84
Scrapy internals Alexander Sibiryakov, 16-17 July 2017, PyConRU 2017 [email protected] made by

«Scrapy internals» Александр Сибиряков, Scrapinghub

Embed Size (px)

Citation preview

Page 1: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy internalsAlexander Sibiryakov, 16-17 July 2017, PyConRU 2017

[email protected]

made by

Page 2: «Scrapy internals» Александр Сибиряков, Scrapinghub

Talk scope

Page 3: «Scrapy internals» Александр Сибиряков, Scrapinghub

Talk scope• Design of complex asynchronous application,

Page 4: «Scrapy internals» Александр Сибиряков, Scrapinghub

Talk scope• Design of complex asynchronous application,

• Flow-control issues,

Page 5: «Scrapy internals» Александр Сибиряков, Scrapinghub

Talk scope• Design of complex asynchronous application,

• Flow-control issues,

• open source life.

Page 6: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping

Page 7: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping• extraction of structured data,

Page 8: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

Page 9: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

Page 10: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,

Page 11: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,

• Robust encoding support and auto-detection,

Page 12: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features

Page 13: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Page 14: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Telnet console

Page 15: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Form submissionTelnet console

Page 16: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Form submissionTelnet console

Page 17: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Form submissionTelnet console

Graceful shutdown by signal

Page 18: «Scrapy internals» Александр Сибиряков, Scrapinghub

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Robots.txt

Form submissionTelnet console

Graceful shutdown by signal

Page 19: «Scrapy internals» Александр Сибиряков, Scrapinghub

Scrapy architecture

Page 20: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted

Page 21: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

Page 22: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

Page 23: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

Page 24: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

Page 25: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

Page 26: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

• Cross platform

Page 27: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

• Cross platform

Page 28: «Scrapy internals» Александр Сибиряков, Scrapinghub

Creator of Twisted

Page 29: «Scrapy internals» Александр Сибиряков, Scrapinghub

Glyph LefkowitzCreator of Twisted

Page 30: «Scrapy internals» Александр Сибиряков, Scrapinghub

–Twisted source code

self._nameResolver = _SimpleResolverComplexifier(resolver)

Page 31: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

Page 32: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

Page 33: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time

Page 34: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time min: O(1)

Page 35: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources

Page 36: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

Page 37: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

Page 38: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

Page 39: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

Page 40: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

Page 41: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

Page 42: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

• RDTSC/RDTSCP - CPU clock cycles

Page 43: «Scrapy internals» Александр Сибиряков, Scrapinghub

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

• RDTSC/RDTSCP - CPU clock cycles

• Proprietary timers

Page 44: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred

Page 45: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

Page 46: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

• errback

Page 47: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

• errback

• addCallback, addErrback

Page 48: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

Page 49: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

• addTimeout

Page 50: «Scrapy internals» Александр Сибиряков, Scrapinghub

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

• addTimeout

• pause/unpause

Page 51: «Scrapy internals» Александр Сибиряков, Scrapinghub

Internal components intercommunication

Page 52: «Scrapy internals» Александр Сибиряков, Scrapinghub

Web agent pipeline

Page 53: «Scrapy internals» Александр Сибиряков, Scrapinghub

DownloaderSlots:

Page 54: «Scrapy internals» Александр Сибиряков, Scrapinghub

PROBLEMS

Page 55: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

Page 56: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

• Downloader,

Page 57: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

• Downloader,

• Scraper

Page 58: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

Page 59: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

• Feed exports (serialization + disk/network IO)

Page 60: «Scrapy internals» Александр Сибиряков, Scrapinghub

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

• Feed exports (serialization + disk/network IO)

• ?

Page 61: «Scrapy internals» Александр Сибиряков, Scrapinghub

Flow control: memory

Page 62: «Scrapy internals» Александр Сибиряков, Scrapinghub

Flow control: memory

Page 63: «Scrapy internals» Александр Сибиряков, Scrapinghub

Flow control: memory• Unlimited downloading -> unlimited items growth

from cascading feed pages.

Page 64: «Scrapy internals» Александр Сибиряков, Scrapinghub

Flow control: memory• Unlimited downloading -> unlimited items growth

from cascading feed pages.

• maintain limit per amount of memory used for Responses in queue (~5Mb)

Page 65: «Scrapy internals» Александр Сибиряков, Scrapinghub

Flow control: CPUspending more time on

than

> reactor.callLater( 0.1 , d.errback, _failure)

an artificial delay in 100ms

Callbacks-> CPU

io

Page 66: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing

Page 67: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing• concurrent items limits,

Page 68: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing• concurrent items limits,

• memory consumption limits,

Page 69: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

Page 70: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

if limit is reached ->

Page 71: «Scrapy internals» Александр Сибиряков, Scrapinghub

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

if limit is reached ->

don’t pickup new request from scheduler

Page 72: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…

Page 73: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

Page 74: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

Page 75: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

• Where?

Page 76: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

Page 77: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

Page 78: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

> self.heartbeat = task.LoopingCall(nextcall.schedule)

Page 79: «Scrapy internals» Александр Сибиряков, Scrapinghub

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

> self.heartbeat = task.LoopingCall(nextcall.schedule)

+ extensive logging

Page 80: «Scrapy internals» Александр Сибиряков, Scrapinghub
Page 81: «Scrapy internals» Александр Сибиряков, Scrapinghub

Design your async application well

Page 82: «Scrapy internals» Александр Сибиряков, Scrapinghub

Design your async application well

Iterations

Page 83: «Scrapy internals» Александр Сибиряков, Scrapinghub

Design your async application well

Iterations

State diagrams

Page 84: «Scrapy internals» Александр Сибиряков, Scrapinghub

ВопросыAlexander Sibiryakov, Scrapinghub Ltd.,

[email protected]