Upload
it-people
View
166
Download
5
Embed Size (px)
Citation preview
Scrapy internalsAlexander Sibiryakov, 16-17 July 2017, PyConRU 2017
made by
Talk scope
Talk scope• Design of complex asynchronous application,
Talk scope• Design of complex asynchronous application,
• Flow-control issues,
Talk scope• Design of complex asynchronous application,
• Flow-control issues,
• open source life.
Scrapy: web scraping
Scrapy: web scraping• extraction of structured data,
Scrapy: web scraping• extraction of structured data,
• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel
Scrapy: web scraping• extraction of structured data,
• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel
• Interactive shell,
Scrapy: web scraping• extraction of structured data,
• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,
Scrapy: web scraping• extraction of structured data,
• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,
• Robust encoding support and auto-detection,
Main features
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Telnet console
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Form submissionTelnet console
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submissionTelnet console
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submissionTelnet console
Graceful shutdown by signal
Main features• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Robots.txt
Form submissionTelnet console
Graceful shutdown by signal
Scrapy architecture
Twisted
Twisted• Event-driven network programming framework
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Twisted• Event-driven network programming framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Creator of Twisted
Glyph LefkowitzCreator of Twisted
–Twisted source code
self._nameResolver = _SimpleResolverComplexifier(resolver)
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time min: O(1)
x86 time sources
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
x86 time sources• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
• Proprietary timers
Twisted.Deferred
Twisted.Deferred• callback
Twisted.Deferred• callback
• errback
Twisted.Deferred• callback
• errback
• addCallback, addErrback
Twisted.Deferred• callback
• errback
• addCallback, addErrback
• cancel
Twisted.Deferred• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
Twisted.Deferred• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
• pause/unpause
Internal components intercommunication
Web agent pipeline
DownloaderSlots:
PROBLEMS
Throttling between internal components
Throttling between internal components
• Downloader,
Throttling between internal components
• Downloader,
• Scraper
Throttling between internal components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups, storing,..)
Throttling between internal components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups, storing,..)
• Feed exports (serialization + disk/network IO)
Throttling between internal components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups, storing,..)
• Feed exports (serialization + disk/network IO)
• ?
Flow control: memory
Flow control: memory
Flow control: memory• Unlimited downloading -> unlimited items growth
from cascading feed pages.
Flow control: memory• Unlimited downloading -> unlimited items growth
from cascading feed pages.
• maintain limit per amount of memory used for Responses in queue (~5Mb)
Flow control: CPUspending more time on
than
> reactor.callLater( 0.1 , d.errback, _failure)
an artificial delay in 100ms
Callbacks-> CPU
io
Summarizing
Summarizing• concurrent items limits,
Summarizing• concurrent items limits,
• memory consumption limits,
Summarizing• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
Summarizing• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
Summarizing• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
don’t pickup new request from scheduler
It just stopped…
It just stopped…• Why?
It just stopped…• Why?
• Some Deferred was lost?
It just stopped…• Why?
• Some Deferred was lost?
• Where?
It just stopped…• Why?
• Some Deferred was lost?
• Where?
• How to debug?
It just stopped…• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
It just stopped…• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
It just stopped…• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
+ extensive logging
Design your async application well
Design your async application well
Iterations
Design your async application well
Iterations
State diagrams