Upload
satoshi-tagomori
View
4.288
Download
2
Embed Size (px)
DESCRIPTION
Citation preview
Distributed Stream Processingin the real [Perl] world.
YAPC::Asia 2012 Day 1 (2012/09/28)
TAGOMORI Satoshi (@tagomoris)
NHN Japan
12年9月29日土曜日
tagomoris
• TAGOMORI Satoshi ( @tagomoris )
• Working at NHN Japan
12年9月29日土曜日
What this talk contains
• What "Stream Processing" is
• Why we want "Stream Processing"
• What features we should write for "Stream Processing"
• Frameworks and tools for "Distributed Stream Processing"
• Implementations in the Perl world
12年9月29日土曜日
What "Stream Processing" is
12年9月29日土曜日
Stream
12年9月29日土曜日
Stream ?
•Continuously increasing data
•access logs, trace logs, sales checks, ...
•typically written in file line-by-line
tail -f
12年9月29日土曜日
Stream Processing
•Convert, select, aggregate passed data
•NOT wait EOF (in many cases)
tail -f|grep ^hit|sed -es/hit/miss/g
12年9月29日土曜日
Stream Processing over network
•Data are collected from many nodes
•to seach/query/store
•Separate heavy processes from edge nodes
edge: tail -f|ncbackend: nc -l|grep|sed|tee|...
12年9月29日土曜日
Why we want "Stream Processing"
12年9月29日土曜日
Batch file copy & convert
access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................
60min.
flush wait 3min.
Copy over network?min.
latency for 16:00 log62+ minutes
Convert into query friendly structure?min.
12年9月29日土曜日
Stream data copy & convert
access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................
Copy over networkin real time Convert next-to-next
Very low latency for each log lines(if traffic is not larger than capacity)
12年9月29日土曜日
Case of data size explosion (batch)
serviceA
serviceB
needs long tranfer timeserviceC
serviceD
Casual batch over multi node/servicemay be blocked by
unbalanced data size
Asynchronous batch is very good problem...
12年9月29日土曜日
Case of data size explosion (stream)
serviceA
serviceB
serviceC
serviceD
heavytraffic
Streams are mixedand not blocked by heavy traffics
(if traffic is not larger than capacity)
12年9月29日土曜日
What features we should write for"Stream Processing"
12年9月29日土曜日
One-by-one input/process/output
12年9月29日土曜日
One-by-one input/process/output
one recordconvertformatselect
one record (or none)
•Basic feature
•I/O call overhead is relatively heavy
12年9月29日土曜日
Burst transfer/read/write and process
12年9月29日土曜日
Burst transfer/read/write and process
•less input/output calls
•more performance with async I/O and multi process
many recordsconvertformatselect
many records(or few or none)
read andstore
recordstemprally
frominput
read andstore
recordstemprally
tooutput
12年9月29日土曜日
Control buffer flush intervals
12年9月29日土曜日
Control buffer flush intervals
•Control flushing about buffer size and latency
•(Semi-)real-time control flow arguments
•Max size of lost data when process crushed
many records
many records(or few or none)
buffer
read andstore
recordstemprally
frominput
buffer
read andstore
recordstemprally
tooutput
readinputs
writerecords
0.5sec? 1sec? 3sec? 30sec?
12年9月29日土曜日
Buffering/Queueing
12年9月29日土曜日
Buffering/Queueing
outputrecords buffer send to
next node next node
outputrecords
send tonext node next node
outputrecords
send tonext node next node
outputrecords
send tonext node next node
bufferbufferbuffer
bufferbufferbuffer
buffer
STOP
recover
streaming
12年9月29日土曜日
Connection keepaliveConnection pooling
12年9月29日土曜日
Connection keepalive / connection pooling
•Keep connections and select one to use
•TCP connection establishment needs large cost
•manage node status (alive/down) at same time
•not only inter-nodes, but also inter-process connections
node A
node B
node C
node D
12年9月29日土曜日
Distribution
12年9月29日土曜日
Distribution: Load balancing (cpu/node)
•Distribute large scale data to many nodes
•nodes: servers, or processor processes
•to make total performance high
records
send tonext node processor
loadbalancer
send tonext node processor
send tonext node processor
12年9月29日土曜日
Distribution: High availability (process/node)
•Distribute large scale data to N+1 (or 2 or more) nodes
•to make system tolerant of node trouble
•without any failover (and takeback) operations
records
send tonext node processor
loadbalancer
send tonext node processor
send tonext node processor
12年9月29日土曜日
Routing
records
records forservice A
router records for service B
process Brecords for service C
process A
router
output C
output B
output A
12年9月29日土曜日
TOO MANY FEATURES TO IMPLEMENT !!!!!
12年9月29日土曜日
Frameworks and tools for"Distributed Stream Processing"
12年9月29日土曜日
Frameworks and tools
•Apache Kafka
•written in Scala (... with JVM!)
•Twitter Storm
•written in Clojure (...with JVM!)
•Fluentd
12年9月29日土曜日
Fluentd
12年9月29日土曜日
Fluentd
•Mainly written by @frsyuki in TreasureData
•APLv2 software on github
•Log read/transfer/write daemon based on MessagePack
•structured data (Hash: key:value pairs)
•Plugin mechanism for input/output/buffer features
•now many plugins are published12年9月29日土曜日
Fluentd features: input/output
•File tailing, network, and other input plugins
•tail and parse line-by-line
•receive records from app logger or other fluentd
•in_syslog, in_exec, in_dstat, .....
•Output to many many storage/systems
•other fluentd, file, S3, mongodb, mysql, HDFS, .....
12年9月29日土曜日
Fluentd features: buffers
•Pluggable buffers
•output plugin buffers are swappable (by configuration)
•In memory buffers: fast, but lost at fluentd down
•file buffers: slow, but always saved
•Buffer plugins are also added by users
•No one public plugin exists now....
12年9月29日土曜日
Fluentd features: routing
•Tag based routing
•all records have tag and time
•Fluentd use tags which plugin the record sended next
•configurartions are:
•tag matcher pattern + plugin configuration
12年9月29日土曜日
Fluentd features: exec_filter
•Output records to specified (and forked) command
•And get records from command's STDOUT
•We can specify our stream processor as command
12年9月29日土曜日
I'm very sorry that....
12年9月29日土曜日
Fluentd is written in RubyFluentd plugins released as rubygems
12年9月29日土曜日
Problems about Fluentd (for stream processing)
•Eager buffering
•Eager default buffering config, not to flush under 8MB
•Performance
•Many many features for data protection injures performance
•Doesn't work on Windows
12年9月29日土曜日
Implementations in the Perl world
12年9月29日土曜日
fluent-agent-lite (Fluent::AgentLite)
•Log collection agent tools (in perl) by tagomoris
•fast and low load
•gets logs from file/STDIN, and sends to other nodes
•minimal features for log collector agent
•doesn't parse log lines (send 1 attribute with whole log line)
•supports load balancing and failover of destination12年9月29日土曜日
fluent-agent (Fluent::Agent)
•Fluentd feature subset tools by tagomoris
•written in Perl
•libuv and UV module for async I/O lib (for Windows)
•Goal: simple, fast and easy deployment
•UNDER CONSTRUCTION
•60% features and many bugs, not in CPAN now
12年9月29日土曜日
Features of Fluent::Agent
•1 input, 1 output and 0/1 filter
•Network I/O: protocol compatible with Fluentd•and simple load balancing/failover feature
•File input/output: superset features of Fluentd (in plan)
•Filter with any command: compatible with Fluentd's exec_filter
data/records inputfilter
any programyou want
output data/records
12年9月29日土曜日
Pros of Fluent::Agent (in plan)
•Simple and fast software for stream processing
•Stateless nodes
•fluent-agent works without any configuration files
•fluent-agent works with only commandline options
•Simple buffering and load balance
•less memory usage
12年9月29日土曜日
Cons of Fluent::Agent (in fact)
•Poor input/output methods
•fluent-agent doesn't have plugin architecture (currently)
•in future, CPAN based plugin system?
•Lack of data protection for death of process
•fluent-agent have only memory buffer
12年9月29日土曜日
Fluentd and fluent-agent
12年9月29日土曜日
Fluentd and fluent-agent and fluent-agent-liteservice node fluent-agent-lite
service node fluent-agent-lite
service node fluent-agent-lite
service node fluent-agent-lite
service node fluent-agent-lite
service node fluent-agent-lite
service node fluent-agent-lite
deliver
fluent-agent
fluent-agent
fluent-agent
processor
fluent-agent
fluent-agent
fluent-agent
fluent-agent
fluent-agent
fluent-agentwriter for storages
/aggregator
fluentd
fluentd
12年9月29日土曜日
Conclusion
•Distributed Stream Processing is:
•to provides more power to our application
•very hard (and interesting) problem
•that we have some supporting frameworks/tools like Fluentd and/or fluent-agent
12年9月29日土曜日
Thanks!
Let's try to improve your application
with stream processing
instead of many many batches
CAST: crouton & luke & chachaThanks to @kbysmnr
12年9月29日土曜日