50
Distributed Stream Processing in the real [Perl] world. YAPC::Asia 2012 Day 1 (2012/09/28) TAGOMORI Satoshi (@tagomoris) NHN Japan 12929日土曜日

Distributed Stream Processing in the real [Perl] world

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Distributed Stream Processing in the real [Perl] world

Distributed Stream Processingin the real [Perl] world.

YAPC::Asia 2012 Day 1 (2012/09/28)

TAGOMORI Satoshi (@tagomoris)

NHN Japan

12年9月29日土曜日

Page 2: Distributed Stream Processing in the real [Perl] world

tagomoris

• TAGOMORI Satoshi ( @tagomoris )

• Working at NHN Japan

12年9月29日土曜日

Page 3: Distributed Stream Processing in the real [Perl] world

What this talk contains

• What "Stream Processing" is

• Why we want "Stream Processing"

• What features we should write for "Stream Processing"

• Frameworks and tools for "Distributed Stream Processing"

• Implementations in the Perl world

12年9月29日土曜日

Page 4: Distributed Stream Processing in the real [Perl] world

What "Stream Processing" is

12年9月29日土曜日

Page 5: Distributed Stream Processing in the real [Perl] world

Stream

12年9月29日土曜日

Page 6: Distributed Stream Processing in the real [Perl] world

Stream ?

•Continuously increasing data

•access logs, trace logs, sales checks, ...

•typically written in file line-by-line

tail -f

12年9月29日土曜日

Page 7: Distributed Stream Processing in the real [Perl] world

Stream Processing

•Convert, select, aggregate passed data

•NOT wait EOF (in many cases)

tail -f|grep ^hit|sed -es/hit/miss/g

12年9月29日土曜日

Page 8: Distributed Stream Processing in the real [Perl] world

Stream Processing over network

•Data are collected from many nodes

•to seach/query/store

•Separate heavy processes from edge nodes

edge: tail -f|ncbackend: nc -l|grep|sed|tee|...

12年9月29日土曜日

Page 9: Distributed Stream Processing in the real [Perl] world

Why we want "Stream Processing"

12年9月29日土曜日

Page 10: Distributed Stream Processing in the real [Perl] world

Batch file copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

60min.

flush wait 3min.

Copy over network?min.

latency for 16:00 log62+ minutes

Convert into query friendly structure?min.

12年9月29日土曜日

Page 11: Distributed Stream Processing in the real [Perl] world

Stream data copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

Copy over networkin real time Convert next-to-next

Very low latency for each log lines(if traffic is not larger than capacity)

12年9月29日土曜日

Page 12: Distributed Stream Processing in the real [Perl] world

Case of data size explosion (batch)

serviceA

serviceB

needs long tranfer timeserviceC

serviceD

Casual batch over multi node/servicemay be blocked by

unbalanced data size

Asynchronous batch is very good problem...

12年9月29日土曜日

Page 13: Distributed Stream Processing in the real [Perl] world

Case of data size explosion (stream)

serviceA

serviceB

serviceC

serviceD

heavytraffic

Streams are mixedand not blocked by heavy traffics

(if traffic is not larger than capacity)

12年9月29日土曜日

Page 14: Distributed Stream Processing in the real [Perl] world

What features we should write for"Stream Processing"

12年9月29日土曜日

Page 15: Distributed Stream Processing in the real [Perl] world

One-by-one input/process/output

12年9月29日土曜日

Page 16: Distributed Stream Processing in the real [Perl] world

One-by-one input/process/output

one recordconvertformatselect

one record (or none)

•Basic feature

•I/O call overhead is relatively heavy

12年9月29日土曜日

Page 17: Distributed Stream Processing in the real [Perl] world

Burst transfer/read/write and process

12年9月29日土曜日

Page 18: Distributed Stream Processing in the real [Perl] world

Burst transfer/read/write and process

•less input/output calls

•more performance with async I/O and multi process

many recordsconvertformatselect

many records(or few or none)

read andstore

recordstemprally

frominput

read andstore

recordstemprally

tooutput

12年9月29日土曜日

Page 19: Distributed Stream Processing in the real [Perl] world

Control buffer flush intervals

12年9月29日土曜日

Page 20: Distributed Stream Processing in the real [Perl] world

Control buffer flush intervals

•Control flushing about buffer size and latency

•(Semi-)real-time control flow arguments

•Max size of lost data when process crushed

many records

many records(or few or none)

buffer

read andstore

recordstemprally

frominput

buffer

read andstore

recordstemprally

tooutput

readinputs

writerecords

0.5sec? 1sec? 3sec? 30sec?

12年9月29日土曜日

Page 21: Distributed Stream Processing in the real [Perl] world

Buffering/Queueing

12年9月29日土曜日

Page 22: Distributed Stream Processing in the real [Perl] world

Buffering/Queueing

outputrecords buffer send to

next node next node

outputrecords

send tonext node next node

outputrecords

send tonext node next node

outputrecords

send tonext node next node

bufferbufferbuffer

bufferbufferbuffer

buffer

STOP

recover

streaming

12年9月29日土曜日

Page 23: Distributed Stream Processing in the real [Perl] world

Connection keepaliveConnection pooling

12年9月29日土曜日

Page 24: Distributed Stream Processing in the real [Perl] world

Connection keepalive / connection pooling

•Keep connections and select one to use

•TCP connection establishment needs large cost

•manage node status (alive/down) at same time

•not only inter-nodes, but also inter-process connections

node A

node B

node C

node D

12年9月29日土曜日

Page 25: Distributed Stream Processing in the real [Perl] world

Distribution

12年9月29日土曜日

Page 26: Distributed Stream Processing in the real [Perl] world

Distribution: Load balancing (cpu/node)

•Distribute large scale data to many nodes

•nodes: servers, or processor processes

•to make total performance high

records

send tonext node processor

loadbalancer

send tonext node processor

send tonext node processor

12年9月29日土曜日

Page 27: Distributed Stream Processing in the real [Perl] world

Distribution: High availability (process/node)

•Distribute large scale data to N+1 (or 2 or more) nodes

•to make system tolerant of node trouble

•without any failover (and takeback) operations

records

send tonext node processor

loadbalancer

send tonext node processor

send tonext node processor

12年9月29日土曜日

Page 28: Distributed Stream Processing in the real [Perl] world

Routing

records

records forservice A

router records for service B

process Brecords for service C

process A

router

output C

output B

output A

12年9月29日土曜日

Page 29: Distributed Stream Processing in the real [Perl] world

TOO MANY FEATURES TO IMPLEMENT !!!!!

12年9月29日土曜日

Page 30: Distributed Stream Processing in the real [Perl] world

Frameworks and tools for"Distributed Stream Processing"

12年9月29日土曜日

Page 31: Distributed Stream Processing in the real [Perl] world

Frameworks and tools

•Apache Kafka

•written in Scala (... with JVM!)

•Twitter Storm

•written in Clojure (...with JVM!)

•Fluentd

12年9月29日土曜日

Page 32: Distributed Stream Processing in the real [Perl] world

Fluentd

12年9月29日土曜日

Page 33: Distributed Stream Processing in the real [Perl] world

Fluentd

•Mainly written by @frsyuki in TreasureData

•APLv2 software on github

•Log read/transfer/write daemon based on MessagePack

•structured data (Hash: key:value pairs)

•Plugin mechanism for input/output/buffer features

•now many plugins are published12年9月29日土曜日

Page 34: Distributed Stream Processing in the real [Perl] world

Fluentd features: input/output

•File tailing, network, and other input plugins

•tail and parse line-by-line

•receive records from app logger or other fluentd

•in_syslog, in_exec, in_dstat, .....

•Output to many many storage/systems

•other fluentd, file, S3, mongodb, mysql, HDFS, .....

12年9月29日土曜日

Page 35: Distributed Stream Processing in the real [Perl] world

Fluentd features: buffers

•Pluggable buffers

•output plugin buffers are swappable (by configuration)

•In memory buffers: fast, but lost at fluentd down

•file buffers: slow, but always saved

•Buffer plugins are also added by users

•No one public plugin exists now....

12年9月29日土曜日

Page 36: Distributed Stream Processing in the real [Perl] world

Fluentd features: routing

•Tag based routing

•all records have tag and time

•Fluentd use tags which plugin the record sended next

•configurartions are:

•tag matcher pattern + plugin configuration

12年9月29日土曜日

Page 37: Distributed Stream Processing in the real [Perl] world

Fluentd features: exec_filter

•Output records to specified (and forked) command

•And get records from command's STDOUT

•We can specify our stream processor as command

12年9月29日土曜日

Page 38: Distributed Stream Processing in the real [Perl] world

I'm very sorry that....

12年9月29日土曜日

Page 39: Distributed Stream Processing in the real [Perl] world

Fluentd is written in RubyFluentd plugins released as rubygems

12年9月29日土曜日

Page 40: Distributed Stream Processing in the real [Perl] world

Problems about Fluentd (for stream processing)

•Eager buffering

•Eager default buffering config, not to flush under 8MB

•Performance

•Many many features for data protection injures performance

•Doesn't work on Windows

12年9月29日土曜日

Page 41: Distributed Stream Processing in the real [Perl] world

Implementations in the Perl world

12年9月29日土曜日

Page 42: Distributed Stream Processing in the real [Perl] world

fluent-agent-lite (Fluent::AgentLite)

•Log collection agent tools (in perl) by tagomoris

•fast and low load

•gets logs from file/STDIN, and sends to other nodes

•minimal features for log collector agent

•doesn't parse log lines (send 1 attribute with whole log line)

•supports load balancing and failover of destination12年9月29日土曜日

Page 43: Distributed Stream Processing in the real [Perl] world

fluent-agent (Fluent::Agent)

•Fluentd feature subset tools by tagomoris

•written in Perl

•libuv and UV module for async I/O lib (for Windows)

•Goal: simple, fast and easy deployment

•UNDER CONSTRUCTION

•60% features and many bugs, not in CPAN now

12年9月29日土曜日

Page 44: Distributed Stream Processing in the real [Perl] world

Features of Fluent::Agent

•1 input, 1 output and 0/1 filter

•Network I/O: protocol compatible with Fluentd•and simple load balancing/failover feature

•File input/output: superset features of Fluentd (in plan)

•Filter with any command: compatible with Fluentd's exec_filter

data/records inputfilter

any programyou want

output data/records

12年9月29日土曜日

Page 45: Distributed Stream Processing in the real [Perl] world

Pros of Fluent::Agent (in plan)

•Simple and fast software for stream processing

•Stateless nodes

•fluent-agent works without any configuration files

•fluent-agent works with only commandline options

•Simple buffering and load balance

•less memory usage

12年9月29日土曜日

Page 46: Distributed Stream Processing in the real [Perl] world

Cons of Fluent::Agent (in fact)

•Poor input/output methods

•fluent-agent doesn't have plugin architecture (currently)

•in future, CPAN based plugin system?

•Lack of data protection for death of process

•fluent-agent have only memory buffer

12年9月29日土曜日

Page 47: Distributed Stream Processing in the real [Perl] world

Fluentd and fluent-agent

12年9月29日土曜日

Page 48: Distributed Stream Processing in the real [Perl] world

Fluentd and fluent-agent and fluent-agent-liteservice node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

deliver

fluent-agent

fluent-agent

fluent-agent

processor

fluent-agent

fluent-agent

fluent-agent

fluent-agent

fluent-agent

fluent-agentwriter for storages

/aggregator

fluentd

fluentd

12年9月29日土曜日

Page 49: Distributed Stream Processing in the real [Perl] world

Conclusion

•Distributed Stream Processing is:

•to provides more power to our application

•very hard (and interesting) problem

•that we have some supporting frameworks/tools like Fluentd and/or fluent-agent

12年9月29日土曜日

Page 50: Distributed Stream Processing in the real [Perl] world

Thanks!

Let's try to improve your application

with stream processing

instead of many many batches

CAST: crouton & luke & chachaThanks to @kbysmnr

12年9月29日土曜日