Distributed Stream Processing in the real [Perl] world

Preview:

DESCRIPTION

 

Citation preview

Distributed Stream Processingin the real [Perl] world.

YAPC::Asia 2012 Day 1 (2012/09/28)

TAGOMORI Satoshi (@tagomoris)

NHN Japan

12年9月29日土曜日

tagomoris

• TAGOMORI Satoshi ( @tagomoris )

• Working at NHN Japan

12年9月29日土曜日

What this talk contains

• What "Stream Processing" is

• Why we want "Stream Processing"

• What features we should write for "Stream Processing"

• Frameworks and tools for "Distributed Stream Processing"

• Implementations in the Perl world

12年9月29日土曜日

What "Stream Processing" is

12年9月29日土曜日

Stream

12年9月29日土曜日

Stream ?

•Continuously increasing data

•access logs, trace logs, sales checks, ...

•typically written in file line-by-line

tail -f

12年9月29日土曜日

Stream Processing

•Convert, select, aggregate passed data

•NOT wait EOF (in many cases)

tail -f|grep ^hit|sed -es/hit/miss/g

12年9月29日土曜日

Stream Processing over network

•Data are collected from many nodes

•to seach/query/store

•Separate heavy processes from edge nodes

edge: tail -f|ncbackend: nc -l|grep|sed|tee|...

12年9月29日土曜日

Why we want "Stream Processing"

12年9月29日土曜日

Batch file copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

60min.

flush wait 3min.

Copy over network?min.

latency for 16:00 log62+ minutes

Convert into query friendly structure?min.

12年9月29日土曜日

Stream data copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

Copy over networkin real time Convert next-to-next

Very low latency for each log lines(if traffic is not larger than capacity)

12年9月29日土曜日

Case of data size explosion (batch)

serviceA

serviceB

needs long tranfer timeserviceC

serviceD

Casual batch over multi node/servicemay be blocked by

unbalanced data size

Asynchronous batch is very good problem...

12年9月29日土曜日

Case of data size explosion (stream)

serviceA

serviceB

serviceC

serviceD

heavytraffic

Streams are mixedand not blocked by heavy traffics

(if traffic is not larger than capacity)

12年9月29日土曜日

What features we should write for"Stream Processing"

12年9月29日土曜日

One-by-one input/process/output

12年9月29日土曜日

One-by-one input/process/output

one recordconvertformatselect

one record (or none)

•Basic feature

•I/O call overhead is relatively heavy

12年9月29日土曜日

Burst transfer/read/write and process

12年9月29日土曜日

Burst transfer/read/write and process

•less input/output calls

•more performance with async I/O and multi process

many recordsconvertformatselect

many records(or few or none)

read andstore

recordstemprally

frominput

read andstore

recordstemprally

tooutput

12年9月29日土曜日

Control buffer flush intervals

12年9月29日土曜日

Control buffer flush intervals

•Control flushing about buffer size and latency

•(Semi-)real-time control flow arguments

•Max size of lost data when process crushed

many records

many records(or few or none)

buffer

read andstore

recordstemprally

frominput

buffer

read andstore

recordstemprally

tooutput

readinputs

writerecords

0.5sec? 1sec? 3sec? 30sec?

12年9月29日土曜日

Buffering/Queueing

12年9月29日土曜日

Buffering/Queueing

outputrecords buffer send to

next node next node

outputrecords

send tonext node next node

outputrecords

send tonext node next node

outputrecords

send tonext node next node

bufferbufferbuffer

bufferbufferbuffer

buffer

STOP

recover

streaming

12年9月29日土曜日

Connection keepaliveConnection pooling

12年9月29日土曜日

Connection keepalive / connection pooling

•Keep connections and select one to use

•TCP connection establishment needs large cost

•manage node status (alive/down) at same time

•not only inter-nodes, but also inter-process connections

node A

node B

node C

node D

12年9月29日土曜日

Distribution

12年9月29日土曜日

Distribution: Load balancing (cpu/node)

•Distribute large scale data to many nodes

•nodes: servers, or processor processes

•to make total performance high

records

send tonext node processor

loadbalancer

send tonext node processor

send tonext node processor

12年9月29日土曜日

Distribution: High availability (process/node)

•Distribute large scale data to N+1 (or 2 or more) nodes

•to make system tolerant of node trouble

•without any failover (and takeback) operations

records

send tonext node processor

loadbalancer

send tonext node processor

send tonext node processor

12年9月29日土曜日

Routing

records

records forservice A

router records for service B

process Brecords for service C

process A

router

output C

output B

output A

12年9月29日土曜日

TOO MANY FEATURES TO IMPLEMENT !!!!!

12年9月29日土曜日

Frameworks and tools for"Distributed Stream Processing"

12年9月29日土曜日

Frameworks and tools

•Apache Kafka

•written in Scala (... with JVM!)

•Twitter Storm

•written in Clojure (...with JVM!)

•Fluentd

12年9月29日土曜日

Fluentd

12年9月29日土曜日

Fluentd

•Mainly written by @frsyuki in TreasureData

•APLv2 software on github

•Log read/transfer/write daemon based on MessagePack

•structured data (Hash: key:value pairs)

•Plugin mechanism for input/output/buffer features

•now many plugins are published12年9月29日土曜日

Fluentd features: input/output

•File tailing, network, and other input plugins

•tail and parse line-by-line

•receive records from app logger or other fluentd

•in_syslog, in_exec, in_dstat, .....

•Output to many many storage/systems

•other fluentd, file, S3, mongodb, mysql, HDFS, .....

12年9月29日土曜日

Fluentd features: buffers

•Pluggable buffers

•output plugin buffers are swappable (by configuration)

•In memory buffers: fast, but lost at fluentd down

•file buffers: slow, but always saved

•Buffer plugins are also added by users

•No one public plugin exists now....

12年9月29日土曜日

Fluentd features: routing

•Tag based routing

•all records have tag and time

•Fluentd use tags which plugin the record sended next

•configurartions are:

•tag matcher pattern + plugin configuration

12年9月29日土曜日

Fluentd features: exec_filter

•Output records to specified (and forked) command

•And get records from command's STDOUT

•We can specify our stream processor as command

12年9月29日土曜日

I'm very sorry that....

12年9月29日土曜日

Fluentd is written in RubyFluentd plugins released as rubygems

12年9月29日土曜日

Problems about Fluentd (for stream processing)

•Eager buffering

•Eager default buffering config, not to flush under 8MB

•Performance

•Many many features for data protection injures performance

•Doesn't work on Windows

12年9月29日土曜日

Implementations in the Perl world

12年9月29日土曜日

fluent-agent-lite (Fluent::AgentLite)

•Log collection agent tools (in perl) by tagomoris

•fast and low load

•gets logs from file/STDIN, and sends to other nodes

•minimal features for log collector agent

•doesn't parse log lines (send 1 attribute with whole log line)

•supports load balancing and failover of destination12年9月29日土曜日

fluent-agent (Fluent::Agent)

•Fluentd feature subset tools by tagomoris

•written in Perl

•libuv and UV module for async I/O lib (for Windows)

•Goal: simple, fast and easy deployment

•UNDER CONSTRUCTION

•60% features and many bugs, not in CPAN now

12年9月29日土曜日

Features of Fluent::Agent

•1 input, 1 output and 0/1 filter

•Network I/O: protocol compatible with Fluentd•and simple load balancing/failover feature

•File input/output: superset features of Fluentd (in plan)

•Filter with any command: compatible with Fluentd's exec_filter

data/records inputfilter

any programyou want

output data/records

12年9月29日土曜日

Pros of Fluent::Agent (in plan)

•Simple and fast software for stream processing

•Stateless nodes

•fluent-agent works without any configuration files

•fluent-agent works with only commandline options

•Simple buffering and load balance

•less memory usage

12年9月29日土曜日

Cons of Fluent::Agent (in fact)

•Poor input/output methods

•fluent-agent doesn't have plugin architecture (currently)

•in future, CPAN based plugin system?

•Lack of data protection for death of process

•fluent-agent have only memory buffer

12年9月29日土曜日

Fluentd and fluent-agent

12年9月29日土曜日

Fluentd and fluent-agent and fluent-agent-liteservice node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

service node fluent-agent-lite

deliver

fluent-agent

fluent-agent

fluent-agent

processor

fluent-agent

fluent-agent

fluent-agent

fluent-agent

fluent-agent

fluent-agentwriter for storages

/aggregator

fluentd

fluentd

12年9月29日土曜日

Conclusion

•Distributed Stream Processing is:

•to provides more power to our application

•very hard (and interesting) problem

•that we have some supporting frameworks/tools like Fluentd and/or fluent-agent

12年9月29日土曜日

Thanks!

Let's try to improve your application

with stream processing

instead of many many batches

CAST: crouton & luke & chachaThanks to @kbysmnr

12年9月29日土曜日

Recommended