Distributed Stream Processing in the real [Perl] world

Distributed Stream Processingin the real [Perl] world.

YAPC::Asia 2012 Day 1 (2012/09/28)

TAGOMORI Satoshi (@tagomoris)

NHN Japan

12年9月29日土曜日

tagomoris

• TAGOMORI Satoshi ( @tagomoris )

• Working at NHN Japan

What this talk contains

• What "Stream Processing" is

• Why we want "Stream Processing"

• What features we should write for "Stream Processing"

• Frameworks and tools for "Distributed Stream Processing"

• Implementations in the Perl world

What "Stream Processing" is

Stream

Stream ?

•Continuously increasing data

•access logs, trace logs, sales checks, ...

•typically written in file line-by-line

tail -f

Stream Processing

•Convert, select, aggregate passed data

•NOT wait EOF (in many cases)

tail -f|grep ^hit|sed -es/hit/miss/g

Stream Processing over network

•Data are collected from many nodes

•to seach/query/store

•Separate heavy processes from edge nodes

edge: tail -f|ncbackend: nc -l|grep|sed|tee|...

Why we want "Stream Processing"

Batch file copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

60min.

flush wait 3min.

Copy over network?min.

latency for 16:00 log62+ minutes

Convert into query friendly structure?min.

Stream data copy & convert

access.0928.16.log16:00 ...............................16:00 ..........................................16:59 .................16:59 ...........................

Copy over networkin real time Convert next-to-next

Very low latency for each log lines(if traffic is not larger than capacity)

Case of data size explosion (batch)

serviceA

serviceB

needs long tranfer timeserviceC

serviceD

Casual batch over multi node/servicemay be blocked by

unbalanced data size

Asynchronous batch is very good problem...

Case of data size explosion (stream)

serviceA

serviceB

serviceC

serviceD

heavytraffic

Streams are mixedand not blocked by heavy traffics

(if traffic is not larger than capacity)

What features we should write for"Stream Processing"

One-by-one input/process/output

one recordconvertformatselect

one record (or none)

•Basic feature

•I/O call overhead is relatively heavy

Burst transfer/read/write and process

•less input/output calls

•more performance with async I/O and multi process

many recordsconvertformatselect

many records(or few or none)

read andstore

recordstemprally

frominput

read andstore

recordstemprally

tooutput

Control buffer flush intervals

•Control flushing about buffer size and latency

•(Semi-)real-time control flow arguments

•Max size of lost data when process crushed

many records

many records(or few or none)

buffer

read andstore

recordstemprally

frominput

buffer

read andstore

recordstemprally

tooutput

readinputs

writerecords

0.5sec? 1sec? 3sec? 30sec?

Buffering/Queueing

outputrecords buffer send to

next node next node

outputrecords

send tonext node next node

outputrecords

bufferbufferbuffer

buffer

recover

streaming

Connection keepaliveConnection pooling

Connection keepalive / connection pooling

•Keep connections and select one to use

•TCP connection establishment needs large cost

•manage node status (alive/down) at same time

•not only inter-nodes, but also inter-process connections

node A

node B

node C

node D

Distribution

Distribution: Load balancing (cpu/node)

•Distribute large scale data to many nodes

•nodes: servers, or processor processes

•to make total performance high

records

send tonext node processor

loadbalancer

Distribution: High availability (process/node)

•Distribute large scale data to N+1 (or 2 or more) nodes

•to make system tolerant of node trouble

•without any failover (and takeback) operations

records

loadbalancer

Routing

records

records forservice A

router records for service B

process Brecords for service C

process A

router

output C

output B

output A

TOO MANY FEATURES TO IMPLEMENT !!!!!

Frameworks and tools for"Distributed Stream Processing"

Frameworks and tools

•Apache Kafka

•written in Scala (... with JVM!)

•Twitter Storm

•written in Clojure (...with JVM!)

•Fluentd

Fluentd

•Mainly written by @frsyuki in TreasureData

•APLv2 software on github

•Log read/transfer/write daemon based on MessagePack

•structured data (Hash: key:value pairs)

•Plugin mechanism for input/output/buffer features

•now many plugins are published12年9月29日土曜日

Fluentd features: input/output

•File tailing, network, and other input plugins

•tail and parse line-by-line

•receive records from app logger or other fluentd

•in_syslog, in_exec, in_dstat, .....

•Output to many many storage/systems

•other fluentd, file, S3, mongodb, mysql, HDFS, .....

Fluentd features: buffers

•Pluggable buffers

•output plugin buffers are swappable (by configuration)

•In memory buffers: fast, but lost at fluentd down

•file buffers: slow, but always saved

•Buffer plugins are also added by users

•No one public plugin exists now....

Fluentd features: routing

•Tag based routing

•all records have tag and time

•Fluentd use tags which plugin the record sended next

•configurartions are:

•tag matcher pattern + plugin configuration

Fluentd features: exec_filter

•Output records to specified (and forked) command

•And get records from command's STDOUT

•We can specify our stream processor as command

I'm very sorry that....

Fluentd is written in RubyFluentd plugins released as rubygems

Problems about Fluentd (for stream processing)

•Eager buffering

•Eager default buffering config, not to flush under 8MB

•Performance

•Many many features for data protection injures performance

•Doesn't work on Windows

Implementations in the Perl world

fluent-agent-lite (Fluent::AgentLite)

•Log collection agent tools (in perl) by tagomoris

•fast and low load

•gets logs from file/STDIN, and sends to other nodes

•minimal features for log collector agent

•doesn't parse log lines (send 1 attribute with whole log line)

•supports load balancing and failover of destination12年9月29日土曜日

fluent-agent (Fluent::Agent)

•Fluentd feature subset tools by tagomoris

•written in Perl

•libuv and UV module for async I/O lib (for Windows)

•Goal: simple, fast and easy deployment

•UNDER CONSTRUCTION

•60% features and many bugs, not in CPAN now

Features of Fluent::Agent

•1 input, 1 output and 0/1 filter

•Network I/O: protocol compatible with Fluentd•and simple load balancing/failover feature

•File input/output: superset features of Fluentd (in plan)

•Filter with any command: compatible with Fluentd's exec_filter

data/records inputfilter

any programyou want

output data/records

Pros of Fluent::Agent (in plan)

•Simple and fast software for stream processing

•Stateless nodes

•fluent-agent works without any configuration files

•fluent-agent works with only commandline options

•Simple buffering and load balance

•less memory usage

Cons of Fluent::Agent (in fact)

•Poor input/output methods

•fluent-agent doesn't have plugin architecture (currently)

•in future, CPAN based plugin system?

•Lack of data protection for death of process

•fluent-agent have only memory buffer

Fluentd and fluent-agent

Fluentd and fluent-agent and fluent-agent-liteservice node fluent-agent-lite

service node fluent-agent-lite

deliver

fluent-agent

processor

fluent-agent

fluent-agentwriter for storages

/aggregator

fluentd

Conclusion

•Distributed Stream Processing is:

•to provides more power to our application

•very hard (and interesting) problem

•that we have some supporting frameworks/tools like Fluentd and/or fluent-agent

Thanks!

Let's try to improve your application

with stream processing

instead of many many batches

CAST: crouton & luke & chachaThanks to @kbysmnr

Distributed Stream Processing in the real [Perl] world

Documents

Perl Bioinfo

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk by Petr Zapletal

Perl - TMTOWTDI 宋政隆 Perl User. Outline What is Perl? Why learn/use Perl? How to get Perl? Things about Perl …

Lenguaje Perl

Holistic, Distributed Stream Processing in IoTEnvironmentssysws.org.uk/workshop/2018/54-michalak-streams.pdf · Step count algorithm[1]in EPL. PATHfinder: ... Bat Life (h) pp03 Ω1

Distributed Distributed Systemscs.brown.edu/courses/csci1380/s20/lectures/L22_2020.pdf · 2020. 4. 30. · Distributed Distributed Systems L22: Distributed File Systems Theophilus

perl .2001

Perl logging

Perl Srpski

Perl nagios

Just for fun, Perl (Korean Perl Workshop 2012)

Perl 6 Today - Pugs · 2008-10-05 · Perl 6 Today Pugs-An implementation of Perl 6 Perl 6-Reconciling the Irreconcilable Au!ey Tang 1

Distributed Problem Solving and Distributed Planning

Web - Perl

Langage Perl

Perl::Lint - Yet Another Perl Source Code Linter

PERL PERLA-200 AMBA PERL PERLA-250 - Sima

Perl y CGIs para impacientes - GeNeura Teamgeneura.ugr.es/~jmerelo/tutoriales/perl-cgi-servidores/perl-cgi-servidores.pdf · Perl y CGIs para impacientes leyendo uno titulado, por

Perl 基础 - images.china-pub.comimages.china-pub.com/ebook195001-200000/198098/ch001.pdf · 2 Perl 5 1 2 4 6 8 10 12 13 3 5 7 9-11 2 Perl Perl 5.10 Perl Perl 5.10 say print say

I. Mengenal PERL - gapra.files. · PDF fileI. Mengenal PERL 1.Tentang PERL PERL adalah bahasa pemrograman yang menggunakan tipe data dinamis, program PERL dapat langsung dieksekusi