Upload
kyoungmo-yang
View
366
Download
1
Embed Size (px)
Citation preview
2
Agenda
1. What is Pulsar ?
2. Twitter stream processing demo
3. Key points
4. Other platforms
3
1. What is Pulsar ?
● Developed by eBay
● Real-time analytics platform
● Stream processing framework
● Scalability
– Scale to tens of millions of events per second
● Availability
– No downtime during software upgrade, stream processing of rules and topology changes
● Flexibility
– SQL-like language and annotations for defining stream processing rules
4
Pulsar's Building Block(basic framework)
● Jetstream– Real-time stream processing framework
– Spring IoC(Inversion of Control) container
5
Pulsar's Building Block (Cont.)(basic framework)
6
Pulsar's Building Block (Cont.)(basic framework)
● Jetstream's key points
– CEP capabilities through Esper integration.
– Define processing logic in SQL
– Extends SQL functionality and pipeline flow routing using SQL
– Hot deploy SQL without restarting applications
– Spring IoC enabling dynamic topology changes at runtime
– Clustering with elastic scaling
– Cloud deployment
7
Pulsar Real-time Analytics Pipeline
● Collector : Ingests events through a Rest end point
● Sessionizer : Sessionizes the events, maintaining the session state and generating marker events
● Distributor : Filters and mutates events to different consumers; acts as an event router
● Metrics Calculator : Calculates metrics by various dimensions and persists them in the metrics store
● Reply : Replays the failed events on other stages
● ConfigApp : Configures dynamic provisioning for the whole pipeline
8
1) Collector
● Supports REST API to ingest events● Geo and device classification enrichment● Detects fraud and bot● Streams the enriched events to Sessionizer stage
PulsarRawEvent:A
“si”: “UUID”"ipv4": "ip",..."itmP":”itmPrice”,"capQ":”cmapaignQuantity”
PulsarEvent:A
“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”
Enrichment
9
2) Sessionizer
● A process of temporal grouping of events containing a specific identifier referred to as session duration
● Session metadata and state● Session store (in-memory cache)
PulsarEvent:A
“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”
Sessionization
PulsarEvent:A
“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”
Metadata:A
sessionId,PageId,geo-loc,device,etc..
10
3) Distributor
● Event filtering, mutation and routing
distributes
PulsarEvent:A“si” : “AAAAAA”,“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”
@OutputTo("OutboundMessageChannel")@ClusterAffinityTag(colname="si")@PublishOn(topics="Pulsar.MC/ssnzEvent")select * from PulsarEvent;
OutboundMessageChannel
InboundMessageChannel
InboundMessageChannel
"Pulsar.MC/ssnzEvent"
PulsarEvent:B“si” : “BBBBBB”,“device” : “deviceinfo”,“geo” : “geoinfo”,“raw” : “PulsarRawEvent:A”
11
● Real-time metrics computation engine(Esper)● Metrics are stored into Cassandra for batch processing
4) Metrics Calculator
context MCContextinsert into PulsarEventCountSelect count(*) as count from PulsarEvent output snapshot when terminated;
@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from PulsarEventCount;
calculates PulsarEventCount:C
“count”: 2
OutboundMessageChannel
InboundMessageChannel
"Pulsar.Report/metric"
12
5) Replay
● Every stage, events are stored in Kafka● and Replays the failed events on other stages
13
2. Demo(Twitter stream processing)
TwitterStream
Twitter StreamCollector
14
EPLs (Context)
context MCContextinsert into TwitterTopCountryCount Select count(*) as count, country from TwitterSample(country is not null) group by country output snapshot when terminated order by count(*) desc limit 10; context MCContextinsert into TwitterTopLangCount Select count(*) as count, lang from TwitterSample(lang is not null) group by lang output snapshot when terminated order by count(*) desc limit 10; context MCContextinsert into TwitterTopHashTagCount Select topKNested(1000, 20, hashtag, ',') as TopHashTag from TwitterSample(hashtag is not null) output snapshot when terminated;
context MCContextinsert into TwitterEventCount Select count(*) as count from TwitterSample output snapshot when terminated;
15
EPLs (Select)@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopCountryCount;
@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopLangCount;
@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterTopHashTagCount;
@BroadCast@OutputTo("OutboundMessageChannel") @PublishOn(topics="Pulsar.Report/metric")select * from TwitterEventCount;
16
http://<hostname>:8088
Dashboard
17
3. Pulsar's key points
● Creating pipelines declaratively● SQL driven processing logic with hot deployment of SQL
● Framework for custom SQL extensions● Dynamic partitioning and flow control● < 100 millisecond pipeline latency● 99.99% Availability● < 0.01% data loss● Cloud deployable
18
4. Other Stream Processing Frameworks
● Storm(Trident)– Storm Transactional Topology
– Stateful
● Storm(Esper)– Our solution developed in NexR Project
– Integrates Esper
● Apache Spark– Fast and general cluster computing platform for Big Data
– Support SQL
19
Storm(+Esper) / Spark vs Pulsar
Points Pulsar Storm(Trident) Storm(Esper) Spark
Declarative pipeline wiring O X X X
Pipeline stitching Run time Build time Build time Build time
Hot deployment of topologies
O X X X
SQL support O X O O
Hot deployment of processing rules
O X O X
Pipeline flow control O △ △ ?
Stateful processing O O △ O
<http://gopulsar.io/docs/Pulsar_Presentation.pdf>
20
References
● http://www.ebaytechblog.com/2015/02/23/announcing-pulsar-real-time-analytics-at-scale/#.VQIuqBCsVW2
● http://gopulsar.io/● https://github.com/pulsarIO/realtime-analytics/wiki● http://gopulsar.io/html/docs.html● https://spark.apache.org/● https://storm.apache.org/● http://www.espertech.com/
21
Q & A