21
Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder & CEO, Think Big

Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

Real-time Big Data Analytics with Storm

June 2013

Ron Bodkin Founder & CEO, Think Big

Page 2: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 2

Accelerating Your Time to Value

Strategy and Roadmap

IMAGINE Training

and Education

ILLUMINATE Hands-On

Data Science and Data Engineering

IMPLEMENT

Leading Provider of Data Science and Engineering Services

Page 3: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 3

Agenda

�  What is Storm? �  When is Storm appropriate? �  API Options �  Use Cases

Page 4: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 4

�  DAG Processing of never ending streams of data -  Open Source: https://github.com/nathanmarz/storm/wiki -  Used at Twitter plus > 24 other companies -  Reliable - At Least Once semantics -  Think MapReduce for data streams -  Java / Clojure based -  Bolts in Java and ‘Shell Bolts’ -  Not a queue, but usually reads from a queue.

�  Related: -  S4, CEP

�  Compromises -  Static topologies & cluster sizing – Avoid messy dynamic rebalancing -  Nimbus – SPOF

�  Strong Community Support, emerging commercial support

Storm Overview

Page 5: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 5

�  Cluster - Nimbus - Zookeeper - Supervisor, Workers

�  Topology �  Streams �  Tuple �  Spout �  Bolt �  Stream Groupings

- Shuffle, Field �  Trident �  DRPC

Storm Concepts

Page 6: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 6

What is Real-Time?

�  Low latency -  Query response -  Data refresh -  End-to-end response

�  … nanoseconds, milliseconds, seconds, or minutes depending on your problem

Page 7: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 7

�  Better end-user experience -  Ex: View an ad, see the counter move.

�  Operational intelligence -  Low latency analysis -  Operational intelligence

�  Event response -  Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw) -  Path to additional real-time capabilities -  Scalable analysis -  Example: Trend analysis to recommend ‘hot’ articles.

Why Real-time?

Page 8: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 8

�  Scalable way to manage queue readers and logic pipeline �  Much better than roll your own �  Reliable (Message guarantees, fault tolerant) �  Multi-node scaling (1MM messages / 10 nodes) �  It works �  Open source community �  See also: https://github.com/nathanmarz/storm/wiki/Rationale

Why Storm?

Page 9: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 9

�  Raw Java API (tuples and raw computation) �  Trident – Java DSL (SQL-like primitives) �  Python adapter �  Closure DSL �  RedStorm – Ruby �  Wukong – Ruby �  Trident-ML �  Storm-Esper

Programming Options

Page 10: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 10

Use Cases

�  Scale analysis pipeline �  Lively stats �  Recommendations �  Better predictions �  Realtime analytics �  Online machine learning �  Distributed RPC

Page 11: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 11

Overall Architecture

NoSQL

Memcache (Tuple fail tracking)

Queue

Hadoop

Ad Serving

LB

Edge

Edge

Impression

S3S3

S3DFS

Archive Logs

ManagementServer

LB

Edge

Edge

RelationalStore

Ad ManagementAd Selling

Storm- Queue Management- Simple Bot Filtering- Real-time Bucketization- Performance Counters- Event Logging

View Ad

CleansingModel TrainingRecommendations

Events

Monitoring & Alerting (Metrics, Alarms, Notifications)

Model Parameters

getPrediction

Performance CountersImpression Buckets

Page 12: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 12

Integration Patterns

Page 13: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 13

Analytics Architecture

Storm

Web Server

Time SeriesBucketBoltSimple Bot

Annotator

DFS Adapter

Impression Spout

Time SeriesBuckets (Batch)

Time SeriesBuckets

(Realtime)

ImpressionPrediction

Predictive Model

Parameters

ImpressionsImpressionsImpressions

Hadoop

ImpressionBucketization

Predictive Model

Training

NoSQLBolt

Time

Page 14: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 14

Analyze Massive Historical

Data Set

Analyze Recent

Past

Realtime Prediction

Solution Approach

Historical Data Set = S3 Analyze = Hadoop + Pig + R

Recent Past = Storm + NoSQL Analyze = R + Web Service

Page 15: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 15

Example: Real-Time Recommendation Updates

Web Servers

Read & Update

•  Requests read single user profile

•  May update that profile record •  Classic key/value store

problem •  Can split transient state into

read/write NoSQL with batch view NoSQL

Batch Log Feed

Batch Model Feed

Hadoop Cluster Batch Export

Classic BI Access Ad Hoc Queries

MPP Database

NoSQL Database

Page 16: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 16

Real-Time Recommendation Updates w/ Storm

Web Servers

•  More complex recommendation logic

•  Parallelize analysis across topology

•  Processing benefits from distributed logic of streaming

Batch Log Feed

Batch Model Feed

Hadoop Cluster Batch Export

Classic BI Access Ad Hoc Queries

Storm

MPP Database

NoSQL Database

Page 17: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 17

Social Updates (real-time news feed)

Web Servers

•  Pages now integrate many profile records

•  Fan-out-on-read or Fan-out-on-write

•  Processing benefits from distributed logic of streaming

Batch Log Feed

Batch Model Feed

Hadoop Cluster Batch Export

Classic BI Access Ad Hoc Queries

Storm

MPP Database

NoSQL Database

Page 18: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 18

Finance: Trade Compliance Monitoring

Order Mgmt System

•  Storm reduces market data to usable subset

•  Integrates high, medium and low frequency data

•  Real-time Storm filter triggers deeper search via NoSQL

Batch Log Feed

Batch Model Feed

Hadoop Cluster

Batch Export

Classic BI Access

Ad Hoc Queries

Storm

MPP Database

NoSQL Database

Market Data Feed

Page 19: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 19

Operational Intelligence: Stats and Search

Web Servers

•  Cache and microbatch updates to multidimensional stats

•  Support distributed queries •  Cache updates

Hadoop Cluster

Ad Hoc Queries

Storm Sys Logs

Batch System Log Feed

Batch Summary Views

Search & NoSQL DB

Page 20: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 20

�  Easy to develop, hard to debug – start locally -  Timeouts

�  Storm infinite loop of failures -  Use Memcached to count # tuple failures

�  At Least Once processing -  Hadoop based read-repair job

�  Performance Counters not getting flushed -  Tick Tuples

�  Always ACK �  Batching to S3

-  Run a compaction & event de-duplication job in Hadoop

Lessons

Page 21: Real-time Big Data Analytics with Storm NoSQL Roadshow ...nosqlroadshow.com/dl/NoSQL-sanfran-2013/Realtime...Real-time Big Data Analytics with Storm June 2013 Ron Bodkin Founder &

No SQL Roadshow| 21

Conclusions

�  There’s many kinds of real-time problems �  Use of Hadoop and/or NoSQL can solve

-  Low latency queries -  Event response with localized intelligence -  Operational intelligence

�  Storm is valuable for -  Ingesting data within seconds -  Complex real-time distributed logic -  Operational intelligence

�  We’re Hiring!

21 6/12/13