21
1 © Cloudera, Inc. All rights reserved. Oryx 2 Overview Sean Owen | Cloudera | @sean_r_owen

Lambda architecture on Spark, Kafka for real-time large scale ML

  • Upload
    huguk

  • View
    1.012

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lambda architecture on Spark, Kafka for real-time large scale ML

1© Cloudera, Inc. All rights reserved.

Oryx 2 OverviewSean Owen | Cloudera | @sean_r_owen

Page 2: Lambda architecture on Spark, Kafka for real-time large scale ML

2© Cloudera, Inc. All rights reserved.

Consider the Music Recommender

Collect Play Data

& Do Data Science

BuildTaste

ModelOffline

Learn Quickly

from New Plays

RecommendNew Songs

Now

Page 3: Lambda architecture on Spark, Kafka for real-time large scale ML

3© Cloudera, Inc. All rights reserved.

From Exploratory to Operational?

Exploratory Analytics Operational Analytics

Explore DataPick Model

Build Model at Scale, Offline

Continuously Update Model

?Score Model inReal-Time

?

Page 4: Lambda architecture on Spark, Kafka for real-time large scale ML

4© Cloudera, Inc. All rights reserved.

Large Scale or Real-Time?

Large-ScaleOfflineBatch

Real-TimeOnlineStreaming

vs

Why Don’t We Have Both?

λ!

Page 5: Lambda architecture on Spark, Kafka for real-time large scale ML

5© Cloudera, Inc. All rights reserved.

• Batch Layer• High latency, high throughput• Compute official result

• Speed Layer• Low latency• Compute approximate update to

last known result• Serving Layer• Real-time• Merge batch/speed results

The Lambda Architecture

www.ymc.ch/en/lambda-architecture-part-1

Page 6: Lambda architecture on Spark, Kafka for real-time large scale ML

6© Cloudera, Inc. All rights reserved.

• Batch Layer• Train, evaluate, tune model

over all data in hours• Speed Layer• Update model approximately in

minutes or seconds• Serving Layer• Make prediction, recommendation

from model in milliseconds

λ Architecture fits ML + Hadoop

Streaming MLlib

Page 7: Lambda architecture on Spark, Kafka for real-time large scale ML

7© Cloudera, Inc. All rights reserved.

www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg

Page 8: Lambda architecture on Spark, Kafka for real-time large scale ML

8© Cloudera, Inc. All rights reserved.

History (or: 5th time’s a charm)

Taste2005 – 2009- Recommender

toolkit in Java - Local only- Serves results

Apache Mahout2009 – 2014- Adds Hadoop-based

model buildingat scale

- But no serving

Myrrix2011-2013- Mahout recs

reimagined- Adds serving to

Hadoop-basedmodel build

Oryx 12013 –- Extends to

classification,clustering

- PMML- Merge with

cloudera/ml

Oryx 22014 –- Same APIs / goals- Rewrite- Full lambda

architecture- Kafka + Spark + YARN

Page 9: Lambda architecture on Spark, Kafka for real-time large scale ML

9© Cloudera, Inc. All rights reserved.

Complementary, Not Competitive

Most ML-on-Hadoop tools are for building models only, and excel at this.

Oryx and similar projects do everything else around this: continuous update, serving

Page 10: Lambda architecture on Spark, Kafka for real-time large scale ML

10© Cloudera, Inc. All rights reserved.

Architecture

Page 11: Lambda architecture on Spark, Kafka for real-time large scale ML

11© Cloudera, Inc. All rights reserved.

• Input Kafka topic• Any type; usually strings• From external or Serving Layer

• Update Kafka topic• Serialized models (PMML)

produced by Batch Layer• Model updates / deltas

produced by Speed Layer

Data Transport

Page 12: Lambda architecture on Spark, Kafka for real-time large scale ML

12© Cloudera, Inc. All rights reserved.

• Spark Streaming• Persists input topic data

to HDFS from Kafka• Builds “model” occasionally from

historical and new data• Hours• ML: can use MLlib• ML: tunes hyperparameters• Publishes models as PMML to

update topic

Batch Layer

Page 13: Lambda architecture on Spark, Kafka for real-time large scale ML

13© Cloudera, Inc. All rights reserved.

• Spark Streaming• Listens for new PMML models• Listens to input topic too• Computes approximate updates to

model implied by input and publishes to update topic• Seconds

Speed Layer

Page 14: Lambda architecture on Spark, Kafka for real-time large scale ML

14© Cloudera, Inc. All rights reserved.

• Tomcat + JAX-RS• (Can deploy on YARN)• REST API• Listens for new PMML models and

updates from update topic• Scores model / answers queries• Writes to input topic too• No shared state; scales horizontally• Milliseconds

Serving Layer

Page 15: Lambda architecture on Spark, Kafka for real-time large scale ML

15© Cloudera, Inc. All rights reserved.

Logical Architecture

Serving Layer Speed Layer Batch Layer

App Tier oryx-app-serving oryx-app-mlliboryx-app

oryx-app-mlliboryx-app

ML Tier oryx-ml oryx-ml

Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda

Generic Lambda-Architecture support

ML-specific specialization

Prebuilt recommender, clustering, classification implementations

Page 16: Lambda architecture on Spark, Kafka for real-time large scale ML

16© Cloudera, Inc. All rights reserved.

• Scoring on the fly is not cheap• 1M user/items ≈ 1GB heap

at scale (≈ 200 features)• Feature, item count determines

latency, throughput• Java 8 + 16-core 2.3GHz Xeon• Smallish models ≈

100s QPS, 10s ms latency• Huge models ≈

Single digit QPS, 100s ms latency

Recommendation Benchmarks

Page 17: Lambda architecture on Spark, Kafka for real-time large scale ML

17© Cloudera, Inc. All rights reserved.

• Spark 1.3.1• MLlib• Streaming

• Kafka 0.8.2.1• Hadoop 2.6• HDFS• YARN

• JavaEE 7• JAX-RS 2• Jersey 2

• Servlet 3.1• Tomcat 8

• JPMML + PMML 4.2.1

Key Technology Roster

CDH 5.4+

Page 18: Lambda architecture on Spark, Kafka for real-time large scale ML

18© Cloudera, Inc. All rights reserved.

• Cloudera Labs project• Partial collaboration with Intel• Not shipped with CDH• Not supported, no plans to yet

• 2.0.0 beta 3• Suitable for POCs• 2.0.0 by end of year

• Best For• Recommender engines• Real-time anomaly detection• Real-time classification• Problems where both scale and

latency are important• CDH users

Status

Page 19: Lambda architecture on Spark, Kafka for real-time large scale ML

19© Cloudera, Inc. All rights reserved.

Get Started in ~1 Hour

http://oryx.io

Page 20: Lambda architecture on Spark, Kafka for real-time large scale ML

20© Cloudera, Inc. All rights reserved.

Thank you@[email protected]

Page 21: Lambda architecture on Spark, Kafka for real-time large scale ML

21© Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprisewrangleconf.com

Public registration is now open!

Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco