Upload
huguk
View
1.012
Download
0
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Oryx 2 OverviewSean Owen | Cloudera | @sean_r_owen
2© Cloudera, Inc. All rights reserved.
Consider the Music Recommender
Collect Play Data
& Do Data Science
BuildTaste
ModelOffline
Learn Quickly
from New Plays
RecommendNew Songs
Now
3© Cloudera, Inc. All rights reserved.
From Exploratory to Operational?
Exploratory Analytics Operational Analytics
Explore DataPick Model
Build Model at Scale, Offline
Continuously Update Model
?Score Model inReal-Time
?
4© Cloudera, Inc. All rights reserved.
Large Scale or Real-Time?
Large-ScaleOfflineBatch
Real-TimeOnlineStreaming
vs
Why Don’t We Have Both?
λ!
5© Cloudera, Inc. All rights reserved.
• Batch Layer• High latency, high throughput• Compute official result
• Speed Layer• Low latency• Compute approximate update to
last known result• Serving Layer• Real-time• Merge batch/speed results
The Lambda Architecture
www.ymc.ch/en/lambda-architecture-part-1
6© Cloudera, Inc. All rights reserved.
• Batch Layer• Train, evaluate, tune model
over all data in hours• Speed Layer• Update model approximately in
minutes or seconds• Serving Layer• Make prediction, recommendation
from model in milliseconds
λ Architecture fits ML + Hadoop
Streaming MLlib
7© Cloudera, Inc. All rights reserved.
www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
8© Cloudera, Inc. All rights reserved.
History (or: 5th time’s a charm)
Taste2005 – 2009- Recommender
toolkit in Java - Local only- Serves results
Apache Mahout2009 – 2014- Adds Hadoop-based
model buildingat scale
- But no serving
Myrrix2011-2013- Mahout recs
reimagined- Adds serving to
Hadoop-basedmodel build
Oryx 12013 –- Extends to
classification,clustering
- PMML- Merge with
cloudera/ml
Oryx 22014 –- Same APIs / goals- Rewrite- Full lambda
architecture- Kafka + Spark + YARN
9© Cloudera, Inc. All rights reserved.
Complementary, Not Competitive
Most ML-on-Hadoop tools are for building models only, and excel at this.
Oryx and similar projects do everything else around this: continuous update, serving
10© Cloudera, Inc. All rights reserved.
Architecture
11© Cloudera, Inc. All rights reserved.
• Input Kafka topic• Any type; usually strings• From external or Serving Layer
• Update Kafka topic• Serialized models (PMML)
produced by Batch Layer• Model updates / deltas
produced by Speed Layer
Data Transport
12© Cloudera, Inc. All rights reserved.
• Spark Streaming• Persists input topic data
to HDFS from Kafka• Builds “model” occasionally from
historical and new data• Hours• ML: can use MLlib• ML: tunes hyperparameters• Publishes models as PMML to
update topic
Batch Layer
13© Cloudera, Inc. All rights reserved.
• Spark Streaming• Listens for new PMML models• Listens to input topic too• Computes approximate updates to
model implied by input and publishes to update topic• Seconds
Speed Layer
14© Cloudera, Inc. All rights reserved.
• Tomcat + JAX-RS• (Can deploy on YARN)• REST API• Listens for new PMML models and
updates from update topic• Scores model / answers queries• Writes to input topic too• No shared state; scales horizontally• Milliseconds
Serving Layer
15© Cloudera, Inc. All rights reserved.
Logical Architecture
Serving Layer Speed Layer Batch Layer
App Tier oryx-app-serving oryx-app-mlliboryx-app
oryx-app-mlliboryx-app
ML Tier oryx-ml oryx-ml
Lambda Tier oryx-lambda-serving oryx-lambda oryx-lambda
Generic Lambda-Architecture support
ML-specific specialization
Prebuilt recommender, clustering, classification implementations
16© Cloudera, Inc. All rights reserved.
• Scoring on the fly is not cheap• 1M user/items ≈ 1GB heap
at scale (≈ 200 features)• Feature, item count determines
latency, throughput• Java 8 + 16-core 2.3GHz Xeon• Smallish models ≈
100s QPS, 10s ms latency• Huge models ≈
Single digit QPS, 100s ms latency
Recommendation Benchmarks
17© Cloudera, Inc. All rights reserved.
• Spark 1.3.1• MLlib• Streaming
• Kafka 0.8.2.1• Hadoop 2.6• HDFS• YARN
• JavaEE 7• JAX-RS 2• Jersey 2
• Servlet 3.1• Tomcat 8
• JPMML + PMML 4.2.1
Key Technology Roster
CDH 5.4+
18© Cloudera, Inc. All rights reserved.
• Cloudera Labs project• Partial collaboration with Intel• Not shipped with CDH• Not supported, no plans to yet
• 2.0.0 beta 3• Suitable for POCs• 2.0.0 by end of year
• Best For• Recommender engines• Real-time anomaly detection• Real-time classification• Problems where both scale and
latency are important• CDH users
Status
19© Cloudera, Inc. All rights reserved.
Get Started in ~1 Hour
http://oryx.io
20© Cloudera, Inc. All rights reserved.
Thank you@[email protected]
21© Cloudera, Inc. All rights reserved.
The conference for and by Data Scientists, from startup to enterprisewrangleconf.com
Public registration is now open!
Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco