29
Copyright ©2015 Treasure Data. All Rights Reserved. Treasure Data Inc. Research Engineer Makoto YUI @myui 2015/05/14 TD tech talk #3 @Retty 1 http://myui.github.io/ 20 min. Introduction to Hivemall

Introduction to Hivemall

Embed Size (px)

Citation preview

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Treasure  Data  Inc.Research  EngineerMakoto  YUI  @myui

2015/05/14TD  tech  talk  #3  @Retty 1

http://myui.github.io/

20  min.  Introduction  to  Hivemall

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Ø2015/04  Joined  Treasure  Data,  Inc.Ø1st Research  Engineer  in  Treasure  DataØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)  

Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute  of  Advanced  Industrial  Science  and  Technology,  Japan.  ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel  Databases  

Ø2009/03  Ph.D.  in  Computer  Science  from  NAISTØMy  research  topic  was  about  building  XML  native  database  and  Parallel  Database  systems

ØSuper  programmer  award  from  the  MITOU  Foundation  (a  Government  founded  program  for  finding  young  and  talented  programmers)Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida

2

Who  am    I  ?

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.3

0

2000

4000

6000

8000

10000

12000

Aug-­‐12

Sep-­‐12

Oct-­‐12

Nov-­‐12

Dec-­‐12Jan-­‐13

Feb-­‐13

Mar-­‐13

Apr-­‐13

May-­‐13

Jun-­‐13Jul-­‐13

Aug-­‐13

Sep-­‐13

Oct-­‐13

Nov-­‐13

Dec-­‐13Jan-­‐14

Feb-­‐14

Mar-­‐14

Apr-­‐14

May-­‐14

Jun-­‐14Jul-­‐14

Aug-­‐14

Sep-­‐14

Oct-­‐14

Billio

n  records  (Unit)

Service  in

Series  A  Funding

Reached  100  customers

Selected  as  “Cool  Vendor  in  Big  Data”  by  Gartner

10  trillionrecords  

5  trillion   records

Figures on Oct. 20144 hundred thousand (40万) records Imported for each SECOND!!10+ trillion (10兆) records Total number of imported records

12 billion (120億) records # records sent by an Ad-tech company

Figures  of  Imported  Data  in  Treasure  Data

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

The  latest  numbers  in  Treasure  Data

100+CustomersIn Japan

15 trillion# of

stored records

4,000A single company sends data to usfrom 4,000 nodes

500,000# of records

stored per a second

4

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Plan  of  the  Talk

1. Brief  introduction  to  Hivemall

2. How  to  use  Hivemall

3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS

5

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

What  is  HivemallScalable  machine  learning  library  built  on  the  top  of  Apache  Hive,  licensed  under  the  Apache  License  v2

Hadoop  HDFS

MapReduce(MRv1)

Hive /  PIG

Hivemall

Apache  YARN

Apache  TezDAG processing MR v2

Machine  Learning

Check  http://github.com/myui/hivemall

6

Query  Processing

Parallel  Data  Processing  Framework

Resource  Management

Distributed  File  System

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

R

M MM

M

HDFS

HDFS

M M M

R

M M M

R

HDFS

M MM

M M

HDFS

R

MapReduce  and  DAG  engine

MapReduce   DAG  engineTez/Spark

No  intermediate  DFS  reads/writes!

7

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Very  easy  to  use;  Machine  Learning  on  SQL

The  key  characteristic  of  Hivemall

100+  lines

of  code

Classification  with  Mahout

CREATE  TABLE  lr_model ASSELECTfeature,  -­‐-­‐ reducers  perform  model  averaging  in  parallelavg(weight)  as  weightFROM  (SELECT  logress(features,label,..)  as  (feature,weight)FROM  train)  t  -­‐-­‐ map-­‐only  taskGROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers

ü Machine  Learning  made  easy  for  SQL  developers  (ML  for  the  rest  of  us)

ü APIs  are  very  stable  because  of  SQL  abstraction

This  SQL  query  automatically  runs  in  parallelon  Hadoop  

8

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

List  of  functions  in  Hivemall  v0.3

9

• Classification  (both  binary-­‐ and  multi-­‐class)

ü Perceptronü Passive  Aggressive   (PA)ü Confidence  Weighted   (CW)ü Adaptive  Regularization  of  Weight  Vectors  (AROW)

ü Soft  Confidence  Weighted   (SCW)ü AdaGrad+RDA

• Regressionü Logistic  Regression   (SGD)ü PA  Regressionü AROW  Regressionü AdaGradü AdaDELTA

• kNN and  Recommendationü Minhash and  b-­‐Bit  Minhash(LSH  variant)

ü Similarity  Search  using  K-­‐NNü Matrix  Factorization

• Feature  engineeringü Feature  hashingü Feature  scaling(normalization,  z-­‐score)  

ü TF-­‐IDF  vectorizer

Treasure  Data  will  support  Hivemallv0.3.1  in  the  next  week!  

bit.ly/hivemall-­‐mf

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

• Contribution  from  Daniel  Dai  (Pig  PMC)  from  Hortonworks• To  be  supported  from  Pig  0.15

10

Hivemall  on  Apache  Pig

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Plan  of  the  Talk

1. Brief  introduction  to  Hivemall

2. How  to  use  Hivemall

3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS

11

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Data  preparation

12

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

How  to  use  Hivemall  -­‐ Data  preparation

Define  a  Hive  table  for  training/testing  data

13

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Feature  Engineering

14

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label})

as label,features

from e2006tfidf_train;

Applying a Min-Max Feature Normalization

How  to  use  Hivemall  -­‐ Feature  Engineering

Transforming  a  label  value  to  a  value  between  0.0  and  1.0

15

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Training

16

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Training

CREATE TABLE lr_model ASSELECT

feature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training  by  logistic  regression

map-­‐only  task  to  learn  a  prediction  model

Shuffle  map-­‐outputs  to  reduces  by  feature

Reducers  perform  model  averaging  in  parallel

17

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training  of  Confidence  Weighted  Classifier

Vote  to  use  negative  or  positive  weights  for  avg

+0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7

Training  for  the  CW  classifier

18

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)

from news20mc_train_x3

) t group by label, feature;

Ensemble  learning  for  stable  prediction  performance

Just  stack  prediction  models  by  union  all

19

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Prediction

20

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall  -­‐ Prediction

CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

Prediction  is  done  by  LEFT  OUTER  JOINbetween  test  data  and  prediction  model

No  need  to  load  the  entire  model  into  memory

21

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Plan  of  the  Talk

1. Brief  introduction  to  Hivemall

2. How  to  use  Hivemall

3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS

22

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Type/Purpose  Matrix  of  Machine  Learning

23

OnlineLearning

OfflineLearning

OnlinePrediction

• AlgorithmTrade  (HFT)• Twitter  real-­‐time  

analysis

• Ad-­‐tech (e.g.,  CTR/CVR  prediction)

• Real-­‐time  recommendation

OfflinePrediction

no/fewneeds?

• Daily/weekly batch  systems

• BusinessAnalytics/Reporting

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

How  to  use  Hivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS

PredictionModel Label

Feature  Vector

Feature  Vector

Label

Export  prediction  models

24

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Export  Prediction  Model  to  a  RDBMS

25

hive> desc news20b_cw_model1;feature intweight double

Any  RDBMS

TD  exportPeriodical  export  is  very easyin  Treasure  Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.26

hive>  desc  testing_exploded;                                                    feature                                  string  value                                      float

Real-­‐time  Prediction  on  MySQL

#2  Preparing  a  Test  data  table

SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x))

PredictionModel Label

Feature  Vector

SELECT    sigmoid(sum(t.value   *  m.weight))  as  prob

FROMtesting_exploded   t  LEFT  OUTER  JOIN  prediction_model   m  ON  (t.feature  =  m.feature)

#3  Online  prediction  on  MySQL  

You  can  alternatively  use  SQL  viewdefining  for  testing  target

Index  lookups  are  veryefficient  in  RDBMSs

http://bit.ly/hivemall-­‐rtp

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.

Cost  of  Amazon  Machine  LearningAmazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit(single  process)  

27

Data  Analysis  and  Model  Building  Fees$0.42/Instance  per  Hour

Batch  Prediction$0.1/1000 requests

Real-­‐time  Prediction$0.0001  per  a  request

Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for  each  web  request  (e.g.  online  CTR  prediction)

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.28

Real-­‐time  Prediction  on  Treasure  Data

Run  batch  trainingjob  periodically

Real-­‐time  predictionon  a  RDBMS

Periodicalexport

Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.29

Beyond  Query-­‐as-­‐a-­‐Service!

We  ❤️  Open-­‐source!  We  Invented  ..

We  are  Hiring!