Introduction to Hivemall

Copyright ©2015 Treasure Data. All Rights Reserved.

Treasure Data Inc.Research EngineerMakoto YUI @myui

2015/05/14TD tech talk #3 @Retty 1

http://myui.github.io/

20 min. Introduction to Hivemall


Ø2015/04 Joined Treasure Data, Inc.Ø1st Research Engineer in Treasure DataØMy mission in TD is developing ML-‐as-‐a-‐Service (MLaaS)

Ø2010/04-‐2015/03 Senior Researcher at National Institute of Advanced Industrial Science and Technology, Japan. ØWorked on a large-‐scale Machine Learning project and Parallel Databases

Ø2009/03 Ph.D. in Computer Science from NAISTØMy research topic was about building XML native database and Parallel Database systems

ØSuper programmer award from the MITOU Foundation (a Government founded program for finding young and talented programmers)Ø Super creators in Treasure Data: Sada Furuhashi, Keisuke Nishida

2

Who am I ?

Copyright ©2015 Treasure Data. All Rights Reserved.3

0

2000

4000

6000

8000

10000

12000

Aug-‐12

Sep-‐12

Oct-‐12

Nov-‐12

Dec-‐12Jan-‐13

Feb-‐13

Mar-‐13

Apr-‐13

May-‐13

Jun-‐13Jul-‐13

Aug-‐13

Sep-‐13

Oct-‐13

Nov-‐13

Dec-‐13Jan-‐14

Feb-‐14

Mar-‐14

Apr-‐14

May-‐14

Jun-‐14Jul-‐14

Aug-‐14

Sep-‐14

Oct-‐14

Billio

n records (Unit)

Service in

Series A Funding

Reached 100 customers

Selected as “Cool Vendor in Big Data” by Gartner

10 trillionrecords

5 trillion records

Figures on Oct. 20144 hundred thousand (40万) records Imported for each SECOND!!10+ trillion (10兆) records Total number of imported records

12 billion (120億) records # records sent by an Ad-tech company

Figures of Imported Data in Treasure Data


The latest numbers in Treasure Data

100+CustomersIn Japan

15 trillion# of

stored records

4,000A single company sends data to usfrom 4,000 nodes

500,000# of records

stored per a second

4


Plan of the Talk

1. Brief introduction to Hivemall

2. How to use Hivemall

3. Real-‐time prediction w/ Hivemall and RDBMS

5


What is HivemallScalable machine learning library built on the top of Apache Hive, licensed under the Apache License v2

Hadoop HDFS

MapReduce(MRv1)

Hive / PIG

Hivemall

Apache YARN

Apache TezDAG processing MR v2

Machine Learning

Check http://github.com/myui/hivemall

6

Query Processing

Parallel Data Processing Framework

Resource Management

Distributed File System


R

M MM

M

HDFS

HDFS

M M M

R

M M M

R

HDFS

M MM

M M

HDFS

R

MapReduce and DAG engine

MapReduce DAG engineTez/Spark

No intermediate DFS reads/writes!

7


Very easy to use; Machine Learning on SQL

The key characteristic of Hivemall

100+ lines

of code

Classification with Mahout

CREATE TABLE lr_model ASSELECTfeature, -‐-‐ reducers perform model averaging in parallelavg(weight) as weightFROM (SELECT logress(features,label,..) as (feature,weight)FROM train) t -‐-‐ map-‐only taskGROUP BY feature; -‐-‐ shuffled to reducers

ü Machine Learning made easy for SQL developers (ML for the rest of us)

ü APIs are very stable because of SQL abstraction

This SQL query automatically runs in parallelon Hadoop

8


List of functions in Hivemall v0.3

9

• Classification (both binary-‐ and multi-‐class)

ü Perceptronü Passive Aggressive (PA)ü Confidence Weighted (CW)ü Adaptive Regularization of Weight Vectors (AROW)

ü Soft Confidence Weighted (SCW)ü AdaGrad+RDA

• Regressionü Logistic Regression (SGD)ü PA Regressionü AROW Regressionü AdaGradü AdaDELTA

• kNN and Recommendationü Minhash and b-‐Bit Minhash(LSH variant)

ü Similarity Search using K-‐NNü Matrix Factorization

• Feature engineeringü Feature hashingü Feature scaling(normalization, z-‐score)

ü TF-‐IDF vectorizer

Treasure Data will support Hivemallv0.3.1 in the next week!

bit.ly/hivemall-‐mf


• Contribution from Daniel Dai (Pig PMC) from Hortonworks• To be supported from Pig 0.15

10

Hivemall on Apache Pig


Plan of the Talk




11


How to use Hivemall

MachineLearning

Training

Prediction

PredictionModel Label

Feature Vector

Feature Vector

Label

Data preparation

12


Create external table e2006tfidf_train (rowid int,label float,features ARRAY<STRING>

) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“

STORED AS TEXTFILE LOCATION '/dataset/E2006-tfidf/train';

How to use Hivemall -‐ Data preparation

Define a Hive table for training/testing data

13


How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Feature Engineering

14


create view e2006tfidf_train_scaled asselect

rowid,rescale(target,${min_label},${max_label})

as label,features

from e2006tfidf_train;

Applying a Min-Max Feature Normalization

How to use Hivemall -‐ Feature Engineering

Transforming a label value to a value between 0.0 and 1.0

15


How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Training

16


How to use Hivemall -‐ Training

CREATE TABLE lr_model ASSELECT

feature,avg(weight) as weight

FROM (SELECT logress(features,label,..)

as (feature,weight)FROM train

) tGROUP BY feature

Training by logistic regression

map-‐only task to learn a prediction model

Shuffle map-‐outputs to reduces by feature

Reducers perform model averaging in parallel

17


How to use Hivemall -‐ Training

CREATE TABLE news20b_cw_model1 ASSELECT

feature,voted_avg(weight) as weight

FROM(SELECT

train_cw(features,label) as (feature,weight)

FROMnews20b_train

) t GROUP BY feature

Training of Confidence Weighted Classifier

Vote to use negative or positive weights for avg

+0.7, +0.3, +0.2, -‐0.1, +0.7

Training for the CW classifier

18


create table news20mc_ensemble_model1 asselect label, cast(feature as int) as feature,cast(voted_avg(weight) as float) as weightfrom (select

train_multiclass_cw(addBias(features),label) as (label,feature,weight)

from news20mc_train_x3

union allselect

train_multiclass_arow(addBias(features),label) as (label,feature,weight)


union allselect

train_multiclass_scw(addBias(features),label)as (label,feature,weight)


) t group by label, feature;

Ensemble learning for stable prediction performance

Just stack prediction models by union all

19


How to use Hivemall

MachineLearning

Training

Prediction


Feature Vector

Feature Vector

Label

Prediction

20


How to use Hivemall -‐ Prediction

CREATE TABLE lr_predict asSELECTt.rowid, sigmoid(sum(m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOINlr_model m ON (t.feature = m.feature)

GROUP BY t.rowid

Prediction is done by LEFT OUTER JOINbetween test data and prediction model

No need to load the entire model into memory

21


Plan of the Talk




22


Type/Purpose Matrix of Machine Learning

23

OnlineLearning

OfflineLearning

OnlinePrediction

• AlgorithmTrade (HFT)• Twitter real-‐time

analysis

• Ad-‐tech (e.g., CTR/CVR prediction)

• Real-‐time recommendation

OfflinePrediction

no/fewneeds?

• Daily/weekly batch systems

• BusinessAnalytics/Reporting


How to use Hivemall

MachineLearning

Batch Training on Hadoop

Online Prediction on RDBMS


Feature Vector

Feature Vector

Label

Export prediction models

24


Export Prediction Model to a RDBMS

25

hive> desc news20b_cw_model1;feature intweight double

Any RDBMS

TD exportPeriodical export is very easyin Treasure Data

103 -0.4896543622016907104 -0.0955817922949791105 0.12560302019119263106 0.09214721620082855


hive> desc testing_exploded; feature string value float

Real-‐time Prediction on MySQL

#2 Preparing a Test data table

SIGMOID(x) = 1.0 / (1.0 + exp(-‐x))


Feature Vector

SELECT sigmoid(sum(t.value * m.weight)) as prob

FROMtesting_exploded t LEFT OUTER JOIN prediction_model m ON (t.feature = m.feature)

#3 Online prediction on MySQL

You can alternatively use SQL viewdefining for testing target

Index lookups are veryefficient in RDBMSs

http://bit.ly/hivemall-‐rtp


Cost of Amazon Machine LearningAmazon-‐ML is suspected to be based on Vowpal Wabbit(single process)

27

Data Analysis and Model Building Fees$0.42/Instance per Hour

Batch Prediction$0.1/1000 requests

Real-‐time Prediction$0.0001 per a request

Pay-‐per-‐request is apparently not suitable for doing prediction for each web request (e.g. online CTR prediction)


Real-‐time Prediction on Treasure Data

Run batch trainingjob periodically

Real-‐time predictionon a RDBMS

Periodicalexport


Beyond Query-‐as-‐a-‐Service!

We ❤️ Open-‐source! We Invented ..

We are Hiring!