Upload
yifeng-jiang
View
1.013
Download
3
Embed Size (px)
Citation preview
Hortonworks Inc. 2011 2015. All Rights Reserved
HadoopYifeng JiangMarch 10, 2015
Hortonworks Inc. 2011 2015. All Rights Reserved
(Yifeng Jiang) Solutions Engineer @ Hortonworks Japan HBase Book Author @uprush
Hortonworks Inc. 2011 2015. All Rights Reserved
? Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
?
Hortonworks Inc. 2011 2015. All Rights Reserved
...
...
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
BI
Business Intelligence: & ; Data Science: & ; ;
Hortonworks Inc. 2011 2015. All Rights Reserved
CDR NPTB
360
Hortonworks Inc. 2011 2015. All Rights Reserved
ROI
Amazon: 35%
Netflix: 75%
CTR
Hortonworks Inc. 2011 2015. All Rights Reserved
/
Hortonworks Inc. 2011 2015. All Rights Reserved
...
OCR
NLP
Hortonworks Inc. 2011 2015. All Rights Reserved
ETL
Java Scala
Python
NLP
SQLExcel
Hadoop PIG HIVE
SOLR
Hortonworks Inc. 2011 2015. All Rights Reserved
ETL
Java Scala Python
NLP
Hadoop PIG HIVESOLR
Hortonworks Inc. 2011 2015. All Rights Reserved
NLP R MATLAB SAS SQL /
Hadoop PIG/HIVE Map-Reduce Java Python Perl SQL C++ NoSQL Hbase Cassandra Mongo
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
WALL-E 700
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
: CTR
Rank = bid * CTRCTR CTR Etc
Hortonworks Inc. 2011 2015. All Rights Reserved
Collaborative Filtering
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
Model
(Train)
Feature Matrix
Feature Vector
Hortonworks Inc. 2011 2015. All Rights Reserved
:
ID Total$ Age City Target
101 200 25 SF
102 350 35 LA
103 25 15 LA
Feature Matrix Feature Engineering
Raw Transforms
Signal Processing
OCR
Geo-spatial
Normalize
Transform/aggregate
Sample
Dimensionality reduction
Feature Selection
NLP
Mutual Information
TB, PB
MB, GB
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
:
Shopper ID TX ID Apple Banana Honey Milk Bread
101 TX 1 4 5 1 1 0
102 TX 2 0 2 0 1 1
103 TX 3 0 0 0 0 2
101 TX 4 1 1 0 0 0
Apple Banana Honey Milk Bread
Price $2 $1 $5 $3 $4
Age City Size of household
101 25 SF 4
102 35 LA 3
Hortonworks Inc. 2011 2015. All Rights Reserved
:
Shopper ID # Tx Total $ Age City
101 10 $200 25 SF
102 15 $350 35 LA
103 2 $25 15 LA
25 $5 15 NYC
Hortonworks Inc. 2011 2015. All Rights Reserved
:
- -
ID Total$ Age City
101 $200 25 SF 2
102 $350 35 LA 2
103 $25 15 LA 1
1
1
2
2
2
Hortonworks Inc. 2011 2015. All Rights Reserved
?
: 10M , 100 = 8 bytes (double) = ~7.5GB
Hortonworks Inc. 2011 2015. All Rights Reserved
:
l (70%)(30%)
l
l
Hortonworks Inc. 2011 2015. All Rights Reserved
confusion matrix :
Yes No
Yes True positives False
positives
No False negatives
True negatives
Confusion Matrix
confusion matrix = % of positive predicts that are correct = % of positive instances that were predicts as positiveF1 = a measure of tests accuracy, combining precision and recall= % of correct classications
Hortonworks Inc. 2011 2015. All Rights Reserved
ALS
MySQL / HBase
Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
YARN Data Lake 2013 YARN Hadoop
YARN Data Lake Data Lake Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
?
6 9
Schema change
HDFS
?
3
Schema on read
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
OCR
NLP
Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Feature Engineering
Raw Transforms
Signal Processing
OCR
Geo-spatial
Normalize
Transform/aggregate
Sample
Dimensionality reduction
Feature Selection
NLP
Mutual Information
Frequent Itemset
Anomaly Detection
Clustering
Collaborative Filter
Regression
Classication
Supervised Learning
Unsupervised Learning
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
R, Python Scikit-learn or SAS
Mahout
Spark ML-Lib:
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
: R, Python Scikit-learn or SAS
Mahout () Spark ML-Lib
Hadoop Grid-search:
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
:20M PMML (e.g., Zementis, Pattern) Python, R, Java,
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Distributed K-means Spark ML-Lib & Mahout
Collaborative Filtering Alternating Least Squares (ALS) Mahout, Spark ML-Lib, ALS Collaborative FilteringMahout
Hortonworks Inc. 2011 2015. All Rights Reserved
: HadoopR
R
R Rstudio Rstudio RCloud
Hadoop RMR: map-reduce R RHDFS: RHDFS RHIVE: Rhive RHBase: RHbase RODBC
Rstudio, Rcloud Rhadoop RHive
R . .
. . .
. . R
YARN
R high-memory node
Hortonworks Inc. 2011 2015. All Rights Reserved
: Hadoop Python
Python
Python UIIpython
Hadoop PyDoop: PythonHDFS Hadoop Map-reduce
PIGPython UDFs
IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop
PythonScikit-learn
Pandas. .
. . .
. .Python
Scikit-learnPandas
YARN
Python high-memory node
Hortonworks Inc. 2011 2015. All Rights Reserved
: HadoopSpark
Edge NodeSpark ( ML-Lib) Scala API Java API Python API
SparkYARN
Spark ML-Lib Edge node
Spark . .
. . .
. . Spark
YARN
Hortonworks Inc. 2011 2015. All Rights Reserved
Hortonworks Inc. 2011 2015. All Rights Reserved
Hadoop
Hadoop
HadoopYARN
Hadoop
Hortonworks Inc. 2011 2015. All Rights Reserved
Thank You! Yifeng Jiang Solutions Engineer