Upload
-
View
326
Download
4
Embed Size (px)
Citation preview
MLDM MONDAYChih-Ming
About MECM 志明
Ph.D Student in TIGP-SNHCC
Research Assistant at AS CITI
Research Intern at KKBOX
Advisor: Prof. Ming-Feng Tsai (蔡銘峰)
Advisor: Dr. Eric Yang (楊弈軒)
• CLIP Lab
• MAC Lab
Research, Machine Learning team
https://about.me/chewme
3 http://kaggletw.azurewebsites.net/
台灣 Kaggle 交流區https://www.facebook.com/groups/kaggletw/
5
6
7
8
First Thing First
• The type of prediction task - classification? regression? top-N recommendations?
• Evaluation Metric - AUC, MAE, RMSE, Log-loss, MAP@N, …
• Why Compete? - For fun - For learning - For networking
The Prediction Task
BinaryClassification
Multi-labelClassification
Regression
Recommendations
Evaluation Metrichttps://www.kaggle.com/wiki/Metrics
Why Compete?
• For Fun: Competing with others like running or racing
• For Learning: Improving your abilities
• What's Your Motivation?
Other Considerations …
• Data Size - 10MB? 10GB? >100GB? - no $$ to pay AWS
• Need GPU Power? - no $$ to pay AWS
• Good Prize? - $$$$$$$$$$$$
Check the Provided Data
• The Distribution of Train/Test Data - random splitting - split by time - split by Ids
• Available Features - categorical, numerical - text - image, audio - time - sparse, dense
Cross Validation (1) TrainValidation
TRAIN TRAIN
TRAIN TRAIN
TRAIN TRAIN
Round 1:
Round 2:
Round 3:
Cross Validation (2) TrainValidation
TRAIN
TRAIN
TRAIN
Test
Round 1:
Round 2:
Round 3:
Hold A Proper Validation
• Random Splitting
• Split by Time
• Split by Id
TrainValidation
Test
7 DAYS7 DAYS
5/20 5/275/13
or
Data Cleaning / Preprocessing
• Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value
• Outlier Detection - https://en.wikipedia.org/wiki/Outlier
• Redundant Features - remove them usually
mean /median /mode /clustering /modelling methods
Categorical Features
• One-hot Encoding
• Clustering Group
Mayday 1 0 0 0
Sodagreen 0 1 0 0
SEKAI_NO_OWARI 0 0 1 0
The_Beatles 0 0 0 1
Mayday 1 0 0
Sodagreen 1 0 0
SEKAI_NO_OWARI 0 1 0
The_Beatles 0 0 1
Language
Id
Categorical Features
• Col-hot Encoding
• Count-hot Encoding
• Likelihood Encoding
• …
T1 T2 T3
T1
T2U
T3
231
6
23 1 6
1 0 1
count
binary
probability 23/30 1/30 6/30
Categorical Features (2)
• Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model
reduce the computation cost alleviating the overfitting issue
finding out the meaningful components remove the noisehttps://en.wikipedia.org/wiki/Dimensionality_reduction
Categorical vs. Numerical
• Ordinal Categories
HATE DON’T MIND LIKE LOVE
0 1 2 3
0
2
4
6
8
HATE DON'T MIND LIKE LOVE
exp(value)
Numerical Features
• Standardization / Normalization
• Rescaling
• Transform the Distribution - logarithmic transformation - tf-idf like transformation
• Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling
required bymany ML algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)
Other Features
• Text-based - Natural Language Processing
• Image-based, Audio-based - Image/Signal Processing
• Time-based - Time Series
Domain Knowledge is Important
Example (1)
• Text-based - Vector Space Model - Word Embeddings
https://en.wikipedia.org/wiki/Vector_space_model
MAN
WOMAN
KING
QUEEN
need stemming? lemmatization?
Example (2)
Text
服務好、環境整潔 …服務⼈人員笑容溫暖...今天點了商業午餐...
segmentation
[服務] [好] [環境] [整潔][服務] [⼈人員] [笑容] [溫暖][今天] [點了] [商業午餐]
服務:1 好:1 環境:1 整潔:2服務:1 笑容:1 溫暖:2
商業午餐:1filtering
WordEmbeddings?
dummyvariables
服務:2 好:1 環境:1 整潔:4服務:2 笑容:1 溫暖:1商業午餐:0.8
AdvancedWeighting?
Example (3)
• Image-based - SIFT - Convolutional NN
https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network
Realize the Meaning Behind the Observed Features
• 2017/05/20 08:00
• Taipei
Holiday? Weekday?
Day? Night?
Asia
Mandarin
ML Libraries
• sci-kit learn
• xgboost, lightgbm, …
• vowpal wabbit
• libsvm, liblinear, libfm, libffm, …
• tensorflow, keras, h2o, caffe, mxnet, …
• …
Understand the Pros and Cons
• Linear Model - simple, fast and easy to tune - occupy low memory - non-complex
• Nearest Neighbours - depends on the prediction task and the data distributions
Understand the Pros and Cons (2)
• Random Forest - work very well in many competitions - fast and easy to tune - memory hungry
• SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters
Understand the Pros and Cons (3)
• There are too many details …
• Find some online courses or ML books • The Elements of Statistical Learning
• Machine Learning, A Probabilistic Perspective
• Programming Collective Intelligence
• Information Science and Statistics
• Pattern Recognition and Machine Learning
• …
Exploratory Data Analysis (EDA)
• Statistics Helps - min, max, variance, mode, …
• Data Visualization Helps
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
Avg.
Avg.
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4
ENSEMBLE MODEL
Model Ensembling
• Voting
• Averaging
• Bagging
• Boosting
• Blending
• Stacking
1 2 3 4 new feature
Avg.
Other Tricks
• Data Leakage
• Magic/Lucky Parameters
Overall - To Get into Top
• Correct Validation
• Good Feature Extractions
• Diverse Model
• Proper Ensembling
https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-d03bf30b2eeb&v=&b=&from_search=9
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
Learning from Others/Winners
• Discussion Forum
• Kernels
• Winner Solutions
ANY QUESTION?changecandy at gmail