MLDM CM Kaggle Tips

MLDM MONDAYChih-Ming

About MECM 志明

Ph.D Student in TIGP-SNHCC

Research Assistant at AS CITI

Research Intern at KKBOX

Advisor: Prof. Ming-Feng Tsai (蔡銘峰)

Advisor: Dr. Eric Yang (楊弈軒)

• CLIP Lab

• MAC Lab

Research, Machine Learning team

https://about.me/chewme

http://www.cs.nccu.edu.tw/~mftsai/

http://mac.citi.sinica.edu.tw/~yang/

http://cfda.csie.org

http://mac.citi.sinica.edu.tw/

https://about.me/chewme

3 http://kaggletw.azurewebsites.net/

http://kaggletw.azurewebsites.net/

台灣 Kaggle 交流區https://www.facebook.com/groups/kaggletw/

https://www.facebook.com/groups/kaggletw/

5

6

7

8

First Thing First

• The type of prediction task - classification? regression? top-N recommendations?

• Evaluation Metric - AUC, MAE, RMSE, Log-loss, MAP@N, …

• Why Compete? - For fun - For learning - For networking

The Prediction Task

BinaryClassification

Multi-labelClassification

Regression

Recommendations

Evaluation Metrichttps://www.kaggle.com/wiki/Metrics

https://www.kaggle.com/wiki/Metrics

Why Compete?

• For Fun: Competing with others like running or racing

• For Learning: Improving your abilities

• What's Your Motivation?

Other Considerations …

• Data Size - 10MB? 10GB? >100GB? - no $$ to pay AWS

• Need GPU Power? - no $$ to pay AWS

• Good Prize? - $$$$$$$$$$$$

Check the Provided Data

• The Distribution of Train/Test Data - random splitting - split by time - split by Ids

• Available Features - categorical, numerical - text - image, audio - time - sparse, dense

Cross Validation (1) TrainValidation

TRAIN TRAIN

TRAIN TRAIN

TRAIN TRAIN

Round 1:

Round 2:

Round 3:

Cross Validation (2) TrainValidation

TRAIN

TRAIN

TRAIN

Test

Round 1:

Round 2:

Round 3:

Hold A Proper Validation

• Random Splitting

• Split by Time

• Split by Id

TrainValidation

Test

7 DAYS7 DAYS

5/20 5/275/13

or

Data Cleaning / Preprocessing

• Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value

• Outlier Detection - https://en.wikipedia.org/wiki/Outlier

• Redundant Features - remove them usually

mean /median /mode /clustering /modelling methods

https://en.wikipedia.org/wiki/Outlier

Categorical Features

• One-hot Encoding

• Clustering Group

Mayday 1 0 0 0

Sodagreen 0 1 0 0

SEKAI_NO_OWARI 0 0 1 0

The_Beatles 0 0 0 1

Mayday 1 0 0

Sodagreen 1 0 0

SEKAI_NO_OWARI 0 1 0

The_Beatles 0 0 1

Language

Id

Categorical Features

• Col-hot Encoding

• Count-hot Encoding

• Likelihood Encoding

• …

T1 T2 T3

T1

T2U

T3

231

6

23 1 6

1 0 1

count

binary

probability 23/30 1/30 6/30

Categorical Features (2)

• Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model

reduce the computation cost alleviating the overfitting issue

finding out the meaningful components remove the noisehttps://en.wikipedia.org/wiki/Dimensionality_reduction

https://en.wikipedia.org/wiki/Dimensionality_reduction

Categorical vs. Numerical

• Ordinal Categories

HATE DON’T MIND LIKE LOVE

0 1 2 3

0

2

4

6

8

HATE DON'T MIND LIKE LOVE

exp(value)

Numerical Features

• Standardization / Normalization

• Rescaling

• Transform the Distribution - logarithmic transformation - tf-idf like transformation

• Binning / Sampling

https://en.wikipedia.org/wiki/Feature_scaling

required bymany ML algorithms

https://en.wikipedia.org/wiki/Data_transformation_(statistics)

https://en.wikipedia.org/wiki/Feature_scaling

https://en.wikipedia.org/wiki/Data_transformation_(statistics)

Other Features

• Text-based - Natural Language Processing

• Image-based, Audio-based - Image/Signal Processing

• Time-based - Time Series

Domain Knowledge is Important

Example (1)

• Text-based - Vector Space Model - Word Embeddings

https://en.wikipedia.org/wiki/Vector_space_model

MAN

WOMAN

KING

QUEEN

need stemming? lemmatization?

https://en.wikipedia.org/wiki/Vector_space_model

Example (2)

Text

服務好、環境整潔 …服務⼈人員笑容溫暖...今天點了商業午餐...

segmentation

[服務] [好] [環境] [整潔][服務] [⼈人員] [笑容] [溫暖][今天] [點了] [商業午餐]

服務:1 好:1 環境:1 整潔:2服務:1 笑容:1 溫暖:2

商業午餐:1filtering

WordEmbeddings?

dummyvariables

服務:2 好:1 環境:1 整潔:4服務:2 笑容:1 溫暖:1商業午餐:0.8

AdvancedWeighting?

Example (3)

• Image-based - SIFT - Convolutional NN

https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network

https://en.wikipedia.org/wiki/Scale-invariant_feature_transform

https://en.wikipedia.org/wiki/Convolutional_neural_network

Realize the Meaning Behind the Observed Features

• 2017/05/20 08:00

• Taipei

Holiday? Weekday?

Day? Night?

Asia

Mandarin

ML Libraries

• sci-kit learn

• xgboost, lightgbm, …

• vowpal wabbit

• libsvm, liblinear, libfm, libffm, …

• tensorflow, keras, h2o, caffe, mxnet, …

• …

Understand the Pros and Cons

• Linear Model - simple, fast and easy to tune - occupy low memory - non-complex

• Nearest Neighbours - depends on the prediction task and the data distributions

Understand the Pros and Cons (2)

• Random Forest - work very well in many competitions - fast and easy to tune - memory hungry

• SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters

Understand the Pros and Cons (3)

• There are too many details …

• Find some online courses or ML books • The Elements of Statistical Learning

• Machine Learning, A Probabilistic Perspective

• Programming Collective Intelligence

• Information Science and Statistics

• Pattern Recognition and Machine Learning

• …

Exploratory Data Analysis (EDA)

• Statistics Helps - min, max, variance, mode, …

• Data Visualization Helps

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4

Avg.

Avg.

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4

ENSEMBLE MODEL

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4 new feature

Avg.

Other Tricks

• Data Leakage

• Magic/Lucky Parameters

Overall - To Get into Top

• Correct Validation

• Good Feature Extractions

• Diverse Model

• Proper Ensembling

https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-d03bf30b2eeb&v=&b=&from_search=9

https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-d03bf30b2eeb&v=&b=&from_search=9

http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

Learning from Others/Winners

• Discussion Forum

• Kernels

• Winner Solutions

Learning from Others/Winners

http://blog.kaggle.com/

http://blog.kaggle.com/

ANY QUESTION?changecandy at gmail

Technology

MLDM CM Kaggle Tips