43
MLDM MONDAY Chih-Ming

MLDM CM Kaggle Tips

  • Upload
    -

  • View
    326

  • Download
    4

Embed Size (px)

Citation preview

Page 1: MLDM CM Kaggle Tips

MLDM MONDAYChih-Ming

Page 2: MLDM CM Kaggle Tips

About MECM 志明

Ph.D Student in TIGP-SNHCC

Research Assistant at AS CITI

Research Intern at KKBOX

Advisor: Prof. Ming-Feng Tsai (蔡銘峰)

Advisor: Dr. Eric Yang (楊弈軒)

• CLIP Lab

• MAC Lab

Research, Machine Learning team

https://about.me/chewme

Page 3: MLDM CM Kaggle Tips

3 http://kaggletw.azurewebsites.net/

Page 4: MLDM CM Kaggle Tips

台灣 Kaggle 交流區https://www.facebook.com/groups/kaggletw/

Page 5: MLDM CM Kaggle Tips

5

Page 6: MLDM CM Kaggle Tips

6

Page 7: MLDM CM Kaggle Tips

7

Page 8: MLDM CM Kaggle Tips

8

Page 9: MLDM CM Kaggle Tips

First Thing First

• The type of prediction task - classification? regression? top-N recommendations?

• Evaluation Metric - AUC, MAE, RMSE, Log-loss, MAP@N, …

• Why Compete? - For fun - For learning - For networking

Page 10: MLDM CM Kaggle Tips

The Prediction Task

BinaryClassification

Multi-labelClassification

Regression

Recommendations

Page 11: MLDM CM Kaggle Tips

Evaluation Metrichttps://www.kaggle.com/wiki/Metrics

Page 12: MLDM CM Kaggle Tips

Why Compete?

• For Fun: Competing with others like running or racing

• For Learning: Improving your abilities

• What's Your Motivation?

Page 13: MLDM CM Kaggle Tips

Other Considerations …

• Data Size - 10MB? 10GB? >100GB? - no $$ to pay AWS

• Need GPU Power? - no $$ to pay AWS

• Good Prize? - $$$$$$$$$$$$

Page 14: MLDM CM Kaggle Tips

Check the Provided Data

• The Distribution of Train/Test Data - random splitting - split by time - split by Ids

• Available Features - categorical, numerical - text - image, audio - time - sparse, dense

Page 15: MLDM CM Kaggle Tips

Cross Validation (1) TrainValidation

TRAIN TRAIN

TRAIN TRAIN

TRAIN TRAIN

Round 1:

Round 2:

Round 3:

Page 16: MLDM CM Kaggle Tips

Cross Validation (2) TrainValidation

TRAIN

TRAIN

TRAIN

Test

Round 1:

Round 2:

Round 3:

Page 17: MLDM CM Kaggle Tips

Hold A Proper Validation

• Random Splitting

• Split by Time

• Split by Id

TrainValidation

Test

7 DAYS7 DAYS

5/20 5/275/13

or

Page 18: MLDM CM Kaggle Tips

Data Cleaning / Preprocessing

• Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value

• Outlier Detection - https://en.wikipedia.org/wiki/Outlier

• Redundant Features - remove them usually

mean /median /mode /clustering /modelling methods

Page 19: MLDM CM Kaggle Tips

Categorical Features

• One-hot Encoding

• Clustering Group

Mayday 1 0 0 0

Sodagreen 0 1 0 0

SEKAI_NO_OWARI 0 0 1 0

The_Beatles 0 0 0 1

Mayday 1 0 0

Sodagreen 1 0 0

SEKAI_NO_OWARI 0 1 0

The_Beatles 0 0 1

Language

Id

Page 20: MLDM CM Kaggle Tips

Categorical Features

• Col-hot Encoding

• Count-hot Encoding

• Likelihood Encoding

• …

T1 T2 T3

T1

T2U

T3

231

6

23 1 6

1 0 1

count

binary

probability 23/30 1/30 6/30

Page 21: MLDM CM Kaggle Tips

Categorical Features (2)

• Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model

reduce the computation cost alleviating the overfitting issue

finding out the meaningful components remove the noisehttps://en.wikipedia.org/wiki/Dimensionality_reduction

Page 22: MLDM CM Kaggle Tips

Categorical vs. Numerical

• Ordinal Categories

HATE DON’T MIND LIKE LOVE

0 1 2 3

0

2

4

6

8

HATE DON'T MIND LIKE LOVE

exp(value)

Page 23: MLDM CM Kaggle Tips

Numerical Features

• Standardization / Normalization

• Rescaling

• Transform the Distribution - logarithmic transformation - tf-idf like transformation

• Binning / Sampling

https://en.wikipedia.org/wiki/Feature_scaling

required bymany ML algorithms

https://en.wikipedia.org/wiki/Data_transformation_(statistics)

Page 24: MLDM CM Kaggle Tips

Other Features

• Text-based - Natural Language Processing

• Image-based, Audio-based - Image/Signal Processing

• Time-based - Time Series

Domain Knowledge is Important

Page 25: MLDM CM Kaggle Tips

Example (1)

• Text-based - Vector Space Model - Word Embeddings

https://en.wikipedia.org/wiki/Vector_space_model

MAN

WOMAN

KING

QUEEN

need stemming? lemmatization?

Page 26: MLDM CM Kaggle Tips

Example (2)

Text

服務好、環境整潔 …服務⼈人員笑容溫暖...今天點了商業午餐...

segmentation

[服務] [好] [環境] [整潔][服務] [⼈人員] [笑容] [溫暖][今天] [點了] [商業午餐]

服務:1 好:1 環境:1 整潔:2服務:1 笑容:1 溫暖:2

商業午餐:1filtering

WordEmbeddings?

dummyvariables

服務:2 好:1 環境:1 整潔:4服務:2 笑容:1 溫暖:1商業午餐:0.8

AdvancedWeighting?

Page 27: MLDM CM Kaggle Tips

Example (3)

• Image-based - SIFT - Convolutional NN

https://en.wikipedia.org/wiki/Scale-invariant_feature_transformhttps://en.wikipedia.org/wiki/Convolutional_neural_network

Page 28: MLDM CM Kaggle Tips

Realize the Meaning Behind the Observed Features

• 2017/05/20 08:00

• Taipei

Holiday? Weekday?

Day? Night?

Asia

Mandarin

Page 29: MLDM CM Kaggle Tips

ML Libraries

• sci-kit learn

• xgboost, lightgbm, …

• vowpal wabbit

• libsvm, liblinear, libfm, libffm, …

• tensorflow, keras, h2o, caffe, mxnet, …

• …

Page 30: MLDM CM Kaggle Tips

Understand the Pros and Cons

• Linear Model - simple, fast and easy to tune - occupy low memory - non-complex

• Nearest Neighbours - depends on the prediction task and the data distributions

Page 31: MLDM CM Kaggle Tips

Understand the Pros and Cons (2)

• Random Forest - work very well in many competitions - fast and easy to tune - memory hungry

• SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters

Page 32: MLDM CM Kaggle Tips

Understand the Pros and Cons (3)

• There are too many details …

• Find some online courses or ML books • The Elements of Statistical Learning

• Machine Learning, A Probabilistic Perspective

• Programming Collective Intelligence

• Information Science and Statistics

• Pattern Recognition and Machine Learning

• …

Page 33: MLDM CM Kaggle Tips

Exploratory Data Analysis (EDA)

• Statistics Helps - min, max, variance, mode, …

• Data Visualization Helps

Page 34: MLDM CM Kaggle Tips

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4

Avg.

Avg.

Page 35: MLDM CM Kaggle Tips

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4

ENSEMBLE MODEL

Page 36: MLDM CM Kaggle Tips

Model Ensembling

• Voting

• Averaging

• Bagging

• Boosting

• Blending

• Stacking

1 2 3 4 new feature

Avg.

Page 37: MLDM CM Kaggle Tips

Other Tricks

• Data Leakage

• Magic/Lucky Parameters

Page 38: MLDM CM Kaggle Tips

Overall - To Get into Top

• Correct Validation

• Good Feature Extractions

• Diverse Model

• Proper Ensembling

Page 39: MLDM CM Kaggle Tips

https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05-d03bf30b2eeb&v=&b=&from_search=9

Page 40: MLDM CM Kaggle Tips

http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

Page 41: MLDM CM Kaggle Tips

Learning from Others/Winners

• Discussion Forum

• Kernels

• Winner Solutions

Page 42: MLDM CM Kaggle Tips

Learning from Others/Winners

http://blog.kaggle.com/

Page 43: MLDM CM Kaggle Tips

ANY QUESTION?changecandy at gmail