2. About ME CM Ph.D Student in TIGP-SNHCC Research Assistant at
AS CITI Research Intern at KKBOX Advisor: Prof. Victor Tsai ()
Advisor: Dr. Eric Yang () CLIP Lab MAC Lab Research, Machine
Learning team https://about.me/chewme
5. Why Compete? For Fun: Competing with others like running or
racing For Learning: Improving your abilities
6. Why Compete? For Fun: Competing with others like running or
racing For Learning: Improving your abilities What's Your
Motivation?
7. Why Compete? For Fun: Competing with others like running or
racing For Learning: Improving your abilities What's Your
Motivation?
8. Why Compete?
9. Related Websites http://dc.dsp.im/index.php
10. Related Websites https://tianchi.aliyun.com/
11. 11
12. Common Prediction Tasks Binary Classification Multi-label
Classification Regression Recommendations
13. Other Prediction Tasks Route Path Object Detection propose
a solution / design a webpage / exploratory data analysis (EDA)
/
14. Evaluation Metric https://www.kaggle.com/wiki/Metrics Many
existing toolkits have provided the solver that can optimize the
loss with certain metric.
15. Why Optimize with Given Metric? 5 4 3 3 2 1 3 4 3 Perfect
Ranking Bad Loss Bad Ranking Better Loss
16. Why Optimize with Given Metric? 5 4 3 3+2 2+2 1+2 3 4 3
Perfect Ranking Perfect Loss Bad Ranking Better Loss
17. Check the Provided Data The Distribution of Train/Test Data
- random splitting - split by time - split by Ids Available
Features - categorical, numerical - text - image, audio - time -
sparse, dense
18. Cross Validation (1) Train Validation TRAIN VAL TEST VAL
TEST TRAIN TEST TRAIN VAL Test Round 1: Round 2: Round 3:
19. Common Given Data TRAIN TEST
20. Cross Validation (2) Train Validation TRAIN TRAIN VAL TRAIN
VAL TRAIN VAL TRAIN TRAIN Round 1: Round 2: Round 3: TEST
21. Cross Validation (3) Train Validation TRAIN TRAIN VAL1
Round only: Find out the best VAL. TEST
22. Hold A Proper Validation Random Splitting Split by Time
Split by Id Train Validation Test 7 DAYS7 DAYS 5/20 5/275/13
or
23. Data Cleaning / Preprocessing Missing Values - drop the
missing data - replace them by certain statistical values - label
them as the missing value Outlier Detection -
https://en.wikipedia.org/wiki/Outlier Redundant Features - we
usually remove them mean / median / mode / clustering / modeling
methods
24. Data Cleaning / Preprocessing Python - Pandas
25. Data Cleaning / Preprocessing Python - Pandas - drop -
replace - label
26. Data Cleaning / Preprocessing Age User A 19 User B 27 User
C 200
27. Data Cleaning / Preprocessing Age User A 19 User B 27 User
C 200 drop
28. Data Cleaning / Preprocessing Age User A 19 User B 27 User
C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1
0 drop add label
29. Data Cleaning / Preprocessing Age User A 19 User B 27 User
C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1
0 drop Age User A 19 User B 27 User C 36 replaceadd label mean /
median / mode / clustering / modeling methods
31. Categorical Features Col-hot Encoding Count-hot Encoding
Likelihood Encoding T1 T2 T3 T1 T2U T3 23 1 6 23 1 6 1 0 1 count
binary probability 23/30 1/30 6/30
32. Categorical Features (2) Latent Representations - Principal
Component Analysis (PCA) - Linear Discriminant Analysis (LDA) -
Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) -
Low-Rank Approximation / Latent Factorization - Latent Topic Model
reduce the computation cost alleviating the overfitting issue
finding out the meaningful components remove the noise
https://en.wikipedia.org/wiki/Dimensionality_reduction
33. Numerical Features Standardization / Normalization
Rescaling Transform the Distribution - logarithmic transformation -
tf-idf like transformation Binning / Sampling
https://en.wikipedia.org/wiki/Feature_scaling required by many ML
algorithms
https://en.wikipedia.org/wiki/Data_transformation_(statistics)
34. Categorical vs. Numerical Ordinal Categories HATE DONT MIND
LIKE LOVE 0 1 2 3 0 2 4 6 8 HATE DON'T MIND LIKE LOVE
exp(value)
35. Data Sampling Label Imbalance Problem 3:1 Find out more
from: https://github.com/scikit-learn-contrib/imbalanced-learn
36. Data Sampling Label Imbalance Problem Over Sampling 3:1 1:1
Find out more from:
https://github.com/scikit-learn-contrib/imbalanced-learn
37. Data Sampling Label Imbalance Problem Over Sampling Under
Sampling 3:1 1:1 1:1 Find out more from:
https://github.com/scikit-learn-contrib/imbalanced-learn
38. Data Sampling Label Imbalance Problem Over Sampling Under
Sampling 3:1 1:1 1:1 Find out more from:
https://github.com/scikit-learn-contrib/imbalanced-learn Over-Under
Sampling?
39. Other Feature Kinds Text-based - Natural Language
Processing Image-based, Audio-based - Image/Signal Processing
Time-based - Time Series Domain Knowledge is Important
40. Example (1) Text-based - Vector Space Model - Word
Embeddings https://en.wikipedia.org/wiki/Vector_space_model MAN
WOMAN KING QUEEN need stemming? lemmatization?
45. Understand the Pros and Cons Linear Model - simple, fast
and easy to tune - occupy low memory - non-complex Random Forest -
work very well in many competitions - fast and easy to tune -
memory hungry
46. Understand the Pros and Cons (2) Neural Networks - easy
end2end learning - flexible - hard to tune/train SVM - strong
theoretical guarantees - good to prevent from overfit - slow and
memory heavy - usually needs grid-search on hyper parameters
47. Understand the Pros and Cons (3) Gradient Boosting Machine
(GBM) - usually unbeatable for using dense feature sets
Factorization Machine (FM) - the master in dealing with sparse
data
48. Understand the Pros and Cons (4) There are too many details
Find some online courses or ML books The Elements of Statistical
Learning Machine Learning, A Probabilistic Perspective Programming
Collective Intelligence Information Science and Statistics Pattern
Recognition and Machine Learning
49. Understand the Pros and Cons (5) Ill tell you
everything.
50. Exploratory Data Analysis (EDA) Quora Duplicated Question
as the example - How do I read and find my YouTube comments? - How
can I see all my Youtube comments? & - What is the alternative
to machine learning? - How do I over-sample a multi-class imbalance
data set? & - What is the biggest monster in Monster Hunter? -
Is there a Monster Hunter PC game?
51. Example EDA Statistics Helps - min, max, variance, mode,
Data Visualization Helps
52. Model Ensemble Voting Averaging Bagging Boosting Blending
Stacking 1 2 3 4 1 2 3 4 AVERAGE 1 2 3 4 AVERAGE
53. Diverse Models Ensemble
https://mlwave.com/kaggle-ensembling-guide/
54. Diverse Models Ensemble
https://mlwave.com/kaggle-ensembling-guide/
55. Diverse Models Ensemble
https://mlwave.com/kaggle-ensembling-guide/
56. Model Ensemble Voting Averaging Bagging Boosting Blending
Stacking 1 2 3 4 1 2 3 4 ENSEMBLE MODEL 1 2 3 4 ENSEMBLE MODEL
58. Other Tricks Data Leakage Magic/Lucky Parameters
59. Overall - To Get into Top Correct Validation Good Feature
Extractions Diverse Model Proper Model Ensemble Advanced Way -
understand and modify the model
60. Step by Step Feature Set A Feature Set B Feature Set C
Model A DATASET Focus on using 1 single model. Extract N features
everyday. Check the validation score. Prediction A
61. Step by Step Feature Set A Feature Set B Feature Set C
Model A DATASET Diversify the models. Try different features
combination. Model B Model C Prediction A Prediction B Prediction
C
62. Step by Step Feature Set A Feature Set B Feature Set C
Model A DATASET Ensemble the models. Model B Model C Prediction
A
63. Step by Step Feature Set A Feature Set B Feature Set C
Model A DATASET Package it. Model B Model C Prediction A
64. Step by Step Feature Set D Feature Set E Feature Set F
Model D DATASET Stacking Model E Model F Prediction B FeaFeaFea Mo
Mo Mo Pr