CM UTaipei Kaggle Share

1. UTAIPEI Chih-Ming

2. About ME CM Ph.D Student in TIGP-SNHCC Research Assistant at AS CITI Research Intern at KKBOX Advisor: Prof. Victor Tsai () Advisor: Dr. Eric Yang () CLIP Lab MAC Lab Research, Machine Learning team https://about.me/chewme

3. Kaggle https://www.facebook.com/groups/kaggletw/

5. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities

6. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?

7. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?

8. Why Compete?

9. Related Websites http://dc.dsp.im/index.php

10. Related Websites https://tianchi.aliyun.com/

11. 11

12. Common Prediction Tasks Binary Classification Multi-label Classification Regression Recommendations

13. Other Prediction Tasks Route Path Object Detection propose a solution / design a webpage / exploratory data analysis (EDA) /

14. Evaluation Metric https://www.kaggle.com/wiki/Metrics Many existing toolkits have provided the solver that can optimize the loss with certain metric.

15. Why Optimize with Given Metric? 5 4 3 3 2 1 3 4 3 Perfect Ranking Bad Loss Bad Ranking Better Loss

16. Why Optimize with Given Metric? 5 4 3 3+2 2+2 1+2 3 4 3 Perfect Ranking Perfect Loss Bad Ranking Better Loss

17. Check the Provided Data The Distribution of Train/Test Data - random splitting - split by time - split by Ids Available Features - categorical, numerical - text - image, audio - time - sparse, dense

18. Cross Validation (1) Train Validation TRAIN VAL TEST VAL TEST TRAIN TEST TRAIN VAL Test Round 1: Round 2: Round 3:

19. Common Given Data TRAIN TEST

20. Cross Validation (2) Train Validation TRAIN TRAIN VAL TRAIN VAL TRAIN VAL TRAIN TRAIN Round 1: Round 2: Round 3: TEST

21. Cross Validation (3) Train Validation TRAIN TRAIN VAL1 Round only: Find out the best VAL. TEST

22. Hold A Proper Validation Random Splitting Split by Time Split by Id Train Validation Test 7 DAYS7 DAYS 5/20 5/275/13 or

23. Data Cleaning / Preprocessing Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value Outlier Detection - https://en.wikipedia.org/wiki/Outlier Redundant Features - we usually remove them mean / median / mode / clustering / modeling methods

24. Data Cleaning / Preprocessing Python - Pandas

25. Data Cleaning / Preprocessing Python - Pandas - drop - replace - label

26. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200

27. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 drop

28. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop add label

29. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop Age User A 19 User B 27 User C 36 replaceadd label mean / median / mode / clustering / modeling methods

30. Categorical Features One-hot Encoding Clustering Group Mayday 1 0 0 0 Sodagreen 0 1 0 0 SEKAI_NO_OWARI 0 0 1 0 The_Beatles 0 0 0 1 Mayday 1 0 0 Sodagreen 1 0 0 SEKAI_NO_OWARI 0 1 0 The_Beatles 0 0 1 Language Id

31. Categorical Features Col-hot Encoding Count-hot Encoding Likelihood Encoding T1 T2 T3 T1 T2U T3 23 1 6 23 1 6 1 0 1 count binary probability 23/30 1/30 6/30

32. Categorical Features (2) Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model reduce the computation cost alleviating the overfitting issue finding out the meaningful components remove the noise https://en.wikipedia.org/wiki/Dimensionality_reduction

33. Numerical Features Standardization / Normalization Rescaling Transform the Distribution - logarithmic transformation - tf-idf like transformation Binning / Sampling https://en.wikipedia.org/wiki/Feature_scaling required by many ML algorithms https://en.wikipedia.org/wiki/Data_transformation_(statistics)

34. Categorical vs. Numerical Ordinal Categories HATE DONT MIND LIKE LOVE 0 1 2 3 0 2 4 6 8 HATE DON'T MIND LIKE LOVE exp(value)

35. Data Sampling Label Imbalance Problem 3:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn

36. Data Sampling Label Imbalance Problem Over Sampling 3:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn

37. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn

38. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn Over-Under Sampling?

39. Other Feature Kinds Text-based - Natural Language Processing Image-based, Audio-based - Image/Signal Processing Time-based - Time Series Domain Knowledge is Important

40. Example (1) Text-based - Vector Space Model - Word Embeddings https://en.wikipedia.org/wiki/Vector_space_model MAN WOMAN KING QUEEN need stemming? lemmatization?

41. Example (2) Text ... ... segmentation [] [] [] [] [] [] [] [] [] [] [] :1 :1 :1 :2 :1 :1 :2 :1filtering Word Embeddings? dummy variables :2 :1 :1 :4 :2 :1 :1 :0.8 Advanced Weighting?

42. Example (3) Image-based - SIFT - Convolutional NN https://en.wikipedia.org/wiki/Scale-invariant_feature_transform https://en.wikipedia.org/wiki/Convolutional_neural_network

43. Realize the Meaning Behind the Observed Features 2017/05/20 08:00 Taipei Holiday? Weekday? Day? Night? Asia Mandarin

44. ML Libraries sci-kit learn xgboost, lightgbm, vowpal wabbit libsvm, liblinear, libfm, libffm, tensorflow, keras, h2o, caffe, mxnet,

45. Understand the Pros and Cons Linear Model - simple, fast and easy to tune - occupy low memory - non-complex Random Forest - work very well in many competitions - fast and easy to tune - memory hungry

46. Understand the Pros and Cons (2) Neural Networks - easy end2end learning - flexible - hard to tune/train SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters

47. Understand the Pros and Cons (3) Gradient Boosting Machine (GBM) - usually unbeatable for using dense feature sets Factorization Machine (FM) - the master in dealing with sparse data

48. Understand the Pros and Cons (4) There are too many details Find some online courses or ML books The Elements of Statistical Learning Machine Learning, A Probabilistic Perspective Programming Collective Intelligence Information Science and Statistics Pattern Recognition and Machine Learning

49. Understand the Pros and Cons (5) Ill tell you everything.

50. Exploratory Data Analysis (EDA) Quora Duplicated Question as the example - How do I read and find my YouTube comments? - How can I see all my Youtube comments? & - What is the alternative to machine learning? - How do I over-sample a multi-class imbalance data set? & - What is the biggest monster in Monster Hunter? - Is there a Monster Hunter PC game?

51. Example EDA Statistics Helps - min, max, variance, mode, Data Visualization Helps

52. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 AVERAGE 1 2 3 4 AVERAGE

53. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/

56. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 ENSEMBLE MODEL 1 2 3 4 ENSEMBLE MODEL

57. Model Ensembling Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 1 1 1 new feature 1 2 3 4 AVG. Avg.

58. Other Tricks Data Leakage Magic/Lucky Parameters

59. Overall - To Get into Top Correct Validation Good Feature Extractions Diverse Model Proper Model Ensemble Advanced Way - understand and modify the model

60. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Focus on using 1 single model. Extract N features everyday. Check the validation score. Prediction A

61. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Diversify the models. Try different features combination. Model B Model C Prediction A Prediction B Prediction C

62. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Ensemble the models. Model B Model C Prediction A

63. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Package it. Model B Model C Prediction A

64. Step by Step Feature Set D Feature Set E Feature Set F Model D DATASET Stacking Model E Model F Prediction B FeaFeaFea Mo Mo Mo Pr

65. https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05- d03bf30b2eeb&v=&b=&from_search=9

66. https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng/

67. http://blog.kaggle.com/2017/02/27/allstate-claims-severity-competition-2nd-place-winners- interview-alexey-noskov/

68. http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/

69. Learning from Others/Winners http://blog.kaggle.com/

70. https://docs.google.com/presentation/d/ 1bo7SahuYMzEEylVUTbJE29Oot7T4Cqk9SknOS9Lgsq4/edit?usp=sharing Learning from Others/Winners

71. ANY QUESTION? changecandy at gmail

Data & Analytics

CM UTaipei Kaggle Share