72
UTAIPEI Chih-Ming

CM UTaipei Kaggle Share

  • Upload
    -

  • View
    167

  • Download
    0

Embed Size (px)

Citation preview

  1. 1. UTAIPEI Chih-Ming
  2. 2. About ME CM Ph.D Student in TIGP-SNHCC Research Assistant at AS CITI Research Intern at KKBOX Advisor: Prof. Victor Tsai () Advisor: Dr. Eric Yang () CLIP Lab MAC Lab Research, Machine Learning team https://about.me/chewme
  3. 3. Kaggle https://www.facebook.com/groups/kaggletw/
  4. 4. 4
  5. 5. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities
  6. 6. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?
  7. 7. Why Compete? For Fun: Competing with others like running or racing For Learning: Improving your abilities What's Your Motivation?
  8. 8. Why Compete?
  9. 9. Related Websites http://dc.dsp.im/index.php
  10. 10. Related Websites https://tianchi.aliyun.com/
  11. 11. 11
  12. 12. Common Prediction Tasks Binary Classification Multi-label Classification Regression Recommendations
  13. 13. Other Prediction Tasks Route Path Object Detection propose a solution / design a webpage / exploratory data analysis (EDA) /
  14. 14. Evaluation Metric https://www.kaggle.com/wiki/Metrics Many existing toolkits have provided the solver that can optimize the loss with certain metric.
  15. 15. Why Optimize with Given Metric? 5 4 3 3 2 1 3 4 3 Perfect Ranking Bad Loss Bad Ranking Better Loss
  16. 16. Why Optimize with Given Metric? 5 4 3 3+2 2+2 1+2 3 4 3 Perfect Ranking Perfect Loss Bad Ranking Better Loss
  17. 17. Check the Provided Data The Distribution of Train/Test Data - random splitting - split by time - split by Ids Available Features - categorical, numerical - text - image, audio - time - sparse, dense
  18. 18. Cross Validation (1) Train Validation TRAIN VAL TEST VAL TEST TRAIN TEST TRAIN VAL Test Round 1: Round 2: Round 3:
  19. 19. Common Given Data TRAIN TEST
  20. 20. Cross Validation (2) Train Validation TRAIN TRAIN VAL TRAIN VAL TRAIN VAL TRAIN TRAIN Round 1: Round 2: Round 3: TEST
  21. 21. Cross Validation (3) Train Validation TRAIN TRAIN VAL1 Round only: Find out the best VAL. TEST
  22. 22. Hold A Proper Validation Random Splitting Split by Time Split by Id Train Validation Test 7 DAYS7 DAYS 5/20 5/275/13 or
  23. 23. Data Cleaning / Preprocessing Missing Values - drop the missing data - replace them by certain statistical values - label them as the missing value Outlier Detection - https://en.wikipedia.org/wiki/Outlier Redundant Features - we usually remove them mean / median / mode / clustering / modeling methods
  24. 24. Data Cleaning / Preprocessing Python - Pandas
  25. 25. Data Cleaning / Preprocessing Python - Pandas - drop - replace - label
  26. 26. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200
  27. 27. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 drop
  28. 28. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop add label
  29. 29. Data Cleaning / Preprocessing Age User A 19 User B 27 User C 200 Std. 1.2 1.1 6.3 Age User A 19 User B 27 User C 200 Out. 1 1 0 drop Age User A 19 User B 27 User C 36 replaceadd label mean / median / mode / clustering / modeling methods
  30. 30. Categorical Features One-hot Encoding Clustering Group Mayday 1 0 0 0 Sodagreen 0 1 0 0 SEKAI_NO_OWARI 0 0 1 0 The_Beatles 0 0 0 1 Mayday 1 0 0 Sodagreen 1 0 0 SEKAI_NO_OWARI 0 1 0 The_Beatles 0 0 1 Language Id
  31. 31. Categorical Features Col-hot Encoding Count-hot Encoding Likelihood Encoding T1 T2 T3 T1 T2U T3 23 1 6 23 1 6 1 0 1 count binary probability 23/30 1/30 6/30
  32. 32. Categorical Features (2) Latent Representations - Principal Component Analysis (PCA) - Linear Discriminant Analysis (LDA) - Laplacian Eigenmaps (LE) - Locally linear embedding (LLE) - Low-Rank Approximation / Latent Factorization - Latent Topic Model reduce the computation cost alleviating the overfitting issue finding out the meaningful components remove the noise https://en.wikipedia.org/wiki/Dimensionality_reduction
  33. 33. Numerical Features Standardization / Normalization Rescaling Transform the Distribution - logarithmic transformation - tf-idf like transformation Binning / Sampling https://en.wikipedia.org/wiki/Feature_scaling required by many ML algorithms https://en.wikipedia.org/wiki/Data_transformation_(statistics)
  34. 34. Categorical vs. Numerical Ordinal Categories HATE DONT MIND LIKE LOVE 0 1 2 3 0 2 4 6 8 HATE DON'T MIND LIKE LOVE exp(value)
  35. 35. Data Sampling Label Imbalance Problem 3:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
  36. 36. Data Sampling Label Imbalance Problem Over Sampling 3:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
  37. 37. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn
  38. 38. Data Sampling Label Imbalance Problem Over Sampling Under Sampling 3:1 1:1 1:1 Find out more from: https://github.com/scikit-learn-contrib/imbalanced-learn Over-Under Sampling?
  39. 39. Other Feature Kinds Text-based - Natural Language Processing Image-based, Audio-based - Image/Signal Processing Time-based - Time Series Domain Knowledge is Important
  40. 40. Example (1) Text-based - Vector Space Model - Word Embeddings https://en.wikipedia.org/wiki/Vector_space_model MAN WOMAN KING QUEEN need stemming? lemmatization?
  41. 41. Example (2) Text ... ... segmentation [] [] [] [] [] [] [] [] [] [] [] :1 :1 :1 :2 :1 :1 :2 :1filtering Word Embeddings? dummy variables :2 :1 :1 :4 :2 :1 :1 :0.8 Advanced Weighting?
  42. 42. Example (3) Image-based - SIFT - Convolutional NN https://en.wikipedia.org/wiki/Scale-invariant_feature_transform https://en.wikipedia.org/wiki/Convolutional_neural_network
  43. 43. Realize the Meaning Behind the Observed Features 2017/05/20 08:00 Taipei Holiday? Weekday? Day? Night? Asia Mandarin
  44. 44. ML Libraries sci-kit learn xgboost, lightgbm, vowpal wabbit libsvm, liblinear, libfm, libffm, tensorflow, keras, h2o, caffe, mxnet,
  45. 45. Understand the Pros and Cons Linear Model - simple, fast and easy to tune - occupy low memory - non-complex Random Forest - work very well in many competitions - fast and easy to tune - memory hungry
  46. 46. Understand the Pros and Cons (2) Neural Networks - easy end2end learning - flexible - hard to tune/train SVM - strong theoretical guarantees - good to prevent from overfit - slow and memory heavy - usually needs grid-search on hyper parameters
  47. 47. Understand the Pros and Cons (3) Gradient Boosting Machine (GBM) - usually unbeatable for using dense feature sets Factorization Machine (FM) - the master in dealing with sparse data
  48. 48. Understand the Pros and Cons (4) There are too many details Find some online courses or ML books The Elements of Statistical Learning Machine Learning, A Probabilistic Perspective Programming Collective Intelligence Information Science and Statistics Pattern Recognition and Machine Learning
  49. 49. Understand the Pros and Cons (5) Ill tell you everything.
  50. 50. Exploratory Data Analysis (EDA) Quora Duplicated Question as the example - How do I read and find my YouTube comments? - How can I see all my Youtube comments? & - What is the alternative to machine learning? - How do I over-sample a multi-class imbalance data set? & - What is the biggest monster in Monster Hunter? - Is there a Monster Hunter PC game?
  51. 51. Example EDA Statistics Helps - min, max, variance, mode, Data Visualization Helps
  52. 52. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 AVERAGE 1 2 3 4 AVERAGE
  53. 53. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
  54. 54. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
  55. 55. Diverse Models Ensemble https://mlwave.com/kaggle-ensembling-guide/
  56. 56. Model Ensemble Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 2 3 4 ENSEMBLE MODEL 1 2 3 4 ENSEMBLE MODEL
  57. 57. Model Ensembling Voting Averaging Bagging Boosting Blending Stacking 1 2 3 4 1 1 1 1 new feature 1 2 3 4 AVG. Avg.
  58. 58. Other Tricks Data Leakage Magic/Lucky Parameters
  59. 59. Overall - To Get into Top Correct Validation Good Feature Extractions Diverse Model Proper Model Ensemble Advanced Way - understand and modify the model
  60. 60. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Focus on using 1 single model. Extract N features everyday. Check the validation score. Prediction A
  61. 61. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Diversify the models. Try different features combination. Model B Model C Prediction A Prediction B Prediction C
  62. 62. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Ensemble the models. Model B Model C Prediction A
  63. 63. Step by Step Feature Set A Feature Set B Feature Set C Model A DATASET Package it. Model B Model C Prediction A
  64. 64. Step by Step Feature Set D Feature Set E Feature Set F Model D DATASET Stacking Model E Model F Prediction B FeaFeaFea Mo Mo Mo Pr
  65. 65. https://www.slideshare.net/NathanielShimoni/starting-data-science-with-kagglecom?qid=8f0c66f9-43ba-4646-8a05- d03bf30b2eeb&v=&b=&from_search=9
  66. 66. https://www.linkedin.com/pulse/ideas-sharing-kaggle-crowdflower-search-results-relevance-mark-peng/
  67. 67. http://blog.kaggle.com/2017/02/27/allstate-claims-severity-competition-2nd-place-winners- interview-alexey-noskov/
  68. 68. http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
  69. 69. Learning from Others/Winners http://blog.kaggle.com/
  70. 70. https://docs.google.com/presentation/d/ 1bo7SahuYMzEEylVUTbJE29Oot7T4Cqk9SknOS9Lgsq4/edit?usp=sharing Learning from Others/Winners
  71. 71. ANY QUESTION? changecandy at gmail