51
Uczenie maszynowe Vladimir Alekseichenko „rocket science” czy chleb powszedni?

AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?

  • Upload
    2040io

  • View
    44

  • Download
    2

Embed Size (px)

Citation preview

Uczenie maszynowe

Vladimir Alekseichenko

„rocket science” czy chleb powszedni?

Zmiany w czasie

10min na jeden

36 500 000 minut

~70 lat

Kierowca vs Mechanik

dataworkshop.eu

Bike Sharing Demand

Zadnie - kaggle

Rozwiązanie - github.com/dataworkshop

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Zrozum Biznes i Dane(understand business and data)

Dni robocze

Weekend

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Wytworzenie cech(feature engineering)

• ilościowe => od 1 do 10, 11 do 20…

• daty => dzień, miesiąc, rok, godzina, czy weekend…

• kategorii/jakościowe (czerwony, zielony, biały)

• przypisać identyfikator liczbowy (1, 2, 3)

• stworzyć n-kolumn binarnych (jest czerwony? itd)

• prawdopodobieństwa ze zmienną docelową

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Selekcja cech(feature selection)

• Czym mniej tym lepiej (prostszy model)

• Zostawić najbardziej wartościowe (idealnie jedna :)

• Cechy (zazwyczaj) są zależny, więc trzeba uważać… (sprawdzać empirycznie)

• Szybciej

Variance

Univariate

Recursive

xgbfir

https://github.com/limexp/xgbfir

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Dobór Modelu(model selection)

• Linear

• Decision Tree

• Random Forest

• Gradient Boosting

• Neural Network

Linear

https://github.com/dataworkshop/model_evaluation/blob/master/step1-regression.ipynb

Decision Tree

http://xgboost.readthedocs.io/en/latest/model.html

Ensemble trees

http://xgboost.readthedocs.io/en/latest/model.html

Ensemble trees

• Bagging (bootstrap aggregation)

• Random Forest

• Extra Trees

• Boosting

• Gradient Boosting

XGBoost(Extreme Gradient Boosting)

“When in doubt, use xgboost”

Owen Zhang

Wybór modelu(model selection)

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Dobór hiperparametrów(tuning hyperparameters)

• Grid Search

• Random Search

• Bayesian

hyperopt

Understand

Business & DataRead and explore data

Feature EngineeringCreate a new ones based on already exists

Feature SelectionSelect only useful features

Model SelectionFind the best model(s) model

Amodel

Bmodel

Cmodel

Dmodel

E

Tuning

HyperparametersFind the best hyperparameters for given model

Ensemble ModelingCombine few models into one more better

x0.6 x0.4+

model B

model E

datetime season temp count

2011-01-01 08:32:02 1 9.23 5

2012-04-02 12:10:00 2 18.78 32

2012-08-07 15:47:01 3 15.45 15

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

model B

model E

datetime season temp hour day month … count count_log

2011-01-01 08:32:02 1 9.23 8 1 1 … 5 1.609

2012-04-02 12:10:00 2 18.78 12 2 4 … 32 3.466

2012-08-07 15:47:01 3 15.45 15 7 8 … 15 2.708

Ansambl(ensemble modeling)

Neuron

(Artificial) Neural Network

MNIST

Dane

Neural NetworkError: 1.60%

http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

source

Wyzwania

Przeuczenie się(overfitting)

http://mlwiki.org/index.php/Overfitting

Sprawdzian krzyżowy(cross-validation)

http://blog.goldenhelix.com/bchristensen/cross-validation-for-genomic-prediction-in-svs/

Kreatywność jest wiele warta

https://techcrunch.com/2016/11/19/how-data-science-and-rocket-science-will-get-humans-to-mars

source

Fala już idzi… czy jesteś gotów?

Dziękuję

@slon1024 [email protected]

dataworkshop.eu