21
Boruta – A System for Feature Selection Boruta – A System for Feature Selection Miron B. Kursa, Aleksander Jankowski, Witold R. Rudnicki Brandon Sherman December 1, 2016 1 / 21

Boruta

  • Upload
    s-b

  • View
    75

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Boruta

Boruta – A System for Feature Selection

Boruta – A System for Feature SelectionMiron B. Kursa, Aleksander Jankowski, Witold R. Rudnicki

Brandon Sherman

December 1, 2016

1 / 21

Page 2: Boruta

Boruta – A System for Feature Selection

Background

Outline

1 Background

2 Boruta

3 Example – Detecting Aptamer Sequences

4 Questions?

2 / 21

Page 3: Boruta

Boruta – A System for Feature Selection

Background

Red deer; Białowieża Forest - Podlaskie Voivodeship, Poland

3 / 21

Page 4: Boruta

Boruta – A System for Feature Selection

Background

How do random forests work?

A random forest is a machine learning classifier that works asfollows:

1 For a dataset (X,Y) with observations {x1, x2, . . . , xN} andresponse {y1, y2, . . . , yN} (yi ∈ {0, 1}), bootstrap B datasetsof size N.

Bootstrapping - for a dataset (X,Y) with N observations,generate another dataset (Xb,Yb) with N observations bysampling from (X,Y) with replacement.

2 For b = {1, . . . ,B}, train a decision tree on dataset (Xb,Yb).3 Predict a new observation Xnew on each decision tree by

taking the majority vote from all of the trees.4 Calculate the out-of-bag error, which is the mean prediction

error for each decision tree’s predictions on observations notincluded in its bootstrapped sample.

5 Calculate feature importance for feature Xj by fitting a modelon a randomly permuted Xj (called X(s)

j ), and comparing itsout-of-bag error to the out of bag error for Xj .

4 / 21

Page 5: Boruta

Boruta – A System for Feature Selection

Background

Problems with Feature Importances

“Feature importances can’t tell you the emotional story. They cangive you the exact mathematical design, but what’s missing is the

eyebrows.” – Frank Zappa (heavily paraphrased)

5 / 21

Page 6: Boruta

Boruta – A System for Feature Selection

Background

Problems with Feature Importances

Vague, nebulous notion of “how important is this feature?”Feature importances are relative, so no notion of “howimportant is this feature on its own?”No idea how many features are needed.Is a feature with small importance unimportant or slightlyimportant?The “Breiman assumption”, which states that featureimportance is normally distributed, is false.

6 / 21

Page 7: Boruta

Boruta – A System for Feature Selection

Boruta

Outline

1 Background

2 Boruta

3 Example – Detecting Aptamer Sequences

4 Questions?

7 / 21

Page 8: Boruta

Boruta – A System for Feature Selection

Boruta

Boruta (Slavic forest spirit); artist’s depiction

8 / 21

Page 9: Boruta

Boruta – A System for Feature Selection

Boruta

A Quick Diversion

Fire stock broker DanAykroyd and hireuntrained homeless manEddie MurphyCan Eddie Murphy be asgood a stockbroker asDan Aykroyd?If so, then there’s nothinginherent about DanAykroyd that makes him agood stockbroker

9 / 21

Page 10: Boruta

Boruta – A System for Feature Selection

Boruta

Boruta Algorithm

For each feature Xj , randomly permute it to generate a“shadow feature” (random attribute) X(s)

j .Fit a random forest to the original and the shadow featuresCalculate feature importances on original and shadow featuresThe feature is important for a single run if its importance ishigher than the maximum importance of all shadow features(MIRA).Eliminate all features whose importances across all runs arelow enough. Keep all features whose importances across allruns are high enough.Repeat from the beginning with all tentative features.

10 / 21

Page 11: Boruta

Boruta – A System for Feature Selection

Boruta

Boruta – Now with more math! (Part 1)

Iterate the following procedure N times for all original features{X1, . . . ,Xp}:

1 Create a random forest consisting of original andnewly-generated shadow features.

2 Calculate all Imp(Xj) and MIRA

3 If a particular Imp(Xj) > MIRA, then increment Hj and call Xj

important for the run.

11 / 21

Page 12: Boruta

Boruta – A System for Feature Selection

Boruta

Boruta – Now with more math! (Part 2)

Once {H1, . . . ,Hp} have been calculated:1 Perform the statistical test H0 : Hi = E (H) vs.

H1 : Hi 6= E (H).Because hits follow a binomial distribution, we have

Hi≈∼ N

((0.5N), (

√0.25N)2

).

2 If Hi is significantly greater than E (H), then we say thefeature is important.

3 If Hi is significantly lower than E (H), then we say the featureis unimportant.

4 Finish the procedure after some number of iterations or if allfeatures have been rejected or deemed important. Otherwise,repeat the procedure from the beginning on all tentativefeatures.

12 / 21

Page 13: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

Outline

1 Background

2 Boruta

3 Example – Detecting Aptamer Sequences

4 Questions?

13 / 21

Page 14: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

The Big Question

Which DNA sequences are indicitative ofaptamers?

14 / 21

Page 15: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

15 / 21

Page 16: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

The Aptamer Problem

An aptamer is an RNA or DNA chain that strongly binds tovarious molecular targets.An aptamer is represented as a genetic sequence that containscertain k-mers.

A, GC, TGA, and AGAC are a 1-, 2-, 3-, and 4-mer,respectively.

Each row of a dataset has features consisting of a sequencesplit into p k-mers, and whether or not the k-mers representan aptamer sequence. The presence of a k-mer in thesequence is marked with a 1 and the absence of a k-mer ismarked with a 0.Very few sequences in the dataset (small n)Many possible k-mers (high p)How do we know which k-mers make up aptamers?

16 / 21

Page 17: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

Boruta and the Aptamer Problem

For a dataset consisting of n genetic sequences, p 3-, 4-, and5-mers, and whether or not the sequences make up an aptamer:

1 Create a shadow feature for each of the p k-mers2 Run Boruta on the combined dataset.3 Build a random forest on all k-mers selected by Boruta4 Calculate out-of-bag (OOB) error on the new random forest

model. 30% is the maximum acceptable OOB error.

17 / 21

Page 18: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

Example: ATP Binding Sites

See 2-Boruta.R

18 / 21

Page 19: Boruta

Boruta – A System for Feature Selection

Example – Detecting Aptamer Sequences

Aptamer Problem Results

Out of 23 genetic sequence datasets:2 had OOB error greater than 30% and didn’t select anysequences.1 had increased OOB error after Boruta from 11% to 38%.20 had average out-of-bag error of 11% and, on average,selected 18 out of 1170 k-mers.The k-mers selected were known to be important based onpast biological knowledge.

Boruta does pretty darn well!

19 / 21

Page 20: Boruta

Boruta – A System for Feature Selection

Questions?

Outline

1 Background

2 Boruta

3 Example – Detecting Aptamer Sequences

4 Questions?

20 / 21

Page 21: Boruta

Boruta – A System for Feature Selection

Questions?

Any questions?

21 / 21