41
Introduction to Boosting Slides Adapted from Che Wanxiang( 车车车 ) at HIT, and Robin Dhamankar of Many thanks!

Introduction to Boosting

  • Upload
    mirari

  • View
    50

  • Download
    3

Embed Size (px)

DESCRIPTION

Introduction to Boosting. Slides Adapted from Che Wanxiang(车万翔) at HIT, and Robin Dhamankar of Many thanks!. Ideas. Boosting is considered to be one of the most significant developments in machine learning - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Boosting

Introduction to Boosting

Slides Adapted from Che Wanxiang(车万翔 ) at HIT, and

Robin Dhamankar of Many thanks!

Page 2: Introduction to Boosting

Ideas

Boosting is considered to be one of the most significant developments in machine learning

Finding many weak rules of thumb is easier than finding a single, highly prediction rule

Key in combining the weak rules

Page 3: Introduction to Boosting
Page 4: Introduction to Boosting
Page 5: Introduction to Boosting
Page 6: Introduction to Boosting
Page 7: Introduction to Boosting
Page 8: Introduction to Boosting
Page 9: Introduction to Boosting
Page 10: Introduction to Boosting
Page 11: Introduction to Boosting
Page 12: Introduction to Boosting
Page 13: Introduction to Boosting
Page 14: Introduction to Boosting
Page 15: Introduction to Boosting

Boosting(Algorithm) W(x) is the distribution of weights over the N

training points ∑ W(xi)=1 Initially assign uniform weights W0(x) = 1/N

for all x, step k=0 At each iteration k :

Find best weak classifier Ck(x) using weights Wk(x) With error rate εk and based on a loss function:

weight αk the classifier Ck‘s weight in the final hypothesis For each xi , update weights based on εk to get Wk+1(xi )

CFINAL(x) =sign [ ∑ αi Ci (x) ]

Page 16: Introduction to Boosting

Boosting (Algorithm)

Page 17: Introduction to Boosting
Page 18: Introduction to Boosting
Page 19: Introduction to Boosting

Boosting As Additive Model The final prediction in boosting f(x) can be

expressed as an additive expansion of individual classifiers

The process is iterative and can be expressed as follows.

Typically we would try to minimize a loss function on the training examples

);()(1

m

M

mm xbxf

);()()( 1 mmmm xbxfxf

N

i

M

mmimi xbyL

Mmm 1 1},{

);(,min1

Page 20: Introduction to Boosting

Boosting As Additive Model Simple case: Squared-error loss

Forward stage-wise modeling amounts to just fitting the residuals from previous iteration.

Squared-error loss not robust for classification

2))((2

1))(,( xfyxfyL

2

21

1

));((

));()((

));()(,(

iim

iimi

iimi

xbr

xbxfy

xbxfyL

Page 21: Introduction to Boosting

Boosting As Additive Model AdaBoost for Classification:

L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function

N

iimiimi

G

N

iimimi

G

i

N

ii

f

xGyxfy

xGxfy

xfyL

m

m

11

,

11

,

1

))(exp())(exp(minarg

)])()([exp(minarg

))(,(minarg

Page 22: Introduction to Boosting

Boosting As Additive Model

First assume that β is constant, and minimize w.r.t. G:

ew

xGyIwee

wexGyIwee

ewew

xfywwherexGyw

xGyxfy

N

i

mi

N

iii

mi

G

N

i

mi

N

iii

mi

G

N

xGy

mi

N

xGy

mi

G

imim

i

N

iimi

mi

G

N

iimiimi

G

iiii

m

m

1

)(

1

)(

1

)(

1

)(

)(

)(

)(

)(

1)(

1

)(

,

11

,

)])(([)(minarg

)])(([)(minarg

minarg

))(exp(,))(exp(minarg

))(exp())(exp(minarg

Page 23: Introduction to Boosting

Boosting As Additive Model

)()(minarg

)])(([)(minarg

1

)(

1

)(

Heerree

ew

xGyIwee

mG

N

i

mi

N

iii

mi

G

errm : It is the training error on the weighted samples

The last equation tells us that in each iteration we must find a classifier that minimizes the training error on the weighted samples.

Page 24: Introduction to Boosting

Boosting As Additive Model

)1

log(2

1

1

01

0)(1

0)(

)()(

2

2

m

m

m

m

mm

m

m

m

err

err

eerr

err

errerre

eeerre

eeeerrH

eeeerrH

Now that we have found G, we minimize w.r.t. β:

Page 25: Introduction to Boosting

AdaBoost(Algorithm) W(x) is the distribution of weights over the N

training points ∑ W(xi)=1 Initially assign uniform weights W0(x) = 1/N for all

x. At each iteration k :

Find best weak classifier Ck(x) using weights Wk(x) Compute εk the error rate as

εk= [ ∑ W(xi ) ∙ I(yi ≠ Ck(xi )) ] / [ ∑ W(xi )]

weight αk the classifier Ck‘s weight in the final hypothesis Set αk = log ((1 – εk )/εk )

For each xi , Wk+1(xi ) = Wk(xi ) ∙ exp[αk ∙ I(yi ≠ Ck(xi ))] CFINAL(x) =sign [ ∑ αi Ci (x) ]

Page 26: Introduction to Boosting

AdaBoost(Example)

Original Training set : Equal Weights to all training samples

Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire

Page 27: Introduction to Boosting

AdaBoost(Example)

ROUND 1

Page 28: Introduction to Boosting

AdaBoost(Example)

ROUND 2

Page 29: Introduction to Boosting

AdaBoost(Example)

ROUND 3

Page 30: Introduction to Boosting

AdaBoost(Example)

Page 31: Introduction to Boosting

AdaBoost (Characteristics) Why exponential loss function?

Computational Simple modular re-weighting Derivative easy so determing optimal parameters is

relatively easy Statistical

In a two label case it determines one half the log odds of P(Y=1|x) => We can use the sign as the classification rule

Accuracy depends upon number of iterations ( How sensitive.. we will see soon).

Page 32: Introduction to Boosting

Boosting performance

Decision stumps are very simple rules of thumb that test condition on a single attribute.

Decision stumps formed the individual classifiers whose predictions were combined to generate the final prediction.

The misclassification rate of the Boosting algorithm was plotted against the number of iterations performed.

Page 33: Introduction to Boosting

Boosting performance

Steep decrease in error

Page 34: Introduction to Boosting

Boosting performance Pondering over how many iterations

would be sufficient…. Observations

First few ( about 50) iterations increase the accuracy substantially.. Seen by the steep decrease in misclassification rate.

As iterations increase training error decreases ? and generalization error decreases ?

Page 35: Introduction to Boosting

Can Boosting do well if?

Limited training data? Probably not ..

Many missing values ? Noise in the data ? Individual classifiers not very

accurate ? It cud if the individual classifiers have

considerable mutual disagreement.

Page 36: Introduction to Boosting

Application : Data mining Challenges in real world data mining problems

Data has large number of observations and large number of variables on each observation.

Inputs are a mixture of various different kinds of variables Missing values, outliers and variables with skewed

distribution. Results to be obtained fast and they should be

interpretable. So off-shelf techniques are difficult to come up with. Boosting Decision Trees ( AdaBoost or MART) come

close to an off-shelf technique for Data Mining.

Page 37: Introduction to Boosting
Page 38: Introduction to Boosting
Page 39: Introduction to Boosting
Page 40: Introduction to Boosting

AT&T “May I help you?”

Page 41: Introduction to Boosting