24
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Embed Size (px)

Citation preview

Page 1: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

CSE 446 Logistic Regression

Winter 2012

Dan Weld

Some slides from Carlos Guestrin, Luke Zettlemoyer

Page 2: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Gaussian Naïve Bayes

P(Xi =x| Y=yk) = N(ik, ik)

P(Y | X) P(X | Y) P(Y)

N(ik, ik) =

Sometimes Assume Variance– is independent of Y (i.e., i), – or independent of Xi (i.e., k)– or both (i.e., )

Page 3: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

3

Generative vs. Discriminative Classifiers

• Want to Learn: h:X Y– X – features– Y – target classes

• Bayes optimal classifier – P(Y|X)• Generative classifier, e.g., Naïve Bayes:

– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from training data– Use Bayes rule to calculate P(Y|X= x)– This is a ‘generative’ model

• Indirect computation of P(Y|X) through Bayes rule• As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y)

• Discriminative classifiers, e.g., Logistic Regression:– Assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from training data– This is the ‘discriminative’ model

• Directly learn P(Y|X)• But cannot obtain a sample of the data, because P(X) is not available

P(Y | X) P(X | Y) P(Y)

Page 4: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

4

Univariate Linear Regression

hw(x) = w1x + w0

Loss(hw) =j=1

n

L2(yj, hw(xj)) = j=1

n

(yj - hw(xj))2 =

j=1

n

(yj – (w1xj+w0))2

Page 5: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

5

Understanding Weight Space

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

hw(x) = w1x + w0

Page 6: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

6

Understanding Weight Space

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Page 7: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

7

Finding Minimum Loss

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Loss(hw) = 0w0

Argminw Loss(hw)

Loss(hw) = 0w1

Page 8: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Unique Solution!

hw(x) = w1x + w0

w1 =

Argminw Loss(hw)

w0 = ((yj)–w1(xj)/N

N(xjyj)–(xj)(yj)

N(xj2)–(xj)2

Page 9: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Could also Solve Iteratively

w = any point in weight spaceLoop until convergence

For each wi in w do

wi := wi - Loss(w)

Argminw Loss(hw)

wi

Page 10: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

10

Multivariate Linear Regression

hw(xj) = w0

Unique Solution = (xTx)-1xTy

+wi xj,i =wi xj,i =wTxj

Argminw Loss(hw)

Problem….

Page 11: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

11

Overfitting

Regularize!!Penalize high weights

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+wi2

k

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

Alternatively….

Page 12: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

12

Regularization

L1 L2

Page 13: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

13

Back to Classification

P(edible|X)=1

P(edible|X)=0

Decision Boundary

Page 14: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

14

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Not differentiable…

P(Y)=1

P(Y)=0

Page 15: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

15

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Logistic Function

Aka Sigmoid

Page 16: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

16

Logistic Function in n Dimensions

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Page 17: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Understanding Sigmoids

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1= -0.5

Page 18: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

18

Very convenient!

implies

implies

implies

linear classification

rule!

©Carlos Guestrin 2005-2009

Page 19: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

©Carlos Guestrin 2005-2009 19

Logistic regression more generally

Logistic regression in more general case, where Y {y1,…,yR}

for k<R

for k=R (normalization, so no weights for this class)

Features can be discrete or continuous!

Page 20: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

20

Generative (Naïve Bayes) Loss function: Data likelihood

Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood

Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification

Loss Functions: Likelihood vs. Conditional Likelihood

©Carlos Guestrin 2005-2009

Page 21: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

21

Expressing Conditional Log Likelihood

©Carlos Guestrin 2005-2009

}Pro

babil

ity o

f pre

dictin

g 1

Proba

bility

of p

redic

ting

0

}

1 whe

n co

rrect

answ

er is

0

1 whe

n co

rrect

answ

er is

1

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Page 22: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Expressing Conditional Log Likelihood

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Page 23: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

23

Maximizing Conditional Log Likelihood

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local minima

Concave functions easy to optimize

©Carlos Guestrin 2005-2009

Page 24: CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

24

Optimizing Concave FunctionsGradient Ascent

Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent

Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)

Gradient:

Learning rate, >0

Update rule:

©Carlos Guestrin 2005-2009