CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

CSE 446 Logistic Regression

Winter 2012

Dan Weld

Some slides from Carlos Guestrin, Luke Zettlemoyer

Gaussian Naïve Bayes

P(Xi =x| Y=yk) = N(ik, ik)

P(Y | X) P(X | Y) P(Y)

N(ik, ik) =

Sometimes Assume Variance– is independent of Y (i.e., i), – or independent of Xi (i.e., k)– or both (i.e., )

3

Generative vs. Discriminative Classifiers

• Want to Learn: h:X Y– X – features– Y – target classes

• Bayes optimal classifier – P(Y|X)• Generative classifier, e.g., Naïve Bayes:

– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from training data– Use Bayes rule to calculate P(Y|X= x)– This is a ‘generative’ model

• Indirect computation of P(Y|X) through Bayes rule• As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y)

• Discriminative classifiers, e.g., Logistic Regression:– Assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from training data– This is the ‘discriminative’ model

• Directly learn P(Y|X)• But cannot obtain a sample of the data, because P(X) is not available

P(Y | X) P(X | Y) P(Y)

4

Univariate Linear Regression

hw(x) = w1x + w0

Loss(hw) =j=1

n

L2(yj, hw(xj)) = j=1

n

(yj - hw(xj))2 =

j=1

n

(yj – (w1xj+w0))2

5

Understanding Weight Space

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

hw(x) = w1x + w0

6

Understanding Weight Space

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

7

Finding Minimum Loss

hw(x) = w1x + w0

Loss(hw) =j=1

n

(yj – (w1xj+w0))2

Loss(hw) = 0w0

Argminw Loss(hw)

Loss(hw) = 0w1

Unique Solution!

hw(x) = w1x + w0

w1 =

Argminw Loss(hw)

w0 = ((yj)–w1(xj)/N

N(xjyj)–(xj)(yj)

N(xj2)–(xj)2

Could also Solve Iteratively

w = any point in weight spaceLoop until convergence

For each wi in w do

wi := wi - Loss(w)

Argminw Loss(hw)

wi

10

Multivariate Linear Regression

hw(xj) = w0

Unique Solution = (xTx)-1xTy

+wi xj,i =wi xj,i =wTxj

Argminw Loss(hw)

Problem….

11

Overfitting

Regularize!!Penalize high weights

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+wi2

k

Loss(hw) =j=1

n

(yj – (w1xj+w0))2 i=1+|wi|

k

Alternatively….

12

Regularization

L1 L2

13

Back to Classification

P(edible|X)=1

P(edible|X)=0

Decision Boundary

14

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Not differentiable…

P(Y)=1

P(Y)=0

15

Logistic Regression

Learn P(Y|X) directly! Assume a particular functional form Logistic Function

Aka Sigmoid

16

Logistic Function in n Dimensions

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!

Understanding Sigmoids

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=-2, w1=-1

-6 -4 -2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

w0=0, w1= -0.5

18

Very convenient!

implies

implies

implies

linear classification

rule!

©Carlos Guestrin 2005-2009

©Carlos Guestrin 2005-2009 19

Logistic regression more generally

Logistic regression in more general case, where Y {y1,…,yR}

for k<R

for k=R (normalization, so no weights for this class)

Features can be discrete or continuous!

20

Generative (Naïve Bayes) Loss function: Data likelihood

Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood

Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification

Loss Functions: Likelihood vs. Conditional Likelihood


21

Expressing Conditional Log Likelihood


}Pro

babil

ity o

f pre

dictin

g 1

Proba

bility

of p

redic

ting

0

}

1 whe

n co

rrect

answ

er is

0

1 whe

n co

rrect

answ

er is

1

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

Expressing Conditional Log Likelihood

ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)

ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)

23

Maximizing Conditional Log Likelihood

Bad news: no closed-form solution to maximize l(w)

Good news: l(w) is concave function of w!

No local minima

Concave functions easy to optimize


24

Optimizing Concave FunctionsGradient Ascent

Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent

Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)

Gradient:

Learning rate, >0

Update rule:


Documents

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer