Upload
cecilia-gregory
View
213
Download
0
Embed Size (px)
Citation preview
CSE 446 Logistic Regression
Winter 2012
Dan Weld
Some slides from Carlos Guestrin, Luke Zettlemoyer
Gaussian Naïve Bayes
P(Xi =x| Y=yk) = N(ik, ik)
P(Y | X) P(X | Y) P(Y)
N(ik, ik) =
Sometimes Assume Variance– is independent of Y (i.e., i), – or independent of Xi (i.e., k)– or both (i.e., )
3
Generative vs. Discriminative Classifiers
• Want to Learn: h:X Y– X – features– Y – target classes
• Bayes optimal classifier – P(Y|X)• Generative classifier, e.g., Naïve Bayes:
– Assume some functional form for P(X|Y), P(Y)– Estimate parameters of P(X|Y), P(Y) directly from training data– Use Bayes rule to calculate P(Y|X= x)– This is a ‘generative’ model
• Indirect computation of P(Y|X) through Bayes rule• As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y)
• Discriminative classifiers, e.g., Logistic Regression:– Assume some functional form for P(Y|X)– Estimate parameters of P(Y|X) directly from training data– This is the ‘discriminative’ model
• Directly learn P(Y|X)• But cannot obtain a sample of the data, because P(X) is not available
P(Y | X) P(X | Y) P(Y)
4
Univariate Linear Regression
hw(x) = w1x + w0
Loss(hw) =j=1
n
L2(yj, hw(xj)) = j=1
n
(yj - hw(xj))2 =
j=1
n
(yj – (w1xj+w0))2
5
Understanding Weight Space
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
hw(x) = w1x + w0
6
Understanding Weight Space
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
7
Finding Minimum Loss
hw(x) = w1x + w0
Loss(hw) =j=1
n
(yj – (w1xj+w0))2
Loss(hw) = 0w0
Argminw Loss(hw)
Loss(hw) = 0w1
Unique Solution!
hw(x) = w1x + w0
w1 =
Argminw Loss(hw)
w0 = ((yj)–w1(xj)/N
N(xjyj)–(xj)(yj)
N(xj2)–(xj)2
Could also Solve Iteratively
w = any point in weight spaceLoop until convergence
For each wi in w do
wi := wi - Loss(w)
Argminw Loss(hw)
wi
10
Multivariate Linear Regression
hw(xj) = w0
Unique Solution = (xTx)-1xTy
+wi xj,i =wi xj,i =wTxj
Argminw Loss(hw)
Problem….
11
Overfitting
Regularize!!Penalize high weights
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+wi2
k
Loss(hw) =j=1
n
(yj – (w1xj+w0))2 i=1+|wi|
k
Alternatively….
12
Regularization
L1 L2
13
Back to Classification
P(edible|X)=1
P(edible|X)=0
Decision Boundary
14
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Not differentiable…
P(Y)=1
P(Y)=0
15
Logistic Regression
Learn P(Y|X) directly! Assume a particular functional form Logistic Function
Aka Sigmoid
16
Logistic Function in n Dimensions
Sigmoid applied to a linear function of the data:
Features can be discrete or continuous!
Understanding Sigmoids
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=-2, w1=-1
-6 -4 -2 0 2 4 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
w0=0, w1= -0.5
18
Very convenient!
implies
implies
implies
linear classification
rule!
©Carlos Guestrin 2005-2009
©Carlos Guestrin 2005-2009 19
Logistic regression more generally
Logistic regression in more general case, where Y {y1,…,yR}
for k<R
for k=R (normalization, so no weights for this class)
Features can be discrete or continuous!
20
Generative (Naïve Bayes) Loss function: Data likelihood
Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood
Discriminative models can’t compute P(xj|w)!Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification
Loss Functions: Likelihood vs. Conditional Likelihood
©Carlos Guestrin 2005-2009
21
Expressing Conditional Log Likelihood
©Carlos Guestrin 2005-2009
}Pro
babil
ity o
f pre
dictin
g 1
Proba
bility
of p
redic
ting
0
}
1 whe
n co
rrect
answ
er is
0
1 whe
n co
rrect
answ
er is
1
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
Expressing Conditional Log Likelihood
ln(P(Y=0|X,w) = -ln(1+exp(w0+i wiXi)
ln(P(Y=0|X,w) = w0+i wiXi - ln(1+exp(w0+i wiXi)
23
Maximizing Conditional Log Likelihood
Bad news: no closed-form solution to maximize l(w)
Good news: l(w) is concave function of w!
No local minima
Concave functions easy to optimize
©Carlos Guestrin 2005-2009
24
Optimizing Concave FunctionsGradient Ascent
Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent
Gradient ascent is simplest of optimization approachese.g., Conjugate gradient ascent much better (see reading)
Gradient:
Learning rate, >0
Update rule:
©Carlos Guestrin 2005-2009