: INTRODUCTION TO Machine Learning Parametric Methods

姓名 : 李政軒

INTRODUCTION TO Machine Learning Parametric Methods

Parametric Estimation

X = { xt }t where xt ~ p (x)Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X N ( μ, σ2) where q = { μ, σ2}

Maximum Likelihood EstimationLikelihood of q given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimatorθ* = argmaxθ L(θ|X)

Examples: Bernoulli/MultinomialBernoulli: Two states, failure/success, x in

{0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏

i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

i pi

xit

MLE: pi = ∑t xi

t / N

Gaussian (Normal) Distribution

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

2

2

22

1

x

xp exp

Bias and Variance

Unknown parameter qEstimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]

Mean square error: r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]

= Bias2 + Variance

Bayes’ Estimator

Treat θ as a random var with prior p (θ)Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθMaximum a Posteriori (MAP): θMAP =

argmaxθ p(θ|X)

Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log or

|

ii

iii

i

i

ii

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1

Given the sample

ML estimates are

Discriminant becomes

Nt

tt ,rx 1}{ X

x

, if

if ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

i r

rmxs

r

rxm

N

rCP

2

2 ˆ

ii

iii CP

smx

sxg ˆ log log log

2

2

22

2

1

Parametric Classification

Parametric Classification (a)and(b) for two classes when the input is one-

dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold if decision.

Parametric Classification (a)and(b) for two classes when the input is one-dimensional.

Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with

2.0

Regression

2

20

,N

,N

|~|

~

|:estimator

xgxrp

xgxfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL

Regression: From LogL to Error

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

||

|log

|exp log|

X

XL

Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

tt

t

xr

r

ww

xx

xNyw

1

02A

yw 1A

Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

rr

xxx

xxxxxx

2

1

22

2222

1211

1

1

1

r D

rw TT DDD1

Square Error:

Relative Square Error:

Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|

ε-sensitive Error: E (θ |X) = ∑ t

1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| –

ε)

Other Error Measures

2

12

1

N

t

tt xgrE ||X

2

1

2

1

N

t

t

N

t

tt

rr

xgrE

||X

Bias and Variance

222 ||| xgExgExgExrExxgxrEE XXXX bias variance

222 xgxrExxrErExxgrE ||||

noise squared error

Estimating Bias and Variance

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1

1

2

22

Variance

Bias

M samples Xi={xti , rt

i}, i=1,...,M

are used to fit gi (x), i =1,...,M

Bias/Variance Dilemma

Example: gi(x)=2 has no variance and high bias

gi(x)= ∑t rt

i/N has lower bias with variance

As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with

data)

Bias/Variance Dilemma

(.)g

(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from the function. Five samples are taken, each containing twenty in-stances. (b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each case, dotted line is the average of the five fits namely, .

Polynomial Regression

Best fit “min error”

In the same setting as that of previous, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5.

Model Selection

Cross-validation: Measure generalization accuracy by testing on data unused during training

Regularization: Penalize complex modelsE’=error on data + λ model

complexityAkaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)

Best fit, “elbow”

Model Selection

Bayesian Model Selection

data

model model|datadata|model

ppp

p

Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)Average over a number of models with high

posterior

Regression example

Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrEtionregulariza 22

12

1 ww ||: X

Documents

: INTRODUCTION TO Machine Learning Parametric Methods