25
姓姓 : 姓姓姓 INTRODUCTION TO Machine Learning Parametric Methods

: INTRODUCTION TO Machine Learning Parametric Methods

Embed Size (px)

Citation preview

Page 1: : INTRODUCTION TO Machine Learning Parametric Methods

姓名 : 李政軒

INTRODUCTION TO Machine Learning Parametric Methods

Page 2: : INTRODUCTION TO Machine Learning Parametric Methods

Parametric Estimation

X = { xt }t where xt ~ p (x)Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient statistics, using X N ( μ, σ2) where q = { μ, σ2}

Page 3: : INTRODUCTION TO Machine Learning Parametric Methods

Maximum Likelihood EstimationLikelihood of q given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑

t log p (xt|θ)

Maximum likelihood estimatorθ* = argmaxθ L(θ|X)

Page 4: : INTRODUCTION TO Machine Learning Parametric Methods

Examples: Bernoulli/MultinomialBernoulli: Two states, failure/success, x in

{0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏

i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

i pi

xit

MLE: pi = ∑t xi

t / N

Page 5: : INTRODUCTION TO Machine Learning Parametric Methods

Gaussian (Normal) Distribution

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

2

2

22

1

x

xp exp

Page 6: : INTRODUCTION TO Machine Learning Parametric Methods

Bias and Variance

Unknown parameter qEstimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – qVariance: E [(d–E [d])2]

Mean square error: r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]

= Bias2 + Variance

Page 7: : INTRODUCTION TO Machine Learning Parametric Methods

Bayes’ Estimator

Treat θ as a random var with prior p (θ)Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθMaximum a Posteriori (MAP): θMAP =

argmaxθ p(θ|X)

Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Page 8: : INTRODUCTION TO Machine Learning Parametric Methods

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log or

|

ii

iii

i

i

ii

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1

Page 9: : INTRODUCTION TO Machine Learning Parametric Methods

Given the sample

ML estimates are

Discriminant becomes

Nt

tt ,rx 1}{ X

x

, if

if ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

i r

rmxs

r

rxm

N

rCP

2

2 ˆ

ii

iii CP

smx

sxg ˆ log log log

2

2

22

2

1

Parametric Classification

Page 10: : INTRODUCTION TO Machine Learning Parametric Methods

Parametric Classification (a)and(b) for two classes when the input is one-

dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold if decision.

Page 11: : INTRODUCTION TO Machine Learning Parametric Methods

Parametric Classification (a)and(b) for two classes when the input is one-dimensional.

Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with

2.0

Page 12: : INTRODUCTION TO Machine Learning Parametric Methods

Regression

2

20

,N

,N

|~|

~

|:estimator

xgxrp

xgxfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL

Page 13: : INTRODUCTION TO Machine Learning Parametric Methods

Regression: From LogL to Error

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

||

|log

|exp log|

X

XL

Page 14: : INTRODUCTION TO Machine Learning Parametric Methods

Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

tt

t

xr

r

ww

xx

xNyw

1

02A

yw 1A

Page 15: : INTRODUCTION TO Machine Learning Parametric Methods

Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

rr

xxx

xxxxxx

2

1

22

2222

1211

1

1

1

r D

rw TT DDD1

Page 16: : INTRODUCTION TO Machine Learning Parametric Methods

Square Error:

Relative Square Error:

Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|

ε-sensitive Error: E (θ |X) = ∑ t

1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| –

ε)

Other Error Measures

2

12

1

N

t

tt xgrE ||X

2

1

2

1

N

t

t

N

t

tt

rr

xgrE

||X

Page 17: : INTRODUCTION TO Machine Learning Parametric Methods

Bias and Variance

222 ||| xgExgExgExrExxgxrEE XXXX bias variance

222 xgxrExxrErExxgrE ||||

noise squared error

Page 18: : INTRODUCTION TO Machine Learning Parametric Methods

Estimating Bias and Variance

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1

1

2

22

Variance

Bias

M samples Xi={xti , rt

i}, i=1,...,M

are used to fit gi (x), i =1,...,M

Page 19: : INTRODUCTION TO Machine Learning Parametric Methods

Bias/Variance Dilemma

Example: gi(x)=2 has no variance and high bias

gi(x)= ∑t rt

i/N has lower bias with variance

As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with

data)

Page 20: : INTRODUCTION TO Machine Learning Parametric Methods

Bias/Variance Dilemma

(.)g

(a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from the function. Five samples are taken, each containing twenty in-stances. (b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each case, dotted line is the average of the five fits namely, .

Page 21: : INTRODUCTION TO Machine Learning Parametric Methods

Polynomial Regression

Best fit “min error”

In the same setting as that of previous, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5.

Page 22: : INTRODUCTION TO Machine Learning Parametric Methods

Model Selection

Cross-validation: Measure generalization accuracy by testing on data unused during training

Regularization: Penalize complex modelsE’=error on data + λ model

complexityAkaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)

Page 23: : INTRODUCTION TO Machine Learning Parametric Methods

Best fit, “elbow”

Model Selection

Page 24: : INTRODUCTION TO Machine Learning Parametric Methods

Bayesian Model Selection

data

model model|datadata|model

ppp

p

Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)Average over a number of models with high

posterior

Page 25: : INTRODUCTION TO Machine Learning Parametric Methods

Regression example

Coefficients increase in magnitude as order increases:1: [-0.0769, 0.0016]2: [0.1682, -0.6657, 0.0080]3: [0.4238, -2.5778, 3.4675, -0.00024: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrEtionregulariza 22

12

1 ww ||: X