ﻢﻫدزاود سرد ﻦﯿﺷﺎﻣيﺮﯿﮔدﺎﯾfumblog.um.ac.ir/gallery/101/1spp-lesson12-1-parametric-methods.pdf · Monday, November 17, 2014 ﯽﻔﺼﻨﻣ ﺎﺿر

روش هاي پارامتري)Parametric Methods(

)1بخش (

ماشین یادگیري)Machine Learning(

دانشگاه فردوسی مشهد دانشکده مهندسی

رضا منصفی

درس دوازدهم

Monday, November 17, 2014 دکتر رضا منصفی - دانشکده مهندسی - یادگیري ماشین 2

.می گیرد قرار بحث مورد بیزین یادگیري در قطعیت عدم احتماالتی مدل با بهینه تصمیم هاي اتخاذ نحوه- :گرفت خواهد قرار بحت مورد درس این در که آن چهزد تخمین داده شده آموزشی مجموعه از استفاده با را احتماالت این می توان چگونه ببینیم می خواهیم.می دهیم ادامه )13 درس( رگراسیون و )12 درس( کالس بندي براي پارامتري روي کرد با را بحث.گرفت خواهد قرار بحث مورد دیگر درس هاي در غیرپارامتري و نیمه پارامتري روي کرد هاي.تجربی خطاي و مدل پیچیدگی بین مصالحه براي،

و واریانس/بایاس مسأله روش هاي

مدل انتخاب.شد خواهد معرفی

مقدمهبیشینه درست نمایی تخمینبرنولی چگالیچندجمله اي چگالیگوسی( نرمال چگالی(واریانس و بایاس( تخمین گر ارزیابی(بیزین تخمین گرپارامتریک طبقه بندي

) Parametric Estimation(تخمین پارامتر –مقدمه گفته می شود آمارهمحاسبه شود، داده شده نمونه يکه از ) ارزشی( مقداريبه هر . انجام می شود نمونهفراهم شده توسط اطالعاتبا استفاده از تصمیم گیري، آماري استنباطدر. ،است پارامتري روشاولین روش مورد مطالعه . شناخته شده مدلیکه از شده) مشتق( تولید احتمالی اي توزیعدر این روش فرض می کنیم نمونه از ،

. ، پیروي می کند)Gaussian( گوسی احتمال توزیعبراي مثال، این است که مدل با تعداد کمی از پارامترها؛ پارامتريمزیت روش

Sufficient( توزیع کافی آماره هاي؛ یعنی واریانس، میانگینبراي مثال، statistics(تعریف می شود ، . شده است شناخته توزیعشد، در واقع همه تخمین زده، نمونهاز پارامترهاوقتی که این.می شود، تخمین زده داده شده نمونه ي ، از توزیع پارامترهاي

به دست آید، تخمینی توزیعیوارد شده تا مفروض مدلبه تخمین هااین از آن استفاده می شود تصمیمآن گاه براي اتخاذ.

نقطه اي تخمین( احتمال توزیع پارامترهاي تخمینروشی که براي )Point Estimation (( استفادهمی کنیم،

Maximum( بیشینه درست نمایی تخمین Likelihood Estimation, MLE( نام دارد .ن تخمیننیز در جاي دیگر مورد بررسی قرار خواهد گرفت بیزی .

Monday, November 17, 2014 دکتر رضا منصفی -دانشکده مهندسی -یادگیری ماشین 3

4

. است، آغاز می کنیم p(x)تخمینکه حالت عمومی ) Density Estimation( چگالی تخمینبا

. استفاده می شود کالس بنديبراي چگالی تخمیناز

، تخمینی چگالی هايبه طوري که، چگالی کالسp(x|Ci) ، پیشین چگالی p(Ci)

هستند، تا بتوان پسین چگالی p(Ci|x) ،

. را محاسبه و اتخاذ تصمیم نمود

. است p(y|x) تخمینی چگالیبحث می کنیم که رگراسیونبعدا در مورد

.هستندتک متغیره چگالی هااست و بنابراین بعديتک xدر این جا

5

Maximum Likelihood Estimation

Likelihood of θ given the sample Xl (θ|X) = p (X |θ) = ∏t p (xt|θ)

Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)Maximum likelihood estimator

(MLE)θ* = θML = argmaxθ L(θ|X)

اراي دستقل و م) Sample(نمونه اي independent(کسان یوزیع ت and

identically distributed, iid()م دت ي( . داریم

نمونه هاي بدست xtفرض بر این است کهآمده از خانواده چگالی احتمال شناخته

θهستند که با پارامترهاي شده. شده اندتعریف

|xp

را به نحوي تعیین کنیم که θمی خواهیم از را تا حد نمونه برداري

. سازد) شبیه(ممکن محتمل

ها مستقل هستند، xtچون X نمونه) Likelihood(درست نمایی

مفروض، حاصل ضرب θبا پارامتر.است xدرست نمایی هاي تمام مؤلفه هاي

|xp

|~ xpx t

Nt

txX 1

tx

X = { xt }t where xt ~ p (x)Parametric estimation: Assume a form for p (x | θ) and estimate θ,its sufficient statistics, using Xe.g., N ( μ, σ2) where θ = { μ, σ2}

6

Examples: Bernoulli/MultinolyBernoulli: Two states (K=2), failure/success, x in {0,1}P (x) = px (1 – p ) (1 – x)

L (p|X) = log ∏t pxt (1 – p ) (1 – xt)

MLE: p = ∑t xt / N

اگر مسأله دو کالسه داشته باشیم، توزیعی که به کار . است) Bernoulli( برنولیمی بریم،

0,1x

2

[ ] ( ) 1. 0.(1 )

( ) ( [ )) ( ) (1 )x

x

E X xp x p p p

Var X x E X p x p p

1

1

log | log 1

log log 1

t tN x x

t

t t

t t

L p X p p

x p N x p

که لگاریتم درست نمایی را بیشینه می کند، :به دست می آیدبا حل

نسبت تعداد رخدادهاي پیشامد به pتخمین.تعداد آزمایش ها است

plog 0d L

dp

Multinoly: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE: pi = ∑t xit / N

تعمیم توزیع برنولی، وقتی که بیش از دو کالس وجود دارد،.است چندجمله ايتوزیع

. شرطی با ورودي عددي است-یکی از پر استفاده ترین توزیع ها براي مدل کردن چگالی هاي کالس) گوسی(چگالی نرمال

Maximum Likelihood Estimation, MLE)ثابت مجهول n=10 )p0و p0با پارامترهاي Xمتغیر تصادفی دوجمله اي


تابع توزیع برنولی - :مثال

00 0 0( ; ) ( ) (1 )n

x n xp

xf x p P X x p p

.است X=3براي شروع، فرض شود

Point( نقطه تخمینهمانند همه مسائل Estimation( ،. می باشد موجود داده هايبر اساس p0 پارامتر مقدار تخمینهدف

p0=0.5در وهله اول فرض شود -را نمی دانیم p0مقدار X=3با داده ها تولید احتمالبا این فرض خاص،

103 10 3 7

0.53

(3;0.5) ( 3) 0.5 (1 0.5 0.5) 0.117f P X

=pحال به طور کلی تر، براي p0 داریم [0,1]p

103 7

3(3; ) ( 3) (1 )pf p P X p p


می گوید درست نمایی ماکزیمماصل .نماید بیشینهرا L(p;3)استفاده شود تا pبراي p0 تخمینباید از

.نماید محتمل ترشده را مشاهده داده يباید احتمال p0بدین معنی که Log( درست نمایی لگاریتممعموال از Likelihood (استفاده می شود.

( ;3) 3log 7 log(1 ) log( )n

kLog L p p p

. شود را بدست آورد بیشینه تابعکه نقطه ايمی توان مشتق گیريبا

بیشینهتابع فوق را یکتا نقطه يمشاهده می شود که فقط یک می نماید که به آن Single( واحد بحرانی نقطه Critical Point (

.گفته می شود

.گفته می شود X=3براي بیشینه درست نمایی تخمین p0به این نقطه


( ;3) 0

3 7 3 3 7 30 3 10 0 0.31 (1 ) 10

( ;3) 0.3

Log L pp

p p p pp p p pMax Log L p

را X=kاست که اگر ما مشاهده ي واضح

داشته باشیم، k=0, 1, ….., nبراي برايتخمین درست نمایی بیشینه

p0 مقدار k/n است

درست نماییکه تابع بحرانی نقطهیک . می نماید بیشینه را

در نقطه براي این که نقطه .ماکزیمم باشد، مشتق دوم می گیریم

بدین معنی که نقطه نقطه ي . ماکزیمم است

)Poisson(توزیع پوواسن تابع -:مثال


به بیشینه درست نمایی تخمیندر مثال قبلی دیده شد که مسأله یافتن .تقلیل یافت درست نمایی تابع بهینه سازيمسئله

1

1

1

11

1

( ; )!

1

!

( ; ) ( !)

( ; ) 0

0

i

nii

xn

t i

xnn

it

nn

i iit

nii

eL xx

ex

Log L x n x Log Log x

LogL x

xn x

x

x

22

2 10n

ii

LogL x

,x1, x2 حال …., xn مجموعه ي iid تصادفی متغیرهاياز .است مفروض) Poisson(µ)( پوواسون


Gaussian (Normal) Distribution

For Moment Generating Function Click Here

12

Gaussian (Normal) Distributionp(x) = N (µ , σ2)

µ

σ

2

2

t t

t tx x m

m sN N

2

2

2exp

2

1 xxp

x

2

2log , | log 2 log2 2

tt

xNL X N

براي مثال. از حروف یونانی براي پارامترهاي جمعیت و از حروف رومی براي تخمین آن ها از نمونه استفاده می کنیم

.

MLE for μ and σ2:

Sample X = { xt }Nt=1 xt ~ N ( μ, σ2)

2σ3σ

دقت !!شود

Let X be a sample from a population specified up to a parameter θLet d=d(X) be an estimator of θTo evaluate the quality of this estimator, we can measure how much it is different fromθ, that is, (d(X) - θ)2

But since it is a random variable (it depends on the sample), we need to average this over possible X and considerr(d, θ) the Mean Square Error (MSE) of estimator d defined as MSE: r(d, θ) = E[(d(X) - θ)2] Variance: E [(d - E [d])2]

The bias of an estimator is given asBias: bθ (d) = E[d(X)] - θIf bθ (d) = 0 for all θ values, then we say that d is unbiased estimator of θ.For example, with xt drawn from some density with mean µ, because

This means that though on a particular sample, m may be different from µ, if we take many such samples, Xi , and estimate many mi = m(Xi), their average will get close to µ as the number of such samples increases. 13

Evaluating an Estimator: Bias and Variance

θ

1[ ] [ ] [ ]t

ttt

NE m E EN N Nx x


m is also a constant estimator, i.e., Var(m) --> 0 as N --> infinity2 2

2 2

1( ) ( ) ( )t

ttt

x NVar m Var Var xN N N N

As N, the number of points in the sample, gets larger, m deviates less from µ. Let us now check, S2, the MLE of σ2:-

2 2 22

2 22

( ) ( )

[( ) ] [ ]( )

t tt t

tt

x m x N mS

N NE x N E m

E SN

Given that Var(X) = E[X2] – E[X]2 , we get E[X2] = Var(X) +E[X]2

and we can writeE[(xt)2] = σ2 + µ2 , E[m2] = σ2 /N+ µ2

Then we have,2 2 2 2

2 2 2( ) ( / ) 1( ) ( )N N N NE SN N

Which shows that S2 is a biased estimator of σ2. (N/(N-1) S2 is an unbiased estimator. However when N is large, the difference is negligible. This is an example of an asymptotically unbiased estimator whose bias

goes to 0 as N goes to infinity.


درجه 1کاهش آزادي

UnderEstimate

For D.O.FhereClick

Normal Distribution

16

Mean Square Error (MSE): d=d(X)r (d , θ)= E [(d - θ)2]

= E [(d - E[d] + E[d] - θ)2]= E [(d - E[d])2 + (E[d] - θ)2 + 2 (E[d] - θ) (d– E[d])]= E [(d - E[d])2] + E[(E[d] - θ)2]+ 2E[(E[d] - θ) (d– E[d])]= E [(d - E[d])2] + (E[d] - θ)2 + 2 (E[d] - θ) E[d– E[d] ]= E [(d - E[d])2] + (E[d] - θ)2

= = Variance + Bias2

r(d , θ)= var(d) + (bθ(d))2

θ

Unknown parameter θEstimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θVariance: E [(d - E [d])2]MSE:(r (d ,θ)) E [(d - θ)2]

00

The decomposition allows us to see that the Mean Squared Error (MSE) of a model (generated by a particular learning algorithm) is in fact made up of two components.

The bias component tells us how accurate the model is, on average across different possible training sets.

The variance component tells us how sensitive the learning algorithm is to small changes in the training set.

Mathematically, this can be quantified as a decomposition of the mean squared error function.

For a testing example {x, d}, the decomposition is:


θ µ

θ µθ µθµ

θµ

Different Distribution

Random Generatot

Bias Variance Decomposition

Figure shows the bias-variance decomposition is like trying to hit the bullseyeon a dartboard.

Each dart is thrown after training our “dart-throwing” model in a slightly different manner.

If the darts vary wildly, the learner is high variance.

If they are far from the bullseye, the learner is high bias.

The ideal is clearly to have both low bias and low variance; however this is often difficult, giving an alternative terminology as the bias-variance “dilemma” (Dartboard analogy, Moore & McCabe (2002))

MSE = 0 OK. Var = 0 bias2 = 0MSE = ? Overfit Var = ? bias2 = 0MSE = ? Underfit Var = 0 bias2 = ?MSE = ? Error Var = ? bias2 = ?

OverfitUnderfit

22

Parametric Estimation

• X = { xt }t where xt ~ p (x)• Parametric estimation:

Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using Xe.g., N ( µ, σ2) where θ = { µ, σ2}

23

Maximum Likelihood Estimation

• Likelihood of θ given the sample Xl (θ|X) = p (X |θ) = ∏t p (xt|θ)

• Log likelihoodL(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

• Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

24

Examples: Bernoulli/Multinomial

• Bernoulli: Two states, failure/success, x in {0,1} P (x) = po

x (1 – po ) (1 – x)

L (po|X) = log ∏t poxt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

• Multinomial: K>2 states, xi in {0,1}P (x1,x2,...,xK) = ∏i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE: pi = ∑t xit / N

25

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

• p(x) = N ( µ, σ2)

• MLE for µ and σ2:µ σ

N

mxs

N

xm

t

t

t

t

2

2

2

2

2exp

2

1 xxp

26

Bias and Variance

Unknown parameter θEstimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θVariance: E [(d–E [d])2]

Mean square error: r (d,θ) = E [(d–θ)2]

= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance


Sample mean and variance





Return

Moment Generating FunctionsThe moment generating function φ(t ) of the

random variable X is defied for all values of tby

We call φ(t ) the moment generating function becauseall of the moments of X can be obtained by

successively differentiating φ(t ). Example:

2014/11/17دوشنبه، - مهندسی-رضا منصفی-نظریه یادگیری محاسباتی32دانشگاه فردوسی

In general, the nth derivative of φ(t ) evaluated at t = 0 equals E[Xn]; that is,

Important property:- of moment generating functions

is that the moment generating function

of the sum

of independent random variablesis just the

productof the

individual moment generating functions. Suppose that

Xand

Yare independent and have moment generating functions

φX (t ) and φY (t),respectively.Then φX+Y (t), the moment generating function of X + Y , is given by ،2014/11/17دوشنبه - مهندسی-رضا منصفی-نظریه یادگیری محاسباتی33

دانشگاه فردوسی

In probability theory and statistics, skewnessskewness is a measure of the extent to which a probability distribution of a real-valued random variable "leans" to one side of the mean.

The skewness value can be positive or negative, or even undefined.

In probability theory and statistics, kurtosiskurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning curved, arching) is any measure of the "peakedness" of the probability distribution of a real-valued random variable.

In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population.

There are various interpretations of kurtosis, and of how particular measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders (distribution primarily peak and tails, not in between).

Kurtosis:

Skewness:

Moment 0?

Moment 1

Moment 2Moment 3

Moment 4

2014/11/17دوشنبه، - مهندسی-رضا منصفی-نظریه یادگیری محاسباتی34دانشگاه فردوسی

35

Gaussian (Normal) Distributionmean

VarianceSkewnessKurtosis

????

2014/11/17دوشنبه، - مهندسی-رضا منصفی-نظریه یادگیری محاسباتیدانشگاه فردوسی Return

Documents

ﻢﻫدزاود سرد ﻦﯿﺷﺎﻣيﺮﯿﮔدﺎﯾfumblog.um.ac.ir/gallery/101/1spp-lesson12-1-parametric-methods.pdf · Monday, November 17, 2014 ﯽﻔﺼﻨﻣ ﺎﺿر