Upload
alison-clark
View
224
Download
2
Embed Size (px)
Citation preview
Introduction
• Reasons for Uncertainty– Prediction
• Making a prediction about tomorrow based on data we have today
– Sample• Data maybe a sample from the population, and we don’t
know the difference between our data and other sample(or population)
– Missing value or unknown value• We need to guess these value• Example : Censored Data
Introduction
• Dealing with Uncertainty– Probability– Fuzzy
• Probability Theory v.s. Probability Calculus– Probability Theory
• Mapping from real world to the mathematical representation
– Probability Calculus• Based on well-defined and generally accepted axioms• The aim is to explore the consequences of those axioms
Introduction
• Frequentist (Probability is objective)– The probability of an event is defined as the
limiting proportion of times that the event would occur in identical situations
– Example• The proportion of times a head comes up in tossing a
same coin repeatedly• Assess the probability that a customer in a
supermarket will buy a certain item(Use similarly customer)
Introduction
• Bayesian(Subjective probability)– Explicit characterization of all uncertainty including
any parameters estimated from the data– Probability is an individual degree of belief that a
given event will occur• Frequentist v.s. Bayesian– Toss a coin 10 times, get 7 head– In Frequentist, probability is P(A) = 7/10– In Bayesian, I guess a probability P(A) = 0.5, then use
this prior idea and the data to estimate probability
Random variable
• Mapping from property of objects to a variable that can take a set of possible values via a process that appears to the observer to have an element of unpredictability
• Example– Coin toss (domain is the set [heads , tails])– No of times a coin has to be tossed to get a head
• Domain is integers– Student’s score
• Domain is a set of integers between 0~100
Properties of single random variable
• X is random variable and x is its value• Domain is finite:– probability mass function p(x)
• Domain is real line:– probability density function f(x)
• Expectation of X– –
1
)(][i
ii xpxXE
iii dxxfxXE
)(][
Multivariate random variable
• Set of several random variables• For p-dimensional vector x={x1,..,xp}
• The joint mass function),,(),( 111 ppp xxpxXxXp
The joint mass function
• For example– Rolling two fair dice, X represent first dice’s result
and Y represent another– Then p(x=3, y=3) = 1/6 * 1/6 = 1/36
The joint mass function
Y=1 Y=2 Y=3 Y=4 Y=5 Y=6 Py(Y)X=1 1/36 1/36 1/36 1/36 1/36 1/36 0.17X=2 1/36 1/36 1/36 1/36 1/36 1/36 0.17X=3 1/36 1/36 1/36 1/36 1/36 1/36 0.17X=4 1/36 1/36 1/36 1/36 1/36 1/36 0.17X=5 1/36 1/36 1/36 1/36 1/36 1/36 0.17X=6 1/36 1/36 1/36 1/36 1/36 1/36 0.17Px(X) 0.17 0.17 0.17 0.17 0.17 0.17 1.00
Marginal probability mass function
• The marginal probability mass function of X and Y are
xXYY
yXYX
yxpyYpy
yxpxXpx
),()()(f
),()()(f
Continuous
• Marginal probability density function of X and Y are
dxyxfyf
dyyxfxf
XYY
XYX
),()(
),()(
Conditional probability
• Density of a single variable (or a subset of complete set of variables) given (or “conditioned on”) particular values of other variables
• Conditional density of X given some value of Y is denoted f(x|y) and defined as
)(
),()|(
yf
yxfyxf
Conditional probability
• For example– If a student’s score is given at random– Sample space is S = {0,1,…,100}– What’s the probability that the student is fail?•
– Given that student’s score is even(including 0), then what’s the probability that the student is fail?•
101
60)(
S
FFP
101/51
101/30
)(
)()|(
30
51
EP
FEPEFP
FE
E
Conditional independence
• Generic problem in data mining is finding relationships between variables– Is purchasing item A likely to be related to
purchasing item B?
• Variables are independent if there is no relationship; otherwise they are dependent
• Independent if p(x,y)=p(x)p(y)
)()(
)()(
)(
)()|(
)()(
)()(
)(
)()|(
BpAp
BpAp
Ap
BApABp
ApBp
BpAp
Bp
BApBAp
Conditional Independence: More than 2 variables
• X is conditional independence of Y– Given Z if for all values of X, Y, Z we have– )|()|()|,( zypzxpzyxp
Conditional Independence: More than 2 variables
• Example– P(F)=60/101
– P(E∩F)=30/51
– Now E and F are dependence
– If student’s score !=100, then• P(F|B)=60/100
• P(E|B)=1/2
• P(E∩F|B)=30/100=60/100*1/2
• Given B condition , E and F are independence
Conditional Independence: More than 2 variables
• Example– If student’s score == 100 , then• P(F|C)=0• P(E|C)=1• P(E ∩ F|C)=0=1*0• Given C condition , E and F are independence
– Now we can calculate P(E ∩ F)
101
30101
1006.05.00
)100()100|()100()100|()(
gradepgradeFEpgradepgradeFEpFEp
Conditional Independence
• Conditional independence don’t imply marginal independence
• Note that X and Y may be unconditionally independence but conditionally dependent given Z
)()(),(
implynot
)|()|()|,(
ypxpyxp
zypzxpzyxp
On assuming independence
• Independence is a strong assumption frequently violated in practice
• But provides modeling– Fewer parameters– Understandable models
Dependence and Correlation
• Covariance measures how X and Y vary together– Large positive if large X is associated with large Y ,
and small X with small Y– Negative if large X is associated with small Y
• Two variables may be dependent but no linearly correlated
Correlation and Causation
• Two variables may be highly correlated without a causal relationship between the two– Yellow stained finger and lung cancer may be
correlated but causally linked only by a third variable : smoking
– Human reaction time and earned income are negatively correlated• Does not mean one causes the other• A third variable “age” is causally related to both
Samples and Statistical inference
• Samples can be used to model the data• If goal is to detect the small deviations form
the data , the size of samples will effect the result
Estimation
• In inference we want to make statements about entire population from which sample is drawn
• The two important methods for estimating parameters of a model– Maximum Likelihood Estimation– Bayesian Estimation
Desirable properties of estimators
• Let be an estimate of parameter• Two measures of estimator quality– Expected value of estimate (Bias)• Difference between expected and true value
– Variance of Estimate
]ˆ[)ˆ( EBias
2]]ˆ[ˆ[)ˆ( EEVar
Mean squared error
• The mean of the squared difference between the value of the estimator and the true value of parameter
• Mean squared error can be partitioned as sum of squared bias and variance
])ˆ[( 2 E
)ˆ())ˆ((])ˆ[( 22 VarBiasE
Mean squared error
)ˆ()ˆ(
)]ˆ()ˆ())ˆ(())ˆ([(2)ˆ()ˆ(
]]ˆ[]ˆ[]ˆ[ˆ]ˆ[ˆ[2)ˆ()ˆ(
)])]]ˆ[])(ˆ[ˆ[((2))ˆ(())ˆ(ˆ[(
])]ˆ[]ˆ[ˆ[(])ˆ[(
2
222
2
22
22
BiasVar
EEEEBiasVar
EEEEEBiasVar
EEEEEE
EEEE
constant a is where][
][]][[
2
222222
))((2)()()(
22
22222
222
cccE
XEXEE
baca
bcbacabcbcbbaba
cbbacbbacbba
Maximum Likelihood Estimation
• Most widely used method for parameter estimation
• Likelihood Function is probability that data D would have arisen for a given value of θ
• Value of θ for which the data has the highest probability is the MLE
n
i
ixp
nxxp
nxxLDL
1
)|)((
)|)(),1((
))(,),1(|()|(
Example of MLE for Binomial
• Customers either purchase or not purchase milk– We want estimate of proportion purchasing
• Binomial with unknown parameterθ • Samples x(1),…,x(1000) where r purchase milk• Assuming conditional
independence , likelihood function is)1000(
1000))(1()( )1()1())1000(),1(|( rr
i
ixixxxL
Log-likelihood Function
• We want the highest probability , so change to Log-likelihood function
• Then Differentiating and setting equal to zero
)1log()1000()log()(log)( rrLl
1000
1000
0)1(
)1000(
)1(
)1000()1log()1000()log(
r
rrr
rrLet
rr
d
rr
Example of MLE for Binomial
• r milk purchases out of n customers• θis the probability that milk is purchased by
random customer• For 3 data set– r = 7 , n =10– r = 70 , n =100– r = 700 , n =1000
• Uncertainty becomes smaller as n increases
Likelihood under Normal Distribution
• For 1 variance , Unknown mean• Likelihood function
)))((2
1exp()2(
)))((2
1exp()2())(,),1(|(
1
22/
1
22/1
n
i
n
n
i
ix
ixnxxL
2
2
2 2
)(
2
1)(
x
exf
Log-likelihood function
• To find the MLE set derivative d/dθ to zero
n
i
ixn
nxxl1
2))((2
12log
2))(,),1(|(
n
ix
nixix
ixLet
ixd
dl
n
i
n
i
n
i
n
i
n
i
1
11
1
1
)(
0)())((
0))((
))(()(
Likelihood under Normal Distribution
• θis the estimated mean• For 2 data set(By random)– 20 data points– 200 data points
Sufficient statistic
• Quantity s(D) is a sufficient statistic forθ if the likelihood l(θ) only depends on the data through s(D)
• no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter
Interval estimate
• Point estimate doesn’t convey uncertainty associated with it
• Interval estimate provide a confidence interval
Likelihood under Normal Distribution
n
iin
n
ii
n
n
xn
xxl
xxxL
xxf
1
22
21
1
22
221
2
2
2
)(2
1)2log(
2),|(
:function likelihoodLog
))(2
1exp()2(),|(
: function Likelihood
)2
)(exp(
2
1)(
:on distributi Normal
Mean
n
x
nx
x
x
xd
xxdl
xn
xxl
n
ii
n
ii
n
ii
n
ii
n
ii
n
n
iin
1
1
1
12
12
1
1
22
21
)(
0)(
0)(
0)(1
)(1),|(
)(2
1)2log(
2),|(
Variance
n
ii
n
ii
n
ii
n
ii
n
n
iin
xn
xn
xn
xn
d
xxdl
xn
xxl
1
22
1
242
1
242
1
2422
1
1
22
21
)(1
)(2
1
2
0)(2
1)
2
1(2
2
)(2
1)
2
1(2
2
),|(
)(2
1)2log(
2),|(
Bayesian approach
• Frequestist approach– The parameters of population are fixed but unknown– Data is a random sample– Intrinsic variability lies in data
• Bayesian approach– Data are known– Parameters θ are random variables– θhas a distribution of values– reflects degree of belief on where true
parameters θ may be
)}(,),1({ nxxD
)(p
Bayesian estimation
• Modification done by Bayesian rule
• Leads to a distribution rather than single value– Single value can be obtained by mean or mode
dpDp
pDp
Dp
pDpDp
)()|(
)()|(
)(
)()|()|(
Bayesian estimation
• P(D) is a constant independent of θ• For a given data set D and a particular
model(model = distribution for prior and likelihood)
• If we have a weak belief about parameter before collecting data, choose a wide prior(normal with large variance)
)()|()|( pDpDp
Binomial example
• Single binary variable X : wish to estimate
• Prior for parameter in [0, 1] is the Beta distribution
)1( Xp
)1()1(
)1()1(
)1()()(
)(),|Beta(
model thisof parameters twoare 0 and 0 where
)1()(
p
Binomial example
• Likelihood function
• Combining likelihood and prior
• We get another Beta distribution– With parameters and
)()1()|( rnrDL
)1()1()1()1( )1()1()1()()|()D|( rnrrnrpDpp
r rn
Advantages of Bayesian approach
• Retain full knowledge of all problem uncertainty• Calculating full posterior distribution onθ
• Natural updating of distribution
)()|()|(),|( 1221 pDpDpDDp
Predictive distribution
• In equation to modify prior to posterior
• Denominator is called predictive distribution of D
• Useful for model checking– If observed data have only small probability then
it is unlikely to be correct
dpDp
pDp
Dp
pDpDp
)()|(
)()|(
)(
)()|()|(
Normal distribution example
• Suppose x comes from a normal distribution With unknown mean θand known variance α
• Prior distribution for θis),(~ 00 N
Normal distribution example
)/)(2
1exp(
2
1)|(
)/)(2
1exp()//
2
1exp(
)()(
)22
)()11
(2
1exp(
))(2
1exp(
2
1))(
2
1exp(
2
1
)()|()|(
12
1
1
12
11112
0
011
11101
0
20
2
00
2
20
00
2
xp
xlet
xx
x
pxpxp
Jeffrey’s prior
• A reference prior• Fisher information
• Jeffrey’s prior
])|(log
[)|(2
2
xL
ExI
)|()( xIp
Conjugate priors
• p(θ) is a conjugate prior for p(x| θ) if the posterior distribution p(θ|x) is in the same family as the prior p(θ)– Beta to Beta– Normal distribution to Normal distribution
Sampling in Data Mining
• The data set is only fit statistical analysis– “Experimental design” in statistics is concerned
with optimal ways of collecting data– Data miners can’t control the data collection
process– The data may be ideally suited to the purposes for
which it was collected, but not adequate for its data mining uses
Sampling in Data Mining
• Two ways in which sample arise– Database is sample of population– Database contains every cases, but the analysis is
based on the sample• Not appropriate when we want to find unusual records
Why sampling
• Draw a sample from the database that allows us to construct a model reflects the structure of the data in the database– Efficiency, quicker, easier
• The sample must representative of the entire database
Systematic sampling
• Try to ensure representativeness– Taking one out of every two records
• Can lead to problems when there are regularities in database– Data set where records are of married couples
Random Sampling
• Avoiding regularities
• Epsem Sampling– Each record has same probability of being chosen
Variance of Mean of Random Sample
• If variance of population of size N is , the variance of mean of a simple random sample of size n without replacement is– Usually N >> n, so the second term is small, and
variance decreases as sample size increases
2
)1(2
N
n
n
Stratified Random Sampling
• Split population into non-overlapping subpopulations or strata
• Advantages– Enable making statements about each of the
subpopulations separately• For example, one of the credit card companies
we work with categorizes transactions into 26 categories : supermarket, gas station, and so on
Mean of Stratified Sample
• The total size of population is N• stratum has elements in it• are chosen for the sample from this
stratum• Sample mean within stratum is • Estimate of population mean
thk kN
kn
thk kx
N
xN kk