Download ppt - Lecture 5 slides on Central Limit Theorem Stratified Sampling How to acquire random sample Prepared by Amrita Tamrakar

Lecture 5 slides on

• Central Limit Theorem

• Stratified Sampling

• How to acquire random sample

Prepared by

Amrita Tamrakar

Central Limit Theorem

Assume a given population of numbers

P={ x1,x2,…….infinity}

xi xj

Let xp= average of P, σp= variance of P,

k = tuples from sample, µs= average of sample.

-Does µs remain fixed?

Standard Error formula says, E(µs) = xp

If σs= variance of the average of sample then

E(µs) = xp

σs2 = σp

2 / k

Interesting phenomenon

If we plot µ, it is not going to be skewed but give a bell curve even though the actual population may be any distribution.

The Central limit theorem says:

As we repeat sampling random distribution, the randomness disappears and gets a bell shaped curve which gets tighter as we proceed.

2)(

2

2

2

)(

x

exf

0k 40k 200k

Skewed Distribution of salary

x = exact avg

Plot µ

Our main objective is

Not to reduce the error but to give exact error interval. Hence we need to find the variance.

There are two options to find variance σp

1) Use a materialized view with an extra column e.g.. 0 for females, 1 for males

2) Calculate the sample variance many times to get an unbiased original variance .i.e. Use sample variance as a surrogate of original variance.

Which one will be better?

http://www.math.duke.edu/~wka/math135/confidence.pdf

x-d x+dx

Area=0.95

∫ =1

Error Interval with Confidence level

• To give the error interval with 95% confidence.

• Find a point d which will give an area=0.95 from the curve, then x±d will be the error with 95% confidence

Alternatively, to find out d we can calculate 1.96*sd

Where standard deviation (sd)= σp /√ k

http://www.math.duke.edu/~wka/math135/confidence.pdf

Stratified Sampling

Will stratification of salary give a more accurate results?

50k 100k 200k0 k N1 N2 Nr

Population P broken into r strata (P1…Pr ) :

Sample Mean σ1

Sample Size k1

P1

σ2

k2

P2

σr

kr

Pr

Technique to stratify is to minimize variance in each strata.

Total sample = k1+k2+……+kr

Mean of sample

µs= N

NNN rr ...2211

Challenges :

1) Stratification : How to break into strata

2) Allocation : How many samples from 1st group, 2nd group…….? i.e. how to allocate samples

In this graph, can we say get more samples from 30-70k range (allocation strategy) ?

0k 30k 40k 70k

How data is organized in database?

• in disc blocks

• To read a single record , need to read the entire disc block

• Clustered index , B+ tree are some of the indexing techniques.

Two approaches for sampling

• Online sampling

• Offline sampling also called pre-computed sampling

Effects :

• Online sampling costly in-terms of response time.• Offline sampling can be done during pre-processing time.• Reuse the sample again.

How to get sample data :

Generate a random number between 0-106 and pull out the record with that record id.

OR

Bernoulli's theorem :

• Go to each record

• Toss a coin

• If head then pull out the record, else leave it.

Note: May not get the exact sample size

How to maintain freshness of data in random sample via offline method?

• Doesn’t matter much as they are done for history data• What if the original query changes? May be it was directed

towards particular field only..

Generate the random sample again as it doesn’t matter much towards the performance since it is pre-processed. E.g. generate once in 3 months.

Oracle, sqlserver are having the random sampling functionality added in their newer versions.