V pátek 9. 10. nebude přednáška. Cvičení v tomto týdnu bude

• V pátek 9. 10. nebude přednáška. Cvičení v tomto týdnu bude.

Last lecture summary• Mode• Distribution• Five numbers summary, percentiles, mean• Box plot, modified box plot• Robust statistic – mean, median, trimmed mean

• outlier

SDA girls – histogram of heights 2014

n = 48 or N = 48

bin size = 3.8

SDA girls – all previous years + actual

n = 69

bin size = 3.8

MEASURES OF VARIABILITY

Navození atmosféry

www.udacity.com – Introduction to statistics

QUESTION

Mean1 Mean2Mode1 Mode2Median1 Median2

www.udacity.com – Statistics

n = 1000

n = 1000

range(variační rozpětí)

MAX - min


n = 1000

n = 1000

Range

Range changes when we add new data into dataset• Always• Sometimes• Never


n = 1000

n = 1000

Adding Mark Zuckerberg


n = 1000

n = 1000

Cut off data

IQR, mezikvartilové rozpětíwww.udacity.com – Statistics

n = 1000

Interquartile range, IQR

Let’ take this quiz, answer yes or no.

1. About 50% of the data fall within the IQR.

2. The IQR is affected by every value in the data set.

3. The IQR is not affected by outliers.

4. The mean is always between Q1 and Q3.

0 1 1 1 2 2 2 2 2 3 3 3 90

Q2Q1=1 Q3=3


průměr = 8.62n = 13

Define the outlierOR

Sample (n=10)

$38,946

$43,420

$49,160

$50,430

$50,557

$52,580

$53,595

$54,160

$60,181

$10,000,000

What values are outliers for this data set?

1. $60,0002. $80,0003. $100,0004. $200,000


Problem with IQR

normal

bimodal

uniform


Options for measuring variability• Find the average distance between all pairs of data

values.• Find the average distance between each data value and

either the max or the min.• Find the average distance between each data value and

the mean.


Average distance from meanSample Deviation from mean

10

5

3

2

19

1

7

11

1

1

Average distance from meanSample Deviation from mean

10 4

5 -1

3 -3

2 -4

19 13

1 -5

7 1

11 5

1 -5

1 -5

∑ (𝑥 𝑖−𝑥 )=0Find the average distance between each data value and the mean.

Preventing cancellation• How can we prevent the negative and positive deviations

from cancelling each out?1. Ignore (i.e. delete) the negative sign.

2. Multiply each deviation by two.

3. Square each deviation.

4. Take absolute value of each deviation.

Average absolute deviationSample Deviation from mean Absolute deviation

10 4 4

5 -1 1

3 -3 3

2 -4 4

19 13 13

1 -5 5

7 1 1

11 5 5

1 -5 5

1 -5 5

avg. absolute deviation = 4.6

Average absolute deviation

Which formulas describe what you just did?

Squared deviationsSample Deviation from

mean Squared deviation

10 4 16

5 -1 1

3 -3 9

2 -4 16

19 13 169

1 -5 25

7 1 1

11 5 25

1 -5 25

1 -5 25 avg. square deviation = 31.2

SS, sum of squares(čtverce odchylek)

Variance

Average square devation has a special name – variance (rozptyl).


∑ (𝑥 𝑖− 𝑥 )2

𝑛

Standard deviation• směrodatná odchylka,

• Which symbol would you use for a variance?

Standard deviation• What is so great about the standard deviation? Why don’t

we just find the average absolute deviation?

More on absolute vs. standard deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm

1. SD is used because of tradition2. It is easier to work with power of two

than with absolute value.3. SD has very nice interpretation in

Gaussian distribution.

Standard deviation – empirical rule



Empirical rule – well behaved distribution

n = 400,

covers 273 data values, 66.8% covers 380 data values, 95% covers 397 data values, 99.3%

Empirical rule – not-so-well behaved distribution

197 countries

65% within 1 s.d.

94.7 within 2 s.d.

100% within 3 s.d.

Statistical inference• The goal of statistics: make rational conclusions or

decisions based on the incomplete information we have in our data.

• This process is known as statistical inference. • In inferential statistics we want to answer

1. Is some relationship in data due to chance? Or is it a real difference?

2. If the effect is real, can it be generalized to a larger group?

Statistical jargon• Population – the group we are interested in making

conclusions about.• Census – a collection of data on the entire population.• Sample – if we can’t conduct a census, we collect data

from the sample of a population. Goal: make conclusions about that population.

Statistical jargon

Population - parameterMean Standard deviation

Sample - statisticMean Standard deviation

Výběr - statistikaVýběrový průměr Výběrová směrodatná odchylka

population (census) vs. sampleparameter (population) vs. statistic (sample)

Statistical inference• A statistic is a value calculated from our observed data

(sample).

• A parameter is a value that describes the population.

• We want to be able to generalize what we observe in our data to our population. In order to this, the sample needs to be representative.

• How to select a representative sample? Use randomization.

Random sampling• Simple Random Sampling (SRS) – each possible

sample from the population is equally likely to be selected.

• Stratified Sampling – simple random sample from subgroups of the population• subgroups: gender, age groups, …

• Cluster sampling – divide the population into non-overlapping groups (clusters), sample is a randomly chosen cluster• example: population are all students in an area, randomly select

schools and create a sample from students of the given school

Simple random sampling• sampling with replacement (WR)

• výběr s navrácením

• Generates independent samples• Two sample values are independent if that what we get on the first

one doesn't affect what we get on the second.

• sampling without replacement (WOR)• výběr bez navrácení

• Deliberately avoid choosing any member of the population more than once.

• This type of sampling is not independent, however it is more common.

• The error is small as long as 1. the sample is large

2. the sample size is no more than 10% of population size

Bias• If a sample is not representative, it can introduce bias into

our results.• bias – zkreslení, odchylka• A sample is biased if it differs from the population in a

systematic way.

• The Literary Digest poll, 1936, U. S. presidential election• surveyed 10 mil. people – subscribers• 2.3 mil. responded predicting (3:2) a Republican candidate to win• a Democrat candidate won• What went wrong?

• only wealthy people were surveyed (selection bias)• survey was voluntary response (nonresponse bias) – angry people or

people who want a change

Documents

V pátek 9. 10. nebude přednáška. Cvičení v tomto týdnu bude