Probability & Statistics: Infinite Statistics · Infinite vs. Finite Statistics In theory, we...

Preview:

Citation preview

Probability & Statistics: Infinite Statistics

Robert Leishman Mark Colton

ME 363 Spring 2011

Large Data Sets

What happens to a histogram as N becomes large (N → ∞)? – Number of bins becomes large (K → ∞) – Width of bins becomes small (δx → 0) – Histogram becomes smoother and approaches a

continuous function

Example: ASTM-A242 Steel

N = 100 K = 12 Save = 479.18 MPa

450 460 470 480 490 500 5100

0.05

0.1

0.15

0.2

0.25

S (Mpa)

n (%

)

N = 10,000,000 K = 1,180 Save = 480.00 MPa

420 440 460 480 500 520 5400

0.5

1

1.5

2

2.5

3

3.5

4 x 10-3

S (Mpa)n

(%)

Probability Density Function

For an infinite data set, the frequency distribution is smooth and continuous

This function is called the probability density function (p.d.f.)

The p.d.f. relates the value of a measured variable to its likelihood of occurring

Infinite vs. Finite Statistics

In theory, we can have infinite data sets – An infinite number of data points – We can examine the entire population, or all possible

values In reality, data sets are finite

– A limited number of data points – Based on a sample of the entire population

We will use our limited sample size to extract information about the entire population

– Example: Extract information about ALL ASTM-A242 steel from 1000 specimens

Standard p.d.f.s

There are many shapes of p.d.f.s that can occur

See Table 4.2 The shape of the p.d.f can be experimentally

determined by looking at the shape of the histogram, and finding which p.d.f best matches it

Standard p.d.f.s

Normal Distribution

The normal or Gaussian distribution is one of the most common

It can be used to represent variables that are random about a mean – Electrical noise – Variation in strength of steel – Precision error

“Bell Curve”

Normal Distribution

The p.d.f. of the normal distribution is given by:

– x is the value of a particular measurement – x’ is the true mean of the entire population

• Describes the tendency of the population • Determines the center point of the normal curve • Also called μ (when it is sampled, not true)

– σ is the standard deviation of the entire population • Describes the population’s spread • Determines the width of the normal curve

22

2)'(

21)( σ

πσ

xx

exp−

=

Normal Distribution

If we have the p.d.f. of some data, then we can make predictions about the probability P of future measurements

Specifically, we can predict the probability of a data point falling within a certain interval

This probability is given by the area under the p.d.f. over the appropriate interval

∫+

−=+≤≤−

xx

xxdxxpxxxxxP

δ

δδδ

'

')()''(

∫+

−−

=+≤≤−xx

xx

xx

dxexxxxxPδ

δ

σ

πσδδ

'

'

22

2)'(

21)''(

x’ x’+δx x’-δx

Normal Distribution

Make a change of variables: The integral then becomes

Since the normal distribution is symmetric about x’:

∫+

−−

=+≤≤−xx

xx

xx

dxexxxxxPδ

δ

σ

πσδδ

'

'

22

2)'(

21)''(

σσβ

'' 11

xxzxx −≡

−≡

∫−−

=≤≤−1

1

2

211 2

1)(z

zdezzP β

πβ

β

=≤≤− ∫

−1

2

02

11 212)(

zdezzP β

πβ

β

Normal Distribution

The bracketed expression is called the “normal error function”

Solutions to this integral are tabulated in Table 4.3 (p. 118) The integral gives us a method for calculating the

probability that x lies between x’ ± x1 for a given distribution defined by x’ and σ

Best understood by doing some examples

σσβ

'' 11

xxzxx −≡

−≡

=≤≤− ∫

−1

2

02

11 212)(

zdezzP β

πβ

β

Example 1

What is the area under the normal distribution curve from z1 = -1.43 to z1 = 1.43?

What is the significance of this area?

Example 1

Example 1 From the table:

– The area under the normal distribution curve from z1 = 0 to z1 = 1.43 is 0.4236

– This represents ½ of the integral between z1 = -1.43 to z1 = 1.43

– The total area is therefore 2(0.4236) = 0.8472 What does this mean?

– For data following a normal distribution, 84.72% of the population lies within the range -1.43 ≤ z1 ≤ 1.43

– But x1 = x’ + z1σ – So this means that 84.72% of the population lies within

±1.43 standard deviations of the mean – This is true for any normally distributed data

Example 2

What range of a random variable x will contain 90% of the population?

Solution: – Find z1 such that 45% of the data lie between 0

and + z1 and the other 45% lie between –z1 and 0

– Use the table

Example 2

By interpolation, z0.45 = 1.645

Example 2

Again, x1 = x’ + z1σ So 90% of the population will fall within

the range (x’ - z0.45σ) < x < (x’ + z0.45σ)

(x’ – 1.645σ) < x < (x’ + 1.645σ) So 90% of the population will lie within

±1.645 standard deviations of the mean

Comments The probability of a measurement occurring

can be expressed in terms of the standard deviation of the population

The probability of a measurement being within: ±1σ of the mean is 68.27% ±2σ of the mean is 95.45% ±3σ of the mean is 99.73%

Data outside ±3σ are often considered “outliers”

Example 3 You are assigned to measure the maximum no-

load speed of a new type of DC motor You apply a constant voltage to “many” of these

motors and measure the maximum no-load speed for each motor

You calculate that x’ = 4315.25 rpm and σ = 427.5 rpm

Assuming that the variations in no-load speed are random (and normally distributed), what is the probability that a motor will have a no-load speed between 5000 and 5200 rpm?

Example 3 P(5000≤ x ≤ 5200) = P(4315.25≤ x ≤ 5200) -

P(4315.25≤ x ≤ 5000) From upper limit (first term):

z1 = (5200-4315.25)/427.5 = 2.0696 P(2.0696) = 0.4808

From lower limit (second term): z1 = (5000-4315.25)/427.5 = 1.6018 P(1.6018) = 0.4474

P(5000≤ x ≤ 5200) = 0.4808 – 0.4474 = 0.0334 = 3.34%

Recommended