Upload
others
View
21
Download
0
Embed Size (px)
Citation preview
Probability & Statistics: Infinite Statistics
Robert Leishman Mark Colton
ME 363 Spring 2011
Large Data Sets
What happens to a histogram as N becomes large (N → ∞)? – Number of bins becomes large (K → ∞) – Width of bins becomes small (δx → 0) – Histogram becomes smoother and approaches a
continuous function
Example: ASTM-A242 Steel
N = 100 K = 12 Save = 479.18 MPa
450 460 470 480 490 500 5100
0.05
0.1
0.15
0.2
0.25
S (Mpa)
n (%
)
N = 10,000,000 K = 1,180 Save = 480.00 MPa
420 440 460 480 500 520 5400
0.5
1
1.5
2
2.5
3
3.5
4 x 10-3
S (Mpa)n
(%)
Probability Density Function
For an infinite data set, the frequency distribution is smooth and continuous
This function is called the probability density function (p.d.f.)
The p.d.f. relates the value of a measured variable to its likelihood of occurring
Infinite vs. Finite Statistics
In theory, we can have infinite data sets – An infinite number of data points – We can examine the entire population, or all possible
values In reality, data sets are finite
– A limited number of data points – Based on a sample of the entire population
We will use our limited sample size to extract information about the entire population
– Example: Extract information about ALL ASTM-A242 steel from 1000 specimens
Standard p.d.f.s
There are many shapes of p.d.f.s that can occur
See Table 4.2 The shape of the p.d.f can be experimentally
determined by looking at the shape of the histogram, and finding which p.d.f best matches it
Standard p.d.f.s
Normal Distribution
The normal or Gaussian distribution is one of the most common
It can be used to represent variables that are random about a mean – Electrical noise – Variation in strength of steel – Precision error
“Bell Curve”
Normal Distribution
The p.d.f. of the normal distribution is given by:
– x is the value of a particular measurement – x’ is the true mean of the entire population
• Describes the tendency of the population • Determines the center point of the normal curve • Also called μ (when it is sampled, not true)
– σ is the standard deviation of the entire population • Describes the population’s spread • Determines the width of the normal curve
22
2)'(
21)( σ
πσ
xx
exp−
−
=
Normal Distribution
If we have the p.d.f. of some data, then we can make predictions about the probability P of future measurements
Specifically, we can predict the probability of a data point falling within a certain interval
This probability is given by the area under the p.d.f. over the appropriate interval
∫+
−=+≤≤−
xx
xxdxxpxxxxxP
δ
δδδ
'
')()''(
∫+
−
−−
=+≤≤−xx
xx
xx
dxexxxxxPδ
δ
σ
πσδδ
'
'
22
2)'(
21)''(
x’ x’+δx x’-δx
Normal Distribution
Make a change of variables: The integral then becomes
Since the normal distribution is symmetric about x’:
∫+
−
−−
=+≤≤−xx
xx
xx
dxexxxxxPδ
δ
σ
πσδδ
'
'
22
2)'(
21)''(
σσβ
'' 11
xxzxx −≡
−≡
∫−−
=≤≤−1
1
2
211 2
1)(z
zdezzP β
πβ
β
=≤≤− ∫
−1
2
02
11 212)(
zdezzP β
πβ
β
Normal Distribution
The bracketed expression is called the “normal error function”
Solutions to this integral are tabulated in Table 4.3 (p. 118) The integral gives us a method for calculating the
probability that x lies between x’ ± x1 for a given distribution defined by x’ and σ
Best understood by doing some examples
σσβ
'' 11
xxzxx −≡
−≡
=≤≤− ∫
−1
2
02
11 212)(
zdezzP β
πβ
β
Example 1
What is the area under the normal distribution curve from z1 = -1.43 to z1 = 1.43?
What is the significance of this area?
Example 1
Example 1 From the table:
– The area under the normal distribution curve from z1 = 0 to z1 = 1.43 is 0.4236
– This represents ½ of the integral between z1 = -1.43 to z1 = 1.43
– The total area is therefore 2(0.4236) = 0.8472 What does this mean?
– For data following a normal distribution, 84.72% of the population lies within the range -1.43 ≤ z1 ≤ 1.43
– But x1 = x’ + z1σ – So this means that 84.72% of the population lies within
±1.43 standard deviations of the mean – This is true for any normally distributed data
Example 2
What range of a random variable x will contain 90% of the population?
Solution: – Find z1 such that 45% of the data lie between 0
and + z1 and the other 45% lie between –z1 and 0
– Use the table
Example 2
By interpolation, z0.45 = 1.645
Example 2
Again, x1 = x’ + z1σ So 90% of the population will fall within
the range (x’ - z0.45σ) < x < (x’ + z0.45σ)
(x’ – 1.645σ) < x < (x’ + 1.645σ) So 90% of the population will lie within
±1.645 standard deviations of the mean
Comments The probability of a measurement occurring
can be expressed in terms of the standard deviation of the population
The probability of a measurement being within: ±1σ of the mean is 68.27% ±2σ of the mean is 95.45% ±3σ of the mean is 99.73%
Data outside ±3σ are often considered “outliers”
Example 3 You are assigned to measure the maximum no-
load speed of a new type of DC motor You apply a constant voltage to “many” of these
motors and measure the maximum no-load speed for each motor
You calculate that x’ = 4315.25 rpm and σ = 427.5 rpm
Assuming that the variations in no-load speed are random (and normally distributed), what is the probability that a motor will have a no-load speed between 5000 and 5200 rpm?
Example 3 P(5000≤ x ≤ 5200) = P(4315.25≤ x ≤ 5200) -
P(4315.25≤ x ≤ 5000) From upper limit (first term):
z1 = (5200-4315.25)/427.5 = 2.0696 P(2.0696) = 0.4808
From lower limit (second term): z1 = (5000-4315.25)/427.5 = 1.6018 P(1.6018) = 0.4474
P(5000≤ x ≤ 5200) = 0.4808 – 0.4474 = 0.0334 = 3.34%