143
Descriptive Statistics-II Dr Mahmoud Alhussami

Descriptive statistics ii

  • Upload
    -

  • View
    328

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Descriptive statistics ii

Descriptive Statistics-II

Dr Mahmoud Alhussami

Page 2: Descriptive statistics ii

Shapes of Distribution A third important property of data – after location

and dispersion - is its shape Distributions of quantitative variables can be

described in terms of a number of features, many of which are related to the distributions’ physical appearance or shape when presented graphically. modality Symmetry and skewness Degree of skewness Kurtosis

Page 3: Descriptive statistics ii

Modality The modality of a distribution concerns

how many peaks or high points there are. A distribution with a single peak, one value

a high frequency is a unimodal distribution.

Page 4: Descriptive statistics ii

Modality A distribution with two

or more peaks called multimodal distribution.

Page 5: Descriptive statistics ii

Symmetry and Skewness A distribution is symmetric if the distribution could be split

down the middle to form two haves that are mirror images of one another.

In asymmetric distributions, the peaks are off center, with a bull of scores clustering at one end, and a tail trailing off at the other end. Such distributions are often describes as skewed. When the longer tail trails off to the right this is a positively

skewed distribution. E.g. annual income. When the longer tail trails off to the left this is called

negatively skewed distribution. E.g. age at death.

Page 6: Descriptive statistics ii

Symmetry and Skewness Shape can be described by degree of asymmetry (i.e.,

skewness). mean > median positive or right-skewness mean = median symmetric or zero-skewness mean < median negative or left-skewness

Positive skewness can arise when the mean is increased by some unusually high values.

Negative skewness can arise when the mean is decreased by some unusually low values.

Page 7: Descriptive statistics ii

Left skewed:

Right skewed:

Symmetric:

7

Page 8: Descriptive statistics ii

April 13, 2023 8

Shapes of the DistributionShapes of the Distribution

Three common shapes of frequency Three common shapes of frequency distributions: distributions:

Symmetrical and bell shaped

Positively skewed or skewed to the right

Negatively skewed or skewed to the left

A B C

Page 9: Descriptive statistics ii

April 13, 2023 9

Shapes of the DistributionShapes of the Distribution

Three less common shapes of frequency Three less common shapes of frequency distributions: distributions:

Bimodal ReverseJ-shaped

Uniform

A B C

Page 10: Descriptive statistics ii

10

This guy took a VERY long time!

Page 11: Descriptive statistics ii

Degree of Skewness A skewness index can readily be calculated most

statistical computer program in conjunction with frequency distributions

The index has a value of 0 for perfectly symmetric distribution.

A positive value if there is a positive skew, and negative value if there is a negative skew.

A skewness index that is more than twice the value of its standard error can be interpreted as a departure from symmetry.

Page 12: Descriptive statistics ii

Measures of Skewness or Symmetry Pearson’s skewness coefficient

It is nonalgebraic and easily calculated. Also it is useful for quick estimates of symmetry .

It is defined as:skewness = mean-median/SD

Fisher’s measure of skewness. It is based on deviations from the mean to the

third power.

Page 13: Descriptive statistics ii

Pearson’s skewness coefficient

For a perfectly symmetrical distribution, the mean will equal the median, and the skewness coefficient will be zero. If the distribution is positively skewed the mean will be more than the median and the coefficient will be the positive. If the coefficient is negative, the distribution is negatively skewed and the mean less than the median.

Skewness values will fall between -1 and +1 SD units. Values falling outside this range indicate a substantially skewed distribution.

Hildebrand (1986) states that skewness values above 0.2 or below -0.2 indicate severe skewness.

Page 14: Descriptive statistics ii

Assumption of Normality

Many of the statistical methods that we will apply require the assumption that a variable or variables are normally distributed.

With multivariate statistics, the assumption is that the combination of variables follows a multivariate normal distribution.

Since there is not a direct test for multivariate normality, we generally test each variable individually and assume that they are multivariate normal if they are individually normal, though this is not necessarily the case.

Page 15: Descriptive statistics ii

Evaluating normality

There are both graphical and statistical methods for evaluating normality.

Graphical methods include the histogram and normality plot.

Statistical methods include diagnostic hypothesis tests for normality, and a rule of thumb that says a variable is reasonably close to normal if its skewness and kurtosis have values between –1.0 and +1.0.

None of the methods is absolutely definitive.

Page 16: Descriptive statistics ii

Transformations

When a variable is not normally distributed, we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis.

Three common transformations are: the logarithmic transformation, the square root transformation, and the inverse transformation.

All of these change the measuring scale on the horizontal axis of a histogram to produce a transformed variable that is mathematically equivalent to the original variable.

Page 17: Descriptive statistics ii

Types of Data Transformations for moderate skewness, use a square root

transformation. For substantial skewness, use a log

transformation. For sever skewness, use an inverse

transformation.

Page 18: Descriptive statistics ii

Computing “Explore” descriptive statistics

To compute the statistics needed for evaluating the normality of a variable, select the Explore… command from the Descriptive Statistics menu.

Page 19: Descriptive statistics ii

Adding the variable to be evaluated

First, click on the variable to be included in the analysis to highlight it.

Second, click on right arrow button to move the highlighted variable to the Dependent List.

Page 20: Descriptive statistics ii

Selecting statistics to be computed

To select the statistics for the output, click on the Statistics… command button.

Page 21: Descriptive statistics ii

Including descriptive statistics

First, click on the Descriptives checkbox to select it. Clear the other checkboxes.

Second, click on the Continue button to complete the request for statistics.

Page 22: Descriptive statistics ii

Selecting charts for the output

To select the diagnostic charts for the output, click on the Plots… command button.

Page 23: Descriptive statistics ii

Including diagnostic plots and statistics

First, click on the None option button on the Boxplots panel since boxplots are not as helpful as other charts in assessing normality.

Second, click on the Normality plots with tests checkbox to include normality plots and the hypothesis tests for normality.

Third, click on the Histogram checkbox to include a histogram in the output. You may want to examine the stem-and-leaf plot as well, though I find it less useful.

Finally, click on the Continue button to complete the request.

Page 24: Descriptive statistics ii

Completing the specifications for the analysis

Click on the OK button to complete the specifications for the analysis and request SPSS to produce the output.

Page 25: Descriptive statistics ii

TOTAL TIME SPENT ON THE INTERNET

100.0

90.0

80.0

70.0

60.0

50.0

40.0

30.0

20.0

10.0

0.0

HistogramF

requ

ency

50

40

30

20

10

0

Std. Dev = 15.35

Mean = 10.7

N = 93.00

The histogram

An initial impression of the normality of the distribution can be gained by examining the histogram.

In this example, the histogram shows a substantial violation of normality caused by a extremely large value in the distribution.

Page 26: Descriptive statistics ii

Normal Q-Q Plot of TOTAL TIME SPENT ON THE INTERNET

Observed Value

120100806040200-20-40

Exp

ecte

d N

orm

al

3

2

1

0

-1

-2

-3

The normality plot

The problem with the normality of this variable’s distribution is reinforced by the normality plot.

If the variable were normally distributed, the red dots would fit the green line very closely. In this case, the red points in the upper right of the chart indicate the severe skewing caused by the extremely large data values.

Page 27: Descriptive statistics ii

Tests of Normality

.246 93 .000 .606 93 .000TOTAL TIME SPENTON THE INTERNET

Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova

Shapiro-Wilk

Lilliefors Significance Correctiona.

The test of normality

Problem 1 asks about the results of the test of normality. Since the sample size is larger than 50, we use the Kolmogorov-Smirnov test. If the sample size were 50 or less, we would use the Shapiro-Wilk statistic instead.

The null hypothesis for the test of normality states that the actual distribution of the variable is equal to the expected distribution, i.e., the variable is normally distributed. Since the probability associated with the test of normality is < 0.001 is less than or equal to the level of significance (0.01), we reject the null hypothesis and conclude that total hours spent on the Internet is not normally distributed. (Note: we report the probability as <0.001 instead of .000 to be clear that the probability is not really zero.)

The answer to problem 1 is false.

Page 28: Descriptive statistics ii

The assumption of normality script

An SPSS script to produce all of the output that we have produced manually is available on the course web site.

After downloading the script, run it to test the assumption of linearity.

Select Run Script… from the Utilities menu.

Page 29: Descriptive statistics ii

Selecting the assumption of normality script

First, navigate to the folder containing your scripts and highlight the NormalityAssumptionAndTransformations.SBS script.

Second, click on the Run button to activate the script.

Page 30: Descriptive statistics ii

Specifications for normality script

The default output is to do all of the transformations of the variable. To exclude some transformations from the calculations, clear the checkboxes.

Third, click on the OK button to run the script.

First, move variables from the list of variables in the data set to the Variables to Test list box.

Page 31: Descriptive statistics ii

Tests of Normality

.246 93 .000 .606 93 .000TOTAL TIME SPENTON THE INTERNET

Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova

Shapiro-Wilk

Lilliefors Significance Correctiona.

The test of normality

The script produces the same output that we computed manually, in this example, the tests of normality.

Page 32: Descriptive statistics ii

When transformations do not work When none of the transformations induces

normality in a variable, including that variable in the analysis will reduce our effectiveness at identifying statistical relationships, i.e. we lose power.

We do have the option of changing the way the information in the variable is represented, e.g. substitute several dichotomous variables for a single metric variable.

Page 33: Descriptive statistics ii

Fisher’s Measure of Skewness The formula for Fisher’s skewness statistic is based on

deviations from the mean to the third power. The measure of skewness can be interpreted in terms of

the normal curve A symmetrical curve will result in a value of 0. If the skewness value is positive, them the curve is skewed to

the right, and vice versa for a distribution skewed to the left. A z-score is calculated by dividing the measure of skewness

by the standard error for skewness. Values above +1.96 or below -1.96 are significant at the 0.05 level because 95% of the scores in a normal deviation fall between +1.96 and -1.96 from the mean.

E.g. if Fisher’s skewness= 0.195 and st.err. =0.197 the z-score = 0.195/0.197 = 0.99

Page 34: Descriptive statistics ii

Kurtosis The distribution’s kurtosis is concerns how

pointed or flat its peak. Two types:

Leptokurtic distribution (mean thin). Platykurtic distribution (means flat).

Page 35: Descriptive statistics ii

Kurtosis There is a statistical index of kurtosis that can be

computed when computer programs are instructed to produce a frequency distribution

For kurtosis index, a value of zero indicates a shape that is neither flat nor pointed.

Positive values on the kurtosis statistics indicate greater peakedness, and negative values indicate greater flatness.

Page 36: Descriptive statistics ii

Fishers’ measure of Kurtosis Fisher’s measure is based on deviation

from the mean to the fourth power. A z-score is calculated by dividing the

measure of kurtosis by the standard error for kurtosis.

Page 37: Descriptive statistics ii

Descriptives

10.731 1.5918

7.570

13.893

8.295

5.500

235.655

15.3511

.2

102.0

101.8

10.200

3.532 .250

15.614 .495

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

TOTAL TIME SPENTON THE INTERNET

Statistic Std. Error

Table of descriptive statistics

To answer problem 2, we look at the values for skewness and kurtosis in the Descriptives table.

The skewness and kurtosis for the variable both exceed the rule of thumb criteria of 1.0. The variable is not normally distributed.

The answer to problem 2 if false.

Page 38: Descriptive statistics ii

Other problems on assumption of normality

A problem may ask about the assumption of normality for a nominal level variable. The answer will be “An inappropriate application of a statistic” since there is no expectation that a nominal variable be normal.

A problem may ask about the assumption of normality for an ordinal level variable. If the variable or transformed variable is normal, the correct answer to the question is “True with caution” since we may be required to defend treating an ordinal variable as metric.

Questions will specify a level of significance to use and the statistical evidence upon which you should base your answer.

Page 39: Descriptive statistics ii

Normal Distribution Also called belt shaped curve, normal

curve, or Gaussian distribution. A normal distribution is one that is

unimodal, symmetric, and not too peaked or flat.

Given its name by the French mathematician Quetelet who, in the early 19th century noted that many human attributes, e.g. height, weight, intelligence appeared to be distributed normally.

Page 40: Descriptive statistics ii

April 13, 2023 40

Normal DistributionNormal Distribution The normal curve is unimodal and symmetric The normal curve is unimodal and symmetric

about its mean (about its mean (). ). In this distribution the mean, median and mode In this distribution the mean, median and mode

are all identical.are all identical. The standard deviation (The standard deviation () specifies the amount ) specifies the amount

of dispersion around the mean.of dispersion around the mean. The two parameters The two parameters and and completely define a completely define a

normal curve.normal curve.

Page 41: Descriptive statistics ii

Also called a Probability density function. The probability is interpreted as "area under the curve."

The random variable takes on an infinite # of values within a given interval

The probability that X = any particular value is 0. Consequently, we talk about intervals. The probability is = to the area under the curve.

The area under the whole curve = 1.

41

Page 42: Descriptive statistics ii

42

Page 43: Descriptive statistics ii

Normal Distribution

X is the random variable.μ is the mean value.σ is the standard deviation (std) value.e = 2.7182818... constant.π = 3.1415926... constant.

Page 44: Descriptive statistics ii

Importance of Normal Distribution to Statistics Although most distributions are not

exactly normal, most variables tend to have approximately normal distribution.

Many inferential statistics assume that the populations are distributed normally.

The normal curve is a probability distribution and is used to answer questions about the likelihood of getting various particular outcomes when sampling from a population.

Page 45: Descriptive statistics ii

Probabilities are obtained by getting the area under the curve inside of a particular interval. The area under the curve = the proportion of times under identical (repeated) conditions that a particular range of values will occur.

Characteristics of the Normal distribution: It is symmetric about the mean μ. Mean = median = mode. [“bell-shaped” curve] f(X) decreases as X gets farther and farther away from

the mean. It approaches horizontal axis asymptotically:- ∞ < X < + ∞. This means that there is always some probability (area) for extreme values.

45

Page 46: Descriptive statistics ii

April 13, 2023 46

Why Do We Like The Normal Why Do We Like The Normal Distribution So MuchDistribution So Much??

There is nothing “special” about standard There is nothing “special” about standard normal scoresnormal scores These can be computed for observations from any These can be computed for observations from any

sample/population of continuous data valuessample/population of continuous data values The score measures how far an observation is from The score measures how far an observation is from

its mean in standard units of statistical distanceits mean in standard units of statistical distance But, if distribution is not normal, we may not be But, if distribution is not normal, we may not be

able to use Z-score approach.able to use Z-score approach.

Page 47: Descriptive statistics ii

April 13, 2023 47

Probability DistributionsProbability Distributions Any characteristic that can be measured or Any characteristic that can be measured or

categorized is called a categorized is called a variablevariable.. If the variable can assume a number of different If the variable can assume a number of different

values such that any particular outcome is values such that any particular outcome is determined by chance it is called adetermined by chance it is called a random random variable.variable.

Every random variable has a corresponding Every random variable has a corresponding probability distributionprobability distribution..

The probability distribution applies the theory of The probability distribution applies the theory of probability to describe the behavior of the probability to describe the behavior of the random variablerandom variable..

Page 48: Descriptive statistics ii

April 13, 2023 48

Discrete Probability Discrete Probability DistributionsDistributions Binomial distribution – the random variable Binomial distribution – the random variable

can only assume 1 of 2 possible outcomes. can only assume 1 of 2 possible outcomes. There are a fixed number of trials and the There are a fixed number of trials and the results of the trials are independent.results of the trials are independent.

i.e. flipping a coin and counting the number of heads in i.e. flipping a coin and counting the number of heads in 10 trials.10 trials.

Poisson Distribution – random variable can Poisson Distribution – random variable can assume a value between 0 and infinity.assume a value between 0 and infinity.

Counts usually follow a Poisson distribution (i.e. number Counts usually follow a Poisson distribution (i.e. number of ambulances needed in a city in a given night) of ambulances needed in a city in a given night)

Page 49: Descriptive statistics ii

April 13, 2023 49

Discrete Random VariableDiscrete Random Variable A A discrete random variablediscrete random variable X has a finite number of possible X has a finite number of possible

values. The values. The probability distributionprobability distribution of X lists the values and of X lists the values and their probabilities.their probabilities.

1.1. Every probability pEvery probability pii is a number between 0 and 1. is a number between 0 and 1.

2.2. The sum of the probabilities must be 1.The sum of the probabilities must be 1. Find the probabilities of any event by adding the probabilities Find the probabilities of any event by adding the probabilities

of the particular values that make up the event.of the particular values that make up the event.

Value of XValue of Xxx11xx22xx33……xxkk

ProbabilityProbabilitypp11pp22pp33……ppkk

Page 50: Descriptive statistics ii

April 13, 2023 50

ExampleExample The instructor in a large class gives 15% each of A’s and D’s, The instructor in a large class gives 15% each of A’s and D’s,

30% each of B’s and C’s and 10% F’s. The student’s grade on 30% each of B’s and C’s and 10% F’s. The student’s grade on a 4-point scale is a random variable X (A=4).a 4-point scale is a random variable X (A=4).

What is the probability that a student selected at random will What is the probability that a student selected at random will have a B or better?have a B or better?

ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4)ANSWER: P (grade of 3 or 4)=P(X=3) + P(X=4)

= 0.3 + 0.15 = 0.45= 0.3 + 0.15 = 0.45

GradeGradeF=0F=0D=1D=1C=2C=2B=3B=3A=4A=4

ProbabilityProbability0.100.10..1515..3030..3030..1515

Page 51: Descriptive statistics ii

April 13, 2023 51

Continuous Probability Continuous Probability DistributionsDistributions When it follows a Binomial or a Poisson When it follows a Binomial or a Poisson

distribution the variable is restricted to taking on distribution the variable is restricted to taking on integer values only.integer values only.

Between two values of a continuous random Between two values of a continuous random variable we can always find a third.variable we can always find a third.

A histogram is used to represent a discrete A histogram is used to represent a discrete probability distribution and a smooth curve called probability distribution and a smooth curve called the the probability density probability density is used to represent a is used to represent a continuous probability distribution.continuous probability distribution.

Page 52: Descriptive statistics ii

April 13, 2023 52

Normal DistributionNormal Distribution

Q Is every variable normally distributed?Is every variable normally distributed?A Absolutely notAbsolutely notQ Then why do we spend so much time Then why do we spend so much time

studying the normal distribution?studying the normal distribution?A Some variables are normally distributed; Some variables are normally distributed;

a bigger reason is the “Central Limit a bigger reason is the “Central Limit Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!?????????Theorem”!!!!!!!!!!!!!!!!!!!!!!!!!!!?????????????

Page 53: Descriptive statistics ii

Central Limit TheoremCentral Limit Theorem describes the characteristics of the "population of the

means" which has been created from the means of an infinite number of random population samples of size (N), all of them drawn from a given "parent population".

It predicts that regardless of the distribution of the parent population: The mean of the population of means is always equal to the

mean of the parent population from which the population samples were drawn.

The standard deviation of the population of means is always equal to the standard deviation of the parent population divided by the square root of the sample size (N).

The distribution of means will increasingly approximate a normal distribution as the size N of samples increases.

Page 54: Descriptive statistics ii

Central Limit TheoremCentral Limit Theorem A consequence of Central Limit Theorem is that if we

average measurements of a particular quantity, the distribution of our average tends toward a normal one.

In addition, if a measured variable is actually a combination of several other uncorrelated variables, all of them "contaminated" with a random error of any distribution, our measurements tend to be contaminated with a random error that is normally distributed as the number of these variables increases.

Thus, the Central Limit Theorem explains the ubiquity of the famous bell-shaped "Normal distribution" (or "Gaussian distribution") in the measurements domain.

Page 55: Descriptive statistics ii

Note that the normal distribution is defined by two parameters, μ and σ . You can draw a normal distribution for any μ and σ combination. There is one normal distribution, Z, that is special. It has a μ = 0 and a σ = 1. This is the Z distribution, also called the standard normal distribution. It is one of trillions of normal distributions we could have selected.

55

Page 56: Descriptive statistics ii

April 13, 2023 56

Standard Normal VariableStandard Normal Variable It is customary to call a standard normal random It is customary to call a standard normal random

variable Z.variable Z. The outcomes of the random variable Z are The outcomes of the random variable Z are

denoted by denoted by z.z. The table in the coming slide give the area under The table in the coming slide give the area under

the curve (probabilities) between the mean and the curve (probabilities) between the mean and z.z. The probabilities in the table refer to the The probabilities in the table refer to the

likelihood that a randomly selected value Z is likelihood that a randomly selected value Z is equal to or less than a given value of equal to or less than a given value of zz and and greater than 0 (the mean of the standard greater than 0 (the mean of the standard normal).normal).

Page 57: Descriptive statistics ii

57

Source: Levine et al, Business Statistics, Pearson.

Page 58: Descriptive statistics ii

April 13, 2023 58

The 68-95-99.7 Rule for the The 68-95-99.7 Rule for the Normal DistributionNormal Distribution 68% of the observations fall within one 68% of the observations fall within one

standard deviation of the meanstandard deviation of the mean 95% of the observations fall within two 95% of the observations fall within two

standard deviations of the meanstandard deviations of the mean 99.7% of the observations fall within three 99.7% of the observations fall within three

standard deviations of the meanstandard deviations of the mean When applied to ‘real data’, these When applied to ‘real data’, these

estimates are considered approximate!estimates are considered approximate!

Page 59: Descriptive statistics ii

Remember these probabilities (percentages):

Practice: Find these values yourself using the Z table.

Two Sample Z Test59

#standard deviations from the mean

Approx. area under the normal curve

±1.68

±1.645.90

±1.96.95

±2.955

±2.575.99

±3.997

Page 60: Descriptive statistics ii

April 13, 2023 60

Standard Normal CurveStandard Normal Curve

Page 61: Descriptive statistics ii

April 13, 2023 61

Standard Normal Standard Normal DistributionDistribution

50% of probability in here–probability=0.5

50% of probability in here –probability=0.5

Page 62: Descriptive statistics ii

April 13, 2023 62

Standard Normal Standard Normal DistributionDistribution

2.5% of probability in here

2.5% of probability in here

95% of probability in here

Standard Normal Distribution with 95% area

marked

Page 63: Descriptive statistics ii

April 13, 2023 63

Calculating ProbabilitiesCalculating Probabilities Probability calculations are always Probability calculations are always

concerned with finding the probability that concerned with finding the probability that the variable assumes any value in an the variable assumes any value in an interval between two specific points interval between two specific points a a andand b.b.

The probability that a continuous variable The probability that a continuous variable assumes the a value between assumes the a value between a a andand b b is is the area under the graph of the density the area under the graph of the density betweenbetween a a and and b.b.

Page 64: Descriptive statistics ii

Normal Distribution 64

If the weight of males is N.D. with μ=150 and σ=10, what is the probability that a randomly selected male will weigh between 140 lbs and 155 lbs?[Important Note: Always remember that the probability that X is equal to any one particular value is zero, P(X=value) =0, since the normal distribution is continuous.]  

Page 65: Descriptive statistics ii

Solution:  

Z = (140 – 150)/ 10 = -1.00 s.d. from meanArea under the curve = .3413 (from Z table)

Z = (155 – 150) / 10 =+.50 s.d. from meanArea under the curve = .1915 (from Z table)

Answer: .3413 + .1915 = .5328

65

150 155

X

Z 0.5 -1 0

140

Page 66: Descriptive statistics ii

Example For example: What’s the probability of getting a math SAT score of 575 or less, =500 and =50?

5.150

500575

Z

i.e., A score of 575 is 1.5 standard deviations above the mean

5.1

2

1575

200

)50

500(

2

1 22

2

1

2)50(

1)575( dzedxeXP

Zx

Yikes! But to look up Z= 1.5 in standard normal chart (or enter into SAS) no problem! = .9332

Page 67: Descriptive statistics ii

If IQ is ND with a mean of 100 and a S.D. of 10, what percentage of the population will have (a)IQs ranging from 90 to 110? (b)IQs ranging from 80 to 120?Solution: Z = (90 – 100)/10 = -1.00Z = (110 -100)/ 10 = +1.00 Area between 0 and 1.00 in the Z-table is  .3413; Area between 0 and -1.00 is also .3413 (Z-distribution is symmetric). Answer to part (a) is .3413 + .3413 = .6826.

67

Page 68: Descriptive statistics ii

(b) IQs ranging from 80 to 120?Solution: Z = (80 – 100)/10 = -2.00Z = (120 -100)/ 10 = +2.00 Area between =0 and 2.00 in the Z-table is

 .4772; Area between 0 and -2.00 is also .4772 (Z-distribution is symmetric).

Answer is .4772 + .4772 = .9544.

68

Page 69: Descriptive statistics ii

Suppose that the average salary of college graduates is N.D. with μ=$40,000 and σ=$10,000.

(a) What proportion of college graduates will earn $24,800 or less?

(b) What proportion of college graduates will earn $53,500 or more?

(c) What proportion of college graduates will earn between $45,000 and $57,000?

(d) Calculate the 80th percentile. (e) Calculate the 27th percentile.

69

Page 70: Descriptive statistics ii

(a) What proportion of college graduates will earn $24,800 or less?Solution: Convert the $24,800 to a Z-score: Z = ($24,800 - $40,000)/$10,000 = -1.52. Always DRAW a picture of the distribution to help you solve these problems.

70

Page 71: Descriptive statistics ii

First Find the area between 0 and -1.52 in the Z-table. From the Z table, that area is .4357. Then, the area from -1.52 to - ∞ is .5000 - .4357 = .0643.Answer: 6.43% of college graduates will earn less than $24,800.

71

$24,800 $40,000

-1.52 0

.4357

X

Z

Page 72: Descriptive statistics ii

(b) What proportion of college graduates will earn $53,500 or more?Solution: Convert the $53,500 to a Z-score.Z = ($53,500 - $40,000)/$10,000 = +1.35. Find the area between 0 and +1.35 in the Z-table: .4115 is the table value.When you DRAW A PICTURE (above) you see that you need the area in the tail: .5 - .4115 - .0885.Answer: .0885. Thus, 8.85% of college graduates will earn $53,500 or more.

72

$40,000 $53,500

Z

0 +1.35

.4115

.0885

Page 73: Descriptive statistics ii

(c) What proportion of college graduates will earn between $45,000 and $57,000?

Z = $45,000 – $40,000 / $10,000 = .50Z = $57,000 – $40,000 / $10,000 = 1.70

From the table, we can get the area under the curve between the mean (0) and .5; we can get the area between 0 and 1.7. From the picture we see that neither one is what we need.What do we do here? Subtract the small piece from the big piece to get exactly what we need.Answer: .4554 − .1915 = .2639

73

$40k

Z

0 1.7

$45k $57k

.5

.19

15

.4554

Page 74: Descriptive statistics ii

Parts (d) and (e) of this example ask you to compute percentiles. Every Z-score is associated with a percentile. A Z-score of 0 is the 50th percentile. This means that if you take any test that is normally distributed (e.g., the SAT exam), and your Z-score on the test is 0, this means you scored at the 50th percentile. In fact, your score is the mean, median, and mode.

74

Page 75: Descriptive statistics ii

(d) Calculate the 80th percentile.

Solution:First, what Z-score is associated with the 80th percentile? A Z-score of approximately +.84 will give you about .3000 of the area under the curve. Also, the area under the curve between -∞ and 0 is .5000. Therefore, a Z-score of +.84 is associated with the 80th percentile.

Now to find the salary (X) at the 80th percentile: Just solve for X: +.84 = (X−$40,000)/$10,000 X = $40,000 + $8,400 = $48,400.

75

$40,000

Z

0 .84

.3000.5000

ANSWER

Page 76: Descriptive statistics ii

(e) Calculate the 27th percentile.

Solution: First, what Z-score is associated with the 27th percentile? A Z-score of approximately -.61will give you about .2300 of the area under the curve, with .2700 in the tail. (The area under the curve between 0 and -.61 is .2291 which we are rounding to .2300). Also, the area under the curve between 0 and ∞ is .5000. Therefore, a Z-score of -.61 is associated with the 27th percentile.

Now to find the salary (X) at the 27th percentile: Just solve for X: -0.61 =(X−$40,000)/$10,000 X = $40,000 - $6,100 = $33,900

76

$40,000

Z

0-.61

.5000.2300

ANSWER

.2700

Page 77: Descriptive statistics ii

April 13, 2023 77

T-DistributionT-Distribution Similar to the standard normal in that it is unimodal, bell-Similar to the standard normal in that it is unimodal, bell-

shaped and symmetric.shaped and symmetric. The tail on the distribution are “thicker” than the standard The tail on the distribution are “thicker” than the standard

normalnormal The distribution is indexed by “degrees of freedom” (df).The distribution is indexed by “degrees of freedom” (df). The degrees of freedom measure the amount of information The degrees of freedom measure the amount of information

available in the data set that can be used for estimating the available in the data set that can be used for estimating the population variance (df=n-1).population variance (df=n-1).

Area under the curve still equals 1.Area under the curve still equals 1. Probabilities for the t-distribution with infinite df equals those Probabilities for the t-distribution with infinite df equals those

of the standard normal.of the standard normal.

Page 78: Descriptive statistics ii

April 13, 2023 78

T-DistributionT-Distribution

The table of t-distribution will give you the The table of t-distribution will give you the probability to the right of a critical value – probability to the right of a critical value – i.e. area in the upper tail.i.e. area in the upper tail.

We are only given the area (or probability) We are only given the area (or probability) for a few selected critical values for each for a few selected critical values for each degree of freedom.degree of freedom.

Page 79: Descriptive statistics ii

April 13, 2023 79

T-Distribution ExampleT-Distribution Example

For a t-curve from a sample of size 15 find For a t-curve from a sample of size 15 find the area to the left of 2.145.the area to the left of 2.145.

Answer: df=15-1=14Answer: df=15-1=14 In the table of the t~distribution, the area to In the table of the t~distribution, the area to

the right of 2.145 is 0.025.the right of 2.145 is 0.025. Therefore the area to the left of 2.145 is: Therefore the area to the left of 2.145 is:

1-0.025=0.9751-0.025=0.975

Page 80: Descriptive statistics ii

April 13, 2023 80

Graphical MethodsGraphical Methods

Frequency DistributionFrequency Distribution HistogramHistogram Frequency PolygonFrequency Polygon Cumulative Frequency GraphCumulative Frequency Graph Pie Chart.

Page 81: Descriptive statistics ii

Presenting Data Table

Condenses data into a form that can make them easier to understand;

Shows many details in summary fashion;BUT

Since table shows only numbers, it may not be readily understood without comparing it to other values.

Page 82: Descriptive statistics ii

Principles of Table Construction Don’t try to do too much in a table Us white space effectively to make table

layout pleasing to the eye. Make sure tables & test refer to each

other. Use some aspect of the table to order &

group rows & columns.

Page 83: Descriptive statistics ii

Principles of Table Construction If appropriate, frame table with summary

statistics in rows & columns to provide a standard of comparison.

Round numbers in table to one or two decimal places to make them easily understood.

When creating tables for publication in a manuscript, double-space them unless contraindicated by journal.

Page 84: Descriptive statistics ii

April 13, 2023 84

Frequency DistributionsFrequency Distributions

A useful way to present data when you A useful way to present data when you have a large data set is the formation of a have a large data set is the formation of a frequency tablefrequency table or or frequency frequency distribution.distribution.

FrequencyFrequency – the number of observations – the number of observations that fall within a certain range of the data.that fall within a certain range of the data.

Page 85: Descriptive statistics ii

85

Frequency TableFrequency Table

AgeAgeNumber of DeathsNumber of Deaths

<<11564564

1-41-48686

5-145-14127127

15-2415-24490490

25-3425-346666

35-4435-44806806

45-5445-541,4251,425

55-6455-643,5113,511

65-7465-746,9326,932

75-8475-8410,10110,101

8585++98259825

TotalTotal34,52434,524

Page 86: Descriptive statistics ii

86

Frequency TableFrequency Table

Data Data IntervalsIntervals

FrequencyFrequencyCumulative Cumulative FrequencyFrequency

Relative Relative FrequencyFrequency

)%()%(

Cumulative Cumulative Relative Relative FrequencyFrequency)%( )%(

10-1910-195555

20-2920-2918182323

30-3930-3910103333

40-4940-4913134646

50-5950-59445050

60-6960-69445454

70-7970-79225656

TotalTotal

Page 87: Descriptive statistics ii

April 13, 2023 87

Cumulative Relative Cumulative Relative FrequencyFrequency Cumulative Relative FrequencyCumulative Relative Frequency – the – the

percentage of persons having a percentage of persons having a measurement less than or equal to the measurement less than or equal to the upper boundary of the class interval.upper boundary of the class interval. i.e. cumulative relative frequency for the 3i.e. cumulative relative frequency for the 3rdrd

interval of our data example:interval of our data example: 8.8+13.3+17.5 = 59.6%8.8+13.3+17.5 = 59.6%

- We say that 59.6% of the children have weights below - We say that 59.6% of the children have weights below 39.5 pounds.39.5 pounds.

Page 88: Descriptive statistics ii

April 13, 2023 88

Number of IntervalsNumber of Intervals

There is no clear-cut rule on the number of There is no clear-cut rule on the number of intervals or classes that should be used.intervals or classes that should be used.

Too many intervals – the data may not be Too many intervals – the data may not be summarized enough for a clear summarized enough for a clear visualization of how they are distributed.visualization of how they are distributed.

Too few intervals – the data may be over-Too few intervals – the data may be over-summarized and some of the details of the summarized and some of the details of the distribution may be lost.distribution may be lost.

Page 89: Descriptive statistics ii

Presenting DataChart

- Visual representation of a frequency distribution that helps to

gain insight about what the data mean. - Built with lines, area & text: bar charts Ex: bar chart, pie chart

Page 90: Descriptive statistics ii

Bar Chart Simplest form of chart Used to display

nominal or ordinal data

ETHICAL ISSUES SCALE

ITEM 8

ACTING AGAINST YOUR OWN PERSONAL/RELIGIOUS VIEWS

FrequentlySomet imesSeldomNeverP

ER

CE

NT

60

50

40

30

20

10

0

Page 91: Descriptive statistics ii

Horizontal Bar Chart

CLINICAL PRACTICE AREAC

LIN

ICA

L P

RA

CT

ICE

AR

EA

Acute CareCritical CareGerontology

Post AnesthesiaPerinatal

Clinical ResearchFamily Nursing

NeonatalPsych/Mental Health

Community HealthGeneral Practice

OrthopedicsPrimary Care

Operating RoomMedical

OncologyOther

PERCENT

14121086420

Page 92: Descriptive statistics ii

Cluster Bar Chart

RN HIGHEST EDUCATION

Post Bac

Bachelor Degree

Associate Degree

Diploma

PE

RC

EN

T

70

60

50

40

30

20

10

0

Employment

Full time RN

Part time RN

Self employed

Page 93: Descriptive statistics ii

Pie Chart Alternative to bar

chart Circle partitioned into

percentage distributions of qualitative variables with total area of 100%

Doctorate NonNursing

Doctorate Nursing

MS NonNursing

MS Nursing

Juris Doctor

BS NonNursing

BS Nursing

AD Nursing

Diploma-Nursing

Missing

Page 94: Descriptive statistics ii

Histogram Appropriate for interval, ratio and

sometimes ordinal data Similar to bar charts but bars are placed

side by side Often used to represent both frequencies

and percentages Most histograms have from 5 to 20 bars

Page 95: Descriptive statistics ii

Histogram

SF-36 VITALITY SCORES

100.0

90.0

80.0

70.0

60.0

50.0

40.0

30.0

20.0

10.0

0.0

FR

EQ

UE

NC

Y

80

60

40

20

0

Std. Dev = 22.17

Mean = 61.6

N = 439.00

Page 96: Descriptive statistics ii

April 13, 2023 96

Pictures of Data: HistogramsPictures of Data: HistogramsBlood pressure data on a sample of 113 menBlood pressure data on a sample of 113 men

Histogram of the Systolic Blood Pressure for 113 men. Each bar spans a width of 5 mmHg on the horizontal axis. The height of each bar represents the number of individuals with SBP in that range.

05

1015

20N

um

be

r o

f M

en

80 100 120 140 160Systolic BP (mmHg)

Page 97: Descriptive statistics ii

April 13, 2023 97

Frequency PolygonFrequency Polygon

•First place a dot at the midpoint of the upper base of each rectangular bar.

•The points are connected with straight lines.

•At the ends, the points are connected to the midpoints of the previous and succeeding intervals (these intervals have zero frequency).

Frequency Polygon

0

2

4

6

8

10

12

14

16

18

20

4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5

Childrens w eights

Page 98: Descriptive statistics ii

Hallmarks of a Good Chart Simple & easy to read Placed correctly within text Use color only when it has a purpose, not

solely for decoration Make sure others can understand chart;

try it out on somebody first Remember: A poor chart is worse than no

chart at all.

Page 99: Descriptive statistics ii

April 13, 2023 99

Cumulative Frequency PlotCumulative Frequency PlotWeights of Daycare Children

0%

20%

40%

60%

80%

100%

120%

9.5 19.5 29.5 39.5 49.5 59.5 69.5 79.5 89.5

Weight Range

Per

cen

t o

f C

hild

ren

•Place a point with a horizontal axis marked at the upper class boundary and a vertical axis marked at the corresponding cumulative frequency.

•Each point represents the cumulative relative frequency and the points are connected with straight lines.

•The left end is connected to the lower boundary of the first interval that has data.

Page 100: Descriptive statistics ii

April 13, 2023 100

Coefficient of CorrelationCoefficient of Correlation

Measure of linear association between 2 Measure of linear association between 2 continuous variables.continuous variables.

Setting:Setting: two measurements are made for each two measurements are made for each

observation.observation. Sample consists of pairs of values and you Sample consists of pairs of values and you

want to determine the association between the want to determine the association between the variables.variables.

Page 101: Descriptive statistics ii

April 13, 2023 101

Association ExamplesAssociation Examples Example 1: Association between a mother’s Example 1: Association between a mother’s

weight and the birth weight of her childweight and the birth weight of her child 2 measurements: mother’s weight and baby’s weight2 measurements: mother’s weight and baby’s weight

Both continuous measures Both continuous measures

Example 2: Association between a risk factor and Example 2: Association between a risk factor and a diseasea disease 2 measurements: disease status and risk factor status2 measurements: disease status and risk factor status

Both dichotomous measurementsBoth dichotomous measurements

Page 102: Descriptive statistics ii

April 13, 2023 102

Correlation AnalysisCorrelation Analysis

When you have 2 continuous When you have 2 continuous measurements you use correlation measurements you use correlation analysis to determine the relationship analysis to determine the relationship between the variables.between the variables.

Through correlation analysis you can Through correlation analysis you can calculate a number that relates to the calculate a number that relates to the strength of the linear association.strength of the linear association.

Page 103: Descriptive statistics ii

April 13, 2023 103

Types of RelationshipsTypes of Relationships

There are 2 types of relationships:There are 2 types of relationships: Deterministic relationshipDeterministic relationship – the values of the 2 – the values of the 2

variables are related through an exact variables are related through an exact mathematical formula. mathematical formula.

Statistical relationship Statistical relationship – this is not a – this is not a perfectperfect

relationship!!! relationship!!!

Page 104: Descriptive statistics ii

April 13, 2023 104

Scatter Plots and Scatter Plots and AssociationAssociation You can plot the 2 variables in a scatter plot (one You can plot the 2 variables in a scatter plot (one

of the types of charts in SPSS/Excel).of the types of charts in SPSS/Excel). The pattern of the “dots” in the plot indicate the The pattern of the “dots” in the plot indicate the

statistical relationshipstatistical relationship between the variables (the between the variables (the strength and the direction).strength and the direction). Positive relationship – pattern goes from lower left to Positive relationship – pattern goes from lower left to

upper right.upper right. Negative relationship – pattern goes from upper left to Negative relationship – pattern goes from upper left to

lower right.lower right. The more the dots cluster around a straight line the The more the dots cluster around a straight line the

stronger the linear relationship.stronger the linear relationship.

Page 105: Descriptive statistics ii

105

Birth Weight DataBirth Weight Data

x (oz) y(%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

x – birth weight in ounces

y – increase in weight between 70th and 100th days of life, expressed as a percentage of birth weight

Page 106: Descriptive statistics ii

April 13, 2023 106

Pearson Correlation Pearson Correlation CoefficientCoefficient

Birth Weight Data

40

50

60

70

80

90

100

110

120

70 80 90 100 110 120 130 140

Birth Weight )in ounces(

Incr

ease

in

Bir

th W

eig

ht

)%(

Page 107: Descriptive statistics ii

April 13, 2023 107

Calculations of Correlation Calculations of Correlation CoefficientCoefficient In SPSS:In SPSS:

Go to TOOLS menu and select DATA ANALYSIS.Go to TOOLS menu and select DATA ANALYSIS. Highlight CORRELATION and click “ok”Highlight CORRELATION and click “ok” Enter INPUT RANGE (2 columns of data that Enter INPUT RANGE (2 columns of data that

contain “x” and “y”)contain “x” and “y”) Click “ok” (cells where you want the answer to Click “ok” (cells where you want the answer to

be placed.be placed.

Page 108: Descriptive statistics ii

April 13, 2023 108

Pearson Correlation ResultsPearson Correlation Results

    x (oz) y(%)

x (oz) 1y(%) -0.94629 1

Pearson Correlation Coefficient = -0.946

Interpretation:

- values near 1 indicate strong positive linear relationship

- values near –1 indicate strong negative linear relationship

- values near 0 indicate a weak linear association

Page 109: Descriptive statistics ii

April 13, 2023 109

CAUTIONCAUTION!!!!!!!!

Interpreting the correlation coefficient Interpreting the correlation coefficient should be done cautiously!should be done cautiously!

A result of 0 does not mean there is NO A result of 0 does not mean there is NO relationship …. It means there is no relationship …. It means there is no linearlinear association.association.

There may be a perfect non-linear There may be a perfect non-linear association.association.

Page 110: Descriptive statistics ii

The Uses of Frequency Distributions Becoming familiar with dataset. Cleaning the data.

Outliers-values that lie outside the normal range of values for other cases.

Inspecting the data for missing values. Testing assumptions for statistical tests.

Assumption is a condition that is presumed to be true and when ignored or violated can lead to misleading or invalid results.

When DV is not normally distributed researchers have to choose between three options:

Select a statistical test that does not assume a normal distribution. Ignore the violation of the assumption. Transform the variable to better approximate a distribution that is

normal. Please consult the various data transformation.

Page 111: Descriptive statistics ii

The Uses of Frequency Distributions Obtaining information about sample

characteristics. Directing answering research questions.

Page 112: Descriptive statistics ii

Outliers Are values that are extreme relative to the bulk of

scores in the distribution. They appear to be inconsistent with the rest of

the data. Advantages:

They may indicate characteristics of the population that would not be known in the normal course of analysis.

Disadvantages: They do not represent the population Run counter to the objectives of the analysis Can distort statistical tests.

Page 113: Descriptive statistics ii

Sources of Outliers An error in the recording of the data. A failure of data collection, such as not

following sample criteria (e.g. inadvertently admitting a disoriented patient into a study), a subject not following instructions on a questionnaire, or equipment failure.

An actual extreme value from an unusual subjects.

Page 114: Descriptive statistics ii

Methods to Identify Outliers Traditional way of labeling outliers, any

value more than 3SD from the mean. Values that are more than 3 IQRs from the

upper or lower edge of the box plot are extreme outliers.

Values between 1.5 and 3 IQRs from the upper and lower edges of the box are minor outliers.

Page 115: Descriptive statistics ii

Handling Outliers Analyze the data two ways:

With the outliers in the distribution With outliers removed.

If the results are similar, as they are likely to be if the sample size is large, then the outliers may be ignored.

If the results are not similar, then a statistical analysis that is resistant to outliers can be used (e.g. median and IQR).

If you want to use a mean with outliers, then the trimmed mean is an option. If calculated with a certain percentage of the extreme values removed from both ends of the distribution (e.g. n=100, then 5% trimmed mean is the mean of the middle 90% of the observation).

Page 116: Descriptive statistics ii
Page 117: Descriptive statistics ii

Handling Outliers Another alternative is a Winsorized mean. The highest and lowest extremes are

replaced by the next-to-highest value and by the next-to-lowest value.

For Univariate outliers, Tabachnick and Fidell (2001) suggest changing the scores on the variables for the outlying cases so they are deviant. E.g. if the two largest scores in the distribution are 125 and 122 and the next largest score 87. recode 122 as 88 and 125 as 89.

Page 118: Descriptive statistics ii

Outliers

Steps on SPSS 1. Analyze 2. Descriptive3. Explore 4. Statistics ……plots 5. Outliers

Page 119: Descriptive statistics ii

Missing Data Any systematic event external to the respondent (such as

data entry errors or data collection problems) or action on the part of the respondent (such as refusal to answer) that leads to missing data.

It means that analyses are based on fewer study participants than were in the full study sample. This, in turn, means less statistical power, which can undermine statistical conclusion validity-the degree to which the statistical results are accurate.

Missing data can also affect internal validity-the degree to which inferences about the causal effect of the dependent variable on the dependent variable are warranted, and also affect the external validity-generalizability.

Page 120: Descriptive statistics ii

Strategies to avoid Missing Data Persistent follow-up Flexibility in scheduling appointments Paying incentives. Using well-proven methods to track people

who have moved. Performing a thorough review of

completed data forms prior to excusing participants.

Page 121: Descriptive statistics ii

Factors to consider in designing a missing values strategy Extent of missing data Pattern of missing data Nature of missing data. Role of the variable Level of measurement of the variable.

Page 122: Descriptive statistics ii

Extent of missing data Researchers usually handle the problem

differently if there is only 1% missing data as opposed to , say, 25% missing.

Page 123: Descriptive statistics ii

Pattern of missing data It is more straightforward to deal with data

that are missing a haphazard, random fashion, as opposed to a systematic fashion that typically reflects a bias.

Different patterns of missing data: Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNSR).

Page 124: Descriptive statistics ii

Missing Completely at Random (MCAR) It means that the probability that the observation

is missing is completely unrelated to either the value of the missing case or the value of any other variables.

Occurs when cases with missing values are just a random subsample of all cases in the sample.

When data are MCAR, analyses remain unbiased, although power is reduced.

E.g. When one participant did not show up post the intervention due to emergency. In this situation, the missing values are not related to the main variable or to the value of other characteristics, such as the person’ s age, sex or experimental group status.

Page 125: Descriptive statistics ii

Missing at Random (MAR)

It considered MAR if missingness is related to other variables-but not related to the value of the variable that has the missing values.

This pattern is perhaps the most prevalent pattern of missingness in clinical research.

E.g. men were less likely to keep their follow-up appointment. Thus, missingness is related to a person’s gender.

Page 126: Descriptive statistics ii

Missing not at Random (MNSR) A pattern in which the value of the

variable that is missing is related to its missingness. This is often found for such variables as income (not to tell the truth).

Page 127: Descriptive statistics ii

Nature of Missing Data.

For only one item is a multi-item-measure. Sometimes an entire variable is missing. In other situations, all data are missing for

study participants.

Page 128: Descriptive statistics ii

Role of the variable How one handles the missing data

problem may depend on whether a variable is considered a primary outcome, a secondary outcome, an independent (predictor) variable, or control variable (covariate).

Page 129: Descriptive statistics ii

Level of Measurement of the Variable Some strategies are best applied when the

variable is measured on an interval or ratio scale, while others only make sense for nominal-level variables.

Page 130: Descriptive statistics ii

Techniques for Handling Missing Data Deletion techniques. Involve excluding subjects

with missing data from statistical calculation. Imputation techniques. Involve calculating an

estimate of each missing value and replacing, or imputing, each value by its respective estimate.

Note: techniques for handling missing data often vary in the degree to which they affect the amount of dispersion around true scores, and the degree of bias in the final results. Therefore, the selection of a data handling technique should be carefully considered.

Page 131: Descriptive statistics ii

Deletion Techniques Deletion methods involve removal of cases or variables

with missing data. Listwise deletion. Also called complete case analysis. It is

simply the analysis of those cases for which there are no missing data. It eliminates an entire case when any of its items/variables has a missing data point, whether or not that data point is part of the analysis. It is the default of the SPSS.

Pairwise deletion. Called the available case analysis (unwise deletion). Involves omitting cases from the analysis on a variable-by-variable basis. It eliminates a case only when that case has missing data for variables or items under analysis.

Note: deletion techniques are widely criticized because they assume that the data are MCAR (which is very difficult to ascertain), pose a risk for bias, and lead to reduction of sample size and power.

Page 132: Descriptive statistics ii

Imputation Techniques Imputation is the process of estimating

missing data based on valid values of other variables or cases in the sample.

The goal of imputation is to use known relationship that can be identified in the valid values of the sample to help estimate the missing data

Page 133: Descriptive statistics ii

Types of Imputation Techniques Using prior knowledge. Inserting mean values. Using regression Expectation maximization (EM). Multiple imputation.

Page 134: Descriptive statistics ii

Prior Knowledge Involves replacing a missing value with a

value based on an educational guess. It is a reasonable method if the researcher

has a good working knowledge of the research domain, the sample is large, and the number of missing values is small.

Page 135: Descriptive statistics ii

Mean Replacement Also called median replacement for

skewed distribution. Involves calculating mean values from a

available data on that variable and using them to replace missing values before analysis.

It is a conservative procedure because the distribution mean as a whole does not change and the researcher does not have to guess at missing values.

Page 136: Descriptive statistics ii

Mean Replacement Advantages:

Easily implemented and provides all cases with complete data.

A compromise procedure is to insert a group mean for the missing values.

Disadvantages: It invalidates the variance estimates derived from the

standard variance formulas by understanding the data’s true variance.

It distorts the actual distribution of values. It depresses the observed correlation that this variable

will have with other variables because all missing data have a single constant value, thus reducing the variance.

Page 137: Descriptive statistics ii

Using Regression Involves using other variables in the dataset as

independent variables to develop a regression equation for the variable with missing data serving as the dependent variable.

Cases with complete data are used to generate the regression equation.

The equation is then used to predict missing values for incomplete cases.

More regressions are computed, using the predicted values from the previous regression to develop the next equation, until the predicted values from one step to the next are comparable.

Prediction from the last regression are the ones used to replace missing values.

Page 138: Descriptive statistics ii

Using Regression Advantages:

It is more objective than the researcher’s guess but not as blind as simply using the overall mean.

Disadvantages: It reinforces the relationships already in the data,

resulting in less generalizability. The variance of the distribution is reduced because the

estimate is probably too close to the mean. It assumes that the variable with missing data is

correlated substantially with missing data is correlated substantially with the other variables in the dataset.

The regression procedure is not constrained in the estimates it makes.

Page 139: Descriptive statistics ii

Expectation Maximization For randomly missing data. It is an iterative process that proceeds in two

discrete steps: In the expectation (E) step, the conditional

expected value of the complete data is computed and then given the observed values, such as correlations.

In the Maximization (M) step, these expected values are then substituted for the missing and maximum likelihood estimation is then computed as though there were no missing data.

Page 140: Descriptive statistics ii

Multiple Imputation It produces several datasets and analyzes

them separately. One set of parameters is then formed by

averaging the resulting estimates and standard errors.

Page 141: Descriptive statistics ii

Multiple Imputation Advantages:

It makes no assumptions about whether data are randomly missing but incorporates random error because it requires random variation in the imputation process.

It permits use of complete-data methods for data analysis and also includes the data collector’s knowledge.

It permits estimates of nonlinear models. It simulates proper inference from data and increases

efficiency of the estimates by minimizing standard errors.

It is the method of choice for databases that are made available for analyses outside the agency that collected the data.

Page 142: Descriptive statistics ii

Multiple Imputation Disadvantages:

It requires conceptual intensiveness to carry out MI, including special software and model building.

It does not produce a unique answer because randomness is preserved in the MI process, making reproducibility of exact results problematic.

it requires large a mounts of data storage space that often exceeds space on personal computers’ hard driver.

Page 143: Descriptive statistics ii