31
生生生生生 指指指指 : 指指指 指指 指指 : 15. 指指指 (101581001) 指指 : 2013.12.31(09:00)

生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Embed Size (px)

Citation preview

Page 1: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

生醫統計學

指導老師 : 蔡章仁 老師學生 : 15. 許傳智 (101581001)

時間 : 2013.12.31(09:00)

Page 2: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Real Statistics Using Excel

Descriptive Statistics

•Measures of Central Tendency •Measures of Variability •Symmetry, Skewness and Kurtosis •Ranking Functions in Excel •Descriptive Statistics Tools •Frequency Tables •Histograms •Creating Box Plots •Outliers and Robustness •Dealing with Missing Data •Assumptions for Statistical Tests •Data Transformations

Page 3: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Ranking Functions in Excel

Figure 1 summarizes the various ranking functions in Excel for a data set R. We describe each of these functions in more detail in the rest of the section. In the examples which follow we describe the values of these functions for a range R with data values {4, 0, -1, 7, 5}.

Page 4: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

MIN and MAXDescription: MIN(R) = the smallest value in R and MAX(R) = the largest value in RExamples:MIN(R) = -1 MAX(R) = 7

SMALL and LARGEDescription: SMALL(n, R) = the nth smallest value in R and LARGE(n, R) = the nth largest value in R. Here n can take on any value from 1 to the number of elements in R, i.e. COUNT(R).Examples:LARGE(1, R) = 7, LARGE(2, R) = 5, LARGE(5, R) = -1 SMALL(1, R) = -1, SMALL(2, R) = 0, LARGE(5, R) = 7

Page 5: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

RANKDescription: RANK(c, R, d) = the rank of data element c in R. If d = 0 (or is omitted) then the ranking is in increasing order, i.e. a rank of 1 represents the largest data element in R. If d ≠ 0 then the ranking is in decreasing order and so a rank of 1 represents the smallest element in R.Examples:RANK(7, R) = RANK(7, R, 0) = 1 RANK(7, R, 1) = 5 RANK(0, R) = RANK(0, R, 0) = 4 RANK(0, R, 1) = 2Observations:If SMALL(n, R) = c then RANK(c, R) = RANK(c, R, 0) = n If LARGE(n ,R) = c then RANK(c, R, 1) = n For any value c and d, 1 ≤ RANK(c, R, d) ≤ COUNT(R)

Page 6: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

PERCENTILEDescription: For any percentage p (i.e. 0 ≤ p ≤ 1 or equivalently 0% ≤ p ≤ 100%), PERCENTILE(R, p) = the element at the pth percentile This means that if PERCENTILE(R, p) = c then p% of the data elements in R are less than c.If p = k/(n–1) for some integer value k = 0, 1, 2, … n–1 where n = COUNT(R), then PERCENTILE(R, p) = SMALL(R, k+1) = the k+1th element in R. If p is not a multiple of 1/(n–1), then the PERCENTILE function makes a linear interpolation as described in the examples below.Example: The 5 data elements in R divide the range into 4 intervals of size 25%, i.e. 1/(5-1) = .25. Thus,PERCENTILE(R, 0) = -1 (the smallest element in R) PERCENTILE(R, .25) = 0 (the second smallest element in R) PERCENTILE(R, .5) = 4 (the third smallest element in R) PERCENTILE(R, .75) = 5 (the fourth smallest element in R) PERCENTILE(R, 1) = 7 (the fifth smallest element in R)For other values of p we need to interpolate. For example,PERCENTILE(R, .8) = 5 + (7 – 5) * (0.8 – 0.75) / 0.25 = 5.4 PERCENTILE(R, .303) = 0 + (4 – 0) * (0.303 – 0.25) / 0.25 = .85Of course, Excel’s PERCENTILE function calculates all these values automatically without you having to figure things out.

Page 7: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

PERCENTRANKDescription: PERCENTRANK(R, c) = the percentage of data elements below c. If PERCENTRANK(R, c) =  p then PERCENTILE(R, p) = c.Example:PERCENTRANK(R, 5) = .75 PERCENTRANK(R, 54) = .8You can also add a 3rd argument which represents the number of significant figures in the answer. Thus PERCENTRANK(R, .85, 5) = .30312

QUARTILEDescription: For any integer n = 0, 1, 2, 3 or 4, QUARTILE(R, n) = PERCENTILE(R, n/4). If c is not an integer, but 0 ≤ c ≤ 4, then QUARTILE(R, c) = QUARTILE(R, INT(c)).Observation:QUARTILE(R, 0) = PERCENTILE(R, 0) = MIN(R) QUARTILE(R, 1) = PERCENTILE(R, .25) QUARTILE(R, 2) = PERCENTILE(R, .5) = MEDIAN(R) QUARTILE(R, 3) = PERCENTILE(R, .75) QUARTILE(R, 4) = PERCENTILE(R, 1) = MAX(R)Example:QUARTILE(R, 0) = PERCENTILE(R, 0) = -1 QUARTILE(R, 1) = PERCENTILE(R, .25) = 0 QUARTILE(R, 2) = PERCENTILE(R, .5) = 4 QUARTILE(R, 3) = PERCENTILE(R, .75) = 5 QUARTILE(R, 4) = PERCENTILE(R, 1) = 7

Page 8: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Descriptive Statistics Tools

Excel provides a data analysis tool called Descriptive Statistics which produces a summary of the key statistics for a data set.Example 1 – Provide a table of the most common descriptive statistics for the scores in column A of Figure 1.

Figure 1 – Output from Descriptive Statistics data analysis tool

Page 9: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

The output from the tool is shown in the right side of Figure 1. To use the tool, select Data > Analysis|Data Analysis and choose the Descriptive Statistics option. A dialog box appears as in Figure 2.

Figure 2 – Descriptive Statistics dialog box

Now click on Input Range and highlight the scores in column A (i.e. cells A3:A14). If you include the heading, as is done here, check Labels in first row. Since we want the output to start in cell C3, click the Output Range radio button and insert C3 (or click on cell C3). Finally click the Summary statistics checkbox and press OK.

Note that if we had also checked the Kth Largest checkbox, the output would also contain the value for LARGE(A4:A14, k) where k is the number we insert in the box to the right of the label Kth Largest. Similarly, checking the Kth Smallest checkbox outputs SMALL(A4:A14, k). The option Confidence Interval for Mean generates a confidence interval using the t distribution as explained in One Sample t Test.

Page 10: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides a supplemental Descriptive Statistics and Normality data analysis tool which outputs the above statistics plus GEOMEAN, HARMEAN, MAD, AAD and IQR. But instead of just generating the numerical value of each statistic, as is in Excel’s Descriptive Statistics data analysis tool, the Real Statistics tool outputs the appropriate Excel formula for computing each statistic (see Figure 4 below). Thus whenever the input data values change, the output values will change automatically as well.Both Excel’s Descriptive Statistics and the Real Statistics Descriptive Statistics and Normality data analysis tools allow you to report on multiple sets of data at the same time, as shown in the following example.

Example 2 – Use Excel’s Descriptive Statistics data analysis tool as well as the Real Statistics Descriptive Statistics and Normality data analysis tool to show the descriptive statistics for the two samples on the left side of Figure 3.

Figure 3 – Output from Excel’s Descriptive Statistics data analysis tool

Page 11: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

The output from the Excel Descriptive Statistics data analysis tool is given on the right side of Figure 3. To use the Real Statistics data analysis tool, enter Ctrl-m and select the Descriptive Statistics and Normality option. A dialog box will now appear. Select the Descriptive Statistics option and the following output will be displayed:

Figure 4 – Real Statistics Descriptive Statistics data analysis tool

As described above, the tool actually generates formulas instead of the numerical values.

Page 12: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Frequency Tables

Often data is presented in the form of a frequency . For example,

Figure 1 – Frequency Table

This means that the data element 2 occurs 4 times, the element 4 occurs 2 times and the element 3 and 5 occur 1 time. This is equivalent to a data set with elements 2, 2, 2, 3, 4, 4, 5. When data is provided in the form of a frequency table, the calculation of the mean and

This can be calculated in Excel as=SUMPRODUCT(R1, R2) / SUM(R2)where R1 is a range containing the data elements {x1, …, xm}  and R2 is a range containing  {f1, …, fm}.

standard deviation cannot be performed directly using the usual AVERAGE and STDEV Excel functions. In fact for sample data  {x1, …, xm} with corresponding frequency counts of  f1, …, fm respectively and n = f1 + f2 + … + fm, then the sample mean is:

Based on Property 1 of Measures of Variability, the sample variance can be calculated as

This can be calculated in Excel as=SUMPRODUCT((R1-R3)^2, R2)/(SUM(R2)-1)where R1 and R2 are as above and R3 contains the sample mean (as described above). Using these formulas we can calculate the mean and variance of sample data expressed in the form of a frequency table. We demonstrate this in the following example.

Page 13: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Example 1: Calculate the mean and variance of the sample data from the frequency table in Figure 1( see page 11).

Figure 2 – Calculation of mean and standard deviation from frequency table

The required calculation is displayed in Figure 2. Here cell F11 contains the formula =D11/E11, which calculates the mean, and cell G14 contains the formula =(D14-E14*F14)/(E14-1), which calculates the variance. The results are the same as calculating the mean and variance by applying Excel’s AVERAGE and VAR functions to the data set {2, 2, 2, 2, 3, 4, 4, 5}.

Note too that a frequency table is closely linked to a frequency function, as defined in Definition 1 of Discrete Distributions. E.g., since there are 8 elements in the data set in Figure 2, we see that the frequency function for random variable  is as in Figure 3 where each frequency value is divided by 8:

Figure 3 – Frequency function corresponding to frequency table

Often frequency tables are given for a range of data values, i.e. intervals for the x values. In this case the midpoint of each interval is assigned the value xi.

Page 14: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Example 2: Calculate the mean and variance for the data in the frequency table in Figure 4.

Figure 4 – Calculations for a frequency table with intervals

The first interval in Figure 4 is 0 < x ≤ 4, the second 4 < x ≤ 10, etc. The calculation of the mean and variance is as in Figure 2, except that now the midpoints are used as the x values.Observation: Sometimes the first and/or last interval is unbounded: e.g. if the last interval in Figure 4 is replaced by “over 20”. In this case it isn’t possible to establish a midpoint, and so all you can do is make your best estimate of a suitable representative value for that interval.

Observation: Sometimes the first and/or last interval is unbounded: e.g. if the last interval in Figure 4 is replaced by “over 20”. In this case it isn’t possible to establish a midpoint, and so all you can do is make your best estimate of a suitable representative value for that interval.

Page 15: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Excel Function: When you have a lot of data, it is convenient to put the data in bins, usually of equal size, and then graph the number of data elements in each bin. Excel provides the FREQUENCY(R1, R2) array function for doing this, where R1 = the input array and R2 = the bin array.To use the FREQUENCY array function, enter the data into the worksheet and then enter a bin array. The bin array defines the intervals that make up the bins. E.g., if the bin array = 10, 20, 30, then there are 4 bins, namely data with value x ≤ 10, data with value x where 10 < x ≤ 20, data with value x where 20 < x ≤ 30, and finally data with value x > 30. The FREQUENCY function simply returns an array consisting of the number of data elements in each of the bins.

Example 3: Create a frequency table for the 22 data elements in the range A4:B14 of Figure 5 based on the bin array D4:D7 (the text “over 20” in cell D8 is not part of the bin array).

Figure 5 – Example of the FREQUENCY function

To produce the output, highlight the range E4:E8 (i.e. a column range with one more cell than the number of bins) and enter the formula=FREQUENCY(A4:B11,D4:D7)

Page 16: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Since this is an array formula, you must press Ctrl-Shft-Enter. Excel now inserts frequency values in the highlighted range E4:E8. Here E4 contains the number of data elements in the input range with value in the first bin (i.e. data elements whose value is ≤ 20). Similarly, E5 contains the number of data elements in the input range with value in the second bin (i.e. data elements whose value is > 20 and ≤ 40). The final output cell (E8) contains the number of data elements in the input range with value > the value of the final bin (i.e. > 80 for this example).

Observation: As described in Discrete Probability Distributions, the Real Statistics Resource Pack provides the FREQTABLE function. This function can also be used to create a frequency table with bins where the bins are equally spaced.

Real Statistics Function: The Real Statistics Resource Pack supplies the following supplemental array function to create a frequency table.

FREQTABLE(R1, bsize)  = an array which contains the frequency table for the data in range R1, assuming equally sized bins of size bsize.To use the function you must highlight an array with 3 columns and at least k rows where k = (MAX(R1) – MIN(R1) / bsize + 1. You can highlight more rows than you need; any extra rows will take value #N/A.

Page 17: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Example 4: Create a frequency table for the 22 data elements in the range A4:B14 of Figure 5 based on bins of size 15.The desired frequency table can be produced using the array formula=FREQTABLE(A4:B14,15)as shown in range M4:O11 of Figure 6.

Figure 6 – FREQTABLE function with bin size 15

The headings are not outputted by the function but have been added manually. Note that two extra rows have been highlighted and so are filled with #N/A.Observation: You can also use the Frequency Table data analysis tool for creating frequency tables. See Histograms for an example of how to use these tools. Also see Frequency Table Conversion for how to calculate the descriptive statistics (see Descriptive Statistics) for the data described by a frequency table.

Page 18: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

HistogramsA histogram is a graphical representation of the output of the FREQUENCY function (as described in Frequency Tables).Example 1: Create a histogram for the data and bin selection for Example 1 from Frequency Tables.We start by replicating the data and bin section for Example 1 in Figure 1.

Figure 1 – Data for Example 1

You can use Excel’s chart tool to graph the data in Figure 1, or alternatively you can use the Histogram data analysis tool to accomplish this directly, as described next.Excel Data Analysis Tool: To use Excel’s Histogram data analysis tool, you must first establish a bin array (as for the FREQUENCY function described in Frequency Tables) and then select the Histogram data analysis tool. In the dialog box that is displayed you next specify the input data (Input Range) and bin array (Bin Range). You can optionally include the labels for these ranges (in which case you check the Labels check box).

Page 19: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

For Example 1, the Input Range is A4:B14 and the Bin Range is D4:D7 (with the Labels check box unchecked). The output is displayed in Figure 2.

Observation: Caution must be exercised when creating histograms to present the data in a clear and accurate way. For most purposes it is important that the intervals be equal in size (except for an unbounded first and/or last interval). Otherwise a distorted picture of the data may be presented.To avoid this problem equally-spaced intervals can be used. This is the approach illustrated in Example 4 of Frequency Tables using the FREQTABLE supplemental function. The Frequency Table supplemental data analysis tool can be used as well.

Real Statistics Data Analysis Tool: The Frequency Table data analysis tool provided in the Real Statistics Resource Pack can be used to create a Frequency Table and histogram as illustrated in the following example.

Figure 2 – Histogram data analysis tool

Page 20: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Example 2: Create a frequency table and histogram for the 22 data elements in the range A4:B14 of Figure 1 based on bins of size 15.Enter Ctrl-m and select the Frequency Table option. The dialog box shown in Figure 3 will appear.

Figure 3 – Dialog box for Frequency Table data analysis tool 

Figure 4 – Frequency Table and Histogram

Insert A4:B14 in the Input Range field, select Raw data as the Input format and insert 15 as the bin size. The output is shown in Figure 4.

Observation: You can also produce a frequency table (and histogram) of the type described in Example 3 of Discrete Probability Distributions (i.e. without specifying any bins) via the Frequency Table data analysis tool. In this case you would select Raw data as the Input format for the dialog box shown in Figure 3 and leave the Bin size field blank.

Page 21: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Creating Box Plots in ExcelAnother way to characterize a distribution or sample is via a box plot. Specifically, a box plot provides a pictorial representation of the following statistics: maximum, 75%-percentile, median (50%-percentile), 25%-percentile and minimum.Box plots are especially useful when comparing samples and testing whether data is symmetric.Real Statistics Data Analysis Tool: To generate a box plot, you can use the Box Plot option of the Descriptive Statistics and Normality supplemental data analysis tool found in the Real Statistics Resource Pack, as described in the following example. See also Special Charting Capabilities for how to create the box plot manually using Excel’s charting capabilities.

Figure 1 – Sample data

Example 1: A market research company asks 30 people to evaluate three brands of tablet computers using a questionnaire. The 30 people are divided at random into 3 groups of 10 people each, where the first group evaluates Brand A, the second evaluates Brand B and the third evaluates Brand C. The questionnaire scores from these groups are summarized in Figure 1.

Page 22: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

To generate the box plots for these three groups, enter Ctrl-m and select the Descriptive Statistics and Normality supplemental data analysis tool. A dialog box will appear. Select the Box Plot option and insert A3:C13 in the Input Range and check Headings included with the data. The resulting plot is shown in Figure 2.

Figure 2 – Box Plot

Figure 3 – Box Plot elements

Note too that the data analysis tool also generates a table, which may in fact be located behind the chart. For those who are interested, this table contains the information in Figure 3, as explained in Special Charting Capabilities.

For each sample, the box plot consists of a rectangular box with one line extending upward and another extending downward (usually called whiskers). The box itself is divided into two parts. In particular, the meaning of each element in the box plot is described in Figure 3.

Page 23: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

From the box plot (see Figure 2) we can see that the scores for Brand C tend to be higher than for the other brands and those for Brand B tend to be lower. We also see that the distribution of Brand A is pretty symmetric at least in the range between the 1st and 3rd quartiles, although there is some asymmetry for higher values (or potentially there is an outlier). Brands B and C look less symmetric. Because of the long upper whisker (especially with respect to the box), Brand B may have an outlier (see Outliers and Robustness for a discussion of outliers).We can also convert the box plot to a horizontal representation of the data (as in Figure 4) by clicking on the chart and selecting Insert > Charts|Bar > Stacked Bar.

Figure 4 – Horizontal Box Plot

See Special Charting Capabilities for more information about the Box Plot data analysis tool, especially regarding issues that arise when some of the data is negative.

Page 24: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Outliers and Robustness

One problem that we face in analyzing data is the presence of outliers, i.e. data that are a lot different from the other data collected, especially data that is much bigger or much smaller.For example, the mean of the sample {2, 3, 4, 5, 6} is 4, while the mean of {2, 3, 4, 5, 60} is 14.4. The appearance of the 60 completely distorts the mean in the second sample. Some statistics, such as the median, are more resistant to such outliers. In fact, the median for both samples is 4.For this example it is obvious that 60 is a potential outlier. In Identifying Outliers and Missing Data we show how to identify potential outliers using a data analysis tool provided in the Real Statistics Resource Pack.

Excel Function: One approach for dealing with outliers is to throw away data that is either too big or too small. Excel provides the TRIMMEAN function for dealing with this issue.

Page 25: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

TRIMMEAN(R, p) – calculates the mean of the data in the range R after first throwing away p% of the data, half from the top and half from the bottom. If R contains n data elements and k = the largest whole number ≤ np/2, then the k largest items and the k smallest items are removed before calculating the mean.For example, suppose R = {5, 4, 3, 20, 1, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. Then TRIMMEAN(R, 0.2) works as follows. Since R has 15 elements, k = INT(15 * .2 / 2) = 1. Thus the largest element (20) and the smallest element (1) is removed from R to get R′ = {5, 4, 3, 4, 6, 4, 5, 6, 7, 1, 3, 7, 2}. TRIMMEAN now returns the mean of this range, namely 4.385 instead of the mean of R which is 5.2.

A related approach is to use Winsorized samples, in which the trimmed values are replaced by the remaining highest and lowest values. Consider the following sample:4, 6, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 50, 80A 10% trimmed sample would simply remove the two lowest and two highest elements (i.e. 4, 6, 50, 80). A 10% Winsorized sample replaces the two lowest elements by the third lowest and the two highest by the 3rd highest, resulting in the following data set:10, 10, 10, 14, 16, 19, 22, 23, 25, 27, 27, 31, 37, 38, 40, 44, 45, 48, 48, 48Since 4 data elements have been replaced, the degrees of freedom of any test need to be reduced by 4.

Page 26: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Dealing with Missing DataAnother problem faced when collecting data is that some data may be missing. For example, in conducting a survey with ten questions, perhaps some of the people who take the survey don’t answer all ten questions. In Identifying Outliers and Missing Data we show how to identify missing data using a supplemental data analysis tool provided in the Real Statistics Resource Pack.A simple approach for dealing with missing data is to throw out all the data for any sample missing one or more data elements. One problem with this approach is that the sample size will be reduced. This is particularly relevant when the reduced sample size is too small to obtain significant results in the analysis. In this case additional sample data elements may need to be collected. This problem is a bigger than might first be evident. E.g. if a questionnaire with 5 questions is randomly missing 10% of the data, then on average almost 60% of the sample will have at least one question missing.Also it is often the case that the missing data is not randomly distributed. E.g., people filling out a long questionnaire may give up at some point and not answer any further questions, or they may be offended or embarrassed by a particular question and choose not to answer it. These are characteristics that might be quite relevant to the analysis.In general there are the following types of remedies for missing data:•Delete the samples with any missing data elements •Impute the value of the missing data •Remove a variable (e.g. a particular question in the case of a questionnaire or survey) which has a high incidence of missing data, especially if there are other variables (i.e. questions) which measure similar aspects of the characteristics being studied.

Page 27: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Deleting Missing DataOf particular importance is the randomness of the missing data. E.g. suppose question 5 has a lot of missing data and question 7 has no missing data. If the frequency of the responses to question 7 changes significantly when samples which are missing responses to question 5 are dropped, then the missing data is not random, and so dropping samples can bias the results of the analysis. In this case either another remedy should be employed or the analysis should be run twice: once with samples with missing data retained (e.g. by adding a “no response” for missing data) and once with these samples dropped.Missing data can be removed by using the following supplemental Excel functions.Supplemental Excel Functions:

DELBLANK(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any empty cellsDELNonNum(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any non-numeric cellsDELROWBLANK(R1, b) – fills the highlighted range with the data in range R1 omitting any row which has one or more empty cells; if b is True then the first row of R1 (presumably containing column headings) is always is always copied (even if it contains an empty cell); the second argument is optional defaults to b = False.DELROWNonNum(R1, b) – fills the highlighted range with the data in range R1 omitting any row which has one or more non-numeric cells; if b is True then the first row of R1 (presumably containing column headings) is always is always copied (even if it contains a non-numeric cell); the second argument is optional defaults to b = False..

Page 28: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

The string s is used as a filler in case the output range has more cells than R1. This second argument is optional and defaults to the error value #N/A. See Data Conversion and Reformatting for an example of the use of these functions.Also see Data Conversion and Reformatting for how to use the supplemental Reformat Data Range data analysis tool found in the Real Statistics Resource Pack to accomplish the same objectives.

Imputing the values for missing dataSome techniques for imputing values for missing data include:•Substituting the missing data with another observation which is considered similar, either taken from another sample or from a previous study. •Using the mean of all the non-missing data elements for that variable. This might be acceptable in cases with a small number of missing data elements, but otherwise it can distort the distribution of the data (e.g. by reducing the variance) or by lowering the observed correlations (see Basic Concepts of Correlation).

•Using regression techniques. In this approach regression (as described in Regression and Multiple Regression) is used to predict the value of the missing data element based on the relationship between that variable and other variables. This approach reinforces existing relationships and so makes it more likely that the analysis will characterize the sample and not the general population.

Page 29: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Assumptions for Statistical TestsAs we will see shortly, most of the statistical tests we perform are based on a set of assumptions. When these assumptions are violated the results of the analysis can be misleading or completely erroneous.Typical assumptions are:•Normality: Data have a normal distribution (or at least is symmetric) •Homogeneity of variances: Data from multiple groups have the same •Linearity: Data have a linear relationship •Independence: Data are independent

We explore in detail what it means for data to be normally distributed in Normal Distribution, but in general it means that the graph of the data has the shape of a bell curve. Such data is symmetric around its mean and has kurtosis equal to zero. In Testing for Normality and Symmetry we provide tests to determine whether data meet this assumption.Some tests (e.g. ANOVA) require that the groups of data being studied have the same variance. In Homogeneity of Variances we provide some tests to determine whether groups of data have the same variance.Some tests (e.g. Regression) require that there be a linear correlation between the dependent and independent variables. Generally linearity can be tested graphically using scatter diagrams or via other techniques explored in Correlation, Regression and Multiple Regression.

Page 30: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

We touch on the notion of independence in Definition 3 of Basic Probability Concepts. In general, data are independent when there is no correlation between them (see Correlation). Many tests require that data be randomly sampled with each data element selected independently of data previously selected. E.g. if we measure the monthly weight of 10 people over the course of 5 months, these 50 observations are not independent since repeated measurements from the same people are not independent. Also the IQ of 20 married couples doesn’t constitute 40 independent observations.Almost all of the most commonly used statistical tests rely of the adherence to some distribution function (such as the normal distribution). Such tests are called parametric tests. Sometimes when one of the key assumptions of such a test is violated, a non-parametric test can be used instead. Such tests don’t rely on a specific probability distribution function (see Non-parametric Tests).Another approach for addressing problems with assumptions is by transforming the data (see Transformations).

Page 31: 生醫統計學 指導老師 : 蔡章仁 老師 學生 : 15. 許傳智 (101581001) 時間 : 2013.12.31(09:00)

Data Transformations

It can sometimes be useful to transform data to overcome the violation of an assumption required for the statistical analysis we want to make. Typical transformations take a random variable  and transform it into log x or 1/x or x2 or , etc.There is some controversy regarding the desirability of performing such transformations since often they cause more problems than they solve. Sometimes a transformation can be considered simply as another way of looking at the data. For example, sound volume is often given in decibels, which is essentially a log transformation; time to complete a task is often expressed as speed, which is essentially a reciprocal transformation; area of a circular plot of land can be expressed as the radius, which is essentially a square root transformation.In any case, we will see some examples in the rest of the book where transformations are desirable. One thing that is very important is that transformations be applied uniformly. E.g. when comparing three groups of data, it would not be appropriate to apply a log transformation to one group but not to the other two.Also transformations should only be used to achieve the assumptions of a test. You shouldn’t try lots of transformation in order to find one that achieves a specific test result.

--The end--