Descriptive Statistics

  • View

  • Download

Embed Size (px)


  • 1.Intro to Research in InformationStudiesInferential StatisticsStandard Error of the MeanSignificanceInferential tests you can use 1

2. Do you speak the language? XA -XBt= XA - X )2 X )211 [( )+(X)] x ( n 2 (2 (n2 )+A B -B n1 n21(n1-1) + (n2-1) 2 3. Difference betweenDont Panic !!Dont Panicmeans XA -X Bt= XA -X )2 X )2 11[()+(X )] x ( n2 ( 2 ( n2 ) + AB -Bn1n2 1 (n1-1)+ (n2-1) Compare with SD formula3 4. Basic types of statistical treatment o Descriptive statistics which summarize the characteristics of a sample of data o Inferential statistics which attempt to say something about a population on the basis of a sample of data - infer to all on the basis of some Statistical tests are inferentialStatistical tests are inferential4 5. Two kinds of descriptive statistic: o Measures of central tendency Or where about on the Or where about on the meanmeasurement scale measurement scalemost of the data fall most of the data fall median modeOr how spread out they Or how spread out theyare are o Measures of dispersion (variation) range inter-quartile range variance/standard deviationThe different measures have different sensitivity and The different measures have different sensitivity andshould be used at the appropriate times should be used at the appropriate times5 6. Symbol checko Sigma: Means the sum of n o Sigma (1 to n) x of i: xi means add all values of i from 1 to n in a data seti =1 o Xi = the ith data point6 7. MeanSum of all observations divided by the number ofobservations x niIn notation:Refer to handout on notation i=1 Refer to handout on notation n See example on next slideSee example on next slideMean uses every item of data but is sensitive to extremeoutliers7 8. To overcome problems with range etc. To overcome problems with range etc.we need a better measure of spread we need a better measure of spreadVariance and standard deviationo A deviation is a measure of how farfrom the mean is a score in our datao Sample: 6,4,7,5 mean =5.5o Each score can be expressed in terms of distance from5.5o 6,4,7,5, => 0.5, -1.5, 1.5, -0.5 (these are distances frommean)o Since these are measures of distance, some are positive(greater than mean) and some are negative (less thanthe mean)8o TIP: Sum of these distances ALWAYS = 0 9. Symbol checko Called x bar; refers toxthe mean Called x minus x-bar;(x x) o implies subtracting the mean from a data point x. also known as a deviation from the mean 9 10. Two ways to get SD sd = (x x)2Sum the sq. deviations from the meanDivide by No. of observationsn Take the square root of the result sd =x 2 x2Sum the squared raw scores Divide by N Subtract the squared mean n Take the square root of the result10 11. xx - x2 2x22 4 s=n2 42 42 42 49523 9=-2.93 9 104 164 165 25 = 9.5 - 8.41 x = 29 x 2 = 95 =1.09 = 1.044IfIf we recalculate the we recalculate thevariance with the 60variance with the 60instead of the 55 in theinstead of the in thedatadata 12. If we include a large outlier: x2 xs= - x2x2 2 2 44 n 24 24 37602 2 349= - 8.410 39 416 460163600 = 367 - 70.56 x = 84 x 2 = 3670 =296.44Like the mean, theLike the mean, thestandard deviation usesstandard deviation usesevery piece of data and =every piece of data andis therefore sensitive tois therefore sensitive toextreme values 17.22extreme valuesNote increase in SD 13. MeanTwo sets of data can have the same mean but different standard deviations.The bigger the SD, the more s-p-r-e-a-d out are the data. 14. On the use of N or N-1 When your (x x) 2 o observations are thesd = complete set of peoplenthat could be measured (parameter)sd = (x x) 2 o When you are observing only an1sample of potential users (statistic), the use of N-1 increases size 14 of sd slightly 15. SummaryMeasures of Central TendencyMost frequent observation. Mode Use with nominal dataMiddle of data. Use with ordinalMedian data or when data contain outliers Mean Average. Use with intervaland ratio data if no outliersMeasures of DispersionRange Dependent on two extreme valuesMore useful than range. Interquartile Range Often used with medianSame conditions as mean. WithVariance / Standard Deviation mean, provides excellentsummary of data 16. Andrew Dillon: Andrew Dillon:Move this to later in the course, after Move this to later in the course, after Deviation units: Z scoresdistributions? distributions?Any data point can be expressed in terms of itsDistance from the mean in SD units: xxz=sdA positive z score implies a value above the meanA negative z score implies a value below the mean 16 17. Interpreting Z scoreso Mean = 70,SD = 6o By using Z scores, wecan standardize a set ofo Then a score of 82 is scores to a scale that is2 sd [ (82-70)/6] more intuitiveabove the mean, or 82 o Many IQ tests and= Z score of 2aptitude tests do this,o Similarly, a score of setting a mean of 100 and64 = a Z score of -1an SD of 10 etc. 17 18. Comparing data with Z scoresYou score 49 in class A but 58 in class BHow can you compare your performance in both?Class A: Class B:Mean =45 Mean =55SD=4 SD = 649 is a Z=1.058 is a Z=0.518 19. With normal distributionsMean,SD andZ tablesIn combination provide powerful means ofestimating what your data indicates 19 20. Graphing data - the histogram The frequency of 100 occurrence for90 measure of80Number interest, 70Of errors e.g., errors, time, 60 scores on a test 50 etc. 40302010 0Graph gives instant Graph gives instantsummary of data - - summary of data1 23 4 56 78 9 10check spread, check spread,similarity, outliers, etc. similarity, outliers, etc. The categories of data we are studying, e.g., task orinterface, or user group etc. 20 21. Very large data sets tend to havedistinct shape:8070605040302010 021 22. Normal distributiono Bell shaped, symmetrical, measures ofcentral tendency convergeo mean, median, mode are equal in normaldistributiono Mean lies at the peak of the curveo Many events in nature follow this curveo IQ test scores, height, tosses of a fair coin, userperformance in tests, 22 23. The Normal Curve NB: position ofNB: position of measures ofmeasures of centralcentral tendency50% of scores tendencyf fall below meanMeanMedianMode23 24. Positively skewed distributionNote how the various measures of Note how the various measures ofcentral tendency separate now - - central tendency separate nownote the direction of the change note the direction of the changemode moves left of other two, mode moves left of other two,mean stays highest, indicating mean stays highest, indicatingfrequency of scores less than the frequency of scores less than themean meanf Mode Median Mean 24 25. Negatively skewed distributionHere the tendency Here the tendencyto have higher to have highervalues more values morecommon serves to common serves toincrease the value increase the valueof the mode of the modef Mean Median Mode25 26. Other distributionso Bimodalo Data shows 2 peaks separated by trougho Multimodalo More than 2 peakso The shape of the underlying distribution determines yourchoice of inferential test 26 27. Bimodal Will occur in situations where Will occur in situations wherethere might be distinct groups there might be distinct groupsbeing tested e.g., novices and being tested e.g., novices andexperts expertsNote how each mode is itself part Note how each mode is itself partof aanormal distribution (more of normal distribution (morelater) later)fMode Mean Mode Median 27 28. Standard deviations and the normalcurve 68% of observations fall within 1 s.d.f95% of observations fall within 2 s.d. (approx) 1 sd 1 sd1 sd 1 sd Mean28 29. Z scores and tables Knowing a Z score allows you to determine where under the normal distribution it occurs Z score between: 0 and 1 = 34% of observations 1 and -1 = 68% of observations etc. Or 16% of scores are >1 Z score above mean Check out Z tables in any basic stats book 29 30. Remember:o A Z score reflects position in a normaldistributiono The Normal Distribution has been plottedout such that we know what proportion ofthe distribution occurs above or below anypoint 30 31. Importance of distributiono Given the mean, the standard deviation, andsome reasonable expectation of normaldistribution, we can establish theconfidence level of our findingso With a distribution, we can go beyonddescriptive statistics to inferentialstatistics (tests of significance)31 32. So - for your research:o Always summarize the data by graphing it -look for general pattern of distributiono Then, determine the mean, median, modeand standard deviationo From these we know a LOT about what wehave observed 32 33. Inference is built on Probabilityo Inferential statistics rely on the laws ofprobability to determine the significanceof the data we observe.o Statistical significance is NOT the same aspractical significanceo In statistics, we generally considersignificant those differences that occurless than 1:20 by chance alone33 34. At this point I Iask people to take out aa At this point ask people to take outcoin and toss it 10 times, noting the exact coin and toss it 10 times, noting the exactCalculating probabilitysequence of outcomes e.g., sequence of outcomes e.g.,h,h,t,h,t,t,h,t,t,h. h,h,t,h,t,t,h,t,t,h.Then I Ihave people compare outcomes. Then have people compare outcomes.o Probability refers to the likelihood of anygiven event occurring out of all possibleevents e.g.:o Tossing a coin - outcome is either head or tail o Therefore probability of head is 1/2 o Probability of two heads on two tosses is 1/4 since the other possible outcomes are two tails, and two possible sequences of head and tail.o The probability of any event is expressed asa value between 0 (no chance) and 1(certain)34 35. Sampling distribution for 3 cointosses3.532.521.510.50 0 heads1 1 head 3 2 heads3 3 heads1 35 36. Probability and normal curveso Q? When is the probability of getting 10 heads in10 coin tosses the same as getting 6 heads and 4tails?o HHHHHHHHHHo HHTHTHHTHTo Answer: when you specify the precise order of the6 H/4T sequence: