19
What Types Of Data Are Collected? What Kinds Of Question Can Be Asked Of Those Data? Do people who say they study for more hours also think they’ll finish their doctorate earlier? Are computer literates less anxious about statistics? …. ? Are men more likely to study part-time? Are women more likely to enroll in CCE? …. ? Questions that Require Us To Examine Relationships Between Features of the Participants. How tall are class members, on average? How many hours a week do class members report that they study? …. ? How many members of the class are women? What proportion of the class is fulltime? …. ? Questions That Require Us To Describe Single Features of the Participants “Continuous” Data “Categorical” Data Research Is A Partnership Of Questions And Data © Willett, Harvard University Graduate School of Education, 06/21/22 S010Y/C08 – Slide 1 S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

What Types Of Data Are Collected?

  • Upload
    azriel

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

What Types Of Data Are Collected?. Research Is A Partnership Of Questions And Data. “Categorical” Data. “Continuous” Data. S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data. What Kinds Of Question Can Be Asked Of Those Data?. - PowerPoint PPT Presentation

Citation preview

Page 1: What Types Of  Data  Are Collected?

What Types Of Data Are Collected?

What Kinds Of Question Can Be

Asked Of Those Data?

Do people who say they study for more hours also think they’ll finish their doctorate earlier?

Are computer literates less anxious about statistics?

…. ?

Are men more likely to study part-time?

Are women more likely to enroll in CCE?

…. ?

Questions that Require Us To

Examine Relationships

Between Features of the

Participants.

How tall are class members, on average?

How many hours a week do class members report that they study?

…. ?

How many members of the class are women?

What proportion of the class is fulltime?

…. ?

Questions That Require Us To

DescribeSingle Features

of the Participants

“Continuous”

Data

“Categorical”

Data

Research Is A Partnership Of

Questions And Data

Research Is A Partnership Of

Questions And Data

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 1

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Page 2: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 2

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Today, I’ll focus on generating summaries using the arithmetic manipulation principlearithmetic manipulation principle.Today, I’ll focus on generating summaries

using the arithmetic manipulation principlearithmetic manipulation principle.

Last time, I focused on generating such summaries using the ordering principleordering principle.Last time, I focused on generating such summaries using the ordering principleordering principle.

We have distinguished two broad approachestwo broad approaches forcreating statistical summaries statistical summaries of these properties:

Approach #2Approach #2Based on the arithmetic manipulation of data arithmetic manipulation of data valuesvalues:

Mean, standard deviation, skewness, kurtosis, …

Approach #1Approach #1Based on the ordering of data valuesordering of data values: Median, quartiles, percentiles, inter-

quartile range, …

It is more difficult to summarize the sample distribution of a continuous variable, like MAT score, than it is to summarize the sample distribution of a categorical variable, because the sample distributions of continuous variables like MAT scores have so many interesting properties, including:

The “center” or “location” of the batch. The “spread” of the batch.

The “one-sidedness” of the batch. The “peakiness” of the batch.

Page 3: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 3

Let’s use the arithmetic principlearithmetic principle to develop a statistic for describing the center of the distributioncenter of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance …Let’s use the arithmetic principlearithmetic principle to develop a statistic for describing the center of the distributioncenter of the distribution of the values of a continuous variable like MAT score … for the “Early” “Elsewhere” batch, for instance …

987654321

2 3 3

1

9

7

650

61000

98870

23

3

197

65

0

61

00

0

98

87

0

9 8 7 6 5 4 3 2 1

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution

would need to be supported so that it could “balance.”

A good summary statistic for describing the center of a distribution of the values of a continuous variable is the place where the distribution

would need to be supported so that it could “balance.”

Page 4: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 4

A good summary statisticsummary statistic for describing the center of the distribution of the values of a continuous center of the distribution of the values of a continuous variablevariable, like MAT score, is the place where the distribution must be supported for it to balanceA good summary statisticsummary statistic for describing the center of the distribution of the values of a continuous center of the distribution of the values of a continuous variablevariable, like MAT score, is the place where the distribution must be supported for it to balance

23

3

197

65

0

61

00

0

98

87

0

9 8 7 6 5 4 3 2 1

Known as the sample mean, or

average.

Known as the sample mean, or

average.

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

4.63

19

120519

83...50473921

values of Number

values the all up AddPoint

Balance

Page 5: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 5

let’s use the arithmetic principlearithmetic principle to create a summary statistic for describing the spread of the distribution spread of the distribution of values of a continuous variableof values of a continuous variable … how about the “average distance from the center”?let’s use the arithmetic principlearithmetic principle to create a summary statistic for describing the spread of the distribution spread of the distribution of values of a continuous variableof values of a continuous variable … how about the “average distance from the center”?

23

3

197

65

0

61

00

0

98

87

0

9 8 7 6 5 4 3 2 1

Why don’t we just find the average distance of all the “blocks” from the center?

Why don’t we just find the average distance of all the “blocks” from the center?

1 - blocks ofNumber

center thefrom blocks"" theof distances theall Add

center thefromblocks"" theofdistance Average

1 - blocks ofNumber

center thefrom blocks"" theof distances theall Add

center thefromblocks"" theofdistance Average

018

018

)4.42()4.24(...)6.19()6.19(

1 - 19

63.4)-(2163.4)-(39......63.4)-(8363.4)-(83

center thefromblocks"" theofdistance Average

018

018

)4.42()4.24(...)6.19()6.19(

1 - 19

63.4)-(2163.4)-(39......63.4)-(8363.4)-(83

center thefromblocks"" theofdistance Average

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Page 6: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 6

When you sum, everything goes to zero, so what do we do now …. ?When you sum, everything goes to zero, so what do we do now …. ?2

33

197

65

0

61

00

0

98

87

0

9 8 7 6 5 4 3 2 1

Let’s do what we’ve done before,square all the distances before averaging?

Let’s do what we’ve done before,square all the distances before averaging?

1 - blocks ofNumber

center thefrom blocks all of distances theAdd

center thefromblocks""

theof distancesquared Average

squared

1 - blocks ofNumber

center thefrom blocks all of distances theAdd

center thefromblocks""

theof distancesquared Average

squared

26.27918

64.502618

)4.42()4.24(...)6.19()6.19(

1 - 19

63.4)-(2163.4)-(39......63.4)-(8363.4)-(83

center thefromblocks""

theof distancessquared Average

2222

2222

26.27918

64.502618

)4.42()4.24(...)6.19()6.19(

1 - 19

63.4)-(2163.4)-(39......63.4)-(8363.4)-(83

center thefromblocks""

theof distancessquared Average

2222

2222

Now I guess we should take the square root, to reverse the squaring that we did to begin with?

Let’s call this the standard deviationstandard deviation.

Now I guess we should take the square root, to reverse the squaring that we did to begin with?

Let’s call this the standard deviationstandard deviation.

7.1626.279

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Page 7: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 7

And so, creating summary statistics based on the arithmetic principlearithmetic principle, here’s the story so far…...And so, creating summary statistics based on the arithmetic principlearithmetic principle, here’s the story so far…...

23

3

197

65

0

61

00

0

98

87

0

9 8 7 6 5 4 3 2 1

Mean63.4

Mean63.4 46.746.780.180.1

1 standard deviation1 standard deviation

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Page 8: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 8

You don’t have to do all these computations by hand – SAS can do them for you:

Here are the MAT data you worked with, supplemented by data from the 1987 cohort.

All in the MAT.txt dataset.

You don’t have to do all these computations by hand – SAS can do them for you:

Here are the MAT data you worked with, supplemented by data from the 1987 cohort.

All in the MAT.txt dataset.

1 01 1 64 21 02 1 54 21 03 1 93 21 04 1 82 21 05 1 75 21 06 1 72 21 07 1 59 21 08 1 76 21 09 1 38 21 10 1 73 21 11 1 88 21 12 1 50 11 13 1 96 11 14 1 66 11 15 1 93 11 16 1 63 1

(74 cases omitted)

Entering cohort:1 =19872 =1989

Entering cohort:1 =19872 =1989

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

ID labelID label Location of test site:1 = Harvard2 = Elsewhere

Location of test site:1 = Harvard2 = Elsewhere

When the test was received in the Admissions Office:

1 = Early2 = Late

When the test was received in the Admissions Office:

1 = Early2 = Late

Raw MAT scoreRaw MAT score

Page 9: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 9

OPTIONS Nodate Pageno=1; TITLE1 ‘S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I';TITLE3 'MAT Scores from 2 Years of Doctoral Applicants';TITLE4 'Data in MAT.txt'; *-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site'; *-----------------------------------------------------------------------------*Format labels for values of categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE YEARFMT 1='1987' 2='1989'; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';

Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1 Here’s a PC-SAS program to provide descriptive univariate statistics on these data … Handout C08_1

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Standard data input statements, notice that there are several other variables in the dataset

The usual process of formatting the categorical variables

Page 10: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 10

*--------------------------------------------------------------------------* Data Listing*--------------------------------------------------------------------------*;PROC PRINT LABEL DATA=MAT; TITLE5 'Listing of MAT Scores & Background Variables for all Applicants'; VAR ID YEARTEST WHENRECD TESTSITE MATSCOR; FORMAT YEARTEST YEARFMT. WHENRECD WHENFMT. TESTSITE SITEFMT.; *--------------------------------------------------------------------------* Displaying and summarizing the MAT scores for the whole sample*--------------------------------------------------------------------------*;PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Univariate Descriptive Summaries of MAT Score for all Applicants'; VAR MATSCOR; ID ID;RUN;

And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics ...And here’s the rest of the PC_SAS program … this part provides the requested univariate descriptive statistics ...

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Printing, titling and formatting a few cases for inspection

PROC UNIVARIATE provides all kind of univariate (“single variable”) descriptive statistics for continuous

variables

The PLOT command requests various data plots, including the stem.leaf plot.

The ID command identifies a variables that contains respondent

identifying information

The VAR command specifies the continuous variable to be

summarized

Page 11: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 11

S010Y: Answering Questions with Quantitative Data Class 8/Handout 1: Displaying and Summarizing Continuous Data, Part I MAT Scores from 2 Years of Doctoral Applicants Data in MAT.txt

Listing of MAT Scores and Background Variables for all Applicants  Case Year When Millers identification test application AnalogiesObs number taken received Test site Test Score  1 1 1987 Early Elsewhere 64 2 2 1987 Early Elsewhere 54 3 3 1987 Early Elsewhere 93 4 4 1987 Early Elsewhere 82 5 5 1987 Early Elsewhere 75 6 6 1987 Early Elsewhere 72 7 7 1987 Early Elsewhere 59 8 8 1987 Early Elsewhere 76 9 9 1987 Early Elsewhere 38 10 10 1987 Early Elsewhere 73 . . 83 83 1989 Late Elsewhere 55 84 84 1989 Late Harvard 72 85 85 1989 Late Elsewhere 32 86 86 1989 Late Elsewhere 53 87 87 1989 Late Elsewhere 76 88 88 1989 Late Elsewhere 62 89 89 1989 Late Elsewhere 78 90 90 1989 Late Elsewhere 54

Here’s a listing of a few cases from the dataset …Here’s a listing of a few cases from the dataset …

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Harvard graduation, 1890The six class day speakers; with W.E.B. Du Bois

on the far right

Harvard graduation, 1890The six class day speakers; with W.E.B. Du Bois

on the far right

Each row is a case, as usual

Page 12: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 12

Variable: MATSCOR (Millers Analogies Test Score)  Moments N 90 Sum Weights 90Mean 63.3888889 Sum Observations 5705Std Deviation 18.6924815 Variance 349.408864Skewness -0.5406701 Kurtosis -0.320241  Basic Statistical Measures  Location Variability  Mean 63.38889 Std Deviation 18.69248 Median 65.00000 Variance 349.40886 Mode 62.00000 Range 78.00000 Interquartile Range 24.00000   Quantiles 

Quantile Estimate 100% Max 96.0 99% 96.0 95% 90.0 90% 85.5 75% Q3 77.0 50% Median 65.0 25% Q1 53.0 10% 35.0 5% 27.0 1% 18.0 0% Min 18.0

And the “orderingordering” and “arithmetic manipulationarithmetic manipulation” summary statistics for MATSCOR are …And the “orderingordering” and “arithmetic manipulationarithmetic manipulation” summary statistics for MATSCOR are …

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

The sample mean of MATSCOR is 63.39The sample mean of MATSCOR is 63.39

The sample standard deviation of MATSCOR is 18.69.

The sample standard deviation of MATSCOR is 18.69.

The median (or 50th percentile) of MATSCOR is 65

The median (or 50th percentile) of MATSCOR is 65

The inter-quartile range is the difference between the upper and lower quartiles:

• Lower quartile = 53• Upper quartile = 77• Inter-quartile range =

(77-53) = 24

The inter-quartile range is the difference between the upper and lower quartiles:

• Lower quartile = 53• Upper quartile = 77• Inter-quartile range =

(77-53) = 24

The range is the difference between the minimum and the maximum:

• Minimum = 18• Maximum = 96• Range = (96-18) = 78

The range is the difference between the minimum and the maximum:

• Minimum = 18• Maximum = 96• Range = (96-18) = 78

Page 13: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 13

Millers Analogies Test Score Stem Leaf # 9 6 1 9 00333 5 8 5689 4 8 222334 6 7 556667788899 12 7 0011122223344 13 6 55669 5 6 000122223444 12 5 556899 6 5 00333444 8 4 57 2 4 022 3 3 55889 5 3 124 3 2 7 1 2 114 3 1 8 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Millers Analogies Test Score Stem Leaf # 9 6 1 9 00333 5 8 5689 4 8 222334 6 7 556667788899 12 7 0011122223344 13 6 55669 5 6 000122223444 12 5 556899 6 5 00333444 8 4 57 2 4 022 3 3 55889 5 3 124 3 2 7 1 2 114 3 1 8 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR …Here’s SAS’s version of the stem.leaf plot for the values of MATSCOR …

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

This is scientific notation:

And don’t forget the inverses …

10000104**101000103**10100102**10

10101**10

4

3

2

1

001.01000

1

10

1103**10

01.0100

1

10

1102**10

1.010

1

10

1101**10

33

22

11

1.8 x 101 = 18, etc.

Page 14: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 14

We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic

manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot,

or boxplot…

We can bring several of these univariate descriptive statistics – both the “ordering” and “arithmetic

manipulation” versions -- together in a useful single summary figure called the “box and whisker” plot,

or boxplot…

Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range:

• Min = 18• Max = 96• Range =78

Quartiles, Median & Inter-Quartile Range:• 25 %ile Q1 = 53• Median = 65• 75 %ile Q3 = 77• Interquartile Range = 24

Mean:• Mean = 63.4

Recall that, for the full sample (n=90) …. Minimum, Maximum, & Range:

• Min = 18• Max = 96• Range =78

Quartiles, Median & Inter-Quartile Range:• 25 %ile Q1 = 53• Median = 65• 75 %ile Q3 = 77• Interquartile Range = 24

Mean:• Mean = 63.4

100

90

80

70

60

50

40

30

20

10

100

90

80

70

60

50

40

30

20

10

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Page 15: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 15

The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot 9 6 1 | 9 00333 5 | 8 5689 4 | 8 222334 6 | 7 556667788899 12 +-----+ 7 0011122223344 13 | | 6 55669 5 *-----* 6 000122223444 12 | + | 5 556899 6 | | 5 00333444 8 +-----+ 4 57 2 | 4 022 3 | 3 55889 5 | 3 124 3 | 2 7 1 | 2 114 3 | 1 8 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1

The UNIVARIATE Procedure Variable: MATSCOR (Millers Analogies Test Score) Stem Leaf # Boxplot 9 6 1 | 9 00333 5 | 8 5689 4 | 8 222334 6 | 7 556667788899 12 +-----+ 7 0011122223344 13 | | 6 55669 5 *-----* 6 000122223444 12 | + | 5 556899 6 | | 5 00333444 8 +-----+ 4 57 2 | 4 022 3 | 3 55889 5 | 3 124 3 | 2 7 1 | 2 114 3 | 1 8 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1

And here’s the PROC UNIVARIATE version of the box-plot from the previous handout…..And here’s the PROC UNIVARIATE version of the box-plot from the previous handout…..

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?

What would the box-plot look like if there was very little variability in MATSCOR in the sample?

What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?

What would the box-plot look like if the sample distribution of MATSCOR were perfectly symmetrical?

What would the box-plot look like if there was very little variability in MATSCOR in the sample?

What features of the sample distribution of MATSCOR account for the fact that the sample mean is smaller than the sample median?

Page 16: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 16

An interesting aside on the normal distribution …..An interesting aside on the normal distribution …..

There is a special relationship between percentiles and standard deviation in a

normal distribution

There is a special relationship between percentiles and standard deviation in a

normal distribution

Normal distribution simulationNormal distribution simulation

MeanMean Mean+2sdMean+2sd

Mean+1sdMean+1sd

Mean-2sd

Mean-2sd

Mean- 1sdMean- 1sd

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:

Height Weight, Test Scores, Opinions, etc.…

A considerable number of continuous variables that occur “naturally” turn out to be “normally distributed”:

Height Weight, Test Scores, Opinions, etc.…

If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”…

If you were to plot a vertical histogram of the values of variables like these, you would get the familiar “bell-shaped curve”…

Ball-drop simulationBall-drop simulation

Page 17: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 17

OPTIONS Nodate Pageno=1;

TITLE1 'S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II';TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP';TITLE4 'Data in MAT.txt';

*-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site';

*-----------------------------------------------------------------------------*Format labels for the values of the categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';

OPTIONS Nodate Pageno=1;

TITLE1 'S010Y: Answering Questions with Quantitative Data';TITLE2 'Class 8/Handout 2: Displaying and Summarizing Continuous Data, Part II';TITLE3 'Using Boxplots To Compare MAT Scores of Doctoral Applicants to APSP';TITLE4 'Data in MAT.txt';

*-----------------------------------------------------------------------------*Input data, name and label variables in dataset*-----------------------------------------------------------------------------*; DATA MAT; INFILE 'C:\DATA\S010Y\MAT.txt'; INPUT YEARTEST ID WHENRECD MATSCOR TESTSITE; IF YEARTEST = 2; * Pick out 1989 Cohort for comparison with Activity #1; LABEL ID = 'Case identification number' YEARTEST = 'Year test taken' WHENRECD = 'When application received' MATSCOR = 'Millers Analogies Test Score' TESTSITE = 'Test site';

*-----------------------------------------------------------------------------*Format labels for the values of the categorical variables*-----------------------------------------------------------------------------*; PROC FORMAT; VALUE WHENFMT 1='Early' 2='Late'; VALUE SITEFMT 1='Harvard' 2='Elsewhere';

The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 …The boxplot is very useful if you want to compare sample distributions of a continuous variable like MATSCOR across different groups, as in Activity #1 – see Handout C08_2 …

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Let’s use categorical variables WHENRECD and TESTSITE to sub-divide the sample, so that we can

compare sub-sample distributions of MATSCOR using boxplots … like original Activity #1.

Here, I’ve picked out only applicants in the 1989 (YEARTEST = 2) cohort, so that the new analyses will match the analyses that you conducted in original Activity #1.

Page 18: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 18

*-----------------------------------------------------------------------------* Comparing Distributions of MAT scores across groups of testees*-----------------------------------------------------------------------------*;PROC SORT DATA=MAT; BY TESTSITE WHENRECD;

PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.;

*-----------------------------------------------------------------------------* Comparing Distributions of MAT scores across groups of testees*-----------------------------------------------------------------------------*;PROC SORT DATA=MAT; BY TESTSITE WHENRECD;

PROC UNIVARIATE PLOT DATA=MAT; TITLE5 'Sample Distributions of MAT Scores, by Test Site and Week Received'; VAR MATSCOR; BY TESTSITE WHENRECD; FORMAT TESTSITE SITEFMT. WHENRECD WHENFMT.;

And here’s the rest of the PC-SAS program…..And here’s the rest of the PC-SAS program…..

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by

TESTSITE and then by WHENRECD.

So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,

The new analyses should therefore have an ordering that matches the ordering in Activity #1.

To split the sample, first you need to sort it by the categorical variables of interest: Here, I have sorted first by

TESTSITE and then by WHENRECD.

So, the data will be ordered by “Early” and “Late” within an ordering by “Harvard” and “Elsewhere,

The new analyses should therefore have an ordering that matches the ordering in Activity #1.

To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is

implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later).

To obtain standard PROC UNIVARIATE analyses for the separate subgroups defined by TESTSITE and WHENRECD, use the “BY” command (you’ve seen this command used before in the categorical data-analysis part of the module): When the “BY” command is

implemented along with the “PLOT” option, an interesting “stacking” of the boxplots occurs (see later).

Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.

Here’s the usual use of PROC UNIVARIATE to generate “single variable” summary statistics for MATSCOR, with the PLOT option exercised.

Page 19: What Types Of  Data  Are Collected?

© Willett, Harvard University Graduate School of Education, 04/21/23 S010Y/C08 – Slide 19

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

S010Y: Answering Questions with Quantitative Data Class 8/III.1: Displaying and Summarizing Continuous Data

Conclusions? Mean scores of those who took the MAT test at Harvard are

generally higher than the mean scores of applicants who took the test elsewhere.

• Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample

• The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population?

The sample distribution of MAT scores is less spread out for those who took the test at Harvard:

• Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population.

The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard:

• Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail.

• Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early.

• Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?.

Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric:

• Most like a sample drawn from the general population?• Perhaps those who took the test elsewhere and submitted a late

application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?

Conclusions? Mean scores of those who took the MAT test at Harvard are

generally higher than the mean scores of applicants who took the test elsewhere.

• Why? Perhaps applicants who took the test at Harvard were already Master’s students here, and were therefore already a highly selected sample

• The mean scores of those taking the test elsewhere were lower because the sample of folk taking the test was much more inclusive of all members of the general population?

The sample distribution of MAT scores is less spread out for those who took the test at Harvard:

• Perhaps this further indicates that Harvard test takers were a selected group, maybe the top tail of the general population.

The scores of applicants who took the test elsewhere are more spread out, in general, than those who took the test at Harvard:

• Interestingly, the sample distribution of the “early, elsewhere” group looks a little similar to that of those who took the test at Harvard, but the distribution has a long lower tail.

• Perhaps there is still some self-selection going on here, with more highly motivated – and therefore “self-selected” -- folk tending to apply early.

• Perhaps the long lower tail is a few folk – like foreign students -- who found the test difficult because it was in English?.

Those who took the test elsewhere and applied late had a lower mean, a larger spread, and the distribution was very symmetric:

• Most like a sample drawn from the general population?• Perhaps those who took the test elsewhere and submitted a late

application were busy with work – like everyone else in the general population -- and they just found it hard to get to the post office on time?