Beyond the Mean

Beyond the Mean

Data analysis for School Leaders

Glen Gilchrist

Alexavier Fareheed

2

This edition published by LULU, February 2012

ISBN: 978‐1‐4716‐1146‐9

This work is licensed under a Creative Commons Attribution‐NonCommercial‐

ShareAlike 3.0 Unported License (CC BY‐NC‐SA 3.0).

To view a copy of this license, visit http://creativecommons.org/licenses/by‐nc‐sa/3.0/

or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco,

California 94105, USA. Whilst the Creative Commons License for this book entitles you

to distribute / modify the work for non‐commercial use, without additional

permissions, we kindly request that you inform the authors of any intention to re‐

publish / remix this title. Send an email to [email protected]

Every effort has been made to contact perceived copyright holders for material

reproduced in this publication. Any omissions or oversights will be rectified in

subsequent editions if written notice is given to the author. All trademarks are the

property of their respective owners. The authors are not associated with any product

or vendor mentioned in this book except where stated. Unless otherwise stated; any

third‐party quotes, images and screenshots, or portions thereof, are included under

‘fair use’ for comment, news reporting, teaching, scholarship, and research.

Acknowledgements

The authors would like to thank Michelle Gilchrist for her help, support are tireless

proof reading skills, without which, this book would not have seen the light of day.

Disclaimer

This is a book aimed at those readers wanting to explore data as used to drive decisions

in schools. It is not a comprehensive guide to statistics – no responsibility is assumed

or accepted for your decisions based on your data. Using the techniques detailed in

this text provides an aid to decision making whereas, the decision to act is left to the

discretion of the reader. No liability can be placed with the authors of this text.

By using the material contained within this guide, you acknowledge that you have read

and accept this disclaimer

3

Preface to first edition

I have to admit a long standing and growing interest in the subject of statistics. As a

research scientist before finding my vocation as a teacher I used the tools of statistics

on a daily basis to inform my research and to plan future investigations. When I started

on my teaching career I was amazed at just how underdeveloped the use of “proper”

numbers was, both in the classroom and within the wider arena of educational policy

making. Far reaching decisions are made on the basis of poorly researched and under

analyzed data. Everyone’s tax investments and our endeavors as a teacher/leader is

constantly being misdirected by the improper analysis of data. This book is my

contribution to the cause of using data in an appropriate and considered manner.

Good luck dear reader.

Glen Gilchrist, February 2012

I’ve been head of faculty for 5 years now and in all that time, I don’t think that I’ve seen

anyone – literally anyone in the education sector use data in a robust manner. Sure,

I’ve seen pretty bar charts and tables used to justify interventions and to determine

policy. I’ve sat through too many INSET sessions discussing the consequences of poorly

analyzed data; in fact I’ve been asked to lead on data sessions as presented to

incoming PGCE, NQTs and new staff – I guess in short, I’ve become part of the problem.

I believe that you dear reader have an obligation to reflect upon the data that you

collect and the consequences of your analysis.

Alexavier Fareheed, February 2012

Corresponding with the authors

Data analysis can be a lonely pursuit. The authors are happy to receive questions,

queries and other correspondence – send an email to [email protected].

4

5

Contents

Introduction 9

It’s easy to see why data is mishandled and unsafe conclusions drawn. 10

Essential definitions 11

A word about software 12

Minitab 13

Final note 13

Chapter 1 15

DATA ANALYSIS THAT SCHOOLS “DO” 15

Why we use the mean average 15

Factors we can compare 16

Central tendency 16

The mean ‐ a point statistic 16

More sophisticated analysis 18

Complementing the mean – bar charts 19

Using the mean to compare “segments” of data 20

Using the language of statistics 21

The wider school picture 21

Call to action 23

Conclusions 24

Chapter 2 25

THE PROBLEMS WITH THE MEAN 25

Statistics in action 26

Call to action 26

Problems with the mean 27

Call to action 27

The dangers of presumption – pre analyzing the data 28

Call to action 28

What do your bar charts show? 29

6

Ethics, politics and “getting your own way” 30

Call to action 31

How big an effect / difference is “big enough” to matter? 32

Extra information in a “modified” bar chart 33

Call to Action 33

Looking at a whole cohort 34

Preconceptions again 35

Conclusions 36

Chapter 3 37

COMPARATIVE STATISTICS 37

What does significant mean? 37

T‐tests and p values 37

Calculating significance using Excel 40

Excel command for T‐testing 41

Call to action 43

Conclusions 44

Chapter 4 45

FACTORS WITH MULTIPLE LEVELS. 45

Multi level factors 45

Combine levels to make a binary solution 45

Calculating t‐test for “binned” data 48

Limits of the t‐test 49

Multi level factors 50

ANALYSIS OF VARIANCE 50

Does attendance affect attainment? 51

Fitting a trend line to Excel data 52

Using R2 to check for “goodness” of fit 54

One way Analysis of Variance (ANOVA) 58

Non numeric multi level factors 61

Call to action 66

Pause for breath …….. 67

Questions to reflect on 67

Conclusions 68

7

Chapter 5 69

GENERAL LINEAR MODEL (GLM) 69

Constructing a GLM 70

Deeper analysis 73

Extending the GLM 75

Building interactions into the GLM 77

Big implications of the GLM 79

Call to action 80

Conclusions 80

Chapter 6 81

MAIN EFFECTS 81

Main Effects Plot 83

Interactions Plot 84

Call to action 88

Conclusions 88

Chapter 7 89

FINAL REMARKS 89

Tools you’ll need: 89

8

9

Introduction

Every school leader, head of subject and class room teacher will recognize the

following scenario:

It’s a school INSET, and what wonderful pedagogical expertise is going to be shared

with you, the willing staff? – Yes, you’ve guess it “Addressing the gender differential” –

the very name sends waves of déjà‐vu through the staff and the authors of this book

develop an instant migraine.

We’re not denying that there is a difference between the genders and their approach

to education; nor are we suggesting that as teachers and leaders that you don’t need

to monitor things to ensure that situations aren’t improving/deteriorating ‐ what brings

us to the point of tears, is that this statement is based on poorly and superficially

analyzed data.

As we will show in this book, it’s easy to assume that responses will be different for a

certain factor, and when you just look at the mean of data set, this “difference” is often

seen – you’ve then proved your initial assumption and you don’t look for a more

fundamental root cause. In our experiences, this is the case with the gender

differential, and I bet you’ve fallen into it too.

When we came into teaching, for the first time in our professional lives we became

aware of the situation of being “data rich but information poor”. Education abounds

with numbers, and schools, students & teachers have never been “measured” as much

as they are in 2010‐20111

But which numbers do you use and which demand that you take them seriously?

1 Whilst this appears to be particularly true of the English / Welsh systems, all educational infrastructures

constantly battle with league tables, “banding” and other lists

10

It’s easy to see why data is mishandled and unsafe conclusions drawn.

Until very recently, use of correct descriptive statistics was the preserve of the

statistician, often resulting in the calculation of arcane numbers, utilizing impenetrable

mathematics. Indeed, pick up anything but the most basic of statistics text books and

the reader will soon be swimming in a sea of mathematical notation, far beyond the

readability of those without degrees in mathematics.

But with the change is responsibilities, the TLR structure, and the reduction is

extraneous funding, the expectation is that as a subject/school leader, you undertake

data analysis and draw conclusions.

I doubt you’re trained in statistics (and why should you be?) ‐ so instead of carrying out

statistically valid analysis you’ve have returned to that most basic of measure – the

“average” – after all, it’s easy to calculate and means something doesn’t it?

Throughout the text of this book, we will look at analysing the data a typical

department in a school might produce – initially by calculating “means” and developing

this into a more rigorous assessment of data.

So dear reader, this book is aimed at classroom practitioners, heads of department and

school leaders seeking a deeper understanding of what your data actually shows.

In a nutshell, we’re going to take you “beyond the mean”.

Glen Gilchrist & Alexavier Fareheed

2012

11

Essential definitions

We need to define three vital terms that will be used throughout this text:

Factor: A factor is a variable whose values are independent of changes in the values of

other variables. Traditionally factors are the groups into which we split our

data – gender, SEN, free school meals are examples of educational factors.

Level: Factors can be split into different values. Statistically, these values are called

levels.

Levels can be numerical, quantitative or qualitative, binary or multi level.

Binary Levels

Levels can be binary in nature “boy or girl”, “SEN or not” and can be

represented numerically “1=boy, 2=girl” or remain as text.

Multilevel Levels

Levels are not always binary, “originating primary school” for example could be

one of 10 or more levels, with each school either referred to by name or a

coded “number” 1=School A, 2=School B etc

For continuous levels (age and attendance are good examples) levels

themselves might be grouped together to make analysis easier. These

groupings are often called “bins” and reference will be made to “bin size”.

Attendance for example could be binned as:

‐1 = less than 80%

0= 80% to 89.9%

1 = 90% and greater

The numerical value of the groups (‐1, 0, 1) is not important and the labels are

used to identify the grouped levels. Some consideration needs to be made into

12

the size / range of the groupings as this choice can affect subsequent

data analysis – however this is outside the scope of this text, and for

the analysis undertaken in schools, just ensure that the bins are

“sensible”.

Response: The response is the output that you are measuring. For school based

data, average or total points score and number of “C’s” are the typical

responses measured.

A word about software

MS Excel is referred to throughout this text and is used as convenient shorthand for

“spreadsheet”. We acknowledge that other spread sheets such as OpenOffice and

GoogleDocs are available and can be used fairly interchangeably for MS Excel (except

where indicated). Each has their strengths / weaknesses, but all process statistical

information in much the same manner. There is no need to change your spreadsheet

package to complete the numerical analysis undertaken in the majority of this text.

Some of the more advanced statistics require the use of a dedicated statistics tool.

Recently the cost of these tools has fallen dramatically and academic licenses can be

obtained for less than £50. We cannot recommend strongly enough the value in

obtaining the correct tool to analyze your data.

A great list is maintained at Wikipedia, which compares different statistical tools, their

costs and licenses: http://en.wikipedia.org/wiki/Comparison_of_statistical_packages.

13

Minitab

Throughout this book the authors makes use of Minitab as a conveniently easy tool to

get to grips with and available at an excellent price (from sub £20)

(http://www.minitab.com/en‐GB/academic/licensing‐options.aspx). The publisher also

makes available a free 30 day trial – more than enough time to learn the ropes and to

process data for your self evaluation.

Final note

The authors are practicing teachers, currently heads of subject in maintained

secondary schools and have no association with any of the tools / software / publishers

mentioned in this text.

“Data analysis is a journey that the only destination is enlightenment – get ready for

the ride of your life.” Glen & Alexavier – February 2012

14

15

Chapter 1

Data analysis that schools “do”

One of the biggest challenges in getting data used correctly in schools used to be the

actual collection and manual processing of the “numbers”. Now with tools such as MS

Excel, OpenOffice and GoogleDocs available to all, the challenge has shifted to the

actual processing and analysis that turns “numbers” into “data”.

Courses abound in educational circles about the “use” of data, but from personal

experiences they all focus on 3 areas:

1. Sources of baseline data (CATs, FFT, Government, Feeder Primaries)

2. Segmenting the data (gender, free school meals, SEN)

3. Monitoring, assessing and explaining student performance against (1) and (2)

Valuable as these courses are (and a significant improvement on not using data), they

all focus on basic statistics – the mean average, range and a cursory diversion into

drawing and formatting bar / line graphs; and whilst this is encouraged, reliance on

these measures alone can lead to poorly drawn and costly conclusions.

Why we use the mean average

Whilst Excel et al have democratized the collection and analysis of data, they have also

exposed the fact that most users of these tools are unaware how to use them at a high

enough level to process statistical information. As a result, most users are content with

tabulation, calculation of “averages” of data sets and with drawing basic, overly

coloured bar charts.

These “averages” are then used to draw conclusions, usually in the form of

comparisons; Boys vs Girls, free school meals vs non free school, English vs Maths,

2009 vs 2010, one school vs another.

16

Factors we can compare

The candidate list for comparison is long: special educational needs, ethnicity, “looked

after”, target group, literacy “booster” support or a hundred‐and‐one other

educational imperatives. A situation that I am certain occurs in your school. Indeed the

schools inspection framework2 demands that schools use data to “identify, plan and

monitor” the attainment of “groups” of learners. Without extensive use of such data,

schools cannot hope to achieve a coveted “Grade 1” status.

We will expose in this chapter the dangers of using just the mean to represent a data

set, and show how drawing conclusions can lead to costly and unnecessary

interventions.

Central tendency

Used in this context, the mean is a “measure of central tendency”3

The two most widely used measures of "central tendency" of data are the mean

(average) and the median. For example, to calculate the mean weight of 50 people,

add the 50 weights together and divide by 50. To find the median weight of the 50

people, order the data and find the number that splits the data into two equal parts.

The median is generally a better measure of the centre when there are extreme values

or outliers because it is not affected by the precise numerical values of the outliers

themselves (The median is often used to describe “average” earnings in a population as

it is not affected by a small number of very large (or small) salaries) .

The mean ‐ a point statistic

The mean is a “point” statistic – that is, it reduces an entire data set to a single value,

useful to succinctly describe the data. (However, you lose any sense of the spread and

variability of the numbers). As a result, the mean is the most widely used measure of

central tendency, but as we will see, not always the most useful.

2 UK wide, but certainly heavily endorsed in England and Wales 3 There are three measures of central tendency used to describe data sets – mean, mode and median. If

you are unfamiliar with these terms or just need a recap, remember – Google is your friend.

17

For example, the Average Points score for 5 schools in 2011 was:

School Average

Points Score

A 435

B 403

C 440

D 427

E 438

What conclusions can be drawn from this data?

School “C” is the best performing

School “B” is the least performing

Schools “A”, “C” and “E” all have similar points scores

School “B” needs to do “something” as its performance is very different to the

other schools.

It’s likely that such analysis is undertaken at this level in both your department and

whole school self evaluations.

The consequences of such analysis are likely to be some form of change, intervention

or closer monitoring. In short, money, time and effort will be expended acting on this

analysis of means. A situation that we are sure has happened in your school or

department.

18

More sophisticated analysis

Further and seemingly more sophisticated analysis will have you looking at the same

data over a period of 3 or 5 years:

School 2008‐2009 2009‐2010 2010‐2011

A 425 430 435

B 440 420 403

C 411 424 440

D 425 430 427

E 430 438 438

What does this show?

School “C” is the most improved over the 3 years

School “B” has fallen 37 points over 3 years

Schools “D” and “E” have shown little improvement over the three years

As part of your self evaluation / action plan – you will have undoubtedly looked at 3

year trends in mean data. You’re likely to have compared your results to that of other

departments, between local, national and family of schools and made pronouncements

on how well you are doing compared to last year.

To try an unravel some of the mystery about what your data is showing you, chances

are you’ll draw a bar chart of the means.

19

Complementing the mean – bar charts

Let’s complete the analysis and draw a bar chart of the data for the schools over three

years:

What does this chart show us?

It emphasizes the fall in performance of school “B”

The performance gains of school “C” look incredible

School “D” looks all but static over the past three years

Overall, what conclusions can be drawn about schools “A” to “E”?

School “A” is doing something that is improving performance

School “C” is clearly doing something “better” than the other schools and

better than school “A”

20

School “D” appears not to be doing anything and performance is static

School “E” looks like something happened during 2009‐2010, but these gains

have stopped and the school has not improved since.

School “B” looks like it’s in free fall and standards are falling rapidly

No doubt such analysis is regularly completed by you and/or your senior leadership

team. And if our personal experiences are reflected in your school the stress levels and

anxiety rises in proportion to the preparation and analysis of such data.

Using the mean to compare “segments” of data

As a teacher, administrator or policy maker we often need to compare the means of

two or more populations – essentially to test whether or not an intervention or

observation produces a measurable difference. For example, the average points score

for Year 11 students upon receiving their L2 qualifications is often segmented into data

for males and females.

As a result of this basic analysis, decisions and policy will be decided.

In this case, “clearly” there is a sex linked differential between Boys and Girls – with

Girls outperforming Boys by some 10%. From this analysis of means an intervention

will be planned – possibly grouping next year’s cohort into separate sex classes,

planning boy friendly lessons and tweaking the seating plans.

Again, we’re sure that you’re familiar with such segmentation of data and are certain

that your self evaluation contains statements about the gender differential and how

you intend to tackle it.

Average Points Score

Boy 402

Girl 448

21

Using the language of statistics

At this point, let’s start to use the language of statistics more fully.

In the case above for boys / girls L2 performance:

We have one factor, SEX, split into two levels (Boy and Girl) – we say we have a

binary factor.

Our response is the Average Points Score

From now on, we will use factor, level and response to describe our data.

The wider school picture

Such analysis is extended across the wider school, comparing the differentials in your

subject to those in English, French and DT4 ‐ as a direct result of this analysis a working

party or even a PLC5 will be created to tackle the clear differences between subject

areas.

(Whilst written here in a tongue‐in‐cheek manner, I suspect that your school has at

some point created a working party to contemplate differences in responses when

factors are analyzed for mean differences)

4 Insert the high performing subject areas in your school 5 PLC – Professional Learning Community, school based collaborative action research – for more details

see: http://www.centerforcsri.org/plc/program.html

22

What can we conclude from this chart?

French has the smallest sex differential

Science has the widest differential

In DT, boys outperform girls

The temptation in this case is to view the French differential (low) as in some way

“better” that the Science differential (high) and to invest time and resources in solving

the “problem”.

We’re not suggesting that this does not need to be solved; just that the data analysis

performed so far does not demand such investigations, merely hints at it

23

Call to action

1. Do you know the three measures of central tendency and when to use each

one? Do you know how to get Excel to calculate each?

2. Find your self evaluation and identify where you have used the mean of a data

set to draw a conclusion about segmentation of data

3. Look at the charts and graphs you have created for your exam analysis

meeting. Are they based on means of data? What conclusions did you draw

from them?

4. Look at whole school, local and national data – how often is an entire data set

reduced to a point statistic?

5. How well can you use your spreadsheet tools?

a. Can you enter formula to calculate the average of a data set?

b. What about counting the numbers in a column when the value in a

different column is a particular value? (CountIF() – used to

automatically count data, say based on a column containing the sex of

a learner)

24

Conclusions

During this chapter we have shown the basic data analysis undertaken by schools. As

subject team leader we imagine that you have laboured over such figures yourself,

painstakingly entering figures into MS Excel, creating comparison bar / pie charts and

drawing conclusions based on the mean average of data sets.

You’ve likely taken such figures into exam analysis meetings with your head teacher

and drawn conclusions about why students who obtain free school meals do “less well”

in your subject than, say, Spanish.

All of these things are a step in the road to understanding how to use data effectively

and the fact that you are reading this title demonstrates a clear desire to take your use

of data to a higher, more effective level.

In the coming chapters I’ll show you why data analysis based solely on the mean of a

population is dangerously superficial and can lead to misdirected effort and the

potential to miss a more fundamental underlying truth.

25

Chapter 2

The problems with the mean

Demonstrating that there are “issues” with using the mean of a data set is often the

most instructive way forward.

Consider the following data obtained for a group of year 10 Maths students.

Student L / R Hand Score

A R 80

B R 78

C R 82

D R 84

E R 76

F L 82

G R 81

H L 79

I L 79

J R 81

K L 84

L R 76

M R 81

N R 78

If we take the average of the left handed and the right handed students, we obtain;

Hand Average Score

Left 81

Right 79.7

26

From this, we conclude that right handed students underperform compared to left

handed – we might even plan further monitoring, investigate the scheme of work to

look for bias and set up a far reaching working party.

Statistics in action

If you take any data set, made up from “real” data – and by real, I mean measured from

real people / events, not simulated on a computer, and segment that data into two –

you are likely to see a difference between one group and the other.

In this case, we looked at L and R hands, but the argument holds for any segmentation,

regardless of how ridiculous it sounds.

Call to action

1. The next time you teach any class, survey them for one of the following:

o Xbox or Playstation

o Blackberry or iPhone

o Eastenders vs Coronation Street

o Family Guy vs American Dad

(The choices don’t need to be binary, but at this stage, it will help with the data

analysis)

2. Add this segmentation to the class register.

3. The next time you “test” your learners, split the data into the segments that you

have just defined and calculate the mean for each: (for example)

Console Average Score

Xbox 67

Playstation 83

Ask yourselves the following question – does this show anything meaningful?

27

Have we just uncovered the route to educational success – “buy everyone a

Playstation” or is there something else going on?

Whilst a contrived example, I am sure from your own experience that this

segmentation and superficial analysis has been undertaken – possibly with the gender

differentials cited in the previous chapter.

Problems with the mean

From the previous example, what exactly are the problems with using the mean?

Some observations stand out:

1. The difference between left and right handed is small – 1.3 –

a. The question we should ask is:

“Is this difference big enough to matter?”

2. There are only 4 left handed students – does this affect the conclusions?

“How much data do you need to draw realistic inferences?”

These issues aside, we are sure that you have drawn conclusions using similarly

analyzed data.

Call to action

Before you read on, either for your own data or the data presented previously, splitting

into Left and Right handedness, use your favourite spreadsheet to draw a bar chart of a

set of results that can be split into two segments. For the purposes of this text, I’ll

assume that you’ve used my data.

28

The dangers of presumption – pre analyzing the data

The analysis of data by using just the mean is not the only concern for rigorous data

analysis.

When we presume there is a difference between two segments of data, we are

unsurprised when we find it, and are then more likely to accept that difference as

meaningful. After all boys and girls are different, so when your data shows this, it must

be true – right?

Call to action

What presumptions do you make in your data analysis?

1. Would you have expected left and right handed segmentation to produce

different means?

a. Can you think of a pseudo‐pedagogical reason why this might be true?

2. What about other splits of data?

a. Everyone knows that free school meals, linked to poverty affects

attainment – right? Does your data show this difference?

When you analyze your data and find a difference, you are ready to accept it as real

and meaningful. The same is true with gender, SEN and a host of other factors that we

assess.

29

What do your bar charts show?

Let’s show you my plots the mean data for handedness as a series of bar charts, all

showing the same data:

Firstly, let me assure you that these charts all show the same “numbers” for the left

and right hand segmentation of the data.

Chart “B” is the default MS Excel and OpenOffice formatting of the data as entered.

The only difference between each chart is the scale of the y‐axis.

Chart “A” shows 79.5≤ y ≤81.1, with each division being equal to 0.2

Chart “B” shows 79 ≤ y ≤81.5, with each division being equal to 0.5

Chart “C” shows 0 ≤ y ≤80, with each division being equal to 20

Chart “D” shows 0 ≤ y ≤100, with each division being equal to 20

Quite dramatically charts “A” and “B” emphasize the differences between L and R,

whilst charts “C” and “D” seem to imply the difference is almost nonexistent.

A

B

C

D

30

Ethics, politics and “getting your own way”

But which is the correct way to display the data?

At this point, those of you reading this who find the whole concept of data and analysis

abhorrent will be likely thinking “that’s why I hate doing all this stuff” – “see I was right,

its way beyond me”, and the most insightful “It’s bloody confusing!”

Oddly for a book aimed at using statistics we are going to tend to agree with the last

statement.

For the four charts shown, there is no “right” answer – heck, there’s not even a “best”

answer.

The surprising thing (and this is what causes the data adverse to shiver with

indecision) is that it’s entirely up to you and you can choose the one that makes your

case the strongest.

Say, I had presumed that there was a L/R split in data and analyzed the results – I

would choose chart “A” or “B” to represent my results as they clearly demand taking

the L/R split seriously. Had I on the other hand assumed that there would be no

difference, I would choose “C” or “D” as it backs up my case. Both are strictly “correct”

but I have subtly manipulated the presentation of data to support my case.

The whole point is that our preconceptions will (often subconsciously) guide us

through the data visualization and analysis process. We will change what we do to

support our personal agenda – however careful we are.

31

Call to action

1. What did your bar chart look like? Which of my examples was it closest to?

2. As part of the self evaluation and action planning process you will have

certainly either constructed or interpreted charts showing results – often

segmented into different groups. Those groups will have likely shown a

difference. Have you presented data to SLT by using either the default or

custom scales – to “make your point clearer”?

What we’ve just demonstrated is that the apparent importance of differences can be

manipulated by just how you construct your charts.

3. How have you constructed charts for last year’s examination analysis meeting

with your SLT? Have you emphasized or played down an effect to influence a

decision or opinion?

However well minded your intentions, I suspect that you will have exerted some

“influence” on the data – even if it was just by using the default settings in Excel –

which in this case seem to imply that there is a huge difference between L and R.

32

How big an effect / difference is “big enough” to matter?

To try and resolve some of these issues just raised, let’s go back to the data for “L” and

“R” and construct a type of “modified” bar chart6, where we are combining the discrete

data of “L” and “R” on the x‐axis, with a continuous y‐axis showing the “score”:

You see two “columns” of data, one for L and one for R that is comprised of a series of

“o” points corresponding to each value. Superimposed on the chart is a “” showing

the mean for L and mean of R, with a line connecting each mean.

At this point, as was demonstrated, had the data been displayed as a bar chart, you

would have shown that “L” outperforms “R” and depending on the scale you used,

could have either emphasized or downplayed the results.

6 This chart was produced in Minitab, by tweaking the “box plot” chart – a great statistical analysis

software package, but similar can be achieved with MSExcel or plotted straight onto graph paper

Left / Right

Scor

e

RL

84

83

82

81

80

79

78

77

76

75

Point Plot of Score vs Hand

33

Extra information in a “modified” bar chart

What this chart clearly shows is the spread of data for each segment. You can see that

the entire L data sits within the R data.

The range of the R data is more than the L data

There are no values of L that are higher than R

There are low values of R, lower than any of the L data

What can be concluded from this chart, is that whilst the means are different, with L

being higher than R, the spread of the data and the low values of R have influenced the

mean value.

What if those learners with the lowest right hand score just happened to be the SEN

learners in the class? Or, what if those lowest R scores correspond to learners who

have been long term sick, incomers to school, EAL learners?

Call to Action

1. Find a data set that you can segment into two (boy / girl splits work well and

are a constant political/educational debate). You need the actual score for a

class, broken down into learners / gender.

2. Plot the scores as a modified bar chart, one column for each segment (boy /

girl)

3. What does this show you for your data?

34

Looking at a whole cohort

The figure above, shows 2010 data for the average point score for a secondary school,

split by sex. As before, the line joins the means.

The chart shows that girls have a higher average points score to boys (as the line slopes

down, from left to right)

From analysis of the means, the following was presented to SLT for the annual exam

analysis meeting:

From the mean analysis, it appears that there is a real and big difference between the

boy and girl average point scores.

Average Points Score

Boy 443

Girl 406

Sex

POIN

TS

MF

900

800

700

600

500

400

300

200

100

0

Average points score vs sex

35

The modified bar char starts to add more meaning:

The spread or ranges of the boy data is more than the girl data

The boy data has far more lower scores than the girls

The girl data has the highest performing students.

Preconceptions again

Again whilst sex is a convenient (and presumptuous) way of explaining difference – and

indeed the means substantiate a conclusion, might it just be that the lowest scoring

learners (who happen to be boys) also happen to be the EAL students? Might it be

equally true that the highest performing girls receive tuition outside of school?

We come back to the question:

How big an effect / difference is “big enough” to matter?

and we add:

How do we tell what the real cause of something is?

36

Conclusions

We have demonstrated how the mean as a point statistic is a blunt instrument in data

analysis, and can lead to spurious conclusions

Our own preconceptions about “what’s likely” to make a difference (gender) will

influence how to visual and analyze data.

How we can (often unwillingly) influence / bias perception with the way what we

represent data.

The use of a modified bar chart can begin to shed more light on the data and allow us

to draw safer conclusions.

In the next chapter we will begin to quantify differences to allow us to make firmer,

evidence based data analysis.

37

Chapter 3

Comparative statistics

Over the previous two chapters we’ve been talking about the mean of data being a

poor summary tool and incomplete when used to compare two segments of data.

We’ve shown how we can draw a chart to help illustrate the difference between means

and how, by tweaking the scales of bar charts, you can magnify or minimize apparent

differences. Ultimately, all of these techniques are qualitative and assessing whether

or not data sets are different has been a matter of choice.

Whilst this might be satisfactory when deciding what the most popular games console

is, surely we can apply more forethought over decisions that are likely to lead to

profound implications to the education of young people.

What we are looking for is a way to quantify how different sets of data are, and an

agreed upon set of standards for assessing whether or not a measured difference is

significant – hence, if the difference is significant it demands attention and solution.

What does significant mean?

It’s important at this point to clarify that a difference is statistically significant if the

observed difference is greater than can be accounted for by random error alone.

T‐tests and p values

For the professional statistician there are a number of measures that can be used to

assess the significance of measurements being different. If we intended to compare a

response to one factor only (say gender), we would use the t‐test, which returns a

probability that the difference between the data sets cannot be distinguished from

random occurrences or accounted for by other factors.

38

That mouthful (presented for statistical correctness) can be reduced to:

The probability (%) that the data sets are not really different. This is often referred to

as the p value, and is either a decimal in the range 0.000 to 1.000 or a percentage. The

higher the p value, the less sure we are that the data sets are different.

For example:

If p=0.000 or 0% we would have zero concern that the means were the same.

Or put the other way, we would be totally certain that the means are different.

We would be (1‐p) or 100% confident that the means were different.

If p=0.001 or 0.1%, we would be slightly concerned and not totally confident

that the means were different. We would be (1‐p) or 99.9% confident that the

means are different.

If p=0.005 or 0.5%, we would be more concerned that the means were not

different – We would be (1‐p) or 99.5% confident that the means are different.

If p=0.10 or 10%, we would be quite concerned that the means were not

different. We would be (1‐p) or 90% confident that the means are different.

If p=0.50 or 50%, we would be totally unsure and (1‐p) = 50% would show that

it was 50/50 that the means are different.

Consider the following question – if you wanted me to invest £1,000,000 in your idea to

cure cancer, and you had tested it against a placebo, what value of p would you accept

as sensible evidence for “proving” your cure worked?

Would you accept p=0.10 or only 90% sure that your cure worked?

Would you accept p=0.005 or p=0.001?

Statisticians agree that a p value of 0.005 or less is needed for “proof” that a

difference is real and hence defined as significant.

39

P values in the range p=0.01 to p=0.006 show increasing evidence that a

difference might be real and probably warrants further analysis

P values in the range p=0.05 to 0.01 show a hint that there is a real difference.

At p=0.05, we would be 95% sure there is a real difference, or there’s a 5%

chance that the means are actually the same. This p=0.05 value corresponds

to the limit of “significance” – a p‐value of p=0.05 or less indicates

significance of a difference between two levels of a factor.

P values greater than p=0.05 are rejected are we are less than 95% sure the

data sets are different.

This might sound draconian, but these levels of significance are used by drug

companies to “prove” a cure works, by the courts and police to convict those accused

of crimes and by all serious scientists trying to prove that A caused B or C worked

better than D – so if it works for them, it should work for us.

40

Calculating significance using Excel

You can use Excel to calculate the t‐test p values. The data however does need to be

laid out in a particular manner: From our previous left / right handed example:

Student L / R Hand Score

A R 80

B R 78

C R 82

D R 84

E R 76

F L 82

G R 81

H L 79

I L 79

J R 81

K L 84

L R 76

M R 81

N R 78

For Excel to compute t‐test, we need to have each response corresponding to a

particular level of a factor in a different column. In this case, the data for left hand in

a different column to the data for right hand, so some manipulation is needed:

41

In this screen, data for R has been placed in C2 : C11, whilst data for L has been placed

in D2: D5.

Excel command for T‐testing

The formula for Excel to calculate t‐test is TTEST(range 1, range 2, tails, type) – which

returns the p value as seen in D13 above.

Range 1 and range 2 corresponds to the data sets.

Tails can be “1” or “2” – corresponding to the shape of the distribution. For us

using data that can be equally distributed around a mean, we will always pick

“2”

Type can be “1”, “2” or “3” – corresponding to “paired” or “unpaired” data.

The difference between these is quite involved and difficult to explain briefly.

Its sufficient to say that given the data that we are analysing, we will always

choose “3”

42

The p value of 0.414 indicates a 41.4% chance that the means are actually the same.

Or as we discussed previously, a 1‐p or 58.6% chance that the means are different.

(Remember what we are talking about here – this almost represents a 50/50 case –

that the data is different OR not)

This is well above the value of statistical significance (p=0.05) and the p‐value

demands that we treat the means of these data sets as “not different”.

Contrast the value of a numerical value to the previous charts we created:

Whilst we might have concluded that the means were the same or “not likely to be

different”, clearly this was open to interpretation / bias and was left to my decision

over how we drew the charts.

Now we have a numerical value to assess the just how different a difference actually

is.

43

Call to action

1. Revisit the data you collected previously.

2. For the factors that you were considering, put one value of the response

corresponding to one level of a factor (boy) in one column and the other level

(girl) into another column.

3. Calculate the TTEST value, using the ranges for the data, “2” for the tails and

“3” for the type.

4. What is the p value?

5. Does this show a significant difference between the data sets or do you

conclude that they are the same?

6. Does this disagree with any analysis you previously undertook?

7. Next time you split a data set into two groups, calculate a t‐test to see if the

means really are different.

44

Conclusions

In this chapter we have introduced the concept of calculating a value that shows

whether or not the differences between two means is caused by the factors being

measured or could be down to random chance or some other, non measured factors.

We introduced the concept of the p‐value, which corresponds to a probability or

percentage that the difference between means is real or just down to chance.

P values less than p=0.001 show a 99.9% chance that the means really are different and

the factor you are measuring is responsible

P values of p=0.05 are considered the critical value and correspond to a 95% chance

that the factor you are measuring is responsible.

P values greater than p=0.05 are rejected as we are less than 95% certain that the

factor being measured is responsible.

The t‐test can be calculated in Excel with the TTEST(range 1, range 2, tails, type)

formula entered into a cell. Tails is normally “2” and type “3”

In the next chapter we’ll look at a more useful test that allows you to look at factors at

more than two levels, such as previous primary school.

45

Chapter 4

Factors with multiple levels.

So far we we’ve looked at assessing responses against factors that exist in two levels –

splitting data sets by boy/girl, looking at left or right handed, free school meals or not.

To process a t‐test in Excel required the data to be laid out in a specific manner, but did

result in a quantifiable measure of the difference between means.

Multi level factors

But what about factors that have multiple levels – such as previous primary school? Or

factors that are a continuous in nature, such as reading or spelling age? Simply put, the

t‐test doesn’t work for factors in more than two levels.

Combine levels to make a binary solution

The first and possibly the simplest solution is to re‐code levels into a binary set – say by

grouping reading age into 10 ≤ x ≤ 12 and 12 < x ≤ 14 and then perform a t‐test.

It doesn’t matter what we call these levels ‐ “1” and “2” or “Low” and “High” are

traditionally used.

Once we have the factor levels, we lay out the data as we did before in Excel, with one

column for each factor level.

In the following example, we have coded reading age using this scheme:

8 ≤ x ≤ 12 = 1 and x > 12 = 2

46

If we take the means of the bins, we conclude:

Bin Mean

"1" 449

"2" 492

Surely a 43 point difference between the average points score for the two different

reading age “bins” represents something that we must take seriously?

47

Let’s look at the data:

Looks encouraging, that difference of 43 surely looks impressive and stands out.

Remember what we said about scales? If we draw the same chart on axes starting at 0:

Now, the difference between the two groups looks less impressive than before –

maybe they’re not that different.

48

Calculating t‐test for “binned” data

As before, let’s reorganize the data and get Excel to calculate the t‐test.

The t‐test of 0.1987 indicates a 19.87, say 20% chance that the means are actually the

same and there is no difference between the reading age bins. Put another way, there

is a 1‐p or nearly 80% chance that the means are actually different, and we cannot

conclude that the factor we are assessing is solely responsible for the difference.

Now 80% sounds positive – but remember we agreed that p=0.05 was the upper limit,

above which we cannot be certain that the factor is causing the difference in the

response.

49

Limits of the t‐test

I know that sounds like a bunch of statistical waffle, but the wording is important. The

t‐test does not rule out reading age having an effect on points score, but the low

significance of p=0.1987, points to some other factor either jointly being responsible or

(as is likely) more significant in explaining the difference between the data.

In our case, it means we should keep analyzing the data to find a more fundamental

difference.

As before, let’s plot a modified bar chart for the bins “1” and “2”, joining the means for

each level. In this case, it proves a particularly useful chart as it clearly shows that the

mean for level “2” of reading age is pulled upward by the three high points score.

Re-coded

Poin

ts S

core

21

650

600

550

500

450

400

350

300

Boxplot of Points Score vs Re-coded

50

Multi level factors

We can use the same idea of binning‐up factor levels to ease analysis of other factors –

such as attendance data for example.

However, what if we don’t want to combine factors into just two levels? In the case of

attendance data, we might want:

‐1 = less 80

0 = 80 to 89.99

1 = 90‐ 94.99

2 = 95+

We can’t use the t‐test as it only works to discriminate between factors that are in two

levels. We need a different statistical tool – analysis of variance.

Analysis of variance

You’ve arrived at the point in the statistics journey where you are about to leave the

“core” functions of Excel behind. Whilst it’s true that you can get Excel to calculate

analysis of variance, it’s not an easy process, the preparation of the data can be

confusing and the results leave a lot to be desired.

At this point I strongly suggest that you get hold of a copy of Minitab7 or download the

excellent Daniels XL Toolbox8 – a free add‐in to Excel that will enhance its native

statistics capability.

However, even Daniels XL Toolbox will run out of steam in the next chapter, so maybe

it’s time to break the Excel apron strings ‐ ;‐)

7 Or alternative statistics package. See the preface to this book for how to obtain Minitab for a reasonable

price. 8 http://xltoolbox.sourceforge.net/

51

Does attendance affect attainment?

Anyways, let’s push on and look at a continuous variable, attendance and try and

answer the questions – “Does attendance affect attainment”. Received wisdom is,

“surely yes, attendance affects attainment and the more you attend the higher the

attainment” – but ask yourself whether you’ve actually tested this “wisdom”.

As we have two data sets that are continuous, we can get a feel for what’s going on by

plotting a traditional scatter graph of attendance (x) against points score (y)

Does that help? Is there a link between attendance and attainment?

52

Fitting a trend line to Excel data

Excel allows us to fit a line between the data points that “best” represents the data.

How well that line fits is shown by the R2 value – the close it is to 1, the better the fit,

with anything above 0.8 as indicating a “good” fit to the data.

Create a scatter graph as normal. Once created, right click on a data point to bring up

the context menu:

Select “Add Trendline”.

From the next context menu, you can choose what kind of line to fit – in this case we

are looking for a straight line, so choose “linear”:

Leave most of the settings to the default, but at the bottom, before you click the CLOSE

button, put a check as indicated:

53

The full context menu for adding a trend line to an Excel chart:

54

From our data, the following linear trend line is fitted.

Using R2 to check for “goodness” of fit

The R2 value of 0.0093 indicates that the line does not represent the data well – in fact

anything below 0.80 is regarded as “poor”.

In fact when R2 = 0, the line fits the data no better than a horizontal line drawn through

the mean “y” value.

The closer R2 is to 1, the better we can use the line and its equation to predict values –

in this case, we if R2=1 we could 100% predict a points score from the attendance.

Clearly this is not the case for our data.

55

So does attendance matter?

Lets bin up the attendance figures as previously agreed:

‐1 = less 80

0 = 80 to 89.99

1 = 90‐ 94.99

2 = 95+

Sample of the original data and “binned” or “coded” figures.

Attendance Coded Points Attendance Coded Points

90.35 1 479 98.07 2 548

91.32 1 350 100 2 440

100 2 440 81.35 0 413

99.36 2 597 76.53 ‐1 695

76.85 ‐1 314 95.82 2 752

98.07 2 698 89.71 0 502

100 2 440 93.25 1 834

88.42 0 614 78.14 ‐1 389

95.18 2 566 84.24 0 290

96.14 2 631 59.81 ‐1 269

100 2 440 85.85 0 425

100 2 284 95.18 2 292

96.14 2 469 75.56 ‐1 410

98.71 2 342 100 2 262

100 2 400 63.02 ‐1 538

89.97 0 426 96.78 2 612

94.21 1 626 100 2 80

94.21 1 552 87.14 0 158

88.75 0 467 92.93 1 494

92.93 1 519 89.71 0 509

56

Let’s calculate the means of each bin to assess if there is any variation between

attendance figures:

Binned Mean Points

‐1 435.8

0 422.7

1 550.6

2 460.7

What the mean analysis shows, is a difference of 25 points in going from the lowest sub

80% attendance to the highest 95%+ attendance. But, is this a big enough effect to

conclude that attendance matters?

If we plot the binned attendance against points score, we can see that “something” is

going on, and the connected means show some variation

Binned attendance

Poin

ts

210-1

900

800

700

600

500

400

300

200

100

0

Modified Bar Chart of Points vs Binned Attendance

57

At this point, the observant reader might ask “Doesn’t all this depend on the size of

the bins?” – Let’s see....

If we re‐bin the data, into ‐1 (less than 90) and +1 (90 and greater) we find;

Binned Mean Points

‐1 427.9

1 486.9

This time, there’s nearly 60 points of difference between the lowest and highest

attendance – surely this is significant?

At this point we’ve reduced the factors to a binary split, so we can use the t‐test to see

if the difference between the means is real and significant.

The preparation of the data is left as an exercise for the reader, but by binning into ‐1

and +1, separating the data into columns and running the Excel TTEST function, we

obtain a value of p=0.243.

This p value is well above the value of p=0.05 for us to consider the means as

statistically different and we conclude, that there is no statistical difference between

the average points score, when we consider the factor “attendance”.

However, this is not where we wanted to be – we’ve reduced a factor to a binary

split.

We’re going to stick with the original binned data, as they correspond to how we track

learners in school:

‐1 = less 80

0 = 80 to 89.99

1 = 90‐ 94.99

2 = 95+

58

You’ll need Daniels XL toolbox or Minitab at this stage. Download a copy for MS

Excel from: http://xltoolbox.sourceforge.net/

One way Analysis of Variance (ANOVA)

The statistical test that we’re going to perform is called the One‐way analysis of

variance or as its usually referred to ANOVA.

ANOVA is similar in function (but mathematically much more complex) to the t‐test,

except ANOVA can test whether or not two or more means are different. ANOVA tests

produce a p value which can be interpreted in the same manner as the t‐test.

This is ideal for our case – ANOVA will reduce our problem of determining if attendance

matters to the familiar task of interpreting a p‐value.

As we’re going to use Daniels XL toolbox or Minitab, data this time can be laid out as

you would receive it from your examinations officer, without further processing.

That is a list of information with headings across

the top – no preparation will be required.

<<< Your data will be laid out like this

With one row per pupil – much easier to deal with

than before.

From the Add‐In menu in Excel, select XL Toolbox,

and navigate to the Statistics > ANOVA menu

From the One‐Way Analysis of Variance (ANOVA)

menu that appears, select the ranges for the

input data.

59

Click in the box once and then drag down

over the range of the bins – not including

the heading

Click in the box once and then drag down

over the range of the data – not including

the heading

60

You should find that the numerical range of each is the same – in this case, $2 to $41 –

but your data might be different, and they don’t need to the same size.

Once the ranges are set up, select Run ANOVA.

This dialogue shows a number of things, but the most important for us are:

The bin names (‐1,0, 1 and 2), their counts & means

ANOVA Results p‐value, which allows us to comment on the significance.

In our case, P=0.41370, which is well above P=0.05, indicating that there is no

statistical significance difference between the means and any differences cannot be

ascribed to the attendance levels alone.

61

Non numeric multi level factors

We started this text by looking at gender and handedness, both were binary non

numeric factors (either one value or another). Some factors under consideration can

be non numerical and text based – originating primary school9 for example.

Our fictional secondary school has 4 feeder primaries: Elm Tree, Everymans, Oldberry

and St Judes.

The average points score at the year of Year 11 for a group of learners is:

Primary Points Primary Points Primary Points Primary Points

St Judes 314 St Judes 698 Elm Tree 509 St Judes 494

St Judes 695 St Judes 440 St Judes 614 Elm Tree 440

St Judes 389 St Judes 566 St Judes 426 St Judes 597

Elm Tree 269 Oldberry 631 St Judes 467 St Judes 698

St Judes 410 Oldberry 440 Elm Tree 413 St Judes 440

Elm Tree 400 Everymans 501 St Judes 502 Everymans 566

St Judes 314 Oldberry 469 Oldberry 290 Everymans 631

St Judes 614 St Judes 342 Elm Tree 425 St Judes 440

St Judes 426 Oldberry 400 Elm Tree 158 St Judes 284

St Judes 467 Oldberry 626 St Judes 509 Oldberry 469

Oldberry 413 Oldberry 552 St Judes 479 St Judes 342

Everymans 695 Oldberry 519 Everymans 490 St Judes 400

Everymans 502 St Judes 548 St Judes 626 St Judes 548

St Judes 389 Oldberry 440 Elm Tree 401 Oldberry 440

St Judes 290 Everymans 752 Oldberry 519 Oldberry 752

St Judes 269 Everymans 834 St Judes 834 Oldberry 292

Elm Tree 425 St Judes 292 Oldberry 494 Oldberry 262

Everymans 410 Oldberry 262 Oldberry 350 Oldberry 612

St Judes 538 Elm Tree 612 Oldberry 440 Elm Tree 80

St Judes 158 Everymans 540 Oldberry 597

9 At this point, I need to be clear – I’m not suggesting a blame culture between Primary and Secondary,

more, the fact that we have this data in secondary and it can be instructive to see if and where a response

can be split by a factor.

62

Firing up Excel and the XL Toolbox we place the data in two columns, one for feeder

primary and the other for points score. Navigating through XL Toolbox we run an

ANOVA:

What this ANOVA shows us, with a P value of p=0.0089 is that feeder primary is more

than 99% certain to have an effect upon the average points score at the end of year 11.

What it doesn’t show is where this variation actually is. Are all the schools different, or

just one school different from the rest?

63

Let’s plot a modified bar chart to see:

Primary

Poin

ts

St JudesOldberryEverymansElm Tree

900

800

700

600

500

400

300

200

100

0

Modified Bar Chart of Points vs Primary

The “difference” is likely to be between Elm Tree and Everymans. But, being the good

statistician we now want to ask more rounded questions:

Is Everymans different to Oldberry & St Judes?

Is Elm Tree different to Oldberry?

Fortunately, tests exist to quantify this difference.

64

If the p‐value of the ANOVA indicates a statistically significant difference, (indicated by

* or ** next to the value), an additional tab at the top of the window is active. Select

this tab:

The window that appears allows you to test for significance between the levels of the

factors previously analyzed for the ANOVA test.

Leaving the default “Bonferroni‐Holm” (named after the statisticians who devised the

test) you can click on each level of factor in the “Compare” column and look how

different that is to other levels – importantly for us, the dialogue displays the

significance.

65

On this screen, click on “Produce report”, which will summarise this test in an easy to

read table.

Posthoc test: Bonferroni‐Holm

Group 1 Group 2 Critical P Significant?

Elm Tree Everymans 0.008333333 0.002662327 Yes

Oldberry Everymans 0.01 0.017707646 No

St Judes Everymans 0.0125 0.01989173 No

St Judes Elm Tree 0.016666667 0.074365767 No

Elm Tree Oldberry 0.025 0.082440719 No

St Judes Oldberry 0.05 0.96789046 No

(Here, the significance of the P value is slightly different than before – if the value of p

is less than the displayed “critical value”, the difference is significant.

66

We can see that for our data, only the Elm Tree – Everymans difference is significant,

whilst the Oldberry, and St Judes to Everymans is approaching significant.

Whilst our modified bar chart hinted at this before, we now have a hard and fast figure

that describes the difference between the primary schools.

Call to action

Now that we’ve got some real statistical tests in our tool kit, go and find your master

data set for your school / department / class.

Most schools will have spreadsheets of such data, and they probably look something

like this:

Name Sex SEN FSM CATs Att% Feeder Read Maths English Science Overall Points

Adams, Jon M NA N 119 90.35 St Judes 14.02 30 35 40 440

See if you can answer the following questions from your own data:

1. Are the overall results for your school different for gender? Is this a significant

difference ?

a. (TTEST and P value)

b. Repeat the analysis for free school meals (FSM)

2. How well does CATS, (or other base line data), attendance or reading age

predict Maths, English, Science (insert subjects that you have data for)?

a. (Scatter graph for continuous data and fit a trend line. Check R2 value)

3. Create some binned data (CATs, Feeder School) and use ANOVA to check the

significance of a multi leveled factor.

a. Use Bonferroni‐Holm to check for differences between levels of a

factor

67

Pause for breath ……..

At this point, you’ve come a long way. Instead of using the means of responses to

describe (possibly erroneous) differences between the effects of factor levels, you’ve

just used some real statistical tests (TTEST and ANOVA) to provide you with evidence

that is more than just a “hunch”.

Questions to reflect on

1. Did any of your analysis contradict your preconceptions?

2. Did you show that gender was statistically significant overall? What about

gender for Maths, English, Science?

3. Do learners from any of your feeder primaries perform significantly different

than learners from other? Does this surprise you?

This is the beauty of simple statistical tests – you can ask the “What if” questions and

very quickly get an answer.

But, and isn’t there always a but – from the factors listed how do you decide which is

the most important and most significant in driving a response?

Name Sex SEN FSM CATs Att% Feeder Read Overall Points

Adams, Jon M NA N 119 90.35 St Judes 14.02 440

And for that, we need yet another tool – this time, the final one we’ll introduce and the

“most useful”, generic test available. Say hello to the General Linear Model

68

Conclusions

We’ve covered a lot of ground in this chapter. Starting with the t‐test previously

described we’ve looked at:

Grouping or binning factor levels to allow us to continue to use the t‐test and the

familiar p value for significance

How we can use Excel and trend lines to explore the relationship between continuous

data.

We looked at the R2 value and used it to decide how “well” a trend line matched the

data. R2 = 0.80 is the agreed upon limit, below this the fit is described as “poor”.

How continuous data can also be binned up to allow t‐tests to differentiate between

binary leveled factors

We’ve introduced the concept of One‐way analysis of variance (ANOVA), which allows

us to test for significance between multi level factors.

We looked at extending this ANOVA to explore differences between the levels of

factors and how to assess the significance of these differences.

We explored Daniels XL Toolbox, a free add‐in to Excel which makes calculating ANOVA

much more straight forward.

69

Chapter 5

General Linear Model (GLM)

To perform the GLM test you will need a dedicated statistics package like Minitab

and the sophistication of the analysis is beyond what’s possible within Excel. It even

beats Daniels XL Toolbox.

Regardless of how it’s calculated, what a GLM does is clever and essential when

exploring a data set – it allows you to assess what’s going on without any

preconceptions over what factors affect the response you are measuring. With TTEST

and ANOVA we went looking for a difference caused by changing the levels of a factor

and assessed if this produced a statistical difference in the response – we “assumed”

that there might be a difference and went looking for it.

GLMs are different – it analyses all the factors and levels that you input and returns the

significance (p value) that that factor is influencing the response. More than that

though, it then assess how well you can use these factors to predict the response value

(R2).

The clever part (as far as we are concerned) comes from assessing the p‐values for each

factor and the overall R2 value.

The same rules over p‐value as we used in TTEST and ANOVA apply here. P=0.05 is

the upper limit of where we say a factor is significant.

The same rules apply over R2 – the higher the value, the better the fit of the factors and

the closer we are to accounting for all the variation. What this means is that if we have

a low R2 value, it probably means that we are “missing” a factor in our analysis and

should consider what else we could bring in (spelling age, FFT, teacher)

70

Constructing a GLM

Assume that we have data laid out in a similar pattern to before:

Name Sex SEN FSM CATs Att% Feeder LAC Read Overall Points

Adams, Jon M NA N 119 90.35 St Judes Yes 14.02 440

We’ve binned up attendance and reading age as we did earlier in the text.

The GLM command exists under the Minitab > Stat > ANOVA menu:

Once the GLM window is active, you simply select the “response” – in our case the

total points (but could equally be the scores for just your subject), and build the

“model”.

71

In the first instance the model is just the factors that we are assessing. You can

construct a more sophisticated model later.

The output of GLM will depend on the package you are using, but for Minitab, the

output for our data was:

72

Analysis of Variance for POINTS, using Adjusted SS for Tests

Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 24701 24701 1.33 0.252 Previous School 6 162246 80249 13375 0.72 0.635 Attendance 3 147096 181165 60388 3.25 0.026

Reading Age 3 129266 82841 27614 1.49 0.224

FSM Eligible 1 1898 2340 2340 0.13 0.724 LAC 1 14316 12533 12533 0.67 0.414 SEN 3 563318 563318 187773 10.10 0.000

Error 86 1598457 1598457 18587 Total 104 2714376 S = 136.333 R-Sq = 41.11% R-Sq(adj) = 28.79%

This looks very busy, but we are only interested in the p‐values and the overall R2 value

Following our previous rules, only our binned Attendance and the SEN status of our

learners are a significant factor in determining the average points score.

R2 = 41.11%, showing that these factors account for only 41% of the variation seen in

the data – implying that there is likely to be something “else” that we haven’t

accounted for.

73

Deeper analysis

Now that the GLM has identified SEN and Attendance as being significant compared to

the other factors, we can dig deeper into each and run ANOVA or TTest as appropriate.

SEN

Using Minitab to run ANOVA is no different to Daniels XL Toolbox, but the output is

slightly different:

Source DF SS MS F P SEN 3 775711 258570 12.64 0.000 Error 147 3007614 20460 Total 150 3783326 S = 143.0 R-Sq = 20.50% R-Sq(adj) = 18.88% Individual 99.5% CIs For Mean Based on Pooled StDev Level N Mean StDev -------+---------+---------+---------+-- No 94 472.2 139.6 (---*---) school action 13 407.9 121.5 (-----------*----------) school action + 36 300.5 158.0 (------*------) statement of SEN 8 449.0 143.8 (--------------*-------------) -------+---------+---------+---------+-- 300 400 500 600

What this shows us:

The p‐value (p=0.000) indicating a 100% statistically significant result (The p‐

value from ANOVA can be different from the GLM as during ANOVA we are

only considering the factor itself, but in GLM we are considering it along with

all the other factors).

R2 of 20.5%, indicating that SEN alone accounts for nearly 21% of the variation

in the data

Minitab analysis (unlike MS Excel / Daniels XL Toolbox) automatically charts the data,

showing the mean and 99.5% confidence intervals (CI – where we would expect to find

99.5% of all the data. This makes reduces the impact of excessively high / low values)

From this analysis, we can see that No SEN and School Action + are totally different –

and this is the cause of the low p‐value (the CIs don’t overlap)

74

The cohort size for SEN statements (8) results is huge confidence intervals – meaning

that we can’t draw conclusions on the statemented learners. However, if we remove

them from the analysis, the following picture emerges:

Level N Mean StDev -------+---------+---------+---------+-- No 94 472.2 139.6 (---*---) school action 13 407.9 121.5 (-----------*----------) school action + 36 300.5 158.0 (------*------) -------+---------+---------+---------+-- 300 400 500 600

We would be safe to conclude that as we progress up the SEN “ladder” the

performance of learners falls, from 472 to 300 from No‐SEN to SA+.

With significance of p=0.000, this would demand further analysis and possible

interventions.

Attendance

Source DF SS MS F P C_ATT 3 173144 57715 2.35 0.075 Error 147 3610182 24559 Total 150 3783326 S = 156.7 R-Sq = 4.58% R-Sq(adj) = 2.63% Individual 99.5% CIs For Mean Based on Pooled StDev Level N Mean StDev --+---------+---------+---------+------- -1 26 356.7 99.6 (------------*-----------) 0 30 428.6 142.5 (----------*-----------) 1 33 464.0 146.9 (----------*----------) 2 62 429.9 184.8 (-------*--------) --+---------+---------+---------+------- 280 350 420 490

What this shows us:

The p‐value is less than from GLM, but at p=0.075, we are on the cusp of

statistical significance.

75

The R2 of 4.58% indicates that whilst Attendance might be significant, it only

accounts for 4.58% of the spread of data – clearly something else is accounting

for the variation in data.

What would be expect Attendance vs Attainment to look like? I think we all would

agree that we would expect, the higher the attendance, the higher the attainment. In

fact we looked at this before.

Our ANOVA seems to show that there is a drop off in attainment for the highest

attendance group (95%+, binned into “2”).

Being the data guru that we now are, we want to be able to explain that seemingly

contradictory finding.

Extending the GLM

What often causes responses to behave in this way is some sort “interaction” between

the factors. What does that mean?

Until now, all the statistics we’ve used have relied on the understanding that one

factors has no influence on the other – for example, “what primary school you go to,

does not affect whether you are boy or girl”. But, in the case of a single sex primary

school, clearly, the sex could be linked to the school itself. This is called an

“interaction”

In our case, an interaction between the factors is likely to be pulling down the

performance of the high attendance learners – but what could that be?

Prior knowledge would seem to indicate that SEN Statemented students tend to have a

higher attendance than other students AND the highest attending boys seem to do less

well than other learners.

Let’s explore the data graphically:

76

Attendance %

POIN

TS

1101009080706050403020

900

800

700

600

500

400

300

200

100

0

SexFM

Scatterplot of POINTS vs Attendance %

Attendance %

POIN

TS

1101009080706050403020

900

800

700

600

500

400

300

200

100

0

SENno special provisionschool actionschool action plusstatement of SEN

Scatterplot of POINTS vs Attendance %

77

Both of these plots indicate high attending learners who achieve less than we would

expect.

The first chat seems to indicate that there are a number of low achieving, but

high attendance males

The second chart seems to indicate that there are a high number of School

Action Plus learners, with high attendance that are also low achieving.

These two statements would implicate an interaction between “Sex” and “SEN” – that

is “SEN Boys perform differently to SEN Girls”.

Building interactions into the GLM

When we created the GLM, we entered a range of factors into Minitab. We can

indicate an interaction by:

Factor 1 * Factor 2

78

To enter a “Sex” / “SEN” interaction, I have entered Sex*SEN into the GLM model.

If we run the GLM, the output is similar to before, but with the additional factor of

Sex*SEN

Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 8598 8598 0.54 0.464 Previous School 6 162246 90813 15135 0.95 0.462 Attendance 3 147096 238927 79642 5.01 0.003 C-READ 3 129266 52955 17652 1.11 0.349 FSM Eligible 1 1898 2961 2961 0.19 0.667 LAC 1 14316 30069 30069 1.89 0.173 SEN 3 563318 292055 97352 6.13 0.001 Sex*SEN 3 279666 279666 93222 5.87 0.001 Error 83 1318791 1318791 15889 Total 104 2714376 S = 126.052 R-Sq = 51.41% R-Sq(adj) = 39.12%

What we notice:

The p‐values have changed – we would expect that as we have added in

another factor to assess against.

Attendance and SEN are still significant

Sex*SEN interaction is as significant as SEN alone

By this manner, we have built an interaction into the GLM and assessed that SEN pupils

perform differently depending on SEX.

Overall, this GLM analysis points the direction for further study. Specifically, with this

data set, Attendance, SEN and SEN*SEX are the factors that we should be considering.

79

Big implications of the GLM

By not assuming any one factor is important, we have used a General Linear Model to

show that only Attendance and SEN are statistically important in determining the total

points score of our learners.

Let me put that another way compared to SEX, Free School Meals or LAC status,

Attendance and SEN are more important in determining the total points score on

leaving10.

This is where the importance of the GLM cannot be under estimated.

Before embarking on a whole school initiative to “solve” a problem, we must surely use

detailed statistics to correctly assess what we need to “solve” in the first place.

10 We’re not making far reaching conclusions here – for this data, analyzed for 2010‐2011, this is the

output of the GLM. Your data / conclusions will be different.

80

Call to action

1. Get the biggest data set that you can get your hands on, including, but not

limited to:

Name Sex SEN FSM CATs Att% Feeder Read Maths English Science Points

Adams, Jon M NA N 119 90.35 St Judes 14.02 30 35 40 440

2. Build a GLM of the input factors and determine which are the most significant

in determining your output response.

a. You might need to bin up your input factors

3. Are there any factors that come out as significant that show a different trend

than you would expect?

a. Consider extending the GLM to take into account interactions between

factors.

Conclusions

The GLM allows you to determine which factors are most significant in driving an

output response. Crucially, the GLM allows you to perform this analysis, without any

preconceptions over which factor will be significant in the first place.

We’ve shown you (briefly) how to use Minitab to create a GLM model for a range input

factors and how to interpret the significance of the output.

Once we’ve built the GLM we looked at digging deeper into the data and running

separate ANOVA to check for trends within each factors.

Finally, we discussed how to add in interactions between factors to account for counter

intuitive trends.

81

Chapter 6

Main Effects

We’ve taken a fairly constructivist approach to working our way through the statistics

we’ve needed to analyze the data routinely seen in school and at faculty level. This has

led to exploring means, data visualization, t‐tests, ANOVA and latterly the very

powerful GLM approach. These were presented in that order to allow us to build up a

picture of how to effectively analyze a data set. Once you’ve got these tools under

your belt, you’re going to want a short cut method that focuses your analysis – that’s

the Main Effects.

Imagine drawing a series of charts, with common scales, so that you can directly

compare the magnitudes of the effects of different factors. That way, you can see

which factors have the most effect and need to be analyzed further. Excel can be

driven in this way, but you’ll need to create separate graphs for each response and

scale them accordingly – lots of work.

Fortunately, now that you’re using Minitab11, there’s an immediate way to get just

what you want.

Right under the Stat >> ANOVA menu, towards the bottom is the Main Effects Plot.

11 Or equivalent

82

Pick the Response and Factors, exactly as building the ANOVA analysis.

83

Main Effects Plot

The Main Effects Plot is quite simple – it shows on common y‐axis scales the effect of

each of the factors you chose, split by the level of each factor.

The chart below leads us to the following conclusions:

SEN, going from SA to SA+ has the largest impact on Average Points

Going from ‐1 to +1 in C‐Read (Reading Age) has a large impact

Going from ‐1 to +1 in C_ATT (Attendance) has a large impact

There is a large difference between Previous Schools B and D

Out of all the factors charted, SEX shows the least affect on Average Points.

Ave

rage

Poi

nts MF

500

400

300

210-1 10-1-2

StatementSA +SANo

500

400

300

YN YN

FEDCBA*

500

400

300

Sex C _A TT C -REA D

SEN FSM Eligible LA C

Prev ious School

Main Effects Plot for Average Points

84

Interactions Plot

Following the same thoughts as the Main Effects, it would be a good idea to see all the

possible interactions and to decide which need further analysis.

From the same STAT >> ANOVA menu as before, select Interactions Plot and pick the

same factors as previously:

For the purposes of this text however, we will only look at Attendance, Reading Age

and SEN – and how they interact to determine Attainment – as you will see, the chart is

confusing enough with just these factors.

Take time to look at the chart before reading on.

85

Sex

C-R

EAD

SEN

C_AT

T

21

0-1

10

-1-2

statem

ent o

f SEN

scho

ol ac

tion

plus

scho

ol ac

tion

no spe

cial p

rovis

ion

600

450

300

600

450

300

600

450

300

Sex

F M C_A

TT

1 2-1 0

C-R

EAD 0 1-2 -1

Inte

ract

ion

Plo

t fo

r A

vera

ge P

oint

s S

core

86

What does the interaction plot show?

In all cases of Attendance, Girls outperform Boys

As girls reading age increase, they improve in attainment at a faster rate than

boys

Reading age only affects attainment for the highest attendees – for those with

the lowest attendance, reading age matters much less

Statemented boys do far better than statemented girls

Those statmented students with the highest reading age achieve most

The purpose of the interaction chart is to quickly identify avenues for further research.

What stands out as a promising line of enquiry is the gender split and reading age

affect on attainment: (seen here zoomed in)

87

It can be clearly seen, that as the reading age goes from ‐2 to +1, both genders improve

in attainment. However, it clearly demonstrates that the attainment gains for girls are

far higher at higher reading ages than for boys. This clearly needs further analysis.

Let’s build a GLM of attainment against the factors of SEX, Reading Age and the

Sex*Reading Age interaction:

General Linear Model: POINTS versus Sex, C-READ Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 60685 60685 2.42 0.123 C-READ 3 157821 130298 43433 1.73 0.166 Sex*C-READ 3 26056 26056 8685 0.35 0.792 Error 97 2432719 2432719 25080 Total 104 2714376 S = 158.365 R-Sq = 10.38% R-Sq(adj) = 3.91%

What this shows, is that whilst all factors are not totally significant in determining

attainment, Sex*Reading Age interaction, with a p=0.792 is totally insignificant, so

should be removed form our model.

Let’s remove it and re‐run the GLM:

Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 77753 77753 3.16 0.078 C-READ 3 157821 157821 52607 2.14 0.100 Error 100 2458776 2458776 24588 Total 104 2714376

Now the GLM hints that both Sex and Reading Age are becoming statistically significant

so warrant further investigation.

So again the value of a thorough statistical investigation is clear. What looks like a

difference between the effect of reading age for boys and girls, is not actually

statistically valid.

88

Call to action

1. Find the data set that you’ve been looking at and use Minitab to display the

Main Effects. Does the ME plot lead you to any conclusions?

2. Use Minitab to create an Interactions Plot for the same factors and responses.

Do any interactions stand out? For example “do high attending boy perform

differently to high attending girls?”

3. For any interesting interactions, build a GLM to assess if these interactions are

significant and need further analysis.

Conclusions

During this Chapter we’ve turned the previous analysis on it’s head and gone right back

to a datasheet.

We’ve use the Main Effects plot to quickly show the average effect of different factors.

These factors can then be assessed by ANOVA for their overall significance.

Finally, we used the Interactions Plot to explore any trends that “break the mould”. To

assess these interactions, we built a GLM.

89

Chapter 7

Final remarks

Data is nothing to be afraid of. Throughout this text we’ve tried to show how, by

approaching the crunching of numbers, you can make sense out of the data that is

presented in schools. We’ve shown how simple analysis and display of the mean can

be misleading and is open to interpretation/bias. As a class teacher, head of subject or

member of the senior leadership team, surely you need more than “gut feel” to make

decisions that could adversely affect the education of a whole generation – and we

don’t make that pronouncement lightly.

As educational professionals, we owe it to the learners for whom we are devising

interventions and creating policy – to base our pronouncements on sound, statistically

valid understanding of the numbers we are “crunching”. I hope this text has begun to

demonstrate that this is not hard, and is certainly not beyond you dear reader.

Tools you’ll need:

1. MS Excel (or equivalent)

2. Daniels XL Toolbox (or equivalent Add‐In)

3. Minitab (or equivalent)

What we can’t guarantee is that you’ll uncover that “missing link” that will improve

your school performance from 30% to 80% A*‐C overnight – but what we can ensure is

that when you’re asked “Are you sure” – you can say “Yes”.

So, go on – get “beyond the mean”.

Glen Gilchrist

Alexavier Fareheed

www.goingbeyond.co.uk

90