Upload
glen-gilchrist
View
104
Download
0
Embed Size (px)
DESCRIPTION
As a school leader or head of subject you are required to analyse attainment data relating to whole cohorts of learners. From this analysis you need to produce timely interventions and measurable initiatives to improve the very performance you are monitoring. This requires data analysis - something that most teachers and leaders either find daunting or only address in a superficial manner. Covering chapters on Data Analysis, Problems With The Mean, Comparative Statistics, Analysis Of Variance and the incredibly powerful General Linear Model (GLM) - this is a text book for real teachers faced with real issues in real classrooms.
Citation preview
Beyond the Mean
Data analysis for School Leaders
Glen Gilchrist
Alexavier Fareheed
2
This edition published by LULU, February 2012
ISBN: 978‐1‐4716‐1146‐9
This work is licensed under a Creative Commons Attribution‐NonCommercial‐
ShareAlike 3.0 Unported License (CC BY‐NC‐SA 3.0).
To view a copy of this license, visit http://creativecommons.org/licenses/by‐nc‐sa/3.0/
or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco,
California 94105, USA. Whilst the Creative Commons License for this book entitles you
to distribute / modify the work for non‐commercial use, without additional
permissions, we kindly request that you inform the authors of any intention to re‐
publish / remix this title. Send an email to [email protected]
Every effort has been made to contact perceived copyright holders for material
reproduced in this publication. Any omissions or oversights will be rectified in
subsequent editions if written notice is given to the author. All trademarks are the
property of their respective owners. The authors are not associated with any product
or vendor mentioned in this book except where stated. Unless otherwise stated; any
third‐party quotes, images and screenshots, or portions thereof, are included under
‘fair use’ for comment, news reporting, teaching, scholarship, and research.
Acknowledgements
The authors would like to thank Michelle Gilchrist for her help, support are tireless
proof reading skills, without which, this book would not have seen the light of day.
Disclaimer
This is a book aimed at those readers wanting to explore data as used to drive decisions
in schools. It is not a comprehensive guide to statistics – no responsibility is assumed
or accepted for your decisions based on your data. Using the techniques detailed in
this text provides an aid to decision making whereas, the decision to act is left to the
discretion of the reader. No liability can be placed with the authors of this text.
By using the material contained within this guide, you acknowledge that you have read
and accept this disclaimer
3
Preface to first edition
I have to admit a long standing and growing interest in the subject of statistics. As a
research scientist before finding my vocation as a teacher I used the tools of statistics
on a daily basis to inform my research and to plan future investigations. When I started
on my teaching career I was amazed at just how underdeveloped the use of “proper”
numbers was, both in the classroom and within the wider arena of educational policy
making. Far reaching decisions are made on the basis of poorly researched and under
analyzed data. Everyone’s tax investments and our endeavors as a teacher/leader is
constantly being misdirected by the improper analysis of data. This book is my
contribution to the cause of using data in an appropriate and considered manner.
Good luck dear reader.
Glen Gilchrist, February 2012
I’ve been head of faculty for 5 years now and in all that time, I don’t think that I’ve seen
anyone – literally anyone in the education sector use data in a robust manner. Sure,
I’ve seen pretty bar charts and tables used to justify interventions and to determine
policy. I’ve sat through too many INSET sessions discussing the consequences of poorly
analyzed data; in fact I’ve been asked to lead on data sessions as presented to
incoming PGCE, NQTs and new staff – I guess in short, I’ve become part of the problem.
I believe that you dear reader have an obligation to reflect upon the data that you
collect and the consequences of your analysis.
Alexavier Fareheed, February 2012
Corresponding with the authors
Data analysis can be a lonely pursuit. The authors are happy to receive questions,
queries and other correspondence – send an email to [email protected].
4
5
Contents
Introduction 9
It’s easy to see why data is mishandled and unsafe conclusions drawn. 10
Essential definitions 11
A word about software 12
Minitab 13
Final note 13
Chapter 1 15
DATA ANALYSIS THAT SCHOOLS “DO” 15
Why we use the mean average 15
Factors we can compare 16
Central tendency 16
The mean ‐ a point statistic 16
More sophisticated analysis 18
Complementing the mean – bar charts 19
Using the mean to compare “segments” of data 20
Using the language of statistics 21
The wider school picture 21
Call to action 23
Conclusions 24
Chapter 2 25
THE PROBLEMS WITH THE MEAN 25
Statistics in action 26
Call to action 26
Problems with the mean 27
Call to action 27
The dangers of presumption – pre analyzing the data 28
Call to action 28
What do your bar charts show? 29
6
Ethics, politics and “getting your own way” 30
Call to action 31
How big an effect / difference is “big enough” to matter? 32
Extra information in a “modified” bar chart 33
Call to Action 33
Looking at a whole cohort 34
Preconceptions again 35
Conclusions 36
Chapter 3 37
COMPARATIVE STATISTICS 37
What does significant mean? 37
T‐tests and p values 37
Calculating significance using Excel 40
Excel command for T‐testing 41
Call to action 43
Conclusions 44
Chapter 4 45
FACTORS WITH MULTIPLE LEVELS. 45
Multi level factors 45
Combine levels to make a binary solution 45
Calculating t‐test for “binned” data 48
Limits of the t‐test 49
Multi level factors 50
ANALYSIS OF VARIANCE 50
Does attendance affect attainment? 51
Fitting a trend line to Excel data 52
Using R2 to check for “goodness” of fit 54
One way Analysis of Variance (ANOVA) 58
Non numeric multi level factors 61
Call to action 66
Pause for breath …….. 67
Questions to reflect on 67
Conclusions 68
7
Chapter 5 69
GENERAL LINEAR MODEL (GLM) 69
Constructing a GLM 70
Deeper analysis 73
Extending the GLM 75
Building interactions into the GLM 77
Big implications of the GLM 79
Call to action 80
Conclusions 80
Chapter 6 81
MAIN EFFECTS 81
Main Effects Plot 83
Interactions Plot 84
Call to action 88
Conclusions 88
Chapter 7 89
FINAL REMARKS 89
Tools you’ll need: 89
8
9
Introduction
Every school leader, head of subject and class room teacher will recognize the
following scenario:
It’s a school INSET, and what wonderful pedagogical expertise is going to be shared
with you, the willing staff? – Yes, you’ve guess it “Addressing the gender differential” –
the very name sends waves of déjà‐vu through the staff and the authors of this book
develop an instant migraine.
We’re not denying that there is a difference between the genders and their approach
to education; nor are we suggesting that as teachers and leaders that you don’t need
to monitor things to ensure that situations aren’t improving/deteriorating ‐ what brings
us to the point of tears, is that this statement is based on poorly and superficially
analyzed data.
As we will show in this book, it’s easy to assume that responses will be different for a
certain factor, and when you just look at the mean of data set, this “difference” is often
seen – you’ve then proved your initial assumption and you don’t look for a more
fundamental root cause. In our experiences, this is the case with the gender
differential, and I bet you’ve fallen into it too.
When we came into teaching, for the first time in our professional lives we became
aware of the situation of being “data rich but information poor”. Education abounds
with numbers, and schools, students & teachers have never been “measured” as much
as they are in 2010‐20111
But which numbers do you use and which demand that you take them seriously?
1 Whilst this appears to be particularly true of the English / Welsh systems, all educational infrastructures
constantly battle with league tables, “banding” and other lists
10
It’s easy to see why data is mishandled and unsafe conclusions drawn.
Until very recently, use of correct descriptive statistics was the preserve of the
statistician, often resulting in the calculation of arcane numbers, utilizing impenetrable
mathematics. Indeed, pick up anything but the most basic of statistics text books and
the reader will soon be swimming in a sea of mathematical notation, far beyond the
readability of those without degrees in mathematics.
But with the change is responsibilities, the TLR structure, and the reduction is
extraneous funding, the expectation is that as a subject/school leader, you undertake
data analysis and draw conclusions.
I doubt you’re trained in statistics (and why should you be?) ‐ so instead of carrying out
statistically valid analysis you’ve have returned to that most basic of measure – the
“average” – after all, it’s easy to calculate and means something doesn’t it?
Throughout the text of this book, we will look at analysing the data a typical
department in a school might produce – initially by calculating “means” and developing
this into a more rigorous assessment of data.
So dear reader, this book is aimed at classroom practitioners, heads of department and
school leaders seeking a deeper understanding of what your data actually shows.
In a nutshell, we’re going to take you “beyond the mean”.
Glen Gilchrist & Alexavier Fareheed
2012
11
Essential definitions
We need to define three vital terms that will be used throughout this text:
Factor: A factor is a variable whose values are independent of changes in the values of
other variables. Traditionally factors are the groups into which we split our
data – gender, SEN, free school meals are examples of educational factors.
Level: Factors can be split into different values. Statistically, these values are called
levels.
Levels can be numerical, quantitative or qualitative, binary or multi level.
Binary Levels
Levels can be binary in nature “boy or girl”, “SEN or not” and can be
represented numerically “1=boy, 2=girl” or remain as text.
Multilevel Levels
Levels are not always binary, “originating primary school” for example could be
one of 10 or more levels, with each school either referred to by name or a
coded “number” 1=School A, 2=School B etc
For continuous levels (age and attendance are good examples) levels
themselves might be grouped together to make analysis easier. These
groupings are often called “bins” and reference will be made to “bin size”.
Attendance for example could be binned as:
‐1 = less than 80%
0= 80% to 89.9%
1 = 90% and greater
The numerical value of the groups (‐1, 0, 1) is not important and the labels are
used to identify the grouped levels. Some consideration needs to be made into
12
the size / range of the groupings as this choice can affect subsequent
data analysis – however this is outside the scope of this text, and for
the analysis undertaken in schools, just ensure that the bins are
“sensible”.
Response: The response is the output that you are measuring. For school based
data, average or total points score and number of “C’s” are the typical
responses measured.
A word about software
MS Excel is referred to throughout this text and is used as convenient shorthand for
“spreadsheet”. We acknowledge that other spread sheets such as OpenOffice and
GoogleDocs are available and can be used fairly interchangeably for MS Excel (except
where indicated). Each has their strengths / weaknesses, but all process statistical
information in much the same manner. There is no need to change your spreadsheet
package to complete the numerical analysis undertaken in the majority of this text.
Some of the more advanced statistics require the use of a dedicated statistics tool.
Recently the cost of these tools has fallen dramatically and academic licenses can be
obtained for less than £50. We cannot recommend strongly enough the value in
obtaining the correct tool to analyze your data.
A great list is maintained at Wikipedia, which compares different statistical tools, their
costs and licenses: http://en.wikipedia.org/wiki/Comparison_of_statistical_packages.
13
Minitab
Throughout this book the authors makes use of Minitab as a conveniently easy tool to
get to grips with and available at an excellent price (from sub £20)
(http://www.minitab.com/en‐GB/academic/licensing‐options.aspx). The publisher also
makes available a free 30 day trial – more than enough time to learn the ropes and to
process data for your self evaluation.
Final note
The authors are practicing teachers, currently heads of subject in maintained
secondary schools and have no association with any of the tools / software / publishers
mentioned in this text.
“Data analysis is a journey that the only destination is enlightenment – get ready for
the ride of your life.” Glen & Alexavier – February 2012
14
15
Chapter 1
Data analysis that schools “do”
One of the biggest challenges in getting data used correctly in schools used to be the
actual collection and manual processing of the “numbers”. Now with tools such as MS
Excel, OpenOffice and GoogleDocs available to all, the challenge has shifted to the
actual processing and analysis that turns “numbers” into “data”.
Courses abound in educational circles about the “use” of data, but from personal
experiences they all focus on 3 areas:
1. Sources of baseline data (CATs, FFT, Government, Feeder Primaries)
2. Segmenting the data (gender, free school meals, SEN)
3. Monitoring, assessing and explaining student performance against (1) and (2)
Valuable as these courses are (and a significant improvement on not using data), they
all focus on basic statistics – the mean average, range and a cursory diversion into
drawing and formatting bar / line graphs; and whilst this is encouraged, reliance on
these measures alone can lead to poorly drawn and costly conclusions.
Why we use the mean average
Whilst Excel et al have democratized the collection and analysis of data, they have also
exposed the fact that most users of these tools are unaware how to use them at a high
enough level to process statistical information. As a result, most users are content with
tabulation, calculation of “averages” of data sets and with drawing basic, overly
coloured bar charts.
These “averages” are then used to draw conclusions, usually in the form of
comparisons; Boys vs Girls, free school meals vs non free school, English vs Maths,
2009 vs 2010, one school vs another.
16
Factors we can compare
The candidate list for comparison is long: special educational needs, ethnicity, “looked
after”, target group, literacy “booster” support or a hundred‐and‐one other
educational imperatives. A situation that I am certain occurs in your school. Indeed the
schools inspection framework2 demands that schools use data to “identify, plan and
monitor” the attainment of “groups” of learners. Without extensive use of such data,
schools cannot hope to achieve a coveted “Grade 1” status.
We will expose in this chapter the dangers of using just the mean to represent a data
set, and show how drawing conclusions can lead to costly and unnecessary
interventions.
Central tendency
Used in this context, the mean is a “measure of central tendency”3
The two most widely used measures of "central tendency" of data are the mean
(average) and the median. For example, to calculate the mean weight of 50 people,
add the 50 weights together and divide by 50. To find the median weight of the 50
people, order the data and find the number that splits the data into two equal parts.
The median is generally a better measure of the centre when there are extreme values
or outliers because it is not affected by the precise numerical values of the outliers
themselves (The median is often used to describe “average” earnings in a population as
it is not affected by a small number of very large (or small) salaries) .
The mean ‐ a point statistic
The mean is a “point” statistic – that is, it reduces an entire data set to a single value,
useful to succinctly describe the data. (However, you lose any sense of the spread and
variability of the numbers). As a result, the mean is the most widely used measure of
central tendency, but as we will see, not always the most useful.
2 UK wide, but certainly heavily endorsed in England and Wales 3 There are three measures of central tendency used to describe data sets – mean, mode and median. If
you are unfamiliar with these terms or just need a recap, remember – Google is your friend.
17
For example, the Average Points score for 5 schools in 2011 was:
School Average
Points Score
A 435
B 403
C 440
D 427
E 438
What conclusions can be drawn from this data?
School “C” is the best performing
School “B” is the least performing
Schools “A”, “C” and “E” all have similar points scores
School “B” needs to do “something” as its performance is very different to the
other schools.
It’s likely that such analysis is undertaken at this level in both your department and
whole school self evaluations.
The consequences of such analysis are likely to be some form of change, intervention
or closer monitoring. In short, money, time and effort will be expended acting on this
analysis of means. A situation that we are sure has happened in your school or
department.
18
More sophisticated analysis
Further and seemingly more sophisticated analysis will have you looking at the same
data over a period of 3 or 5 years:
School 2008‐2009 2009‐2010 2010‐2011
A 425 430 435
B 440 420 403
C 411 424 440
D 425 430 427
E 430 438 438
What does this show?
School “C” is the most improved over the 3 years
School “B” has fallen 37 points over 3 years
Schools “D” and “E” have shown little improvement over the three years
As part of your self evaluation / action plan – you will have undoubtedly looked at 3
year trends in mean data. You’re likely to have compared your results to that of other
departments, between local, national and family of schools and made pronouncements
on how well you are doing compared to last year.
To try an unravel some of the mystery about what your data is showing you, chances
are you’ll draw a bar chart of the means.
19
Complementing the mean – bar charts
Let’s complete the analysis and draw a bar chart of the data for the schools over three
years:
What does this chart show us?
It emphasizes the fall in performance of school “B”
The performance gains of school “C” look incredible
School “D” looks all but static over the past three years
Overall, what conclusions can be drawn about schools “A” to “E”?
School “A” is doing something that is improving performance
School “C” is clearly doing something “better” than the other schools and
better than school “A”
20
School “D” appears not to be doing anything and performance is static
School “E” looks like something happened during 2009‐2010, but these gains
have stopped and the school has not improved since.
School “B” looks like it’s in free fall and standards are falling rapidly
No doubt such analysis is regularly completed by you and/or your senior leadership
team. And if our personal experiences are reflected in your school the stress levels and
anxiety rises in proportion to the preparation and analysis of such data.
Using the mean to compare “segments” of data
As a teacher, administrator or policy maker we often need to compare the means of
two or more populations – essentially to test whether or not an intervention or
observation produces a measurable difference. For example, the average points score
for Year 11 students upon receiving their L2 qualifications is often segmented into data
for males and females.
As a result of this basic analysis, decisions and policy will be decided.
In this case, “clearly” there is a sex linked differential between Boys and Girls – with
Girls outperforming Boys by some 10%. From this analysis of means an intervention
will be planned – possibly grouping next year’s cohort into separate sex classes,
planning boy friendly lessons and tweaking the seating plans.
Again, we’re sure that you’re familiar with such segmentation of data and are certain
that your self evaluation contains statements about the gender differential and how
you intend to tackle it.
Average Points Score
Boy 402
Girl 448
21
Using the language of statistics
At this point, let’s start to use the language of statistics more fully.
In the case above for boys / girls L2 performance:
We have one factor, SEX, split into two levels (Boy and Girl) – we say we have a
binary factor.
Our response is the Average Points Score
From now on, we will use factor, level and response to describe our data.
The wider school picture
Such analysis is extended across the wider school, comparing the differentials in your
subject to those in English, French and DT4 ‐ as a direct result of this analysis a working
party or even a PLC5 will be created to tackle the clear differences between subject
areas.
(Whilst written here in a tongue‐in‐cheek manner, I suspect that your school has at
some point created a working party to contemplate differences in responses when
factors are analyzed for mean differences)
4 Insert the high performing subject areas in your school 5 PLC – Professional Learning Community, school based collaborative action research – for more details
see: http://www.centerforcsri.org/plc/program.html
22
What can we conclude from this chart?
French has the smallest sex differential
Science has the widest differential
In DT, boys outperform girls
The temptation in this case is to view the French differential (low) as in some way
“better” that the Science differential (high) and to invest time and resources in solving
the “problem”.
We’re not suggesting that this does not need to be solved; just that the data analysis
performed so far does not demand such investigations, merely hints at it
23
Call to action
1. Do you know the three measures of central tendency and when to use each
one? Do you know how to get Excel to calculate each?
2. Find your self evaluation and identify where you have used the mean of a data
set to draw a conclusion about segmentation of data
3. Look at the charts and graphs you have created for your exam analysis
meeting. Are they based on means of data? What conclusions did you draw
from them?
4. Look at whole school, local and national data – how often is an entire data set
reduced to a point statistic?
5. How well can you use your spreadsheet tools?
a. Can you enter formula to calculate the average of a data set?
b. What about counting the numbers in a column when the value in a
different column is a particular value? (CountIF() – used to
automatically count data, say based on a column containing the sex of
a learner)
24
Conclusions
During this chapter we have shown the basic data analysis undertaken by schools. As
subject team leader we imagine that you have laboured over such figures yourself,
painstakingly entering figures into MS Excel, creating comparison bar / pie charts and
drawing conclusions based on the mean average of data sets.
You’ve likely taken such figures into exam analysis meetings with your head teacher
and drawn conclusions about why students who obtain free school meals do “less well”
in your subject than, say, Spanish.
All of these things are a step in the road to understanding how to use data effectively
and the fact that you are reading this title demonstrates a clear desire to take your use
of data to a higher, more effective level.
In the coming chapters I’ll show you why data analysis based solely on the mean of a
population is dangerously superficial and can lead to misdirected effort and the
potential to miss a more fundamental underlying truth.
25
Chapter 2
The problems with the mean
Demonstrating that there are “issues” with using the mean of a data set is often the
most instructive way forward.
Consider the following data obtained for a group of year 10 Maths students.
Student L / R Hand Score
A R 80
B R 78
C R 82
D R 84
E R 76
F L 82
G R 81
H L 79
I L 79
J R 81
K L 84
L R 76
M R 81
N R 78
If we take the average of the left handed and the right handed students, we obtain;
Hand Average Score
Left 81
Right 79.7
26
From this, we conclude that right handed students underperform compared to left
handed – we might even plan further monitoring, investigate the scheme of work to
look for bias and set up a far reaching working party.
Statistics in action
If you take any data set, made up from “real” data – and by real, I mean measured from
real people / events, not simulated on a computer, and segment that data into two –
you are likely to see a difference between one group and the other.
In this case, we looked at L and R hands, but the argument holds for any segmentation,
regardless of how ridiculous it sounds.
Call to action
1. The next time you teach any class, survey them for one of the following:
o Xbox or Playstation
o Blackberry or iPhone
o Eastenders vs Coronation Street
o Family Guy vs American Dad
(The choices don’t need to be binary, but at this stage, it will help with the data
analysis)
2. Add this segmentation to the class register.
3. The next time you “test” your learners, split the data into the segments that you
have just defined and calculate the mean for each: (for example)
Console Average Score
Xbox 67
Playstation 83
Ask yourselves the following question – does this show anything meaningful?
27
Have we just uncovered the route to educational success – “buy everyone a
Playstation” or is there something else going on?
Whilst a contrived example, I am sure from your own experience that this
segmentation and superficial analysis has been undertaken – possibly with the gender
differentials cited in the previous chapter.
Problems with the mean
From the previous example, what exactly are the problems with using the mean?
Some observations stand out:
1. The difference between left and right handed is small – 1.3 –
a. The question we should ask is:
“Is this difference big enough to matter?”
2. There are only 4 left handed students – does this affect the conclusions?
“How much data do you need to draw realistic inferences?”
These issues aside, we are sure that you have drawn conclusions using similarly
analyzed data.
Call to action
Before you read on, either for your own data or the data presented previously, splitting
into Left and Right handedness, use your favourite spreadsheet to draw a bar chart of a
set of results that can be split into two segments. For the purposes of this text, I’ll
assume that you’ve used my data.
28
The dangers of presumption – pre analyzing the data
The analysis of data by using just the mean is not the only concern for rigorous data
analysis.
When we presume there is a difference between two segments of data, we are
unsurprised when we find it, and are then more likely to accept that difference as
meaningful. After all boys and girls are different, so when your data shows this, it must
be true – right?
Call to action
What presumptions do you make in your data analysis?
1. Would you have expected left and right handed segmentation to produce
different means?
a. Can you think of a pseudo‐pedagogical reason why this might be true?
2. What about other splits of data?
a. Everyone knows that free school meals, linked to poverty affects
attainment – right? Does your data show this difference?
When you analyze your data and find a difference, you are ready to accept it as real
and meaningful. The same is true with gender, SEN and a host of other factors that we
assess.
29
What do your bar charts show?
Let’s show you my plots the mean data for handedness as a series of bar charts, all
showing the same data:
Firstly, let me assure you that these charts all show the same “numbers” for the left
and right hand segmentation of the data.
Chart “B” is the default MS Excel and OpenOffice formatting of the data as entered.
The only difference between each chart is the scale of the y‐axis.
Chart “A” shows 79.5≤ y ≤81.1, with each division being equal to 0.2
Chart “B” shows 79 ≤ y ≤81.5, with each division being equal to 0.5
Chart “C” shows 0 ≤ y ≤80, with each division being equal to 20
Chart “D” shows 0 ≤ y ≤100, with each division being equal to 20
Quite dramatically charts “A” and “B” emphasize the differences between L and R,
whilst charts “C” and “D” seem to imply the difference is almost nonexistent.
A
B
C
D
30
Ethics, politics and “getting your own way”
But which is the correct way to display the data?
At this point, those of you reading this who find the whole concept of data and analysis
abhorrent will be likely thinking “that’s why I hate doing all this stuff” – “see I was right,
its way beyond me”, and the most insightful “It’s bloody confusing!”
Oddly for a book aimed at using statistics we are going to tend to agree with the last
statement.
For the four charts shown, there is no “right” answer – heck, there’s not even a “best”
answer.
The surprising thing (and this is what causes the data adverse to shiver with
indecision) is that it’s entirely up to you and you can choose the one that makes your
case the strongest.
Say, I had presumed that there was a L/R split in data and analyzed the results – I
would choose chart “A” or “B” to represent my results as they clearly demand taking
the L/R split seriously. Had I on the other hand assumed that there would be no
difference, I would choose “C” or “D” as it backs up my case. Both are strictly “correct”
but I have subtly manipulated the presentation of data to support my case.
The whole point is that our preconceptions will (often subconsciously) guide us
through the data visualization and analysis process. We will change what we do to
support our personal agenda – however careful we are.
31
Call to action
1. What did your bar chart look like? Which of my examples was it closest to?
2. As part of the self evaluation and action planning process you will have
certainly either constructed or interpreted charts showing results – often
segmented into different groups. Those groups will have likely shown a
difference. Have you presented data to SLT by using either the default or
custom scales – to “make your point clearer”?
What we’ve just demonstrated is that the apparent importance of differences can be
manipulated by just how you construct your charts.
3. How have you constructed charts for last year’s examination analysis meeting
with your SLT? Have you emphasized or played down an effect to influence a
decision or opinion?
However well minded your intentions, I suspect that you will have exerted some
“influence” on the data – even if it was just by using the default settings in Excel –
which in this case seem to imply that there is a huge difference between L and R.
32
How big an effect / difference is “big enough” to matter?
To try and resolve some of these issues just raised, let’s go back to the data for “L” and
“R” and construct a type of “modified” bar chart6, where we are combining the discrete
data of “L” and “R” on the x‐axis, with a continuous y‐axis showing the “score”:
You see two “columns” of data, one for L and one for R that is comprised of a series of
“o” points corresponding to each value. Superimposed on the chart is a “” showing
the mean for L and mean of R, with a line connecting each mean.
At this point, as was demonstrated, had the data been displayed as a bar chart, you
would have shown that “L” outperforms “R” and depending on the scale you used,
could have either emphasized or downplayed the results.
6 This chart was produced in Minitab, by tweaking the “box plot” chart – a great statistical analysis
software package, but similar can be achieved with MSExcel or plotted straight onto graph paper
Left / Right
Scor
e
RL
84
83
82
81
80
79
78
77
76
75
Point Plot of Score vs Hand
33
Extra information in a “modified” bar chart
What this chart clearly shows is the spread of data for each segment. You can see that
the entire L data sits within the R data.
The range of the R data is more than the L data
There are no values of L that are higher than R
There are low values of R, lower than any of the L data
What can be concluded from this chart, is that whilst the means are different, with L
being higher than R, the spread of the data and the low values of R have influenced the
mean value.
What if those learners with the lowest right hand score just happened to be the SEN
learners in the class? Or, what if those lowest R scores correspond to learners who
have been long term sick, incomers to school, EAL learners?
Call to Action
1. Find a data set that you can segment into two (boy / girl splits work well and
are a constant political/educational debate). You need the actual score for a
class, broken down into learners / gender.
2. Plot the scores as a modified bar chart, one column for each segment (boy /
girl)
3. What does this show you for your data?
34
Looking at a whole cohort
The figure above, shows 2010 data for the average point score for a secondary school,
split by sex. As before, the line joins the means.
The chart shows that girls have a higher average points score to boys (as the line slopes
down, from left to right)
From analysis of the means, the following was presented to SLT for the annual exam
analysis meeting:
From the mean analysis, it appears that there is a real and big difference between the
boy and girl average point scores.
Average Points Score
Boy 443
Girl 406
Sex
POIN
TS
MF
900
800
700
600
500
400
300
200
100
0
Average points score vs sex
35
The modified bar char starts to add more meaning:
The spread or ranges of the boy data is more than the girl data
The boy data has far more lower scores than the girls
The girl data has the highest performing students.
Preconceptions again
Again whilst sex is a convenient (and presumptuous) way of explaining difference – and
indeed the means substantiate a conclusion, might it just be that the lowest scoring
learners (who happen to be boys) also happen to be the EAL students? Might it be
equally true that the highest performing girls receive tuition outside of school?
We come back to the question:
How big an effect / difference is “big enough” to matter?
and we add:
How do we tell what the real cause of something is?
36
Conclusions
We have demonstrated how the mean as a point statistic is a blunt instrument in data
analysis, and can lead to spurious conclusions
Our own preconceptions about “what’s likely” to make a difference (gender) will
influence how to visual and analyze data.
How we can (often unwillingly) influence / bias perception with the way what we
represent data.
The use of a modified bar chart can begin to shed more light on the data and allow us
to draw safer conclusions.
In the next chapter we will begin to quantify differences to allow us to make firmer,
evidence based data analysis.
37
Chapter 3
Comparative statistics
Over the previous two chapters we’ve been talking about the mean of data being a
poor summary tool and incomplete when used to compare two segments of data.
We’ve shown how we can draw a chart to help illustrate the difference between means
and how, by tweaking the scales of bar charts, you can magnify or minimize apparent
differences. Ultimately, all of these techniques are qualitative and assessing whether
or not data sets are different has been a matter of choice.
Whilst this might be satisfactory when deciding what the most popular games console
is, surely we can apply more forethought over decisions that are likely to lead to
profound implications to the education of young people.
What we are looking for is a way to quantify how different sets of data are, and an
agreed upon set of standards for assessing whether or not a measured difference is
significant – hence, if the difference is significant it demands attention and solution.
What does significant mean?
It’s important at this point to clarify that a difference is statistically significant if the
observed difference is greater than can be accounted for by random error alone.
T‐tests and p values
For the professional statistician there are a number of measures that can be used to
assess the significance of measurements being different. If we intended to compare a
response to one factor only (say gender), we would use the t‐test, which returns a
probability that the difference between the data sets cannot be distinguished from
random occurrences or accounted for by other factors.
38
That mouthful (presented for statistical correctness) can be reduced to:
The probability (%) that the data sets are not really different. This is often referred to
as the p value, and is either a decimal in the range 0.000 to 1.000 or a percentage. The
higher the p value, the less sure we are that the data sets are different.
For example:
If p=0.000 or 0% we would have zero concern that the means were the same.
Or put the other way, we would be totally certain that the means are different.
We would be (1‐p) or 100% confident that the means were different.
If p=0.001 or 0.1%, we would be slightly concerned and not totally confident
that the means were different. We would be (1‐p) or 99.9% confident that the
means are different.
If p=0.005 or 0.5%, we would be more concerned that the means were not
different – We would be (1‐p) or 99.5% confident that the means are different.
If p=0.10 or 10%, we would be quite concerned that the means were not
different. We would be (1‐p) or 90% confident that the means are different.
If p=0.50 or 50%, we would be totally unsure and (1‐p) = 50% would show that
it was 50/50 that the means are different.
Consider the following question – if you wanted me to invest £1,000,000 in your idea to
cure cancer, and you had tested it against a placebo, what value of p would you accept
as sensible evidence for “proving” your cure worked?
Would you accept p=0.10 or only 90% sure that your cure worked?
Would you accept p=0.005 or p=0.001?
Statisticians agree that a p value of 0.005 or less is needed for “proof” that a
difference is real and hence defined as significant.
39
P values in the range p=0.01 to p=0.006 show increasing evidence that a
difference might be real and probably warrants further analysis
P values in the range p=0.05 to 0.01 show a hint that there is a real difference.
At p=0.05, we would be 95% sure there is a real difference, or there’s a 5%
chance that the means are actually the same. This p=0.05 value corresponds
to the limit of “significance” – a p‐value of p=0.05 or less indicates
significance of a difference between two levels of a factor.
P values greater than p=0.05 are rejected are we are less than 95% sure the
data sets are different.
This might sound draconian, but these levels of significance are used by drug
companies to “prove” a cure works, by the courts and police to convict those accused
of crimes and by all serious scientists trying to prove that A caused B or C worked
better than D – so if it works for them, it should work for us.
40
Calculating significance using Excel
You can use Excel to calculate the t‐test p values. The data however does need to be
laid out in a particular manner: From our previous left / right handed example:
Student L / R Hand Score
A R 80
B R 78
C R 82
D R 84
E R 76
F L 82
G R 81
H L 79
I L 79
J R 81
K L 84
L R 76
M R 81
N R 78
For Excel to compute t‐test, we need to have each response corresponding to a
particular level of a factor in a different column. In this case, the data for left hand in
a different column to the data for right hand, so some manipulation is needed:
41
In this screen, data for R has been placed in C2 : C11, whilst data for L has been placed
in D2: D5.
Excel command for T‐testing
The formula for Excel to calculate t‐test is TTEST(range 1, range 2, tails, type) – which
returns the p value as seen in D13 above.
Range 1 and range 2 corresponds to the data sets.
Tails can be “1” or “2” – corresponding to the shape of the distribution. For us
using data that can be equally distributed around a mean, we will always pick
“2”
Type can be “1”, “2” or “3” – corresponding to “paired” or “unpaired” data.
The difference between these is quite involved and difficult to explain briefly.
Its sufficient to say that given the data that we are analysing, we will always
choose “3”
42
The p value of 0.414 indicates a 41.4% chance that the means are actually the same.
Or as we discussed previously, a 1‐p or 58.6% chance that the means are different.
(Remember what we are talking about here – this almost represents a 50/50 case –
that the data is different OR not)
This is well above the value of statistical significance (p=0.05) and the p‐value
demands that we treat the means of these data sets as “not different”.
Contrast the value of a numerical value to the previous charts we created:
Whilst we might have concluded that the means were the same or “not likely to be
different”, clearly this was open to interpretation / bias and was left to my decision
over how we drew the charts.
Now we have a numerical value to assess the just how different a difference actually
is.
43
Call to action
1. Revisit the data you collected previously.
2. For the factors that you were considering, put one value of the response
corresponding to one level of a factor (boy) in one column and the other level
(girl) into another column.
3. Calculate the TTEST value, using the ranges for the data, “2” for the tails and
“3” for the type.
4. What is the p value?
5. Does this show a significant difference between the data sets or do you
conclude that they are the same?
6. Does this disagree with any analysis you previously undertook?
7. Next time you split a data set into two groups, calculate a t‐test to see if the
means really are different.
44
Conclusions
In this chapter we have introduced the concept of calculating a value that shows
whether or not the differences between two means is caused by the factors being
measured or could be down to random chance or some other, non measured factors.
We introduced the concept of the p‐value, which corresponds to a probability or
percentage that the difference between means is real or just down to chance.
P values less than p=0.001 show a 99.9% chance that the means really are different and
the factor you are measuring is responsible
P values of p=0.05 are considered the critical value and correspond to a 95% chance
that the factor you are measuring is responsible.
P values greater than p=0.05 are rejected as we are less than 95% certain that the
factor being measured is responsible.
The t‐test can be calculated in Excel with the TTEST(range 1, range 2, tails, type)
formula entered into a cell. Tails is normally “2” and type “3”
In the next chapter we’ll look at a more useful test that allows you to look at factors at
more than two levels, such as previous primary school.
45
Chapter 4
Factors with multiple levels.
So far we we’ve looked at assessing responses against factors that exist in two levels –
splitting data sets by boy/girl, looking at left or right handed, free school meals or not.
To process a t‐test in Excel required the data to be laid out in a specific manner, but did
result in a quantifiable measure of the difference between means.
Multi level factors
But what about factors that have multiple levels – such as previous primary school? Or
factors that are a continuous in nature, such as reading or spelling age? Simply put, the
t‐test doesn’t work for factors in more than two levels.
Combine levels to make a binary solution
The first and possibly the simplest solution is to re‐code levels into a binary set – say by
grouping reading age into 10 ≤ x ≤ 12 and 12 < x ≤ 14 and then perform a t‐test.
It doesn’t matter what we call these levels ‐ “1” and “2” or “Low” and “High” are
traditionally used.
Once we have the factor levels, we lay out the data as we did before in Excel, with one
column for each factor level.
In the following example, we have coded reading age using this scheme:
8 ≤ x ≤ 12 = 1 and x > 12 = 2
46
If we take the means of the bins, we conclude:
Bin Mean
"1" 449
"2" 492
Surely a 43 point difference between the average points score for the two different
reading age “bins” represents something that we must take seriously?
47
Let’s look at the data:
Looks encouraging, that difference of 43 surely looks impressive and stands out.
Remember what we said about scales? If we draw the same chart on axes starting at 0:
Now, the difference between the two groups looks less impressive than before –
maybe they’re not that different.
48
Calculating t‐test for “binned” data
As before, let’s reorganize the data and get Excel to calculate the t‐test.
The t‐test of 0.1987 indicates a 19.87, say 20% chance that the means are actually the
same and there is no difference between the reading age bins. Put another way, there
is a 1‐p or nearly 80% chance that the means are actually different, and we cannot
conclude that the factor we are assessing is solely responsible for the difference.
Now 80% sounds positive – but remember we agreed that p=0.05 was the upper limit,
above which we cannot be certain that the factor is causing the difference in the
response.
49
Limits of the t‐test
I know that sounds like a bunch of statistical waffle, but the wording is important. The
t‐test does not rule out reading age having an effect on points score, but the low
significance of p=0.1987, points to some other factor either jointly being responsible or
(as is likely) more significant in explaining the difference between the data.
In our case, it means we should keep analyzing the data to find a more fundamental
difference.
As before, let’s plot a modified bar chart for the bins “1” and “2”, joining the means for
each level. In this case, it proves a particularly useful chart as it clearly shows that the
mean for level “2” of reading age is pulled upward by the three high points score.
Re-coded
Poin
ts S
core
21
650
600
550
500
450
400
350
300
Boxplot of Points Score vs Re-coded
50
Multi level factors
We can use the same idea of binning‐up factor levels to ease analysis of other factors –
such as attendance data for example.
However, what if we don’t want to combine factors into just two levels? In the case of
attendance data, we might want:
‐1 = less 80
0 = 80 to 89.99
1 = 90‐ 94.99
2 = 95+
We can’t use the t‐test as it only works to discriminate between factors that are in two
levels. We need a different statistical tool – analysis of variance.
Analysis of variance
You’ve arrived at the point in the statistics journey where you are about to leave the
“core” functions of Excel behind. Whilst it’s true that you can get Excel to calculate
analysis of variance, it’s not an easy process, the preparation of the data can be
confusing and the results leave a lot to be desired.
At this point I strongly suggest that you get hold of a copy of Minitab7 or download the
excellent Daniels XL Toolbox8 – a free add‐in to Excel that will enhance its native
statistics capability.
However, even Daniels XL Toolbox will run out of steam in the next chapter, so maybe
it’s time to break the Excel apron strings ‐ ;‐)
7 Or alternative statistics package. See the preface to this book for how to obtain Minitab for a reasonable
price. 8 http://xltoolbox.sourceforge.net/
51
Does attendance affect attainment?
Anyways, let’s push on and look at a continuous variable, attendance and try and
answer the questions – “Does attendance affect attainment”. Received wisdom is,
“surely yes, attendance affects attainment and the more you attend the higher the
attainment” – but ask yourself whether you’ve actually tested this “wisdom”.
As we have two data sets that are continuous, we can get a feel for what’s going on by
plotting a traditional scatter graph of attendance (x) against points score (y)
Does that help? Is there a link between attendance and attainment?
52
Fitting a trend line to Excel data
Excel allows us to fit a line between the data points that “best” represents the data.
How well that line fits is shown by the R2 value – the close it is to 1, the better the fit,
with anything above 0.8 as indicating a “good” fit to the data.
Create a scatter graph as normal. Once created, right click on a data point to bring up
the context menu:
Select “Add Trendline”.
From the next context menu, you can choose what kind of line to fit – in this case we
are looking for a straight line, so choose “linear”:
Leave most of the settings to the default, but at the bottom, before you click the CLOSE
button, put a check as indicated:
53
The full context menu for adding a trend line to an Excel chart:
54
From our data, the following linear trend line is fitted.
Using R2 to check for “goodness” of fit
The R2 value of 0.0093 indicates that the line does not represent the data well – in fact
anything below 0.80 is regarded as “poor”.
In fact when R2 = 0, the line fits the data no better than a horizontal line drawn through
the mean “y” value.
The closer R2 is to 1, the better we can use the line and its equation to predict values –
in this case, we if R2=1 we could 100% predict a points score from the attendance.
Clearly this is not the case for our data.
55
So does attendance matter?
Lets bin up the attendance figures as previously agreed:
‐1 = less 80
0 = 80 to 89.99
1 = 90‐ 94.99
2 = 95+
Sample of the original data and “binned” or “coded” figures.
Attendance Coded Points Attendance Coded Points
90.35 1 479 98.07 2 548
91.32 1 350 100 2 440
100 2 440 81.35 0 413
99.36 2 597 76.53 ‐1 695
76.85 ‐1 314 95.82 2 752
98.07 2 698 89.71 0 502
100 2 440 93.25 1 834
88.42 0 614 78.14 ‐1 389
95.18 2 566 84.24 0 290
96.14 2 631 59.81 ‐1 269
100 2 440 85.85 0 425
100 2 284 95.18 2 292
96.14 2 469 75.56 ‐1 410
98.71 2 342 100 2 262
100 2 400 63.02 ‐1 538
89.97 0 426 96.78 2 612
94.21 1 626 100 2 80
94.21 1 552 87.14 0 158
88.75 0 467 92.93 1 494
92.93 1 519 89.71 0 509
56
Let’s calculate the means of each bin to assess if there is any variation between
attendance figures:
Binned Mean Points
‐1 435.8
0 422.7
1 550.6
2 460.7
What the mean analysis shows, is a difference of 25 points in going from the lowest sub
80% attendance to the highest 95%+ attendance. But, is this a big enough effect to
conclude that attendance matters?
If we plot the binned attendance against points score, we can see that “something” is
going on, and the connected means show some variation
Binned attendance
Poin
ts
210-1
900
800
700
600
500
400
300
200
100
0
Modified Bar Chart of Points vs Binned Attendance
57
At this point, the observant reader might ask “Doesn’t all this depend on the size of
the bins?” – Let’s see....
If we re‐bin the data, into ‐1 (less than 90) and +1 (90 and greater) we find;
Binned Mean Points
‐1 427.9
1 486.9
This time, there’s nearly 60 points of difference between the lowest and highest
attendance – surely this is significant?
At this point we’ve reduced the factors to a binary split, so we can use the t‐test to see
if the difference between the means is real and significant.
The preparation of the data is left as an exercise for the reader, but by binning into ‐1
and +1, separating the data into columns and running the Excel TTEST function, we
obtain a value of p=0.243.
This p value is well above the value of p=0.05 for us to consider the means as
statistically different and we conclude, that there is no statistical difference between
the average points score, when we consider the factor “attendance”.
However, this is not where we wanted to be – we’ve reduced a factor to a binary
split.
We’re going to stick with the original binned data, as they correspond to how we track
learners in school:
‐1 = less 80
0 = 80 to 89.99
1 = 90‐ 94.99
2 = 95+
58
You’ll need Daniels XL toolbox or Minitab at this stage. Download a copy for MS
Excel from: http://xltoolbox.sourceforge.net/
One way Analysis of Variance (ANOVA)
The statistical test that we’re going to perform is called the One‐way analysis of
variance or as its usually referred to ANOVA.
ANOVA is similar in function (but mathematically much more complex) to the t‐test,
except ANOVA can test whether or not two or more means are different. ANOVA tests
produce a p value which can be interpreted in the same manner as the t‐test.
This is ideal for our case – ANOVA will reduce our problem of determining if attendance
matters to the familiar task of interpreting a p‐value.
As we’re going to use Daniels XL toolbox or Minitab, data this time can be laid out as
you would receive it from your examinations officer, without further processing.
That is a list of information with headings across
the top – no preparation will be required.
<<< Your data will be laid out like this
With one row per pupil – much easier to deal with
than before.
From the Add‐In menu in Excel, select XL Toolbox,
and navigate to the Statistics > ANOVA menu
From the One‐Way Analysis of Variance (ANOVA)
menu that appears, select the ranges for the
input data.
59
Click in the box once and then drag down
over the range of the bins – not including
the heading
Click in the box once and then drag down
over the range of the data – not including
the heading
60
You should find that the numerical range of each is the same – in this case, $2 to $41 –
but your data might be different, and they don’t need to the same size.
Once the ranges are set up, select Run ANOVA.
This dialogue shows a number of things, but the most important for us are:
The bin names (‐1,0, 1 and 2), their counts & means
ANOVA Results p‐value, which allows us to comment on the significance.
In our case, P=0.41370, which is well above P=0.05, indicating that there is no
statistical significance difference between the means and any differences cannot be
ascribed to the attendance levels alone.
61
Non numeric multi level factors
We started this text by looking at gender and handedness, both were binary non
numeric factors (either one value or another). Some factors under consideration can
be non numerical and text based – originating primary school9 for example.
Our fictional secondary school has 4 feeder primaries: Elm Tree, Everymans, Oldberry
and St Judes.
The average points score at the year of Year 11 for a group of learners is:
Primary Points Primary Points Primary Points Primary Points
St Judes 314 St Judes 698 Elm Tree 509 St Judes 494
St Judes 695 St Judes 440 St Judes 614 Elm Tree 440
St Judes 389 St Judes 566 St Judes 426 St Judes 597
Elm Tree 269 Oldberry 631 St Judes 467 St Judes 698
St Judes 410 Oldberry 440 Elm Tree 413 St Judes 440
Elm Tree 400 Everymans 501 St Judes 502 Everymans 566
St Judes 314 Oldberry 469 Oldberry 290 Everymans 631
St Judes 614 St Judes 342 Elm Tree 425 St Judes 440
St Judes 426 Oldberry 400 Elm Tree 158 St Judes 284
St Judes 467 Oldberry 626 St Judes 509 Oldberry 469
Oldberry 413 Oldberry 552 St Judes 479 St Judes 342
Everymans 695 Oldberry 519 Everymans 490 St Judes 400
Everymans 502 St Judes 548 St Judes 626 St Judes 548
St Judes 389 Oldberry 440 Elm Tree 401 Oldberry 440
St Judes 290 Everymans 752 Oldberry 519 Oldberry 752
St Judes 269 Everymans 834 St Judes 834 Oldberry 292
Elm Tree 425 St Judes 292 Oldberry 494 Oldberry 262
Everymans 410 Oldberry 262 Oldberry 350 Oldberry 612
St Judes 538 Elm Tree 612 Oldberry 440 Elm Tree 80
St Judes 158 Everymans 540 Oldberry 597
9 At this point, I need to be clear – I’m not suggesting a blame culture between Primary and Secondary,
more, the fact that we have this data in secondary and it can be instructive to see if and where a response
can be split by a factor.
62
Firing up Excel and the XL Toolbox we place the data in two columns, one for feeder
primary and the other for points score. Navigating through XL Toolbox we run an
ANOVA:
What this ANOVA shows us, with a P value of p=0.0089 is that feeder primary is more
than 99% certain to have an effect upon the average points score at the end of year 11.
What it doesn’t show is where this variation actually is. Are all the schools different, or
just one school different from the rest?
63
Let’s plot a modified bar chart to see:
Primary
Poin
ts
St JudesOldberryEverymansElm Tree
900
800
700
600
500
400
300
200
100
0
Modified Bar Chart of Points vs Primary
The “difference” is likely to be between Elm Tree and Everymans. But, being the good
statistician we now want to ask more rounded questions:
Is Everymans different to Oldberry & St Judes?
Is Elm Tree different to Oldberry?
Fortunately, tests exist to quantify this difference.
64
If the p‐value of the ANOVA indicates a statistically significant difference, (indicated by
* or ** next to the value), an additional tab at the top of the window is active. Select
this tab:
The window that appears allows you to test for significance between the levels of the
factors previously analyzed for the ANOVA test.
Leaving the default “Bonferroni‐Holm” (named after the statisticians who devised the
test) you can click on each level of factor in the “Compare” column and look how
different that is to other levels – importantly for us, the dialogue displays the
significance.
65
On this screen, click on “Produce report”, which will summarise this test in an easy to
read table.
Posthoc test: Bonferroni‐Holm
Group 1 Group 2 Critical P Significant?
Elm Tree Everymans 0.008333333 0.002662327 Yes
Oldberry Everymans 0.01 0.017707646 No
St Judes Everymans 0.0125 0.01989173 No
St Judes Elm Tree 0.016666667 0.074365767 No
Elm Tree Oldberry 0.025 0.082440719 No
St Judes Oldberry 0.05 0.96789046 No
(Here, the significance of the P value is slightly different than before – if the value of p
is less than the displayed “critical value”, the difference is significant.
66
We can see that for our data, only the Elm Tree – Everymans difference is significant,
whilst the Oldberry, and St Judes to Everymans is approaching significant.
Whilst our modified bar chart hinted at this before, we now have a hard and fast figure
that describes the difference between the primary schools.
Call to action
Now that we’ve got some real statistical tests in our tool kit, go and find your master
data set for your school / department / class.
Most schools will have spreadsheets of such data, and they probably look something
like this:
Name Sex SEN FSM CATs Att% Feeder Read Maths English Science Overall Points
Adams, Jon M NA N 119 90.35 St Judes 14.02 30 35 40 440
See if you can answer the following questions from your own data:
1. Are the overall results for your school different for gender? Is this a significant
difference ?
a. (TTEST and P value)
b. Repeat the analysis for free school meals (FSM)
2. How well does CATS, (or other base line data), attendance or reading age
predict Maths, English, Science (insert subjects that you have data for)?
a. (Scatter graph for continuous data and fit a trend line. Check R2 value)
3. Create some binned data (CATs, Feeder School) and use ANOVA to check the
significance of a multi leveled factor.
a. Use Bonferroni‐Holm to check for differences between levels of a
factor
67
Pause for breath ……..
At this point, you’ve come a long way. Instead of using the means of responses to
describe (possibly erroneous) differences between the effects of factor levels, you’ve
just used some real statistical tests (TTEST and ANOVA) to provide you with evidence
that is more than just a “hunch”.
Questions to reflect on
1. Did any of your analysis contradict your preconceptions?
2. Did you show that gender was statistically significant overall? What about
gender for Maths, English, Science?
3. Do learners from any of your feeder primaries perform significantly different
than learners from other? Does this surprise you?
This is the beauty of simple statistical tests – you can ask the “What if” questions and
very quickly get an answer.
But, and isn’t there always a but – from the factors listed how do you decide which is
the most important and most significant in driving a response?
Name Sex SEN FSM CATs Att% Feeder Read Overall Points
Adams, Jon M NA N 119 90.35 St Judes 14.02 440
And for that, we need yet another tool – this time, the final one we’ll introduce and the
“most useful”, generic test available. Say hello to the General Linear Model
68
Conclusions
We’ve covered a lot of ground in this chapter. Starting with the t‐test previously
described we’ve looked at:
Grouping or binning factor levels to allow us to continue to use the t‐test and the
familiar p value for significance
How we can use Excel and trend lines to explore the relationship between continuous
data.
We looked at the R2 value and used it to decide how “well” a trend line matched the
data. R2 = 0.80 is the agreed upon limit, below this the fit is described as “poor”.
How continuous data can also be binned up to allow t‐tests to differentiate between
binary leveled factors
We’ve introduced the concept of One‐way analysis of variance (ANOVA), which allows
us to test for significance between multi level factors.
We looked at extending this ANOVA to explore differences between the levels of
factors and how to assess the significance of these differences.
We explored Daniels XL Toolbox, a free add‐in to Excel which makes calculating ANOVA
much more straight forward.
69
Chapter 5
General Linear Model (GLM)
To perform the GLM test you will need a dedicated statistics package like Minitab
and the sophistication of the analysis is beyond what’s possible within Excel. It even
beats Daniels XL Toolbox.
Regardless of how it’s calculated, what a GLM does is clever and essential when
exploring a data set – it allows you to assess what’s going on without any
preconceptions over what factors affect the response you are measuring. With TTEST
and ANOVA we went looking for a difference caused by changing the levels of a factor
and assessed if this produced a statistical difference in the response – we “assumed”
that there might be a difference and went looking for it.
GLMs are different – it analyses all the factors and levels that you input and returns the
significance (p value) that that factor is influencing the response. More than that
though, it then assess how well you can use these factors to predict the response value
(R2).
The clever part (as far as we are concerned) comes from assessing the p‐values for each
factor and the overall R2 value.
The same rules over p‐value as we used in TTEST and ANOVA apply here. P=0.05 is
the upper limit of where we say a factor is significant.
The same rules apply over R2 – the higher the value, the better the fit of the factors and
the closer we are to accounting for all the variation. What this means is that if we have
a low R2 value, it probably means that we are “missing” a factor in our analysis and
should consider what else we could bring in (spelling age, FFT, teacher)
70
Constructing a GLM
Assume that we have data laid out in a similar pattern to before:
Name Sex SEN FSM CATs Att% Feeder LAC Read Overall Points
Adams, Jon M NA N 119 90.35 St Judes Yes 14.02 440
We’ve binned up attendance and reading age as we did earlier in the text.
The GLM command exists under the Minitab > Stat > ANOVA menu:
Once the GLM window is active, you simply select the “response” – in our case the
total points (but could equally be the scores for just your subject), and build the
“model”.
71
In the first instance the model is just the factors that we are assessing. You can
construct a more sophisticated model later.
The output of GLM will depend on the package you are using, but for Minitab, the
output for our data was:
72
Analysis of Variance for POINTS, using Adjusted SS for Tests
Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 24701 24701 1.33 0.252 Previous School 6 162246 80249 13375 0.72 0.635 Attendance 3 147096 181165 60388 3.25 0.026
Reading Age 3 129266 82841 27614 1.49 0.224
FSM Eligible 1 1898 2340 2340 0.13 0.724 LAC 1 14316 12533 12533 0.67 0.414 SEN 3 563318 563318 187773 10.10 0.000
Error 86 1598457 1598457 18587 Total 104 2714376 S = 136.333 R-Sq = 41.11% R-Sq(adj) = 28.79%
This looks very busy, but we are only interested in the p‐values and the overall R2 value
Following our previous rules, only our binned Attendance and the SEN status of our
learners are a significant factor in determining the average points score.
R2 = 41.11%, showing that these factors account for only 41% of the variation seen in
the data – implying that there is likely to be something “else” that we haven’t
accounted for.
73
Deeper analysis
Now that the GLM has identified SEN and Attendance as being significant compared to
the other factors, we can dig deeper into each and run ANOVA or TTest as appropriate.
SEN
Using Minitab to run ANOVA is no different to Daniels XL Toolbox, but the output is
slightly different:
Source DF SS MS F P SEN 3 775711 258570 12.64 0.000 Error 147 3007614 20460 Total 150 3783326 S = 143.0 R-Sq = 20.50% R-Sq(adj) = 18.88% Individual 99.5% CIs For Mean Based on Pooled StDev Level N Mean StDev -------+---------+---------+---------+-- No 94 472.2 139.6 (---*---) school action 13 407.9 121.5 (-----------*----------) school action + 36 300.5 158.0 (------*------) statement of SEN 8 449.0 143.8 (--------------*-------------) -------+---------+---------+---------+-- 300 400 500 600
What this shows us:
The p‐value (p=0.000) indicating a 100% statistically significant result (The p‐
value from ANOVA can be different from the GLM as during ANOVA we are
only considering the factor itself, but in GLM we are considering it along with
all the other factors).
R2 of 20.5%, indicating that SEN alone accounts for nearly 21% of the variation
in the data
Minitab analysis (unlike MS Excel / Daniels XL Toolbox) automatically charts the data,
showing the mean and 99.5% confidence intervals (CI – where we would expect to find
99.5% of all the data. This makes reduces the impact of excessively high / low values)
From this analysis, we can see that No SEN and School Action + are totally different –
and this is the cause of the low p‐value (the CIs don’t overlap)
74
The cohort size for SEN statements (8) results is huge confidence intervals – meaning
that we can’t draw conclusions on the statemented learners. However, if we remove
them from the analysis, the following picture emerges:
Level N Mean StDev -------+---------+---------+---------+-- No 94 472.2 139.6 (---*---) school action 13 407.9 121.5 (-----------*----------) school action + 36 300.5 158.0 (------*------) -------+---------+---------+---------+-- 300 400 500 600
We would be safe to conclude that as we progress up the SEN “ladder” the
performance of learners falls, from 472 to 300 from No‐SEN to SA+.
With significance of p=0.000, this would demand further analysis and possible
interventions.
Attendance
Source DF SS MS F P C_ATT 3 173144 57715 2.35 0.075 Error 147 3610182 24559 Total 150 3783326 S = 156.7 R-Sq = 4.58% R-Sq(adj) = 2.63% Individual 99.5% CIs For Mean Based on Pooled StDev Level N Mean StDev --+---------+---------+---------+------- -1 26 356.7 99.6 (------------*-----------) 0 30 428.6 142.5 (----------*-----------) 1 33 464.0 146.9 (----------*----------) 2 62 429.9 184.8 (-------*--------) --+---------+---------+---------+------- 280 350 420 490
What this shows us:
The p‐value is less than from GLM, but at p=0.075, we are on the cusp of
statistical significance.
75
The R2 of 4.58% indicates that whilst Attendance might be significant, it only
accounts for 4.58% of the spread of data – clearly something else is accounting
for the variation in data.
What would be expect Attendance vs Attainment to look like? I think we all would
agree that we would expect, the higher the attendance, the higher the attainment. In
fact we looked at this before.
Our ANOVA seems to show that there is a drop off in attainment for the highest
attendance group (95%+, binned into “2”).
Being the data guru that we now are, we want to be able to explain that seemingly
contradictory finding.
Extending the GLM
What often causes responses to behave in this way is some sort “interaction” between
the factors. What does that mean?
Until now, all the statistics we’ve used have relied on the understanding that one
factors has no influence on the other – for example, “what primary school you go to,
does not affect whether you are boy or girl”. But, in the case of a single sex primary
school, clearly, the sex could be linked to the school itself. This is called an
“interaction”
In our case, an interaction between the factors is likely to be pulling down the
performance of the high attendance learners – but what could that be?
Prior knowledge would seem to indicate that SEN Statemented students tend to have a
higher attendance than other students AND the highest attending boys seem to do less
well than other learners.
Let’s explore the data graphically:
76
Attendance %
POIN
TS
1101009080706050403020
900
800
700
600
500
400
300
200
100
0
SexFM
Scatterplot of POINTS vs Attendance %
Attendance %
POIN
TS
1101009080706050403020
900
800
700
600
500
400
300
200
100
0
SENno special provisionschool actionschool action plusstatement of SEN
Scatterplot of POINTS vs Attendance %
77
Both of these plots indicate high attending learners who achieve less than we would
expect.
The first chat seems to indicate that there are a number of low achieving, but
high attendance males
The second chart seems to indicate that there are a high number of School
Action Plus learners, with high attendance that are also low achieving.
These two statements would implicate an interaction between “Sex” and “SEN” – that
is “SEN Boys perform differently to SEN Girls”.
Building interactions into the GLM
When we created the GLM, we entered a range of factors into Minitab. We can
indicate an interaction by:
Factor 1 * Factor 2
78
To enter a “Sex” / “SEN” interaction, I have entered Sex*SEN into the GLM model.
If we run the GLM, the output is similar to before, but with the additional factor of
Sex*SEN
Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 8598 8598 0.54 0.464 Previous School 6 162246 90813 15135 0.95 0.462 Attendance 3 147096 238927 79642 5.01 0.003 C-READ 3 129266 52955 17652 1.11 0.349 FSM Eligible 1 1898 2961 2961 0.19 0.667 LAC 1 14316 30069 30069 1.89 0.173 SEN 3 563318 292055 97352 6.13 0.001 Sex*SEN 3 279666 279666 93222 5.87 0.001 Error 83 1318791 1318791 15889 Total 104 2714376 S = 126.052 R-Sq = 51.41% R-Sq(adj) = 39.12%
What we notice:
The p‐values have changed – we would expect that as we have added in
another factor to assess against.
Attendance and SEN are still significant
Sex*SEN interaction is as significant as SEN alone
By this manner, we have built an interaction into the GLM and assessed that SEN pupils
perform differently depending on SEX.
Overall, this GLM analysis points the direction for further study. Specifically, with this
data set, Attendance, SEN and SEN*SEX are the factors that we should be considering.
79
Big implications of the GLM
By not assuming any one factor is important, we have used a General Linear Model to
show that only Attendance and SEN are statistically important in determining the total
points score of our learners.
Let me put that another way compared to SEX, Free School Meals or LAC status,
Attendance and SEN are more important in determining the total points score on
leaving10.
This is where the importance of the GLM cannot be under estimated.
Before embarking on a whole school initiative to “solve” a problem, we must surely use
detailed statistics to correctly assess what we need to “solve” in the first place.
10 We’re not making far reaching conclusions here – for this data, analyzed for 2010‐2011, this is the
output of the GLM. Your data / conclusions will be different.
80
Call to action
1. Get the biggest data set that you can get your hands on, including, but not
limited to:
Name Sex SEN FSM CATs Att% Feeder Read Maths English Science Points
Adams, Jon M NA N 119 90.35 St Judes 14.02 30 35 40 440
2. Build a GLM of the input factors and determine which are the most significant
in determining your output response.
a. You might need to bin up your input factors
3. Are there any factors that come out as significant that show a different trend
than you would expect?
a. Consider extending the GLM to take into account interactions between
factors.
Conclusions
The GLM allows you to determine which factors are most significant in driving an
output response. Crucially, the GLM allows you to perform this analysis, without any
preconceptions over which factor will be significant in the first place.
We’ve shown you (briefly) how to use Minitab to create a GLM model for a range input
factors and how to interpret the significance of the output.
Once we’ve built the GLM we looked at digging deeper into the data and running
separate ANOVA to check for trends within each factors.
Finally, we discussed how to add in interactions between factors to account for counter
intuitive trends.
81
Chapter 6
Main Effects
We’ve taken a fairly constructivist approach to working our way through the statistics
we’ve needed to analyze the data routinely seen in school and at faculty level. This has
led to exploring means, data visualization, t‐tests, ANOVA and latterly the very
powerful GLM approach. These were presented in that order to allow us to build up a
picture of how to effectively analyze a data set. Once you’ve got these tools under
your belt, you’re going to want a short cut method that focuses your analysis – that’s
the Main Effects.
Imagine drawing a series of charts, with common scales, so that you can directly
compare the magnitudes of the effects of different factors. That way, you can see
which factors have the most effect and need to be analyzed further. Excel can be
driven in this way, but you’ll need to create separate graphs for each response and
scale them accordingly – lots of work.
Fortunately, now that you’re using Minitab11, there’s an immediate way to get just
what you want.
Right under the Stat >> ANOVA menu, towards the bottom is the Main Effects Plot.
11 Or equivalent
82
Pick the Response and Factors, exactly as building the ANOVA analysis.
83
Main Effects Plot
The Main Effects Plot is quite simple – it shows on common y‐axis scales the effect of
each of the factors you chose, split by the level of each factor.
The chart below leads us to the following conclusions:
SEN, going from SA to SA+ has the largest impact on Average Points
Going from ‐1 to +1 in C‐Read (Reading Age) has a large impact
Going from ‐1 to +1 in C_ATT (Attendance) has a large impact
There is a large difference between Previous Schools B and D
Out of all the factors charted, SEX shows the least affect on Average Points.
Ave
rage
Poi
nts MF
500
400
300
210-1 10-1-2
StatementSA +SANo
500
400
300
YN YN
FEDCBA*
500
400
300
Sex C _A TT C -REA D
SEN FSM Eligible LA C
Prev ious School
Main Effects Plot for Average Points
84
Interactions Plot
Following the same thoughts as the Main Effects, it would be a good idea to see all the
possible interactions and to decide which need further analysis.
From the same STAT >> ANOVA menu as before, select Interactions Plot and pick the
same factors as previously:
For the purposes of this text however, we will only look at Attendance, Reading Age
and SEN – and how they interact to determine Attainment – as you will see, the chart is
confusing enough with just these factors.
Take time to look at the chart before reading on.
85
Sex
C-R
EAD
SEN
C_AT
T
21
0-1
10
-1-2
statem
ent o
f SEN
scho
ol ac
tion
plus
scho
ol ac
tion
no spe
cial p
rovis
ion
600
450
300
600
450
300
600
450
300
Sex
F M C_A
TT
1 2-1 0
C-R
EAD 0 1-2 -1
Inte
ract
ion
Plo
t fo
r A
vera
ge P
oint
s S
core
86
What does the interaction plot show?
In all cases of Attendance, Girls outperform Boys
As girls reading age increase, they improve in attainment at a faster rate than
boys
Reading age only affects attainment for the highest attendees – for those with
the lowest attendance, reading age matters much less
Statemented boys do far better than statemented girls
Those statmented students with the highest reading age achieve most
The purpose of the interaction chart is to quickly identify avenues for further research.
What stands out as a promising line of enquiry is the gender split and reading age
affect on attainment: (seen here zoomed in)
87
It can be clearly seen, that as the reading age goes from ‐2 to +1, both genders improve
in attainment. However, it clearly demonstrates that the attainment gains for girls are
far higher at higher reading ages than for boys. This clearly needs further analysis.
Let’s build a GLM of attainment against the factors of SEX, Reading Age and the
Sex*Reading Age interaction:
General Linear Model: POINTS versus Sex, C-READ Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 60685 60685 2.42 0.123 C-READ 3 157821 130298 43433 1.73 0.166 Sex*C-READ 3 26056 26056 8685 0.35 0.792 Error 97 2432719 2432719 25080 Total 104 2714376 S = 158.365 R-Sq = 10.38% R-Sq(adj) = 3.91%
What this shows, is that whilst all factors are not totally significant in determining
attainment, Sex*Reading Age interaction, with a p=0.792 is totally insignificant, so
should be removed form our model.
Let’s remove it and re‐run the GLM:
Analysis of Variance for POINTS, using Adjusted SS for Tests Source DF Seq SS Adj SS Adj MS F P Sex 1 97779 77753 77753 3.16 0.078 C-READ 3 157821 157821 52607 2.14 0.100 Error 100 2458776 2458776 24588 Total 104 2714376
Now the GLM hints that both Sex and Reading Age are becoming statistically significant
so warrant further investigation.
So again the value of a thorough statistical investigation is clear. What looks like a
difference between the effect of reading age for boys and girls, is not actually
statistically valid.
88
Call to action
1. Find the data set that you’ve been looking at and use Minitab to display the
Main Effects. Does the ME plot lead you to any conclusions?
2. Use Minitab to create an Interactions Plot for the same factors and responses.
Do any interactions stand out? For example “do high attending boy perform
differently to high attending girls?”
3. For any interesting interactions, build a GLM to assess if these interactions are
significant and need further analysis.
Conclusions
During this Chapter we’ve turned the previous analysis on it’s head and gone right back
to a datasheet.
We’ve use the Main Effects plot to quickly show the average effect of different factors.
These factors can then be assessed by ANOVA for their overall significance.
Finally, we used the Interactions Plot to explore any trends that “break the mould”. To
assess these interactions, we built a GLM.
89
Chapter 7
Final remarks
Data is nothing to be afraid of. Throughout this text we’ve tried to show how, by
approaching the crunching of numbers, you can make sense out of the data that is
presented in schools. We’ve shown how simple analysis and display of the mean can
be misleading and is open to interpretation/bias. As a class teacher, head of subject or
member of the senior leadership team, surely you need more than “gut feel” to make
decisions that could adversely affect the education of a whole generation – and we
don’t make that pronouncement lightly.
As educational professionals, we owe it to the learners for whom we are devising
interventions and creating policy – to base our pronouncements on sound, statistically
valid understanding of the numbers we are “crunching”. I hope this text has begun to
demonstrate that this is not hard, and is certainly not beyond you dear reader.
Tools you’ll need:
1. MS Excel (or equivalent)
2. Daniels XL Toolbox (or equivalent Add‐In)
3. Minitab (or equivalent)
What we can’t guarantee is that you’ll uncover that “missing link” that will improve
your school performance from 30% to 80% A*‐C overnight – but what we can ensure is
that when you’re asked “Are you sure” – you can say “Yes”.
So, go on – get “beyond the mean”.
Glen Gilchrist
Alexavier Fareheed
www.goingbeyond.co.uk
90