Upload
lecong
View
220
Download
2
Embed Size (px)
Citation preview
Copyright © 2011 Educational Testing Service. All rights reserved.
Validating Automated Essay Scores: Are Humans the
Gold Standard?
Brent Bridgeman Educational Testing Service
EALTA, Innsbruck Austria, June 2012
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Large pool of qualified raters difficult to find and maintain
• Machines can score quickly and inexpensively
• Machines are very reliable—will give the same answer every time
• Machines do not get frustrated or angry • Machines do not get tired
2
Why Score Essays by Machine?
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Generic (G) – treats every prompt for a particular task
type in exactly the same way • consistent set of feature weights • single intercept for aligning human and
automated scores • Prompt Specific (PS)
– different weights and intercepts for each prompt
– includes two additional features that are based on the unique vocabulary in a particular prompt
Two Types of e-rater® Models
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved. 6/14/2012
• Score agreement between humans and machine rating the same essay
• Frequently, human-machine agreement is higher than human-human agreement – Praxis I ® Pre-Professional Skills Test (Essay) – GRE ® Analytical Writing (Issue Topic) – Test of English as a Foreign Language Internet-
based Test (TOEFL iBT®) Independent Essay 5
Gold Standard Validity Criterion
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Independent – support an opinion on the provided topic
• Integrated – requires the examinee to listen to a lecture, read a
relevant selection that may present a different point of view, then write an essay integrating and reacting to these different points of view
• 30 minutes for each task and each scored on 1-5 scoring rubric
TOEFL iBT Essay Tasks
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlation of Human and e-rater Scores for TOEFL iBT Independent and
Integrated Essay Tasks
Essay Task Independent Integrated
Human 1 with Human 2 .68 .83
Human 1 with e-rater (G) .72 .62
Human 1 with e-rater (PS) .77 Note.— n = 141,203; G is generic model; PS is prompt specific model
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Correlation with other tasks in the same general domain
• Correlation with essays written on a different topic at a different time
Alternative Validation Criteria
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• In addition to the Writing Score, TOEFL iBT assesses Reading, Listening, and Speaking
• These four skills are highly related to the extent that it makes logical and psychometric sense to produce a total score
• Factor analyses suggest that there is a dominant general factor across all four skills (Sawaki, Stricker, & Oranje, 2009)
Relationship of Human and Machine Essay Scores with Other Measures of
English Language Proficiency
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlations with TOEFL iBT Section Scores
Essay Task
Independent
TOEFL iBT Scores H1 e-rater
Reading .55 .59
Listening .55 .55
Speaking .58 .58
Read+Listen+Speak .63 .65
Note.— n = 141,203
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlations with TOEFL iBT Section Scores
Essay Task
Independent Integrated
TOEFL iBT Scores H1 e-rater H1
Reading .55 .59 .63
Listening .55 .55 .65
Speaking .58 .58 .58
Read+Listen+Speak .63 .65 .71
Note.— n = 141,203
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlations with TOEFL iBT Section Scores
Essay Task
Independent Integrated
TOEFL iBT Scores H1 e-rater H1 e-rater (G)
e-rater (PS)
Reading .55 .59 .63 .60 .71
Listening .55 .55 .65 .57 .70
Speaking .58 .58 .58 .58 .65
Read+Listen+Speak .63 .65 .71 .66 .78
Note.— n = 141,203; G is generic model; PS is prompt specific model
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Repeater Sample Total Sample Repeater Sample
Time 1 Time 2
Language n % M SD n % M SD M SD
Chinese 27286 19 20.4 4.5 1267 16 19.3 4.1 20.4 4.0
Korean 28069 20 20.5 4.8 3068 39 20.4 4.6 21.2 4.5
Japanese 11757 8 18.5 4.8 1339 17 18.5 4.5 19.4 4.3
Arabic 7864 6 18.9 5.1 362 5 17.9 4.3 19.2 4.2
All 141203 100 20.8 4.9 7894 100 19.6 4.5 20.5 4.4
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlation of Time 1 Independent Scores with Time 2 Total (Independent
+Integrated) Human Scores
Scores on Independent Task
r
One human score .558 e-rater .635 Two human scores .615 e-rater & 1 humana .644 a Includes adjudicated scores when human and e-rater
differed by more than 1.5 points
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlation of Time 1 Integrated Scores with Time 2 Total (Independent
+Integrated) Human Scores
Scores on Integrated Task r One human score .623 e-rater .616 Two human scores .648 e-rater & 1 humana .676 a Includes adjudicated scores when human and e-rater
differed by more than 1.0 points
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
Correlation of Time 1 Combined Scores with Time 2 Total (Independent
+Integrated) Human Scores
Time 1 Scores
Independent Integrated r Two human scores Two human scores .719 e-rater & 1 humana Two human scores .727 e-rater & 1 humana One human .713 e-rater & 1 humana e-rater & 1 humanb .729 a Includes adjudicated scores when human and e-rater differed by more than 1.5 points.
bIncludes adjudicated scores when human and e-rater differed by more than 1.0 points
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Using only the “gold standard” criterion of the correlation with a human rater on the same task, e-rater appears to be deficient for the Integrated task – Correlation with a human rater is much lower
than the correlation of two human raters. • When the criterion is expanded to include scores
on other measures of English language proficiency, e-rater and human scores appear to be much more comparable
• When the criterion is the scores on writing tasks from a different occasion, the combination of one human score and one e-rater score outperforms two human scores.
Summary
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• Equal weights for humans and machine not written in stone – Research in progress suggests that for
the integrated task double-weighting the human score may provide optimal measurement
• Optimizing prediction of human scores may not be the best way to create machine scores
What Next?
Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.
• What machines do well – Machines can count accurately – Machines are very reliable—will give the same answer every
time – Machines do not get frustrated or angry – Machines do not get tired
• What people do well – People can recognize tone, appropriateness for a particular
audience, and illogical arguments – People can recognize abrupt changes in quality within an essay
that may suggest use of memorized material (but machines could possibly be trained to do this)
– People can easily spot well-written nonsense
Let People do What They do Best and Let Machines do What They do Best