ETS PowerPoint template-white background (no color gradation)

Copyright © 2011 Educational Testing Service. All rights reserved.

Validating Automated Essay Scores: Are Humans the

Gold Standard?

Brent Bridgeman Educational Testing Service

EALTA, Innsbruck Austria, June 2012

Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved.

• Large pool of qualified raters difficult to find and maintain

• Machines can score quickly and inexpensively

• Machines are very reliable—will give the same answer every time

• Machines do not get frustrated or angry • Machines do not get tired

2

Why Score Essays by Machine?



• Generic (G) – treats every prompt for a particular task

type in exactly the same way • consistent set of feature weights • single intercept for aligning human and

automated scores • Prompt Specific (PS)

– different weights and intercepts for each prompt

– includes two additional features that are based on the unique vocabulary in a particular prompt

Two Types of e-rater® Models

Copyright © 2011 Educational Testing Service. All rights reserved. Copyright © 2011 Educational Testing Service. All rights reserved. 6/14/2012

• Score agreement between humans and machine rating the same essay

• Frequently, human-machine agreement is higher than human-human agreement – Praxis I ® Pre-Professional Skills Test (Essay) – GRE ® Analytical Writing (Issue Topic) – Test of English as a Foreign Language Internet-

based Test (TOEFL iBT®) Independent Essay 5

Gold Standard Validity Criterion


• Independent – support an opinion on the provided topic

• Integrated – requires the examinee to listen to a lecture, read a

relevant selection that may present a different point of view, then write an essay integrating and reacting to these different points of view

• 30 minutes for each task and each scored on 1-5 scoring rubric

TOEFL iBT Essay Tasks


Correlation of Human and e-rater Scores for TOEFL iBT Independent and

Integrated Essay Tasks

Essay Task Independent Integrated

Human 1 with Human 2 .68 .83

Human 1 with e-rater (G) .72 .62

Human 1 with e-rater (PS) .77 Note.— n = 141,203; G is generic model; PS is prompt specific model


• Correlation with other tasks in the same general domain

• Correlation with essays written on a different topic at a different time

Alternative Validation Criteria


• In addition to the Writing Score, TOEFL iBT assesses Reading, Listening, and Speaking

• These four skills are highly related to the extent that it makes logical and psychometric sense to produce a total score

• Factor analyses suggest that there is a dominant general factor across all four skills (Sawaki, Stricker, & Oranje, 2009)

Relationship of Human and Machine Essay Scores with Other Measures of

English Language Proficiency


Correlations with TOEFL iBT Section Scores

Essay Task

Independent

TOEFL iBT Scores H1 e-rater

Reading .55 .59

Listening .55 .55

Speaking .58 .58

Read+Listen+Speak .63 .65

Note.— n = 141,203



Essay Task

Independent Integrated

TOEFL iBT Scores H1 e-rater H1

Reading .55 .59 .63

Listening .55 .55 .65

Speaking .58 .58 .58

Read+Listen+Speak .63 .65 .71

Note.— n = 141,203



Essay Task

Independent Integrated

TOEFL iBT Scores H1 e-rater H1 e-rater (G)

e-rater (PS)

Reading .55 .59 .63 .60 .71

Listening .55 .55 .65 .57 .70

Speaking .58 .58 .58 .58 .65

Read+Listen+Speak .63 .65 .71 .66 .78

Note.— n = 141,203; G is generic model; PS is prompt specific model


Repeater Sample Total Sample Repeater Sample

Time 1 Time 2

Language n % M SD n % M SD M SD

Chinese 27286 19 20.4 4.5 1267 16 19.3 4.1 20.4 4.0

Korean 28069 20 20.5 4.8 3068 39 20.4 4.6 21.2 4.5

Japanese 11757 8 18.5 4.8 1339 17 18.5 4.5 19.4 4.3

Arabic 7864 6 18.9 5.1 362 5 17.9 4.3 19.2 4.2

All 141203 100 20.8 4.9 7894 100 19.6 4.5 20.5 4.4


Correlation of Time 1 Independent Scores with Time 2 Total (Independent

+Integrated) Human Scores

Scores on Independent Task

r

One human score .558 e-rater .635 Two human scores .615 e-rater & 1 humana .644 a Includes adjudicated scores when human and e-rater

differed by more than 1.5 points


Correlation of Time 1 Integrated Scores with Time 2 Total (Independent


Scores on Integrated Task r One human score .623 e-rater .616 Two human scores .648 e-rater & 1 humana .676 a Includes adjudicated scores when human and e-rater

differed by more than 1.0 points


Correlation of Time 1 Combined Scores with Time 2 Total (Independent


Time 1 Scores

Independent Integrated r Two human scores Two human scores .719 e-rater & 1 humana Two human scores .727 e-rater & 1 humana One human .713 e-rater & 1 humana e-rater & 1 humanb .729 a Includes adjudicated scores when human and e-rater differed by more than 1.5 points.

bIncludes adjudicated scores when human and e-rater differed by more than 1.0 points


• Using only the “gold standard” criterion of the correlation with a human rater on the same task, e-rater appears to be deficient for the Integrated task – Correlation with a human rater is much lower

than the correlation of two human raters. • When the criterion is expanded to include scores

on other measures of English language proficiency, e-rater and human scores appear to be much more comparable

• When the criterion is the scores on writing tasks from a different occasion, the combination of one human score and one e-rater score outperforms two human scores.

Summary


• Equal weights for humans and machine not written in stone – Research in progress suggests that for

the integrated task double-weighting the human score may provide optimal measurement

• Optimizing prediction of human scores may not be the best way to create machine scores

What Next?


• What machines do well – Machines can count accurately – Machines are very reliable—will give the same answer every

time – Machines do not get frustrated or angry – Machines do not get tired

• What people do well – People can recognize tone, appropriateness for a particular

audience, and illogical arguments – People can recognize abrupt changes in quality within an essay

that may suggest use of memorized material (but machines could possibly be trained to do this)

– People can easily spot well-written nonsense

Let People do What They do Best and Let Machines do What They do Best

Documents

ETS PowerPoint template-white background (no color gradation)