62
NLP Research for Commercial Development of Writing Evaluation Capabilities Jill Burstein Educational Testing Service Presentation for OTSLAC Columbia University December 2, 2004

NLP Research for Commercial Development of Writing Evaluation Capabilities

  • Upload
    belva

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

NLP Research for Commercial Development of Writing Evaluation Capabilities. Jill Burstein Educational Testing Service Presentation for OTSLAC Columbia University December 2, 2004. Criterion SM , E-rater ® , Critique, C-rater , & more …. Jill Burstein Claudia Leacock Thomas Morton - PowerPoint PPT Presentation

Citation preview

Page 1: NLP Research for  Commercial Development of Writing Evaluation Capabilities

NLP Research for Commercial Development of Writing

Evaluation Capabilities

Jill BursteinEducational Testing ServicePresentation for OTSLAC

Columbia UniversityDecember 2, 2004

Page 2: NLP Research for  Commercial Development of Writing Evaluation Capabilities

CriterionSM, E-rater®, Critique, C-rater, & more …

Jill Burstein Claudia LeacockThomas Morton

Educational Testing ServiceMartin Chodorow

Hunter College, CUNYSusanne Wolff

Princeton University

Page 3: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Let’s Talk About Writing & Assessment

Page 4: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Educators’ Vision: Writing Skill Development

• Master basic skills in K-12– Grammar, spelling, punctuation, etc.

• Perfected the 5-paragraph essay – U.S. Concept– Thesis, 3 Main Points, Conclusion

• Writing within and beyond discipline– Address different audiences– Generate various genres: persuasive, compare-

and–contrast, research writing within discipline

Page 5: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Evaluation of Vision Through Writing Assessments

• High stakes: Undergraduate & Graduate– Admissions– Placement

• No/Low stakes: K-12– Statewide and national assessments – Classroom instruction

Page 6: NLP Research for  Commercial Development of Writing Evaluation Capabilities

What do essays look like?

Page 7: NLP Research for  Commercial Development of Writing Evaluation Capabilities

The Reality of Writing Quality

• Timed Assessments– Up to 500 words (grade-level dependent)– Not Literary Essays

!creative!irony!metaphor

• Instructional Uses– Maybe Longer Essays– Better Quality with Revision Facility

Page 8: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Most essays look like this“My position on school uniforms is as follows. School uniforms are a violation on several students rights. School uniforms makes us, the students, think that we do not have the write to express our feelings through clothing. Many students show pride through the clothing that they choose to wear. For instance some students may be of a different religion. Their relgion may require them to wear a clothing of such. When the school forces us to wear a uniform it forces these people to go back on their religion. wearing differnent types of clothing expresses a student's inner child, if not more. Through clothing we can see a students hobbies, joys, and loves of life. Putting uniforms on us would violate the fact to actually to show an opinion. Does the school want us to look exactly alike? Some students may not have an open mind about the fact that they cannot show their youth and personalty through clothing they will show it in another unhealthy way. In closing uniforms are and unjustice act against all students alike."

Page 9: NLP Research for  Commercial Development of Writing Evaluation Capabilities

And others like this…

“You are stupid. You are stupid because you can't read. You are also stupid becuase you don't speak English and because you can't add. Your so stupid, you can't even add! Once, a teacher give you a very simple math problem; it was 1+1=?. Now keep in mind that this was in fourth grade, when you should have known the answer. You said it was 23! I laughed so hard I almost wet my pants! How much more stupid can you be?! So have I proved it? Don't you agree that your the stupidest person on earth? I mean, you can't read, speak English, or add. Let's face it, your a moron, no, an idiot, no, even worse, you're an imbosol.”

Page 10: NLP Research for  Commercial Development of Writing Evaluation Capabilities

And this ….

“I THINK THAT EVERYONE SHOULD BE ABLE TO WEAR WHATEVER THE HELL THEY WANT TO WEAR. “

Page 11: NLP Research for  Commercial Development of Writing Evaluation Capabilities

And this…

“I don't know how to explain this question because I took a nap while listening.  Sorry. “

Page 12: NLP Research for  Commercial Development of Writing Evaluation Capabilities

And this …This is my topic on presidents. The one i will be talking about is Bill Clinton. He like most of us have fingers, with fingernails. He also has two arms were his fingers are on, were teh fingernails connect.What can a arm be without a hand? nothing thats the answer, so he obviously has two hands in witch the fingers are on, with the fingernails connect too. He also has eyes.....EYES o yeah even him the big cheese has eyes, its weird cuz not many people have eyes but this good president does, and he can SEE you SEE you, you might not be able to see him , but he can SEE you.

One time we were on AOL chat and i was talking to him he said he like to climb trees in his underwear well his arms were covered with sauce....pizza sauce. He said it makes him feel free and good about himself. Also in the chat he said he has a pigeon for a pet and the things name is Frances, he said they like to make bacon together in the mourning and at night.And they eat at his friends Y place mostly everyday.

Page 13: NLP Research for  Commercial Development of Writing Evaluation Capabilities

And this …“Is true that in so many jobs people have to wear

dress codes for so many reazons.Like in the restaurant the workers are obligaded to use a drees code because they have to look differently and have a good looking to impresionate the cutmers.Not necesesary at all the schools have drees codes but the ones that havet is because   Arriba todos mis compas ya llego el rey del cristal, y yo mismo lo cosino para mejor calidad por esos mismos motivos me busca la federal la dea de estados unidos tambien me quiere agarar. si ellos me buscan por tierra yo me les pelo por mar si piensan que ando colima yo me paseo en michoacan por la ruana y poir tepeque aguililla y cuancoman ,cuantas libras va a llevarse “[Descriptive Translation: “It’s a rhyme about a drug-runner … the guy is basically saying that he's the king of Cristal meth, wanted by the DEA and the Feds. ”]

Page 14: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Human Scoring Algorithm

Page 15: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Is |S1 - S2| > 1 ?

essay

human reader score (S1)

human reader score(S2)

NOYES

expert human reader score (S3)Final Score

=mode, or mean

of closest

Final Score=

mean

Page 16: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Building Automated Essay Scoring Capabilities

Page 17: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Some History– PEG – 1960’s essay scoring (Page, 1966)

• Transformation of essay length• Some syntactic analysis• Convincing results

– Writer’s Workbench (Cherry et al, 1982)• Editing tool for students• Diction, style, spelling• Discourse structure (‘topic sentence’ identification)

– Intelligent Essay Assessor (Landauer et al, 1998)• Essay scoring with latent semantic analysis (LSA)• Style and mechanics measures

Page 18: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Assessment Market Technologically Ready

• Increase in Internet & Computer Access– Instructional computers with Internet access in

public schools: 12.1 to 1 in 1998 & 4.8 to 1 in 2001 (NCES, 2002)

– Web resources used in over 40% of college courses (Campus Computing Project, 2000)

– 99% of public schools have internet access (NCES report, 2002)

• State Assessments: Increase in computer-based delivery

• Largest Test Publishers offer 850+ digital textbook titles

Page 19: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Motivation in Assessment• Cost Savings for Large-Scale

Assessments• Classroom Integration for

Instruction– More practice writing possible– Electronic writing portfolios– Individual performance assessment– Classroom assessments

Page 20: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Educators’ Questions About Innovation

1. Reliability: Can automated essay assessments increase scoring consistency for authentic assessments?

2. Assessment Type: Can automated scoring introduce more varied high-stakes assessments?

3. Costs/Performance: Can scoring costs be reduced, but scoring performance maintained?

Page 21: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Starting Development

Page 22: NLP Research for  Commercial Development of Writing Evaluation Capabilities

What should a good essay look like?

• Clearly states the author's position, and effectively persuades the reader of validity of author's argument.• Well organized, with strong transitions helping to link words and ideas.• Develops its arguments with specific, well-elaborated support. • Varies sentence structures and makes good word choices; very few errors in spelling, grammar, or punctuation

Page 23: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Mapping Writing Features to NLP Tools

Writing Features NLP Tools

Grammar Errors & Sentence Structures

POS Taggers; Syntactic Parsers

Vocabulary Usage Content Analysis; POS Taggers

Sentence & Word-Level Mechanics

Spelling Tools; POS Taggers

Organization & Development of Ideas

Discourse Analyzers

Page 24: NLP Research for  Commercial Development of Writing Evaluation Capabilities

E-rater (2/99 – 9/04)• 50+ Writing-Relevant Features

– Syntactic Structure Features: clause types– Discourse Structure Features: cue words & terms– Content: Content vector analysis – Lexical Complexity: e.g., word length, unique words

• NLP Tools– Syntactic Parses– High-level discourse analyzer– tf*idf (essay level & argument level)

• Topic-Specific Models – Training with Human-Scored Essays– Stepwise Linear Regression (Variable Feature Set & Weights)

• System Performance– Agreement with Humans– Comparable to Two Humans– E-rater/Human agreement : 59% exact; 98% exact + adjacent

Page 25: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Is |S1 - S2| > 1 ?

essay

human reader score (S1)

E-rater score(S2)

NOYES

expert human reader score (S3)

Final Score=

mean

Final Score =

mode, or meanof closest

Page 26: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Outcomes of Early Success

Page 27: NLP Research for  Commercial Development of Writing Evaluation Capabilities

NY Times Headline PhobiaCan you spell imbecile?: E-rater® Gives Good Score to Bad EssayBy A. ReporterETS’s automated scoring system thinks that this essay should get something like a “B.” Would you want your child to do well on this kind of writing? You be the judge. “You are stupid. You are stupid because you can't read. You are also stupid becuase you don't speak English and because you can't add. Your so stupid, you can't even add! Once, a teacher give you a very simple math problem; it was 1+1=?. Now keep in mind that this was in fourth grade, when you should have known the answer. You said it was 23! I laughed so hard I almost wet my pants! How much more stupid can you be?! So have I proved it? Don't you agree that your the stupidest person on earth? I mean, you can't read, speak English, or add. Let's face it, your a moron, no, an idiot, no, even worse, you're an imbosol.”

Page 28: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Anomalous Essay DetectionStatistical evaluation of word usage to flag anomalous essays

– “Your essay does not resemble others being written on this topic.”

– “Your essay might not be relevant to assigned topic.”

– “Your essay appears to be restatement of the topic with a few additional concepts.”

– “Compared to other essays written on this topic, your essay contains more repetition of words.”

– “Your essay shows less development of a theme than other essays written on this topic.”

Page 29: NLP Research for  Commercial Development of Writing Evaluation Capabilities

What teachers really wanted:Qualitative FeedbackQualitative Feedback

Page 30: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Learning from Assessment Experts• Holistic scores not meaningful to students

Score 3: – While a position may be stated, either it is unclear OR

undeveloped. – May have organization in parts, but lacks organization in

other parts. – The support of the position may be brief, repetitive, or

irrelevant. – Inconsistent control of sentence structure, and incorrect word

choices; errors in spelling, grammar, or punctuation occasionally interfere with reader understanding.

• Demos for focus groups with teachers, policy makers, assessment experts– Errors in grammar, usage, mechanics, and style– Organization & Development

Page 31: NLP Research for  Commercial Development of Writing Evaluation Capabilities

More Innovation – More Questions

– Meaningfulness: Is the feedback consistently related to a clearly-defined standard?

– Self-Evaluation: Can instructional software help students understand evaluation of their writing?

– Improvement: Can writing practice with immediate feedback help students?

Page 32: NLP Research for  Commercial Development of Writing Evaluation Capabilities

CriterionSM Online Essay Evaluation Service

• Critique writing analysis tools– Grammar – Usage – Mechanics – Style – Organization & Development

• E-rater

Page 33: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Motivation For New Capability Development

• What’s free for commercial use Spelling

• What’s not …free … Grammar Usage Mechanics Style

• What doesn’t exist Organization & Development

Page 34: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Methods

Page 35: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Bigrams for Grammar, Usage & Mechanics Errors

(Leacock & Chodorow)

• Corpus of well-formed text: 30 million words from newspapers

• Features: function words and part-of-speech tagsa_AT good_JJ job_NN during_IN

• Collect frequencies for: – Unigrams of tags and function words– bigrams of tags and function words

a_JJ AT_JJ JJ_NN NN_during NN_IN

• Method: pointwise mutual information and log likelihood ratio used to detect unexpected sequences – likely violations of English grammar

Page 36: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Grammar, Usage and Mechanics Errors

• Harvest low probability bigrams from a set of essays.

• Low probability bigrams:– DTS_NN, AT_NNS– *these pencil, *every teenagers

• Write Filters:– *These pencil is yellow. – but not These pencil erasers are dirty.

Page 37: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Grammar• Fragments• Garbled Sentences• Subject-Verb Agreement: the motel are … • Verb form: They are need to distinguish …• Pronoun Errors: Them are my reasons …• Possessive Errors: the students grades• Wrong or Missing Word: The should take the

student• Proofread This!: I think my through problems

Page 38: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Usage

• Article Errors: I like these song• Confused Words: Because of there

different genres …• Wrong Form of Word: the right choose• Faulty Comparison: It is more easier• Nonstandard Verb or Word Form

Page 39: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Mechanics• Spelling• Missing Capitalization• Missing Initial Capitalization• Missing Question Mark• Missing Final Punctuation• Missing Apostrophe: Thats about the only

thing• Missing Comma• Missing Hyphen: a well deserved vacation• Fused Words• Compound Words• Duplicate Words: escape to the another town

Page 40: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Style• Short sentences, unusually long

sentences, and passives?• Automatic detection of repeated words

– 300 essays manually annotated for repetition

– Word-based text features with C5.0• proportion of word use in essays• distance between repeated word

occurrences • pronoun?• word length

Page 41: NLP Research for  Commercial Development of Writing Evaluation Capabilities

How Do We Identify Organization & Development in Essay Writing?

• Discourse Theories• Lacks Essay-Based Discourse Function

–Cue-word & term detection (Cohen, Hirschberg & Litman, Hovy et al, Knott, Mann & Thompson, Vander Linden & Martin, Sidner, & Quirk, et al)

• Topical Coherence, Not Discourse Function –TexTiling – (Hearst & Plaunt)–LSA (Landauer et al)–Select-A-Kibbitzer (Weimer-Hastings & Graesser)

• Not Student Friendly –RST Trees (Mann & Thompson)

• Essay-Based Discourse Analyzer (Burstein, Marcu, & Knight)

• Background, Thesis, Main Points, Supporting Ideas, and Conclusion

Page 42: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Organization and Development:Essay-Based Discourse Analyzer

• 1400+ essays manually annotated with pre-defined labels

• Voting Between 3 Systems: 2 Probabilistic & 1 Decision-Based – Probable discourse label sequences– Essay sublanguage: agree, should, would, opinion, for

example, because, however...– RST relations: contrast, elaboration, antithesis...– Syntactic structures: infinitive, complement,

subordinating clauses...• Identifies background text, thesis statement, main

ideas, supporting ideas, & conclusion statement

Page 43: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Evaluating Capabilities• Precision, Recall, & F-measure

– Trade-off Precision for Recall– Better not to show falsely identified errors

• Grammar, Usage, & Mechanics (Bigrams)– Minimum Precision for Deployment

• Style & Organization & Development– Human-annotated data– Develop baseline comparisons– Precision, Recall, F-measure outperform

baselines & approach human agreement

Page 44: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Some Numbers• Grammar, Usage, & Mechanics

– (Minimum) Overall System Precision = 0.90

• Discourse Capability (Org & Dev)– Baseline Precision = 0.71– Overall System Precision = 0.85– Human agreement = 0.95

• Repetitive Word Use (Style)– Baseline Precision: 0.27– Overall System Precision = 0.95– Human Agreement: 0.55

Page 45: NLP Research for  Commercial Development of Writing Evaluation Capabilities

E-rater v.2.0 – September 2004• 12 Features: Relevant to Writing Standards

– Grammar, Usage, and Mechanics : Error Types– Style: Sentence Type, Sentence Length, Repeated Words – Organization: Thesis, Main Points, Support, and

Conclusion– Content: Vocabulary Usage

• Topic-Specific & Grade-Specific Models – Training with Human-Scored Essays– Multiple Regression (Standardized Feature Set & Variable

Weights)• System Performance

– Agreement with Humans– Comparable to Two Humans– E-rater/Human agreement : up to 62% exact (from

59%) ; 98% exact + adjacent

Page 46: NLP Research for  Commercial Development of Writing Evaluation Capabilities

CriterionSM Online Essay Evaluation

Page 47: NLP Research for  Commercial Development of Writing Evaluation Capabilities

More to Prevent System Gaming: Word Salad Detector (Morton)

• Rare p-o-s tag sequences (mixing up content!) quick The the over brown dogs

fox. jumped lazy • Abnormal p-o-s tag distributions

kfdl afjidaoi djfd &&&&**

Page 48: NLP Research for  Commercial Development of Writing Evaluation Capabilities
Page 49: NLP Research for  Commercial Development of Writing Evaluation Capabilities
Page 50: NLP Research for  Commercial Development of Writing Evaluation Capabilities
Page 51: NLP Research for  Commercial Development of Writing Evaluation Capabilities
Page 52: NLP Research for  Commercial Development of Writing Evaluation Capabilities

essay

E-rater score

Student Independence

Critique Writing Analysis Feedback

Org & DevG, U, M, S

                             

Page 53: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Effectiveness Studies

Page 54: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Criterion Field Data (Attali, 2004)

• Research Questions– Can we evaluate the basic effectiveness

of Critique writing analysis tools?– Can students understand and respond to

system feedback?• Criterion Field Data

– Multiple submissions from about 9,000 6th to 12th grade essays

– Available for analysis:• First and last essay submission• Total number of submissions

Page 55: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Summary Results• 25% error reduction across 30+error types• Increased number of essay-based discourse

elements – background– main point– supporting idea– conclusion

Page 56: NLP Research for  Commercial Development of Writing Evaluation Capabilities

CriterionSM & Standardized Testing (Shermis et al, 2004)

•Research QuestionCan Criterion use have a positive impact on FCAT writing scores?

•Data/Design 36 10th Grade English classes in Miami-Dade

– 18 used Criterion– 18 used “traditional” instruction

Page 57: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Summary ResultsBad News

– No significant differences in FCAT scores

Good News– Significant growth in writing performance

across different topics; Reduced numbers of errors

– Significantly more writing productivity

Page 58: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Users & VolumesE-rater – GMAT (1999 – present)– 350K essays scored each year– Moving into Statewide Assessment

Criterion: E-rater + Critique – K-12, college, and graduate level practice

applications– Dec 2002: 200 clients & 50K subscriptions– Dec 2003: 445 clients & 127K subscriptions– Dec 2004: 747 clients & 682K subscriptions

International Exposure– Users in Canada, Mexico, Pakistan, India, Estonia,

Puerto Rico, Egypt, Nepal, Taiwan, Hong Kong & Japan

Page 59: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Beyond Essay EvaluationC-rater (Leacock)

– short-answer, concept-based evaluation • morphological analyzer • pronoun resolution • syntactic chunker• predicate argument structure generator• automated spelling correction • word similarity matrices • part-of-speech tagger

Test Item Creation Assistants– key-distractor selection for word-based test

items• Statistical word similarity tools (Deane & Higgins)

Page 60: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Something New for March 2005

• E-rater Trait-Based Scoring (Attali)– Separate Ranks for Grammar, Usage,

Mechanics, Style, and Organization & Development

– E-rater-Style Model Building – Directly to Specific Feedback Categories

Page 61: NLP Research for  Commercial Development of Writing Evaluation Capabilities

Current Research ProblemsAdditional Grammar Error Detection (Chodorow & Wolff)

– Preposition Errors a knowledge of/*at math

– Long Distance Subject-Verb AgreementThe use of dress codes are becoming a popular

subject.Coherence in Organization & Development (Higgins, Burstein & Marcu, 2004)

– Does the thesis statement respond to the question?– Do the main points relate to the thesis statement?– Are all sentences in a supporting idea related?

Page 62: NLP Research for  Commercial Development of Writing Evaluation Capabilities

More Publications:http://www.ets.org/research/erater.html

Tom Morton’s Freeware Parser:https://sourceforge.net/projects/opennlp

OpenNLP Tools, Download