Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
复旦大学大数据学院School of Data Science, Fudan University Chinese Event Extraction
杨依莹
2017.11.22
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction
1
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
• TheobjectiveoftheAutomaticContentExtraction(ACE)Programwastodevelopextractiontechnologytosupportautomaticprocessingofsourcelanguagedata(intheformofnaturaltextandastextderivedfromASRandOCR).
• Theprogramrelatesto English, Arabic and Chinese texts.
• TheACEcorpusisoneofthestandardbenchmarksfortestingnewinformationextraction algorithms.
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
Givenatextin naturallanguage,theACEchallengeistodetect:
1. entitiesmentionedinthetext,suchas:persons,organizations,locations,facilities,weapons.
2. relations betweenentities,suchas:personAisthemanagerofcompanyB.Relationtypesinclude:role,part,located,near,andsocial.
3. eventsmentionedinthetext,suchas:interaction,movement,transfer,creationanddestruction.
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
Anexampleoftext
复旦大学大数据学院School of Data Science, Fudan University
ACE program : entity
• EntityDetectionandTracking(EDT)• ACEtasksidentifiedseventypesofentities:Person,Organization,
Location,Facility,Weapon,VehicleandGeo-PoliticalEntity(GPEs).Eachtypewasfurtherdividedintosubtypes.
• Foreverymention,theannotatoridentifiedthemaximalextentofthestringthatrepresentstheentityandlabeledtheheadofeachmention.Nestedmentionswerealsocaptured.
复旦大学大数据学院School of Data Science, Fudan University
ACE program : relation
• RelationDetectionandCharacterization(RDC):• involvedtheidentificationofrelationsbetweenentities.• Foreveryrelation,annotatorsidentifiedtwoprimaryarguments
(namely,thetwoACEentitiesthatarelinked)aswellastherelation'stemporalattributes.
复旦大学大数据学院School of Data Science, Fudan University
• Createnewstructuredknowledgebases,usefulforanyapp
• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia
• DBpedia:anontologyderivedfromWikipediacontainingover2billionRDFtriples.
• Freebase:adatasetfromWikipediainfoboxes.• On16December2015,Googleofficiallyannouncedthe KnowledgeGraphAPI,whichismeanttobeareplacementtotheFreebaseAPI.
• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?• (acted-in?x“E.T.”)(is-a?yactor)(granddaughter-of?x?y)
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
ACE program : relation
AutomaticContentExtraction(ACE)program:• 7 types and17subtypesrelationsfrom“RelationExtraction
Task”
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
复旦大学大数据学院School of Data Science, Fudan University
• Physical-LocatedPER-GPE• He was in Tennessee
• Part-Whole-SubsidiaryORG-ORG• XYZ, the parent company of ABC
• Person-Social-FamilyPER-PER• John’s wife Yoko
• Org-AFF-FounderPER-ORG• Steve Jobs, co-founder of Apple…
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
• UsingPatternstoExtractRelations• lexico-syntacticpattern(词典-语义规则)
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
• SupervisedLearning
1. Findallpairsofnamedentities
2. Decideif2entitiesarerelated
3. Ifyes,classifytherelation
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
• SupervisedLearning• Themostimportantstep:classification• e.g.AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
• Semi-supervisedLearning1.Afewhigh-precisionseedpatternsorseedtuples.2.Findingsentencesthatcontainentitiesintheseedpair.3.Extractandgeneralizethecontexttolearnnewpatterns.
Maycausesemanticdrift
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
• Semi-supervisedLearning• Toavoidsemanticdrift,weintroduceconfidencevalue.
• Settingconservativeconfidencethresholdsfortheacceptanceofnewpatternsandtuples.
ACE program : relation
复旦大学大数据学院School of Data Science, Fudan University
ACE program : event
AutomaticContentExtraction(ACE)program:• EventDetectionandCharacterization(EDC)
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction2
复旦大学大数据学院School of Data Science, Fudan UniversityDescription
• Inthisassignment,youwill need to use sequencelabeling models for Chinese event extraction.
• Event information aredefinedas two parts:• Trigger:themainwordthatmostclearlyexpressestheoccurrenceofanevent.
• Argument:anentity,temporalexpressionorvaluethatplaysacertainroleintheevent.
• Forexample:“因特尔在中国成立了研究中心”
• “成立”isthetrigger oftypeBusiness• “英特尔”,“中国”and“研究中心”aretheargumentsoftypeAgent,PlaceandOrg
复旦大学大数据学院School of Data Science, Fudan UniversityDescription
• Thistaskisseparatedastwosubtasks:• Triggerlabeling:identify thetriggerwordinthesentence,andclassify ittothefollowing8types:
• Argumentlabeling:identify alltheargumentsinthesentence,andclassify themto35types(somearelistedbelow,alltypescouldbefoundinthetrainingfile):
• You are required to use both HMM and CRF models forthis task. You can use any toolkit for theirimplementation.
• Note that the performance of HMM can be very poor.
复旦大学大数据学院School of Data Science, Fudan UniversityFormal Definition
InputAsequenceofsegmentedChinesewords.
OutputLabeleachwordwith‘T_type’(trigger),‘A_type’(argument)or‘O’(neithertriggernorargument).Saveyourlabelingresultafterthereallabelseparatedwithtab.
fg1:input fg2:traininginstance fg3:testingresult
复旦大学大数据学院School of Data Science, Fudan UniversityProvided Files
• trigger_train.txt &trigger_test.txt :• Thesetwofilescontain1,918and669 instancesfortrainingandtesting,respectively.
• Eachlinecontainsonewordanditslabelseparatedbytabs.• Instancesareseparatedbyblankline.
• argument_train.txt &argument_test.txt :• Thesetwofilescontain2,131and997 instancesfortrainingandtesting,respectively.
• Yourjobistopredictthesequencelabelforinstancesintestfiles,andwriteyourpredictionsinresultfiles.Thelabelsintestfilesareonlyforevaluation.
• eval.py• Thisfilecanhelpyouevaluateyourmodel’srecall,accuracy,precisionandF1-score.
复旦大学大数据学院School of Data Science, Fudan UniversitySubmission
• Generateazipfileandnameitas“sid_homework-3.zip”.
• Itshouldincludeapythonfilenamed“extraction.py”,twooutputfilesnamed“trigger_result.txt”and“argument_result.txt”,andawrittenreportnamed“chinese eventextraction.pdf”.
• Program:codesshouldbewritteninpython.
• Report:thereportneedstobewritteninEnglishwithnomorethan4pages.
复旦大学大数据学院School of Data Science, Fudan UniversityEvaluation
• Wewillmarkyourhomeworkbasedonthefourcriteria:
• Finalaccuracy(20%)• Program(30%)• Report(40%)• HMM implementation (10%)
复旦大学大数据学院School of Data Science, Fudan UniversityDue
• SubmityourhomeworkviaE-learningsystem.• Deadline:Mid-nightatDecember 8th 2017
• Ifyouhaveanyquestionsaboutthishomework,sendemailtoTAorourcoursemailbox.
• TAinCharge• 杨依莹([email protected] )
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction
3
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• CRF++(http://taku910.github.io/crfpp/ ) isasimple,customizable,andopensourceimplementationof ConditionalRandomFields(CRFs) forsegmenting/labelingsequentialdata.
• CRF++isdesignedforgenericpurposeandwillbeappliedtoavarietyofNLPtasks,suchasNamedEntityRecognition,InformationExtractionandTextChunking.
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Template basic
• Each line in the template file denotes one template. In each template, special macro %x[row,col] will be used to specify a token in the input data.
• Here you can find some examples for the replacements
Input: Data
He PRP B-NP
reckons VBZ B-VP
the DT B-NP << CURRENT
current JJ I-NP
account NN I-NP
template expandedfeature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DTABC%x[0,1]123 ABCDT123
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Training(encoding)• Use crf_learn command:
%crf_learn template_file train_file model_file
• Thereare4majorparameterstocontrolthetrainingcondition-aCRF-L2orCRF-L1:Changingtheregularizationalgorithm.DefaultsettingisL2.Generallyspeaking,L2performsslightlybetterthanL1.-cfloat:Withthisoption,youcanchangethehyper-parameterfortheCRFs.Thisparametertradesthebalancebetweenoverfitting andunderfitting.-fNUM:Thisparametersetsthecut-offthresholdforthefeatures.CRF++usesthefeaturesthatoccursnolessthanNUMtimesinthegiventrainingdata.Thedefaultvalueis1.-pNUM:IfthePChasmultipleCPUs,youcanmakethetrainingfasterbyusingmulti-threading.NUMisthenumberofthreads.
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Testing(decoding)• Use crf_test command:
%crf_test -mmodel_file test_files
• wheremodel_file isthefile crf_learn creates.test_file isthetestdatayouwanttoassignsequentialtags.Thisfilehastobewritteninthesameformatastrainingfile.
• -v optionsetsverboselevel.defaultvalueis0.Youcanalsohavemarginalprobabilitiesforeachtagandaconditionalprobablyfortheoutput.
%crf_test -v1-mmodeltest.data|head
Rockwell NNP B B/0.992465International NNP I I/0.979089Corp. NNP I I/0.954883's POS B B/0.986396Tulsa NNP I I/0.991966
Thanks for your attention!
感谢各位聆听!