View
373
Download
9
Category
Preview:
Citation preview
101035 中文信息处理
Chinese NLP
Lecture 15
2
应用——信息抽取Information Extraction
• 基本概念( Concepts)
• 信息抽取的任务( IE Tasks)
• 历史和基准( History and Benchmarks)• 信息抽取的过程( IE Process)• 信息抽取和信息检索( IE vs IR)
3
基本概念Concepts
• Information extraction (IE) analyzes unrestricted texts in order to extract information about pre-specified types of events, entities and relations, and to create a structured output from unstructured texts.
• IE is an essential NLP technique, which serves information retrieval(信息检索) , automatic summarization(自动摘要) , question and answer(自动问答) , etc.
4
• IE object
• IE typically deals with natural language text, especially unstructured text.
• In a broad sense, IE deals with speech, image, video, and other types of data besides electronic text.
• In a narrow sense, IE deals only with natural language text.
5
信息抽取的任务IE Tasks
• Named Entity Detection and Recognition
• It finds and classifies the named entities in the text into pre-defined categories, such as persons, organizations, locations, expressions of time, quantities, monetary values, and percentages, etc.
… banks in Boston and New York.
Named Entity
6
• Co-Reference Resolution
• It identifies the identity relations between the entities in the text.
Jim bought 300 shares of Acme Corp. in 2006.
Jim bought 300 shares of Acme Corp. in 2006.
person quantity organization date
He sold them in 2008.
Entity
Co-Reference
7
• Entity Relation Detection and Characterization
• It finds the relations between entities in the text and classifies them into pre-defined categories, such as AT, NEAR, PART, GROUP, AFFILIATION, POSITION, etc.
located at
… banks in Boston and New York.
located at
Entity Relation
8
• Event Detection and Characterization
• It detects the events in which the entities participate, their arguments (such as agent, object, source and target) and attributes (such as time, location, instrument and purpose) and classifies the identified events into pre-defined categories, such as CREATION, MOVEMENT, TRANSFER, INTERACTION, etc.
In 1997, the company hired John D. Idol to take over as chief executive. In 1997, the company hired John D. Idol to take over as chief executive.
Event employee
In 1997, the company hired John D. Idol to take over as chief executive.
employer
In 1997, the company hired John D. Idol to take over as chief executive.
position
In 1997, the company hired John D. Idol to take over as chief executive.
time
9
In-Class Exercise
• Please read the following sentence and hand-label the entity relations (which means you can be creative!). The entities are bold typed.
经过一晚的休息,我们一行 10人,早起从丰大出发前往黄山南大门,大概半小时左右到达黄山脚下的东岭换乘中心,然后乘坐大巴到云谷寺选择坐索道上山。
10
历史和基准History and Benchmarks
• MUC (Message Understanding Conference) • From 1987-1997, it was sponsored by DARPA
(Defense Advanced Research Projects Agency, TIPSTER Program)
• Datasets: News Domains (Military Messages, Terrorist Events, Corporate Joint Ventures, Airplane crashes, etc.)
11
• MUC Extraction Tasks
• Named Entity (NE): Find proper names, such as person, organization and location names, and quantities of interest, such as dates, times, percentages, and monetary amounts.
• Co-reference (CO)
• Template Element (TE): Fill slots of entity attributes, such as name, type, descriptor, and category.
• Template Relation (TR): Find the relations between TEs, such as employee_of, product_of, location_of.
• Scenario Template (ST): Build a template around an event in which entities participated.
12
• MUC Extraction Tasks
13
• ACE (Automatic Content Extraction) Evaluation
• During 1999-2008, it was sponsored by NIST (National Institute of Standards and Technology).
• The ACE research objectives are viewed as the detection and characterization of Entities, Relations, and Events.
• Entity: A Real Object in the World
• Mention: Named (e.g., “George Bush”), Nominal (e.g. “our president”) and Pronominal (e.g., “he”)
14
• ACE Extraction Tasks • ACE 2000: Entity Detection and Tracking (EDT)
• ACE 2001: Entity Detection and Tracking (EDT) + Relation Detection and Characterization (RDC)
• ACE 2002: The Same as ACE 2001
• ACE 2003: EDT (for English, Chinese, Arabic) + RDC (for English, Chinese)
• ACE 2004: Entity Detection and Recognition (EDR) + RDR + Time Expression Recognition and Normalization (TERN)
• ACE 2005: EDR + RDR + TERN + Value Detection and Recognition (VAL) + Event Detection and Recognition (VDR)
• ACE 2007: The Same as ACE 2005
• ACE 2008: Local (Within-Document) EDR and RDR (for English and Arabic) + Global (Cross-Document) EDR and RDR (for English and Arabic)
15
• TAC (Text Analysis Conference) - KBP (Knowledge Base Population) Track
• 2009-now, it was sponsored by NIST (National Institute of Standards and Technology).
• TAC Extraction Tasks
• Entity Linking: Determines for each query (name string), which knowledge base entity is being referred to, or if the entity is not present in the reference KB ( Mono-lingual vs. Cross-lingual).
• Slot Filling: Involves collecting a pre-defined set of information regarding certain attributes of an entity, which may be a person or some type of organization.
16
• TAC Tasks
17
信息抽取的过程IE Process
Tokenization
Morphological and Lexical Processing
Syntactic Analysis
Text
Entity Detectionand Recognition
Relation Detectionand Recognition
Event Detectionand Recognition
TemplatesNatural Language Processing (NLP)
18
• Tokenization and Word Segmentation • In the first step, the text is divided into sentences
and tokens (word occurrences).
• For Chinese, tokenization also includes word segmentation.
Sam, Schwartz, retired, as, executive, …
19
• Morphological and Lexical Processing• Each token may be looked up in a dictionary to
determine its possible POS and features (both syntactic and semantic).
• The system may utilize several special purpose dictionaries, such as dictionaries of major place names, common first names, and common company suffixes (such as ‘Inc’).
retired -> retire, Sam NAME, Inc COMPANY SUFFIEX, retired -> VBD, …
Sam, Schwartz, retired, as, executive, …
20
• Syntactic Analysis• It identifies the syntactic structure of the text.
• The contents to be extracted often correspond to the phrases (mainly noun phrases) in the text.
21
• Name Entity (NE) Recognition• NE systems identify all the names of people, places,
organizations, dates, and amounts of money, etc.
• Useful for answering the questions about “What”, “Who”, “When” and “Where”.
• Entity Extraction Approaches
• Rule-based
• Learning-based (Classification or Sequential Tagging)
22
In-Class Exercise
• Sequential tagging also applies to ___________.
A) word segmentation
B) POS tagging
C) dependency parsing
D) text classification
23
• Scenario Pattern Matching• Scenario pattern matching extracts events relevant
to the scenario using patterns specific to the task.
• Templates are often used for the purpose.
PERSON retires as POSITION
PERSON is succeeded by PERSON
Person in Person outPositionOrganization
EVENT: succession
An event template
“retire” and “succeeded” are the
trigger verbs.
Matching rules
24
• Scenario Pattern Matching• When PERSON and PERSON match noun phrases
with the associated types, the event is identified; and the associated information is filled in the template, such as the slots of position and person-out.
Person in Person outPositionOrganization
PERSONPOSITION
Person in Person outPositionOrganization
PERSONPERSON
25
• Scenario Pattern Matching
Person in Person outPositionOrganization
Dowd retires as chief of Kenilworth Police Department.
Rocky Marciano retires as world heavyweight champion.
DowdChief of Kenilworth Police Department
PERSON retires as POSITION
26
信息抽取和信息检索IE vs IR
• Information Retrieval (IR, 信息检索 )• IR retrieves a collection or a subset of documents
which are hopefully relevant to a query, based on keyword searching.
• IR is the essential technique underlying search engines and many IT successes. (Google, Baidu, Bing, etc.)
27
• IE vs IR
Information Retrieval gets sets of relevant documents - analyze the documents
Information Extraction gets facts out of documents - analyze the facts
28
• IE vs IR
Information Need
Relevant Documents
Relational Database
IR IE
Answer
QA
29
• IE vs IR
Information Need
Relevant Documents
Database
IR IE
DM
30
• 基本概念• 信息抽取的任务• Named Entity Detection
and Recognition
• Co-Reference Resolution
• Entity Relation Detection and Characterization
• Event Detection and Characterization
Wrap-Up
• 历史和基准• MUC
• ACE
• TAC
• 信息抽取的过程• 信息抽取和信息检索
Recommended