Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
t
Treebanking
Sandra Kü[email protected]
Seminar für Sprachwissenschaft
University of Tübingen
Treebanking – p.1
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tDef. Treebank
A treebank is a syntactically annotated corpus.
Issues:
complete analysis vs. partial analysis
theory-neutral vs. theory-dependent
spoken vs. written language
constituents vs. dependency
annotate grammatical functions?
manual vs. automatic annotation
Treebanking – p.2
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tDef. Treebank
A treebank is a syntactically annotated corpus.
Issues:
complete analysis vs. partial analysis
theory-neutral vs. theory-dependent
spoken vs. written language
constituents vs. dependency
annotate grammatical functions?
manual vs. automatic annotation
Treebanking – p.2
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tSome Remarks
treebanking is extremely labor-intensive (i.e. costly)
good planning is therefore necessary
good tools are crucial
they speed up the process
they help with consistency
try Annotate!
a detailed stylebook is essential
every time you hire a well-trained linguist, yourtreebank will get better
Treebanking – p.3
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tPenn WSJ Treebank – Example
( (S (NP-SBJ (NP Pierre Vinken)
,
(ADJP (NP 61 years)
old)
,)
(VP will
(VP join
(NP the board)
(PP-CLR as
(NP a nonexecutive director))
(NP-TMP Nov. 29)))
.))
Treebanking – p.4
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTreebanks for English
Penn Treebank
BLLIP Treebank
The Penn-Helsinki Parsed Corpus of Middle English
Susanne Corpus and Christine Project
International Corpus of English ICE
Lancaster Treebank
The Redwoods HPSG Treebank
Treebanking – p.5
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTreebanks Projects
Basque
Eus3LB project
Bulgarian
HPSG-based Syntactic Treebank of Bulgarian(BulTreeBank)
Catalan
CAT3LB project
Chinese
The Chinese Treebank Project
Czech
Prague Dependency Treebank
Treebanking – p.6
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTreebanks Projects (2)
Danish
Danish Dependency Treebank
Dutch
The Alpino Treebank
French
Project TALANA
German
NeGra Project - NeGra Corpus
Project TIGER
Verbmobil Treebank of Spoken German(TüBa-D/S)
The Tübingen Treebank of Written German(TüBa-D/Z) Treebanking – p.7
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTreebanks Projects (3)
Italian
Turin University Treebank TUT
Italian Syntactic-Semantic Treebank
Portuguese
The Floresta Sinta(c)tica project
Slovene
Slovene Dependency Treebank
Swedish
Swedish Treebank
Turkish
METU treebank
Treebanking – p.8
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tThe Annotation Scheme
Should the annotation scheme be dependent on aparticular theory?
Theory-neutrality is a fiction. Every annotationscheme is at least implicitly theory-dependent.
Grounding an annotation scheme in a linguistictheory tends to improve consistency of annotations.
Treebanking – p.9
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTheory-dependent Treebanks
Prague Dependeny Treebank
based on Dependency Grammar
The Redwoods HPSG Treebank
based on Head-Driven Phrase StructureGrammar
CCGbank
translation of the Penn Treebank into a corpus ofCombinatory Categorial Grammar derivations
Treebanking – p.10
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tTheory-neutral Treebanks
do not adhere to any particular linguistic theory
encode those grammatical properties that aredistinguished by many, if not all grammaticalframeworks
advantage: more widely usable and less dependenton whatever version of a particular grammaticaltheory may have existed at the time when thetreebank annotation scheme was determined
examples: Penn Treebank, Negra treebank,Tübingen treebanks
Treebanking – p.11
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tCharacteristics of Spontaneous Speech
Fragmentary Utterances
Repetitions
False starts
Speech errors (with correction)
Interruptions
Parentheticals
Discourse markers
Hesitation noises
Treebanking – p.12
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tExamples
Vorbereitungen eigentlich nicht
preparations really not
Theater wäre wäre mal nicht schlecht
theater would be would be surely not bad
trotz Nebel Nebels im November
despite fog of the fog in November
Treebanking – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tExamples
Vorbereitungen eigentlich nicht
preparations really not
Theater wäre wäre mal nicht schlecht
theater would be would be surely not bad
trotz Nebel Nebels im November
despite fog of the fog in November
Treebanking – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tExamples
Vorbereitungen eigentlich nicht
preparations really not
Theater wäre wäre mal nicht schlecht
theater would be would be surely not bad
trotz Nebel Nebels im November
despite fog of the fog in November
Treebanking – p.13
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tExamples (2)
ja, also, das, wenn wir allerdings,
yes well that if we though
wenn wir mit dem Flugzeug fliegen
if we fly by plane
wie kommen wir dann nach Hannover rein ?
how do we then get into Hannover?
Treebanking – p.14
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tAnnotation Principles
Longest Match Principle
as many daughter nodes as possible arecombined into a single mother node, providedthat the resulting construction is syntactically aswell as semantically well-formed
Speech errors, repetions, corrections, andhesitations are structured as much as possible,but are not typically connected to surroundingconstituents as a whole
Flat Clustering Principle
keeps the number of hierarchy levels in asyntactic structure as small as possible
any branching factor is allowed
Treebanking – p.15
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tAnnotation Principles
Longest Match Principle
as many daughter nodes as possible arecombined into a single mother node, providedthat the resulting construction is syntactically aswell as semantically well-formed
Speech errors, repetions, corrections, andhesitations are structured as much as possible,but are not typically connected to surroundingconstituents as a whole
Flat Clustering Principle
keeps the number of hierarchy levels in asyntactic structure as small as possible
any branching factor is allowedTreebanking – p.15
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tRepetition
0 1 2 3 4 5 6
500 501 502 503 504 505
506 507 508
509
Theater
NN
w"are
VAFIN
w"are
VAFIN
mal
ADV
nicht
PTKNEG
schlecht
ADJD
.
$.
HD HD
VXFIN
HD HD HD HD
NX
ON
VXFIN
HD
ADVX
MOD
ADVX
MOD
ADJX
PRED
VF
−
LK
−
MF
−
SIMPX
Treebanking – p.16
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tParenthesis
0 1 2 3 4 5 6 7 8 9 10 11 12
500 501 502 503 504 505 506 507
508 509 510 511 512 513
514 515
516
da
ADV
k"onnen
VMFIN
wir
PPER
uns
PRF
auf
APPR
das
ART
Hotel
NN
,
$,
glaube
VVFIN
ich
PPER
,
$,
einigen
VVINF
.
$.
HD HD HD HD − HD HD HD HD
ADVX
MOD
VXFIN
HD −
NX
HD
VXFIN
HD
NX
ON
VXINF
OV
NX
ON
NX
OA
PX
FOPP
LK
−
MF
−
SIMPX
VF
−
LK
−
MF
−
VC
−
SIMPX
Treebanking – p.17
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tFragmentary Utterance
0 1 2 3 4 5 6 7 8 9 10 11
500 501 502 503 504 505 506 507
508 509 510 511 512
513 514
515
ja
PTKANT
,
$,
das
PDS
ist
VAFIN
das
PDS
das
PDS
ist
VAFIN
in
APPR
Ordnung
NN
,
$,
genau
ITJ
.
$.
−
DM
HD HD HD
NX
HD HD HD −
DMNX
ON
VXFIN
HD
NX
ON
VXFIN
HD −
NX
HD
VF
−
LK
−
SIMPX
PX
PRED
VF
−
LK
−
MF
−
SIMPX
Treebanking – p.18
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tConstituent-Based Annotation
describes phrase structure, clause structure
e.g. noun phrases, adjectival phrases, adverbialphrases, clauses
structures often recursive
ex.: Penn treebank
Treebanking – p.19
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tPenn WSJ Treebank – Example
( (S (NP-SBJ (NP Pierre Vinken)
,
(ADJP (NP 61 years)
old)
,)
(VP will
(VP join
(NP the board)
(PP-CLR as
(NP a nonexecutive director))
(NP-TMP Nov. 29)))
.))
Treebanking – p.20
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tDependency Grammar
PSG describes the structure of a sentence
dependency grammar is more interested ingrammatical relations between words of a sentence,the governing and the dependent words.
dependeny grammar does not propose a recursivestructure but rather a network of relations
the verb is the part of the sentence on whichultimately everything depends
the direction of a link represents the dependency, theangle represents the word order
Treebanking – p.21
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tDependency Grammar - An Example
the little bald man likes the present
likes
man
the little bald
present
the
Treebanking – p.22
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tExtending Dependency Grammar
dependency grammars are often extended by labelsthat denote the grammatical function that thedependent word has with regard to its governor
Example:
I saw a man with a dog and a cat in the park
subj spec adjn spec spec spec
cmpl cmpl cmpl
adjn
from (Lin 1995)Treebanking – p.23
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tFrom Constituents To Dependencies
conversion from constituents to dependencies ispossible
needs head / non-head information
if no such information is given heuristics
Lin 1995, 1998: convert Penn Treebank todependencies
Treebanking – p.24
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion
idea: head of a phrase governs all sisters
uses Tree Head Table: list of rules where to find thehead of a constituent
an entry consists of the node, the direction of search,and the list of possible heads
sample entries:
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
first line: the head of an S constituent is the first Auxdaughter from the right; if there is no Aux, then thefirst VP, etc.
Treebanking – p.25
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))S
NP1
PRON
I
VP1
ADV
really
VP2
V
like
NP2
N1
ice
N2
cream
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
S
NP1
PRON
I
VP1
ADV
really
VP2
V
like
NP2
N1
ice
N2
cream
root head lex. head
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
S
NP1
PRON
I
VP1
ADV
really
VP2
V
like
NP2
N1
ice
N2
cream
root head lex. head
S VP1 ??
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
S
NP1
PRON
I
VP1
ADV
really
VP2
V
like
NP2
N1
ice
N2
cream
root head lex. head
S VP1 ??
VP1 VP2 ??
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
(S right-to-left (Aux VP NP AP PP))
(VP left-to-right (V VP))
(NP right-to-left (Pron N NP))
S
NP1
PRON
I
VP1
ADV
really
VP2
V
like
NP2
N1
ice
N2
cream
root head lex. head
S VP1 like
VP1 VP2 like
VP2 V like
Treebanking – p.26
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
head of a phrase dominates all sisters
VP1 governs NP1 like governs I
VP2 governs ADV like governs really
like
I really cream
ice
Treebanking – p.27
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tLin’s Conversion - Example
head of a phrase dominates all sisters
VP1 governs NP1 like governs I
VP2 governs ADV like governs really
like
I really cream
ice
Treebanking – p.27
EB
ER
HA
RD
KA
RL
SU
NIV
ER
SIT
ÄT
TÜ
BIN
GE
NSe
min
arfü
rSp
rach
wis
sens
chaf
tXML Annotation Tool
CLARK tool: http://www.bultreebank.org/clark24.08.2005 file:/mnt/usb1/myCLaRK.JPG #1
Treebanking – p.28