Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
제7장 파싱
파싱의개요
• 파싱 (Parsing)
– 입력문장의 구조를분석하는 과정
• 문법 (grammar)
– 언어에서 허용되는 문장의구조를 정의하는체계
• 파싱기법 (parsing techniques)
– 문장의구조를 문법에따라 분석하는과정
– 차트파싱 (Chart Parsing)
– …
2
문장의구조와트리
• 문장 : “John ate the apple.”
• Tree Representation List Representation
(S (NP (N John))(VP (V ate)
(NP (DET the)(N apple)) ) )
• 의미 (meaning)– S는 NP와 VP로이루어졌다.– NP는 NAME인 “John”으로 이루어졌다.– VP는 VERB인 “ate”와 다른 NP로이루어졌다.– NP는 DET인 “the”와 NOUN인 “apple”로 이루어졌다.
3
S
NP VP
N V NP
John ate
DET N
the apple
문맥자유문법 (Context-Free Grammar)
• 문법의구성요소
– 단어및품사기호 (terminals)
• ate, the, apple 등
• V, DET, N 등
– 구문기호 (nonterminals)
• NP, VP, S 등
– 문법규칙 (productions)
S NP VP
NP N | DET N
VP V | V NP | V PP
4
하향식파싱
• 하향식파싱 (Top-Down Parsing)– 문장기호 S로부터 입력 문장방향으로 진행
– 문법규칙의 LHS (left-hand side) 기호를 RHS (right-hand side) 기호로대체하는 과정의 반복
• 하향식파싱의예 (leftmost derivation)
S NP VP N VP John VP John V NP John ate NP John ate DET N John ate the N John ate the apple
5
G : S NP VP
NP N | DET N
VP V NP
N John
DET the
V ate
N apple
Input Sentence :
John ate the apple
하향식파싱과정• Grammar G
S NP VPNP NNP DET NVP V NPN JohnV ateDET theN apple
• Input Sentence
John ate the apple
6
S
NP VP
N
John
V NP
ate
DET N
the apple
상향식파싱
• 상향식파싱 (Bottom-Up Parsing)– 입력문장으로부터 문법 기호 S 방향으로 진행
– 문법규칙의 RHS를 LHS로 대체하는 과정의 반복
• 상향식파싱의예 (reverse rightmost derivation)
John ate the apple├ N ate the apple├ NP ate the apple├ NP V the apple├ NP V DET apple├ NP V DET N├ NP V NP├ NP VP├ S
7
G : S NP VP
NP N | DET N
VP V NP
N John
DET the
V ate
N apple
Input Sentence :
John ate the apple
상향식파싱과정
• Grammar G
S NP VPNP NNP DET NVP V NPN JohnV ateDET theN apple
• Input Sentence
John ate the apple
8
S
NP
N
VP
V NP
DET
John ate the apple
N
자연언어의중의성 (1)
• 구조적 중의성 (Structural Ambiguity)
– 하나의문장이다수의 구조로해석될수있는성질
• 구조중의성의 예
G : S NP VP Input Sentence :NP N | DET N | NP PP John saw Mary in the park.VP V NP | VP PPPP P NP
9
S
John saw Mary in the park
N V N P DET N
NPNP NP
PPVP
VP
S
John saw Mary in the park
N V N P DET N
NP
NP NP
PP
VP
NP
자연언어의중의성 (2)
• 어휘적 중의성 (Lexical Ambiguity)
– 하나의단어가복수의 품사로서사용되는경우
– 어휘적중의성으로구조적 중의성발생
• 어휘적 중의성의 예
G : S NP VP Input Sentence :NP D N | A N | N Time flies like an arrowVP V | VP NP | VP PPPP P NP
10
S
Time files like an arrow
N V P D N
NP NP
VP
PP
S
Time files like an arrow
A N V D N
NP NP
VP
차트파싱
• 차트 (chart)– 파싱의진행과정을기록하는테이블– Bookkeeping mechanism– Keep track of constituents that were built up during part of
parse, but may be used by other rules
• 차트파싱 (chart parsing)– 차트를이용하는파싱
• Backtracking에 의해동일한분석을 반복하는 overhead 제거
– 구체적인 parsing strategy에대해서는 no comments• top-down or bottom-up• left-to-right, right-to-left, or island-driven
– 일반적인 CFG parsing algorithm (CYK, Early algorithm 등) 이용
11
차트파싱의장점
• A Grammar GG : S NP VP NP DET N
S NP VP PP NP NP PPVP V PP P NP
• Sentence : The rabbit with a saw nibbled on an orange
• Traditional Parsing (with backtracking)
– S NP VP 규칙을적용하여실패할경우 backtracking한 후,
– S NP VP PP 규칙을적용하여파싱• 이규칙에서 NP와 VP는 S NP VP 규칙에서분석했던내용과동일한데도
처음부터다시분석해야함 (비효율적)
• 차트파싱
– S NP VP 규칙을 적용하여실패하였다고해도, 부분결과로만들어진 NP, VP 구조를버리지않고 chart에기록해둠
– S NP VP PP 규칙에서 NP, VP는새로분석할필요없이 chart에기록된내용을그대로이용
12
차트파싱알고리즘
• CYK algorithm
–가장기본적인차트파싱알고리즘
– Bottom-up 방식
– Complexity: O(n^3)
• Early algorithm
– CYK algorithm을개선한차트파싱알고리즘
• 필요없는구성성분들이덜나오도록!
– Bottom-up + Top down 방식
– Complexity: O(n^3)
13
The CYK Algorithm
• The membership problem:– Problem:
• Given a context-free grammar G and a string w– G = (V, ∑ ,P , S) where
» V finite set of variables
» ∑ (the alphabet) finite set of terminal symbols
» P finite set of rules
» S start symbol (distinguished element of V)
» V and ∑ are assumed to be disjoint
– G is used to generate the string of a language
– Question: • Is w in L(G)?
The CYK Algorithm
• J. Cocke
• D. Younger,
• T. Kasami
– Independently developed an algorithm to answer this question.
The CYK Algorithm Basics
– The Structure of the rules in a Chomsky Normal Form grammar
– Uses a “dynamic programming” or “table-filling algorithm”
Chomsky Normal Form
• Normal Form is described by a set of conditions that each rule in the grammar must satisfy
• Context-free grammar is in CNF if each rule has one of the following forms:– A BC at most 2 symbols on right side
– A a, or terminal symbol
– S λ null string
–where B, C Є V – {S}
Construct a Triangular Table
• Each row corresponds to one length of substrings
–Bottom Row – Strings of length 1
–Second from Bottom Row – Strings of length 2
– .
– .
–Top Row – string ‘w’
Construct a Triangular Table
•Xi, i is the set of variables A such that
• A wi is a production of G
•Compare at most n pairs of previously computed sets:
–(Xi, i , Xi+1, j ), (Xi, i+1 , Xi+2, j ) … (Xi, j-1 , Xj, j )
–e.g. i=1, j=5
–(X1,1 , X2,5 ), (X1,2 , X3,5 ) … (X1,4 , X5,5 )
Construct a Triangular Table
X1, 5
X1, 4 X2, 5
X1, 3 X2, 4 X3, 5
X1, 2 X2, 3 X3, 4 X4, 5
X1, 1 X2, 2 X3, 3 X4, 4 X5, 5
w1 w2 w3 w4 w5
Table for string ‘w’ that has length 5
X1, 5
X1, 4 X2, 5
X1, 3 X2, 4 X3, 5
X1, 2 X2, 3 X3, 4 X4, 5
X1, 1 X2, 2 X3, 3 X4, 4 X5, 5
w1 w2 w3 w4 w5
Construct a Triangular Table
Looking for pairs to compare
Example CYK Algorithm
• Show the CYK Algorithm with the following example:– CNF grammar G
• S NP VP
• NP DET NP | NP NP | time | flies | arrow
• VP VP NP | VP PP | flies | like
• PP P NP
• DET an
• P like
– w is "time flies like an arrow"
– Question Is "time flies like an arrow" in L(G)?
Constructing The Triangular Table
{NP} (X1,1) {NP,VP} (X2,2) {VP,P} (X3,3) {DET} (X4,4) {NP} (X5,5)
time (1) flies (2) like (3) an (4) arrow (5)
Calculating the Bottom ROW: Xi, i
Constructing The Triangular Table
{NP,S} (X1,2) {S} (X2,3) {} (X3,4) {NP} (X4,5)
{NP} {NP,VP} {VP,P} {DET} {NP}
time (1) flies (2) like (3) an (4) arrow (5)
X1,2: (X1,1 , X2,2 )
X2,3: (X2,2 , X3,3 )
X3,4: (X3,3 , X4,4 )
X4,5: (X4,4 , X5,5 )
Constructing The Triangular Table
{S} (X1,3) {} (X2,4) {VP,PP} (X3,5)
{NP,S} {S} {} {NP}
{NP} {NP,VP} {VP,P} {DET} {NP}
time (1) flies (2) like (3) an (4) arrow (5)
X1,3: (X1,1 , X2,3 ), (X1,2 , X3,3 )
X3,5: (X3,3 , X4,5 ), (X3,4 , X5,5 )
Constructing The Triangular Table
{} (X1,4) {S,VP} (X2,5)
{S} {} {VP,PP}
{NP,S} {S} {} {NP}
{NP} {NP,VP} {VP,P} {DET} {NP}
time (1) flies (2) like (3) an (4) arrow (5)
X1,4: (X1,1 , X2,4 ), (X1,2 , X3,4 ) , (X1,3 , X4,4 )
X2,5: (X2,2 , X3,5 ), (X2,3 , X4,5 ) , (X2,4 , X5,5 )
Constructing The Triangular Table
{S} (X1,5)
{} {S,VP}
{S} {} {VP,PP}
{NP,S} {S} {} {NP}
{NP} {NP,VP} {VP,P} {DET} {NP}
time (1) flies (2) like (3) an (4) arrow (5)
X1,5: (X1,1 , X2,5 ), (X1,2 , X3,5 ) , (X1,3 , X4,5 ) , (X1,4 , X5,5 )
CYK algorithm: Pseudocode•let the input be a string S consisting of n characters: a1 ... an.
•let the grammar contain r nonterminal symbols R1 ... Rr.
•This grammar contains the subset Rs which is the set of start symbols. letP[n,n,r] be an array of booleans.
•Initialize all elements of P to false.
•for each i = 1 to n
• for each unit production Rj ai
• set P[i,1,j] = true
•for each i = 2 to n -- Length of span
• for each j = 1 to n-i+1 -- Start of span
• for each k = 1 to i-1 -- Partition of span
• for each production RA RB RC
• if P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true
•if any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for Rs) then
• S is member of language
•else