6장태깅cs.kangwon.ac.kr/~leeck/NLP/06_tagging.pdf · 2019-11-19 · POS Tagging (품사태깅) •Labeling each word in a sentence with its appropriate part of speech (POS) •Information

제6장 태깅

POS Tagging (품사태깅)

• Labeling each word in a sentence with its appropriate part of speech (POS)

• Information sources in tagging:

– Tags of other words in the context

– The word itself

• Different approaches:

– Rule-based Tagger

– Stochastic POS Tagger

• Simplest stochastic Tagger

• HMM Tagger

• …2

Simplest Stochastic Tagger

• Each word is assigned its most frequent tag (most frequently encountered in the training set)

–영어의경우 90% 이상의성능을보임

• Problem: may generate a valid tag for a word but unacceptable tag sequences

– Time flies like an arrow

NN VBZ VB DT NN

3

4

Markov Models (MM)

• In a Markov chain, the future element of the sequence depends only on the current element in the sequence, but not the past elements

• X = (X1, …, XT) is a sequence of random variables, S = {s1, …, sN} is the state space

and

ex.

)|( 1 itjtij sXsXPa

jiaij ,,0

N

j

ij ia1

.,1

)|(),,,|(

)|(),|(

4512345

23123

XXPXXXXXP

XXPXXXP

Example of Markov Models (MM)

5

Hidden Markov Model (HMM)

• In (visible) MM, we know the state sequences the model passes, so the state sequence is regarded as output

• In HMM, we don’t know the state sequences, but only some probabilistic function of it– 품사태깅문제에서단어만주어지고그단어의실제품사태그는모름

– 확률함수는알고있음

• Markov models can be used wherever one wants to model the probability of a linear sequence of events

• HMM can be trained from unannotated text– 이론적으로학습데이터없이학습할수있음

– 그러나, 이럴경우성능이나쁘기때문에대부분학습데이터를사용함 (Visible) Markov Model

6

HMM Example

7

NN

Time

VBZ

flies

IN

like

DT

an

NN

arrow

State

Output

NNS VB

HMM Tagger

• Assumption: word’s tag only depends on the previous tag and this dependency does not change over time– P(St+1|St), 0 < t < n+1

• HMM tagger uses states to represent POS tags and outputs (symbol emission) to represent the words.

• Tagging task is to find the most probable tag sequence for a sequence of words.

8

Finding the most probable sequence

9

HMM tagging – an example

Calculating the most likely sequence

Green: transition probabilities

Blue: emission probabilities

11

Dealing with unknown words

• The simplest model: assume that unknown words can have any POS tags, or the most frequent one in the tagset

• In practice, morphological info like suffix is used as hint

12

13

TnT (Trigrams’n’Tags)

• A statistical tagger using Markov Models: states represent tags and outputs represent words

• To find the current tag is to calculate:

)|()]|(),|([maxarg 121

1...1

TTiiiii

T

itt

ttPtwPtttPr

14

Transition and emission probabilities

• Transition and output probabilities are estimated from a tagged corpus:

Bigrams:

Trigrams:

Lexical:

)(

),()|(

2

3223

^

tf

ttfttP

),(

),,(),|(

21

321213

^

ttf

tttftttP

)(

),()|(

3

3333

^

tf

twftwP

15

Smoothing Technique

• Needed due to sparse-data problem

• The trigram is most likely to be zero in a limited corpus:– Without smoothing, the complete probability becomes

zero

• Smoothing:

where

),|()|()(),|( 213

^

323

^

23

^

1213 tttPttPtPtttP

1321

Other techniques

• Handling unknown words– Using the longest suffix (the final sequence of

characters of a word) as a strong predictor for word classes

– To calculate the probability of a tag t given the last m letters li of an n letter word.• m depends on the specific word

• Capitalization– Works better for English than for German

16

Learning Curve for Penn Treebank

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF (ACL16)

• LSTM-CRF model + Char-level Representation– Char: CNN

End-to-End 한국어형태소분석(동계학술대회16)

Attention + Input-feeding + Copying mechanism

BERT기반 LSTM-CRF 모델을이용한한국어형태소분석및품사태깅 (HCLT19)

20

입력 신문 시장은 여느 때보다 탈법과 불법이난무하고있다.

음절 단위 입력문장

신문시장은여느때보다탈법과불법이난무하고있다 .

Kor BERT 입

력

[CLS] 신문_ 시장은_ 여느_ 때보다_ 탈법과_불법이_ 난무하고_ 있다 ._ [SEP]

어절 범위자

질

E B E B I E B E B I E B I E B I E B I I EB I E E

형태소 분석및 품사 태깅

출력

B-NNG I-NNG B-NNG I-NNG B-JX B-MM I-MM B-NNG B-JKB I-JKB B-NNG I-NNG B-JC B-NNG I-NNG B-JKS B-NNG I-NNG B-XSV B-EC B-VX B-EF B-SF

Models F1나승훈[18]: CRF* 97.65이건일[6]: Sequence-to-sequence* 97.15이창기[15]: structural SVM 98.03RNN-search [13] 95.92황현선[29]: Copying mechanism 97.08BERT + LSTM-CRF (ours) 98.74

최근태깅기술 – 영어

• 대상: WSJ corpus (Penn tag set, Penn treebank)

• Hidden Markov Model (Viterbi algorithm)

– TnT (2000): 96.5% ~ 96.7%

• Brill Tagger (rule-based, 1995): 97.2%

• Averaged Perceptron

– Collins (2002): 97.1%, COMPOST (2009): 97.2%

• SVM (2004): 97.2%

• Maximum Entropy / Conditional Random Fields

– CRF: 97%, Melt (2009): 97.0%, GENiA Tagger (2005): 97.1%

– Stanford Tagger (2011): 97.3%

• Deep Learning

– SENNA: 97.55%

– Bi-directional LSTM-CNNs-CRF (ACL16): 97.55%

– Bi-LSTM RNN + Char representation(Ling et al., 2015): 97.78%

– Morpho-syntactic Tagging with a Meta-BiLSTM Model (Bohnet et al., 2018): 97.96%

21

최근태깅기술 – 한국어

• 창원대학교 (HMM+어절패턴, 세종코퍼스, 2009)

– 어절별 정확도=90.5%, 형태소별 품사 정확도=95.9%, 초당 2만 어절 처리

• KOMORAN (HMM?, 2013)

– 어절 정확도=84.8%, 형태소별 품사 정확도=91.2%, 초당 3만 어절 처리 (초당 200KB 처리)

• KAIST (HMM, 2010)

– 어절 정확도=89%

• 울산대 (HMM+기분석사전, 2012)

– 형태소별 품사 정확도=95.8% (형태소의 의미번호 포함), 초당 4.8만 어절 처리

• CRF + 음절기반

– 심광섭 교수님: 어절 단위 정확도 96.6% (태그 셋과 테스트 셋이 다름)

– 나승훈 (ETRI): 품사 태깅 정확도 96.2% (세종코퍼스, 어절별 성능?)

• Structural SVM + 음절기반

– 이창기(강원대, 2013): 형태소별 품사 정확도=97.96% (세종코퍼스), 초당 100KB 처리

• Deep Learning (Sequence-to-sequence, End-to-end)

– 이창기(강원대, 2016): 형태소별 품사 정확도=97.08% (세종코퍼스)

• Deep Learning (BERT + LSTM-CRF)

– 박천음,이창기(강원대, 2019): 형태소별 품사 정확도=98.74% (세종코퍼스)

22

Documents

6장태깅cs.kangwon.ac.kr/~leeck/NLP/06_tagging.pdf · 2019-11-19 · POS Tagging (품사태깅) •Labeling each word in a sentence with its appropriate part of speech (POS) •Information