K46 Nguyen Cam Tu Thesis

Embed Size (px)

Citation preview

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Cm T

NHN BIT CC LOI THC TH TRONG VN BN TING VIT NHM H TR WEB NG NGHA V TM KIM HNG THC TH

KHA LUN TT NGHIP I HC H CHNH QUI Ngnh: Cng ngh thng tin

H NI - 2005

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Cm T

NHN BIT CC LOI THC TH TRONG VN BN TING VIT NHM H TR WEB NG NGHA V TM KIM HNG THC TH

KHA LUN TT NGHIP I HC H CHNH QUI Ngnh: Cng ngh thng tin

Cn b hng dn: TS. H Quang Thy Cn b ng hng dn: ThS. Phan Xun Hiu

H NI - 2005

Li cm nTrc tin, em mun gi li cm n su sc nht n thy gio, TS. H Quang Thy v ThS. Phan Xun Hiu, nhng ngi tn tnh hng dn em trong sut qu trnh nghin cu Khoa hc v lm kha lun tt nghip. Em xin by t li cm n su sc n nhng thy c gio ging dy em trong bn nm qua, nhng kin thc m em nhn c trn ging ng i hc s l hnh trang gip em vng bc trong tng lai. Em cng mun gi li cm n n cc anh ch v cc thy c trong nhm seminar v Khai ph d liu nh ThS.Nguyn Tr Thnh, ThS. To Th Thu Phng, CN. V Bi Hng, CN. Nguyn Th Hng Giang ... cho em nhng li khuyn b ch v chuyn mn trong qu trnh nghin cu. Cui cng, em mun gi li cm n su sc n tt c bn b, v c bit l cha m v em trai, nhng ngi lun kp thi ng vin v gip em vt qua nhng kh khn trong cuc sng. Sinh Vin Nguyn Cm T

i

Tm ttNhn bit cc loi thc th l mt bc c bn trong trch chn thng tin t vn bn v x l ngn ng t nhin. N c ng dng nhiu trong dch t ng, tm tt vn bn, hiu ngn ng t nhin , nhn bit tn thc th trong sinh/y hc v c bit ng dng trong vic tch hp t ng cc i tng, thc th t mi trng Web vo cc ontology ng ngha v cc c s tri thc. Trong kha lun ny, em trnh by mt gii php nhn bit loi thc th cho cc vn bn ting Vit trn mi trng Web. Sau khi xem xt cc hng tip cn khc nhau, em chn phng php tip cn hc my bng cch xy dng mt h thng nhn bit loi thc th da trn m hnh Conditional Random Fields (CRF- Laferty, 2001) . im mnh ca CRF l n c kh nng x l d liu c tnh cht chui, c th tch hp hng trm nghn thm ch hng triu c im t d liu ht sc a dng nhm h tr cho qu trnh phn lp. Thc nghim trn cc vn bn ting Vit cho thy qui trnh phn lp t c kt qu rt kh quan.

ii

Mc lcLi cm n ........................................................................................................................i Tm tt ............................................................................................................................ ii Mc lc .......................................................................................................................... iii Bng t vit tt ................................................................................................................v M u .............................................................................................................................1 Chng 1. 1.1. 1.2. 1.3. 1.4. 2.1. 2.2. Bi ton nhn din loi thc th ................................................................3 Trch chn thng tin..........................................................................................3 Bi ton nhn bit cc loi thc th ..................................................................4 M hnh ha bi ton nhn bit cc loi thc th .............................................5 ngha ca bi ton nhn bit cc loi thc th ..............................................6 Cc hng tip cn gii quyt bi ton nhn bit cc loi thc th ..........8 Hng tip cn th cng ...................................................................................8 Cc m hnh Markov n (HMM) ......................................................................9 Tng quan v cc m hnh HMM .............................................................9 Gii hn ca cc m hnh Markov n .....................................................10 Tng quan v m hnh Markov cc i ha Entropy (MEMM) .............11 Vn label bias ..................................................................................13 Conditional Random Field (CRF) ...........................................................15

Chng 2.

2.2.1. 2.2.2. 2.3. 2.3.1. 2.3.2. 2.4. 3.1. 3.2. Chng 3.

M hnh Markov cc i ha Entropy (MEMM) ...........................................11

Tng kt chng .............................................................................................14 nh ngha CRF ..............................................................................................15 Nguyn l cc i ha Entropy ......................................................................16 o Entropy iu kin .........................................................................17 Cc rng buc i vi phn phi m hnh ..............................................17 Nguyn l cc i ha Entropy ...............................................................18

3.2.1. 3.2.2. 3.2.3. 3.3. 3.4. 3.5. 3.6.

Hm tim nng ca cc m hnh CRF ............................................................19 Thut ton gn nhn cho d liu dng chui ..................................................20 CRF c th gii quyt c vn label bias..............................................22 Tng kt chng .............................................................................................22 c lng tham s cho cc m hnh CRF .............................................23

Chng 4.

iii

4.1.

Cc phng php lp ......................................................................................24 Thut ton GIS ........................................................................................26 Thut ton IIS ..........................................................................................27 K thut ti u s bc mt .......................................................................28 K thut ti u s bc hai.........................................................................29 H thng nhn bit cc loi thc th trong ting Vit.............................31 Phn cng ................................................................................................31 Phn mm ................................................................................................31 D liu thc nghim ................................................................................31

4.1.1. 4.1.2. 4.2. 4.2.1. 4.2.2. 4.3. 5.1. Chng 5. 5.1.1. 5.1.2. 5.1.3. 5.2. 5.3.

Cc phng php ti u s (numerical optimisation methods) ......................28

Tng kt chng .............................................................................................30 Mi trng thc nghim .................................................................................31

H thng nhn bit loi thc th cho ting Vit .............................................31 Cc tham s hun luyn v nh gi thc nghim .........................................32 Cc tham s hun luyn ..........................................................................32 nh gi cc h thng nhn bit loi thc th ........................................33 Phng php 10-fold cross validation .................................................34 Mu ng cnh v t vng........................................................................35 Mu ng cnh th hin c im ca t..................................................35 Mu ng cnh dng regular expression...................................................36 Mu ng cnh dng t in .....................................................................36 Kt qu ca 10 ln th nghim................................................................37 Ln thc nghim cho kt qu tt nht .....................................................37 Trung bnh 10 ln thc nghim ...............................................................42 Nhn xt ..................................................................................................42

5.3.1. 5.3.2. 5.3.3. 5.4. 5.4.1. 5.4.2. 5.4.3. 5.4.4. 5.5. 5.5.1. 5.5.2. 5.5.3. 5.5.4.

La chn cc thuc tnh ..................................................................................34

Kt qu thc nghim .......................................................................................37

Kt lun..........................................................................................................................43 Ph lc: Output ca h thng nhn din loi thc th ting Vit ..................................45 Ti liu tham kho .........................................................................................................48

iv

Bng t vit ttT hoc cm t Conditional Random Field M hnh Markov n M hnh Markov cc i ha entropy Vit tt CRF HMM MEMM

v

M uTim Benner Lee, cha ca World Wide Web hin nay, cp Web ng ngha nh l tng lai ca World Wide Web, trong n kt hp kh nng hiu c bi con ngi v kh nng x l c bi my. Thnh cng ca Web ng ngha ph thuc phn ln vo cc ontology cng nh cc trang Web c ch gii theo cc ontology ny. Trong khi nhng li ch m Web ng ngha em li l rt ln th vic xy dng cc ontology mt cch th cng li ht sc kh khn. Gii php cho vn ny l ta phi dng cc k thut trch chn thng tin ni chung v nhn bit cc loi thc thc th ni ring t ng ha mt phn qu trnh xy dng cc ontology. Cc ontology v h thng nhn bit cc loi thc th khi c tch hp vo my tm kim s lm tng chnh xc ca tm kim v cho php tm kim hng thc th, khc phc c mt s nhc im cho cc my tm kim da trn t kha hin nay. thc c nhng li ch m cc bi ton trch chn thng tin ni chung v nhn bit loi thc th ni ring, em chn hng nghin cu nhm gii quyt bi ton nhn bit loi thc th cho ting Vit lm ti lun vn ca mnh. Lun vn c t chc thnh 5 chng nh sau: Chng 1 gii thiu v bi ton trch chn thng tin v bi ton nhn din cc loi thc th cng nhng ng dng ca n. Chng 2 trnh by mt s hng tip cn nhm gii quyt bi ton nhn bit loi thc th nh phng php th cng, cc phng php hc my HMM v MEMM. Cc hng tip cn th cng c nhc im l tn km v mt thi gian, cng sc v khng kh chuyn. Cc phng php hc my nh HMM hay MEMM tuy c th khc phc c nhc im ca hng tip cn th cng nhng li gp phi mt s vn do c th ca mi m hnh. Vi HMM, ta khng th tch hp cc thuc tnh lng nhau mc d nhng thuc tnh ny rt hu ch cho qu trnh gn nhn d liu dng chui. MEMM ,trong mt s trng hp c bit, gp phi vn label bias, l xu hng b qua cc d liu quan st khi trng thi c t ng i ra. Chng 3 gii thiu nh ngha CRF, nguyn l cc i ha Entropy mt phng php nh gi phn phi xc sut t d liu v l c s chn cc hm tim nng cho cc m hnh CRF, thut ton Viterbi gn nhn cho d liu dng chui. Bn cht phn phi iu kin v phn phi ton cc ca CRF cho php cc m hnh ny khc phc c cc nhc im ca cc m

1

hnh hc my khc nh HMM v MEMM trong vic gn nhn v phn on (segmentation) cc d liu dng chui. Chng 4 trnh by nhng phng php c lng cc tham s cho m hnh CRF nh cc thut ton IIS, GIS, cc phng php da trn vector gradient nh phng php gradient lin hp, quasi-Newton, L-BFGs. Trong s cc phng php ny, phng php L-BFGs c nh gi l tt nht v c tc hi t nhanh nht. Chng 5 trnh by h thng nhn din loi thc th cho ting Vit da trn m hnh CRF, xut cc phng php chn thuc tnh cho vic nhn bit cc loi thc th trong cc vn bn ting Vit v a ra mt s kt qu thc nghim.

2

Chng 1. Bi ton nhn din loi thc thCh chnh ca kha lun l p dng m hnh CRF cho bi ton nhn bit cc loi thc th cho ting Vit. Chng ny s gii thiu tng quan v trch chn thng tin [30][31][32], chi tit v bi ton nhn bit loi thc th [13][15][30][31] v nhng ng dng ca bi ton nhn bit loi thc th.

1.1. Trch chn thng tinKhng ging nh vic hiu ton b vn bn, cc h thng trch chn thng tin ch c gng nhn bit mt s dng thng tin ng quan tm. C nhiu mc trch chn thng tin t vn bn nh xc nh cc thc th (Element Extraction), xc nh quan h gia cc thc th (Relation Extraction), xc nh v theo di cc s kin v cc kch bn (Event and Scenario Extraction and Tracking), xc nh ng tham chiu (Co-reference Resolution) ... Cc k thut c s dng trong trch chn thng tin gm c: phn on, phn lp, kt hp v phn cm.October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a superimportant shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying

IE

NAME Bill Gates Bill Veghte Richard Stallman

TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft..

Hnh 1: Mt h thng trch chn thng tin

Kt qu ca mt h thng trch chn thng tin thng l cc mu (template) cha mt s lng xc nh cc trng (slots) c in thng tin.

3

mc trch chn thng tin ng ngha, mt mu l th hin ca mt s kin trong cc thc th tham gia ng mt s vai tr xc nh trong s kin . Chng hn nh ti MUC-7 [31] (Seventh Message Understanding Conference), mt mu kch bn c yu cu l cc s kin phng tn la v rocket trong 100 bi bo ca New York Times. Cc h thng tham gia hi ngh phi in vo mu ny cc thng tin sao cho c th tr li c cu hi v thi gian, a im ... ca cc s kin phng tn la, rocket c cp trong cc bi bo.

1.2. Bi ton nhn bit cc loi thc thCon ngi, thi gian, a im, cc con s, ... l nhng i tng c bn trong mt vn bn d bt k ngn ng no. Mc ch chnh ca bi ton nhn bit cc loi thc th l xc nh nhng i tng ny t phn no gip cho chng ta trong vic hiu vn bn. Bi ton nhn bit cc loi thc th l bi ton n gin nht trong s cc bi ton trch chn thng tin, tuy vy n li l bc c bn nht trc khi tnh n vic gii quyt cc bi ton phc tp hn trong lnh vc ny. R rng trc khi c th xc nh c cc mi quan h gia cc thc th ta phi xc nh c u l cc thc th tham gia vo mi quan h . Tuy l bi ton c bn nht trong trch chn thng tin, vn tn ti mt lng ln cc trng hp nhp nhng lm cho vic nhn bit cc loi thc th tr nn kh khn. Mt s v d c th : Bnh nh v HAGL u thua AFC Champion Ledge . o y Bnh nh phi c nh du l mt t chc (mt i bng) thay v l mt a danh. o Ch Bnh vit u cu nn thng tin vit hoa khng mang nhiu ngha. Khi no H Ch Minh c s dng nh tn ngi, khi no c s dng nh tn mt a danh? Bi ton nhn bit loi thc th trong cc vn bn ting Vit cn gp nhiu kh khn hn so vi bi ton ny trong ting Anh v mt s nguyn nhn nh sau: Thiu d liu hun luyn v cc ngun ti nguyn c th tra cu nh WordNet trong ting Anh.

4

Thiu cc thng tin ng php (POS) v cc thng tin v cm t nh cm danh t, cm ng t ... cho ting Vit trong khi cc thng tin ny gi vai tr rt quan trng trong vic nhn bit loi thc th. Ta hy xem xt v d sau: Cao Xumin, Ch tch Phng Thng mi Xut nhp khu thc phm ca Trung Quc, cho rng cch xem xt ca DOC khi em so snh gi tm ca Trung Quc v gi tm ca n l vi phm lut thng mi Chng ta mun on vn bn trn c nh du nh sau: Cao Xumin, Ch tch Phng Thng mi Xut nhp khu thc phm ca Trung Quc, cho rng cch xem xt ca DOC khi em so snh gi tm ca Trung Quc v gi tm ca n l vi phm lut thng mi V d trn bc l mt s kh khn m mt h thng nhn bit cc loi thc th ting Vit gp phi trong khi gn nhn cho d liu (xem ph lc): Cm t Phng Thng mi Xut nhp khu thc phm l tn mt t chc nhng khng phi t no cng vit hoa. Cc thng tin nh Phng Thng mi Xut nhp khu thc phm l mt cm danh t v ng vai tr ch ng trong cu rt hu ch cho vic an nhn chnh xc loi thc th, tuy vy do ting Vit thiu cc h thng t ng on nhn chc nng ng php v cm t nn vic nhn bit loi thc th tr nn kh khn hn nhiu so vi ting Anh.

1.3. M hnh ha bi ton nhn bit cc loi thc thBi ton nhn bit loi thc th trong vn bn l tm cu tr li cho cc cu hi: ai?, bao gi?, u?, bao nhiu? ... y l mt trng hp c th ca bi tan gn nhn cho d liu dng chui, trong (tr nhn O) th mi mt nhn gm mt tip u ng B_ hoc I_ (vi ngha l bt u hay bn trong mt tn thc th) kt hp vi tn nhn.Bng 1: Cc loi thc th

Tn nhn PER ORG

ngha Tn ngi Tn t chc

5

LOC NUM PCT CUR TIME MISC O

Tn a danh S Phn trm Tin t Ngy thng, thi gian Nhng loi thc th khc ngai 7 lai trn Khng phi thc th

V d: chui cc nhn tng ng cho cm Phan Vn Khi l B_PER I_PER I_PER Nh vy vi 8 loi thc th k c Misc, ta s c tng ng 17 nhn (8*2+1). V bn cht gn nhn cho d liu l chnh l mt trng hp c bit ca phn lp trong vn bn, y cc lp chnh l cc nhn cn gn cho d liu.

1.4. ngha ca bi ton nhn bit cc loi thc thMt h thng nhn bit cc loi thc th tt c th c ng dng trong nhiu lnh vc khc nhau, c th n c th c s dng nhm: H tr Web ng ngha. Web ng ngha l cc trang Web c th biu din d liu thng minh , y thng minh ch kh nng kt hp, phn lp v kh nng suy din trn d liu . S thnh cng ca cc Web ng ngha ph thuc vo cc ontology [] cng nh s pht trin ca cc trang Web c ch gii bi cc siu d liu tun theo cc ontology ny. Mc d cc li ch m cc ontology em li l rt ln nhng vic xy dng chng mt cch t ng li ht sc kh khn. V l do ny, cc cng c trch chn thng tin t ng t cc trang Web lm y cc ontology nh h thng nhn bit cc loi thc th l ht sc cn thit. Xy dng cc my tm kim hng thc th. Ngi dng c th tm thy cc trang Web ni v Clinton l mt a danh Bc Carolina mt cch nhanh chng m khng phi duyt qua hng trm trang Web ni v tng thng Bill Clinton.

6

Nhn bit cc loi thc th c th c xem nh l bc tin x l lm n gin ha cc bi ton nh dch my, tm tt vn bn ... Nh c cp trn y, mt h thng nhn bit cc loi thc th c th ng vai tr l mt thnh phn c bn cho cc bi ton trch chn thng tin phc tp hn. Trc khi c mt ti liu, ngi dng c th c lt qua cc tn ngi, tn a danh, tn cng ty c cp n trong . T ng nh ch s cho cc sch. Trong cc sch, phn ln cc ch mc l cc loi thc th. H thng nhn din loi thc th cho ting Vit s lm tin cho vic gii quyt cc bi ton v trch chn thng tin t cc ti liu ting Vit cng nh h tr cho vic x l ngn ng ting Vit. p dng h thng xy dng mt ontology v cc thc th trong ting Vit s t nn mng cho mt th h Web mi - Web ng ngha ting Vit.

7

Chng 2. Cc hng tip cn gii quyt bi ton nhn bit cc loi thc thC nhiu phng php tip cn khc nhau gii quyt bi ton nhn din cc loi thc th, chng ny s gii thiu mt s hng tip cn nh vy cng vi nhng u nhc im ca chng t l gii ti sao chng em li chn phng php da trn CRF xy dng h thng nhn din loi thc th cho ting Vit.

2.1. Hng tip cn th cngTiu biu cho hng tip cn th cng l h thng nhn bit loi thc th Proteous ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v c h tr bi mt s lng ln cc lut. Di y l mt s v d v cc lut c s dng bi Proteous cng vi cc trng hp ngoi l ca chng: Title Capitalized_Word => Title Person Name ng : Mr. Johns, Gen. Schwarzkopf Ngoi l: Mrs. Fields Cookies (mt cng ty) Month_name number_less_than_32 => Date ng: February 28, July 15 Ngoi l: Long March 3 ( tn mt tn la ca Trung Quc). Trn thc t, mi lut trn u cha mt s lng ln cc ngoi l. Thm ch ngay c khi ngi thit k tm cch gii quyt ht cc ngoi l m h ngh n th vn tn ti nhng trng hp ch xut hin khi h thng c a vo thc nghim. Hn na, vic xy dng mt h thng trch chn da trn cc lut l rt tn cng sc. Thng thng xy dng mt h thng nh vy i hi cng sc vi thng t mt lp trnh vin vi nhiu kinh nghim v ngn ng hc. Thi gian ny cn ln hn khi chng ta mun chuyn sang lnh vc khc hay sang ngn ng khc. Cu tr li cho cc gii hn ny l phi xy dng mt h thng bng cch no c th t hc, iu ny s gip gim bt s tham gia ca cc chuyn gia ngn ng v lm tng tnh kh chuyn cho h thng. C rt nhiu phng php hc my nh cc m hnh markov n (Hidden Markov Models - HMM), cc m hnh Markov cc i ha Entropy (Maximum Entropy Markov Models- MEMM) v m hnh Conditional Random Field (CRF)... c th c p dng gii quyt bi ton nhn bit loi thc th. Cc m hnh CRF s c miu t chi tit trong chng sau, y8

chng ta s ch xem xt cc m hnh HMM v MEMM cng vi u v nhc im ca chng.

2.2. Cc m hnh Markov n (HMM)M hnh Markov[7][13][19] n c gii thiu v nghin cu vo cui nhng nm 1960 v u nhng nm 1970 ,cho n nay n c ng dng nhiu trong nhn dng ting ni, tin sinh hc v x l ngn ng t nhin. 2.2.1. Tng quan v cc m hnh HMM HMM l m hnh my trng thi hu hn (probabilistic finite state machine) vi cc tham s biu din xc sut chuyn trng thi v xc sut sinh d liu quan st ti mi trng thi. Cc trng thi trong m hnh HMM c xem l b n i bn di d liu quan st sinh ra do m hnh. Qu trnh sinh ra chui d liu quan st trong HMM thng qua mt lot cc bc chuyn trng thi xut pht t mt trong cc trng thi bt u v dng li mt trng thi kt thc. Ti mi trng thi, mt thnh phn ca chui quan st c sinh ra trc khi chuyn sang trng thi tip theo. Trong bi ton nhn bit loi thc th, ta c th xem tng ng mi trng thi vi mt trong nhn B_PER, B_LOC, I_PER...v d liu quan st l cc t trong cu. Mc d cc lp ny khng sinh ra cc t, nhng mi lp c gn cho mt t bt k c th xem nh l sinh ra t ny theo mt cch thc no . V th ta c th tm ra chui cc trng thi (chui cc lp loi thc th) m t tt nht cho chui d liu quan st (chui cc t) bng cch tnh .

P (S | O ) =

P (S ,O ) P (O )

(2.1)

y S l chui trng thi n, O l chui d liu quan st bit. V P(O) c th tnh c mt cch hiu qu nh thut ton forward-backward [19], vic tm chui S* lm cc i xc sut P(S|O) tng ng vi vic tm S* lm cc i P(S,O).

9

Ta c th m hnh ha HMM di dng mt th c hng nh sau:

S1

S2

S3

Sn-1

Sn

O1

O2

O3

O4

O5

Hnh 2: th c hng m t m hnh HMM

y, Si l trng thi ti thi im t=i trong chui trng thi S, Oi l d liu quan st c ti thi im t=i trong chui O. S dng tnh cht Markov th nht (trng thi hin ti ch ph thuc vo trng thi ngay trc ) v gi thit d liu quan st c ti thi im t ch ph thuc trng thi ti t, ta c th tnh xc sut P(S,O) nh sau:

P(S , O) = P(S1 ) P(O1 | S1 ) P(St |S t 1) * P(Ot | St )t =2

n

(2.2)

Qu trnh tm ra chui trng thi ti u m t tt nht chui d liu quan st cho trc c th c thc hin bi mt k thut lp trnh quy hoch ng s dng thut ton Viterbi [19]. 2.2.2. Gii hn ca cc m hnh Markov nTrong bi bo Maximum Entropy Markov Model for Information Extraction

and Segmentation[5], Adrew McCallum a ra hai vn m cc m hnh HMM truyn thng ni ring v cc m hnh sinh (generative models) ni chung gp phi khi gn nhn cho d liu dng chui. Th nht, c th tnh c xc sut P(S, O) (2.1), thng thng ta phi lit k ht cc trng hp c th ca chui S v chui O. Nu nh cc chui S c th lit k c v s lng cc trng thi l c hn th trong mt s ng dng ta khng th no lit k ht c cc chui O v d liu quan st l ht sc phong ph v a dng. gii quyt vn ny, HMM phi a ra gi thit v s c lp gia cc d liu quan st, l d liu quan st c ti thi im t ch ph thuc trng thi ti thi im . Tuy vy, vi cc bi ton gn nhn cho d liu dng chui, ta nn a ra cc phng thc biu din cc d liu quan st mm do hn nh l biu din d liu quan

10

st di dng cc thuc tnh (features) khng ph thuc ln nhau. V d vi bi ton phn loi cc cu hi v cu tr li trong mt danh sch FAQ, cc thuc tnh c th l bn thn cc t hay di ca dng, s lng cc k t trng, dng hin ti c vit li u dng hay khng, s cc k t khng nm trong bng ch ci, cc thuc tnh v cc chc nng ng php ca chng R rng nhng thuc tnh ny khng nht thit phi c lp vi nhau. Vn th hai m cc m hnh sinh gp phi khi p dng vo cc bi ton phn lp d liu dng chui l chng s dng xc sut ng thi m hnh ha cc bi ton c tnh iu kin.Vi cc bi ton ny s thch hp hn nu ta dng mt m hnh iu kin c th tnh ton P (S|O) trc tip thay v P (S, O) nh trong cng thc (2.1).

2.3. M hnh Markov cc i ha Entropy (MEMM)McCallum a ra mt m hnh Markov mi - m hnh MEMM [5] (Maximum Entropy Markov Model) nh p n cho nhng vn ca m hnh Markov truyn thng. 2.3.1. Tng quan v m hnh Markov cc i ha Entropy (MEMM) M hnh MEMM thay th cc xc sut chuyn trng thi v xc sut sinh quan st trong HMM bi mt hm xc sut duy nht P (Si|Si-1, Oi) - xc sut trng thi hin ti l Si vi iu kin trng thi trc l Si-1 v d liu quan st hin ti l Oi. M hnh MEMM quan nim rng cc quan st c cho trc v chng ta khng cn quan tm n xc sut sinh ra chng, iu duy nht cn quan tm l cc xc sut chuyn trng thi. So snh vi HMM, y quan st hin ti khng ch ph thuc vo trng thi hin ti m cn c th ph thuc vo trng thi trc , iu c ngha l quan st hin ti c gn lin vi qu trnh chuyn trng thi thay v gn lin vi cc trng thi ring l nh trong m hnh HMM truyn thng.S1 S2 S3 Sn-1 Sn

O1

O2

O3

On-1

On

Hnh 3: th c hng m t mt m hnh MEMM

11

p dng tnh cht Markov th nht, xc sut P(S|O) c th tnh theo cng thc :

P ( S | O ) = P ( S1 | O1 ) P ( S t | S t 1 , O1 )t =1

n

(2.3)

MEMM coi cc d liu quan st l cc iu kin cho trc thay v coi chng nh cc thnh phn c sinh ra bi m hnh nh trong HMM v th xc sut chuyn trng thi c th ph thuc vo cc thuc tnh a dng ca chui d liu quan st. Cc thuc tnh ny khng b gii hn bi gi thit v tnh c lp nh trong HMM v gi vai tr quan trng trong vic xc nh trng thi k tip. K hiu PSi-1(Si|Oi)=P(Si|Si-1,Oi). p dng phng php cc i ha Entropy (s c cp trong chng 3), McCallum xc nh phn phi cho xc sut chuyn trng thi c dng hm m nh sau:

PSi 1 ( S i | Oi ) =

1 exp a f a (Oi , S i ) Z (Oi , S i 1 ) a

(2.4)

y, a l cc tham s cn c hun luyn (c lng); Z (Oi, Si) l tha s chn ha tng xc sut chuyn t trng thi Si-1 sang tt c cc trng thi Si k u bng 1; fa (Oi, Si) l hm thuc tnh ti v tr th i trong chui d liu quan st v trong chui trng thi. Mi hm thuc tnh fa (Oi,Si) nhn hai tham s, mt l d liu quan st hin ti Oi v mt l trng thi hin ti Si. McCallum nh ngha a=, y b l thuc tnh nh phn ch ph thuc vo d liu quan st hin ti v Si l trng thi hin ti. Sau y l mt v d v mt thuc tnh b: 1 nu d liu quan st hin ti l the 0 nu ngc li Hm thuc tnh fa (Oi, Si) xc nh nu b (Oi) xc nh v trng thi hin ti nhn mt gi tr c th no : fa (Oi,Si)= 1 nu b (Oi) =1 v Si=Si-1 0 nu ngc li

b(Oi) =

12

gn nhn cho d liu, MEMM xc nh chui trng thi S lm cc i P(S|O) trong cng thc (2.3).Vic xc nh chui S cng c thc hin bng cch p dng thut ton Viterbi nh trong HMM. 2.3.2. Vn label bias Trong mt s trng hp c bit, cc m hnh MEMM v cc m hnh nh ngha mt phn phi xc sut cho mi trng thi c th gp phi vn label bias [15][17]. Ta hy xem xt mt kch bn chuyn trng thi n gin sau:

r_ 0 r_

1

i_

2

b: rib 5

3

o_

4

b: rob

Hnh 4: Vn label bias

Gi s ta cn xc nh chui trng thi khi xut hin chui quan st l rob. y, chui trng thi ng S l 0345 v ta mong i xc sut P (0345|rob) s ln hn xc sut P(0125|rob). p dng cng thc (2.3), ta c: P (0125|rob) =P (0)*P (1|0, r)*P (2|1, o)*P (5|2, b) V tng cc xc sut chuyn t mt trng thi sang cc trng thi k vi n bng 1 nn mc d trng thi 1 cha bao gi thy quan st o nhng n khng c cch no khc l chuyn sang trang thi 2, iu c ngha l P (2|1, x) =1 vi x c th l mt quan st bt k. Mt cch tng qut, cc trng thi c phn phi chuyn vi entropy thp (t ng i ra) c xu hng t ch hn n quan st hin ti. Li c P (5|2, b) =1, t suy ra: P (0125|rob) = P(0)*P(1|0,r). Tng t ta cng c P (0345|rob)=P (0)*P (3|0,r). Nu trong tp hun luyn, t rib xut hin thng xuyn hn t rob th xc sut P(3|0,r) s nh hn xc sut P(1|0,r), iu dn n xc sut P(0345|rob) nh hn xc sut P(0125|rob), tc l chui trng thi S=0125 s lun c chn d chui quan st l rib hay rob. Nm 1991, Lon Bottou a ra hai gii php cho vn ny.Gii php th nht l gp hai trng thi 1, 3 v tr hon vic r nhnh cho n khi gp mt quan st13

xc nh (c th y l i v o). y chnh l trng hp c bit ca vic chuyn mt automata a nh sang mt automata n nh. Nhng vn ch ngay c khi c th thc hin vic chuyn i ny th cng gp phi s bng n t hp cc trng thi ca automata. Gii php th hai m Bottou a ra l chng ta s bt u m hnh vi mt th y ca cc trng thi v cho th tc hun luyn t quyt nh mt cu trc thch hp cho m hnh.Tic rng gii php ny s lm mt tnh i tnh c th t ca m hnh, mt tnh cht rt c ch cho cc bi tan trch chn thng tin [5]. Mt gii php ng n hn cho vn ny l xem xt ton b chui trng thi nh mt tng th v cho php mt s cc bc chuyn trong chui trng thi ny ng vai tr quyt nh vi vic chn chui trng thi. iu ny c ngha l xc sut ca ton b chui trng thi s khng phi c bo tn trong qu trnh chuyn trng thi m c th b thay i ti mt bc chuyn ty thuc vo quan st ti .Trong v d trn, xc sut chuyn ti 1 v 3 c th c nhiu nh hng i vi vic ta s chn chui trng thi no hn xc sut chuyn trng thi ti 0.

2.4. Tng kt chngChng ny gii thiu cc hng tip cn nhm gii quyt bi ton nhn din loi thc th: hng tip cn th cng, cc hng tip cn hc my (HMM v MEMM). Trong khi hng tip cn th cng c gii hn l tn km v cng sc, thi gian v khng kh chuyn th HMM khng th tch hp cc thuc tnh phong ph ca chui d liu quan st vo qu trnh phn lp, v MEMM gp phi vn label bias. Nhng phn tch, nh gi vi tng phng php cho thy nhu cu v mt m hnh tht s thch hp cho vic gn nhn d liu dng chui ni chung v bi ton nhn din cc loi thc th ni ring.

14

Chng 3. Conditional Random Field (CRF)CRF [6][11][12][15][16][17] c gii thiu ln u vo nm 2001 bi Lafferty v cc ng nghip. Ging nh MEMM, CRF l m hnh da trn xc sut iu kin, n c th tch hp c cc thuc tnh a dng ca chui d liu quan st nhm h tr cho qu trnh phn lp. Tuy vy, khc vi MEMM, CRF l m hnh th v hng. iu ny cho php CRF c th nh ngha phn phi xc sut ca ton b chui trng thi vi iu kin bit chui quan st cho trc thay v phn phi trn mi trng thi vi iu kin bit trng thi trc v quan st hin ti nh trong cc m hnh MEMM. Chnh v cch m hnh ha nh vy, CRF c th gii quyt c vn label bias. Chng ny s a ra nh ngha CRF, mt s phng php c lng tham s cho cc m hnh CRF v thut tan Viterbi ci tin tm chui trng thi tt nht m t mt chui d liu quan st cho trc. Mt s qui c k hiu: Ch vit hoa X, Y, Zk hiu cc bin ngu nhin. Ch thng m x, y, t, s,k hiu cc vector nh vector biu din chui cc d liu quan st, vector biu din chui cc nhn Ch vit thng in m v c ch s l k hiu ca mt thnh phn trong mt vector, v d xi ch mt thnh phn ti v tr i trong vector x. Ch vit thng khng m nh x, y, l k hiu cc gi tr n nh mt d liu quan st hay mt trng thi. S: Tp hu hn cc trng thi ca mt m hnh CRF.

3.1. nh ngha CRFK hiu X l bin ngu nhin nhn gi tr l chui d liu cn phi gn nhn v Y l bin ngu nhin nhn gi tr l chui nhn tng ng. Mi thnh phn Yi ca Y l mt bin ngu nhin nhn ga tr trong tp hu hn cc trng thi S. Trong bi ton nhn bit cc loi thc th, X c th nhn gi tr l cc cu trong ngn ng t nhin, Y l mt chui ngu nhin cc tn thc th tng ng vi cc cu ny v mi mt thnh phn Yi ca Y c min gi tr l tp tt c cc nhn tn thc th (tn ngi, tn a danh,...). Cho mt th v hng khng c chu trnh G=(V,E), y V l tp cc nh ca th v E l tp cc cnh v hng ni cc nh th. Cc nh V biu din cc thnh phn ca bin ngu nhin Y sao cho tn ti nh x mt-mt gia mt nh v15

mt thnh phn ca Yv ca Y. Ta ni (Y|X) l mt trng ngu nhin iu kin (Conditional Random Field - CRF) khi vi iu kin X, cc bin ngu nhin Yv tun theo tnh cht Markov i vi th G:

P(Yv | X ,Y , v) = P(Yv | X ,Y , N(v))

(3.1)

y, N(v) l tp tt c cc nh k vi v. Nh vy, mt CRF l mt trng ngu nhin ph thuc tan cc vo X. Trong cc bi ton x l d liu dng chui, G n gin ch l dng chui G=(V={1,2,m},E={(i,i+1)}). K hiu X=(X1, X2,, Xn), Y=(Y1,Y2, ...,Yn). M hnh th cho CRF c dng:X

Y1

Y2

Y3

Yn-1

Yn

Hnh 5: th v hng m t CRF

Gi C l tp hp tt c cc th con y ca th G - th biu din cu trc ca mt CRF. p dng kt qu ca Hammerley-Clifford [14] cho cc trng ngu nhin Markov, ta tha s ha c p(y|x) - xc sut ca chui nhn vi iu kin bit chui d liu quan st- thnh tch ca cc hm tim nng nh sau:

P (y | x) = A ( A | x)AC

(3.2)

V trong cc bi ton x l d liu dng chui th biu din cu trc ca mt CRF c dng ng thng nh trong hnh 5 nn tp C phi l hp ca E v V, trong E l tp cc cnh ca th G v V l tp cc nh ca G, hay ni cch khc th con A hoc ch gm mt nh hoc ch gm mt cnh ca G.

3.2. Nguyn l cc i ha EntropyLafferty et. al.[17] xc nh cc hm tim nng cho cc m hnh CRF da trn nguyn l cc i ha Entropy [1][3][8][29]. Cc i ha Entropy l mt nguyn l cho php nh gi cc phn phi xc sut t mt tp cc d liu hun luyn.

16

3.2.1. o Entropy iu kin Entropy l o v tnh ng u hay tnh khng chc chn ca mt phn phi xc sut. o Entropy iu kin ca mt phn phi m hnh trn mt chui trng thi vi iu kin bit mt chui d liu quan st p(y|x) c dng sau:

H ( p) = ~(x) * p(y | x) * log p(y | x) px,y

(3.3)

3.2.2. Cc rng buc i vi phn phi m hnh Cc rng buc i vi phn phi m hnh c thit lp bng cch thng k cc thuc tnh c rt ra t tp d liu hun luyn. Di y l v d v mt thuc tnh nh vy:

f=

1 nu t lin trc l t ng v nhn hin ti l B_PER 0 nu ngc li

Tp cc thuc tnh l tp hp cc thng tin quan trng trong d liu hun luyn. K hiu k vng ca thuc tnh f theo phn phi xc sut thc nghim nh sau:E ~ ( x , y ) [ f ] ~ ( x, y ) f ( x, y ) p px,y

(3.4)

p y ~(x, y) l phn phi thc nghim trong d liu hun luyn. Gi s dliu hun luyn gm N cp, mi cp gm mt chui d liu quan st v mt chui nhn D={(xi,yi)}, khi phn phi thc nghim trong d liu hun luyn c tnh nh sau:

~(x, y) =1/N * s ln xut hin ng thi ca x,y trong tp hun luyn pK vng ca thuc tnh f theo phn phi xc sut trong m hnhE p [ f ] ~ ( x ) p ( y | x ) * f ( x, y ) px,y

(3.5)

Phn phi m hnh thng nht vi phn phi thc nghim ch khi k vng ca mi thuc tnh theo phn phi xc sut phi bng k vng ca thuc tnh theo phn phi m hnh :

E ~ ( x,y ) [ f ] = E p [ f ] p

(3.6)

17

Phng trnh (3.6) th hin mt rng buc i vi phn phi m hnh. Nu ta chn n thuc tnh t tp d liu hun luyn, ta s c tng ng n rng buc i vi phn phi m hnh. 3.2.3. Nguyn l cc i ha Entropy Gi P l khng gian ca tt c cc phn phi xc sut iu kin, v n l s cc thuc tnh rt ra t d liu hun luyn. P l tp con ca P, P c xc nh nh sau:

P' = {p P | E p ( f i ) = E ~ ( f i )i { ,2,3..., n}} 1 p

(3.7)

P P C1

(a)

(b)

P C1 C2 (c) (d) Hnh 6: Cc rng buc m hnh C1 C2

P

P l khng gian ca ton b phn phi xc sut. Trng hp a: khng c rng buc; trng hp b: c mt rng buc C1, cc m hnh p tha mn rng buc nm trn ng C1; trng hp c: 2 rng buc C1 v C2 giao nhau, m hnh p tha mn c hai rng buc l giao ca hai ng C1 v C2; trng hp d: 2 rng buc C1 v C2 khng giao nhau, khng tn ti m hnh p tha mn c 2 rng buc. T tng ch o ca nguyn l cc i ha Entropy l ta phi xc nh mt phn phi m hnh sao cho phn phi tun theo mi gi thit bit t thc18

nghim v ngoi ra khng a thm bt k mt gi thit no khc. iu ny c ngha l phn phi m hnh phi tha mn mi rng buc c rt ra t thc nghim, v phi gn nht vi phn phi u. Ni theo ngn ng ton hc, ta phi tm phn phi m hnh p(y|x) tha mn hai iu kin, mt l n phi thuc tp P (3.7) v hai l n phi lm cc i Entropy iu kin (3.3). Vi mi thuc tnh fi ta a vo mt tha s langrange i , ta nh ngha hm Lagrange L( p, ) nh sau:

L ( p, ) = H ( p ) + i * ( E ~ [ f i ] E p [ f i ]) pi

(3.8)

Phn phi p(y|x) lm cc i o Entropy H ( p ) v tha mn n rng buc dng E ~ ( x , y ) [ f ] = E p [ f ] cng s lm cc i hm L( p, ) (theo l thuyt tha s p Langrange). T (3.8) ta suy ra:

p ( y | x) =

1 exp i f i Z ( x) i

(3.9)

y Z (x) l tha s chun ha m bo

p(y | x) = 1 vi mi x:y

Z (x) = exp i f i y i

(3.10)

3.3. Hm tim nng ca cc m hnh CRFBng cch p dng nguyn l cc i ha Entropy, Lafferty xc nh hm tim nng ca mt CRF c dng mt hm m.

A ( A | x) = exp k f k ( A | x )k

(3.11)

y fk l mt thuc tnh ca chui d liu quan st v k l trng s ch mc biu t thng tin ca thuc tnh fk . C hai loi thuc tnh l thuc tnh chuyn (k hiu l t) v thuc tnh trng thi(k hiu l s) ty thuc vo A l th con gm mt nh hay mt cnh ca G. Thay cc hm tim nng vo cng thc (3.2) v thm vo mt tha s chun ha Z(x) m bo tng xc sut ca tt c cc chui nhn tng ng vi mt chui d liu quan st bng 1, ta c:

19

P(y | x) =

1 exp k t k (y i1 , y i , x) + k sk (y i , x) Z (x) i k i k

(3.12)

y, x,y l chui d liu quan st v chui trng thi tng ng; tk l thuc tnh ca tan b chui quan st v cc trng thi ti v tr i-1, i trong chui trng thi; sk l thuc tnh ca ton b chui quan st v trng thi ti v tr i trong chui trng thi. si = 1 nu xi=Bill v yi= B_PER 0 nu ngc li 1 nu xi-1= Bill, xi=Clinton v yi-1=B_PER,yi=I_PER ti = 0 nu ngc li

Tha s chun ha Z(x) c tnh nh sau:

Z (x) = exp k t k (y i 1 , y i , x) + k s k (y i , x) y i k i k

(3.13)

(1 , 2 ,..., 1, 2 ..) l cc vector cc tham s ca m hnh, teta s c c

lng gi tr nh cc phng php c lng tham s cho m hnh s c cp trong phn sau.

3.4. Thut ton gn nhn cho d liu dng chuiTi mi v tr i trong chui d liu quan st, ta nh ngha mt ma trn chuyn |S|*|S| nh sau:M i (x) = [M i ( y ' , y, x)]

(3.14) (3.15)

M i ( y ' , y, x) = exp k t k ( y ' , y, x) + k s k ( y , x) k k

y Mi(y,y,x) l xc sut chuyn t trng thi y sang trng thi y vi chui d liu quan st l x. Chui trng thi y* m t tt nht cho chui d liu quan st x l nghim ca phng trnh: y* = argmax{p(y|x)} (3.16)

20

Chui y* c xc nh bng thut ton Viterbi ci tin. nh ngha i ( y ) l xc sut ca chui trng thi di i kt thc bi trng thi y v c xc sut ln nht bit chui quan st l x.y1

Prob= i ( y1 )

i +1 ( y j )

y2 Prob=

i ( y2 )yNi

?

yj

Prob=

( yN )

Hnh 7: Mt bc trong thut ton Viterbi ci tin

Gi s bit tt c i ( y k ) vi mi yk thuc tp trng thi S ca m hnh, cn xc nh i +1 ( y j ) . T hnh 7, ta suy ra cng thc quy

i +1 ( y j ) = max ( i 1 ( y k ) * M i ( y k , y j , x) )y k S

(3.17)

t Pr ei ( y ) = arg max ( i 1 ( y ' ) * M i ( y ' , y , x ) ) . Gi s chui d liu quan st x c di n, s dng k thut backtracking tm chui trng thi y* tng ng nh sau: Bc 1: Vi mi y thuc tp trng thi tm

y * (n) = arg max( n ( y ) )n

i i y

Bc lp: chng no i>0 i-1 Prei(y)

y*(i) = y Chui y* tm c chnh l chui c xc sut p(y*|x) ln nht, cng chnh l chui nhn ph hp nht vi chui d liu quan st cho trc.

21

3.5. CRF c th gii quyt c vn label biasBn cht phn phi ton cc ca CRF gip cho cc m hnh ny trnh c vn label bias c miu t trong phn 2.3.2 trn y. phng din l thuyt m hnh, ta c th coi m hnh CRF nh l mt my trng thi xc sut vi cc trng s khng chun ha, mi trng s gn lin vi mt bc chuyn trng thi. Bn cht khng chun ha ca cc trng s cho php cc bc chuyn trng thi c th nhn cc gi tr quan trng khc nhau. V th bt c mt trng thi no cng c th lm tng hoc gim xc sut c truyn cho cc trng thi sau n m vn m bo xc sut cui cng c gn cho ton b chui trng thi tha mn nh ngha v xc sut nh tha s chun ha ton cc. Trong [17], Lafferty v cc ng nghip ca ng tin hnh th nghim vi 2000 mu d liu hun luyn v 500 mu kim tra, cc mu ny u cha cc trng hp nhp nhng nh trong v d miu t phn 2.3.2. Thc nghim cho thy t l li ca CRF l 4.6% trong khi t l li ca MEMM l 42%, iu ny chng t rng cc m hnh MEMM khng xc nh c nhnh r ng trong trng hp label bias

3.6. Tng kt chngChng ny gii thiu nhng vn c bn v CRF: nh ngha CRF, thut ton gn nhn cho d liu dng chui trong CRF, nguyn l cc i ha Entropy xc nh cc hm tim nng cho cc m hnh CRF, chng minh CRF c th gii quyt c vn label bias. p dng cc m hnh CRF trong cc bi ton x l d liu chui [5] [9] cho thy CRF c kh nng x l d liu dng ny mnh hn so vi cc m hnh hc my khc nh HMM hay MEMM.

22

Chng 4. c lng tham s cho cc m hnh CRFK thut c s dng nh gi tham s cho mt m hnh CRF l lm cc i ha o likelihood gia phn phi m hnh v phn phi thc nghim. Gi s d liu hun luyn gm mt tp N cp, mi cp gm mt chui quan st v mt chui trng thi tng ng, D={(x(i),y(i))} i = 1K N . o likelihood gia tp hun luyn v m hnh iu kin tng ng p(y|x, ) l:

L ( ) = p ( y | x , ) p ( x , y )~ x,y

(4.1)

p y (1 , 2 ,..., 1, 2 ..) l cc tham s ca m hnh v ~(x, y ) l phn phi

thc nghim ng thi ca x,y trong tp hun luyn. Nguyn l cc i likelihood: cc tham s tt nht ca m hnh l cc tham s lm cc i hm likelihood.

ML = arg max L( )

(4.2)

ML

m bo nhng d liu m chng ta quan st c trong tp hun luyn

s nhn c xc sut cao trong m hnh. Ni cch khc, cc tham s lm cc i hm likelihood s lm phn phi trong m hnh gn nht vi phn phi thc nghim trong tp hun luyn. V vic tnh teta da theo cng thc (4.1) rt kh khn nn thay v tnh ton trc tip, ta i xc nh teta lm cc i logarit ca hm likelihood (thng c gi tt l log-likelihood):

l ( ) = ~ ( x, y ) log ( p ( y | x, ) ) px,y

(4.3)

V hm logarit l hm n iu nn vic lm ny khng lm thay i gi tr ca c chn.Thay p(y|x, ) ca m hnh CRF vo cng thc (4.3), ta c:n n +1 l ( ) = ~ (x, y ) * t + * s ~ (x) * log Z p p x ,y i =1 x i =1

(4.4)

y, (1 , 2 ,..n ) v ( 1 , 2 ,..., m ) l cc vector tham s ca m hnh, t l vector cc thuc tnh chuyn (t1(yi-1,yi,x),t2(yi-1,yi,x),), s l vector cc thuc tnh trng thi (s1(yi,x),s2(yi,x),).

23

Hm log-likelihood cho m hnh CRF l mt hm lm v trn trong ton b khng gian ca tham s. Bn cht hm lm ca log-likelihood cho php ta c th tm c gi tr cc i ton cc bng cch thit lp cc thnh phn ca vector gradient ca hm log-likelihood bng khng. Mi thnh phn trong vector gradient ca hm log-likelihood l o hm ca hm log-likelihood theo mt tham s ca m hnh. o hm hm log likelihood theo tham s k ta c:n l ( ) = ~ (x, y ) t k (y i 1 , y i , x) p k i =1 x,y

~ - p ( x) p ( y | x, ) t k ( y i 1 , y i , x)x i =1

n

= E ~ ( x ,y ) t k E p ( y|x , ) t k p

[ ]

[ ]

(4.5)

Vic thit lp phng trnh trn bng 0 tng ng vi vic a ra mt rngp buc cho m hnh: gi tr trung bnh ca tk theo phn phi ~(x) p(y | x, ) bng gi tr p trung bnh ca tk theo phn phi thc nghim ~ (x, y ) .

V phng din ton hc, bi ton c lng tham s cho mt m hnh CRF chnh l bi ton tm cc i ca hm log-likelihood. Chng ny gii thiu mt s phng php tm cc i ca log-likelihood: cc phng php lp (IIS v GIS), cc phng php ti u s (Conjugate Gradient, cc phng php Newton...).

4.1. Cc phng php lpCc phng php lp lm mn dn phn phi m hnh bng cc cp nht cc tham s m hnh theo cch

k k + k

(4.6)

y, cc gi tr k c chn sao cho gi tr ca hm likelihood gn vi cc i hn. Lafferty et. al. [17] a ra hai thut ton lp cho vic c lng tham s cho m hnh CRF, mt l IIS v mt l GIS. Trong phn ny, chng ta s tm hiu v phng php lp tng qut sau i su tm hiu hai thut ton IIS v GIS. Gi s chng ta c mt m hnh p(y | x, ) y (1 , 2 ,..., 1 , 2 ,...) , mc ch ca cc phng php lp l tm mt tp cc tham s mi + sao cho hm loglikelihood nhn gi tr ln hn vi tp tham s c, y = (1 , 2 ,..., 1 , 2 ,...) . Ni cch khc, trong cc phng php lp ta phi tm mt cch thc cp nht tham s24

m hnh sao cho hm log-likelihood nhn gi tr cng gn vi gi tr cc i cng tt. Vic cp nht tham s s c lp li cho n khi hm log-likelihood hi t (gia s ca hm log-likelhood c tr tuyt i nh hn mt gi tr no ). Vi m hnh CRF, gia s ca hm log-likelihood b chn di bi mt hm ph A( , ) c nh ngha nh saun n +1 A( , ) k t k (y i 1 , y i , x) + k s k (y i , x) i =1 k x , y i =1 k

p + 1 ~ (x) p(y | x, ) t =1k

n +1

t k (y i 1 , y i , x) exp(k T (x, y ) ) T (x, y )

+ i =1 k

n

s k ( y i , x) exp( k T ( x, y ) ) T ( x, y )

(4.7)

y T (x, y ) l tng cc thuc tnh ca chui d liu quan st v chui cc nhn tng ng (x,y)

T (x, y ) t k (y i 1 , y i , x) + s k (y i , x)i =1 k i =1 k

n +1

n

(4.8)

V l ( + ) l ( ) A( , ) nn lm cc i A( , ) cng s lm cc i gia s ca hm log-likelihood. Di y l th tc lp tm tp tham s lm cc i hm likelihood. Khi to cc k Lp cho n khi no hi t Gii phng trnhA( , ) = 0 vi mi tham s k k

Cp nht cc tham s

k k + k

Thit lp o hm tng phn ca A( , ) theo tham s k bng khng ta thu c phng trnh sau:

25

E ~ ( x, y ) [t k ] ~ (x, y ) t k (y i 1 , y i , x) p px,y i =1

n +1

(4.9)

~ = p (x) p(y | x, ) t k (y i 1 , y i , x) exp(k T (x, y ))x,y i =1

n+ q

(4.10) T y, ta c th tnh c cc gia s

k

v k . IIS [2][15] v GIS [15] l

hai trng hp c bit ca phng php lp, mi thut ton c mt cch chn vector gia s cp nht tham s khc nhau.4.1.1. Thut ton GIS

t C l gi tr ln nht ca T(x,y) vi tt c x,y trong tp d liu hun luyn. nh ngha mt vector thuc tnh ton cc (thuc tnh khng gn lin vi mt cnh hay mt nh no trong th m t mt CRF) .

g (x, y ) C t k (y i 1 , y i , x) + s k (y i , x)i =1 k i =1 k

n +1

n

(4.11)

Thng thng vic thm vo mt thuc tnh s lm thay i phn phi xc sut ca m hnh, tuy nhin cc thuc tnh ton cc g(x,y) han ton ph thuc vo cc thuc tnh c trong m hnh, iu ny c ngha l ta khng a thm mt rng buc no i vi phn phi m hnh hay ni cch khc phn phi m hnh s khng i khi thm vo thuc tnh ton cc. Mc d khng lm thay i phn phi m hnh, vic thm cc thuc tnh g(x,y) la lm thay i gi tr ca T(x,y), tnh c cc thuc tnh ton cc T(x,y) s lun nhn gi tr hng s C. Nu cc thuc tnh ch nhn ga tr 0,1 th T(x,y) s chnh l s cc thuc tnh hot ng trong m hnh. Vi gi thit T(x,y)=C, Lafferty et.al [15][17] chng minh rng phng trnh (4.10) c th gii theo phng php gii tch thng thng. Logarithm hai v ca phng trnh (4.10), ta c:n +1 log E ~ ( x ,y ) [t k ] = log ~ (x) p(y | x, ) t k (y i 1 , y i , x) exp(k * C ) p p i =1 x,y

= log E p ( y|x, ) [t k ] + k

*C

(4.12)

26

T y, suy ra:

k =

E ~ ( x , y ) [t k ] 1 p log C E p ( x , y ) [t k ]

(4.13)

Tc hi t ca thut ton GIS ph thuc ln ca C, C cng ln cc bc cp nht cng nh, t l hi t cng chm, ngc li C cng nh, tc hi t cng nhanh.4.1.2. Thut ton IIS

T tng ca thut ton IIS: biu din phng trnh (4.10) di dng mt a thc ca exp( k ) , p dng phng php Newton-Raphson gii a thc nhn c tm k . biu din phng trnh (4.10) di dng a thc ca exp( k ), Lafferty et.al a ra xp x

T (x, y ) T (x) = max y T (x, y )Thay T (x, y ) vo phng trnh (4.10), ta c:

(4.14)

E ~ ( x, y ) [tk ] = ~(x) p(y | x, ) tk (y i 1, y i , x) exp(kT (x)) p px, y i =1

n +1

(4.15)

Phn hoch tp cc cp (x,y) thnh Tmax tp con khng giao nhau, y

Tmax = max T (x) . Vit li (4.15) di dng

E~(x,y) [tk ] = pm=0

Tmax

{x,y|T (x)=m}

~(x) p(y | x,) t (y , y , x)[exp( )]m p k k i1 i (4.16)i=1

n+1

nh ngha a k ,m l k vng ca t k trong tp cc cp (x,y) c T (x) = m .a k , m = ~ (x) p(y | x, ) t k (y i 1 , y i , x) (m, T (x)) px, y i =1 n +1

(4.17)

y, (m, T (x)) c nh ngha nh sau: (m, T (x)) =

1 nu T(x)=m 0 nu ngc li

27

Khi , phng trnh (4.16) c th vit li di dng

E ~(x,y ) [t k ] = ak ,m [exp(k )] pm =0

Tmax

m

(4.18)

Gii phng trnh (4.18) theo phng php Newton-Raphson ta tm c k .

4.2. Cc phng php ti u sCc k thut ti u s[15][28] s dng vector gradient ca hm log-likelihood tm cc tr. Hai loi k thut ti u c cp trong phn nay l k thut ti u bc mt v k thut ti u bc hai.4.2.1. K thut ti u s bc mt

K thut ti u s bc mt s dng cc thng tin cha trong bn thn vector gradient ca hm cn ti u dn dn tnh tin cc c lng n im m vector gradient bng 0 v hm t cc tr. C hai phng php ti u bc mt c th dng c lng tham s cho mt m hnh CRF, c hai phng php ny u l bin th ca thut ton gradient lin hp khng tuyn tnh (non-linear conjugate gradient). Khng xem xt mt hng tm kim trong khi lm cc i hm s nh cc phng php leo i, cc phng php hng lin hp sinh ra mt tp cc vector khc khng tp lin hp v ln lt lm cc i hm dc theo hng ny. Cc phng php gradient lin hp khng tuyn tnh l trng hp c bit ca k thut hng lin hp trong mi vector lin hp hay hng tm kim ch c sinh t hng tm kim trc m khng phi t tt c cc thnh phn ca tp lin hp trc . c bit, mi hng tm kim pj sau l t hp tuyn tnh ca hng i ln dc nht hay gradient ca hm cn tm cc tr v hng tm kim trc pj-1. Mi bc lp ca thut tan cp nht gradient lin hp tnh tin cc tham s ca hm cn tm cc i theo hng ca vector lin hp hin thi s dng lut cp nht:

k ( j +1) = k j + ( j ) p j y, ( j ) l ln ca bc nhy ti u.

(4.19)

C hai phng php ti u bc mt rt thch hp cho vic c lng tham s m hnh CRF, l cc thut tan Fletcher-Reeves v Polak-Ribire-Positive. V bn cht hai thut ton ny l hon ton tng ng, chng ch khc nhau v cch chn hng tm kim v ln ca bc nhy ti u.

28

4.2.2. K thut ti u s bc hai

Ngoi gi tr ca vector gradient, cc k thut ti u s bc hai ci tin cc k thut bc mt trong vic tnh ton cc cp nht cho tham s bng cch thm yu t v ng cong hay o hm bc hai ca hm cn tm cc tr. Lut cp nht bc hai c tnh ton bng cch khai trin chui Taylor bc hai ca l ( + ) nh sau:l ( + ) l ( ) + T G ( ) + 1 T H ( ) 2

(4.20)

G ( ) v H ( ) ln lt l vector gradient v ma trn Hessian (ma trn o hmtng phn bc hai) ca hm log-likelihood l ( ) . Thit lp o hm ca xp x trong (4.20) bng 0 ta tm c gia s cp nht tham s m hnh nh sau:

( k ) = H 1 ( ( k ) )G( ( k ) )

(4.21)

y, k l ch s ca ln lp hin ti. Mc d vic cp nht cc tham s m hnh theo cch thc ny cho hi t rt nhanh nhng vic tnh nghch o ca ma trn Hessian li i hi chi ph ln v thi gian c bit l vi cc bi ton c ln nh l cc bi tan trong x l ngn ng t nhin. V th cc phng php bc hai m phi tnh tan trc tip nghch o ca ma trn Hessian khng thch hp cho vic c lng tham s cho cc m hnh CRF. Cc phng php quasi-Newton l cc trng hp c bit ca k thut ti u bc hai, tng t nh cc phng php Newton tuy nhin chng khng tnh ton trc tip ma trn Hessian m thay vo chng xy dng mt m hnh ca ma trn Hessian ti mi bc lp bng cch o thay i trong vector gradient. Yu t c bn ca cc phng php quasi-Newton l chng thay th ma trn Hessian trong khai trin Taylor (4.20) bi B( ) . Cch thc cp nht tham s m hnh cng v th m thay i:

( k ) = B 1 ( ( k ) )G( ( k ) )

(4.22)

Ti mi bc lp, B 1 ( ) c cp nht phn nh cc thay i trong tham s tnh t bc lp trc. Tuy nhin, thay v phi tnh ton li, B 1 ( ) ch cn phi cp nht li ti mi bc phn nh cong o c trong bc lp trc.

B( ( k ) ) 1 (G( ( k ) ) G( ( k 1) )) = k 1

(4.23)

29

Vic xp x ma trn Hessian theo B( ) cho php phng php quasi-Newton hi t nhanh hn so vi phng php Newton truyn thng. Phng php Limited memory quasi-Newton (L-BFGs) [11] ci tin ca phng php quasi-Newton thc hin tnh tan khi lng b nh b gii hn. Nhng thc nghim gn y cho thy phng php Limited memory quasi-Newton vt tri hn hn so vi cc phng php khc bao gm c GIS, IIS, gradient lin hp... trong vic tm cc i hm log-likelihood.

4.3. Tng kt chngChng ny cp n vn c lng cc tham s cho m hnh CRF bng cch lm cc i likelihood ng thi gii thiu mt s phng php tm cc i ca hm log-likelihood nh IIS, GIS, gradient lin hp, quasi-Newton v L-BFGs nhm phc v cho c lng tham s m hnh. Trong cc phng php tm cc tr hm loglikelihood, phng php L-BFGs c nh gi l vt tri hn hn so vi cc phng php khc.

30

Chng 5. H thng nhn bit cc loi thc th trong ting VitMt h thng nhn bit loi thc th trong ting Vit nu ra i s gp phn quan trng trong x l ting Vit v hiu cc vn bn ting Vit. Tuy rng nhn bit loi thc th l mt bi ton c bn trong trch chn thng tin v x l ngn ng t nhin nhng i vi ting Vit th y li l mt bi ton tng i mi. Mc d c nhng kh khn do c th ca ting Vit v tnh cht tin phong trong lnh vc nghin cu ny, nhng th nghim ban u ca em cho ting Vit cng t c nhng kt qu rt ng khch l.

5.1. Mi trng thc nghim5.1.1. Phn cng

My Celeron III, chip 768 MHz, Ram 128 MB5.1.2. Phn mm

FlexCRFs l mt CRF Framework cho cc bi ton gn nhn d liu d liu dng chui nh POS tagger, Noun Phrase Chunking,... y l mt cng c m ngun m c pht trin bi ThS. Phan Xun Hiu v TS. Nguyn L Minh (Vin JAISTNht Bn). H thng nhn bit loi thc th cho ting Vit ca em c xy dng trn nn ca Framework ny.5.1.3. D liu thc nghim

D liu cho thc nghim gm 50 bi bo lnh vc kinh doanh (khong gn 1400 cu) ly t ngun http://www.vnexpress.net. D liu ban u c cho qua b tin x l lc b cc th HTML v chuyn t dng m ha UTF-8 sang ting Vit khng du m ha dng Telex. Sau d liu c gn nhn bng tay phc v cho qu trnh thc nghim.

5.2. H thng nhn bit loi thc th cho ting VitCc bc gn nhn cho mt trang Web ting Vit c minh ha nh hnh v di y

31

Input (HTML) Tin x l La chn thuc tnh FlexCRF framework Khi phc + tagging Output (HTML)Hnh 8: Cu trc h thng nhn bit loi thc th

5.3. Cc tham s hun luyn v nh gi thc nghim5.3.1. Cc tham s hun luyn

Mt s ty chn trong FlexCRF framework cho qu trnh hun luyn:Bng 2: Cc tham s trong qu trnh hun luyn

Tham s init_lamda_val num_iterations f_rare_threshold

Gi tr 1.0 55 1

ngha Gi tr khi to cho cc tham s trong m hnh S bc lp hun luyn Ch c cc thuc tnh c tn s xut hin ln hn gi tr ny th mi c tch hp vo m hnh CRF Ch c cc mu v t ng cnh c tn s xut hin ln hn gi tr ny mi c tch hp vo m hnh CRF

cp_rare_threshold

1

32

eps_log_likelihood

0.01

FlexCRF s dng phng php L-BFGs c lng tham s m hnh. Gi tr ny cho ta iu kin dng ca vng lp hun luyn, nu nh |log_likelihood(t)-log_likelihood(t1)|