Upload
binh-kaka
View
224
Download
0
Embed Size (px)
Citation preview
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
1/67
I HC QUC GIA H NITRNG I HC CNG NGH
Trn Th Ngn
TRCH CHN THNG TIN Y T TING VIT CHOBI TON TM KIM NGNGHA
KHO LUN TT NGHIP I HC H CHNH QUY
Ngnh:Cng ngh thng tin
H NI - 2009
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
2/67
I HC QUC GIA H NITRNG I HC CNG NGH
Trn Th Ngn
TRCH CHN THNG TIN Y T TING VIT CHOBI TON TM KIM NGNGHA
KHO LUN TT NGHIP I HC H CHNH QUY
Ngnh:Cng ngh thng tin
Cn b hng dn: PGS. TS. H Quang ThyCn bng hng dn: Th.S Nguyn Cm T
H NI - 2009
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
3/67
i
LI CM N
u tin cho em gi li cm n su sc nht n PGS. TS. H Quang Thy,
Th.S Nguyn Cm T tn tnh ch bo cho em trong sut thi gian thc hin
kha lun. Trong qu trnh nghin cu em gp phi nhiu kh khn nhng nh
s hng dn tn tnh ca thy v ch em dn vt qua v hon thnh c khalun.
Em xin by t lng bit n n cc thy c trong trng i Hc Cng
Ngh ging dy v cho em nhng kin thc qu bu, lm nn tng hon thnh
kha lun cng nh thnh cng trong nghin cu, lm vic trong tng lai.
Em xin gi li cm n ti cc anh ch trong phng Lab cho em nhng li
khuyn qu bu, b ch trong qu trnh thc hin qu lun.
V em cng xin li cm n ti nhng ngi bn thn yu, c bit l cc bn
trong phng k tc x bn cnh ng vin trong gip em hon thnh khalun cng nh vt qua nhiu kh khn trong cuc sng.
Cui cng, cho con gi li cm n su sc ti gia nh, b, m, ch v em
cho con nhiu tnh thng cng nh sng vin kp thi con vt qua nhng
kh khn trong cuc sng v hon thnh c kha lun.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
4/67
ii
TM TT
Trch chn thng tin y t nhm xy dng c mt tp d liu tt, y
h trvic tm kim ng ngha ang l nhu cu thit yu, nhn c s quan tm
c bit trong thi gian gn y. Ontology l cch biu din khi nim, thuc tnh,
quan h trong min ng dng m bo tnh nht qun v phong ph. Xy dng
h thng trch chn thng tin da trn mt Ontology y t Ting Vit cho php tm
kim v khai ph loi d liu thuc min ng dng hiu qu hn l mt nhu cu
thit yu.
Kha lun ny cp ti vic xy dng mt h thng trch chn thng tin
da trn mt ontology trong lnh vc y t ting Vit. Kha lun phn tch mt s
phng php, cng c xy dng Ontology la chn mt m hnh v xy dng
c mt Ontology y t ting Vit vi 21 lp thc th,13 mi quan h v trn 500
th hin ca cc lp thc th. Kha lun tin hnh ch thch cho 96 file d liu
vi trn 1500 th hin. H thng nhn din thc th thc nghim ca kha lun
hot ng c tnh kh thi vi o F1 trung bnh qua 10 ln thc nghim t
khong 64%.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
5/67
iii
MC LC
Li mu ...........................................................................................................................1Chng 1 ..............................................................................................................................3TNG QUAN V TM KIM NGNGHA.....................................................................3
1.1. Nhu cu v tm kim ng ngha..........................................................................31.2. Nn tng tm kim ng ngha..................................................................................41.2.1.Web ng ngha.....................................................................................................41.2.2. Ontology .............................................................................................................5
1.3. Kin trc ca mt my tm kim ng ngha............................................................51.4.Trch chn thng tin .................................................................................................6
Chng 2 ..............................................................................................................................9XY DNG ONTOLOGY Y T TING VIT ................................................................9
2.1. Gii thiu Ontology.................................................................................................92.1.1. Khi nim Ontology ...........................................................................................92.1.2. Cc thnh phn ca Ontology...........................................................................102.1.3 Mt s cng trnh lin quan ti xy dng Ontology..........................................11
2.2. L thuyt xy dng Ontology ...............................................................................122.1.1. Phng php xy dng Ontology .....................................................................122.1.2. Cng c xy dng Ontology.............................................................................132.1.3. Ngn ng xy dng Ontology ..........................................................................15
2.3. Xy dng Ontology y t ting Vit .......................................................................16Chng 3 ............................................................................................................................17
NHN DNG THC TH ...............................................................................................173.1. Gii thiu bi ton nhn dng thc th .................................................................173.1.1. Gii thiu chung v nhn dng thc th ...........................................................173.1.2. Mt s kt qu nghin cu v nhn dng thc th ...........................................18
3.2. c im d liu ting Vit ..................................................................................193.2.1. c im ng m..............................................................................................193.2.2. c im t vng .............................................................................................203.2.3. c im ng php...........................................................................................20
3.3. Mt s phng php nhn dng thc th ..............................................................213.3.1. Phng php da trn lut, bn gim st.........................................................233.3.2. Cc phng php my trng thi hu hn........................................................23
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
6/67
iv
3.3.3. Phng php s dng Gazetteer .......................................................................243.4. Nhn dng thc th y t ting Vit........................................................................253.4.1. Nhn dng thc th ting Vit ..........................................................................253.4.2. Nhn dng thc th y t ting Vit ...................................................................26
Chng 4 ............................................................................................................................30XC NH QUAN H NGNGHA..............................................................................30
4.1. Tng quan v xc nh quan h ng ngha............................................................304.1.1. Khi qut v quan h ng ngha .......................................................................304.1.2. Trch chn quan h ng ngha ..........................................................................314.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha ........................35
4.2. Gn nhn ng ngha cho cu .................................................................................374.3.1. Phn lp vi xc nh quan h, nhn dng thc th .........................................394.3.2. Thut ton SVM (Support Vector Machine) ....................................................414.3.3 Phn lp a lp vi SVM ..................................................................................414.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc
y t ting Vit..............................................................................................................42Chng 5 ............................................................................................................................43THC NGHIM................................................................................................................43
5.1. Mi trng thc nghim .......................................................................................435.1.1. Phn cng .........................................................................................................435.1.2 Phn mm ..........................................................................................................435.1.3 D liu th nghim............................................................................................44
5.2 Xy dng Ontology................................................................................................445.2.1. Phn cp lp thc th........................................................................................445.2.2. Cc mi quan h gia cc lp thc th.............................................................47
5.3. Ch thch d liu ..................................................................................................485.4. Nhn dng thc th................................................................................................505.4.1. Xy dng tp gazetteer .....................................................................................505.4.2.nh gi h thng nhn dng thc th ..............................................................515.4.3. Kt qut c...............................................................................................525.4.4. Nhn xt v nh gi ........................................................................................52
5.5. Gn nhn ng ngha cho cu .................................................................................53PH LC - MT S THUT NGANH VIT ............................................................54KT LUN ........................................................................................................................55
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
7/67
v
DANH MC BNG BIU
Bng 1: Gii thch cc mi quan h ng ngha...................................................................35Bng 2: S lng cc th hin ca cc lp thc th trong tp d liu gazetteer. ................50Bng 3: Cc gi trnh ga mt h thng nhn din loi thc th.....................................51
Bng 4: Kt qu sau 10 ln thc nghim nhn dng thc th..............................................52Bng 5: V d mt s cu c gn nhn quan h. .............................................................53
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
8/67
vi
DANH MC HNH V
Hnh 1: V d v Web ng ngha ................................................................................ 4Hnh 2: Kin trc mt my tm kim ng ngha ......................................................... 6Hnh 3: Minh ha mt h thng trch chn thng tin.................................................. 7Hnh 4: M t ngha ca Ontology........................................................................... 9
Hnh 5: Minh ha cu trc phn cp ca Ontology BioCaster ................................. 10Hnh 6: Mt s file Gazetteerc xy dng phc v bi ton nhn dng thc th 25Hnh 7: Minh ha mt quan h ng ngha cho thc th car...................................... 30Hnh 8: Minh ha v trch chn quan h ng ngha.................................................. 31Hnh 9: V tr ca khai ph quan h ng ngha trong x l ngn ng t nhin........ 32Hnh 10: Minh ha cc quan h ng ngha c ch ra trong WordNet................... 33Hnh 11: Mt s quan h ng ngha xy dng c............................................ 34Hnh 12: Nhim v chung ca bi ton xc nh quan h ........................................ 36Hnh 13: M t cc b phn trong b phn tch ng ngha SR [24] ......................... 37Hnh 14: Minh ha Framework gii quyt bi ton xc nh tn ring gia cc ti
liu............................................................................................................................. 38Hnh 15: Mt s nhn ng ngha c gn cho cu [30].......................................... 39Hnh 16: Gn nhn ng ngha cho cc cu m t tng thng Bill Clinton [30]. ...... 39Hnh 17: M t cc giai on trong qu trnh phn lp ............................................ 40Hnh 18: M t s phn chia ti liu theo du ca hm f(d)..................................... 41Hnh 19: M t qu trnh hc ca phn lp cu cha quan h [2]............................ 42Hnh 20: Minh ha cc lp trong Ontology xy dng. ........................................ 46Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c...................... 46Hnh 22: Minh ha cc th hin ca lp thc th v mi quan h gia cc th hin 48Hnh 23: Minh ha mt d liu c ch thch bng Ontology. .............................. 49Hnh 24: Minh ha cc file cha thc th trong tp Gazetteer xy dng c ........ 51Hnh 25: Kt qu 10 ln thc nghim nhn dng thc th ....................................... 52
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
9/67
1
Li mu
Chm sc sc khe lun l mt nhu cu thit yu ca con ngi, v th tm
kim cc thng tin v lnh vc y t trn Internet lun l mt nhu cu thit yu. Vn ny cng cn phi c quan tm thch ng khi con ngi ang phi i mt
vi nhiu dch bnh truyn nhim, v din hnh c th k ti dch bnh cm A
H1N1 ang pht trin v c chiu hng gia tng trong thi gian gn y. Cng vi
s ra i v pht trin khng ngng ca cc ti nguyn trc truyn, vic khai thc
hiu qu ngun ti nguyn ny a ti ngun tri thc hu ch cho ngi dng s
gp phn vo vic tuyn truyn v nng cao sc khe cng ng.
S bng n cc ti nguyn y t, c bit l cc thng tin trc tuyn lin quan
n lnh vc sc khe; nhiu trang web v thng tin tha cng nh vic t chcthng tin mt cch t do (khng hoc bn cu trc) lm cho ngi dng kh c
th theo di cng nh nm bt nhng thng tin cp nht nht. Bn cnh , cng
ngh tm kim thng tin truyn thng hoc tr v kt qu t do s phong ph, phc
tp ca vic din t ngn ng t nhin; hoc qu nhiu theo ngha ngi tm tin
ch mun tm kim nhng tri thc n ch khng ch l cc vn bn cha t kha
tm kim. Do vic khai thc ti u ngun ti nguyn phong ph ny tr thnh
mt ti quan trng, thu ht nhiu nh khoa hc tham gia nghin cu trong hai
thp nin gn y, c nhiu cng trnh nhm trch rt cc thng tin c cu trc tnhng ti nguyn ny nhm xy dng cc cstri thc cho vic t chc thng tin,
tm kim, truy vn, qun l v phn tch thng tin.
Nhiu bi ton c t ra trong lnh vc trch chn thng tin y t nh
BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05 (trch
chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc gia cc
protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc khai
ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th v
trch chn quan h. Nhn din thc thi hi nhn bit cc thnh phn cbn nh
tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh quan h
vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong vn bn.
V d, xc nh quan h gia mt bnh xc nh v mt virus xc nh.
Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h
mt cch nht qun v phong ph nht. Vic xy dng mt Ontology cho y t trong
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
10/67
2
ting Vit s l cscho php tm kim, khai ph loi thng tin ny mt cch hiu
qu.
Theo kho st d liu cho thy Vit Nam hin nay cc Ontology cho y t
ting Vit th hu nh cha c; tuy nhin cng c c mt s nhm nghin cu
tp trung xy dng Ontology vi cc min c th khc phc v cho nhiu mcch khc nhau. n c c th k ti Ontology VNKIM [34] c pht trin ti
i hc Bch khoa, i Hc Quc gia TP.H Ch Minh. Ontology ny bao gm
347 lp thc th v 114 quan h v thuc tnh. VN-KIM Ontology bao gm cc lp
thc th c tn ph bin nh Con _ngi, T_chc, tnh, Thnh_ph,, cc quan
h gia cc lp thc th v cc thuc tnh ca mi lp thc th .
Tn ti nhiu phng php c a ra xy dng mt h thng trch chn
thng tin cnug nh xy dng mng ng ngha v t p dng cho bi ton tm
kim ng ngha. Kha lun trnh by cch biu din da trn Ontology - mttrong s nhng phng php ang c s dng kh rng ri hin nay. Kha lun
trnh by mt s phng php xy dng Ontology, mrng ontology mt cch t
ng, gii thiu bi ton nhn dng thc th cng nh phn loi quan h da trn
mt s phng php khc nhau. Kha lun cng xy dng c mt d liu
cho y t phc v cho vic nhn dng thc th v quan hc hiu qu hn.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
11/67
3
Chng 1
TNG QUAN V TM KIM NG NGHA
1.1. Nhu cu v tm kim ngngha
S bng n cc thng tin trc tuyn trn Internet v World Wide Web to ramt lng thng tin khng la ra thch thc l lm th no c th khai ph
ht c lng thng tin ny mt cch hiu qu nhm phc vi sng con ngi.
Cc my tm kim nh Google, Yahoo ra i nhm h trngi dng trong qu
trnh tm kim v s dng thng tin. Tuy kt qu tr v ca cc my tm kim ny
ngy cng c ci thin v cht v lng nhng vn n thun l danh sch cc
ti liu cha nhng t xut hin trong cu truy vn. Nhng thng tin t cc kt qu
tr v ny chc hiu bi con ngi, my tnh khng th hiu c, iu ny
gy nhng kh khn cho qu trnh tip theo x l thng tin tm kim c. Th h
cc my tm kim thc th ra i (h thng Cazoodle ti trang web
http://www.cazoodle.com/, h thng Arnetminer ti trang web
http://www.arnetminer.org/ ...) nh du mt bc pht trin mi ca cc my tm
kim. Thm vo , vi s ra i ca my tm kim ng ngha Wolfram, c xy
dng v pht trin bi d n Wolfram Research, Inc. Marketed do Stephen
Wolfram xut [35], th vn tm kim tri thc cng c quan tm hn na.
S ra i ca Web ng ngha (hay Semantic Web) do W3C (The World
Wide Web Consortium) khi xng m ra mt bc tin ca cng ngh Web,
nhng thng tin trong Web ng ngha c cu trc hon chnh v mang ng ngha
m my tnh c th hiu c. Nhng thng tin ny, c thc s dng li m
khng cn qua cc bc tin x l. Khi s dng cc my tm kim thng thng
(Google, Yahoo), tm kim thng tin trn Web ng ngha s khng tn dng
c nhng u im vt tri ca Web ng ngha, kt qu tr v khng c s ci
tin. Ni theo mt cch khc th vi cc my tm kim hin ti th Web ng ngha
hay Web thng thng ch l mt. Do vy, cn thit c mt h thng tm kim ng
ngha (Semantic Search) tm kim trn Web ng ngha hay trn mt mng tri thc
mang ng ngha, kt qu tr v l cc thng tin c cu trc hon chnh m my tnhc th hiu c, nh vic s dng hay x l thng tin trnn d dng hn
[6][26][2]. Ngoi ra, vic xy dng c mt h thng tm kim ng ngha c th
s to tin cho vic m rng xy dng cc h thng hi p tng trn tng
lnh vc c th nh : y t, vn ha iu ny mang mt ngha thit thc trong
i sng.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
12/67
4
1.2. Nn tng tm kim ngngha
1.2.1.Web ng ngha
Web ng ngha hay cn gi l Semantic Web theo Tim Berners-Lee l bc
pht trin m rng ca cng ngh Word Wide Web hin ti, cha cc thng tin
c nh ngha r rng con ngi v my tnh lm vic vi nhau hiu qu hn.
Mc tiu ca Web ng ngha l pht trin da trn nhng chun v cng ngh
chung, cho php my tnh c th hiu thng tin cha trong cc trang Web nhiu
hn nhm h tr tt con ngi trong khai ph d liu, tng hp thng tin, hay
trong vic xy dng cc h thng tng khc Khng ging nh cng ngh
Web thng thng, ni dung ch bao hm cc ti nguyn vn bn, lin kt, hnh
nh, video m Web ng ngha c th bao gm nhng ti nguyn thng tin tru
tng hn nh: a im, con ngi, t chc thm ch l mt s kin trong cuc
sng. Ngoi ra, lin kt trong Web ng ngha khng chn thun l cc siu linkt (hyperlink) gia cc ti nguyn m cn cha nhiu loi lin kt, quan h khc.
Nhng c im ny khin ni dung ca Web ng ngha a dng hn, chi tit v
y hn. ng thi, nhng thng tin cha trong Web ng ngha c mt mi
lin h cht ch vi nhau. Vi s cht ch ny, ngi dng d dng hn trong vic
s dng, v tm kim thng tin. y cng l u im ln nht ca Web ng ngha
so vi cng ngh Web thng thng [2].
Hnh 1. V d vWeb ngngha [6]
Hnh 1 l mt v d m t v mt trang Web ng ngha cha thng tin camt ngi tn l Yo-Yo Ma. Trang Web c cu trc nh mt th c hng mang
trng s, trong mi nh ca th m t mt kiu ti nguyn cha trong trang
Web. Cc cnh ca th th hin mt kiu lin kt (hay cn gi l thuc tnh ca ti
nguyn) gia cc ti nguyn, trng s ca cc lin kt th hin tn ca lin kt
[tn ca thuc tnh] . C th ta thy Yo-Yo Ma c thuc tnh ngy sinh l
10/07/55c ni sinh Paris, France, Paris, Francec nhit l 62 F
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
13/67
5
Nh vy, mi ti nguyn c m t trong Web ng ngha l mt i tng.
i tng ny c tn gi, thuc tnh, gi tr ca thuc tnh (gi tr c th l mt i
tng khc) v lin kt vi cc ti nguyn (i tng) khc (nu c). xy dng
c mt trang Web ng ngha cn phi c tp d liu y , hay ni mt cch
khc l cn phi xy dng mt tp cc i tng m t ti nguyn cho Web ngngha. Cc i c quan h vi nhau hnh thnh mt mng lin kt rng, c gi l
mng ngngha.
Mng ng ngha c chia s rng khp do vy cc i tng trong mt
mng ng ngha cn phi m t theo mt chun chung nht. Ontology c s
dng m t vi tng, ti nguyn cho Web ng ngha [2].
1.2.2. Ontology
C th hiu mt cch n gin ontology l mt m hnh d liu trnh by
mt tp cc khi nim trong mt min v mi quan h gia cc khi nim . Nc s dng lp lun (suy lun) v cc i tng trong min [12].
Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h
mt cch nht qun v phong ph nht, chnh v th n c s dng xy
dng mng ng ngha t tp d liu th (khng hoc bn cu trc) to nn tng xy
dng mt my tm kim ng ngha mt cch hiu qu. Ontology sc gii thiu
mt cch c th, k lng hn trong chng 2 ca kha lun.
1.3. Kin trc ca mt my tm kim ngnghaXt v cbn, mt my tm kim ng ngha c cu trc tng t vi mt
my tm kim thng thng cng bao gm hai thnh phn chnh [2]:
Phn giao din ngi dng (front end) c hai chc nng chnh:
Giao din truy vn: cho php ngi dng nhp cu hi, truy vn. Hin th cu tr li, kt qu.
Phn kin trc bn trong (back end) l phn ht nhn ca my tm kim bao
gm ba thnh phn chnh l:
Phn tch cu hi Tm kim kt qu cho truy vn hay cu hi Tp ti liu, d liu tm kim, mng ng ngha.
M hnh kin trc mt my tm kim ng ngha c m t nh Hnh 2.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
14/67
6
Hnh 2. Kin trc mt my tm kim ngngha [2]
C th thy rng s khc bit trong cu trc ca my tm kim ng ngha so
vi my tm kim thng thng nm phn kin trc bn trong, c thhai thnh
phn: phn tch cu hi v tp d liu tm kim.
Phn tch cu hi c cp chi tit trong [2]. Tp d liu tm kim
chnh l web ng ngha v mng ng ngha c xy dng da trn ontology v h
thng trch chn thng tin. Kha lun ny tp trung nghin cu k v xy dngontology, m rng tng ontology nh trch chn thng tin m c th l nhn
dng thc th. Kha lun cng cp ti nhn dng quan h ng ngha, phn loi
cu cha quan h nhm mc ch nh trnh by trn, l xy dng c mt
tp d liu tm kim y cho my tm kim ng ngha trong tng lai.
1.4.Trch chn thng tin
Trch chn thng tin l mt lnh vc quan trng trong khai ph d liu vn
bn, thc hin vic trch rt cc thng tin c cu trc t cc vn bn khng c cu
trc. Ni cch khc, mt h thng trch chn thng tin rt ra nhng thng tin
c nh ngha trc v cc thc th v mi quan h gia cc thc th t mt vn
bn di dng ngn ng t nhin v in nhng thng tin ny vo mt vn bn ghi
d liu c cu trc hoc mt dng mu c nh ngha trc . C nhiu mc
trch chn thng tin t vn bn nh xc nh cc thc th (Element Extraction), xc
nh quan h gia cc thc th (Relation Extraction), xc nh v theo di cc s
1.Nhptruyvn
5.Ktqutr v
Mng ngngha
SemanticWeb/Ontology
Search Services 2.Phn lpcu hi
3.Bin idng cu hi
5.Tm kim
1.Nhptruyvn
6.Ktqu trv
4. Trch chnthng tin
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
15/67
7
kin v cc kch bn (Event and Scenario Extraction and Tracking), xc nh ng
tham chiu (Co-reference Resolution)... Cc kthut c s dng trong trch chn
thng tin gm c: phn on, phn lp, kt hp v phn cm [1].
Hnh 3. Minh ha mt h thng trch chn thng tin
c mt h thng trch chn thng tin u tin chng ta phi c mt h
thng nhn dng thc th v tip sau mi tnh n phn loi quan h. Bi ton nhn
bit cc loi thc th l bi ton n gin nht trong s cc bi ton trch chnthng tin, tuy vy n li l bc cbn nht trc khi tnh n vic gii quyt cc
bi ton phc tp hn trong lnh vc ny. Ngoi ng dng trong h thng trch chn
thng tin, n cn c th c p dng trong tm kim thng tin (Information
Retrieval), dch my (machine translation) v h thng hi p (question
answering).
c rt nhiu bi ton c t ra trong lnh vc trch chn thng tin y t
nh BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05
(trch chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc
gia cc protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc
khai ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th
v trch chn quan h. Nhn din thc thi hi nhn bit cc thnh phn cbn
nh tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh
quan h vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong
vn bn. V d: xc nh quan h gia mt bnh xc nh v mt virus
Bnh phi cp tnh l mttrong nhng nguyn nhn tvong chnh ca ngi gi,nguy him hn c bnh phido cm. Triu chng thnggp l ngi mt mi, i khic l ln, st tht thng, hokhan nhiu v nng nhc, ckhi kh th. Cc thuc anthn, chng ho phi c s
dng mt cch thn trng, nuc biu hin thrt cn phiphn bit do hen ph qun thphi dng corticoidv thucgin ph qun.
IEMt miL lnSt tht
thngHo khanKh th
An thnChng hoCorticoid
Thuc ginph qun
B nh Tri u chn Thuc
Phi cptnh
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
16/67
8
xc nh. Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan
h mt cch nht qun v phong ph nht. Vic xy dng mt ontology cho y t
trong ting Vit s l cscho php tm kim, khai ph loi thng tin ny mt cch
hiu qu. Sau khi xy dng ontology, cng vic tip theo cng rt quan trng l
m rng ontology mt cch tng. Vic c mt h thng trch chn thng tin(bao gm nhn dng thc th v trch chn quan h, ) l bc tin c th m
rng ontology mt cch tng.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
17/67
9
Chng 2
XY DNG ONTOLOGY Y T TING VIT
2.1. Gii thiu Ontology
2.1.1. Khi nim Ontology
Trong nhng nm gn y, thut ng Ontology khng chc s dng
trong cc phng th nghim trn lnh vc tr tu nhn to m trnn ph bin i
vi nhiu min lnh vc trong i sng . ng trn quan im ca ngnh tr tu
nhn to, mt Ontology l s mt t v nhng khi nim v nhng quan h ca cc
khi nim nhm mc ch th hin mt gc nhn v th gii. Trn min ng
dng khc ca khoa hc, mt Ontology bao gm tp cc t vng cbn hay mt ti
nguyn trn mt min lnh vc c th, nh nhng nh nghin cu c th lu tr,
qun l v trao i tri thc cho nhau theo mt cch tin li nht [2].
Hin nay tn ti nhiu khi nim v Ontology, trong c nhiu khi nim
mu thun vi cc khc nim khc, kha lun ny ch gii thiu mt nh ngha
mang tnh khi qut v c s dng kh ph bin c Kincho H. Law a ra:
Ontology l biu hin mt tp cc khi nim (i tng), trong mt min c th
v nhng mi quan h gia cc khi nim ny. Ontology chnh l s tng hp ca
mt tp t vng chia s v cc miu t ngha ca t theo cch m my tnh
hiu c.
Hnh 4. M t ngha ca OntologyHnh 4 m t ngha ca Ontology, trong tp t vng dng chung
(Vocabulary) chnh l th hin ca cc lp, quan h. V d, c th c Vocabulary
(...), Categories (Cat, White, Leg, Fish, Animal,), Relations (Is-a, Part-of,
a sharedvocabulary
a formal characterizationof its meaning
Ontology
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
18/67
10
hasMother,), Characterization (...) v cc th hin quan h "A cat is an animal",
"A cat has four legs"...
Hnh 5. Minh ha cu trc phn cp ca Ontology BioCaster [11]
2.1.2. Cc thnh phn ca OntologyCc thnh phn chnh ca Ontology l: Lp (Class), thuc tnh (Property),
thc th (Individual).
Lp (class) l mt b nhng thc th, cc thc thc m t logic nh
ngha cc i tng ca lp; lp c xy dng theo cu trc phn cp cha con nh
l mt s phn loi cc i tng. Thc thc xem l th hin ca mt lp, lm
r hn v lp v c th c hiu l mt i tng no trong t nhin
(England, Manchester United, bnh si, thy u).
Thuc tnh (Property) th hin quan h nh phn ca cc thc th (quan hgia hai thc th) nh lin kt hai thc th vi nhau. V d thuc tnh do_virus
lin kt hai thc th bnh v virus vi nhau.
Thuc tnh (property) c 4 loi (1) Functional: Mt thc th ch lin quan
nhiu nht n mt thc th khc, v d thuc tnh c hng v i vi cc thc
th lp thc_n; (2) Inverse Functional: Thuc tnh o ngc ca Functional,
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
19/67
11
thuc tnh l hng v ca; (3) Transitive: Thc th a quan h vi thc th b, thc
th b quan h vi thc th c thc th a quan h vi thc th c; (4) Symmetric:
Thc th a quan h vi thc th b thc th b quan h vi thc th a.
Thuc tnh c 3 kiu th hin (1) Object Property: Lin kt thc th ny vi
thc th khc; (2) DataType Property: Lin kt thc th vi kiu d liu XMLSchema, RDF literal; (3) Annotation Property: Thm cc thng tin metadata v lp,
thuc tnh hay thc th khc thuc 2 kiu trn.
lm vic vi ontology Web cn s dng ngn ng ontology Web (The
Web Ontology Language: OWL). OWL c th c mt kiu th t l Annotation
propertie. Kiu thuc tnh c s dng thm cc thng tin (metadata d liu
ca d liu) i vi cc lp, cc thc th hay cc thuc tnh Object/ Datatype.
2.1.3 Mt s cng trnh lin quan ti xy dng Ontology
Ngy nay, Ontology c s dng rt nhiu trong cc lnh vc lin quan n
ng ngha nh tr tu nhn to (AI), semantic web, kngh phn mm, v.v V
nhng ng dng ca Ontology nn khng ch ring Vit Nam, trn th gii c
nhiu d n tp trung xy dng Ontology i vi tng min d liu khc nhau v
phc v cho nhiu mc ch a dng khc nhau. i vi min d liu y t c th k
ti rt nhiu Ontology trong lnh vc y t, sinh hc c a ra bi t chc The
National Center for Biomedical Ontology [52]. D n ny a ra c rt nhiu
Ontology trong y t cng nh trong sinh hc, v d nh Ontology v cell type,
Gene, FMA, Human diseasedanh sch cc Ontology a ra c hin th trong[41].
Ngoi ra c th k ti Disease Ontology [42] l mt tp t v y khoa c
pht trin ti Bioinformatics Core Facility cng vi s cng tc ca d nNuGene
Project ti trung tm Center for Genetic Medicine. Ontology ny c thit k vi
mc ch sp xp cc bnh v cc iu kin tng ng i vi nhng code v y t
c th nh l ICD9CM, SNOMED v nhng ci khc.Disease Ontology cng
c s dng lin kt nhng kiu hnh sinh vt mu i vi cc bnh ca con
ngi cng nh trong vic khai ph d liu y hc. Disease Ontology c thchin nh l mt th xon c hng v s dng UMLS (Unified Medical
Language System) l tp t vng truy cp cc Ontology v y t khc nh
ICD9CM.
Mt ontology ting Anh c cp rt nhiu trong lnh vc y t trong thi
gian gn y l GENIA [43]. Mc ch chnh m ontology ny hng ti l
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
20/67
12
s phn ng li ca t bo trong no ngi. Ontology ny ch yu tp trung trong
cc lnh vc y t v cng c s dng trong cc bi ton x l ngn ng t nhin:
truy hi thng tin (Information Retrieval IR), trch chn thng tin, phn lp v
tm tt vn bn Hnh v sau m t cu trc phn cp ca ontology GENIA.
Tn ti nhiu Ontology v y t hin nay c xy dng trn th gii. Tuynhin Vit Nam hin nay mc du vic tm kim ng ngha ang c tp trung
nghin cu, nhng cc Ontology v y t th hu nh cha c, cho nn vic tm kim
cc trang web v thuc, bnh ca ngi dng cha tr v cc kt quy v
t c hiu qu. Tn ti mt Ontology cp n cc thut ng y t trong ting
Vit, l Ontology Biocaster [44]. y l Ontology c nghin cu theo d n
Biocasterc pht trin ti Vin Tin hc Quc gia Nht Bn vi s cng tc ca
trng cc trng i hc ti Nht Bn, Thi Lan, Vit Nam... y l ontology vit
cho nhiu ngn ng nh Nht, Anh, Thi, Vit
Ontology BioCaster [11] c cc thut ng ca nhiu th ting trong c
371 thut ng ting Vit, cc thut ng lin quan n bnh, virus, cc triu chng
ca Vit Nam. Mc d Ontology ny c x l trch chn trong ting Vit, nhng t
li a ra cc bi bo v y t Vit Nam bng ting Anh. V vy, cc thut ng,
thc th, cc bnh hay virus c vit bng ting Vit cn cc quan hc m t
bng ting Anh. V d, thut ng Vietnamese_103, gn nhn: vi rt gy bnh thy
u, c hasLanguage: vi (Vietnamese), hasRootTerm : VIRUS_124
2.2. L thuyt xy dng Ontology
2.1.1. Phng php xy dng Ontology
Ngy nay, vic nghin cu qu trnh xy dng ontology ngy cng c
quan tm nhiu hn. C rt nhiu nhm sau qu trnh nghin cu a ra cc
phng php khc nhau nhm xy dng Ontology.
Phng php Ushold & King c xy dng da trn vic pht trin
Enterprise Ontology. Phng php ny ch yu tp trung vo vic gip ngi pht
trin t mc ch ca ontology c th c nhng hng pht trin nh th no, sau
nh gi v vit ti liu cho ontology. Trong qu trnh xy dng, ngi dng c
th tch hp cc ontology c sn vo ontology ang xy dng. Ba cch tip cn sau
c a ra nhm nh ngha cc khi nim chnh trong ontology: cch tip cn
top-down, bottom-up v middle-out. Phng php lun ny c xy dng khng
ph thuc vo ng dng, ngha l mc ch xy dng ontology c lp vi qu
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
21/67
13
trnh xy dng chng, khng ph thuc vo nhau. Vi bt k ng dng no, chng
ta u c th s dng chung phng php ny [17].
Phng php lun tip theo c pht trin bi Gruninger v Fox [16], c
pht trin thng qua d n ontology Toronto Virtual Enterprise (TOVE). H thng
ny c xy dng bt ngun t t tng v s pht trin h thng da trn trithc, s dng first order logic. Trong phng php ny, cc khi nim ni bt nht
c nh ngha trc tin, sau lm chi tit v tng qut ha cc khi nim
theo cc hng thch hp. Nh vy, phng php ny bt u t mt s cc khi
nim mc cao, i ri n cc khi nim mc thp v tng qut cc mc cao
hn. Phng php ny s dng cch tip cn middle-out nh ngha cc khi
nim v mt phn ph thuc vo ng dng sau ny ca ontology, ngha l trc khi
xy dng ontology, ngi dng cn quyt nh mc ch s dng v tch hp
ontology vo ng dng g.
METHONTOLOGY l mt phng php xy dng Ontology c pht
trin t phng nghin cu tr tu nhn to ca trng H Polytechnic Madrid.
Phng php ny cho php ngi s dng c th xy dng mt ontology mi da
trn bn mu thit k mi hoc c th s dng nhng ontology c sn. B
framework ca METHONTOLOGY c th gip ngi dng xy dng cu trc
ontology mc tri thc v bao gm: nh ngha quy trnh pht trin ontology,
mt s k thut trong qu trnh xy dng quy trnh trn (v d qun l v lp lch,
qun l cht lng, thu thp d liu v tri thc, qun l cu hnh, v.v.). Phng
php lun ny s dng chin lc middle-out v khng ph thuc vo ng dng.
2.1.2. Cng c xy dng Ontology
B cng c xy dng v pht trin Ontology bao gm cc tool h trv mi
trng gip ngi dng c th xy dng mt Ontology mi t bn thit k mi
hoc s dng li nhng Ontology mi c sn. Mt s mi trng pht trin c
xy dng t trc nh Ontosaurus, Ontolingua v WebOnto. Nhng b cng c
mi c s dng nhiu gn y bao gm OntoEdit, OilED,WebODE, Chimera
DAG-Edit v Protg.Ontoligua server [45] l b cng c xy dng ontology c pht trin t
nhng nm 1990 ti Phng Th nghim H thng tri thc (Knowledge Systems
Laboratory -KSL) ca Trng H Stanford (M). Cc module chnh ca b cng
c bao gm b bin tp ontology (ontology editor) v cc module khc nh
Webster, OKBC (Open knowledge Based Connectivity) server.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
22/67
14
Ontosaurus [46] c pht trin cng trong khong thi gian bi Vin
Khoa hc Thng tin ISI ca Trng H South Calfornia (M). OntoSaurus bao
gm 2 module chnh: ontology server (s dng Loom) v mt web browser cho
Loom ontology. Ngoi ra, b cng c cn h trKIF, KRSS v C++, ng thi
OntoSaurus ontology cng c th c truy cp da trn protocol OKBC caOntoligua server.
WebOnto l mt ontology editor cho cc Ontology OCML (Operational
Conceptual Modelling Language), c pht trin bi Vin Truyn thng Tri thc
(KMI) ti Trng H m(Open University). B cng c ny l s dng Java vi
webserver, cho php ngi dng c th duyt v thay i cc m hnh tri thc
thng qua Internet. im mnh chnh ca b cng c ny l c th cho php cng
tc gia nhiu ngi nhm thay i v hon thin ontology [26].
Cc b cng c trn (Ontolingua server, Ontosaurus v WebOnto) c xydng n thun nhm h trduyt v bin tp cc Ontology c vit bng nhng
ngn ng ring (Ontolingua, LOOM v OCML). Nhng b cng c bin tp ny
hin nay khng cn p ng nhu cu ca ngi s dng. Th h mi cc b
cng c xy dng Ontology c nhiu u vit cng nh tnh nng hn hn cc b
cng c ny, v d nh kh nng mrng, h thng kin trc cc thnh phn gip
ngi dng c th cung cp thm cc tnh nng cho mi trng pht trin mt cch
d dng.
WebODE [47] l mt b cng c c kh nng mrng c pht trin binhm Ontology ca trng H Technical Madrid (UPM), c xem nh mt thnh
cng ca ODE (Ontology Design Environment). WebODE c s dng nh mt
Web server vi giao din web. Phn li chnh ca mi trng ny l mt dch v
(service) ontology, trong tt c cc dch v v ng dng khc u c th s dng
dch v ny. Phn son tho Ontology cng ng thi cung cp cng c kim tra
rng buc, to cc lut tin (axiom rule creation) v phn tch vi WebODE
Axiom Builder (WAB), ti liu trong HTML, kt hp ontology vi cc nh dng
khc nhau [XML\RDF[s], OIL, DAML+OIL, CARIN, Flogic, Java v Jess].
OilED [48] l mt b cng c son tho ontology cho php ngi dng c
th xy dng Ontology bng OIL v DAML+OIL, c xy dng bi Trng H
Manchester, i hc Amsterdam v Interprice GmbH.
Protg 2000 [51] l mt trong nhng b cng cc s dng rng ri nht
hin nay, c pht trin bi Trng H Stanford. B cng c ny c pht trin
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
23/67
15
da trn hai mc tiu: c th tng thch vi cc h thng khc, d dng s dng v
h trcc cng c trch chn thng tin. Phn chnh ca mi trng ny l mt bin
tp ontology. Bn cnh , Protg cn bao gm rt nhiu cc plugin nhm h tr
chc nng nh qun l nhiu ontology, dch v suy lun (inference service), h tr
v vn ngn ng ontology (language importation/exportation).
2.1.3. Ngn ng xy dng Ontology
Hin ti, cc ngn ng xy dng ontology (ngn ng ontology) in hnh
bao gm LOOM, LISP, Ontolingua, XML, SHOE, OIL, DAML+OIL v OWL.
Ngn ng ontology c chia lm ba loi: nh ng tp t vng s dng
ngn ng t nhin (object based-knowledge representation languages) nh UML,
v ngn ng da trn lgic v t bc mt (first order predicate logic) nh logic m
t (Description Logics). Ngn ng ontology cn phi tng thch vi nhng cng
c khc, t nhin v d hc, tng thch vi cc chun hin ti ca web nh XML,
XML Schema, RDF v UML. Di y l mt s cc ngn ng web-based.
EXtensible Markup Language [XML] l mt chun mdng biu din d
liu t W3C, c tnh mm do v mnh hn so vi HTML. RDF (Resource
Description Framework) c pht trin nh mt khung gip m t v trao i cc
metadata [12].
SHOE (Simple HTML Ontology Extensions) c xy dng vo nm 1996
ti Trng H Maryland, nh mt mrng ca HTML c th hp nht cc tri
thc ng ngha trn cc vn bn web hin ti thng qua vic ch thch cc trang
HTML [27].
OIL (Ontology Inference Layer) l mrng ca RDF, c pht trin bi d
n ON-To_Knowledge, l ngn ng m t v trao i cho ontology. Ngn ng ny
c kt hp bi ngn ng dng da trn frame (frame-based) vi ng ngha hnh
thc (formal sematics) v dch v suy lun t logic m t (description logics). Ngn
ngc chia lm ba mc i tng lp (cc thc th c th), mc u tin (first-
meta, nh ngha theo ontology) v mc th hai (second-meta, cc mi quan h)
[8].
DAML+OIL c pht trin da trn d n DARPA nm 2000. C OIL v
DAML+OIL u cho php m t cc khi nim, cc phn cp (taxonomy), cc
quan h nh phn, chc nng v thc th [9].
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
24/67
16
OWL l mt ngn ng ontology c s dng ph bin hin nay, c ti
u ho cho vic trao i d liu v chia s tri thc. Ngn ng ny c s dng khi
thng tin cha trong vn bn cn c x l bi cc ng dng. OWL l c thc
s dng biu din ng ngha cc thut ng trong tp t vng v mi quan h
gia nhng thut ng ny. OWL bao gm OWL Lite, OWL DL [RDF] v OWLFULL.
2.3. Xy dng Ontology y tting Vit
Vic thit k v xy dng mt ontology bao gm cc bc sau:
nh ngha cc lp trong ontology. Sp xp cc lp trong mt kin trc phn cp (taxonomic hierarchy). nh ngha cc thuc tnh (slot) v m t cc gi tr cho php cho
nhng thuc tnh ny.
in gi tr ca cc th hin (instance) vo cc slot. Sau , cstri thc c to ra bng cch nh ngha cc th hin
(instance) ca nhng lp ny cng vi nhng gi tr ca chng.
Khng c mt phng php no c gi l phng php chun xc cho
vic xy dng tt c cc Ontology [18]. Vic la chn phng php xy dng ph
hp no c da trn mc ch v tnh cht ca tng Ontology. Qua qu trnh
kho st cc d liu v y t v mt s cc phng php pht trin Ontology, chngti la chn mi trng Protg OWL xy dng mt Ontology y t bng Ting Vit
th nghim.
Sau khi thu thp v kho st d liu, chng ti lit k cc thut ng quan
trng nhm c th nu nh ngha cho ngi dng vi hng nghin cu tip theo
l tng lin kt n cc nh ngha c sn trn trang wikipedia. T cc thut ng
trn, tip theo snh ngha cc thuc tnh ca chng. Vic xy dng Ontology l
mt qu trnh lp li c bt u bng vic nh ngha cc khi nim trong h
thng lp v m t thuc tnh ca cc khi nim .
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
25/67
17
Chng 3
NHN DNG THC TH
3.1. Gii thiu bi ton nhn dng thc th
3.1.1. Gii thiu chung v nhn dng thc th
Nhn dng thc th c th hiu mt cch n gin l phn loai cc t trong
mt vn bn thnh cc lp thc th c nh ngha trc nh ngi (PER), t
chc (ORG), v tr (LOC), bnh (BENH), triu chng (TCHUNG), thuc
(THUOC). Nhn dng thc th cho chng ta c mt phn tch b mt, cc thc
th s tr li cc cu hi quan trng (c thng dng trong h thng hi p).
C rt nhiu phng php c dng gii quyt bi ton nhn dng
thc th, t cc phng php th cng n cc phng php hc my nh cc mhnh markov n (Hidden Markov Models HMM), cc m hnh Markov cc i
ha Entropy (Maximum Entropy Markov Models- MEMM), cc m hnh min ph
thuc iu kin (Conditional Random Field - CRF), phng php my vector h tr
(Support Vector Machine).
Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th
Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v
c h trbi mt s lng ln cc lut, tuy nhin hu ht cc lut u cn tn ti
mt s lng ln cc trng hp ngoi l, trong c nhng ngoi l ch xut hinkhi h thng a vo s dng, m ta kh c th gii quyt ht. Di y l mt s
v d v cc lut c s dng bi Proteus cng vi cc trng hp ngoi l ca
chng [1]:
Lut: Title Capitalized_Word => Title Person Name
Trng hp ng : Mr. Johns, Gen. Schwarzkopf
Trng hp ngoi l: Mrs. Fields Cookies (mt cng ty).
Lut: Month_name number_less_than_32 => Date
Trng hp ng: February 28, July 15
Trng hp ngoi l: Long March 3 ( tn mt tn la ca Trung Quc).
So vi cc phng php th cng va tn thi gian, cng sc, m kt qu
t c li khng c nh mong mun, cc phng php hc my hin ang
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
26/67
18
c tp trung nghin cu nhiu hn. Hu ht cc phng php u c nhng u
th ring ng thi vn cn tn ti mt s hn ch do c th ca mi m hnh.
Tiu biu c th kn cc m hnh Markov n HMM v cc m hnh ci tin ca
n nh MEMM, CRF; vi cc m hnh ny ta c th xem tng ng mi trng thi
vi mt trong nhn cc nhn thc th v d liu quan st l cc t trong cu angxt. My vector h tr(SVM) cng l mt trong nhng phng php hc my cho
kt qu rt kh quan.
3.1.2. Mt s kt qu nghin cu v nhn dng thc th
Trn th gii bi ton nhn bit thc th c quan tm nghin cu t lu
v t c nhng kt qu kh n tng. C rt nhiu phng php (t cc phng
php th cng n cc phng php hc my) c dng gii quyt bi ton
ny. Trong cng trnh nghin cu vo nm 2007 [5], David Nadeau nh gi
mt s nghin cu tiu biu trc c lin quan n bi ton nhn dng thc th.Ni dung cc nh gi ca David Nadeau c trnh by nh di y.
Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th
Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v
c h trbi mt s lng ln cc lut. Nm 1998, Radev cng nghin cu nhn
dng nhng on m t v thc thc a ra, chng hn nh Bill Clinton s
c m t l the President of the U.S., the democratic presidential candidate
hay an Arkansas native H thng ca Fung 1995 (v Huang 2005) gii quyt
bi ton dch cc thc th t ngn ng ny sang ngn ng khc (v d nh bn dchting Vit ca thc th College of Technology s l Trng i hc Cng
ngh). H thng ny c nh gi l gp phi t hn 10% li dch. Tip theo ,
nm 2001, Charniak v cng s cng b kt qu nghin cu nhn dng cu trc cc
phn trong tn ngi, v d nh cm Doctor Paul R. Smith sc chia thnh c
thnh phn chc danh, h, m v tn). Nghin cu ny l mt bc tin x l
quan trng trong b nhn dng thc th, c th xc nh nhng trng hp nh
John F. Kennedy v President Kennedy l cng mt ngi. Cng trong nm
2001, h thng Record linkage ca Cohen v Richman c xy dng vi mcch tm ra tt c cc dng ca cng mt thc th trn ton b csd liu. Vo
nm 2002, Dimitrov v cng s gii quyt vn s dng cc i t thay th, v
d trong cu Rabi finished reading the book and he replaced it in the library i
t he l i t thay th cho Rabi. Nghin cu ny c rt nhiu ng dng thc
t, v d nh trong h thng hi p tng. Nm 2003, Mann v Yarowski xy
dng mt h thng xa b cc nhp nhng v tn ngi, k thut ny c s dng
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
27/67
19
xy dng tiu s - nn tng ca mt s my tm kim nh Zoominfo.com hay
Spock.com. Nm 2005, Nadeau v Turney cng b kt qu nghin cu nhn dng
ty ca cc t vit tt trong mt vn bn ang xt no , v d nh IBM
vit tt ca International Business Machines trong nhiu vn bn. Mt nghin
cu vo nm 2006 ca Agbago nhm xy dng mt h thng c kh nng phc hili nh dng ng ca t bao gm vic bo m cho k tu cu v u thc th
lun c vit hoa l rt c ch trong dch my.
Cng trong cng trnh nghin cu ca mnh [5], David Nadeau s dng
tp nhn thc th ENAMEX theo mu ca hi ngh MUC 7 (Message
Understanding Conference 7) v tin hnh hun luyn - kim th trn tp ng liu
Medstract Gold Standard Evaluation Corpus (Tp ng liu ny c xy dng bi
Pustejovsky vo nm 2001). Tc gi s dng b cng c Weka Machine Learning
kim th nhiu thut ton hc c gim st v a ra kt lun tt ca h
thng ph thuc rt nhiu vo thut ton c s dng v phng php hc bn
gim st ca mnh cho kt qu kh quan nht.
Tnh n nay, c kh nhiu hi ngh khoa hc quc t ln trao i v bi
ton nhn dng thc th cng nhnh gi nh gi cc h thng nhn dng thc
th c xy dng. Tiu biu c th k n MUC (Message Understanding
Conference, 1987-1997), MET (Multilingual Entity Task Conference, 1998), ACE
(Automatic Content Extraction Program, 2000), HAREM (Evaluation contest for
named entity recognizers in Portuguese, 2004-2006), IREX (Information Retrieval
and Extraction Exercise, 1998-1999)
3.2. cim dliu ting Vit
Ting Vit thuc ngn ngn lp, tc l mi mt ting (m tit) c pht
m tch ri nhau v c th hin bng mt ch vit. c im ny th hin r rt
tt c cc mt ng m, t vng, ng php. Di y trnh by mt sc im ca
ting Vit theo cc tc giTrung tm ngn ng hc Vit Nam trnh by. Vic
nghin cu cc c im d liu ting Vit s gip em c ci nhn tng quan v cc
c trng d liu ting Vit. Hiu r rng hn v d liu s gip vic xy dngOntology v trch chn thng tin c hiu qu hn.
3.2.1.c im ng m
Ting Vit c mt loi n vc bit gi l "ting" m v mt ng m th
mi ting l mt m tit. H thng m v ting Vit phong ph v c tnh cn i,
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
28/67
20
to ra tim nng ca ng m ting Vit trong vic th hin cc n v c ngha.
Nhiu t tng hnh, tng thanh c gi tr gi tc sc. Khi to cu, to li,
ngi Vit rt ch n s hi ho v ng m, n nhc iu ca cu vn.
3.2.2.c im t vng
Ni chung, mi ting l mt yu t c ngha. Ting l n v csca h
thng cc n v c ngha ca ting Vit. T ting, ngi ta to ra cc n v t
vng khc nh danh s vt, hin tng..., ch yu nhphng thc ghp v
phng thc ly.
Vic to ra cc n v t vng phng thc ghp lun chu s chi phi ca
quy lut kt hp ng ngha, v d: t nc, my bay, nh lu xe hi, nh tan ca
nt... Hin nay, y l phng thc ch yu sn sinh ra cc n v t vng. Theo
phng thc ny, ting Vit trit s dng cc yu t cu to t thun Vit hay
vay mn t cc ngn ng khc to ra cc t, ng mi, v d nh tip th,
karaoke, th in t (e-mail), th thoi (voice mail), phin bn (version), xa l
thng tin, siu lin kt vn bn, truy cp ngu nhin, v.v.
Vic to ra cc n v t vng phng thc ly th quy lut phi hp ng
m chi phi ch yu vic to ra cc n v t vng, chng hn nh chm cha,
chng ch, ng a ng nh, ththn, lng l lng ling, v.v.
Vn t vng ti thiu ca ting Vit phn ln l cc tn tit [mt m tit,
mt ting]. S linh hot trong s dng, vic to ra cc t ng mi mt cch d dng to iu kin thun li cho s pht trin vn t, va phong ph v s lng, va
a dng trong hot ng. Cng mt s vt, hin tng, mt hot ng hay mt c
trng, c th c nhiu t ng khc nhau biu th. Tim nng ca vn t ng ting
Vit c pht huy cao trong cc phong cch chc nng ngn ng, c bit l
trong phong cch ngn ng ngh thut. Hin nay, do s pht trin vt bc ca
khoa hc-kthut, c bit l cng ngh thng tin, th tim nng cn c pht
huy mnh m hn.
3.2.3.c im ng php
T ting Vit khng bin i hnh thi. c im ny s chi phi cc c
im ng php khc. Khi t kt hp t thnh cc kt cu nh ng, cu, ting Vit
rt coi trng phng thc trt t t v h t.
Vic sp xp cc t theo mt trt t nht nh l cch ch yu biu th cc
quan h c php. Trong ting Vit khi ni Anh ta li n l khc vi Li n anh
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
29/67
21
ta. Khi cc t cng loi kt hp vi nhau theo quan h chnh ph th tng trc
gi vai tr chnh, tng sau gi vai tr ph. Nh trt t kt hp ca t m "c
ci" khc vi "ci c", "tnh cm" khc vi "cm tnh". Trt t ch ngng trc,
v ngng sau l trt t ph bin ca kt cu cu ting Vit.
Phng thc h t cng l phng thc ng php ch yu ca ting Vit.Nhh t m t hp anh ca em khc vi t hp anh v em, anh v em. H
t cng vi trt t t cho php ting Vit to ra nhiu cu cng c ni dung thng
bo cbn nh nhau nhng khc nhau v sc thi biu cm. V d, so snh cc cu
sau y:
- ng y khng ht thuc.
- Thuc, ng y khng ht.
- Thuc, ng y cng khng ht.
Ngoi trt t t v h t, ting Vit cn s dng phng thc ngiu. Ng
iu gi vai tr trong vic biu hin quan h c php ca cc yu t trong cu, nh
nhm a ra ni dung mun thng bo. Trn vn bn, ngiu thng c
biu hin bng du cu. S khc nhau trong ni dung thng bo c nhn bit khi
so snh hai cu sau:
- m hm qua, cu gy.
- m hm, qua cu gy.
Qua mt sc im ni bt va nu trn y, chng ta c th hnh dung
c phn no bn sc v tim nng ca ting Vit cng nh kh khn gp phi
trong vic nhn dng thc th cng nh trch chn thng tin trong ting Vit.
3.3. Mt s phng php nhn dng thc th
Tn ti nhiu phng php c cp ti trong bi ton nhn dng thc th.
Tuy nhin c th tng kt li mt s giai on chnh trong bi ton ny nh sau:
Tin x l: Loi b HTML, tch cu, tch t. La chn thuc tnh: La chn cc nhn th (tag), mu ng cnh
(feature: vit hoa, vit thng, ).
Giai on hun luyn, t hc: S dng HMM, CRF, MEMM,SVM
Gn nhn, khi phc.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
30/67
22
Ty thuc vo tng min ca bi ton nhn dng thc th th s la chn cc
nhn th l khc nhau. C th cp ti by nhn dng cbn tng qut nht c
la chn u tin: 7 dng nhn u tin (theo Ralph & Beth, [5]): ORG (t chc),
LOC (v tr), PER (ngi), DATE,TIME,CUR (Biu din tin t), PCT (Phn
trm). Tp nhn c thc thay i, m rng ty thuc vo tng d n. D nBiocaster [11] xy dng 22 nhn cho lnh vc y t.
Mi mt nhn c gn bao gm ba phn:
Phn bin (boundary category): Xc nh v tr ca t hin ti trongmt thc th.
Phn thc th (Entity category): Xc nh kiu thc th. Tp c trng (Feature set) : Xc nh thng tin ng cnh (mu ng
cnh).C nhiu cch biu din phn bin ca cc t, trong cch biu din
thng c cp v dng nhiu nht c th k ti l: biu din mi mt nhn
gm mt tip u ch B_ (bt u mt thc th ), I_ (bn trong mt thc th), nhn
O (khng phi thc th). Ly v d: bnh vim no nht bn c thc gn
nhn nh sau B_DIS I_DIS I_DIS I_DIS.
La chn mu ng cnh l bi ton quan trng quyt nh chnh xc ca
nhn dng thc th. Mu ng cnh ti v tr quan st bt k cho ta thng tin ng
cnh. Bt k mt h thng nhn dng thc th hon thin no u phi xy dngc mt tp cc mu ng cnh mt cch chnh xc v m tc tng lnh vc
ca bi ton nhn dng. Bi ton nhn dng thc th chung: vit hoa, vit thng,
k t % , ch s, du chm, phyBi ton tng t trong y t, l la chn mu
ng cnh trong nhn dng protein, gene, thuc, t bo .
Cc loi mu ng cnh [6]:
Mu tin nh cbn (vit hoa, thng, chm, phy): comma, dot,oneDigit, AllDigits
Mu hnh thi hc: tin t, hu t (~virus, ~lipid, ~vitamin,), Mu ng php: cm ng t, cm danh t Mu trigger ng ngha:
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
31/67
23
o Trigger danh t chnh: danh t chnh ca mt t hp t ( BCell trong activated human B cells, bnh trong bnh vim
xoang ).
o Triggerng tc bit: nhim, ly, bao gm, gy ra.3.3.1. Phng php da trn lut, bn gim st
H thng da trn lut bao gm mt tp cc lut cbn (Nu-Th), tp cc
s vt (facts), b thng dch (interpreter) s dng tp lut sinh ra cc s vt. S
dng phng php da trn lut, u tin chng ta xy dng mt tp ban u cc
lut, cc thc th. Qua qu trnh hc da trn bn gim st v k thut
bootstrapping, chng ta mrng tp thc th cng nh tp lut ban u.
Hc bn gim st [28] c hiu l phng php hc my s dng c hai
loi d liu gn nhn v cha gn nhn cho qu trnh hun luyn. Phng php nykt hp c u im, gim bt nhng nhc im ca phng php hc c gim
st v hc khng gim st. Cc thut ton bn gim st c nhim v chnh l m
rng mt tp d liu hun luyn nh ban u thnh tp d liu ln hn.
Mt k thut chnh ca phng php hc bn gim st l bootstrapping. K
thut ny bao gm c gim st mc nh, t mt tp d liu ban u (cn gi l
tp seed) bt u qu trnh hun luyn. V d mt h thng nhn dng tn bnh, lc
u yu cu mt tp mu nh cc tn bnh. Sau , h thng tm kim cc cu cha
cc tn bnh ny v c gng tm kim cc thng tin ng cnh chung cho mt s tnbnh trong tp ny (v d nh c s tng ng v thng tin ng cnh trong tng 5
mu tn bnh). Sau t cc thng tin ng cnh ny, h thng s tm cc th hin
ca tn bnh xut hin trong cc ng cnh tng t. Qu trnh hun luyn ny s
c lp i lp li tm ra cc v d mi, cng nh khai thc c cc thng tin
ng cnh mi c lin quan. Bng cch lp i lp li qu trnh ny, mt s lng ln
cc tn bnh v mt s lng ln cc thng tin ng cnh sc thu thp li.
3.3.2. Cc phng php my trng thi hu hn
Cc phng php my trng thi hu hn dng mt s chung ca my
trng thi hu hn (finite state machine - FSM hoc finite state automaton FSA).
C th coi my trang thi hu hn l mt my tru tng c dng trong cc
nghin cu v tnh ton v ngn ng vi mt s lng hu hn, khng i cc
trng thi. My trng thi hu hn c biu din nh mt th c hng, trong
c hu hn c nt (cc trng thi) v t mi nt c khng hoc mt s cung (b
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
32/67
24
chuyn) i ti cc nt khc. Mt xu u vo m cn xc nh dy b chuyn ph
hp. Tn ti mt s kiu my trng thi hu hn. B nhn (Acceptor) cho cu tr
li "c hoc khng" tip nhn xu u vo. Bon nhn (Recognizer) phn lp
i vi xu u vo. B bin i (Transducer) sinh ra mt xu kt qu ra tng ng
vi xu u vo. M hnh my trng thi hu hn c ng dng trong trch chnthng tin thuc loi b bin i, trong vi mt xu vn bn u vo, h thng
a ra xu cc c trng tng ng vi cc t kha trong xu vn bn . Theo
mt cch phn loi khc, th c hai loi my trng thi hu hn l quyt nh
(Deterministic finite automaton- DFA) v khng quyt nh (Non-deterministic
finite automaton NFA).
My trng thi hu hn bao gm:
Mt bng ch, Mt tp cc trng thi S, trong
o vi DFA: c mt trng thi xut pht v c t khng tr lncc trng thi chp nhn (dng).
o vi NFA: c t mt trln cc trng thi c coi l trng thixut pht v c t khng tr ln cc trng thi chp nhn
(dng).
Mt hm chuyn T : S S.Hot ng my trng thi c m t nh sau. Bt u t (tp) trng thixut pht, ln lt xem xt tng k t trong xu u vo trong bng ch, trn c
shm chuyn T di chuyn ti trng thi tip theo cho n khi mi k t ca
xu c xem xt. Nu gp c trng thi dng l thnh cng. Trong trng
hp , xu cc trng thi c gp (xut hin) trong qu trnh x l xu u vo
c coi l xu kt qu, hay cn c gi l xu nhn ph hp vi xu u vo.
M hnh my trng thi hu hn ng dng trong trch chn thng tin c
b sung thm mt s yu t, ch yu lin quan ti hm chuyn T, thng T c
m t nh mt qu trnh Markov.
3.3.3. Phng php s dng Gazetteer
Tin Gazetteer (hay Gazetteer) c hiu l mt danh sch cc thc th
nh tn ngi, t chc, v tr; hay ring i vi lnh vc y t l mt danh sch cc
bnh, tn thuc, triu chng, nguyn nhn.Nu c th xy dng c mt tp d
liu gazetteer tht tt, y , chnh xc th s to bc tin quyt quan trng i
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
33/67
25
vi h thng nhn dng thc th. Ngoi vic xy dng Ontology s cp ti cng
vic xy dng mt tp gazetteer ban u cho y t ting Vit. Nhn dng thc th
da trn tp Gazetteer ny cho kt qu kh quan.
Cc file gazetteerc biu din theo nh dng sau: a.lst:b:c. Trong a.lst
l file cha cc th hin ca lp thc th a, b l kiu major, c l kiu minor. C thhiu mt cch n gin lp thuc kiu minor l lp con ca lp thuc kiu major.
V d cc file gazetteer biu din nguyn nhn gy ra bnh c biu din nh sau:
nguyen_nhan.lst:nguyen_nhan:vikhuan,
nguyen_nhan.lst:nguyen_nhan:tac_nhan.
Hnh 6: Mt s file Gazetteer c xy dng phc v bi ton nhn dng thc
th.
c kh nhiu bi bo cp ti vic s dng tp d liu nhn dng
thc th. Trong bi bo v xy dng tp d liu cho bi ton nhn dng thc th
(c trnh by trong phn 3.4.1), nhm tc gi cp ti tm quan trng ca
vic xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Bi bo
s dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh
SVM da trn cc bi bo c ch thch [20].
3.4. Nhn dng thc th y tting Vit
3.4.1. Nhn dng thc th ting Vit
Tn ti mt s cng trnh nghin cu cp ti vic s dng tp d liu
nhn dng thc th ting Vit. Nguyn Cm T [1] xy dng mt h thng nhn
din thc th nhn bit loi thc th da trn m hnh trng ngu nhin c iu
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
34/67
26
kin (Conditional Random Fields - CRF) xc nh 8 loi thc th, tng ng vi
l 17 nhn. Tc gi tin hnh thc nghim s dng cng c FlexCRFs (cng c
m ngun mc pht trin bi Phan Xun Hiu v Nguyn L Minh), s dng
d liu gm 50 bi bo lnh vc kinh doanh (khong gn 1400 cu) ly t ngun
http://vnexpress.net.Thao P.T.X. v cng s [21] cp ti vic khai thc cc chin lc b
phiu (voting) bng cch t hp cc b my hun luyn s dng phng php da
trn t (word-based). tng chnh ca nhm tc gi l cp ti l vic t hp
cc my hun luyn s dng cc thut ton phn lp khc nhau (SVM, CRF, TBL,
Nave Bayes) s cho kt qu cao hn khi s dng ring r mi thut ton.
Trong [20], Thao P.T.X. v cng s cp ti tm quan trng ca vic
xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Cc tc gi s
dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh SVMda trn cc cng trnh nghin cu lin quan. Nhm tc gi d tm cc bnh truyn
nhim thng qua cc bi trc tuyn v y t sc khe cp ti vic xy dng tp
d liu cho bi ton nhn dng thc thng mt vai tr rt quan trng v a
ra 22 nhn thc th gn nhn v ch thch d liu.
Mt nghin cu tiu biu c lin quan n bi ton nhn dng thc th
Vit Nam l cng c VN-KIM IE [40] c xy dng bi nhm nghin cu do ph
gio s tin sCao Hong Trng u, thuc trng i hc Bch khoa Thnh
ph H Ch Minh. Chc nng ca VN-KIM IE l nhn bit v ch thch lp tng cho cc thc th c tn trn cc trang Web ting Vit.
3.4.2. Nhn dng thc th y t ting Vit
Trn th gii, mt s nh nghin cu (John McNaught[10], Sammy Wang
[25], ...) lu v mt s vn kh khn trong x l d liu y t. Nhng kh
khn in hnh nht l s nhp nhng v a dng ca cc t, thc th trong d liu
y t c cu trc phc tp, nguyn tc hnh thnh i khi li khng ging nh bnh
thng; hin nay vn cha c quy c r rng v tn cc thc th, vn tng
ngha t tri ngha t vit tt v trong nhiu trng hp tc s dng khng
mang ngha thng gp ca n; nhiu t cng ch mt khi nim v mt t c
th c nhiu ngha, .
i vi bi ton nhn dng thc th cho y t ting Vit, ngoi nhng kh
khn chung ca bi ton nhn dng thc th ni trn cn gp mt s trngi khc.
Cc vn bn ting Vit khng c d liu hun luyn v cc ngun ti nguyn c th
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
35/67
27
tra cu (nh Wordnet trong ting Anh), thiu cc thng tin ng php (POS) v cc
thng tin v cm t nh cm danh t, cm ng t cho ting Vit, trong khi cc
thng tin ny gi vai tr quan trng trong vic nhn dng thc th; khong cch
gia cc t khng r rng, d gy nhp nhng. Hn na, i vi c trng ca d
liu y t cng gy ra khng t kh khn cho bi ton nhn dng thc th: thng tinlu tr khng hoc bn cu trc (tn thuc, virus), cc kiu vit tt tn thc th,
kiu tn thc th di, a dng, cc cch vit khc nhau ca cng mt thc th.
Ring vi thc th bnh ting Vit, c thim qua mt sc im gy kh khn
cho bi ton nhn dng thc th:
Khng tun theo lut no v k t vit hoa. Kh hn ch s lng t v: C nhng tn bnh ch gm 01 t (Nh
bnh si, bnh chn), nhng c nhng tn bnh li gm rt nhiu t nh chng
ri lon tm thn th hoang tng, Cu trc cc t to thnh mt thc th c th rt phc tp: ri lon chc
phn no nhtr em,
C nhiu t mn, t Hn Vit: Stress, bnh paranoa, bnh gout, bnhthin u thng
Cng mt bnh i khi c nhiu cch vit khng hon ton ging nhauhay thm ch khc hn nhau: thy u hay tri r, bnh gt hay gout hay cn gi l
thng phong, bnh ung th mu cn c gi l bnh mu trng
C nhiu t vit tt: AIDS (l vit tt t Acquired ImmunodeficiencySyndrome hay t Acquired Immune Deficiency Syndrome ca ting Anh) trong
nhiu ti liu y t ting Vit c dch l hi chng suy gim min dch mc
phi,
Cha nhng t rt d b b st v cm t d c hay khng c cc tny vn c thc tnh l mt thc th, nh mn tnh, cp tnh, nguyn pht, th
pht
Bi ton nhn dng thc thc trng cho d liu sinh hc v y t cng l
mt ni dung nghin cu rt c quan tm. Cc thc thc trng ca d liu
sinh hc y t thng c quan tm n nhiu nht l: Bnh, Thuc, Gen, Sinh
vt, Protein, Enzime, Cc khi u c tnh (Malignancies), Fibrinogen [10] [23]
Mt trong nhng phng php n gin nht c xut cho bi ton nhn
dng thc th trong d liu y t l s dng cc tin hoc tp t vng c nh
ngha trc. n c l s dng MeSH [23]. y l mt bng t vng y t c kim
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
36/67
28
sot s dng nh ch mc. Thc cht n l mt danh sch cc t c xc
nhn dng nh ch mc v ch c cc t trong danh sch ny c chp nhn
vai tr . Cc t trong MeSH c sp xp theo h thng c cu trc cy. C tt
c 16 nhnh ca cy MeSH, y l nhng nhm t ln nht v c trng nht trong
d liu y t, c th k n nhnh A- Anatomy (gii phu hc), nhnh B Organisms (sinh vt), nhnh C Dieases (bnh), nhnh D Chemicals and Drugs
(ha hc v thuc), nhnh G - Biological Sciences (sinh vt hc) Cc nhnh li
chia lm cc nhnh nh, v d nhnh A01 - Body Regions (b phn cth), A02
Sense Organs (cc gic quan)
Trong chui hi ngh quc t BioCreAtIvE (Critical Assessment of
Information Extraction systems in Biology]: c t chc di dng mt cuc thi,
BioCreAtIvE I (2003-2004) tp trung vo ch nhn dng tn thc th Gene v
Protein, c thim qua mt vi kt qu tiu biu di y [32]:
Alexander Yeh v cng s s dng d liu v phn mm c lngcaW. John Wilbur and Lorraine Tanabe cho kt qu F-measure khong 80-83%.
Shuhei Kinoshita v cng s gii quyt vn bng cch coi bi tonnhn dng thc th nh mt dng ca bi ton gn nhn t loi, thm mt nhn
GENE vo tp nhn thng thng, cc tc gi s dng phng php gn nhn t
loi ca Brill, s dng cng c TnT mt cng c da trn m hnh HMM, h
thng khng qua hu x l cho kt qu chnh xc l 68.0%, hi tng l
77.2% v F-measure l 72.3%., nu thm mt bc hu x l (bng mt s lut bt li) t chnh xc l 80.3%, hi tng 80.5% v F-measure l 80.4%; nu
s dng thm mt bc hu x l da trn t in th t c F-measure l
80.9%.
Nm 2004, Yi-Feng Lin, Tzong-Han Tsai, Wen-Chi Chou, Kuen-PinWu, Ting-Yi Sung and Wen-Lian Hsu cng b nghin cu v p dng m hnh
Markov cc i ha Entropy cho bi ton nhn dng thc th trong d liu y t. Kt
quc cho bi chnh xc P, hi tng R v F-measure (2PR/(P+R)) l
(0.512, 0.538, 0.525), sau khi hu x l th t c kt qu tng ng l (0.729,0.711, 0.72).
Nm 2004, Haochang Wang v cng s [7] xut phng php nhn dng
thc th cho d liu y t da trn b phn lp kt hp cc phng php
Generalized Winnow, Conditional Random Fields, Support Vector Machine v
Maximum Entropy, cc phng php ny c phi hp theo ba chin lc khc
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
37/67
29
nhau. H thng m cc tc gi xy dng t c kt quo F khong 77.57%,
l mt kt qu kh tt so vi cc nghin cu cng thi im.
Nm 2007, Andreas Vlachos [3] so snh hai phng php nhn dng thc
th trong d liu y t da trn m hnh HMM v da trn m hnh CRF cng vi
phn tch c php. Hai bng di y ch ra kt qu thc nghim, bng bn tri lkt qu thc nghim khi hun luyn bng mt tp nh d liu c ch thch
thc th th cng v kim th trn ton b tp hun luyn, bng bn phi l kt qu
khi hun luyn bng mt tp nh d liu nhiu v kim th trn ton b tp hun
luyn
Gn y nht, vo thng 3 nm 2009, Razvan C. Bunescu [45] khi trnh by
v trch chn quan h t tp d liu y t lu vn nhn dng thc thc
trng trong d liu y t, cc thc thc quan tm n gm c Bnh, Gen v
Protein. Sau khi nhn dng c cc thc th ny, tc gi tin thm mt bcquan trng l trch chn ra quan h tng tc gia chng (v d nh Gen m ha
mt Protein, Protein hon thnh chc nng ca n bng cch tng tc vi mt
Protein khc ).
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
38/67
30
Chng 4
XCNH QUAN H NG NGHA
4.1. Tng quan v xcnh quan h ngngha
4.1.1. Khi qut v quan h ng ngha
Nh trnh by trn, sau khi c mt tp lp thc th (qua bc nhn dng
thc th) c c mt mng ng ngha cc thc th, chng ta cn thc hin bc
tip theo l bc trch chn quan h ng ngha (semantic relation). Quan h ng
ngha c thc hiu l mi quan h tim n gia hai khi nim c biu din
bng t hoc cm t [24]. Cc mi quan h ng ngha ng mt vai tr quan trng
trong vic phn tch ng ngha t vng. T n c thng dng vo nhiu biton khc: Xy dng nn tng tri thc ng ngha t vng, h thng hi p, tm tt
vn bn, Mt s mi quan h ng ngha in hnh trong lnh vc y t l IS_A
(Cm -- bnh), PART_WHOLE (Virus Nguyn nhn), CAUSE_EFFECT (virus
bnh).
Hnh 7: Minh ha mt quan h ngngha cho thc th car
Tuy quan h ng ngha ng mt vai tr quan trng trong phn tch ng
ngha nhng chng thng tn ti dng n gy kh khn cho vic trch chn cc
quan h ny. Mt cu hi t ra l lm th no chng ta c th khai thc c cc
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
39/67
31
quan h ng ngha ny mt cch c hiu qu t tp d liu th (khng hoc bn cu
trc). Tr li cho cu hi ny chnh l mc tiu chnh ca bi ton trch chn quan
hc cp nhiu trong thi gian gn y.
4.1.2. Trch chn quan h ng ngha
Mc ch ca trch chn quan h ng ngha l trch rt ra nhng quan h
chuyn bit, c th no gia cc thc th trong ngun ng liu vn bn ln. Thc
cht nhim v ca trich chn quan h ng ngha l khi c cho mt cp thc th x-
y, phi xc nh c ngha ca cp thc th [24]. Ly v d t cu mt ng
do cng thng, hi hp chng ta c th suy ra quan h ng ngha: cng thng, hi
hp l nguyn nhn ca bnh mt ng.
Hnh 8. Minh ha v trch chn quan h ngngha
Cc ti nguyn trich chn quan h ng ngha bao gm:
Cc tp d liu: Da trn s xut hin ng thi v cc phng php thng k. Cc ti nguyn sn c v cc quan h ng ngha nh WordNet v cc b chun
mc.
Snh gi ca con ngi.Cng nh nhn dng thc th, nhn dng quan h ng ngha cng c mt s
kh khn ring nh sau (1) cha c c s thng nht v vn s lng cc quanh ng ngha, cc quan h ng ngha c n giu di cc dng khc nhau; (2) cc
s kt hp (danh t - danh t) khng hon ton tun theo cc quy tc rng buc nht
nh, cc quan h ng ngha thng l n, c th c nhiu mi quan h gia cc cp
khi nim, vic thng dch c th ph thuc nhiu vo ng cnh, khng c mt tp
c nh ngha tt v cc quan h ng ngha.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
40/67
32
Vic trch chn quan h ng ngha l mt phn ca cc d n quan trng
mang tm c quc t trong lnh vc khai ph tri thc [24]. V d nh ACE
(Automatic Content Extraction). DARPA EELD (Evidence Extraction and Link
Discovery), ARDA-AQUAINT (Question Answering for Intelligence), ARDA
NIMD (Novel Intelligence from Massive Data), Global WordNet.
Hnh 9. V tr ca khai ph quan h ngngha trong xl ngn ngtnhin
Ty thuc vo tng min, lnh vc m chng ta c cc quan h ng ngha
khc nhau. Bng trong Hnh 10 minh ha mt s quan h ng ngha trong WordNet
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
41/67
33
Hnh 10. Minh ha cc quan h ngngha c ch ra trong WordNet [37]
i vi min d liu y t, qua kho st, chng ti thu thp c 12 loi quan
h ng ngha, cc quan h ny sc m t chi tit trong Chng 5.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
42/67
34
Hnh 11. Mt s quan h ngngha xy dng c
Hnh 11 m t mt s quan h ng ngha, ngha cc quan h ng ngha ny
c m t trong bng Bng 1.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
43/67
35
Quan h ngha Quan ho ngc
Gy_ra M t quan h nguyn_nhn gy
ra bnh
B_gy_ra_bi
C_triu_chng Quan h bnh c cc triu chng Lin_quanTi T_chc c t taa_im
Cha_bng Bnh c cha bng thuc Cha
Lm_vic Ngi lm vic t_chc
Bin_chng Bnh bin chng sang bnh khc
Tng_tc_thuc Thuc tng tc vi thuc
Pht_hin_ti Bnh c pht hin ti T_chc
Tc_ng_tt Thc_phm,Hot_ng,
Cht_ha_hc tc ng tt n
c_th_ngi, bnh
Tc ng xu Thc_phm, Hot_ng,
Cht_ha_hc tc ng xu n
c_th_ngi, bnh
Bng 1. Gii thch cc mi quan h ngngha
4.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha
Ti Hi tho SemEval 2007 [38], nhn dng cc mi quan h ng ngha gia
hai danh t l mt ni dung chnh c cp. ngha ca 2 thc th lin quan n
ngha ca cc t khc trong ng cnh, nhn dng theo 1 kiu quan h no . V
d: i xe p v s vui v (quan h nhn qu) Trch chn quan h ng ngha da
trn 7 mi quan h c bn l Cause- Effect, Instrument-Agency, Product-
Producer,Origin-Entity, Theme-Tool, Part-Whole, and Content-Container.
Ngoi ra, c th k thm mt s phng php trch chn quan h gia hai
khi nim c m t nh sau: thuc l 1 cch iu tr ca 1 bnh, hay 1 gene l 1
nguyn nhn ca 1 bnh. Swanson [29] gii thiu mt m hnh trch chn cc
kiu quan h trn trong csd liu y sinh hc t mra mt khi nim th 3
(v d 1 chc nng sinh l) lin quan n c hai khi nim thuc v bnh. Vic
trch chn loi khi nim th 3 ny cho php mt mi quan h gia hai khi nim
chnh (cha tim n trong mt ti liu no ) c hin th ra. M t phng php
trn mt cch c th hn: X lin quan n bnh no , Z lin quan n thuc, Y l
mt chc nng bnh l, sinh l, triu chng, X v Y, Y v Z thng c cp
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
44/67
36
cng nhau, X v Z th li k cng xut hin trong 1 ti liu nghin cu. T ta c
th s dng khi nim Y v 1 mi lin quan gia hai khi nim X v Z.
i vi vic s dng Ontology, c nhiu nhm tc gi cp ti vic hc
bn gim st s dng Ontology nh mt hng tip cn mi. Trong hng tip cn
, input l mt tp cc vn bn text (tn thc th, tg ng i vi cc khi nimtrong ontology m mi c xc nh). S dng cc tp d liu c sn nh GENIA
corpus [14], vic gn nhn c thc hin th cng nhng d liu corpus c th
c tng to ra s dng mt h thng NER tng ng. Output: Tp cc mu
bao gm cc cp lp v mi quan h trong ontology GENIA, (v d template : virus
infect cell).
C nhiu phng php c a ra xc nh quan h. Tuy nhin nhim v
chung ca bi ton ny l t cc vn bn th nh cc trang Web, ti liu, tin tc,
; qua b phn tch ng ngha (Semantic Parser) chng ta c u ra l cc cstri
thc (Knowledge Base KB), v cc khi nim, cc mi quan h cng nh cc lin
kt gia cc vn bn [24]. Hnh 12 m t nhim v chung ca bi ton xc nh
thc th.
Hnh 12. Nhim v chung ca bi ton xc nh quan h
Bi ton xc nh quan h cng c th hiu l t mt cp danh t (thc th)
xc nh c ngha ca cp danh t [24]. ngha c din t thng qua
mt danh sch cc quan h, cc cp thc th c nhn dng v mt s ti
nguyn khc.
i vi b phn tch ng ngha, nh trnh by phn trn, ng vai tr
quan trng trong vic trch rt cc quan h ng ngha. B phn tch ng ngha nybao gm cc thnh phn c m t nh trong Hnh 13:
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
45/67
37
Hnh 13. M t cc b phn trong b phn tch ngngha SR [24]
Preprocessing: Tokenizer, Part-of-speech tagger, Syntactic parser, Wordsense disambiguation, Named entity recognition.
Feature Selection: Xc nh cc tnh cht, rng buc (hoc ng cnh) , sdng b phn lp phn bit cc mi quan h ng ngha.
Learning Model: Phn loi cc th hin (instance) input thnh cc miquan h ph hp
B phn tch ng ngha (SR: Semantic Parsers) thc hin hai nhim v
chnh:
Labeling: T cc mi quan h ng ngha c nh ngha trc v cpthc th (danh t - danh t) ta gn nhn mi quan h gia hai thc th. V d,
Bnh xe t t .
Paraphrasing: T mt cp danh t hay thc th a ra c din t catrong vn cnh ca danh t. V d bnh mt ng do cng thng, t chng ta
c th suy ra quan hcng thng l nguyn nhn ca mt ng.
4.2. Gn nhn ngngha cho cu
Trong [30], Xuan-Hieu Phan v cng s cp ti gii php kh nhp
nhng thc tha ti liu bng cch gn nhn ng ngha cho cc cu trong vn
bn. Kh nhp nhng thc tha ti liu l phn bit cc thc th trng th hin
trong mt tp ti liu cho trc. V d, cho mt tp cc thc th c cng th hin l
Bill Clinton, ta phi xc nh c tp con ti liu thc s ni v Bill Clinton
cu tng thng M, tp con ti liu no ni v Bill Clinton cu th golf hay tp
no ni v mt Bill Clinton no khc.
Gn nhn ng ngha c thc xem nh l bi ton phn lp cc cu cha
quan h ng ngha. Bi bo s dng b phn lp da trn Maxent ly cc cu t
tm tt c nhn l cc cu u vo v u ra vi cc nhn ng ngha. B phn lp
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
46/67
38
da trn Maxent c u im l lin kt cht ch gia mt s lng rt ln (ln ti
hng trm nghn hoc triu) ca cc c trng chng cho, c lp ti cc mc
khc nhau.
Cc tc gi [30] cng xut mt Framework cho vic kh nhp nhng thc
tha ti liu gm ba phn chnh, v mt phn khng th thiu l gn nhn ngngha cho cu trong vn bn:
Tin x l: S dng x l nng mt thu thp mt tm tt bao gm cccu lin quan ti thc thc cp.
Chnh cc nhn ng ngha i vi cu trong tm tt t chng vocc lp khc nhau ca s vt. S chnh ny c thc hin bi b phn lp da
trn Maxent c chnh xc cao, trong d liu c hun luyn da trn phng
php hc bn gim st.
S dng phng php phn cm, tng ng gia cc tm tt c nhnca mi cu c cng cc nhn ng ngha sc t bng nhau tnh ton gn
ng ngha.
Hnh 14. Minh ha Framework gii quyt bi ton xc nh tn ring gia cc
ti liu.
Hnh v 14 cho thy gn nhn ng ngha cho cu ng mt vai tr quan trng
trong bi ton xc nh tn ring gia cc ti liu cng nh l cscho xc nh
quan h ng ngha.
Mt s nhn ng ngha cho cu c minh ha nh trong Hnh 15 sau y
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
47/67
39
Hnh 15. Mt s nhn ngngha c gn cho cu [30]
Vi cc nhn ny, tm tt c nhn ca Bill Clinton sc gn nhn nh
Hnh 16 di y.
Hnh 16. Gn nhn ngngha cho cc cu m t tng thng Bill Clinton [30].Kha lun gn nhn th nghim cho 1000 cu vi cc nhn cha quan h
lin quan n lnh vc y t. Cc nhn v d liu c gn nhn sc trnh by
chih tit trong Chng 5.
4.3. Phn lp cu cha quan h
4.3.1. Phn lp vi xc nh quan h, nhn dng thc th
Thc th cn nhn dng cng nh cc mi quan h cn xc nh ty thuc
vo tng bi ton, tng min ng dng (domain). V d tn thc th c th l tnngi, tn t chc, a danh, (bi ton nhn dng thc th thng thng). Trong
min ng dng m kha lun thc hin, tn thc th c th l tn bnh, thuc, triu
chng, nguyn nhn, Tuy nhin i vi mt s tn thc th hay quan h, v d
tn bnh, triu chng, nguyn nhn, quan h c_triu_chng v quan h
c_bin_chng th vic nhn dng v phn bit chng cng l mt bi ton phc
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
48/67
40
tp. C nhiu khi tn bnh trng vi triu chng, nguyn nhn, v d nh : au u,
ho c th hiu l bnh, cng c th hiu l nguyn nhn hay triu chng trong
mt s trng hp ng cnh khc nhau. Gn lin nhn dng thc th, xc nh quan
h vi vn phn lp. Cc thc th sau khi c nhn dng ra cn c phn vo
cc lp ng. Hn na, nh trnh by phn trc v gn nhn ng ngha chocu bn cht cng chnh l da trn thut ton phn lp. T nhng l do m kha
lun cp ti bi ton phn lp v cc thut ton phn lp c nghin cu
trong thi gian qua.
Hnh 17 m t cc giai on trong qu trnh phn lp. M hnh ny bao gm
ba cng on chnh: cng on u l biu din d liu, tc l chuyn cc d liu
(cc cu) thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt
tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc trn
cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt s l
u vo cho cng on th hai. Cng on th ba l vic b sung cc kin thc
thm vo do ngi dng cung cp lm tng chnh xc trong biu din vn bn
hay trong qu trnh hc my.
Hnh 17. M t cc giai on trong qu trnh phn lp
Trong nhiu nm gn y c nhiu thut ton c a ra gii quyt
bi ton phn lp, v d : SVM (Support Vector Machine), K lng ging gn nht,
phn lp da vo cy quyt nh, Cc thut ton ny c Nguyn Minh Tun
[2] m t kh chi tit. Chng ti s dng phng php SVM phn loi cu cha
quan h, trong cc phn tip theo s trnh by k hn v thut ton ny.
D liu [cu]
Cc cng cphn lp
Biu din ban u
Biu din
ban u
Gim s chiuhoc la chn
thuc tnh
Biu dincui cng
Tri thc thm
vo [3]
Hc quy np [2]
Biu din
ban u
Gim s chiuhoc la chn
thuc tnh
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
49/67
41
4.3.2. Thut ton SVM (Support Vector Machine)
Thut ton my vector h tr (Support Vector Machine SVM) c
Corters v Vapnik gii thiu vo nm 1995. SVM rt hiu qu gii quyt cc bi
ton vi d liu c s chiu ln (nh cc vector biu din vn bn).
Thut ton SVM c thc hin trn mt tp d liu hc D= {(Xi,Ci),i=1,n}.Trong Ci {-1,1} xc nh d liu dng hay m. Mc ch ca thut
ton l tm mt siu phng svm.d + b phn chia d liu thnh hai min. Phn lp
mt ti liu mi chnh l xc nh du ca f[d] = svm.d + b. Ti liu s thuc lp
dng nu f(d) > 0, thuc lp m nu f(d) < 0.
Hnh 18: M t sphn chia ti liu theo du ca hm f(d) = svm.d + b
4.3.3 Phn lp a lp vi SVM
Bi ton phn lp quan h yu cu mt b phn lp a lp do cn ci tin
SVM cbn (phn lp nh phn) thnh b phn lp a lp.
Mt trong nhng phng php ci tin l s dng thut ton one-against-
all[12]. tng cbn nh sau:
Gi s tp d liu mu (x1,y1), ,(xm,ym) vi xi l mt vector n chiu.v yi Y l nhn lp c gn cho vector xi .
Chia tp Y thnh m tp lp con c cu trc nh sau zi ={yi ,Y\yi } . p dng SVM phn lp nh phn cbn vi m tp Zi xy dng siu
phng cho phn lp ny.
B phn lp vi s kt hp ca m b phn lp trn c gi l b phn lp
a lp mrng vi SVM.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
50/67
42
4.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc y
t ting Vit
Tuy mc tiu ban u ca SVM l dng cho phn lp nh phn, nhng hin
nay c ci tin cho phn lp a lp, c th s dng ci tin ny phn lp
cc cu cha quan h [2].Hai qu trnh chun b d liu khi xy dng c m hnh phn lp quan h
da trn SVM nh sau:
Thit k m hnh cy phn cp (taxonomy) cho tp lp quan h. Minng dng ca quan h s quyt nh phc tp (phn cp) ca
taxonomy.
Xy dng tp d liu mu (corpus) c gn nhn cho tng lp quanh. Trong bc ny, cch la chn c trng biu din quan h c vai
tr quan trng. Ph thuc vo c im ca tng ngn ng m tp cc
c trng c la chn khc nhau. V d vi ting Anh th tp c trng
ca n l cc t.
Sau khi xy dng c tp cc lp cu hi cng vi tp d liu s tin hnh
hc: M hnh hc nh sau:
Hnh 19. M t qu trnh hc ca phn lp cu cha quan h [2]
CuTin x l Trch chn
c trng
Phn lpSVMMulti
Cu (cha QH)Tp vectorc trng
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
51/67
43
Chng 5
THC NGHIM
Vic xy dng Ontology cho y t ting Vit ng thi mrng n mt cchtng thng qua cc bc ca bi ton trch chn thng tin: nhn dng thc th,
xc nh quan h. s lm tin kha lun xy dng mt tp d liu mang ng
ngha (mng ng ngha). Kt qu ca cng vic ny ng vai tr quan trng trong
nhim v xy dng mt my tm kim ng ngha trong tng lai.
5.1. Mi trng thc nghim
5.1.1. Phn cng
Chng ti s dng my tnh c nhn vi cu hnh phn cng l Genuine IntelCPU T2050 1.60 GHz, CHIP 798 MHz, RAM 1Gb.
5.1.2 Phn mm
Chng ti tch hp cc tin ch trong cc b cng c Protg, Gate xy
dng ontology, ch thch d liu v nhn dng thc th ting Vit i vi lnh vc y
t.
Protg [13] l mt cng c xy dng Ontology c xy dng v pht trin
ti Stanford Center for Biomedical Informatics Research ca trng i hc
Stanford University School of Medicine. Protg c hai loi: Protg Frame vProtg OWL. Protg Frame cung cp mt giao din dng y v m hnh c
sn to, lu tr Ontology di dng Frame. Cn Protg OWL h trv ngn
ng Web ontology, c chng thc da vo web ng ngha hay W3C.
Gate [31] l mt kin trc phn mm pht trin v trin khai cc b phn
phn mm phc v cng vic x l ngn ng ca con ngi. Gate gip cc nh pht
trin tin hnh cng vic theo ba cch:
Xc nh mt cu trc, kin trc t chc cho cc phn mm x l ngnng.
Cung cp mt framework hay th vin cc lp thc th, thc hin cu trc xc nh v c thc s dng cho cc ng dng x l ngn ng t nhin.
Cung cp mt mi trng pht trin c xy dng da trn frameworkca cc cng c ha tin li cho cc thnh phn pht trin.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
52/67
44
Gate khai ph s pht trin cc phn mm da trn b phn, hng i tng
v code lu ng, bin i nhanh. Framework v mi trng pht trin c vit
bi ngn ng Java v l mt phn mm m ngun mdi s cho php ca th
vin GNU. Gate s dng Unicode (Unicode Consortium 96) v c kim th trn
mt s ngn ng : c, n .Gate bt u c xy dng v pht trin ti Trng H Sheffield t nm
1995 v t c s dng trong nghin cu v cc d n. Phin bn 1 c ra i
nm 1996 v c chng nhn bi hng trm t chc. Gate s dng mt lng ln
cc ng cnh t phn tch ngn ng vo trong nhiu th ting: Anh, Hy Lp, Thy
in, c, , Php Cc phin bn tip sau c ra i v ngy cng p ng
mt cch hiu qu trong nghin cu cng nhng dng.
5.1.3 D liu th nghim
Sau khi thu thp c hn 500 trang web t cc web sitehttp://suckhoedoisong.vn, chng ti loi b, x l cc vn bn nhiu khng gip
ch cho qu trnh xy dng Ontology cng nh nhn dng thc th. Sau khi x l
thu thp c gn 400 trang web, tng ng vi trn 5000 cu phc v cho vic
xy dng Ontology, nhn dng thc th v to nn tng cho phn loi quan h cu.
S dng cng c tch t JvnTextPro ca Nguyn Cm T [1] loi b
HTML cc trang Web cng nh tch cu, tch t tp ti liu ny.
5.2 Xy dng Ontology
5.2.1. Phn cp lp thc th
Vi cc d liu v y t thu thp c t cc trang web v ontology, chng ti
lit k cc thut ng (term) quan trng nhm c th nu nh ngha cho ngi dng
vi hng nghin cu tip theo l tng lin kt n cc nh ngha c sn trn
trang wikipedia. T cc thut ng trn, tip theo snh ngha cc thuc tnh ca
chng. Vic xy dng Ontology l mt qu trnh lp li c bt u bng vic nh
ngha cc khi nim trong h thng lp v m t thuc tnh ca cc khi nim .
Qua kho st Ontology BioCaster vi cc thut ng trong ting Vit, cngvi mt s lung ln cc trang Web v y t hin nay Vit Nam, chng ti tin
hnh xy dng nn mt tp cc thut ng, cc mi quan h cbn nht t
xut ra Ontology th nghim ban u.
Sau y l mt s lp thc th do kha lun xut xy dng Ontology:
Thuc: ng y, Ty y. V d nh thuc 5-Fluorouracil Ebewe chng ungth (ung thi trc trng, v, thc qun, d dy), hay l thuc Ciloxan st trng,
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
53/67
45
chng nhim khun mt. Thuc ng y ng gia b cha bnh phong thp, trng
gn ct
Bnh, hi chng: Cc loi bnh nh cm g, vim lot d dy, cc hichng mt ng, suy tim
Triu chng: V d nh triu chng ca cm H5N1 l st cao, nhc u,au mi ton thn,... Nguyn nhn: Tc nhn (virut, vi khun..mui, g, chim..), v cc nguyn
khc nh l thiu ng, li tp th dc, ht thuc l thng
Thc phm: Bao gm cc mn n c li hoc gy hi cho sc khe conngui cng nh ph hp vi mt s loi bnh no .
Ngi: Bao gm bc s, gio s m ngi bnh c th tm kim khmbnh, xin gip khi mc bnh.
T chc: Bnh vin, phng khm, hiu thuc l cc a im bnhnhn c th tm n khi mc bnh.
a im: a ch ca mt t chc no m bnh nhn c th tm n,cc ni dch ang pht sinh v lan rng.
C th ngi: L tt c cc b phn c th ngi c th th b nhimbnh: mt, mi, gan, tim
Hot ng: Chn tr, xt nghim, hi cu, h hp nhn to, phng trnh,tim phng ...
Ha cht: Vitamin, khong cht gy tc ng xu, tt n c th conngi, v d vitamin A c li cho mt, Vitamin C, E lm gim cc nguy cbnh
tim
Hi chng: hi chng c th xut hin ca mt bnh [hi chng sc cabnh st xut huyt].
Bin chng: T mt bnh c th bin chng sang bnh khc (bnh quai bbin chng vim mng no).
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
54/67
46
Hnh 20: Minh ha cc lp trong Ontology xy dng.
Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
55/67
47
5.2.2. Cc mi quan h gia cc lp thc th
Kha lun s dng mt s quan h ng ngha di y gia cc thc th
xy dng quan h ng ngha trong Ontology cng nh vic gn nhn ng ngha cho
cu:
S tng tc thuc thuc: Thuc ny c th gy tc dng ph cho thuckia, hay c th kt hp cc loi thuc vi nhau cha bnh. V d thuc
chng ung th Alexan khng nn dng chung vi methotrexate hay 5-
fluorouracil.
Thc phm tc ng xu, tt n bnh, c th ngi. V d nh ungxa nhiu c ri ro mc cc bnh ri lon trao i cht, tng vng bng,
tng huyt p
Quan h bnh thuc.
Quan h nguyn nhn gy ra bnh, hay bnh c nguyn nhn. Quan h bnh triu chng. Quan h bnh bin chng thnh bnh khc. Cc hot ng tc ng ln bnh. Ngi lm vic trong mt t chc ti a im no . Bnh thuc chuyn khoa ca ngi. Bnh c pht hin, cha trt chc. Bnh bin chng sang bnh khc.
Quan h bnh -- hi chng.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
56/67
48
Hnh 22. Minh ha cc th hin ca lp thc th v mi quan h gia cc th
hin
Hnh 22 minh ha mt mi quan h gia cc th hin ca cc lp thc th.
Trn hnh 22 l th hin st Dengue v cc quan h vi cc th hin ca lp thc
th khc: Gn_nhn, pht_hin_ti, c_triu_chng, bin_chng, cha_bng,
b_gy_ra_bi.
Kha lun xy dng c mt Ontology bao gm 21 lp thc th, 13 mi
quan h v trn 500 th hin ca cc lp thc th.
5.3. Ch thch dliu
Kha lun tch hp Ontology vo cng c Gate (General Architecture for
Text Mining) ch thch d liu.. T d liu c thu thp v ontology xydng, qu trnh ch thch d liu bao gm cc bc sau:
Mfile cha d liu ch thch, c th dng mc th mc cha nhiufile ch thch. S dng Data_Store ca gate lu cc d liu c mv sau
khi c ch thch.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
57/67
49
MOntology xy dng c. Ontology c th dng cng c Gate chnh sa li cc lp, thuc tnh,
Thay i mu sc ch thch cc thc thOntology mt cch ph hp c th tin phn bit cc thc th mt cch r rng.
Chn thc th cn ch thch v chn tn lp thc th thuc ontology ch thch.
Kt qu sau qu trnh ch thch, chng ta c th c mt d liu cha cc thc
th tng ng vi cc lp c xy dng trn ontology. Ch thch d liu gip
cho vic xy dng tp corpus trn d liu y t mt cch d dng hn, ng thi gp
phn vo vic tng mrng cc thc th trn ontology.
Kha lun ch thch c 96 file d liu tng ng vi trn 1500 th
hin.
Hnh 23: Minh ha mt dliu c ch thch bng Ontology.
7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA
58/67
50
5.4. Nhn dng thc th
5.4.1. Xy dng tp gazetteer
Sau khi ch thch d liu, chng ta c cc file d liu c ch thch vi cc
lp thc th ring bit. Sau qu trnh ch thch ny, chng ta c th da trn cc d
liu c ch thch xy dng mt tp d liu tn cc thc th. Xy dng c
mt tp d liu tt c th gip cho qu trnh nhn dng thc th hiu qu hn. Kha
lun s dng Ontology cng mt mrng c tch hp vo