TRÍCH CHỌN THÔNG TIN Y TẾ TIẾNG VIỆT CHO BÀI TOÁN TÌM KIẾM NGỮ NGHĨA

Embed Size (px)

Citation preview

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    1/67

    I HC QUC GIA H NITRNG I HC CNG NGH

    Trn Th Ngn

    TRCH CHN THNG TIN Y T TING VIT CHOBI TON TM KIM NGNGHA

    KHO LUN TT NGHIP I HC H CHNH QUY

    Ngnh:Cng ngh thng tin

    H NI - 2009

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    2/67

    I HC QUC GIA H NITRNG I HC CNG NGH

    Trn Th Ngn

    TRCH CHN THNG TIN Y T TING VIT CHOBI TON TM KIM NGNGHA

    KHO LUN TT NGHIP I HC H CHNH QUY

    Ngnh:Cng ngh thng tin

    Cn b hng dn: PGS. TS. H Quang ThyCn bng hng dn: Th.S Nguyn Cm T

    H NI - 2009

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    3/67

    i

    LI CM N

    u tin cho em gi li cm n su sc nht n PGS. TS. H Quang Thy,

    Th.S Nguyn Cm T tn tnh ch bo cho em trong sut thi gian thc hin

    kha lun. Trong qu trnh nghin cu em gp phi nhiu kh khn nhng nh

    s hng dn tn tnh ca thy v ch em dn vt qua v hon thnh c khalun.

    Em xin by t lng bit n n cc thy c trong trng i Hc Cng

    Ngh ging dy v cho em nhng kin thc qu bu, lm nn tng hon thnh

    kha lun cng nh thnh cng trong nghin cu, lm vic trong tng lai.

    Em xin gi li cm n ti cc anh ch trong phng Lab cho em nhng li

    khuyn qu bu, b ch trong qu trnh thc hin qu lun.

    V em cng xin li cm n ti nhng ngi bn thn yu, c bit l cc bn

    trong phng k tc x bn cnh ng vin trong gip em hon thnh khalun cng nh vt qua nhiu kh khn trong cuc sng.

    Cui cng, cho con gi li cm n su sc ti gia nh, b, m, ch v em

    cho con nhiu tnh thng cng nh sng vin kp thi con vt qua nhng

    kh khn trong cuc sng v hon thnh c kha lun.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    4/67

    ii

    TM TT

    Trch chn thng tin y t nhm xy dng c mt tp d liu tt, y

    h trvic tm kim ng ngha ang l nhu cu thit yu, nhn c s quan tm

    c bit trong thi gian gn y. Ontology l cch biu din khi nim, thuc tnh,

    quan h trong min ng dng m bo tnh nht qun v phong ph. Xy dng

    h thng trch chn thng tin da trn mt Ontology y t Ting Vit cho php tm

    kim v khai ph loi d liu thuc min ng dng hiu qu hn l mt nhu cu

    thit yu.

    Kha lun ny cp ti vic xy dng mt h thng trch chn thng tin

    da trn mt ontology trong lnh vc y t ting Vit. Kha lun phn tch mt s

    phng php, cng c xy dng Ontology la chn mt m hnh v xy dng

    c mt Ontology y t ting Vit vi 21 lp thc th,13 mi quan h v trn 500

    th hin ca cc lp thc th. Kha lun tin hnh ch thch cho 96 file d liu

    vi trn 1500 th hin. H thng nhn din thc th thc nghim ca kha lun

    hot ng c tnh kh thi vi o F1 trung bnh qua 10 ln thc nghim t

    khong 64%.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    5/67

    iii

    MC LC

    Li mu ...........................................................................................................................1Chng 1 ..............................................................................................................................3TNG QUAN V TM KIM NGNGHA.....................................................................3

    1.1. Nhu cu v tm kim ng ngha..........................................................................31.2. Nn tng tm kim ng ngha..................................................................................41.2.1.Web ng ngha.....................................................................................................41.2.2. Ontology .............................................................................................................5

    1.3. Kin trc ca mt my tm kim ng ngha............................................................51.4.Trch chn thng tin .................................................................................................6

    Chng 2 ..............................................................................................................................9XY DNG ONTOLOGY Y T TING VIT ................................................................9

    2.1. Gii thiu Ontology.................................................................................................92.1.1. Khi nim Ontology ...........................................................................................92.1.2. Cc thnh phn ca Ontology...........................................................................102.1.3 Mt s cng trnh lin quan ti xy dng Ontology..........................................11

    2.2. L thuyt xy dng Ontology ...............................................................................122.1.1. Phng php xy dng Ontology .....................................................................122.1.2. Cng c xy dng Ontology.............................................................................132.1.3. Ngn ng xy dng Ontology ..........................................................................15

    2.3. Xy dng Ontology y t ting Vit .......................................................................16Chng 3 ............................................................................................................................17

    NHN DNG THC TH ...............................................................................................173.1. Gii thiu bi ton nhn dng thc th .................................................................173.1.1. Gii thiu chung v nhn dng thc th ...........................................................173.1.2. Mt s kt qu nghin cu v nhn dng thc th ...........................................18

    3.2. c im d liu ting Vit ..................................................................................193.2.1. c im ng m..............................................................................................193.2.2. c im t vng .............................................................................................203.2.3. c im ng php...........................................................................................20

    3.3. Mt s phng php nhn dng thc th ..............................................................213.3.1. Phng php da trn lut, bn gim st.........................................................233.3.2. Cc phng php my trng thi hu hn........................................................23

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    6/67

    iv

    3.3.3. Phng php s dng Gazetteer .......................................................................243.4. Nhn dng thc th y t ting Vit........................................................................253.4.1. Nhn dng thc th ting Vit ..........................................................................253.4.2. Nhn dng thc th y t ting Vit ...................................................................26

    Chng 4 ............................................................................................................................30XC NH QUAN H NGNGHA..............................................................................30

    4.1. Tng quan v xc nh quan h ng ngha............................................................304.1.1. Khi qut v quan h ng ngha .......................................................................304.1.2. Trch chn quan h ng ngha ..........................................................................314.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha ........................35

    4.2. Gn nhn ng ngha cho cu .................................................................................374.3.1. Phn lp vi xc nh quan h, nhn dng thc th .........................................394.3.2. Thut ton SVM (Support Vector Machine) ....................................................414.3.3 Phn lp a lp vi SVM ..................................................................................414.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc

    y t ting Vit..............................................................................................................42Chng 5 ............................................................................................................................43THC NGHIM................................................................................................................43

    5.1. Mi trng thc nghim .......................................................................................435.1.1. Phn cng .........................................................................................................435.1.2 Phn mm ..........................................................................................................435.1.3 D liu th nghim............................................................................................44

    5.2 Xy dng Ontology................................................................................................445.2.1. Phn cp lp thc th........................................................................................445.2.2. Cc mi quan h gia cc lp thc th.............................................................47

    5.3. Ch thch d liu ..................................................................................................485.4. Nhn dng thc th................................................................................................505.4.1. Xy dng tp gazetteer .....................................................................................505.4.2.nh gi h thng nhn dng thc th ..............................................................515.4.3. Kt qut c...............................................................................................525.4.4. Nhn xt v nh gi ........................................................................................52

    5.5. Gn nhn ng ngha cho cu .................................................................................53PH LC - MT S THUT NGANH VIT ............................................................54KT LUN ........................................................................................................................55

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    7/67

    v

    DANH MC BNG BIU

    Bng 1: Gii thch cc mi quan h ng ngha...................................................................35Bng 2: S lng cc th hin ca cc lp thc th trong tp d liu gazetteer. ................50Bng 3: Cc gi trnh ga mt h thng nhn din loi thc th.....................................51

    Bng 4: Kt qu sau 10 ln thc nghim nhn dng thc th..............................................52Bng 5: V d mt s cu c gn nhn quan h. .............................................................53

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    8/67

    vi

    DANH MC HNH V

    Hnh 1: V d v Web ng ngha ................................................................................ 4Hnh 2: Kin trc mt my tm kim ng ngha ......................................................... 6Hnh 3: Minh ha mt h thng trch chn thng tin.................................................. 7Hnh 4: M t ngha ca Ontology........................................................................... 9

    Hnh 5: Minh ha cu trc phn cp ca Ontology BioCaster ................................. 10Hnh 6: Mt s file Gazetteerc xy dng phc v bi ton nhn dng thc th 25Hnh 7: Minh ha mt quan h ng ngha cho thc th car...................................... 30Hnh 8: Minh ha v trch chn quan h ng ngha.................................................. 31Hnh 9: V tr ca khai ph quan h ng ngha trong x l ngn ng t nhin........ 32Hnh 10: Minh ha cc quan h ng ngha c ch ra trong WordNet................... 33Hnh 11: Mt s quan h ng ngha xy dng c............................................ 34Hnh 12: Nhim v chung ca bi ton xc nh quan h ........................................ 36Hnh 13: M t cc b phn trong b phn tch ng ngha SR [24] ......................... 37Hnh 14: Minh ha Framework gii quyt bi ton xc nh tn ring gia cc ti

    liu............................................................................................................................. 38Hnh 15: Mt s nhn ng ngha c gn cho cu [30].......................................... 39Hnh 16: Gn nhn ng ngha cho cc cu m t tng thng Bill Clinton [30]. ...... 39Hnh 17: M t cc giai on trong qu trnh phn lp ............................................ 40Hnh 18: M t s phn chia ti liu theo du ca hm f(d)..................................... 41Hnh 19: M t qu trnh hc ca phn lp cu cha quan h [2]............................ 42Hnh 20: Minh ha cc lp trong Ontology xy dng. ........................................ 46Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c...................... 46Hnh 22: Minh ha cc th hin ca lp thc th v mi quan h gia cc th hin 48Hnh 23: Minh ha mt d liu c ch thch bng Ontology. .............................. 49Hnh 24: Minh ha cc file cha thc th trong tp Gazetteer xy dng c ........ 51Hnh 25: Kt qu 10 ln thc nghim nhn dng thc th ....................................... 52

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    9/67

    1

    Li mu

    Chm sc sc khe lun l mt nhu cu thit yu ca con ngi, v th tm

    kim cc thng tin v lnh vc y t trn Internet lun l mt nhu cu thit yu. Vn ny cng cn phi c quan tm thch ng khi con ngi ang phi i mt

    vi nhiu dch bnh truyn nhim, v din hnh c th k ti dch bnh cm A

    H1N1 ang pht trin v c chiu hng gia tng trong thi gian gn y. Cng vi

    s ra i v pht trin khng ngng ca cc ti nguyn trc truyn, vic khai thc

    hiu qu ngun ti nguyn ny a ti ngun tri thc hu ch cho ngi dng s

    gp phn vo vic tuyn truyn v nng cao sc khe cng ng.

    S bng n cc ti nguyn y t, c bit l cc thng tin trc tuyn lin quan

    n lnh vc sc khe; nhiu trang web v thng tin tha cng nh vic t chcthng tin mt cch t do (khng hoc bn cu trc) lm cho ngi dng kh c

    th theo di cng nh nm bt nhng thng tin cp nht nht. Bn cnh , cng

    ngh tm kim thng tin truyn thng hoc tr v kt qu t do s phong ph, phc

    tp ca vic din t ngn ng t nhin; hoc qu nhiu theo ngha ngi tm tin

    ch mun tm kim nhng tri thc n ch khng ch l cc vn bn cha t kha

    tm kim. Do vic khai thc ti u ngun ti nguyn phong ph ny tr thnh

    mt ti quan trng, thu ht nhiu nh khoa hc tham gia nghin cu trong hai

    thp nin gn y, c nhiu cng trnh nhm trch rt cc thng tin c cu trc tnhng ti nguyn ny nhm xy dng cc cstri thc cho vic t chc thng tin,

    tm kim, truy vn, qun l v phn tch thng tin.

    Nhiu bi ton c t ra trong lnh vc trch chn thng tin y t nh

    BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05 (trch

    chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc gia cc

    protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc khai

    ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th v

    trch chn quan h. Nhn din thc thi hi nhn bit cc thnh phn cbn nh

    tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh quan h

    vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong vn bn.

    V d, xc nh quan h gia mt bnh xc nh v mt virus xc nh.

    Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h

    mt cch nht qun v phong ph nht. Vic xy dng mt Ontology cho y t trong

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    10/67

    2

    ting Vit s l cscho php tm kim, khai ph loi thng tin ny mt cch hiu

    qu.

    Theo kho st d liu cho thy Vit Nam hin nay cc Ontology cho y t

    ting Vit th hu nh cha c; tuy nhin cng c c mt s nhm nghin cu

    tp trung xy dng Ontology vi cc min c th khc phc v cho nhiu mcch khc nhau. n c c th k ti Ontology VNKIM [34] c pht trin ti

    i hc Bch khoa, i Hc Quc gia TP.H Ch Minh. Ontology ny bao gm

    347 lp thc th v 114 quan h v thuc tnh. VN-KIM Ontology bao gm cc lp

    thc th c tn ph bin nh Con _ngi, T_chc, tnh, Thnh_ph,, cc quan

    h gia cc lp thc th v cc thuc tnh ca mi lp thc th .

    Tn ti nhiu phng php c a ra xy dng mt h thng trch chn

    thng tin cnug nh xy dng mng ng ngha v t p dng cho bi ton tm

    kim ng ngha. Kha lun trnh by cch biu din da trn Ontology - mttrong s nhng phng php ang c s dng kh rng ri hin nay. Kha lun

    trnh by mt s phng php xy dng Ontology, mrng ontology mt cch t

    ng, gii thiu bi ton nhn dng thc th cng nh phn loi quan h da trn

    mt s phng php khc nhau. Kha lun cng xy dng c mt d liu

    cho y t phc v cho vic nhn dng thc th v quan hc hiu qu hn.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    11/67

    3

    Chng 1

    TNG QUAN V TM KIM NG NGHA

    1.1. Nhu cu v tm kim ngngha

    S bng n cc thng tin trc tuyn trn Internet v World Wide Web to ramt lng thng tin khng la ra thch thc l lm th no c th khai ph

    ht c lng thng tin ny mt cch hiu qu nhm phc vi sng con ngi.

    Cc my tm kim nh Google, Yahoo ra i nhm h trngi dng trong qu

    trnh tm kim v s dng thng tin. Tuy kt qu tr v ca cc my tm kim ny

    ngy cng c ci thin v cht v lng nhng vn n thun l danh sch cc

    ti liu cha nhng t xut hin trong cu truy vn. Nhng thng tin t cc kt qu

    tr v ny chc hiu bi con ngi, my tnh khng th hiu c, iu ny

    gy nhng kh khn cho qu trnh tip theo x l thng tin tm kim c. Th h

    cc my tm kim thc th ra i (h thng Cazoodle ti trang web

    http://www.cazoodle.com/, h thng Arnetminer ti trang web

    http://www.arnetminer.org/ ...) nh du mt bc pht trin mi ca cc my tm

    kim. Thm vo , vi s ra i ca my tm kim ng ngha Wolfram, c xy

    dng v pht trin bi d n Wolfram Research, Inc. Marketed do Stephen

    Wolfram xut [35], th vn tm kim tri thc cng c quan tm hn na.

    S ra i ca Web ng ngha (hay Semantic Web) do W3C (The World

    Wide Web Consortium) khi xng m ra mt bc tin ca cng ngh Web,

    nhng thng tin trong Web ng ngha c cu trc hon chnh v mang ng ngha

    m my tnh c th hiu c. Nhng thng tin ny, c thc s dng li m

    khng cn qua cc bc tin x l. Khi s dng cc my tm kim thng thng

    (Google, Yahoo), tm kim thng tin trn Web ng ngha s khng tn dng

    c nhng u im vt tri ca Web ng ngha, kt qu tr v khng c s ci

    tin. Ni theo mt cch khc th vi cc my tm kim hin ti th Web ng ngha

    hay Web thng thng ch l mt. Do vy, cn thit c mt h thng tm kim ng

    ngha (Semantic Search) tm kim trn Web ng ngha hay trn mt mng tri thc

    mang ng ngha, kt qu tr v l cc thng tin c cu trc hon chnh m my tnhc th hiu c, nh vic s dng hay x l thng tin trnn d dng hn

    [6][26][2]. Ngoi ra, vic xy dng c mt h thng tm kim ng ngha c th

    s to tin cho vic m rng xy dng cc h thng hi p tng trn tng

    lnh vc c th nh : y t, vn ha iu ny mang mt ngha thit thc trong

    i sng.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    12/67

    4

    1.2. Nn tng tm kim ngngha

    1.2.1.Web ng ngha

    Web ng ngha hay cn gi l Semantic Web theo Tim Berners-Lee l bc

    pht trin m rng ca cng ngh Word Wide Web hin ti, cha cc thng tin

    c nh ngha r rng con ngi v my tnh lm vic vi nhau hiu qu hn.

    Mc tiu ca Web ng ngha l pht trin da trn nhng chun v cng ngh

    chung, cho php my tnh c th hiu thng tin cha trong cc trang Web nhiu

    hn nhm h tr tt con ngi trong khai ph d liu, tng hp thng tin, hay

    trong vic xy dng cc h thng tng khc Khng ging nh cng ngh

    Web thng thng, ni dung ch bao hm cc ti nguyn vn bn, lin kt, hnh

    nh, video m Web ng ngha c th bao gm nhng ti nguyn thng tin tru

    tng hn nh: a im, con ngi, t chc thm ch l mt s kin trong cuc

    sng. Ngoi ra, lin kt trong Web ng ngha khng chn thun l cc siu linkt (hyperlink) gia cc ti nguyn m cn cha nhiu loi lin kt, quan h khc.

    Nhng c im ny khin ni dung ca Web ng ngha a dng hn, chi tit v

    y hn. ng thi, nhng thng tin cha trong Web ng ngha c mt mi

    lin h cht ch vi nhau. Vi s cht ch ny, ngi dng d dng hn trong vic

    s dng, v tm kim thng tin. y cng l u im ln nht ca Web ng ngha

    so vi cng ngh Web thng thng [2].

    Hnh 1. V d vWeb ngngha [6]

    Hnh 1 l mt v d m t v mt trang Web ng ngha cha thng tin camt ngi tn l Yo-Yo Ma. Trang Web c cu trc nh mt th c hng mang

    trng s, trong mi nh ca th m t mt kiu ti nguyn cha trong trang

    Web. Cc cnh ca th th hin mt kiu lin kt (hay cn gi l thuc tnh ca ti

    nguyn) gia cc ti nguyn, trng s ca cc lin kt th hin tn ca lin kt

    [tn ca thuc tnh] . C th ta thy Yo-Yo Ma c thuc tnh ngy sinh l

    10/07/55c ni sinh Paris, France, Paris, Francec nhit l 62 F

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    13/67

    5

    Nh vy, mi ti nguyn c m t trong Web ng ngha l mt i tng.

    i tng ny c tn gi, thuc tnh, gi tr ca thuc tnh (gi tr c th l mt i

    tng khc) v lin kt vi cc ti nguyn (i tng) khc (nu c). xy dng

    c mt trang Web ng ngha cn phi c tp d liu y , hay ni mt cch

    khc l cn phi xy dng mt tp cc i tng m t ti nguyn cho Web ngngha. Cc i c quan h vi nhau hnh thnh mt mng lin kt rng, c gi l

    mng ngngha.

    Mng ng ngha c chia s rng khp do vy cc i tng trong mt

    mng ng ngha cn phi m t theo mt chun chung nht. Ontology c s

    dng m t vi tng, ti nguyn cho Web ng ngha [2].

    1.2.2. Ontology

    C th hiu mt cch n gin ontology l mt m hnh d liu trnh by

    mt tp cc khi nim trong mt min v mi quan h gia cc khi nim . Nc s dng lp lun (suy lun) v cc i tng trong min [12].

    Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan h

    mt cch nht qun v phong ph nht, chnh v th n c s dng xy

    dng mng ng ngha t tp d liu th (khng hoc bn cu trc) to nn tng xy

    dng mt my tm kim ng ngha mt cch hiu qu. Ontology sc gii thiu

    mt cch c th, k lng hn trong chng 2 ca kha lun.

    1.3. Kin trc ca mt my tm kim ngnghaXt v cbn, mt my tm kim ng ngha c cu trc tng t vi mt

    my tm kim thng thng cng bao gm hai thnh phn chnh [2]:

    Phn giao din ngi dng (front end) c hai chc nng chnh:

    Giao din truy vn: cho php ngi dng nhp cu hi, truy vn. Hin th cu tr li, kt qu.

    Phn kin trc bn trong (back end) l phn ht nhn ca my tm kim bao

    gm ba thnh phn chnh l:

    Phn tch cu hi Tm kim kt qu cho truy vn hay cu hi Tp ti liu, d liu tm kim, mng ng ngha.

    M hnh kin trc mt my tm kim ng ngha c m t nh Hnh 2.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    14/67

    6

    Hnh 2. Kin trc mt my tm kim ngngha [2]

    C th thy rng s khc bit trong cu trc ca my tm kim ng ngha so

    vi my tm kim thng thng nm phn kin trc bn trong, c thhai thnh

    phn: phn tch cu hi v tp d liu tm kim.

    Phn tch cu hi c cp chi tit trong [2]. Tp d liu tm kim

    chnh l web ng ngha v mng ng ngha c xy dng da trn ontology v h

    thng trch chn thng tin. Kha lun ny tp trung nghin cu k v xy dngontology, m rng tng ontology nh trch chn thng tin m c th l nhn

    dng thc th. Kha lun cng cp ti nhn dng quan h ng ngha, phn loi

    cu cha quan h nhm mc ch nh trnh by trn, l xy dng c mt

    tp d liu tm kim y cho my tm kim ng ngha trong tng lai.

    1.4.Trch chn thng tin

    Trch chn thng tin l mt lnh vc quan trng trong khai ph d liu vn

    bn, thc hin vic trch rt cc thng tin c cu trc t cc vn bn khng c cu

    trc. Ni cch khc, mt h thng trch chn thng tin rt ra nhng thng tin

    c nh ngha trc v cc thc th v mi quan h gia cc thc th t mt vn

    bn di dng ngn ng t nhin v in nhng thng tin ny vo mt vn bn ghi

    d liu c cu trc hoc mt dng mu c nh ngha trc . C nhiu mc

    trch chn thng tin t vn bn nh xc nh cc thc th (Element Extraction), xc

    nh quan h gia cc thc th (Relation Extraction), xc nh v theo di cc s

    1.Nhptruyvn

    5.Ktqutr v

    Mng ngngha

    SemanticWeb/Ontology

    Search Services 2.Phn lpcu hi

    3.Bin idng cu hi

    5.Tm kim

    1.Nhptruyvn

    6.Ktqu trv

    4. Trch chnthng tin

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    15/67

    7

    kin v cc kch bn (Event and Scenario Extraction and Tracking), xc nh ng

    tham chiu (Co-reference Resolution)... Cc kthut c s dng trong trch chn

    thng tin gm c: phn on, phn lp, kt hp v phn cm [1].

    Hnh 3. Minh ha mt h thng trch chn thng tin

    c mt h thng trch chn thng tin u tin chng ta phi c mt h

    thng nhn dng thc th v tip sau mi tnh n phn loi quan h. Bi ton nhn

    bit cc loi thc th l bi ton n gin nht trong s cc bi ton trch chnthng tin, tuy vy n li l bc cbn nht trc khi tnh n vic gii quyt cc

    bi ton phc tp hn trong lnh vc ny. Ngoi ng dng trong h thng trch chn

    thng tin, n cn c th c p dng trong tm kim thng tin (Information

    Retrieval), dch my (machine translation) v h thng hi p (question

    answering).

    c rt nhiu bi ton c t ra trong lnh vc trch chn thng tin y t

    nh BioCreative-I (nhn din cc tn genes v protein trong vn bn) [32], LLL05

    (trch chn thng tin v gene) [33], BioCreative-II (trch chn quan h tng tc

    gia cc protein) [49], Nhng bi ton c a ra nhm nh gi cc chin lc

    khai ph d liu y t v c bit tp trung vo hai bi ton con: nhn din thc th

    v trch chn quan h. Nhn din thc thi hi nhn bit cc thnh phn cbn

    nh tn thuc, tn bnh, triu chng, gene, protein, trong vn bn. Xc nh

    quan h vi mt mu cho trc l nhn bit mt trng hp ca quan h ny trong

    vn bn. V d: xc nh quan h gia mt bnh xc nh v mt virus

    Bnh phi cp tnh l mttrong nhng nguyn nhn tvong chnh ca ngi gi,nguy him hn c bnh phido cm. Triu chng thnggp l ngi mt mi, i khic l ln, st tht thng, hokhan nhiu v nng nhc, ckhi kh th. Cc thuc anthn, chng ho phi c s

    dng mt cch thn trng, nuc biu hin thrt cn phiphn bit do hen ph qun thphi dng corticoidv thucgin ph qun.

    IEMt miL lnSt tht

    thngHo khanKh th

    An thnChng hoCorticoid

    Thuc ginph qun

    B nh Tri u chn Thuc

    Phi cptnh

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    16/67

    8

    xc nh. Ontology l mt trong nhng cch biu din mu cho cc khi nim, quan

    h mt cch nht qun v phong ph nht. Vic xy dng mt ontology cho y t

    trong ting Vit s l cscho php tm kim, khai ph loi thng tin ny mt cch

    hiu qu. Sau khi xy dng ontology, cng vic tip theo cng rt quan trng l

    m rng ontology mt cch tng. Vic c mt h thng trch chn thng tin(bao gm nhn dng thc th v trch chn quan h, ) l bc tin c th m

    rng ontology mt cch tng.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    17/67

    9

    Chng 2

    XY DNG ONTOLOGY Y T TING VIT

    2.1. Gii thiu Ontology

    2.1.1. Khi nim Ontology

    Trong nhng nm gn y, thut ng Ontology khng chc s dng

    trong cc phng th nghim trn lnh vc tr tu nhn to m trnn ph bin i

    vi nhiu min lnh vc trong i sng . ng trn quan im ca ngnh tr tu

    nhn to, mt Ontology l s mt t v nhng khi nim v nhng quan h ca cc

    khi nim nhm mc ch th hin mt gc nhn v th gii. Trn min ng

    dng khc ca khoa hc, mt Ontology bao gm tp cc t vng cbn hay mt ti

    nguyn trn mt min lnh vc c th, nh nhng nh nghin cu c th lu tr,

    qun l v trao i tri thc cho nhau theo mt cch tin li nht [2].

    Hin nay tn ti nhiu khi nim v Ontology, trong c nhiu khi nim

    mu thun vi cc khc nim khc, kha lun ny ch gii thiu mt nh ngha

    mang tnh khi qut v c s dng kh ph bin c Kincho H. Law a ra:

    Ontology l biu hin mt tp cc khi nim (i tng), trong mt min c th

    v nhng mi quan h gia cc khi nim ny. Ontology chnh l s tng hp ca

    mt tp t vng chia s v cc miu t ngha ca t theo cch m my tnh

    hiu c.

    Hnh 4. M t ngha ca OntologyHnh 4 m t ngha ca Ontology, trong tp t vng dng chung

    (Vocabulary) chnh l th hin ca cc lp, quan h. V d, c th c Vocabulary

    (...), Categories (Cat, White, Leg, Fish, Animal,), Relations (Is-a, Part-of,

    a sharedvocabulary

    a formal characterizationof its meaning

    Ontology

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    18/67

    10

    hasMother,), Characterization (...) v cc th hin quan h "A cat is an animal",

    "A cat has four legs"...

    Hnh 5. Minh ha cu trc phn cp ca Ontology BioCaster [11]

    2.1.2. Cc thnh phn ca OntologyCc thnh phn chnh ca Ontology l: Lp (Class), thuc tnh (Property),

    thc th (Individual).

    Lp (class) l mt b nhng thc th, cc thc thc m t logic nh

    ngha cc i tng ca lp; lp c xy dng theo cu trc phn cp cha con nh

    l mt s phn loi cc i tng. Thc thc xem l th hin ca mt lp, lm

    r hn v lp v c th c hiu l mt i tng no trong t nhin

    (England, Manchester United, bnh si, thy u).

    Thuc tnh (Property) th hin quan h nh phn ca cc thc th (quan hgia hai thc th) nh lin kt hai thc th vi nhau. V d thuc tnh do_virus

    lin kt hai thc th bnh v virus vi nhau.

    Thuc tnh (property) c 4 loi (1) Functional: Mt thc th ch lin quan

    nhiu nht n mt thc th khc, v d thuc tnh c hng v i vi cc thc

    th lp thc_n; (2) Inverse Functional: Thuc tnh o ngc ca Functional,

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    19/67

    11

    thuc tnh l hng v ca; (3) Transitive: Thc th a quan h vi thc th b, thc

    th b quan h vi thc th c thc th a quan h vi thc th c; (4) Symmetric:

    Thc th a quan h vi thc th b thc th b quan h vi thc th a.

    Thuc tnh c 3 kiu th hin (1) Object Property: Lin kt thc th ny vi

    thc th khc; (2) DataType Property: Lin kt thc th vi kiu d liu XMLSchema, RDF literal; (3) Annotation Property: Thm cc thng tin metadata v lp,

    thuc tnh hay thc th khc thuc 2 kiu trn.

    lm vic vi ontology Web cn s dng ngn ng ontology Web (The

    Web Ontology Language: OWL). OWL c th c mt kiu th t l Annotation

    propertie. Kiu thuc tnh c s dng thm cc thng tin (metadata d liu

    ca d liu) i vi cc lp, cc thc th hay cc thuc tnh Object/ Datatype.

    2.1.3 Mt s cng trnh lin quan ti xy dng Ontology

    Ngy nay, Ontology c s dng rt nhiu trong cc lnh vc lin quan n

    ng ngha nh tr tu nhn to (AI), semantic web, kngh phn mm, v.v V

    nhng ng dng ca Ontology nn khng ch ring Vit Nam, trn th gii c

    nhiu d n tp trung xy dng Ontology i vi tng min d liu khc nhau v

    phc v cho nhiu mc ch a dng khc nhau. i vi min d liu y t c th k

    ti rt nhiu Ontology trong lnh vc y t, sinh hc c a ra bi t chc The

    National Center for Biomedical Ontology [52]. D n ny a ra c rt nhiu

    Ontology trong y t cng nh trong sinh hc, v d nh Ontology v cell type,

    Gene, FMA, Human diseasedanh sch cc Ontology a ra c hin th trong[41].

    Ngoi ra c th k ti Disease Ontology [42] l mt tp t v y khoa c

    pht trin ti Bioinformatics Core Facility cng vi s cng tc ca d nNuGene

    Project ti trung tm Center for Genetic Medicine. Ontology ny c thit k vi

    mc ch sp xp cc bnh v cc iu kin tng ng i vi nhng code v y t

    c th nh l ICD9CM, SNOMED v nhng ci khc.Disease Ontology cng

    c s dng lin kt nhng kiu hnh sinh vt mu i vi cc bnh ca con

    ngi cng nh trong vic khai ph d liu y hc. Disease Ontology c thchin nh l mt th xon c hng v s dng UMLS (Unified Medical

    Language System) l tp t vng truy cp cc Ontology v y t khc nh

    ICD9CM.

    Mt ontology ting Anh c cp rt nhiu trong lnh vc y t trong thi

    gian gn y l GENIA [43]. Mc ch chnh m ontology ny hng ti l

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    20/67

    12

    s phn ng li ca t bo trong no ngi. Ontology ny ch yu tp trung trong

    cc lnh vc y t v cng c s dng trong cc bi ton x l ngn ng t nhin:

    truy hi thng tin (Information Retrieval IR), trch chn thng tin, phn lp v

    tm tt vn bn Hnh v sau m t cu trc phn cp ca ontology GENIA.

    Tn ti nhiu Ontology v y t hin nay c xy dng trn th gii. Tuynhin Vit Nam hin nay mc du vic tm kim ng ngha ang c tp trung

    nghin cu, nhng cc Ontology v y t th hu nh cha c, cho nn vic tm kim

    cc trang web v thuc, bnh ca ngi dng cha tr v cc kt quy v

    t c hiu qu. Tn ti mt Ontology cp n cc thut ng y t trong ting

    Vit, l Ontology Biocaster [44]. y l Ontology c nghin cu theo d n

    Biocasterc pht trin ti Vin Tin hc Quc gia Nht Bn vi s cng tc ca

    trng cc trng i hc ti Nht Bn, Thi Lan, Vit Nam... y l ontology vit

    cho nhiu ngn ng nh Nht, Anh, Thi, Vit

    Ontology BioCaster [11] c cc thut ng ca nhiu th ting trong c

    371 thut ng ting Vit, cc thut ng lin quan n bnh, virus, cc triu chng

    ca Vit Nam. Mc d Ontology ny c x l trch chn trong ting Vit, nhng t

    li a ra cc bi bo v y t Vit Nam bng ting Anh. V vy, cc thut ng,

    thc th, cc bnh hay virus c vit bng ting Vit cn cc quan hc m t

    bng ting Anh. V d, thut ng Vietnamese_103, gn nhn: vi rt gy bnh thy

    u, c hasLanguage: vi (Vietnamese), hasRootTerm : VIRUS_124

    2.2. L thuyt xy dng Ontology

    2.1.1. Phng php xy dng Ontology

    Ngy nay, vic nghin cu qu trnh xy dng ontology ngy cng c

    quan tm nhiu hn. C rt nhiu nhm sau qu trnh nghin cu a ra cc

    phng php khc nhau nhm xy dng Ontology.

    Phng php Ushold & King c xy dng da trn vic pht trin

    Enterprise Ontology. Phng php ny ch yu tp trung vo vic gip ngi pht

    trin t mc ch ca ontology c th c nhng hng pht trin nh th no, sau

    nh gi v vit ti liu cho ontology. Trong qu trnh xy dng, ngi dng c

    th tch hp cc ontology c sn vo ontology ang xy dng. Ba cch tip cn sau

    c a ra nhm nh ngha cc khi nim chnh trong ontology: cch tip cn

    top-down, bottom-up v middle-out. Phng php lun ny c xy dng khng

    ph thuc vo ng dng, ngha l mc ch xy dng ontology c lp vi qu

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    21/67

    13

    trnh xy dng chng, khng ph thuc vo nhau. Vi bt k ng dng no, chng

    ta u c th s dng chung phng php ny [17].

    Phng php lun tip theo c pht trin bi Gruninger v Fox [16], c

    pht trin thng qua d n ontology Toronto Virtual Enterprise (TOVE). H thng

    ny c xy dng bt ngun t t tng v s pht trin h thng da trn trithc, s dng first order logic. Trong phng php ny, cc khi nim ni bt nht

    c nh ngha trc tin, sau lm chi tit v tng qut ha cc khi nim

    theo cc hng thch hp. Nh vy, phng php ny bt u t mt s cc khi

    nim mc cao, i ri n cc khi nim mc thp v tng qut cc mc cao

    hn. Phng php ny s dng cch tip cn middle-out nh ngha cc khi

    nim v mt phn ph thuc vo ng dng sau ny ca ontology, ngha l trc khi

    xy dng ontology, ngi dng cn quyt nh mc ch s dng v tch hp

    ontology vo ng dng g.

    METHONTOLOGY l mt phng php xy dng Ontology c pht

    trin t phng nghin cu tr tu nhn to ca trng H Polytechnic Madrid.

    Phng php ny cho php ngi s dng c th xy dng mt ontology mi da

    trn bn mu thit k mi hoc c th s dng nhng ontology c sn. B

    framework ca METHONTOLOGY c th gip ngi dng xy dng cu trc

    ontology mc tri thc v bao gm: nh ngha quy trnh pht trin ontology,

    mt s k thut trong qu trnh xy dng quy trnh trn (v d qun l v lp lch,

    qun l cht lng, thu thp d liu v tri thc, qun l cu hnh, v.v.). Phng

    php lun ny s dng chin lc middle-out v khng ph thuc vo ng dng.

    2.1.2. Cng c xy dng Ontology

    B cng c xy dng v pht trin Ontology bao gm cc tool h trv mi

    trng gip ngi dng c th xy dng mt Ontology mi t bn thit k mi

    hoc s dng li nhng Ontology mi c sn. Mt s mi trng pht trin c

    xy dng t trc nh Ontosaurus, Ontolingua v WebOnto. Nhng b cng c

    mi c s dng nhiu gn y bao gm OntoEdit, OilED,WebODE, Chimera

    DAG-Edit v Protg.Ontoligua server [45] l b cng c xy dng ontology c pht trin t

    nhng nm 1990 ti Phng Th nghim H thng tri thc (Knowledge Systems

    Laboratory -KSL) ca Trng H Stanford (M). Cc module chnh ca b cng

    c bao gm b bin tp ontology (ontology editor) v cc module khc nh

    Webster, OKBC (Open knowledge Based Connectivity) server.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    22/67

    14

    Ontosaurus [46] c pht trin cng trong khong thi gian bi Vin

    Khoa hc Thng tin ISI ca Trng H South Calfornia (M). OntoSaurus bao

    gm 2 module chnh: ontology server (s dng Loom) v mt web browser cho

    Loom ontology. Ngoi ra, b cng c cn h trKIF, KRSS v C++, ng thi

    OntoSaurus ontology cng c th c truy cp da trn protocol OKBC caOntoligua server.

    WebOnto l mt ontology editor cho cc Ontology OCML (Operational

    Conceptual Modelling Language), c pht trin bi Vin Truyn thng Tri thc

    (KMI) ti Trng H m(Open University). B cng c ny l s dng Java vi

    webserver, cho php ngi dng c th duyt v thay i cc m hnh tri thc

    thng qua Internet. im mnh chnh ca b cng c ny l c th cho php cng

    tc gia nhiu ngi nhm thay i v hon thin ontology [26].

    Cc b cng c trn (Ontolingua server, Ontosaurus v WebOnto) c xydng n thun nhm h trduyt v bin tp cc Ontology c vit bng nhng

    ngn ng ring (Ontolingua, LOOM v OCML). Nhng b cng c bin tp ny

    hin nay khng cn p ng nhu cu ca ngi s dng. Th h mi cc b

    cng c xy dng Ontology c nhiu u vit cng nh tnh nng hn hn cc b

    cng c ny, v d nh kh nng mrng, h thng kin trc cc thnh phn gip

    ngi dng c th cung cp thm cc tnh nng cho mi trng pht trin mt cch

    d dng.

    WebODE [47] l mt b cng c c kh nng mrng c pht trin binhm Ontology ca trng H Technical Madrid (UPM), c xem nh mt thnh

    cng ca ODE (Ontology Design Environment). WebODE c s dng nh mt

    Web server vi giao din web. Phn li chnh ca mi trng ny l mt dch v

    (service) ontology, trong tt c cc dch v v ng dng khc u c th s dng

    dch v ny. Phn son tho Ontology cng ng thi cung cp cng c kim tra

    rng buc, to cc lut tin (axiom rule creation) v phn tch vi WebODE

    Axiom Builder (WAB), ti liu trong HTML, kt hp ontology vi cc nh dng

    khc nhau [XML\RDF[s], OIL, DAML+OIL, CARIN, Flogic, Java v Jess].

    OilED [48] l mt b cng c son tho ontology cho php ngi dng c

    th xy dng Ontology bng OIL v DAML+OIL, c xy dng bi Trng H

    Manchester, i hc Amsterdam v Interprice GmbH.

    Protg 2000 [51] l mt trong nhng b cng cc s dng rng ri nht

    hin nay, c pht trin bi Trng H Stanford. B cng c ny c pht trin

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    23/67

    15

    da trn hai mc tiu: c th tng thch vi cc h thng khc, d dng s dng v

    h trcc cng c trch chn thng tin. Phn chnh ca mi trng ny l mt bin

    tp ontology. Bn cnh , Protg cn bao gm rt nhiu cc plugin nhm h tr

    chc nng nh qun l nhiu ontology, dch v suy lun (inference service), h tr

    v vn ngn ng ontology (language importation/exportation).

    2.1.3. Ngn ng xy dng Ontology

    Hin ti, cc ngn ng xy dng ontology (ngn ng ontology) in hnh

    bao gm LOOM, LISP, Ontolingua, XML, SHOE, OIL, DAML+OIL v OWL.

    Ngn ng ontology c chia lm ba loi: nh ng tp t vng s dng

    ngn ng t nhin (object based-knowledge representation languages) nh UML,

    v ngn ng da trn lgic v t bc mt (first order predicate logic) nh logic m

    t (Description Logics). Ngn ng ontology cn phi tng thch vi nhng cng

    c khc, t nhin v d hc, tng thch vi cc chun hin ti ca web nh XML,

    XML Schema, RDF v UML. Di y l mt s cc ngn ng web-based.

    EXtensible Markup Language [XML] l mt chun mdng biu din d

    liu t W3C, c tnh mm do v mnh hn so vi HTML. RDF (Resource

    Description Framework) c pht trin nh mt khung gip m t v trao i cc

    metadata [12].

    SHOE (Simple HTML Ontology Extensions) c xy dng vo nm 1996

    ti Trng H Maryland, nh mt mrng ca HTML c th hp nht cc tri

    thc ng ngha trn cc vn bn web hin ti thng qua vic ch thch cc trang

    HTML [27].

    OIL (Ontology Inference Layer) l mrng ca RDF, c pht trin bi d

    n ON-To_Knowledge, l ngn ng m t v trao i cho ontology. Ngn ng ny

    c kt hp bi ngn ng dng da trn frame (frame-based) vi ng ngha hnh

    thc (formal sematics) v dch v suy lun t logic m t (description logics). Ngn

    ngc chia lm ba mc i tng lp (cc thc th c th), mc u tin (first-

    meta, nh ngha theo ontology) v mc th hai (second-meta, cc mi quan h)

    [8].

    DAML+OIL c pht trin da trn d n DARPA nm 2000. C OIL v

    DAML+OIL u cho php m t cc khi nim, cc phn cp (taxonomy), cc

    quan h nh phn, chc nng v thc th [9].

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    24/67

    16

    OWL l mt ngn ng ontology c s dng ph bin hin nay, c ti

    u ho cho vic trao i d liu v chia s tri thc. Ngn ng ny c s dng khi

    thng tin cha trong vn bn cn c x l bi cc ng dng. OWL l c thc

    s dng biu din ng ngha cc thut ng trong tp t vng v mi quan h

    gia nhng thut ng ny. OWL bao gm OWL Lite, OWL DL [RDF] v OWLFULL.

    2.3. Xy dng Ontology y tting Vit

    Vic thit k v xy dng mt ontology bao gm cc bc sau:

    nh ngha cc lp trong ontology. Sp xp cc lp trong mt kin trc phn cp (taxonomic hierarchy). nh ngha cc thuc tnh (slot) v m t cc gi tr cho php cho

    nhng thuc tnh ny.

    in gi tr ca cc th hin (instance) vo cc slot. Sau , cstri thc c to ra bng cch nh ngha cc th hin

    (instance) ca nhng lp ny cng vi nhng gi tr ca chng.

    Khng c mt phng php no c gi l phng php chun xc cho

    vic xy dng tt c cc Ontology [18]. Vic la chn phng php xy dng ph

    hp no c da trn mc ch v tnh cht ca tng Ontology. Qua qu trnh

    kho st cc d liu v y t v mt s cc phng php pht trin Ontology, chngti la chn mi trng Protg OWL xy dng mt Ontology y t bng Ting Vit

    th nghim.

    Sau khi thu thp v kho st d liu, chng ti lit k cc thut ng quan

    trng nhm c th nu nh ngha cho ngi dng vi hng nghin cu tip theo

    l tng lin kt n cc nh ngha c sn trn trang wikipedia. T cc thut ng

    trn, tip theo snh ngha cc thuc tnh ca chng. Vic xy dng Ontology l

    mt qu trnh lp li c bt u bng vic nh ngha cc khi nim trong h

    thng lp v m t thuc tnh ca cc khi nim .

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    25/67

    17

    Chng 3

    NHN DNG THC TH

    3.1. Gii thiu bi ton nhn dng thc th

    3.1.1. Gii thiu chung v nhn dng thc th

    Nhn dng thc th c th hiu mt cch n gin l phn loai cc t trong

    mt vn bn thnh cc lp thc th c nh ngha trc nh ngi (PER), t

    chc (ORG), v tr (LOC), bnh (BENH), triu chng (TCHUNG), thuc

    (THUOC). Nhn dng thc th cho chng ta c mt phn tch b mt, cc thc

    th s tr li cc cu hi quan trng (c thng dng trong h thng hi p).

    C rt nhiu phng php c dng gii quyt bi ton nhn dng

    thc th, t cc phng php th cng n cc phng php hc my nh cc mhnh markov n (Hidden Markov Models HMM), cc m hnh Markov cc i

    ha Entropy (Maximum Entropy Markov Models- MEMM), cc m hnh min ph

    thuc iu kin (Conditional Random Field - CRF), phng php my vector h tr

    (Support Vector Machine).

    Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th

    Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v

    c h trbi mt s lng ln cc lut, tuy nhin hu ht cc lut u cn tn ti

    mt s lng ln cc trng hp ngoi l, trong c nhng ngoi l ch xut hinkhi h thng a vo s dng, m ta kh c th gii quyt ht. Di y l mt s

    v d v cc lut c s dng bi Proteus cng vi cc trng hp ngoi l ca

    chng [1]:

    Lut: Title Capitalized_Word => Title Person Name

    Trng hp ng : Mr. Johns, Gen. Schwarzkopf

    Trng hp ngoi l: Mrs. Fields Cookies (mt cng ty).

    Lut: Month_name number_less_than_32 => Date

    Trng hp ng: February 28, July 15

    Trng hp ngoi l: Long March 3 ( tn mt tn la ca Trung Quc).

    So vi cc phng php th cng va tn thi gian, cng sc, m kt qu

    t c li khng c nh mong mun, cc phng php hc my hin ang

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    26/67

    18

    c tp trung nghin cu nhiu hn. Hu ht cc phng php u c nhng u

    th ring ng thi vn cn tn ti mt s hn ch do c th ca mi m hnh.

    Tiu biu c th kn cc m hnh Markov n HMM v cc m hnh ci tin ca

    n nh MEMM, CRF; vi cc m hnh ny ta c th xem tng ng mi trng thi

    vi mt trong nhn cc nhn thc th v d liu quan st l cc t trong cu angxt. My vector h tr(SVM) cng l mt trong nhng phng php hc my cho

    kt qu rt kh quan.

    3.1.2. Mt s kt qu nghin cu v nhn dng thc th

    Trn th gii bi ton nhn bit thc th c quan tm nghin cu t lu

    v t c nhng kt qu kh n tng. C rt nhiu phng php (t cc phng

    php th cng n cc phng php hc my) c dng gii quyt bi ton

    ny. Trong cng trnh nghin cu vo nm 2007 [5], David Nadeau nh gi

    mt s nghin cu tiu biu trc c lin quan n bi ton nhn dng thc th.Ni dung cc nh gi ca David Nadeau c trnh by nh di y.

    Tiu biu cho hng tip cn th cng l h thng nhn bit loi thc th

    Proteus ca i hc New York tham gia MUC-6. H thng c vit bng Lisp v

    c h trbi mt s lng ln cc lut. Nm 1998, Radev cng nghin cu nhn

    dng nhng on m t v thc thc a ra, chng hn nh Bill Clinton s

    c m t l the President of the U.S., the democratic presidential candidate

    hay an Arkansas native H thng ca Fung 1995 (v Huang 2005) gii quyt

    bi ton dch cc thc th t ngn ng ny sang ngn ng khc (v d nh bn dchting Vit ca thc th College of Technology s l Trng i hc Cng

    ngh). H thng ny c nh gi l gp phi t hn 10% li dch. Tip theo ,

    nm 2001, Charniak v cng s cng b kt qu nghin cu nhn dng cu trc cc

    phn trong tn ngi, v d nh cm Doctor Paul R. Smith sc chia thnh c

    thnh phn chc danh, h, m v tn). Nghin cu ny l mt bc tin x l

    quan trng trong b nhn dng thc th, c th xc nh nhng trng hp nh

    John F. Kennedy v President Kennedy l cng mt ngi. Cng trong nm

    2001, h thng Record linkage ca Cohen v Richman c xy dng vi mcch tm ra tt c cc dng ca cng mt thc th trn ton b csd liu. Vo

    nm 2002, Dimitrov v cng s gii quyt vn s dng cc i t thay th, v

    d trong cu Rabi finished reading the book and he replaced it in the library i

    t he l i t thay th cho Rabi. Nghin cu ny c rt nhiu ng dng thc

    t, v d nh trong h thng hi p tng. Nm 2003, Mann v Yarowski xy

    dng mt h thng xa b cc nhp nhng v tn ngi, k thut ny c s dng

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    27/67

    19

    xy dng tiu s - nn tng ca mt s my tm kim nh Zoominfo.com hay

    Spock.com. Nm 2005, Nadeau v Turney cng b kt qu nghin cu nhn dng

    ty ca cc t vit tt trong mt vn bn ang xt no , v d nh IBM

    vit tt ca International Business Machines trong nhiu vn bn. Mt nghin

    cu vo nm 2006 ca Agbago nhm xy dng mt h thng c kh nng phc hili nh dng ng ca t bao gm vic bo m cho k tu cu v u thc th

    lun c vit hoa l rt c ch trong dch my.

    Cng trong cng trnh nghin cu ca mnh [5], David Nadeau s dng

    tp nhn thc th ENAMEX theo mu ca hi ngh MUC 7 (Message

    Understanding Conference 7) v tin hnh hun luyn - kim th trn tp ng liu

    Medstract Gold Standard Evaluation Corpus (Tp ng liu ny c xy dng bi

    Pustejovsky vo nm 2001). Tc gi s dng b cng c Weka Machine Learning

    kim th nhiu thut ton hc c gim st v a ra kt lun tt ca h

    thng ph thuc rt nhiu vo thut ton c s dng v phng php hc bn

    gim st ca mnh cho kt qu kh quan nht.

    Tnh n nay, c kh nhiu hi ngh khoa hc quc t ln trao i v bi

    ton nhn dng thc th cng nhnh gi nh gi cc h thng nhn dng thc

    th c xy dng. Tiu biu c th k n MUC (Message Understanding

    Conference, 1987-1997), MET (Multilingual Entity Task Conference, 1998), ACE

    (Automatic Content Extraction Program, 2000), HAREM (Evaluation contest for

    named entity recognizers in Portuguese, 2004-2006), IREX (Information Retrieval

    and Extraction Exercise, 1998-1999)

    3.2. cim dliu ting Vit

    Ting Vit thuc ngn ngn lp, tc l mi mt ting (m tit) c pht

    m tch ri nhau v c th hin bng mt ch vit. c im ny th hin r rt

    tt c cc mt ng m, t vng, ng php. Di y trnh by mt sc im ca

    ting Vit theo cc tc giTrung tm ngn ng hc Vit Nam trnh by. Vic

    nghin cu cc c im d liu ting Vit s gip em c ci nhn tng quan v cc

    c trng d liu ting Vit. Hiu r rng hn v d liu s gip vic xy dngOntology v trch chn thng tin c hiu qu hn.

    3.2.1.c im ng m

    Ting Vit c mt loi n vc bit gi l "ting" m v mt ng m th

    mi ting l mt m tit. H thng m v ting Vit phong ph v c tnh cn i,

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    28/67

    20

    to ra tim nng ca ng m ting Vit trong vic th hin cc n v c ngha.

    Nhiu t tng hnh, tng thanh c gi tr gi tc sc. Khi to cu, to li,

    ngi Vit rt ch n s hi ho v ng m, n nhc iu ca cu vn.

    3.2.2.c im t vng

    Ni chung, mi ting l mt yu t c ngha. Ting l n v csca h

    thng cc n v c ngha ca ting Vit. T ting, ngi ta to ra cc n v t

    vng khc nh danh s vt, hin tng..., ch yu nhphng thc ghp v

    phng thc ly.

    Vic to ra cc n v t vng phng thc ghp lun chu s chi phi ca

    quy lut kt hp ng ngha, v d: t nc, my bay, nh lu xe hi, nh tan ca

    nt... Hin nay, y l phng thc ch yu sn sinh ra cc n v t vng. Theo

    phng thc ny, ting Vit trit s dng cc yu t cu to t thun Vit hay

    vay mn t cc ngn ng khc to ra cc t, ng mi, v d nh tip th,

    karaoke, th in t (e-mail), th thoi (voice mail), phin bn (version), xa l

    thng tin, siu lin kt vn bn, truy cp ngu nhin, v.v.

    Vic to ra cc n v t vng phng thc ly th quy lut phi hp ng

    m chi phi ch yu vic to ra cc n v t vng, chng hn nh chm cha,

    chng ch, ng a ng nh, ththn, lng l lng ling, v.v.

    Vn t vng ti thiu ca ting Vit phn ln l cc tn tit [mt m tit,

    mt ting]. S linh hot trong s dng, vic to ra cc t ng mi mt cch d dng to iu kin thun li cho s pht trin vn t, va phong ph v s lng, va

    a dng trong hot ng. Cng mt s vt, hin tng, mt hot ng hay mt c

    trng, c th c nhiu t ng khc nhau biu th. Tim nng ca vn t ng ting

    Vit c pht huy cao trong cc phong cch chc nng ngn ng, c bit l

    trong phong cch ngn ng ngh thut. Hin nay, do s pht trin vt bc ca

    khoa hc-kthut, c bit l cng ngh thng tin, th tim nng cn c pht

    huy mnh m hn.

    3.2.3.c im ng php

    T ting Vit khng bin i hnh thi. c im ny s chi phi cc c

    im ng php khc. Khi t kt hp t thnh cc kt cu nh ng, cu, ting Vit

    rt coi trng phng thc trt t t v h t.

    Vic sp xp cc t theo mt trt t nht nh l cch ch yu biu th cc

    quan h c php. Trong ting Vit khi ni Anh ta li n l khc vi Li n anh

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    29/67

    21

    ta. Khi cc t cng loi kt hp vi nhau theo quan h chnh ph th tng trc

    gi vai tr chnh, tng sau gi vai tr ph. Nh trt t kt hp ca t m "c

    ci" khc vi "ci c", "tnh cm" khc vi "cm tnh". Trt t ch ngng trc,

    v ngng sau l trt t ph bin ca kt cu cu ting Vit.

    Phng thc h t cng l phng thc ng php ch yu ca ting Vit.Nhh t m t hp anh ca em khc vi t hp anh v em, anh v em. H

    t cng vi trt t t cho php ting Vit to ra nhiu cu cng c ni dung thng

    bo cbn nh nhau nhng khc nhau v sc thi biu cm. V d, so snh cc cu

    sau y:

    - ng y khng ht thuc.

    - Thuc, ng y khng ht.

    - Thuc, ng y cng khng ht.

    Ngoi trt t t v h t, ting Vit cn s dng phng thc ngiu. Ng

    iu gi vai tr trong vic biu hin quan h c php ca cc yu t trong cu, nh

    nhm a ra ni dung mun thng bo. Trn vn bn, ngiu thng c

    biu hin bng du cu. S khc nhau trong ni dung thng bo c nhn bit khi

    so snh hai cu sau:

    - m hm qua, cu gy.

    - m hm, qua cu gy.

    Qua mt sc im ni bt va nu trn y, chng ta c th hnh dung

    c phn no bn sc v tim nng ca ting Vit cng nh kh khn gp phi

    trong vic nhn dng thc th cng nh trch chn thng tin trong ting Vit.

    3.3. Mt s phng php nhn dng thc th

    Tn ti nhiu phng php c cp ti trong bi ton nhn dng thc th.

    Tuy nhin c th tng kt li mt s giai on chnh trong bi ton ny nh sau:

    Tin x l: Loi b HTML, tch cu, tch t. La chn thuc tnh: La chn cc nhn th (tag), mu ng cnh

    (feature: vit hoa, vit thng, ).

    Giai on hun luyn, t hc: S dng HMM, CRF, MEMM,SVM

    Gn nhn, khi phc.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    30/67

    22

    Ty thuc vo tng min ca bi ton nhn dng thc th th s la chn cc

    nhn th l khc nhau. C th cp ti by nhn dng cbn tng qut nht c

    la chn u tin: 7 dng nhn u tin (theo Ralph & Beth, [5]): ORG (t chc),

    LOC (v tr), PER (ngi), DATE,TIME,CUR (Biu din tin t), PCT (Phn

    trm). Tp nhn c thc thay i, m rng ty thuc vo tng d n. D nBiocaster [11] xy dng 22 nhn cho lnh vc y t.

    Mi mt nhn c gn bao gm ba phn:

    Phn bin (boundary category): Xc nh v tr ca t hin ti trongmt thc th.

    Phn thc th (Entity category): Xc nh kiu thc th. Tp c trng (Feature set) : Xc nh thng tin ng cnh (mu ng

    cnh).C nhiu cch biu din phn bin ca cc t, trong cch biu din

    thng c cp v dng nhiu nht c th k ti l: biu din mi mt nhn

    gm mt tip u ch B_ (bt u mt thc th ), I_ (bn trong mt thc th), nhn

    O (khng phi thc th). Ly v d: bnh vim no nht bn c thc gn

    nhn nh sau B_DIS I_DIS I_DIS I_DIS.

    La chn mu ng cnh l bi ton quan trng quyt nh chnh xc ca

    nhn dng thc th. Mu ng cnh ti v tr quan st bt k cho ta thng tin ng

    cnh. Bt k mt h thng nhn dng thc th hon thin no u phi xy dngc mt tp cc mu ng cnh mt cch chnh xc v m tc tng lnh vc

    ca bi ton nhn dng. Bi ton nhn dng thc th chung: vit hoa, vit thng,

    k t % , ch s, du chm, phyBi ton tng t trong y t, l la chn mu

    ng cnh trong nhn dng protein, gene, thuc, t bo .

    Cc loi mu ng cnh [6]:

    Mu tin nh cbn (vit hoa, thng, chm, phy): comma, dot,oneDigit, AllDigits

    Mu hnh thi hc: tin t, hu t (~virus, ~lipid, ~vitamin,), Mu ng php: cm ng t, cm danh t Mu trigger ng ngha:

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    31/67

    23

    o Trigger danh t chnh: danh t chnh ca mt t hp t ( BCell trong activated human B cells, bnh trong bnh vim

    xoang ).

    o Triggerng tc bit: nhim, ly, bao gm, gy ra.3.3.1. Phng php da trn lut, bn gim st

    H thng da trn lut bao gm mt tp cc lut cbn (Nu-Th), tp cc

    s vt (facts), b thng dch (interpreter) s dng tp lut sinh ra cc s vt. S

    dng phng php da trn lut, u tin chng ta xy dng mt tp ban u cc

    lut, cc thc th. Qua qu trnh hc da trn bn gim st v k thut

    bootstrapping, chng ta mrng tp thc th cng nh tp lut ban u.

    Hc bn gim st [28] c hiu l phng php hc my s dng c hai

    loi d liu gn nhn v cha gn nhn cho qu trnh hun luyn. Phng php nykt hp c u im, gim bt nhng nhc im ca phng php hc c gim

    st v hc khng gim st. Cc thut ton bn gim st c nhim v chnh l m

    rng mt tp d liu hun luyn nh ban u thnh tp d liu ln hn.

    Mt k thut chnh ca phng php hc bn gim st l bootstrapping. K

    thut ny bao gm c gim st mc nh, t mt tp d liu ban u (cn gi l

    tp seed) bt u qu trnh hun luyn. V d mt h thng nhn dng tn bnh, lc

    u yu cu mt tp mu nh cc tn bnh. Sau , h thng tm kim cc cu cha

    cc tn bnh ny v c gng tm kim cc thng tin ng cnh chung cho mt s tnbnh trong tp ny (v d nh c s tng ng v thng tin ng cnh trong tng 5

    mu tn bnh). Sau t cc thng tin ng cnh ny, h thng s tm cc th hin

    ca tn bnh xut hin trong cc ng cnh tng t. Qu trnh hun luyn ny s

    c lp i lp li tm ra cc v d mi, cng nh khai thc c cc thng tin

    ng cnh mi c lin quan. Bng cch lp i lp li qu trnh ny, mt s lng ln

    cc tn bnh v mt s lng ln cc thng tin ng cnh sc thu thp li.

    3.3.2. Cc phng php my trng thi hu hn

    Cc phng php my trng thi hu hn dng mt s chung ca my

    trng thi hu hn (finite state machine - FSM hoc finite state automaton FSA).

    C th coi my trang thi hu hn l mt my tru tng c dng trong cc

    nghin cu v tnh ton v ngn ng vi mt s lng hu hn, khng i cc

    trng thi. My trng thi hu hn c biu din nh mt th c hng, trong

    c hu hn c nt (cc trng thi) v t mi nt c khng hoc mt s cung (b

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    32/67

    24

    chuyn) i ti cc nt khc. Mt xu u vo m cn xc nh dy b chuyn ph

    hp. Tn ti mt s kiu my trng thi hu hn. B nhn (Acceptor) cho cu tr

    li "c hoc khng" tip nhn xu u vo. Bon nhn (Recognizer) phn lp

    i vi xu u vo. B bin i (Transducer) sinh ra mt xu kt qu ra tng ng

    vi xu u vo. M hnh my trng thi hu hn c ng dng trong trch chnthng tin thuc loi b bin i, trong vi mt xu vn bn u vo, h thng

    a ra xu cc c trng tng ng vi cc t kha trong xu vn bn . Theo

    mt cch phn loi khc, th c hai loi my trng thi hu hn l quyt nh

    (Deterministic finite automaton- DFA) v khng quyt nh (Non-deterministic

    finite automaton NFA).

    My trng thi hu hn bao gm:

    Mt bng ch, Mt tp cc trng thi S, trong

    o vi DFA: c mt trng thi xut pht v c t khng tr lncc trng thi chp nhn (dng).

    o vi NFA: c t mt trln cc trng thi c coi l trng thixut pht v c t khng tr ln cc trng thi chp nhn

    (dng).

    Mt hm chuyn T : S S.Hot ng my trng thi c m t nh sau. Bt u t (tp) trng thixut pht, ln lt xem xt tng k t trong xu u vo trong bng ch, trn c

    shm chuyn T di chuyn ti trng thi tip theo cho n khi mi k t ca

    xu c xem xt. Nu gp c trng thi dng l thnh cng. Trong trng

    hp , xu cc trng thi c gp (xut hin) trong qu trnh x l xu u vo

    c coi l xu kt qu, hay cn c gi l xu nhn ph hp vi xu u vo.

    M hnh my trng thi hu hn ng dng trong trch chn thng tin c

    b sung thm mt s yu t, ch yu lin quan ti hm chuyn T, thng T c

    m t nh mt qu trnh Markov.

    3.3.3. Phng php s dng Gazetteer

    Tin Gazetteer (hay Gazetteer) c hiu l mt danh sch cc thc th

    nh tn ngi, t chc, v tr; hay ring i vi lnh vc y t l mt danh sch cc

    bnh, tn thuc, triu chng, nguyn nhn.Nu c th xy dng c mt tp d

    liu gazetteer tht tt, y , chnh xc th s to bc tin quyt quan trng i

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    33/67

    25

    vi h thng nhn dng thc th. Ngoi vic xy dng Ontology s cp ti cng

    vic xy dng mt tp gazetteer ban u cho y t ting Vit. Nhn dng thc th

    da trn tp Gazetteer ny cho kt qu kh quan.

    Cc file gazetteerc biu din theo nh dng sau: a.lst:b:c. Trong a.lst

    l file cha cc th hin ca lp thc th a, b l kiu major, c l kiu minor. C thhiu mt cch n gin lp thuc kiu minor l lp con ca lp thuc kiu major.

    V d cc file gazetteer biu din nguyn nhn gy ra bnh c biu din nh sau:

    nguyen_nhan.lst:nguyen_nhan:vikhuan,

    nguyen_nhan.lst:nguyen_nhan:tac_nhan.

    Hnh 6: Mt s file Gazetteer c xy dng phc v bi ton nhn dng thc

    th.

    c kh nhiu bi bo cp ti vic s dng tp d liu nhn dng

    thc th. Trong bi bo v xy dng tp d liu cho bi ton nhn dng thc th

    (c trnh by trong phn 3.4.1), nhm tc gi cp ti tm quan trng ca

    vic xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Bi bo

    s dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh

    SVM da trn cc bi bo c ch thch [20].

    3.4. Nhn dng thc th y tting Vit

    3.4.1. Nhn dng thc th ting Vit

    Tn ti mt s cng trnh nghin cu cp ti vic s dng tp d liu

    nhn dng thc th ting Vit. Nguyn Cm T [1] xy dng mt h thng nhn

    din thc th nhn bit loi thc th da trn m hnh trng ngu nhin c iu

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    34/67

    26

    kin (Conditional Random Fields - CRF) xc nh 8 loi thc th, tng ng vi

    l 17 nhn. Tc gi tin hnh thc nghim s dng cng c FlexCRFs (cng c

    m ngun mc pht trin bi Phan Xun Hiu v Nguyn L Minh), s dng

    d liu gm 50 bi bo lnh vc kinh doanh (khong gn 1400 cu) ly t ngun

    http://vnexpress.net.Thao P.T.X. v cng s [21] cp ti vic khai thc cc chin lc b

    phiu (voting) bng cch t hp cc b my hun luyn s dng phng php da

    trn t (word-based). tng chnh ca nhm tc gi l cp ti l vic t hp

    cc my hun luyn s dng cc thut ton phn lp khc nhau (SVM, CRF, TBL,

    Nave Bayes) s cho kt qu cao hn khi s dng ring r mi thut ton.

    Trong [20], Thao P.T.X. v cng s cp ti tm quan trng ca vic

    xy dng mt tp d liu ban u cho qu trnh nhn dng thc th. Cc tc gi s

    dng BioCaster NE ch thch d liu v s dng Yamcha hc m hnh SVMda trn cc cng trnh nghin cu lin quan. Nhm tc gi d tm cc bnh truyn

    nhim thng qua cc bi trc tuyn v y t sc khe cp ti vic xy dng tp

    d liu cho bi ton nhn dng thc thng mt vai tr rt quan trng v a

    ra 22 nhn thc th gn nhn v ch thch d liu.

    Mt nghin cu tiu biu c lin quan n bi ton nhn dng thc th

    Vit Nam l cng c VN-KIM IE [40] c xy dng bi nhm nghin cu do ph

    gio s tin sCao Hong Trng u, thuc trng i hc Bch khoa Thnh

    ph H Ch Minh. Chc nng ca VN-KIM IE l nhn bit v ch thch lp tng cho cc thc th c tn trn cc trang Web ting Vit.

    3.4.2. Nhn dng thc th y t ting Vit

    Trn th gii, mt s nh nghin cu (John McNaught[10], Sammy Wang

    [25], ...) lu v mt s vn kh khn trong x l d liu y t. Nhng kh

    khn in hnh nht l s nhp nhng v a dng ca cc t, thc th trong d liu

    y t c cu trc phc tp, nguyn tc hnh thnh i khi li khng ging nh bnh

    thng; hin nay vn cha c quy c r rng v tn cc thc th, vn tng

    ngha t tri ngha t vit tt v trong nhiu trng hp tc s dng khng

    mang ngha thng gp ca n; nhiu t cng ch mt khi nim v mt t c

    th c nhiu ngha, .

    i vi bi ton nhn dng thc th cho y t ting Vit, ngoi nhng kh

    khn chung ca bi ton nhn dng thc th ni trn cn gp mt s trngi khc.

    Cc vn bn ting Vit khng c d liu hun luyn v cc ngun ti nguyn c th

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    35/67

    27

    tra cu (nh Wordnet trong ting Anh), thiu cc thng tin ng php (POS) v cc

    thng tin v cm t nh cm danh t, cm ng t cho ting Vit, trong khi cc

    thng tin ny gi vai tr quan trng trong vic nhn dng thc th; khong cch

    gia cc t khng r rng, d gy nhp nhng. Hn na, i vi c trng ca d

    liu y t cng gy ra khng t kh khn cho bi ton nhn dng thc th: thng tinlu tr khng hoc bn cu trc (tn thuc, virus), cc kiu vit tt tn thc th,

    kiu tn thc th di, a dng, cc cch vit khc nhau ca cng mt thc th.

    Ring vi thc th bnh ting Vit, c thim qua mt sc im gy kh khn

    cho bi ton nhn dng thc th:

    Khng tun theo lut no v k t vit hoa. Kh hn ch s lng t v: C nhng tn bnh ch gm 01 t (Nh

    bnh si, bnh chn), nhng c nhng tn bnh li gm rt nhiu t nh chng

    ri lon tm thn th hoang tng, Cu trc cc t to thnh mt thc th c th rt phc tp: ri lon chc

    phn no nhtr em,

    C nhiu t mn, t Hn Vit: Stress, bnh paranoa, bnh gout, bnhthin u thng

    Cng mt bnh i khi c nhiu cch vit khng hon ton ging nhauhay thm ch khc hn nhau: thy u hay tri r, bnh gt hay gout hay cn gi l

    thng phong, bnh ung th mu cn c gi l bnh mu trng

    C nhiu t vit tt: AIDS (l vit tt t Acquired ImmunodeficiencySyndrome hay t Acquired Immune Deficiency Syndrome ca ting Anh) trong

    nhiu ti liu y t ting Vit c dch l hi chng suy gim min dch mc

    phi,

    Cha nhng t rt d b b st v cm t d c hay khng c cc tny vn c thc tnh l mt thc th, nh mn tnh, cp tnh, nguyn pht, th

    pht

    Bi ton nhn dng thc thc trng cho d liu sinh hc v y t cng l

    mt ni dung nghin cu rt c quan tm. Cc thc thc trng ca d liu

    sinh hc y t thng c quan tm n nhiu nht l: Bnh, Thuc, Gen, Sinh

    vt, Protein, Enzime, Cc khi u c tnh (Malignancies), Fibrinogen [10] [23]

    Mt trong nhng phng php n gin nht c xut cho bi ton nhn

    dng thc th trong d liu y t l s dng cc tin hoc tp t vng c nh

    ngha trc. n c l s dng MeSH [23]. y l mt bng t vng y t c kim

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    36/67

    28

    sot s dng nh ch mc. Thc cht n l mt danh sch cc t c xc

    nhn dng nh ch mc v ch c cc t trong danh sch ny c chp nhn

    vai tr . Cc t trong MeSH c sp xp theo h thng c cu trc cy. C tt

    c 16 nhnh ca cy MeSH, y l nhng nhm t ln nht v c trng nht trong

    d liu y t, c th k n nhnh A- Anatomy (gii phu hc), nhnh B Organisms (sinh vt), nhnh C Dieases (bnh), nhnh D Chemicals and Drugs

    (ha hc v thuc), nhnh G - Biological Sciences (sinh vt hc) Cc nhnh li

    chia lm cc nhnh nh, v d nhnh A01 - Body Regions (b phn cth), A02

    Sense Organs (cc gic quan)

    Trong chui hi ngh quc t BioCreAtIvE (Critical Assessment of

    Information Extraction systems in Biology]: c t chc di dng mt cuc thi,

    BioCreAtIvE I (2003-2004) tp trung vo ch nhn dng tn thc th Gene v

    Protein, c thim qua mt vi kt qu tiu biu di y [32]:

    Alexander Yeh v cng s s dng d liu v phn mm c lngcaW. John Wilbur and Lorraine Tanabe cho kt qu F-measure khong 80-83%.

    Shuhei Kinoshita v cng s gii quyt vn bng cch coi bi tonnhn dng thc th nh mt dng ca bi ton gn nhn t loi, thm mt nhn

    GENE vo tp nhn thng thng, cc tc gi s dng phng php gn nhn t

    loi ca Brill, s dng cng c TnT mt cng c da trn m hnh HMM, h

    thng khng qua hu x l cho kt qu chnh xc l 68.0%, hi tng l

    77.2% v F-measure l 72.3%., nu thm mt bc hu x l (bng mt s lut bt li) t chnh xc l 80.3%, hi tng 80.5% v F-measure l 80.4%; nu

    s dng thm mt bc hu x l da trn t in th t c F-measure l

    80.9%.

    Nm 2004, Yi-Feng Lin, Tzong-Han Tsai, Wen-Chi Chou, Kuen-PinWu, Ting-Yi Sung and Wen-Lian Hsu cng b nghin cu v p dng m hnh

    Markov cc i ha Entropy cho bi ton nhn dng thc th trong d liu y t. Kt

    quc cho bi chnh xc P, hi tng R v F-measure (2PR/(P+R)) l

    (0.512, 0.538, 0.525), sau khi hu x l th t c kt qu tng ng l (0.729,0.711, 0.72).

    Nm 2004, Haochang Wang v cng s [7] xut phng php nhn dng

    thc th cho d liu y t da trn b phn lp kt hp cc phng php

    Generalized Winnow, Conditional Random Fields, Support Vector Machine v

    Maximum Entropy, cc phng php ny c phi hp theo ba chin lc khc

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    37/67

    29

    nhau. H thng m cc tc gi xy dng t c kt quo F khong 77.57%,

    l mt kt qu kh tt so vi cc nghin cu cng thi im.

    Nm 2007, Andreas Vlachos [3] so snh hai phng php nhn dng thc

    th trong d liu y t da trn m hnh HMM v da trn m hnh CRF cng vi

    phn tch c php. Hai bng di y ch ra kt qu thc nghim, bng bn tri lkt qu thc nghim khi hun luyn bng mt tp nh d liu c ch thch

    thc th th cng v kim th trn ton b tp hun luyn, bng bn phi l kt qu

    khi hun luyn bng mt tp nh d liu nhiu v kim th trn ton b tp hun

    luyn

    Gn y nht, vo thng 3 nm 2009, Razvan C. Bunescu [45] khi trnh by

    v trch chn quan h t tp d liu y t lu vn nhn dng thc thc

    trng trong d liu y t, cc thc thc quan tm n gm c Bnh, Gen v

    Protein. Sau khi nhn dng c cc thc th ny, tc gi tin thm mt bcquan trng l trch chn ra quan h tng tc gia chng (v d nh Gen m ha

    mt Protein, Protein hon thnh chc nng ca n bng cch tng tc vi mt

    Protein khc ).

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    38/67

    30

    Chng 4

    XCNH QUAN H NG NGHA

    4.1. Tng quan v xcnh quan h ngngha

    4.1.1. Khi qut v quan h ng ngha

    Nh trnh by trn, sau khi c mt tp lp thc th (qua bc nhn dng

    thc th) c c mt mng ng ngha cc thc th, chng ta cn thc hin bc

    tip theo l bc trch chn quan h ng ngha (semantic relation). Quan h ng

    ngha c thc hiu l mi quan h tim n gia hai khi nim c biu din

    bng t hoc cm t [24]. Cc mi quan h ng ngha ng mt vai tr quan trng

    trong vic phn tch ng ngha t vng. T n c thng dng vo nhiu biton khc: Xy dng nn tng tri thc ng ngha t vng, h thng hi p, tm tt

    vn bn, Mt s mi quan h ng ngha in hnh trong lnh vc y t l IS_A

    (Cm -- bnh), PART_WHOLE (Virus Nguyn nhn), CAUSE_EFFECT (virus

    bnh).

    Hnh 7: Minh ha mt quan h ngngha cho thc th car

    Tuy quan h ng ngha ng mt vai tr quan trng trong phn tch ng

    ngha nhng chng thng tn ti dng n gy kh khn cho vic trch chn cc

    quan h ny. Mt cu hi t ra l lm th no chng ta c th khai thc c cc

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    39/67

    31

    quan h ng ngha ny mt cch c hiu qu t tp d liu th (khng hoc bn cu

    trc). Tr li cho cu hi ny chnh l mc tiu chnh ca bi ton trch chn quan

    hc cp nhiu trong thi gian gn y.

    4.1.2. Trch chn quan h ng ngha

    Mc ch ca trch chn quan h ng ngha l trch rt ra nhng quan h

    chuyn bit, c th no gia cc thc th trong ngun ng liu vn bn ln. Thc

    cht nhim v ca trich chn quan h ng ngha l khi c cho mt cp thc th x-

    y, phi xc nh c ngha ca cp thc th [24]. Ly v d t cu mt ng

    do cng thng, hi hp chng ta c th suy ra quan h ng ngha: cng thng, hi

    hp l nguyn nhn ca bnh mt ng.

    Hnh 8. Minh ha v trch chn quan h ngngha

    Cc ti nguyn trich chn quan h ng ngha bao gm:

    Cc tp d liu: Da trn s xut hin ng thi v cc phng php thng k. Cc ti nguyn sn c v cc quan h ng ngha nh WordNet v cc b chun

    mc.

    Snh gi ca con ngi.Cng nh nhn dng thc th, nhn dng quan h ng ngha cng c mt s

    kh khn ring nh sau (1) cha c c s thng nht v vn s lng cc quanh ng ngha, cc quan h ng ngha c n giu di cc dng khc nhau; (2) cc

    s kt hp (danh t - danh t) khng hon ton tun theo cc quy tc rng buc nht

    nh, cc quan h ng ngha thng l n, c th c nhiu mi quan h gia cc cp

    khi nim, vic thng dch c th ph thuc nhiu vo ng cnh, khng c mt tp

    c nh ngha tt v cc quan h ng ngha.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    40/67

    32

    Vic trch chn quan h ng ngha l mt phn ca cc d n quan trng

    mang tm c quc t trong lnh vc khai ph tri thc [24]. V d nh ACE

    (Automatic Content Extraction). DARPA EELD (Evidence Extraction and Link

    Discovery), ARDA-AQUAINT (Question Answering for Intelligence), ARDA

    NIMD (Novel Intelligence from Massive Data), Global WordNet.

    Hnh 9. V tr ca khai ph quan h ngngha trong xl ngn ngtnhin

    Ty thuc vo tng min, lnh vc m chng ta c cc quan h ng ngha

    khc nhau. Bng trong Hnh 10 minh ha mt s quan h ng ngha trong WordNet

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    41/67

    33

    Hnh 10. Minh ha cc quan h ngngha c ch ra trong WordNet [37]

    i vi min d liu y t, qua kho st, chng ti thu thp c 12 loi quan

    h ng ngha, cc quan h ny sc m t chi tit trong Chng 5.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    42/67

    34

    Hnh 11. Mt s quan h ngngha xy dng c

    Hnh 11 m t mt s quan h ng ngha, ngha cc quan h ng ngha ny

    c m t trong bng Bng 1.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    43/67

    35

    Quan h ngha Quan ho ngc

    Gy_ra M t quan h nguyn_nhn gy

    ra bnh

    B_gy_ra_bi

    C_triu_chng Quan h bnh c cc triu chng Lin_quanTi T_chc c t taa_im

    Cha_bng Bnh c cha bng thuc Cha

    Lm_vic Ngi lm vic t_chc

    Bin_chng Bnh bin chng sang bnh khc

    Tng_tc_thuc Thuc tng tc vi thuc

    Pht_hin_ti Bnh c pht hin ti T_chc

    Tc_ng_tt Thc_phm,Hot_ng,

    Cht_ha_hc tc ng tt n

    c_th_ngi, bnh

    Tc ng xu Thc_phm, Hot_ng,

    Cht_ha_hc tc ng xu n

    c_th_ngi, bnh

    Bng 1. Gii thch cc mi quan h ngngha

    4.1.3. Mt s nghin cu lin quan n xc nh quan h ng ngha

    Ti Hi tho SemEval 2007 [38], nhn dng cc mi quan h ng ngha gia

    hai danh t l mt ni dung chnh c cp. ngha ca 2 thc th lin quan n

    ngha ca cc t khc trong ng cnh, nhn dng theo 1 kiu quan h no . V

    d: i xe p v s vui v (quan h nhn qu) Trch chn quan h ng ngha da

    trn 7 mi quan h c bn l Cause- Effect, Instrument-Agency, Product-

    Producer,Origin-Entity, Theme-Tool, Part-Whole, and Content-Container.

    Ngoi ra, c th k thm mt s phng php trch chn quan h gia hai

    khi nim c m t nh sau: thuc l 1 cch iu tr ca 1 bnh, hay 1 gene l 1

    nguyn nhn ca 1 bnh. Swanson [29] gii thiu mt m hnh trch chn cc

    kiu quan h trn trong csd liu y sinh hc t mra mt khi nim th 3

    (v d 1 chc nng sinh l) lin quan n c hai khi nim thuc v bnh. Vic

    trch chn loi khi nim th 3 ny cho php mt mi quan h gia hai khi nim

    chnh (cha tim n trong mt ti liu no ) c hin th ra. M t phng php

    trn mt cch c th hn: X lin quan n bnh no , Z lin quan n thuc, Y l

    mt chc nng bnh l, sinh l, triu chng, X v Y, Y v Z thng c cp

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    44/67

    36

    cng nhau, X v Z th li k cng xut hin trong 1 ti liu nghin cu. T ta c

    th s dng khi nim Y v 1 mi lin quan gia hai khi nim X v Z.

    i vi vic s dng Ontology, c nhiu nhm tc gi cp ti vic hc

    bn gim st s dng Ontology nh mt hng tip cn mi. Trong hng tip cn

    , input l mt tp cc vn bn text (tn thc th, tg ng i vi cc khi nimtrong ontology m mi c xc nh). S dng cc tp d liu c sn nh GENIA

    corpus [14], vic gn nhn c thc hin th cng nhng d liu corpus c th

    c tng to ra s dng mt h thng NER tng ng. Output: Tp cc mu

    bao gm cc cp lp v mi quan h trong ontology GENIA, (v d template : virus

    infect cell).

    C nhiu phng php c a ra xc nh quan h. Tuy nhin nhim v

    chung ca bi ton ny l t cc vn bn th nh cc trang Web, ti liu, tin tc,

    ; qua b phn tch ng ngha (Semantic Parser) chng ta c u ra l cc cstri

    thc (Knowledge Base KB), v cc khi nim, cc mi quan h cng nh cc lin

    kt gia cc vn bn [24]. Hnh 12 m t nhim v chung ca bi ton xc nh

    thc th.

    Hnh 12. Nhim v chung ca bi ton xc nh quan h

    Bi ton xc nh quan h cng c th hiu l t mt cp danh t (thc th)

    xc nh c ngha ca cp danh t [24]. ngha c din t thng qua

    mt danh sch cc quan h, cc cp thc th c nhn dng v mt s ti

    nguyn khc.

    i vi b phn tch ng ngha, nh trnh by phn trn, ng vai tr

    quan trng trong vic trch rt cc quan h ng ngha. B phn tch ng ngha nybao gm cc thnh phn c m t nh trong Hnh 13:

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    45/67

    37

    Hnh 13. M t cc b phn trong b phn tch ngngha SR [24]

    Preprocessing: Tokenizer, Part-of-speech tagger, Syntactic parser, Wordsense disambiguation, Named entity recognition.

    Feature Selection: Xc nh cc tnh cht, rng buc (hoc ng cnh) , sdng b phn lp phn bit cc mi quan h ng ngha.

    Learning Model: Phn loi cc th hin (instance) input thnh cc miquan h ph hp

    B phn tch ng ngha (SR: Semantic Parsers) thc hin hai nhim v

    chnh:

    Labeling: T cc mi quan h ng ngha c nh ngha trc v cpthc th (danh t - danh t) ta gn nhn mi quan h gia hai thc th. V d,

    Bnh xe t t .

    Paraphrasing: T mt cp danh t hay thc th a ra c din t catrong vn cnh ca danh t. V d bnh mt ng do cng thng, t chng ta

    c th suy ra quan hcng thng l nguyn nhn ca mt ng.

    4.2. Gn nhn ngngha cho cu

    Trong [30], Xuan-Hieu Phan v cng s cp ti gii php kh nhp

    nhng thc tha ti liu bng cch gn nhn ng ngha cho cc cu trong vn

    bn. Kh nhp nhng thc tha ti liu l phn bit cc thc th trng th hin

    trong mt tp ti liu cho trc. V d, cho mt tp cc thc th c cng th hin l

    Bill Clinton, ta phi xc nh c tp con ti liu thc s ni v Bill Clinton

    cu tng thng M, tp con ti liu no ni v Bill Clinton cu th golf hay tp

    no ni v mt Bill Clinton no khc.

    Gn nhn ng ngha c thc xem nh l bi ton phn lp cc cu cha

    quan h ng ngha. Bi bo s dng b phn lp da trn Maxent ly cc cu t

    tm tt c nhn l cc cu u vo v u ra vi cc nhn ng ngha. B phn lp

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    46/67

    38

    da trn Maxent c u im l lin kt cht ch gia mt s lng rt ln (ln ti

    hng trm nghn hoc triu) ca cc c trng chng cho, c lp ti cc mc

    khc nhau.

    Cc tc gi [30] cng xut mt Framework cho vic kh nhp nhng thc

    tha ti liu gm ba phn chnh, v mt phn khng th thiu l gn nhn ngngha cho cu trong vn bn:

    Tin x l: S dng x l nng mt thu thp mt tm tt bao gm cccu lin quan ti thc thc cp.

    Chnh cc nhn ng ngha i vi cu trong tm tt t chng vocc lp khc nhau ca s vt. S chnh ny c thc hin bi b phn lp da

    trn Maxent c chnh xc cao, trong d liu c hun luyn da trn phng

    php hc bn gim st.

    S dng phng php phn cm, tng ng gia cc tm tt c nhnca mi cu c cng cc nhn ng ngha sc t bng nhau tnh ton gn

    ng ngha.

    Hnh 14. Minh ha Framework gii quyt bi ton xc nh tn ring gia cc

    ti liu.

    Hnh v 14 cho thy gn nhn ng ngha cho cu ng mt vai tr quan trng

    trong bi ton xc nh tn ring gia cc ti liu cng nh l cscho xc nh

    quan h ng ngha.

    Mt s nhn ng ngha cho cu c minh ha nh trong Hnh 15 sau y

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    47/67

    39

    Hnh 15. Mt s nhn ngngha c gn cho cu [30]

    Vi cc nhn ny, tm tt c nhn ca Bill Clinton sc gn nhn nh

    Hnh 16 di y.

    Hnh 16. Gn nhn ngngha cho cc cu m t tng thng Bill Clinton [30].Kha lun gn nhn th nghim cho 1000 cu vi cc nhn cha quan h

    lin quan n lnh vc y t. Cc nhn v d liu c gn nhn sc trnh by

    chih tit trong Chng 5.

    4.3. Phn lp cu cha quan h

    4.3.1. Phn lp vi xc nh quan h, nhn dng thc th

    Thc th cn nhn dng cng nh cc mi quan h cn xc nh ty thuc

    vo tng bi ton, tng min ng dng (domain). V d tn thc th c th l tnngi, tn t chc, a danh, (bi ton nhn dng thc th thng thng). Trong

    min ng dng m kha lun thc hin, tn thc th c th l tn bnh, thuc, triu

    chng, nguyn nhn, Tuy nhin i vi mt s tn thc th hay quan h, v d

    tn bnh, triu chng, nguyn nhn, quan h c_triu_chng v quan h

    c_bin_chng th vic nhn dng v phn bit chng cng l mt bi ton phc

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    48/67

    40

    tp. C nhiu khi tn bnh trng vi triu chng, nguyn nhn, v d nh : au u,

    ho c th hiu l bnh, cng c th hiu l nguyn nhn hay triu chng trong

    mt s trng hp ng cnh khc nhau. Gn lin nhn dng thc th, xc nh quan

    h vi vn phn lp. Cc thc th sau khi c nhn dng ra cn c phn vo

    cc lp ng. Hn na, nh trnh by phn trc v gn nhn ng ngha chocu bn cht cng chnh l da trn thut ton phn lp. T nhng l do m kha

    lun cp ti bi ton phn lp v cc thut ton phn lp c nghin cu

    trong thi gian qua.

    Hnh 17 m t cc giai on trong qu trnh phn lp. M hnh ny bao gm

    ba cng on chnh: cng on u l biu din d liu, tc l chuyn cc d liu

    (cc cu) thnh mt dng c cu trc no , tp hp cc mu cho trc thnh mt

    tp hun luyn. Cng on th hai l vic s dng cc k thut hc my hc trn

    cc mu hun luyn va biu din. Nh vy l vic biu din cng on mt s l

    u vo cho cng on th hai. Cng on th ba l vic b sung cc kin thc

    thm vo do ngi dng cung cp lm tng chnh xc trong biu din vn bn

    hay trong qu trnh hc my.

    Hnh 17. M t cc giai on trong qu trnh phn lp

    Trong nhiu nm gn y c nhiu thut ton c a ra gii quyt

    bi ton phn lp, v d : SVM (Support Vector Machine), K lng ging gn nht,

    phn lp da vo cy quyt nh, Cc thut ton ny c Nguyn Minh Tun

    [2] m t kh chi tit. Chng ti s dng phng php SVM phn loi cu cha

    quan h, trong cc phn tip theo s trnh by k hn v thut ton ny.

    D liu [cu]

    Cc cng cphn lp

    Biu din ban u

    Biu din

    ban u

    Gim s chiuhoc la chn

    thuc tnh

    Biu dincui cng

    Tri thc thm

    vo [3]

    Hc quy np [2]

    Biu din

    ban u

    Gim s chiuhoc la chn

    thuc tnh

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    49/67

    41

    4.3.2. Thut ton SVM (Support Vector Machine)

    Thut ton my vector h tr (Support Vector Machine SVM) c

    Corters v Vapnik gii thiu vo nm 1995. SVM rt hiu qu gii quyt cc bi

    ton vi d liu c s chiu ln (nh cc vector biu din vn bn).

    Thut ton SVM c thc hin trn mt tp d liu hc D= {(Xi,Ci),i=1,n}.Trong Ci {-1,1} xc nh d liu dng hay m. Mc ch ca thut

    ton l tm mt siu phng svm.d + b phn chia d liu thnh hai min. Phn lp

    mt ti liu mi chnh l xc nh du ca f[d] = svm.d + b. Ti liu s thuc lp

    dng nu f(d) > 0, thuc lp m nu f(d) < 0.

    Hnh 18: M t sphn chia ti liu theo du ca hm f(d) = svm.d + b

    4.3.3 Phn lp a lp vi SVM

    Bi ton phn lp quan h yu cu mt b phn lp a lp do cn ci tin

    SVM cbn (phn lp nh phn) thnh b phn lp a lp.

    Mt trong nhng phng php ci tin l s dng thut ton one-against-

    all[12]. tng cbn nh sau:

    Gi s tp d liu mu (x1,y1), ,(xm,ym) vi xi l mt vector n chiu.v yi Y l nhn lp c gn cho vector xi .

    Chia tp Y thnh m tp lp con c cu trc nh sau zi ={yi ,Y\yi } . p dng SVM phn lp nh phn cbn vi m tp Zi xy dng siu

    phng cho phn lp ny.

    B phn lp vi s kt hp ca m b phn lp trn c gi l b phn lp

    a lp mrng vi SVM.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    50/67

    42

    4.3.4. p dng SVM vo phn loi quan h ng ngha trong lnh vc y

    t ting Vit

    Tuy mc tiu ban u ca SVM l dng cho phn lp nh phn, nhng hin

    nay c ci tin cho phn lp a lp, c th s dng ci tin ny phn lp

    cc cu cha quan h [2].Hai qu trnh chun b d liu khi xy dng c m hnh phn lp quan h

    da trn SVM nh sau:

    Thit k m hnh cy phn cp (taxonomy) cho tp lp quan h. Minng dng ca quan h s quyt nh phc tp (phn cp) ca

    taxonomy.

    Xy dng tp d liu mu (corpus) c gn nhn cho tng lp quanh. Trong bc ny, cch la chn c trng biu din quan h c vai

    tr quan trng. Ph thuc vo c im ca tng ngn ng m tp cc

    c trng c la chn khc nhau. V d vi ting Anh th tp c trng

    ca n l cc t.

    Sau khi xy dng c tp cc lp cu hi cng vi tp d liu s tin hnh

    hc: M hnh hc nh sau:

    Hnh 19. M t qu trnh hc ca phn lp cu cha quan h [2]

    CuTin x l Trch chn

    c trng

    Phn lpSVMMulti

    Cu (cha QH)Tp vectorc trng

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    51/67

    43

    Chng 5

    THC NGHIM

    Vic xy dng Ontology cho y t ting Vit ng thi mrng n mt cchtng thng qua cc bc ca bi ton trch chn thng tin: nhn dng thc th,

    xc nh quan h. s lm tin kha lun xy dng mt tp d liu mang ng

    ngha (mng ng ngha). Kt qu ca cng vic ny ng vai tr quan trng trong

    nhim v xy dng mt my tm kim ng ngha trong tng lai.

    5.1. Mi trng thc nghim

    5.1.1. Phn cng

    Chng ti s dng my tnh c nhn vi cu hnh phn cng l Genuine IntelCPU T2050 1.60 GHz, CHIP 798 MHz, RAM 1Gb.

    5.1.2 Phn mm

    Chng ti tch hp cc tin ch trong cc b cng c Protg, Gate xy

    dng ontology, ch thch d liu v nhn dng thc th ting Vit i vi lnh vc y

    t.

    Protg [13] l mt cng c xy dng Ontology c xy dng v pht trin

    ti Stanford Center for Biomedical Informatics Research ca trng i hc

    Stanford University School of Medicine. Protg c hai loi: Protg Frame vProtg OWL. Protg Frame cung cp mt giao din dng y v m hnh c

    sn to, lu tr Ontology di dng Frame. Cn Protg OWL h trv ngn

    ng Web ontology, c chng thc da vo web ng ngha hay W3C.

    Gate [31] l mt kin trc phn mm pht trin v trin khai cc b phn

    phn mm phc v cng vic x l ngn ng ca con ngi. Gate gip cc nh pht

    trin tin hnh cng vic theo ba cch:

    Xc nh mt cu trc, kin trc t chc cho cc phn mm x l ngnng.

    Cung cp mt framework hay th vin cc lp thc th, thc hin cu trc xc nh v c thc s dng cho cc ng dng x l ngn ng t nhin.

    Cung cp mt mi trng pht trin c xy dng da trn frameworkca cc cng c ha tin li cho cc thnh phn pht trin.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    52/67

    44

    Gate khai ph s pht trin cc phn mm da trn b phn, hng i tng

    v code lu ng, bin i nhanh. Framework v mi trng pht trin c vit

    bi ngn ng Java v l mt phn mm m ngun mdi s cho php ca th

    vin GNU. Gate s dng Unicode (Unicode Consortium 96) v c kim th trn

    mt s ngn ng : c, n .Gate bt u c xy dng v pht trin ti Trng H Sheffield t nm

    1995 v t c s dng trong nghin cu v cc d n. Phin bn 1 c ra i

    nm 1996 v c chng nhn bi hng trm t chc. Gate s dng mt lng ln

    cc ng cnh t phn tch ngn ng vo trong nhiu th ting: Anh, Hy Lp, Thy

    in, c, , Php Cc phin bn tip sau c ra i v ngy cng p ng

    mt cch hiu qu trong nghin cu cng nhng dng.

    5.1.3 D liu th nghim

    Sau khi thu thp c hn 500 trang web t cc web sitehttp://suckhoedoisong.vn, chng ti loi b, x l cc vn bn nhiu khng gip

    ch cho qu trnh xy dng Ontology cng nh nhn dng thc th. Sau khi x l

    thu thp c gn 400 trang web, tng ng vi trn 5000 cu phc v cho vic

    xy dng Ontology, nhn dng thc th v to nn tng cho phn loi quan h cu.

    S dng cng c tch t JvnTextPro ca Nguyn Cm T [1] loi b

    HTML cc trang Web cng nh tch cu, tch t tp ti liu ny.

    5.2 Xy dng Ontology

    5.2.1. Phn cp lp thc th

    Vi cc d liu v y t thu thp c t cc trang web v ontology, chng ti

    lit k cc thut ng (term) quan trng nhm c th nu nh ngha cho ngi dng

    vi hng nghin cu tip theo l tng lin kt n cc nh ngha c sn trn

    trang wikipedia. T cc thut ng trn, tip theo snh ngha cc thuc tnh ca

    chng. Vic xy dng Ontology l mt qu trnh lp li c bt u bng vic nh

    ngha cc khi nim trong h thng lp v m t thuc tnh ca cc khi nim .

    Qua kho st Ontology BioCaster vi cc thut ng trong ting Vit, cngvi mt s lung ln cc trang Web v y t hin nay Vit Nam, chng ti tin

    hnh xy dng nn mt tp cc thut ng, cc mi quan h cbn nht t

    xut ra Ontology th nghim ban u.

    Sau y l mt s lp thc th do kha lun xut xy dng Ontology:

    Thuc: ng y, Ty y. V d nh thuc 5-Fluorouracil Ebewe chng ungth (ung thi trc trng, v, thc qun, d dy), hay l thuc Ciloxan st trng,

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    53/67

    45

    chng nhim khun mt. Thuc ng y ng gia b cha bnh phong thp, trng

    gn ct

    Bnh, hi chng: Cc loi bnh nh cm g, vim lot d dy, cc hichng mt ng, suy tim

    Triu chng: V d nh triu chng ca cm H5N1 l st cao, nhc u,au mi ton thn,... Nguyn nhn: Tc nhn (virut, vi khun..mui, g, chim..), v cc nguyn

    khc nh l thiu ng, li tp th dc, ht thuc l thng

    Thc phm: Bao gm cc mn n c li hoc gy hi cho sc khe conngui cng nh ph hp vi mt s loi bnh no .

    Ngi: Bao gm bc s, gio s m ngi bnh c th tm kim khmbnh, xin gip khi mc bnh.

    T chc: Bnh vin, phng khm, hiu thuc l cc a im bnhnhn c th tm n khi mc bnh.

    a im: a ch ca mt t chc no m bnh nhn c th tm n,cc ni dch ang pht sinh v lan rng.

    C th ngi: L tt c cc b phn c th ngi c th th b nhimbnh: mt, mi, gan, tim

    Hot ng: Chn tr, xt nghim, hi cu, h hp nhn to, phng trnh,tim phng ...

    Ha cht: Vitamin, khong cht gy tc ng xu, tt n c th conngi, v d vitamin A c li cho mt, Vitamin C, E lm gim cc nguy cbnh

    tim

    Hi chng: hi chng c th xut hin ca mt bnh [hi chng sc cabnh st xut huyt].

    Bin chng: T mt bnh c th bin chng sang bnh khc (bnh quai bbin chng vim mng no).

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    54/67

    46

    Hnh 20: Minh ha cc lp trong Ontology xy dng.

    Hnh 21: Minh ha cu trc phn tng ca Ontology xy dng c.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    55/67

    47

    5.2.2. Cc mi quan h gia cc lp thc th

    Kha lun s dng mt s quan h ng ngha di y gia cc thc th

    xy dng quan h ng ngha trong Ontology cng nh vic gn nhn ng ngha cho

    cu:

    S tng tc thuc thuc: Thuc ny c th gy tc dng ph cho thuckia, hay c th kt hp cc loi thuc vi nhau cha bnh. V d thuc

    chng ung th Alexan khng nn dng chung vi methotrexate hay 5-

    fluorouracil.

    Thc phm tc ng xu, tt n bnh, c th ngi. V d nh ungxa nhiu c ri ro mc cc bnh ri lon trao i cht, tng vng bng,

    tng huyt p

    Quan h bnh thuc.

    Quan h nguyn nhn gy ra bnh, hay bnh c nguyn nhn. Quan h bnh triu chng. Quan h bnh bin chng thnh bnh khc. Cc hot ng tc ng ln bnh. Ngi lm vic trong mt t chc ti a im no . Bnh thuc chuyn khoa ca ngi. Bnh c pht hin, cha trt chc. Bnh bin chng sang bnh khc.

    Quan h bnh -- hi chng.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    56/67

    48

    Hnh 22. Minh ha cc th hin ca lp thc th v mi quan h gia cc th

    hin

    Hnh 22 minh ha mt mi quan h gia cc th hin ca cc lp thc th.

    Trn hnh 22 l th hin st Dengue v cc quan h vi cc th hin ca lp thc

    th khc: Gn_nhn, pht_hin_ti, c_triu_chng, bin_chng, cha_bng,

    b_gy_ra_bi.

    Kha lun xy dng c mt Ontology bao gm 21 lp thc th, 13 mi

    quan h v trn 500 th hin ca cc lp thc th.

    5.3. Ch thch dliu

    Kha lun tch hp Ontology vo cng c Gate (General Architecture for

    Text Mining) ch thch d liu.. T d liu c thu thp v ontology xydng, qu trnh ch thch d liu bao gm cc bc sau:

    Mfile cha d liu ch thch, c th dng mc th mc cha nhiufile ch thch. S dng Data_Store ca gate lu cc d liu c mv sau

    khi c ch thch.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    57/67

    49

    MOntology xy dng c. Ontology c th dng cng c Gate chnh sa li cc lp, thuc tnh,

    Thay i mu sc ch thch cc thc thOntology mt cch ph hp c th tin phn bit cc thc th mt cch r rng.

    Chn thc th cn ch thch v chn tn lp thc th thuc ontology ch thch.

    Kt qu sau qu trnh ch thch, chng ta c th c mt d liu cha cc thc

    th tng ng vi cc lp c xy dng trn ontology. Ch thch d liu gip

    cho vic xy dng tp corpus trn d liu y t mt cch d dng hn, ng thi gp

    phn vo vic tng mrng cc thc th trn ontology.

    Kha lun ch thch c 96 file d liu tng ng vi trn 1500 th

    hin.

    Hnh 23: Minh ha mt dliu c ch thch bng Ontology.

  • 7/30/2019 TRCH CHN THNG TIN Y T TING VIT CHO BI TON TM KIM NG NGHA

    58/67

    50

    5.4. Nhn dng thc th

    5.4.1. Xy dng tp gazetteer

    Sau khi ch thch d liu, chng ta c cc file d liu c ch thch vi cc

    lp thc th ring bit. Sau qu trnh ch thch ny, chng ta c th da trn cc d

    liu c ch thch xy dng mt tp d liu tn cc thc th. Xy dng c

    mt tp d liu tt c th gip cho qu trnh nhn dng thc th hiu qu hn. Kha

    lun s dng Ontology cng mt mrng c tch hp vo