Bai thu hoach KPDL- Le Ngoc Hieu - CH1101012 - K6UIT

Embed Size (px)

Citation preview

Bi thu hoch mn KHAI PH D LIU Tm hiu v GRAPH MINING2012

I HC QUC GIA THNH PH H CH MINHTRNG I HC CNG NGH THNG TIN

BI THU HOCHMN HC :KHAI PH D LIU

TI

TM HIU V KHAI PH THGRAPH MINING

Hc vin thc hin:L Ngc HiuMSHV: CH1101012Lp : CH K6 - UITGVHD:PGS.TS. Phc

MC LC

Trang

Mc lc2

Li m u3

Chng I:Gii thiu v khai ph d liu t th - Graph mining4

1) Cc khi nim & nh ngha2) Khai thc d liu th - th ph bin mu3) Mt s thut ton trong khai thc th468

Chng II:Cc thut ton v khai thc th9

1) Thut ton Apriori2) Thut ton tng trng mu Pattern growth3) c im ca cc thut ton khai thc th91014

Chng III:Phn lp th20

1) Phn lp da trn cu trc .2) Phn lp da trn mu (pattern)3) Phn lp da vo cy quyt nh4) Phn lp da trn nhn (Kernel) ca th20202123

Chng IV:Nn th

24

Chng V:ng dng khai thc th qun l tin cy trn mng internet25

1) Mt s k hiu .2) S liu lin quan.3) Cu trc cluster (cm) ton cu4) Cu trc cluster (cm) cc b5) Topology

25253035

Chng VI:Kt lun37

Ti liu tham kho38

LI M U

Khai thc d liu th l mng ti khng c, nhng kh mi m Vit Nam.

Thng qua l bi thu hoch cui k ca mn hc Khai ph d liu & kho D liu, gip em hiu hn v cc ng dng ca khai ph d liu th, mc tiu, mc ch & kt qu ca ng dng khai ph d liu th trong cuc sng, l c s vng chc cho vic nghin cu & pht trin v sau trong qu trnh hc tp ti trng.

han thnh bi thu hoch ny, em xin chn thnh cm n thy PGS.TS. Phc, ngi truyn cm hng cho em, thy l ngi ch dn tn tnh, cung cp thng tin, t liu cng nh nhng bi ging c gi tr sn phm ny hon thnh mc bc u nghin cu.

y l ti khng mi nhng khng c, nhng vi thi lng cng nh vic u t nghin cu cha tng ng, nn y ch mang tnh cht mt bi tiu lun mn hc, ch tm hiu mc khi qut vn , phn tch v cha i su m x cc vn mt cch trit tng xng vi mt bi nghin cu khoa hc.

Em rt mong s thng cm & chia s ca thy.

Thnh ph H Ch Minh, Thng 11 Nm 2012.CHNG I: GII THIU V KHAI PH D LIU TH - GRAPH MINING

I.1) CC KHI NIM & NH NGHA1. Ti sao phi khai ph d liu th? th d dng tm thy khp mi ni trong cuc sng hng ngy ca chng ta nh:a. H thng mng internet (Co-expression Network)b. Mng x hi (Social network)c. Quy trnh ca mt chng trnh (Program flow)d. Cc hp cht ha hc ( Chemical compound)e. Cu trc ca Protein (Protein structure) Mt d liu ln ngy nay trn cc h thng mng u c th biu din di dng cc th & mi quan h ca chng theo:a. Lin kt, kt ni vt lb. Kt ni gia cc mng trong lp mngc. Mi quan h trong mng x hid. Siu lien kt gia cc trang webe. Cc tng tc phc tp gia cc thc th Nhng th trn cha ng nhng thong tin gi tr cho vic ng dng vo h thng mng nha. Nhng pht hin t cng ng, nhng im chungb. Phn lpc. Nhng h thng c a ra theo u tin no d. Tm kim trn mnge. P2P (im ti im) tm kim & ly d liuf. Tin cy & uy tn a nhng d liu trn vo di dng th, ta cn phi:a. nh ngha cc ma trn m m t cu trc tng th ca thb. Tm cc cu trc c tnh c trng cng ng ca mng lic. nh ngha cc ma trn m n m t cc mu c trng ca cc giao tip bn trong th d. Pht trin & ng dng nhng thut ton hiu qu nht khai thc d liu trong h thng mng e. Hiu r m hnh ca vic ly ra (tha hng) t cc th . Nhn chung, th c tnh bao qut hn tng i tng, tun t, cy, mng ni chung. th gii quyt c nhiu vn c tnh ton phc tp cao.2. Mt s k hiu & thut ng: Mt th c th c xem l 1 tp ca 5 phn t (V,E,F,Lv,Le). D ={G1,G2,Gn} l tp d liu ca nhng giao dch Nhng giao dch trong tp D l th gin tip c nh du. h tr (support) ca mt th G c nh ngha nh l s phn tram th trong tp D c th con l G. Mt th c gi l ph bin (frequent) nu n c h tr ln hn mt ngng cho trc (ngng ny thng c cho trc).V d: th con ph bin:

3. Tng quan v khai thc d liu th - Graph Mining:

I.2) Khai thc d liu th: th ph bin1. Khai thc th ph bin Graph Pattern Mining:

Gii thiu: Trn l cc tp th.

2. th mu Graph Pattern:

Cc thong s hu ch & l th, t c c cc hnh ng theo mc ch a ra: Tn s xut hin: th mu ph bin ng x khc, x l khc: ly cc thong tin cn thit Mc ngha.

3. th mu ph bin Frequent Graph Pattern Cho mt tp d liu th D, tm th con g sao cho : Trong freq(g) l phn tram ca cc th trong D cha g V d 1 v th con ph bin Hp cht ha hc:(a)Cafeine (ca ph in)(b)diurobromine(c) Viagra

th con ph bin trong cc hp cht trn l:

ss V d 2 v th con ph bin Cc th biu hin mi quan h gi hm ca mt chng trnh

Ta c th con ph bin sau: vi h tr l 2

I.3) M S THUT TON TRONG KHAI THC TH- Lp trnh logic qui np (Inductive Logic Programming) : l phn giao gia k thut lp trnh logic v hc tp quy np, s dng k thut my hc & lp trnh logic, p dng vo khai thc d liu th.- Cc thut ton da trn tnh cht ca th:+ Cch tip cn da vo thut ton Apriori: tm ra tp ph bin nht_ AGM/AcGM: tc gi Inokuchi, (nm 2000)_ FSG: tc gi Kuramochi & Karypis (ICDM nm 2001)_ PATH :tc giVanetik v Gudes ( ICDM 2002, 2004)_ FFSM: tc giHuan (ICDM 2003) v SPIN: tc giHuan (KDD 2004)_ FTOSM:tc gi Horvath (KDD 2006)+ Cch tip cn da vo ln ca mu ( th mu) _ Subdue: tc giHolder (KDD 1994)_ MoFa:tc gi Borgelt v Berthold (ICDM 2002)_ gSpan:tc gi Yan and Han (ICDM 2002)_ Gaston:tc gi Nijssen v Kok (KDD 2004)_ CMTreeMiner:tc gi Chi (TKDE 205), LEAP:tc gi Yan (SIGNMOD 2008)

CHNG II: CC THUT TON V KHAI THC TH

II.1) THUT TON APRIORI1. Nguyn l: Nu mt th l ph bin, th tt c cc th con ca n cng l ph bin.

2. Cc c trng ca thut ton Apriori Thut ton ny c 2 bc chnh: Bc gia nhp (Join): to ra tp cc ng vin th con Bc loi b (Prune): kim tra tnh ph bin ca tng ng vin th con Hu ht tp trung bc u, c gng ti u ha bc u tin, t bc 2 s tm c th con ng cu. Cc bin s dng biu din kch thc ca th con: nh (Vertices), Cnh (Edges), Trng s cnh (path-number) Trnh t chy ca thut ton:

thut ton AGM (tc gi Inokuchi): ln ca th l s nh (#vertices) thut ton FSG (tc gi Karypis): ln ca th l s cnh (#edges) Da vo s cnh m sinh ra cc ng vin: tng kch thc th con ln 1 sau 1 ln lp. Bc tham gia (Join) hai th con cng kch thc k c nhp vo khi v ch khi chng c chung li kch thc k-1.

thut ton PATH (tc gi Venetik): ln ca th l s [path number] (l s cnh ti thiu phn chia ng dn vo th c th c phn tch) Tuy nhin, bc gia nhp (Join) sinh ra cc ng vin kh phc tp & chi ph cao, tiu hao nhiu b nh ( nu s dng BFS); ng thi bc loi b cng nhiu khuyt im, khng hiu qu khi thc hin kim tra tnh ng cu ca th con. T ngi ta a ra thut ton da trn tip cn tng trng mu (Pattern-Growth)3. Phn tch thut ton: Chi ph ca thut ton:

II.2) THUT TON TNG TRNG MU (PATTERN GROWTH)1) tng c bn: trnh s phc tp trong bc gia nhp (Join) thut ton Apriori Ko di & m rng trc tip cc mu bng cch thm vo cnh mi e v ng vin mi c sinh ra g+x e: Nu e l mt cnh hng ra, ni vi nh x mi th c gi tr f ngc li s c gi tr l b, c ngha l cnh lui li. quy ko di & m rng mu ph bin g cho n khi th ph bin cha g khng cn c tm thy ( tc l duy nht).2) Framework: u vo: g l th con ph bin, D l tp d liu th, l h tr v S l tp cc th con ph bin. Thut ton: Lp li bc kim tra: if (g c trong S) return;elseThm g vo S; Bc m rng:Tm tt c cc cnh e trong tp d liu sao choTp g c th m rng thnh g+x e Bc loi b:For each (tp ph bin g+x e)Gi quy Pattern-Growth(g+x e,D, ,S); Return:3) Nhc im: bc m rng l bc km hiu qu v: Vi nhng th ging nhau s c chy nhiu ln: V d: th c cng s cnh l n s c tm thy t n th c n-1 cnh. Vic lp i lp li sinh ra & trng lp cc bc kim tra s lm tn b nh, ti nguyn & thi gian ca thut ton.4) Thut ton gSpan: S dng DFS duyt th DFS: duyt theo trnh t cc nh i qua trong cy DFS.

tng ch o: Rt gn vic m rng bng cch cho php m rng ch mt s hng nht nh (ng i ch yu) Mt cnh mi t hng i chnh ( ch yu) t nh Vn ti bt k nh trong ng i ch yu . (Hoc l) M ra mt nh v ni ti bt k nh no to ra ng i ch yu. Vn : C nhiu cy DFS tn ti cho th v s dn n vic trng lp Gii php: chn mt trong s th trng lp lm chnh, v m rng theo hng ch o ( theo ng i ch yu). Cy t in tm kim DFS: (DFS Lexicographic search tree)

5) Vn m rng th mu Khai thc th ph bin gn nht Thut ton Apriori ni rng: nu mt th l ph bin th tt c th con cng l ph bin. Mt th n- Cnh l ph bin s c 2n th con ph bin. V d: Trong s 423 hp cht ha hc c xc nhn l hot tnh i vi AIDS trong tp d liu, th c 1 triu th mu ph bin m h tr ca n t nht l 5%. T ta a ra vic khai thc th con ph bin gn nht. th ph bin gn nht: Mt th ph bin G l gn nht nu khng tn ti siu th ca G m c h tr ging G. Mt tp cc th con ph bin gn nht c sc mnh ging nhau ging nh tng s ca tt c cc tp con ph bin Torng d liu chng li virus AIDS c 1 triu tp con ph bin, nhng ch c 2000 tp l gn nht. u im ca th ph bin gn nht: Mt s tp th ph bin gn nht xa s t hn tng s th ph bin. Ta c th thay th cc th ph bin bng tnh nng tng ng trong cc ng dng. th gn (Close graph) Hiu qu cho thut ton tng trng mu khai thc tp CFG. M rng tp mu n gin hn vi thut ton gSpan. Gii quyt cc v n kh:

6) Thut ton SUBDUE: Bt u vi mt nh n M rng cu trc con tt nht vi mt cnh mi Gii hn s cu trc con tt nht. Cu trc con c nh gi da trn kh nng ca n c th nn ti u vo ca th ( graph inputs) S dng chiu di ngn nht m t (DL) Cu trc con tt nht S trong th G ti thiu ha: DL(S) + DL(G\S); Ngng khi no khng c cu trc con no c tm thy.

u im: Thc hin mang tnh cht tng i, khng chnh xc, cho php c nhiu cu trc khc nhau. Gim s mu th con ph bin . ng dng: Cu trc phn nhm ca cc cm cha r rang Nn th Hc ng php th ( Graph grammar learning)7) Khai thc mu th ti i (Maximal Graph Pattern Mining) Lp tng ng da trn cy Cy c sp xp theo mt th t nht nh th s nm trong cng lp tng ng nu chng c chung th t cy.

a phng ti i ( vng ti i locally maximal) Mt th con g l ti i a phng nu n l ti i trong lp tng tng ca n. V d nh g khng c th ln hn ph bin, cng s dng chung th t cy ging nh g. Mi th mu ti i phi l ti i a phng Loi b cc th con khng phi l ti i a phng.

II.3) C IM CA THUT TON KHAI THC TH1) Trnh t tm kim: Cng ging nh trong cc thut ton ca th, ta c trnh t duyt th theo chiu su & theo chiu rng. BFS & DFS Duyt ton b & khng ton b.2) Cc th h thut ton khai thc th

3) Th t khm ph mu ( Order) M rng t do:

M rng theo hng chnh ( hng ch yu):

4) Khai thc th con cht ch (Coherent Subgraph) ng c: gii quyt thit hi gy ra do tnh a chiu, m vn gi nguyn tnh nng tm kim mu ph bin. tng c bn: loi b cc tnh nng d tha m khng cung cp thm bt k thong tin no. Mt th G c xem l cht ch nu nhng thng tin gia G v gia tng th con ca n phi nm trong ngng ( ni cch khc G c mi tng quan cht ch vi tt c th con ca n) Thng tin gia mt th G v th con G ca n c cho bi cthc:

P(XG,X G ) l phn phi chung. P(XG=1) = support(G) Ging nh thut ton gSpan nhng loi b cc tp ng vin s hon tt da trn thng tin chung ln nhau.5) Khai thc th con dy c (Dense Subgraph) th quan h: Tt c cc node c duy nht 1 nhn vd nh m hnh mng x hi, mng sinh hc. Vn l khai thc s dy c hay ph bin cao nht vi cc th con t cc th quan h. Khai thc d liu t mng li x hi Tp cc gene c cng chc nng thng c sp xp theo mt trt t sinh hc nht nh. Ging nh ang khai thc gi tr trung bnh ca mt nh no Tp cc cnh khi loi b i lm cho th khng lien thong, gi l cnh ct. Ct t nht l s cnh loi b i t nht Mt tp c gi l dy c khi kch thc ca tp ct khng nh hn mt ngng cho trc. Bc phn r: (Decompose) Phn r cc th quan h tm th con ti i tha cc kt ni kht khe nht. Bc giao li: (intersection) Giao cc th phn r ly ra th lien thong y ( tng dn) Sau khi giao li thnh th mi m khng tha cc kt ni kht khe nht th tip tc phn r n ly ra ng vin c kch thc nh nht.

6) Tm kim th - Graph index Vn : Cho c s d liu th v th truy vn (query graph), chng ta cn phi tm ra tt c th nm trong th truy vn.

Gii php t Nave Tun t qut (a I/O) Kim tra th con ng cu (NP-Complete) Vn l: kh nng m rng l rt kh S th l cp s nhn ca s cu trc con, xay dng index ( th t) cho cc th con s dn n mt lng khng l s th t cc thc th. Trc gic: nu th G ch th truy vn Q, th G nn chc s cha bt k cu trc no ca Q Bc 1: xy dng th t (index) Lit k cc cu trc khc nhau ca th Xy dng s th t ngc gia th v cu trc Bc 2: Qu trnh truy vn Lit k cu trc trong th truy vn Tnh ton s th ng vin cha cc cu trc trn Loi b trng hp sai- dng tnh bng cch kim tra th ng cu.

Tip cn da trn th t c hng chnh (Path-based) Daylight ( H thng thng mi) GraphGrep: tc gi Shaha Grace: tc gi Srinath Srinivasan tng c bn ca th t th & tm kim: Lit k tt c cc ng dn trong c s d liu, dn ti con ng di nht ( hoc mt ngng cho trc) Xy dng s th t o nghch gia ng & th S dng s th t nh danh( nhn dng) cc th ng vin m cha tt c cc ng dn, dn ti con ng di nht trong th truy vn. u im: ng dn d dng hn so vi cy & th Lm vic vi vng c ng dn r rng rt hiu qu Hn ch: C th dn ti trng hp khng c kt qu Khng ph hp khi truy vn th phc tp S dng thut ton gIndex: tng: Tm cu trc ph bin ca c s d liu Nhn din mt tp nh cc cu trc khc bit Xy dng s th t o ngc gia cu trc khc bit & cc th Trc gic: Chng ta c l s khng ly ra cc cu trc d tha bi s th t Ch nh s th t cho nhng cu trc c thong tin nhiu hn so vi cu trc tn ti. Cu trc khc bit: Cho mt tp cc cu trc f1, f2, ,fn v mt cu trc mi x, ta o lng kh nng d th s th t c cho bi x:

Khi P nh va , x s l cu trc khc bit v c th cha s th t. Th t ca cc tp cu trc khc bit s l th t ca ln t hn s th t ca tp cu trc ph bin.7) Tm kim cu trc con tng t: ng c: Tm kim chnh xc l qu kh i vi th, nh cu trc phn t, sinh hc Tm kim tng t ( tng i) l rt cn thit & quan trng. Tm kim tng i: Kh m xy dng th t bao gm cc th con tng t. tng c bn: Xy dng s th t trc ht Chn la ra nhng tnh nng trong khng gian truy vn thay v l khng gian c s d liu. o lng tng t : Mi th c biu th bi mt vector chc nng S tng t s c nh ngha nh khong cch gia cc vector D dng th t ha, nhanh n gin ha truy vn S cnh c th c b qua, khng phn bit v tr ca chng.

CHNG III:PHN LP TH - GRAPH CLASSIFICATION

III.1) TIP CN DA TRN CU TRC (Structure) tng c bn: Chuyn cc thi trong c s d liu sang mt vct.

Khi xi l ph bin ca cu trc th I (hay mu th i) trong G. Mi vector c nh du bi mt lp. Phn lp cc vecto ny trong khng gian vector. Tnh nng cu trc: Cu trc a phng trong mt th, v d nh nhng cnh xung quanh mt im, ng vi di c nh.

III.2) TIP CN DA TRN MU (PATTERN).1) Phn lp da vo th mu trong khai thc d liu: Da vo thut ton chui mu (Sequence patterns tc gi De Raedt v Kramer) Da vo thut ton th con ph bin Da vo thut ton th con ph bin cht ch. Da vo thut ton th con ph bin gn nht th con c vng h - Acyclic Subgraphs ( tc gi Wale and Karpis 2006)2) Phn lp da trn cy quyt nh tng c bn: Phn chia d liu theo cu trc trn xung, v xy dng cy s dng tnh nng tt nht tng bc. Phn chia tp d liu thnh hai tp con, mt cha tnh nng h tr, cn li ko cha.

3) Phn lp theo thut ton Boosting

III.3)TIP CN DA TRN NHN (KERNEL). ng c: Phng php hc da trn nhn khng cn truy cp ti cc im d liu chnh xc. M ch da vo cc hm trong nhn gia cc im d liu. C th ng dng vo cc cu trc phc tp m bn c th t nh ngha cc hm trn nhn . tng c bn: nh x cc th tng ng ti tp ngha ca cc mu nh ngha nhn trong cc tp tng ng ca mu ng tip cn (walk) ngu nhin vi nhn (Gartner et al., Borgwardt et al., Inokuchi et al.) tng c bn: m s ng tip cn (walk) gia 2 th Mt vi iu c bn Ma trn k: A [i,j] = 1 nu c mt cnh gia cnh i v j. ngc li l 0; Chuyn v: A = D -1 A

T mt im bt k i ngu nhin nhy ti im lin k j. Xc xut nhy ti j l t l thun vi A [i,j] ng tip cn ngu nhin: Chiu di ng tip cn (walk) l n i vo An = di ca n ng tip cn i vo An = xc sut ca n ng tip cn So snh th C v m s ng tip cn ph hp vi 2 th Hai th tng ng bu c nhii ng tip cn tng ng Gim nhng con ng tip cn di hn, cn thn vi vng kn.

Phn lp th ng dng vo debug trong lp trnh

Phn lp th ng dng vo mng Malware

CHNG IV:NN TH - GRAPH COMPRESSION

CHNG V:NG DNG KHAI THC TH VO QUN L TIN CY TRN MNGV.1 MT S K HIU G = (V, E) : th V: tp ca N nh E _ VxV : tp cc cnh c nh hng hoc khng nh hng N(u) = {v|(u, v) 2 E}: xung quanh u d(u) = |N(u)|: Bc ca u

V.2 CC S LIU LIN QUAN1) Phn b bc: Ck = |{u : d(u) = k}|: s nh c cng bc (+/-) k. Ta c: vi hay Ni vi cho ta ng thng vi dc 2) Biu Internet [Faloutsos 1999]

3) Bc vo ca th web:[Broder et al., 2000, Donato et al., 2007]

4) Bc ra ca th web [Broder et al., 2000, Donato et al., 2007]

5) Mt s bc khc c lin quan: Cnh 2 chiu: T l phn trm cc cnh c lin kt 2 chiu Bc / Bc trung bnh ca cc nh ln cn (nh k). Bc ra trung bnh ca cc nh ln cn (nh k) Bc vo trung bnh ca cc nh ln cn(nh k).6) Tnh ton s bc cho nh k c kch thc d:Semi-streaming model: graph on disk1: for node : 1 . . . N do2: INITIALIZE-MEM(node)3: end for4: for distance : 1 . . . d do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: COMPUTE(src,dest)8: end for9: end for10: NORMALIZE11: end for12: POST-PROCESS13: return Something7) Tnh s bc trung bnh ca cc nh kSemi-streaming model: graph on disk1: for node : 1 . . . N do2: SUMDEG(node):=03: end for4: for distance : 1 . . . 1 do {Iteration step}5: for src : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: SUMDEG(src) := SUMDEG(src) + DEG(dest)8: end for9: end for10: for node : 1 . . . N do11: AVGDEG(node) := SUMDEG(node)/ DEG(node)12: end for13: end for14: return AVGDEG8) S nh c trng vi khong cch l d:

Gii php n ginChy BFS c su d t mi nh u: iu ny khng kh thi cho mng ln.Gii php na dng (semi-streaning), first versionTi ln lp th i ca thut ton, Mi nhu s cha mt tp cc cnh c khong cch l i;Ti ln lp thi + 1, vi mi nh u, hp tt c cc tp cnh c khong cch I ca tt c cc nh k ca u.Vn : cn O(n) bit cho mt nh

9) Xc sut m:ng dng tnh ton s nh c trng c khong cch d.Thc hin ng thi cho tt c cc nh c lp i lp li dTi mi bc mt bit vector l lan truyn sang tt c cc nh k.Ti mi bc tng hp li : OR vi tt c vectors nhn c t cc nh k.Thc hin vi bit vector c kch thc logarit.

10) Thut ton chungRequire: N: number of nodes, d: distance, k: bits1: for node : 1 . . . N, bit: 1 . . . k do2: INIT(node,bit)3: end for4: for distance : 1 . . . d do {Iteration step}5: Aux 0k6: for src : 1 . . . N do {Follow links in the graph}7: for all links from src to dest do8: Aux[src] Aux[src] OR K[dest,]9: end for10: end for11: K Aux12: end for13: for node: 1 . . .N do {Estimate supporters}14: Supporters[node] ESTIMATE( K[node,] )15: end for16: return Supporters

Khi to ban u gi tr bit l 1 (one) cho tt c cc nh vi xc sut l c lng: D on khi Lp li vi v ones11) Hi t v sai: Thut ton s lp ln

V.3 CU TRC CM TON CU CLUSTERING STRUCTURE1) Cu trc cluster ton cu a phng:- H s bc cu ca th G:

xc sut m mt cp ngu nhin ca cc nh lin k ni vi nhau.

2) H s cluster:- h s cluster ca mt nh v:

xc sut m mt cp ngu nhin ca cc nh lin k ni vi nhau.- H s cluster CG ca th G l h s trung bnh ca cc nh

3) m s tam gic:- Tnh ton chnh xc gim thiu nhn ma trn: iu ny khng kh thi ngay c khi mng ang kho st l mng trung bnh- Sp xp ngu nhin cc mu trong m hnh dng h, v bn dngC 2 m hnh sau:_ Cc Cnh c lu theo mt trt t bt k_ Cnh c trt t: tt c cc cnh ch v 1 nh c lu tun t:

4) Tnh ton cng bc- Thut ton cng bc s kim tra tng cm 3 cnh (triples)- Ta c biu din cho tp cc cm 3 cnh c 0 , 1, 2 v 3 cnh.

- Tng s cm 3 cnh-5) Mu Naive - r: l s mu c lp gm 3 nh ring bit (a,b,c) t th- Vi mu th i, nu (a,b,c) l tam gic th xut ra Ngc li xut ra - - c lng

Ti u ha mu trong bn-dng (semi-Streaming): thut ton 3-pass [Buriol 2006] Pass 1: m s ng c di l 2 trong dng. Pass 2: chn bt k ng no c di l 2 (a,b,c)

Pass 3: nu (a,c) thuc E th ngc li m s tam gic:

6) M rng thut ton:

V.4 - CU TRC CM CC B CLUSTERING STRUCTURE1) H s cluster cc b

- Tnh ton s tam gic cho tt c cc nh- Khng kh thi nu p dng cho m hnh bn dng ( semi-Streaming)

2) c lng tp giao:

3) Thut ton:1:Z = 02: for i: 1 . . . m do {Independent trials}3: for u : 1 . . . |V| do {Assign labels}4: li (u) = hashi (u) {Minwise linear permutation}5: end for6: for u : 1 . . . |V| do {Compute fingerprints}7: Fi (u) = minv2S(u) li (u)8: end for{1 scan of G}9: for u : 1 . . . |V| do {Update counters}10: for v 2 S(u) do11: if Fi (u) == Fi (v) then {Minima are equal}12: Zuv = Zuv + 1 {Zuv s stored on disk}13: end if14: end for15: end for16: end for4) Tnh ton

V.5 S LIU TOPO:1) Page rank:

2) Tnh ton Pagerank

Semi-streaming version of the power iteration method1: for node : 1 . . . N do2: PR(node):=1/N3: end for4: for distance : 1 . . . d do {Iteration step}5: for dest : 1 . . . N do {Follow links in the graph}6: for all links from src to dest do7: PR(dest) := PR(src) T(src,dest)8: end for9: end for10: end for11: return PR3) TrustRank

CHNG IV: KT LUNkhai ph d liu th m ra mt hng nghin cu & pht trin ng dng Cng Ngh thng tin vi cch thc x l cc vn d liu thc t mc cao, khng l & c bit l da trn thc t & cc vn ang ny sinh. i vi Vit Nam hin nay v trong tng lai gn s dng & pht trin khai ph d liu th cha c ch trng & nh gi ng vai tr & v tr ca n. Vic nghin cu v khai ph d liu th l cn rt hn ch. Vic pht trin ng dng khai ph d liu th s gip cho ngnh cng nghip ca Vit Nam pht trin tt hn, c bit l cng ngh sinh hc ha sinh.

TI LIU THAM KHO

Ting Anh[1] D.V Janardhan Rao Prof. Prasad Tadepalli, A study of Graph Mining Algorithms , 2007[2] DEEPAYAN CHAKRABARTI AND CHRISTOS FALOUTSOS, Graph Mining: Laws, Generators, and Algorithms, Yahoo! Research and Carnegie Mellon University, 2006[3] Karsten Borgwardt and Xifeng Yan, GRAPH MINING, Max Planck Institute for Developmental Biology, 2008.

[4] Stefano Leonardi, Graph Mining and its applications to Reputation Management in Networks, Sapienza University of Rome Rome, Italy, 2008.

2L Ngc Hiu CH1101012 K6UIT [email protected]