Tam Quan Trong Cua Toan Trong Tin Hoc

Embed Size (px)

Citation preview

V quan h gia ton hc v tin hc

H T BoVin Cng ngh Thng tin, Vin KH & CN Vit Nam Japan Advanced Institute of Science and Technology1Ton hc v Tin hc

Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)2Ton hc v Tin hc

i th v tin hc (khoa hc my tnh)An ton thng tin Tin sinh hc Nhn dng Tr tu nhn to

Phn mm ng dng

softwarePhn mm h thngCray Y-MP8E

K ngh phn mm DOS, Windows, C s d liu Mac OS, UNIX, Pascal, Linux C, C++, Thut ton Java, Ton hc cho tin hc

hardware

Supercomputer s, mainframes Minicomputers, workstations

My tnh song songLAN, WAN, Internet

3

PC computers

Mng my tnh

Truyn thng

Ton hc v Tin hc

Hai con ng ca ton trong tin hc

Cc l thuyt, m hnh ton hc lm c s cho s pht trin tin hc.

gii quyt cc vn ca tin hc v ng dng tin hc, tm v dng cc l thuyt v cng c ton hc.Ton hc v Tin hc

4

Ton hc ri rc i vi tin hcLogic (lp lun t ng, AI, h thng minh) Set theory (tp m, tp th, tnh ton vi thng tin khng , khng chnh xc, ) Number theory (bo mt thng tin) Combinatorics (hnh hc t hp, ) Graph theory (mng, sinh hc, vt l, ) Digital geometry and digital topology (phn tch nh, etc.)5

Algorithmics (phng php tnh ton) Computability (gii hn l thuyt v thc t ca thut ton) Probability and Markov chains (x l ting ni, sinh hc, ) Linear algebra (phn tch d liu, ...) Probability (ngu nhin) Proofs (k ngh phn mm, AI) etc.Ton hc v Tin hc

Mt s lnh vc ton hc khcThng k v phn tch d liu (statistics, data analysis) L thuyt ti u (Optimization)

6

Ton hc v Tin hc

Logic trong tin hc =Knowledge(i s, thng k, ton hc ri rc, ...)

+Inference(logic ton hc, )

Logic mnh , logic tn t, logic khng chun (modal, temporal, non-monotonic, fuzzy, high order, ... logic) Th d: My t ng dng tam on lun (syllosism)Elephants are bigger than dogs Dogs are bigger than mice Therefore Elephants are bigger than mice7Ton hc v Tin hc

L thuyt tp hp (tp m, tp th)Tp m (Zadeh 1965), tp th (Pawlak, 1982) Rough fuzzy hybridization? Rough sets + Fuzzy sets?

X

Xp x di X* : Hp ca cc lp tng ng nm trn trong X Xp x trn X* : Hp ca cc lp tng ng c giao khc rng vi X

U ={ ,

,

,

,

,} , ,

R = {color, shape },

Eq. class / color = { { Eq. class / shape = {{

}{ ,,

,

}}}}

}, { }, {

}, { }}

Eq. class = {{X ={8

,

}, {

}, {

,

}, X * = {

}, X * = { ,

,

}

Ton hc v Tin hc

23 bi ton ca th k 20Ti i hi Ton hc Th gii ln th hai (Paris, thng Nm 1900), Hilbert nu ra 23 bi ton, thch thc cc nh ton hc ton th gii gii trong th k 20. 12 bi ton c gii ton b, 8 bi ton c gii tng phn, 3 bi vn cha c li gii.

9

Ton hc v Tin hc

7 bi ton ca th k 214 gi chiu Th t ngy 24 thng 5 nm 2000, Vin Ton hc Clay cng b v thch thc 7 bi ton ca th k 21 (1 triu $ cho mi li gii). Bi ton s 1: P versus NP Su bi ton khc: 1. The Hodge Conjecture 2. The Poincar Conjecture (solved 2006) 3. The Riemann Hypothesis 4. Yang-Mills Existence and Mass Gap 5. Navier-Stokes Existence and Smoothness 6. The Birch and Swinnerton-Dyer Conjecture.10Ton hc v Tin hc

Bi ton P versus NPNu ai hi rng liu 13.717.421 c l tch ca hai s nh hn khng, bn s cm thy rt kh tr li l ng hay sai. Nu ngi bo bn rng s ny c th l tch ca 3607 v 3803, bn c th kim tra iu ny tht d dng. Xc nh xem vi mt bi ton cho trc, liu c tn ti mt li gii c th kim chng nhanh (bng my tnh chng hn), nhng li cn rt nhiu thi gian gii t u (nu khng bit li gii)? C rt nhiu bi ton nh vy. Cha ai c th chng minh c rng, vi bt k bi ton no nh vy, thc s cn rt nhiu thi gian gii. C th ch n gin l chng ta vn cha tm ra c cch gii chng nhanh chng. Stephen Cook pht biu bi ton P versus NP vo nm 1971.11Ton hc v Tin hc

Bi ton P versus NPBi ton SAT: cho trc mt mch in t Boolean, liu c cc cch chn inputs sao cho output l True hay khng? Inputs ca cc mch in t Boolean (vi cng AND, OR v NOT) hoc l T (true) hoc l F (false). Mi cng nhn mt s cc inputs, v outputs gi tr logic tng hp c.12Ti khng tm ni cho chef mt thut ton hiu qu. u c ti do ny c v m u qu.

Hy tm ngay cho ti mt thut ton hiu qu gii SAT.

Ton hc v Tin hc

Bi ton P versus NPTi khng th tm c mt thut ton hiu qu bi v khng th c mt thut ton no nh vy. Ti khng th tm c mt thut ton hiu qu bi v tt c nhng ngi ni ting ny cng khng tm c n.

nu bn chng minh c SAT l intractable (chng minh intractability c th kh nh vic tm li gii hiu qu)

nu bn bit SAT l NP-complete

13

Ton hc v Tin hc

phc tp tnh tonS tn ti cc bi ton gii c nhng v cng kh gii phc tp tnh ton: P (thi gian a thc) v non-P (thi gian hm m). Bi ton kiu P c th gii d dng (sp xp dy s theo th t), bi ton kiu non-P rt kh gii (tm cc tha s nguyn t ca mt s nguyn cho trc). Ngi ta tin rng c rt nhiu bi ton thuc kiu non-P, nhng cha bao gi chng minh c chnh chng l nh vy (ht sc kh). NP (Nondeterministic Polynomial) l mt h c bit cc bi ton kiu non-P: nu bt k trong chng c nghim thi gian a thc th tt c s c nghim thi gian a thc. P = NP? Cc bi ton kiu P v NP l nh nhau?14Ton hc v Tin hc

Thi gian a thc v hm mTime complexity function n = 100.0001 second 0.001 second 0.01 second 1 second 0.01 second 0.059 second

n = 200.0002 second 0.002 second 0.02 second 3.2 second 1.0 second 58 minutes

n = 300.0003 second 0.003 second 0.03 second 24.3 second 17.9 second 6.5 years

n = 400.0004 second 0.004 second 0.04 second 1.7 minutes 12.7 days 3855 centuries

n = 500.0005 second 0.005 second 0.05 second 5.2 minutes 35.7 years 2x108 centuries

n = 600.0006 second 0.006 second 0.06 second 13.0 minutes 336 centuries 1.3x1013 centuries

n n2 n3 n5 2n 3n 15

Ton hc v Tin hc

Scalable algorithmsiu g xy ra khi d liu ln, N>106? Cn thut ton heuristic nhanh, gn ngFast CMB analysis by Szapudi et al (2001)

O(NlogN) cn 1 ngy O(N3) cn 10 triu nmCh trng n gi ca tnh tonCn lun iu khin c chnh xc t kt qu tt nht trong thi gian v ti nguyn cho php Polynomial time algorithms do not always work!Ton hc v Tin hc

Dng my tnh song song16

Mt m v an ton thng tinL nghin cu v b mt ca truyn tin (truyn tin trong iu kin c k ch). An ton mng v my tnh: qun l s truy nhp my tnh v tin cy ca thng tin, v cc ng dng nh: ATM cards, computer passwords, e-commerce, ...

Germany Lorenz cipher machine

17

Ton hc v Tin hc

Mt m v an ton thng tinTo m (encryption: plaintext (decryption) ciphertext) v gii m

Khi nim: mt m (cipher: letter, bit), m (code: word or phrase meaning), kha (key). Rt nhiu k thut:Symmetric-key cryptography (cng mt key cho nhn/gi) Public-key cryptography 1976 (public/private key) da trn number theory (integer factorization) Cryptanalysis (tm im yu, khng an ton trong mt h dng mt m) Cryptographic primitives Cryptographic protocols(Kha l mt on tin (piece of information) kim sot hot ng ca mt thut ton to mt m, ch k in t, nh quyn hn (encryption, digital signature, authentication)).

18

Ton hc v Tin hc

Public-key cryptographyinteger factorization l qu trnh phn tch mt ch s a hp thnh tch ca cc c s nh hn sao cho khi nhn cc c s ny ta c s ban u. BSI team: Nov. 2005, RSA-640 (193 decimal digits, 640 bits) on 80 AMD Opteron CPUs computer. Kh nht trong cc bi ton ny l trng hp c s l cc s nguyn t c chn ngu nhin vi cng ln. Vn cha c thut ton thi gian a thc phn tch mt s ln b-bit thnh tch ca hai s nguyn t c cng kch thc. Mt trong cc bi ton ln cha gii c trong KHMT v l thuyt s l vic tm mt thut ton nhn t ha cc s trong thi gian a thc.19Ton hc v Tin hc

Finite state machinesMy hu hn trng thi

L m hnh ca cc hnh vi to nn bi mt s hu hn cc trng thi (states), cc chuyn i trng thi (transitions), v cc hnh ng. M hnh ton hc: (,S,s0,,F), Vic thc hin mt phn cng my tnh i hi mt cng-t (register) cha cc bin trng thi, mt khi (block) mch logic xc nh s chuyn trng thi, v mt khi th hai cc mch logic xc nh output ca mt FSM.20Ton hc v Tin hc

Machine learning and data miningHc my v Khai ph d liu

Machine learning Xy cc h my tnh c th hc nh con ngi (mt s kh nng). ICML since 1982 ECML since 1989 ECML/PKDD since 2001

Data mining Tm tri thc mi v hu ch t nhng c s d liu ln. ACM KDD since 1995, PKDD and PAKDD since 1997, IEEE ICDM and SIAM DM since 2000, etc.

21

Ton hc v Tin hc

Machine learning and data miningHc my v Khai ph d liu

Machine learning Xy cc h my tnh c th hc nh con ngi (mt s kh nng). ICML since 1982 ECML since 1989

Data mining Tm tri thc mi v hu ch t nhng c s d liu ln.

ACM KDD since 1995, PKDD and PAKDD since beats IEEE ICDM 1997: ECML/PKDD since 2001 Deep Blue1997, Kasparov (12.2006: Deep Fritz beets Kramnik) and SIAM DM since 1997: Robot cup 2000, etc.2005: DARPA Grand Challenge

22

Ton hc v Tin hc

Complexly structured dataRt nhiu d liu c thu thp v tch ly trong cc c s d liu ln (hng triu bn ghi, hng nghn thuc tnh). Rt nhiu d liu l loi c cu trc phc tp (khng dng vect).Thch thc: Tm thut ton kh c x l d liu ln v cu trc phc tp?

23

Ton hc v Tin hc

Hai con ng ca ton trong tin hc

Cc l thuyt, m hnh ton hc lm c s cho s pht trin tin hc.

gii quyt cc vn ca tin hc v ng dng tin hc, tm v dng cc l thuyt v cng c ton hc.Ton hc v Tin hc

24

Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)25Ton hc v Tin hc

Natural language processing (NLP)Lexical / Morphological AnalysisTagging (gn nhn t loi) Chunking (phn cm t)

text

Shallow parsingThe woman will give Mary a book POS tagging The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN chunking

Syntactic AnalysisGrammatical Relation Finding Named Entity Recognition Word Sense Disambiguation

[The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP subject relation finding

Semantic AnalysisReference Resolution

[The woman] [will give] [Mary] [a book]

Discourse Analysis

meaning26

i-object

object

ng gi i nhanh quTon hc v Tin hc

Xu th trong NLPTrainable parsers 1990s2000s: Statistical learning algorithms, evaluation, corpora Trainable FSMs 1980s: Standard resources and tasks Penn Treebank, WordNet, MUC 1970s: Kernel (vector) spaces clustering, information retrieval (IR) 1960s: Representation Transformation Finite state machines (FSM) and Augmented transition networks (ATNs) 1960s: Representationbeyond the word level lexical features, tree structures, networks27Ton hc v Tin hc

Machine learning and statistics in NLP

some ML/Stat

no ML/StatTon hc v Tin hc

28

(Marie Claire, ECML/PKDD 2005)

Trainable finite state machines... ... ... St-1 St St+1 ... St-1 St St+1 St-1 St St+1

... ...

... ...

... Ot-1 Ot

Ot+1

... Ot-1 Ot

Ot+1

... Ot-1 Ot

... Ot+1

Maximum Entropy Markov Models (MEMMs) [McCallum et al., 2000]Discriminative No independence assumption Global optimum Local normalization

Conditional Random Fields (CRFs) [Lafferty et al., 2001]Discriminative No independence assumption Global optimum Global normalization

Hidden Markov Models (HMMs) [Baum et al., 1970]- Generative - Need independence assumption - Local optimum - Local normalization

More accurate than MEMMs

More accurate than HMMs Ton hc v Tin hc

29

Increasing computing power1966

1.6 meters

3500 3300 3100 2900 2700 2500 2300 2100 1900 1700 1500 1300 1100 900 700 500 300 100 -100

30 Mb 30MBJAISTs CRAY XT3

Linux OS, 180 AMD Opteron 2.4GHz CPUs, 8Gb RAM/CPU, total:

Training time (minute)

1.44Tb RAM

Second order of Conditional Random Fields (Text chunking):# of parallel processes

All phrase chunking, 17h46 on 90 processes vs.1348hTon hc v Tin hc

1

2

5

60

70

80

30

90 10 0 11 0 12 0 13 0 14 0 15 0

10

20

30

40

50

Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)31Ton hc v Tin hc

Web link dataFood Web [Martinez 91]

Over 3 billion documents

Friendship Network [Moody 01]

32

Ton hc v Tin hc

Tm kim trn WebHai cch tiu biu (1998):1. Page rank: tm cc trang

Web quan trng v lin quan nht trn Web (nh trong Google)

Larry Page, Sergey Brin

2. Hubs and authorities:

nh gi chi tit hn tm quan trng ca cc trang Web.Ton hc v Tin hc

33

PageRank: key ideaGoogle thng a ra mt s rt ln cc trang Web (weather forecast 5.5 million pages). Cc trang lin quan nht thng nm trong mt, hai chc trang u. Lm sao search engine bit c cc trang Web no l quan trng nht? Google gn cho mi trang Web mt con s c trng tm quan trng. Con s ny c gi l s PageRank v c tnh nh gi tr ring (eigenvalue) ca bi ton

Pw = w y P c tnh da trn link structure ca Internet. Vn c bn l lm sao thit lp xc ng c link structure ca ton b Web, i.e., ma trn P.34Ton hc v Tin hc

PageRank: key ideaM hnh c s ca thut ton PageRank l random walk thc hin trn ton b cc trang Web trn Internet. K hiu pt(x) kh nng ta ti trang Web x thi im t. S PageRank ca x c cc nh nh lim(pt(x)) khi t . Ma trn P c tnh cht khng rt gn c (irreducible) v ngu nhin (stochastic), nn random walk c th biu din qua chui Markov, v PageRank ca mi trang c th tnh c qua cc vectors ring ca P.35Ton hc v Tin hc

PageRank: Stochastic matrixMi trang i tng ng vi dng i v ct i ca P Nu trang j c n successors (links) v trang i l mt trong s n succesors ca j th ijth ca ma trn c l to 1/n, nu khng l 0.Assume the Web consists of only three pages A, B, and C. The links among these pages are shown as in the graph.

A C

A A B1/2 1/2

B1/2

C 0 1 0

01/2

B

C

0

36

Ton hc v Tin hc

The PageRank algorithmMa trn Google P hin c kch thc 4.2109 v do vic tm gi tr ring l khng tm thng.w0 = initial guess For k = 1 to 50 wk = P*wk-1 End Return w50

Tm xp x cc vc t ring ca P (power method) Vic tnh ton (matrix-vector multiplications) phi thc hin trn cc h my tnh song song ln.37Ton hc v Tin hc

PageRank: Example

w1 2 w = 1 2 2 w3 0 1

1

2

01 2

0 w1 w 1 2 0 w3

w0 = initial guess For k = 1 to 50 wk = P*wk-1 End Return w50

Bt u cho w1 = w2 = w3 = 1, v tnh quy php nhn ma trn vi cc gi tr ny

a = b = c =38

1 1 1

1 3/2 1/2

5/4 1 3/4

9/8 11/8 1/2

5/4 6/5 17/16 6/5 11/16 ... 3/5Ton hc v Tin hc

Finding things but not pagesInformation Extraction vs. Information RetrievalInformation extraction: the process of extracting text segments of semi-structured or free text to fill data slots in a predefined templatefoodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2004 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

39

Ton hc v Tin hc

Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)40Ton hc v Tin hc

Problems in bioinformatics

Metabolomics Proteomics Genomics

1400 Chemicals

10,000 Proteins

30,000 Genes41Ton hc v Tin hc

How biological data look like?Mt mu 350 ch ci ca dy DNA 1.6 triu ch (nh hn 4570 ln):TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATG TAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGT TTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTT TTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATA AGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTAT GGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTAC CCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGT ATATTATGT

Nhiu loi d liu khc42Ton hc v Tin hc

i snh dy (string matching)(Approximate) String MatchingInput: Text T ,, Pattern P Input: Text T Pattern P Question(s): Question(s):P xut hin trong T? P xut hin trong T? Tm mt xut hin ca P trong T. Tm mt xut hin ca P trong T. Tm mi xut hin ca P trong T. Tm mi xut hin ca P trong T. Tnh s xut hin ca P trong T. Tnh s xut hin ca P trong T. Tm dy con di nht ca P trong T. Tm dy con di nht ca P trong T. Tm dy con gn nht ca P trong T. Tm dy con gn nht ca P trong T. Xc nh cc lp trc tip ca P Xc nh cc lp trc tip ca P trong T. trong T.

Applications: Applications:

Liu P c trong c s d liu T? Liu P c trong c s d liu Xc nh v tr ca P trong T. Xc nh v tr ca P trong T. Liu c th dng P nh mt nguyn Liu c th dng P nh mt nguyn t ca T? t ca T? P c tng ng vi g trong T? P c tng ng vi g trong T? P c b hng bi T? P c b hng bi T? Liu prefix(P) = suffix(T)? Liu prefix(P) = suffix(T)? Xc nh cc lp sau trc (tandem) Xc nh cc lp sau trc (tandem) ca P trong T. ca P trong T.

v nhiu bin dng khc v nhiu bin dng khc

43

Ton hc v Tin hc

Pairwise Sequence AlignmentSp dy tng cpInputHai dy ch ci Mt cch cho im

Bi ton c bn nht ca tin sinh hc Cc dy c sp thng c dng cu trc hoc chc nng Cho nhiu gi nu cu trc v chc nng ca mt trong cc dy c sp thng bit

OutputCch sp thng dy ti u

ATTGCGC ATTGCGC C

ATTGCGC ATCCGC

ATTGCGC AT-CCGC ATTGCGC ATC-CGC ATTGCGC ATCCG-CTon hc v Tin hc

44

Pairwise Sequence AlignmentSp dy tng cp

Ph bin dng HMM: Cc cp dy c sp do dng cc thut ton vt cn bi quy hoch ng. phc tp ca cc phng php vt cn l O(2n mn)

n = s cc dy

Sp dy bi x dng cc phng php heuristic#Rat #Mouse #Rabbit #Human #Oppossum #Chicken #Frog 45 ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT ---ATGGGTTTGACAGCACATGATCGT---CAGCTTon hc v Tin hc

on nhn gene

Gene prediction

L bi ton quan trng ca tin sinh hc v hin c nhiu thut ton cho on nhn gene da trn cc gene bit nh d liu hun luyn. Mt k thut ton nhn gene ph bin l Hidden Markov Models (HMMs).

(given the genomic DNA sequence, can we tell where the genes are?)46Ton hc v Tin hc

Bi ton on nhn cu trc proteinC khong 15,000 cu trc protein trong cc c s d liu cng cng, v trong s ny rt nhiu cu trc ging nhau. Con ngi mi bit chng 1,500 cu trc protein khc nhau. D on cu trc protein t cc dy amino-acid l mt trong cc bi ton quan trng nht ca tin sinh hc, v con ngi cn ang cch li gii rt xa.47Ton hc v Tin hc

D on tng tc proteinsDNA RNA protein Sequence Structure Function

DNA SequenceGene

Interaction Network Function7000 Yeast interactions among 3000 proteins

Protein Structure

Function

Hu ht cc qu trnh sinh hc u lin quan n tng tc no ca proteins.Ton hc v Tin hc

48

Ton hc trong tin sinh hcNeural NetworksSequence Encoding and Output Interpretation Prediction of Protein Secondary Structure Prediction of Signal Peptides and Their Cleavage Sites Applications for DNA and RNA Nucleotide Sequences

Hidden Markov ModelsProtein Applications DNA and RNA Applications

Probabilistic Graph Models Probabilistic Models of Evolution Stochastic Grammars and Linguistics Kernel methods49Ton hc v Tin hc

Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)50Ton hc v Tin hc

Kernel methods: a bit of historyMinsky and Pappert (1969) ch ra hn ch ca perceptrons. Neural networks (t gia 1980s) vt qua cc hn ch bng cch gn vo nhau nhiu n v tuyn tnh (multilayer neural networks). Gp hn ch ca tc v cc tiu a phng. Kernel methods (2000s) ni cc hm tuyn tnh nhng trong khng gian feature vi s chiu cao (high dimensional feature space.)51Ton hc v Tin hc

Kernel methods: the basic ideaBin i d liu vo khng gian nhiu chiu hn c th bin d liu thnh tch c tuyn tnh.

:X = R2 H = R352

2 ( x1 , x2 ) a ( x1 , x2 , x12 + x2 )

Ton hc v Tin hc

Kernel idea: biu din d liuX S (S) = (atcgagtttcgt, acccgtgtggta, cgaaacgtact) 1 0.5 0.3 K = 0.5 1 0.6 0.3 0.6 1

X l tp cc oligonucleotides, S gm ba oligonucleoides. Thng thng, mi oligonucleotide c biu din bi mt dy cc ch ci. Trong kernel methods, S c biu din nh mt ma trn ca o s tng t gia tng cp phn t.53Ton hc v Tin hc

Kernel methods: the schemeInput space Xx1 x2

Feature space Finverse map -1 (x)(x1) ... (x2) (xn) (xn-1)

xn-1 xn

k(xi,xj) = (xi).(xj)

kernel function k: XxX

R

Kernel matrix Knxn= {k(xi,xj)}

-

kernel-based algorithm on K (SVM, KPCA, etc.)

nh x d liu trong X vo mt (high-dimensional) khng gian vect, gi l feature space F, bi mt hm trn tng cp phn t x. Tm cc dng tuyn tnh (linear pattern) trong F vi cc thut ton quan bit (tnh tan trn kernel matrix). Khi dng nh x ngc, cc dng tuyn tnh trong F c th tm c cc dng tng ng phc tp trong X. Vic ny c ngm nh thc hin ch bi ni tch (inner products) trong F (kernel trick) xc nh bi mt hm kernel

54

Ton hc v Tin hc

Kernel methods: math backgroundInput space Xx1 x2

Feature space Finverse map -1 (x)(x1) ... (x2) (xn) (xn-1)

xn-1 xn

k(xi,xj) = (xi).(xj)

kernel function k: XxX

R

Kernel matrix Knxn= {k(xi,xj)}

-

kernel-based algorithm on K (SVM, KPCA, etc.)

Linear algebra, probability/statistics, functional analysis, optimization Mercer theorem: Mi hm xc nh dng u c th vit nh ni tch trong mt khng gian feature no y. Kernel trick: Dng ma trn kernel thay v dng ni tch trong khng gian feature space. Representer theorem:Every minimizer of min{C ( f , {xi , yi }) + ( ff H m H

) admits

a representa tion of the form f (.) = i K (., xi )i=1

55

Ton hc v Tin hc

Support vector machinesSoft margin problemn 1 2 min w + Ci w,b,1 ,...,n 2 i =1

Equivalent to dual problem n 1 n n T W ( ) = yi y j i j x i x j + i 2 i =1 j =1 i =1n yi i = 0 i =1 0 i C for i = 1,..., n

i 0 i, i 1+ yi (wTxi + b) 0 Input space

Feature space

56

Ton hc v Tin hc

Kt lun C mi quan h su sc, rng ln gia ton hc v tin hc. Hu ht cc vn ca khoa hc my tnh u cn n ton hc hin i mc cao hoc rt cao.

57

Ton hc v Tin hc