Upload
dongchirua
View
92
Download
1
Embed Size (px)
Citation preview
V quan h gia ton hc v tin hc
H T BoVin Cng ngh Thng tin, Vin KH & CN Vit Nam Japan Advanced Institute of Science and Technology1Ton hc v Tin hc
Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)2Ton hc v Tin hc
i th v tin hc (khoa hc my tnh)An ton thng tin Tin sinh hc Nhn dng Tr tu nhn to
Phn mm ng dng
softwarePhn mm h thngCray Y-MP8E
K ngh phn mm DOS, Windows, C s d liu Mac OS, UNIX, Pascal, Linux C, C++, Thut ton Java, Ton hc cho tin hc
hardware
Supercomputer s, mainframes Minicomputers, workstations
My tnh song songLAN, WAN, Internet
3
PC computers
Mng my tnh
Truyn thng
Ton hc v Tin hc
Hai con ng ca ton trong tin hc
Cc l thuyt, m hnh ton hc lm c s cho s pht trin tin hc.
gii quyt cc vn ca tin hc v ng dng tin hc, tm v dng cc l thuyt v cng c ton hc.Ton hc v Tin hc
4
Ton hc ri rc i vi tin hcLogic (lp lun t ng, AI, h thng minh) Set theory (tp m, tp th, tnh ton vi thng tin khng , khng chnh xc, ) Number theory (bo mt thng tin) Combinatorics (hnh hc t hp, ) Graph theory (mng, sinh hc, vt l, ) Digital geometry and digital topology (phn tch nh, etc.)5
Algorithmics (phng php tnh ton) Computability (gii hn l thuyt v thc t ca thut ton) Probability and Markov chains (x l ting ni, sinh hc, ) Linear algebra (phn tch d liu, ...) Probability (ngu nhin) Proofs (k ngh phn mm, AI) etc.Ton hc v Tin hc
Mt s lnh vc ton hc khcThng k v phn tch d liu (statistics, data analysis) L thuyt ti u (Optimization)
6
Ton hc v Tin hc
Logic trong tin hc =Knowledge(i s, thng k, ton hc ri rc, ...)
+Inference(logic ton hc, )
Logic mnh , logic tn t, logic khng chun (modal, temporal, non-monotonic, fuzzy, high order, ... logic) Th d: My t ng dng tam on lun (syllosism)Elephants are bigger than dogs Dogs are bigger than mice Therefore Elephants are bigger than mice7Ton hc v Tin hc
L thuyt tp hp (tp m, tp th)Tp m (Zadeh 1965), tp th (Pawlak, 1982) Rough fuzzy hybridization? Rough sets + Fuzzy sets?
X
Xp x di X* : Hp ca cc lp tng ng nm trn trong X Xp x trn X* : Hp ca cc lp tng ng c giao khc rng vi X
U ={ ,
,
,
,
,} , ,
R = {color, shape },
Eq. class / color = { { Eq. class / shape = {{
}{ ,,
,
}}}}
}, { }, {
}, { }}
Eq. class = {{X ={8
,
}, {
}, {
,
}, X * = {
}, X * = { ,
,
}
Ton hc v Tin hc
23 bi ton ca th k 20Ti i hi Ton hc Th gii ln th hai (Paris, thng Nm 1900), Hilbert nu ra 23 bi ton, thch thc cc nh ton hc ton th gii gii trong th k 20. 12 bi ton c gii ton b, 8 bi ton c gii tng phn, 3 bi vn cha c li gii.
9
Ton hc v Tin hc
7 bi ton ca th k 214 gi chiu Th t ngy 24 thng 5 nm 2000, Vin Ton hc Clay cng b v thch thc 7 bi ton ca th k 21 (1 triu $ cho mi li gii). Bi ton s 1: P versus NP Su bi ton khc: 1. The Hodge Conjecture 2. The Poincar Conjecture (solved 2006) 3. The Riemann Hypothesis 4. Yang-Mills Existence and Mass Gap 5. Navier-Stokes Existence and Smoothness 6. The Birch and Swinnerton-Dyer Conjecture.10Ton hc v Tin hc
Bi ton P versus NPNu ai hi rng liu 13.717.421 c l tch ca hai s nh hn khng, bn s cm thy rt kh tr li l ng hay sai. Nu ngi bo bn rng s ny c th l tch ca 3607 v 3803, bn c th kim tra iu ny tht d dng. Xc nh xem vi mt bi ton cho trc, liu c tn ti mt li gii c th kim chng nhanh (bng my tnh chng hn), nhng li cn rt nhiu thi gian gii t u (nu khng bit li gii)? C rt nhiu bi ton nh vy. Cha ai c th chng minh c rng, vi bt k bi ton no nh vy, thc s cn rt nhiu thi gian gii. C th ch n gin l chng ta vn cha tm ra c cch gii chng nhanh chng. Stephen Cook pht biu bi ton P versus NP vo nm 1971.11Ton hc v Tin hc
Bi ton P versus NPBi ton SAT: cho trc mt mch in t Boolean, liu c cc cch chn inputs sao cho output l True hay khng? Inputs ca cc mch in t Boolean (vi cng AND, OR v NOT) hoc l T (true) hoc l F (false). Mi cng nhn mt s cc inputs, v outputs gi tr logic tng hp c.12Ti khng tm ni cho chef mt thut ton hiu qu. u c ti do ny c v m u qu.
Hy tm ngay cho ti mt thut ton hiu qu gii SAT.
Ton hc v Tin hc
Bi ton P versus NPTi khng th tm c mt thut ton hiu qu bi v khng th c mt thut ton no nh vy. Ti khng th tm c mt thut ton hiu qu bi v tt c nhng ngi ni ting ny cng khng tm c n.
nu bn chng minh c SAT l intractable (chng minh intractability c th kh nh vic tm li gii hiu qu)
nu bn bit SAT l NP-complete
13
Ton hc v Tin hc
phc tp tnh tonS tn ti cc bi ton gii c nhng v cng kh gii phc tp tnh ton: P (thi gian a thc) v non-P (thi gian hm m). Bi ton kiu P c th gii d dng (sp xp dy s theo th t), bi ton kiu non-P rt kh gii (tm cc tha s nguyn t ca mt s nguyn cho trc). Ngi ta tin rng c rt nhiu bi ton thuc kiu non-P, nhng cha bao gi chng minh c chnh chng l nh vy (ht sc kh). NP (Nondeterministic Polynomial) l mt h c bit cc bi ton kiu non-P: nu bt k trong chng c nghim thi gian a thc th tt c s c nghim thi gian a thc. P = NP? Cc bi ton kiu P v NP l nh nhau?14Ton hc v Tin hc
Thi gian a thc v hm mTime complexity function n = 100.0001 second 0.001 second 0.01 second 1 second 0.01 second 0.059 second
n = 200.0002 second 0.002 second 0.02 second 3.2 second 1.0 second 58 minutes
n = 300.0003 second 0.003 second 0.03 second 24.3 second 17.9 second 6.5 years
n = 400.0004 second 0.004 second 0.04 second 1.7 minutes 12.7 days 3855 centuries
n = 500.0005 second 0.005 second 0.05 second 5.2 minutes 35.7 years 2x108 centuries
n = 600.0006 second 0.006 second 0.06 second 13.0 minutes 336 centuries 1.3x1013 centuries
n n2 n3 n5 2n 3n 15
Ton hc v Tin hc
Scalable algorithmsiu g xy ra khi d liu ln, N>106? Cn thut ton heuristic nhanh, gn ngFast CMB analysis by Szapudi et al (2001)
O(NlogN) cn 1 ngy O(N3) cn 10 triu nmCh trng n gi ca tnh tonCn lun iu khin c chnh xc t kt qu tt nht trong thi gian v ti nguyn cho php Polynomial time algorithms do not always work!Ton hc v Tin hc
Dng my tnh song song16
Mt m v an ton thng tinL nghin cu v b mt ca truyn tin (truyn tin trong iu kin c k ch). An ton mng v my tnh: qun l s truy nhp my tnh v tin cy ca thng tin, v cc ng dng nh: ATM cards, computer passwords, e-commerce, ...
Germany Lorenz cipher machine
17
Ton hc v Tin hc
Mt m v an ton thng tinTo m (encryption: plaintext (decryption) ciphertext) v gii m
Khi nim: mt m (cipher: letter, bit), m (code: word or phrase meaning), kha (key). Rt nhiu k thut:Symmetric-key cryptography (cng mt key cho nhn/gi) Public-key cryptography 1976 (public/private key) da trn number theory (integer factorization) Cryptanalysis (tm im yu, khng an ton trong mt h dng mt m) Cryptographic primitives Cryptographic protocols(Kha l mt on tin (piece of information) kim sot hot ng ca mt thut ton to mt m, ch k in t, nh quyn hn (encryption, digital signature, authentication)).
18
Ton hc v Tin hc
Public-key cryptographyinteger factorization l qu trnh phn tch mt ch s a hp thnh tch ca cc c s nh hn sao cho khi nhn cc c s ny ta c s ban u. BSI team: Nov. 2005, RSA-640 (193 decimal digits, 640 bits) on 80 AMD Opteron CPUs computer. Kh nht trong cc bi ton ny l trng hp c s l cc s nguyn t c chn ngu nhin vi cng ln. Vn cha c thut ton thi gian a thc phn tch mt s ln b-bit thnh tch ca hai s nguyn t c cng kch thc. Mt trong cc bi ton ln cha gii c trong KHMT v l thuyt s l vic tm mt thut ton nhn t ha cc s trong thi gian a thc.19Ton hc v Tin hc
Finite state machinesMy hu hn trng thi
L m hnh ca cc hnh vi to nn bi mt s hu hn cc trng thi (states), cc chuyn i trng thi (transitions), v cc hnh ng. M hnh ton hc: (,S,s0,,F), Vic thc hin mt phn cng my tnh i hi mt cng-t (register) cha cc bin trng thi, mt khi (block) mch logic xc nh s chuyn trng thi, v mt khi th hai cc mch logic xc nh output ca mt FSM.20Ton hc v Tin hc
Machine learning and data miningHc my v Khai ph d liu
Machine learning Xy cc h my tnh c th hc nh con ngi (mt s kh nng). ICML since 1982 ECML since 1989 ECML/PKDD since 2001
Data mining Tm tri thc mi v hu ch t nhng c s d liu ln. ACM KDD since 1995, PKDD and PAKDD since 1997, IEEE ICDM and SIAM DM since 2000, etc.
21
Ton hc v Tin hc
Machine learning and data miningHc my v Khai ph d liu
Machine learning Xy cc h my tnh c th hc nh con ngi (mt s kh nng). ICML since 1982 ECML since 1989
Data mining Tm tri thc mi v hu ch t nhng c s d liu ln.
ACM KDD since 1995, PKDD and PAKDD since beats IEEE ICDM 1997: ECML/PKDD since 2001 Deep Blue1997, Kasparov (12.2006: Deep Fritz beets Kramnik) and SIAM DM since 1997: Robot cup 2000, etc.2005: DARPA Grand Challenge
22
Ton hc v Tin hc
Complexly structured dataRt nhiu d liu c thu thp v tch ly trong cc c s d liu ln (hng triu bn ghi, hng nghn thuc tnh). Rt nhiu d liu l loi c cu trc phc tp (khng dng vect).Thch thc: Tm thut ton kh c x l d liu ln v cu trc phc tp?
23
Ton hc v Tin hc
Hai con ng ca ton trong tin hc
Cc l thuyt, m hnh ton hc lm c s cho s pht trin tin hc.
gii quyt cc vn ca tin hc v ng dng tin hc, tm v dng cc l thuyt v cng c ton hc.Ton hc v Tin hc
24
Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)25Ton hc v Tin hc
Natural language processing (NLP)Lexical / Morphological AnalysisTagging (gn nhn t loi) Chunking (phn cm t)
text
Shallow parsingThe woman will give Mary a book POS tagging The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN chunking
Syntactic AnalysisGrammatical Relation Finding Named Entity Recognition Word Sense Disambiguation
[The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP subject relation finding
Semantic AnalysisReference Resolution
[The woman] [will give] [Mary] [a book]
Discourse Analysis
meaning26
i-object
object
ng gi i nhanh quTon hc v Tin hc
Xu th trong NLPTrainable parsers 1990s2000s: Statistical learning algorithms, evaluation, corpora Trainable FSMs 1980s: Standard resources and tasks Penn Treebank, WordNet, MUC 1970s: Kernel (vector) spaces clustering, information retrieval (IR) 1960s: Representation Transformation Finite state machines (FSM) and Augmented transition networks (ATNs) 1960s: Representationbeyond the word level lexical features, tree structures, networks27Ton hc v Tin hc
Machine learning and statistics in NLP
some ML/Stat
no ML/StatTon hc v Tin hc
28
(Marie Claire, ECML/PKDD 2005)
Trainable finite state machines... ... ... St-1 St St+1 ... St-1 St St+1 St-1 St St+1
... ...
... ...
... Ot-1 Ot
Ot+1
... Ot-1 Ot
Ot+1
... Ot-1 Ot
... Ot+1
Maximum Entropy Markov Models (MEMMs) [McCallum et al., 2000]Discriminative No independence assumption Global optimum Local normalization
Conditional Random Fields (CRFs) [Lafferty et al., 2001]Discriminative No independence assumption Global optimum Global normalization
Hidden Markov Models (HMMs) [Baum et al., 1970]- Generative - Need independence assumption - Local optimum - Local normalization
More accurate than MEMMs
More accurate than HMMs Ton hc v Tin hc
29
Increasing computing power1966
1.6 meters
3500 3300 3100 2900 2700 2500 2300 2100 1900 1700 1500 1300 1100 900 700 500 300 100 -100
30 Mb 30MBJAISTs CRAY XT3
Linux OS, 180 AMD Opteron 2.4GHz CPUs, 8Gb RAM/CPU, total:
Training time (minute)
1.44Tb RAM
Second order of Conditional Random Fields (Text chunking):# of parallel processes
All phrase chunking, 17h46 on 90 processes vs.1348hTon hc v Tin hc
1
2
5
60
70
80
30
90 10 0 11 0 12 0 13 0 14 0 15 0
10
20
30
40
50
Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)31Ton hc v Tin hc
Web link dataFood Web [Martinez 91]
Over 3 billion documents
Friendship Network [Moody 01]
32
Ton hc v Tin hc
Tm kim trn WebHai cch tiu biu (1998):1. Page rank: tm cc trang
Web quan trng v lin quan nht trn Web (nh trong Google)
Larry Page, Sergey Brin
2. Hubs and authorities:
nh gi chi tit hn tm quan trng ca cc trang Web.Ton hc v Tin hc
33
PageRank: key ideaGoogle thng a ra mt s rt ln cc trang Web (weather forecast 5.5 million pages). Cc trang lin quan nht thng nm trong mt, hai chc trang u. Lm sao search engine bit c cc trang Web no l quan trng nht? Google gn cho mi trang Web mt con s c trng tm quan trng. Con s ny c gi l s PageRank v c tnh nh gi tr ring (eigenvalue) ca bi ton
Pw = w y P c tnh da trn link structure ca Internet. Vn c bn l lm sao thit lp xc ng c link structure ca ton b Web, i.e., ma trn P.34Ton hc v Tin hc
PageRank: key ideaM hnh c s ca thut ton PageRank l random walk thc hin trn ton b cc trang Web trn Internet. K hiu pt(x) kh nng ta ti trang Web x thi im t. S PageRank ca x c cc nh nh lim(pt(x)) khi t . Ma trn P c tnh cht khng rt gn c (irreducible) v ngu nhin (stochastic), nn random walk c th biu din qua chui Markov, v PageRank ca mi trang c th tnh c qua cc vectors ring ca P.35Ton hc v Tin hc
PageRank: Stochastic matrixMi trang i tng ng vi dng i v ct i ca P Nu trang j c n successors (links) v trang i l mt trong s n succesors ca j th ijth ca ma trn c l to 1/n, nu khng l 0.Assume the Web consists of only three pages A, B, and C. The links among these pages are shown as in the graph.
A C
A A B1/2 1/2
B1/2
C 0 1 0
01/2
B
C
0
36
Ton hc v Tin hc
The PageRank algorithmMa trn Google P hin c kch thc 4.2109 v do vic tm gi tr ring l khng tm thng.w0 = initial guess For k = 1 to 50 wk = P*wk-1 End Return w50
Tm xp x cc vc t ring ca P (power method) Vic tnh ton (matrix-vector multiplications) phi thc hin trn cc h my tnh song song ln.37Ton hc v Tin hc
PageRank: Example
w1 2 w = 1 2 2 w3 0 1
1
2
01 2
0 w1 w 1 2 0 w3
w0 = initial guess For k = 1 to 50 wk = P*wk-1 End Return w50
Bt u cho w1 = w2 = w3 = 1, v tnh quy php nhn ma trn vi cc gi tr ny
a = b = c =38
1 1 1
1 3/2 1/2
5/4 1 3/4
9/8 11/8 1/2
5/4 6/5 17/16 6/5 11/16 ... 3/5Ton hc v Tin hc
Finding things but not pagesInformation Extraction vs. Information RetrievalInformation extraction: the process of extracting text segments of semi-structured or free text to fill data slots in a predefined templatefoodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2004 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1
39
Ton hc v Tin hc
Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)40Ton hc v Tin hc
Problems in bioinformatics
Metabolomics Proteomics Genomics
1400 Chemicals
10,000 Proteins
30,000 Genes41Ton hc v Tin hc
How biological data look like?Mt mu 350 ch ci ca dy DNA 1.6 triu ch (nh hn 4570 ln):TACATTAGTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATG TAAATTTCTTATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGT TTGCTGGTTTCATTGTTAGAATATTTAACTTAATCAAATTATTTGAATTT TTGAAAATTAGGATTAATTAGGTAAGTAAATAAAATTTCTCTAACAAATA AGTTAAATTTTTAAATTTAAGGAGATAAAAATACTACTCTGTTTTATTAT GGAAAGAAAGATTTAAATACTAAAGGGTTTATATATATGAAGTAGTTAC CCTTAGAAAAATATGGTATAGAAAGCTTAAATATTAAGAGTGATGAAGT ATATTATGT
Nhiu loi d liu khc42Ton hc v Tin hc
i snh dy (string matching)(Approximate) String MatchingInput: Text T ,, Pattern P Input: Text T Pattern P Question(s): Question(s):P xut hin trong T? P xut hin trong T? Tm mt xut hin ca P trong T. Tm mt xut hin ca P trong T. Tm mi xut hin ca P trong T. Tm mi xut hin ca P trong T. Tnh s xut hin ca P trong T. Tnh s xut hin ca P trong T. Tm dy con di nht ca P trong T. Tm dy con di nht ca P trong T. Tm dy con gn nht ca P trong T. Tm dy con gn nht ca P trong T. Xc nh cc lp trc tip ca P Xc nh cc lp trc tip ca P trong T. trong T.
Applications: Applications:
Liu P c trong c s d liu T? Liu P c trong c s d liu Xc nh v tr ca P trong T. Xc nh v tr ca P trong T. Liu c th dng P nh mt nguyn Liu c th dng P nh mt nguyn t ca T? t ca T? P c tng ng vi g trong T? P c tng ng vi g trong T? P c b hng bi T? P c b hng bi T? Liu prefix(P) = suffix(T)? Liu prefix(P) = suffix(T)? Xc nh cc lp sau trc (tandem) Xc nh cc lp sau trc (tandem) ca P trong T. ca P trong T.
v nhiu bin dng khc v nhiu bin dng khc
43
Ton hc v Tin hc
Pairwise Sequence AlignmentSp dy tng cpInputHai dy ch ci Mt cch cho im
Bi ton c bn nht ca tin sinh hc Cc dy c sp thng c dng cu trc hoc chc nng Cho nhiu gi nu cu trc v chc nng ca mt trong cc dy c sp thng bit
OutputCch sp thng dy ti u
ATTGCGC ATTGCGC C
ATTGCGC ATCCGC
ATTGCGC AT-CCGC ATTGCGC ATC-CGC ATTGCGC ATCCG-CTon hc v Tin hc
44
Pairwise Sequence AlignmentSp dy tng cp
Ph bin dng HMM: Cc cp dy c sp do dng cc thut ton vt cn bi quy hoch ng. phc tp ca cc phng php vt cn l O(2n mn)
n = s cc dy
Sp dy bi x dng cc phng php heuristic#Rat #Mouse #Rabbit #Human #Oppossum #Chicken #Frog 45 ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT ---ATGGGTTTGACAGCACATGATCGT---CAGCTTon hc v Tin hc
on nhn gene
Gene prediction
L bi ton quan trng ca tin sinh hc v hin c nhiu thut ton cho on nhn gene da trn cc gene bit nh d liu hun luyn. Mt k thut ton nhn gene ph bin l Hidden Markov Models (HMMs).
(given the genomic DNA sequence, can we tell where the genes are?)46Ton hc v Tin hc
Bi ton on nhn cu trc proteinC khong 15,000 cu trc protein trong cc c s d liu cng cng, v trong s ny rt nhiu cu trc ging nhau. Con ngi mi bit chng 1,500 cu trc protein khc nhau. D on cu trc protein t cc dy amino-acid l mt trong cc bi ton quan trng nht ca tin sinh hc, v con ngi cn ang cch li gii rt xa.47Ton hc v Tin hc
D on tng tc proteinsDNA RNA protein Sequence Structure Function
DNA SequenceGene
Interaction Network Function7000 Yeast interactions among 3000 proteins
Protein Structure
Function
Hu ht cc qu trnh sinh hc u lin quan n tng tc no ca proteins.Ton hc v Tin hc
48
Ton hc trong tin sinh hcNeural NetworksSequence Encoding and Output Interpretation Prediction of Protein Secondary Structure Prediction of Signal Peptides and Their Cleavage Sites Applications for DNA and RNA Nucleotide Sequences
Hidden Markov ModelsProtein Applications DNA and RNA Applications
Probabilistic Graph Models Probabilistic Models of Evolution Stochastic Grammars and Linguistics Kernel methods49Ton hc v Tin hc
Ton hc v tin hcVi lnh vc tiu biu ca ton hc trong tin hc Ton hc trong x l ngn ng t nhin Ton hc v Web Ton hc trong sinh hc Ton hc trong hc my (machine learning)50Ton hc v Tin hc
Kernel methods: a bit of historyMinsky and Pappert (1969) ch ra hn ch ca perceptrons. Neural networks (t gia 1980s) vt qua cc hn ch bng cch gn vo nhau nhiu n v tuyn tnh (multilayer neural networks). Gp hn ch ca tc v cc tiu a phng. Kernel methods (2000s) ni cc hm tuyn tnh nhng trong khng gian feature vi s chiu cao (high dimensional feature space.)51Ton hc v Tin hc
Kernel methods: the basic ideaBin i d liu vo khng gian nhiu chiu hn c th bin d liu thnh tch c tuyn tnh.
:X = R2 H = R352
2 ( x1 , x2 ) a ( x1 , x2 , x12 + x2 )
Ton hc v Tin hc
Kernel idea: biu din d liuX S (S) = (atcgagtttcgt, acccgtgtggta, cgaaacgtact) 1 0.5 0.3 K = 0.5 1 0.6 0.3 0.6 1
X l tp cc oligonucleotides, S gm ba oligonucleoides. Thng thng, mi oligonucleotide c biu din bi mt dy cc ch ci. Trong kernel methods, S c biu din nh mt ma trn ca o s tng t gia tng cp phn t.53Ton hc v Tin hc
Kernel methods: the schemeInput space Xx1 x2
Feature space Finverse map -1 (x)(x1) ... (x2) (xn) (xn-1)
xn-1 xn
k(xi,xj) = (xi).(xj)
kernel function k: XxX
R
Kernel matrix Knxn= {k(xi,xj)}
-
kernel-based algorithm on K (SVM, KPCA, etc.)
nh x d liu trong X vo mt (high-dimensional) khng gian vect, gi l feature space F, bi mt hm trn tng cp phn t x. Tm cc dng tuyn tnh (linear pattern) trong F vi cc thut ton quan bit (tnh tan trn kernel matrix). Khi dng nh x ngc, cc dng tuyn tnh trong F c th tm c cc dng tng ng phc tp trong X. Vic ny c ngm nh thc hin ch bi ni tch (inner products) trong F (kernel trick) xc nh bi mt hm kernel
54
Ton hc v Tin hc
Kernel methods: math backgroundInput space Xx1 x2
Feature space Finverse map -1 (x)(x1) ... (x2) (xn) (xn-1)
xn-1 xn
k(xi,xj) = (xi).(xj)
kernel function k: XxX
R
Kernel matrix Knxn= {k(xi,xj)}
-
kernel-based algorithm on K (SVM, KPCA, etc.)
Linear algebra, probability/statistics, functional analysis, optimization Mercer theorem: Mi hm xc nh dng u c th vit nh ni tch trong mt khng gian feature no y. Kernel trick: Dng ma trn kernel thay v dng ni tch trong khng gian feature space. Representer theorem:Every minimizer of min{C ( f , {xi , yi }) + ( ff H m H
) admits
a representa tion of the form f (.) = i K (., xi )i=1
55
Ton hc v Tin hc
Support vector machinesSoft margin problemn 1 2 min w + Ci w,b,1 ,...,n 2 i =1
Equivalent to dual problem n 1 n n T W ( ) = yi y j i j x i x j + i 2 i =1 j =1 i =1n yi i = 0 i =1 0 i C for i = 1,..., n
i 0 i, i 1+ yi (wTxi + b) 0 Input space
Feature space
56
Ton hc v Tin hc
Kt lun C mi quan h su sc, rng ln gia ton hc v tin hc. Hu ht cc vn ca khoa hc my tnh u cn n ton hc hin i mc cao hoc rt cao.
57
Ton hc v Tin hc