Upload
fay-blankenship
View
217
Download
2
Embed Size (px)
Citation preview
A new algorithm for detecting low-complexity regions(LCRs) in protein sequences
指導教授:孫光天報告學生:魏至軒
BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 2 2005, pages 160–170 doi:10.1093/bioinformatics/bth497Sung W. Shin. and Sam M. KimDepartment of Computer Engineering, Kyungpook National University,Daegu 702-701, KoreaReceived on September 30, 2003; revised on May 25, 2004; accepted August 19, 2004 Advance Access publication August 27, 2004
Abstract(1) What is LCRs:
short-period repeats( 一段段短的重複序列 ) EX : TGEASTSSTSSTSSGST……….
如果要淨化 sequence, 就要 filer 掉 LCRs 淨化後的 sequence, 更能清楚找到 domain
Reason of this research: 因為 LCRs(low complexity regions) 的關係 , 所有 pair-wise
alignment of protein sequences 和 local similarity searches總是有很多的 false positive.
EX: 可能會發生一對無關的 protein, 卻產生高相似度 , 因為 sequences 中包含了一些 LCRs.
現在大部份的 algorithms 都是用統計分析去做的 .
Abstract(2) Objective of this research:
我們希望能研究出 LCRs 的結構化 properties, 並用 algorithm 去過濾它們 .
Results of this research: 我們發展了一個 algorithm
Detecting LCRs in protein sequences Masking LCRs in protein sequences
改進 database searches 的效能
Introduction 這類的 algorithms 對 subsequence 不斷的進行複雜的分析 , 並區隔出完全相同的部份 (repeating subsequence).
常見的 algorithms: XNU :使 Statistic 和 PAM matrix SEG :使用 sliding windows 進行辨視 CAST :計算相似度的 threshold value
New algorithm: CARD(complexity analysis with repeat delimit)
CARD-preliminary(1) For a sequence s:
prefix is a leading subsequence of s. suffix is a trailing subsequence of s.
TGEASTSSGTSSGTSSGST……….[TSSG]
|s| :the number of residues in s i.e. the length of s.
Repeating subsequence(simply repeat) It’s a subsequence that appears more than once in s. In a biological sequence, a pair of repeats can occur
separated, in tandem or overlapped.
CARD-preliminary(2) s[i .. j ] :
the subsequence of s from i-th position up to j -th position of sequence s.
For two subsequences s[i .. j ]s[k .. l] : the sequence constructed by concatenating the two subs
equences.( 意指: Repeating subsequence)
CARD-preliminary(3)-a suffix tree for a sequence s(1)
A suffix tree 的定義: (1) 對每個 suffix 而言 , 都有一條唯一的 path 從 root 出發直到 leaf node , y 從 i-th 開始 , 則 leaf node 就是 i.
TGEASTSSGTSSGTSSGST……….[TSSG] (2) 如果 y,y’ 這 2 個 suffices 有共同的 prefix, 從 root 出發並經過共同的 internal node.
TGEASTSSG TSSG ST………. (3) 能很容易的表示所有 repeating subsequences 的結構和他們的位置 .
CARD-preliminary(3)-a suffix tree for a sequence s(2)
A rooted tree is a tree which has a designated node called root.
Leaf nodes are the nodes with no child node.
A node that is neither leaf nor root is called internal node.
A path is a sequence of edges Suffix tree for sequence baaabaaa$.
CARD-preliminary(4) Let v be an internal node. The length of the path label x is k.
path label x :from root to node v. Pv : position list at v.
Suffix numbers i1, i2, . . . , im.
Pv = (k, i1, i2, . . . , im) K 指 path label x 的長度 i1, i2, . . . , im指這段 segment 分別在 sequence
中出現的位置 m 2(≧ 至少要 2 個 segment 才能 repeat)
需計算的 Pv 有aaab,a,aa,aaa
CARD-preliminary(5)
如果 Xj0是 LCR 的話 ,H(xj0xj1),H(xj0xj2),H(xj0xj3), . . . 並不會比 H(xj0) 高 .
For a sequence s,Shannon’s information measure Σ 是一個 alphabet, 一個 residues 的集合
Σ = {a1, a2, . . . , ak} size k (e.g. k = 20 in case of amino acid sequence)
H(s):
CARD-preliminary(6)- Three measures
so :未處理過的 sequence ex:FPGSSTSTPIFG sm : mask 過的 sequence ex:FPGSXXXXPIFG Do : so的 domain 數 Dm: sm的 domain 數 TX : masked residues 在 sequence 中的數量 NX : masked residues 在去除 domain 之後 sequence 中的 數量
CARD-algorithm(2)
Pv={2,38,44}
if(ij+k i≧ (j+1)) //38+2 44 ?≧else{t=H(s[ij…ij+k-1]); r=1; //t=H(s[38~(38+2-1)=39])<1log21+1log21=0> ; r 設初值While({t H(s[i≧ j…ij+k-1] s[ij+rk…ij+(r+1)k-1]) and ij+rk < ij+1)
//t=0 r=1 H(s[38~39]s[40~41]) FPGSSSSTGSITP ij+rk < ij+1=>38+2<44(Y)
r=2 H(s[38~39]s[42~43]) FPGSSSSTGSITP ij+rk < ij+1=>38+4<44(Y)
r=3 H(s[38~39]s[44~45]) FPGSSSSTGSITP ij+rk < ij+1=>38+6<44(N)
If(ij+rk i≧ j+1) then mask s[ij…ij+1+k-1]as on LCRs //mask s[38~45]
Table 1. Tandem pattern (lsls, dldl, slsl, ylyl, flfl, lnln) distribution in the test set
1.Lsls,dldl,slsl,ylyl,flfl,lnln
2.C2: 這 6 個 tandem patterns 在每個 family 中出現的次數 .
3.C3: 這 6 個 tandem patterns 在每個 family 中的 domain 出現的次數 .
Table 2. Test data analysis
1.C2: 在這個 family 的 domain 總數
2.C3:family 中的 domain residues 的比例
3.C4:residues 在 LCRs 中的比率
Table 3. Performance comparison of the four algorithms with test set of 1000 proteins from 50 top 20 families of Pfam database
Protein Q9WDB4 has a long domain (italics). Masking is not critical enough to affect Pfam’s capability of detecting domains.