22
A new algorithm for detecting low-complexity regions(LCRs) in protein sequences 指指指指 指指指 指指指指 指指指 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 2 2005, pages 160–170 doi:10.1093/bioinformatics/bth497 Sung W. Shin. and Sam M. Kim Department of Computer Engineering, Kyungpook National University,Daegu 702-701, Korea Received on September 30, 2003; revised on May 25, 2004; accepted August 19, 2004 Advance Access publication August 27, 2004

A new algorithm for detecting low-complexity regions(LCRs) in protein sequences 指導教授:孫光天 報告學生:魏至軒 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 2 2005,

Embed Size (px)

Citation preview

A new algorithm for detecting low-complexity regions(LCRs) in protein sequences

指導教授:孫光天報告學生:魏至軒

BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 2 2005, pages 160–170 doi:10.1093/bioinformatics/bth497Sung W. Shin. and Sam M. KimDepartment of Computer Engineering, Kyungpook National University,Daegu 702-701, KoreaReceived on September 30, 2003; revised on May 25, 2004; accepted August 19, 2004 Advance Access publication August 27, 2004

Abstract(1) What is LCRs:

short-period repeats( 一段段短的重複序列 ) EX : TGEASTSSTSSTSSGST……….

如果要淨化 sequence, 就要 filer 掉 LCRs 淨化後的 sequence, 更能清楚找到 domain

Reason of this research: 因為 LCRs(low complexity regions) 的關係 , 所有 pair-wise

alignment of protein sequences 和 local similarity searches總是有很多的 false positive.

EX: 可能會發生一對無關的 protein, 卻產生高相似度 , 因為 sequences 中包含了一些 LCRs.

現在大部份的 algorithms 都是用統計分析去做的 .

Abstract(2) Objective of this research:

我們希望能研究出 LCRs 的結構化 properties, 並用 algorithm 去過濾它們 .

Results of this research: 我們發展了一個 algorithm

Detecting LCRs in protein sequences Masking LCRs in protein sequences

改進 database searches 的效能

Introduction 這類的 algorithms 對 subsequence 不斷的進行複雜的分析 , 並區隔出完全相同的部份 (repeating subsequence).

常見的 algorithms: XNU :使 Statistic 和 PAM matrix SEG :使用 sliding windows 進行辨視 CAST :計算相似度的 threshold value

New algorithm: CARD(complexity analysis with repeat delimit)

CARD-preliminary(1) For a sequence s:

prefix is a leading subsequence of s. suffix is a trailing subsequence of s.

TGEASTSSGTSSGTSSGST……….[TSSG]

|s| :the number of residues in s i.e. the length of s.

Repeating subsequence(simply repeat) It’s a subsequence that appears more than once in s. In a biological sequence, a pair of repeats can occur

separated, in tandem or overlapped.

CARD-preliminary(1) -Repeating subsequence

CARD-preliminary(2) s[i .. j ] :

the subsequence of s from i-th position up to j -th position of sequence s.

For two subsequences s[i .. j ]s[k .. l] : the sequence constructed by concatenating the two subs

equences.( 意指: Repeating subsequence)

CARD-preliminary(3)-a suffix tree for a sequence s(1)

A suffix tree 的定義: (1) 對每個 suffix 而言 , 都有一條唯一的 path 從 root 出發直到 leaf node , y 從 i-th 開始 , 則 leaf node 就是 i.

TGEASTSSGTSSGTSSGST……….[TSSG] (2) 如果 y,y’ 這 2 個 suffices 有共同的 prefix, 從 root 出發並經過共同的 internal node.

TGEASTSSG TSSG ST………. (3) 能很容易的表示所有 repeating subsequences 的結構和他們的位置 .

CARD-preliminary(3)-a suffix tree for a sequence s(2)

A rooted tree is a tree which has a designated node called root.

Leaf nodes are the nodes with no child node.

A node that is neither leaf nor root is called internal node.

A path is a sequence of edges Suffix tree for sequence baaabaaa$.

CARD-preliminary(4) Let v be an internal node. The length of the path label x is k.

path label x :from root to node v. Pv : position list at v.

Suffix numbers i1, i2, . . . , im.

Pv = (k, i1, i2, . . . , im) K 指 path label x 的長度 i1, i2, . . . , im指這段 segment 分別在 sequence

中出現的位置 m 2(≧ 至少要 2 個 segment 才能 repeat)

需計算的 Pv 有aaab,a,aa,aaa

CARD-preliminary(5)

如果 Xj0是 LCR 的話 ,H(xj0xj1),H(xj0xj2),H(xj0xj3), . . . 並不會比 H(xj0) 高 .

For a sequence s,Shannon’s information measure Σ 是一個 alphabet, 一個 residues 的集合

Σ = {a1, a2, . . . , ak} size k (e.g. k = 20 in case of amino acid sequence)

H(s):

CARD-preliminary(6)- Three measures

so :未處理過的 sequence ex:FPGSSTSTPIFG sm : mask 過的 sequence ex:FPGSXXXXPIFG Do : so的 domain 數 Dm: sm的 domain 數 TX : masked residues 在 sequence 中的數量 NX : masked residues 在去除 domain 之後 sequence 中的   數量

CARD-algorithm(1)

// 建立 suffix tree

每一個 internal node v, 都建一個 Pv

CARD-algorithm(2)

Pv={2,38,44}

if(ij+k i≧ (j+1)) //38+2 44 ?≧else{t=H(s[ij…ij+k-1]); r=1; //t=H(s[38~(38+2-1)=39])<1log21+1log21=0> ; r 設初值While({t H(s[i≧ j…ij+k-1] s[ij+rk…ij+(r+1)k-1]) and ij+rk < ij+1)

//t=0 r=1 H(s[38~39]s[40~41]) FPGSSSSTGSITP ij+rk < ij+1=>38+2<44(Y)

r=2 H(s[38~39]s[42~43]) FPGSSSSTGSITP ij+rk < ij+1=>38+4<44(Y)

r=3 H(s[38~39]s[44~45]) FPGSSSSTGSITP ij+rk < ij+1=>38+6<44(N)

If(ij+rk i≧ j+1) then mask s[ij…ij+1+k-1]as on LCRs //mask s[38~45]

The number of LCR for repeat string length

The variation of marked residues for the repeat string size

Table 1. Tandem pattern (lsls, dldl, slsl, ylyl, flfl, lnln) distribution in the test set

1.Lsls,dldl,slsl,ylyl,flfl,lnln

2.C2: 這 6 個 tandem patterns 在每個 family 中出現的次數 .

3.C3: 這 6 個 tandem patterns 在每個 family 中的 domain 出現的次數 .

Table 2. Test data analysis

1.C2: 在這個 family 的 domain 總數

2.C3:family 中的 domain residues 的比例

3.C4:residues 在 LCRs 中的比率

Table 3. Performance comparison of the four algorithms with test set of 1000 proteins from 50 top 20 families of Pfam database

Protein Q9WDB4 has a long domain (italics). Masking is not critical enough to affect Pfam’s capability of detecting domains.

Table 4. The results of querying DB Uniprot using BLASTP with 100 masked proteins

報 告 結 束