21
Recent Research and Development on Mic roarray Data Mining Shin-Mu Tseng 曾曾曾 [email protected] Dept. Computer Science and Information Engineering National Cheng Kung University Taiwan, R.O.C. August 13, 2001

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 [email protected] Dept. Computer Science and Information Engineering

Embed Size (px)

Citation preview

Page 1: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

Recent Research and Development on Microarray Data Mining

Shin-Mu Tseng 曾新穆[email protected]

Dept. Computer Science and Information EngineeringNational Cheng Kung University

Taiwan, R.O.C.

August 13, 2001

Page 2: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

2

Outline

Microarray Techniques Goal of Microarray Data Mining Clustering Methods Efficient Microarray Data Mining Conclusions

Page 3: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

3

Current Status

Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell

For more than 90% of the genes, we know little about their real functions

Page 4: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

4

Microarray Techniques

Main Advantage of Microarray Techniques allow simultaneous studies of the expression of thousands of

genes in a single experiment

Microarray Process Arrayer Experiments: Hybridization Image Capturing of Results Analysis

Page 5: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

5

Goal of Microarray Mining

genetest

1234....1000

A

0.60.2 00.7 .. ..0.3

B

0.40.9 00.5 .. ..0.8

C

0.20.80.30.2 .. ..0.7

… … ….

… … … … … … …

Multi-Conditions Expression Analysis

Page 6: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

6

Goal of Microarray Mining

genetest

1234....1000

A

0.60.2 00.7 .. ..0.3

B

0.40.9 00.5 .. ..0.8

C

0.20.80.30.2 .. ..0.7

… … ….

… … … … … … …

Multi-Conditions Expression Analysis

Page 7: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

7

Sample Clustering Results

Page 8: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

8

Clustering Methods

Types of Clustering Methods Partitioning : K-Means, K-Medoids, PAM, CLARA … Hierarchical : HAC 、 BIRCH 、 CURE 、 ROCK Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE

… Grid-based : STING 、 CLIQUE 、 WaveCluster… Model-based : COBWEB 、 SOM 、 CLASSIT 、 AutoCl

ass…

Page 9: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

9

Clustering Methods (cont.)

a,b,c,d,e

c,d,e

a,b

d,ed

a

b

c

e

0 1 2 3 4agglomerative

divisive4 3 2 1 0

Partitioning Hierarchical

Page 10: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

10

Clustering Methods (cont.)

Density-based Grid-based

Page 11: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

11

CAST Clustering

Input S : a symmetic n × n Similarity Matrix , S(i, j) [0, 1]∈ t : Affinity Threshold (0 < t < 1)

Method1. Choose a seed for generating a new cluster

2. ADD: add qualified items to the cluster

3. REMOVE: remove unqualified items from the stable cluster

4. Repeat Steps 1-3 till no more clusters can be generated

Page 12: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

12

Similarity Measurements:Correlation Coefficients

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2,

…, Yn} :

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

Page 13: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

13

Similarity Measurements:Correlation Coefficients (cont.)

It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.

Page 14: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

14

Problems in Microarray Mining

How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation

Page 15: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

15

Problems in Microarray Mining (cont.)

How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation

Good Clustering Methods +

Validation Techniques

Page 16: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

16

Efficient Microarray Mining

Improved CAST algorithm for clustering Hubert’s Γ statistic for validation Iterative sampled computation for automatic clustering

Page 17: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

17

Reduce the Computation

1. Narrow down the threshold range

2. Split and Conquer: find “nearly-best” result

<Example> m = 4

threshold0 100%

LM RM

LM: Left MarginRM: Right Margin

Page 18: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

18

Experimental Results

Dataset Source : Lawrence Berkeley National Lab (LBNL)

Michael Eisen's Lab (http://rana.lbl.gov/EisenData.htm )

Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions

Similarity matrix was obtained in advance

Page 19: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

19

Experimental Results (cont.)

Without Range Narrow down Executions: 19 Execution Time: 246

sec Γ statistic: 0.5138

With Range Narrow down Executions: 13 Execution Time: 27 sec Γ statistic: 0.5137

Page 20: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

20

Experimental Results (cont.)

Comparison

Method

Execution

Time (Sec)

Cluster Number

Best

Γ Statistic

Our Method 27 57 0.51

K-means(k= 3 ~ 21)

404 5 0.45

K-means(k= 3 ~ 39)

1092 5 0.45

Page 21: Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering

21

Conclusions

Microarray data analysis is an emerging field needing support of data mining techniques Accuracy Efficiency Automation