Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 [email protected] Dept. Computer Science and Information Engineering

Recent Research and Development on Microarray Data Mining

Shin-Mu Tseng 曾新穆[email protected]

Dept. Computer Science and Information EngineeringNational Cheng Kung University

Taiwan, R.O.C.

August 13, 2001

2

Outline

Microarray Techniques Goal of Microarray Data Mining Clustering Methods Efficient Microarray Data Mining Conclusions

3

Current Status

Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell

For more than 90% of the genes, we know little about their real functions

4

Microarray Techniques

Main Advantage of Microarray Techniques allow simultaneous studies of the expression of thousands of

genes in a single experiment

Microarray Process Arrayer Experiments: Hybridization Image Capturing of Results Analysis

5

Goal of Microarray Mining

genetest

1234....1000

A

0.60.2 00.7 .. ..0.3

B

0.40.9 00.5 .. ..0.8

C

0.20.80.30.2 .. ..0.7

… … ….

… … … … … … …

Multi-Conditions Expression Analysis

6

Goal of Microarray Mining

genetest

1234....1000

A

0.60.2 00.7 .. ..0.3

B

0.40.9 00.5 .. ..0.8

C

0.20.80.30.2 .. ..0.7

… … ….

… … … … … … …

Multi-Conditions Expression Analysis

7

Sample Clustering Results

8

Clustering Methods

Types of Clustering Methods Partitioning ： K-Means, K-Medoids, PAM, CLARA … Hierarchical ： HAC 、 BIRCH 、 CURE 、 ROCK Density-based ： CAST, DBSCAN 、 OPTICS 、 CLIQUE

… Grid-based ： STING 、 CLIQUE 、 WaveCluster… Model-based ： COBWEB 、 SOM 、 CLASSIT 、 AutoCl

ass…

9

Clustering Methods (cont.)

a,b,c,d,e

c,d,e

a,b

d,ed

a

b

c

e

0 1 2 3 4agglomerative

divisive4 3 2 1 0

Partitioning Hierarchical

10

Clustering Methods (cont.)

Density-based Grid-based

11

CAST Clustering

Input S ： a symmetic n × n Similarity Matrix ， S(i, j) [0, 1]∈ t ： Affinity Threshold (0 < t < 1)

Method1. Choose a seed for generating a new cluster

2. ADD: add qualified items to the cluster

3. REMOVE: remove unqualified items from the stable cluster

4. Repeat Steps 1-3 till no more clusters can be generated

12

Similarity Measurements：Correlation Coefficients

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2,

…, Yn} ：

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

13

Similarity Measurements：Correlation Coefficients (cont.)

It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.

14

Problems in Microarray Mining

How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation

15

Problems in Microarray Mining (cont.)

How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation

Good Clustering Methods +

Validation Techniques

16

Efficient Microarray Mining

Improved CAST algorithm for clustering Hubert’s Γ statistic for validation Iterative sampled computation for automatic clustering

17

Reduce the Computation

1. Narrow down the threshold range

2. Split and Conquer: find “nearly-best” result

<Example> m = 4

threshold0 100%

LM RM

LM: Left MarginRM: Right Margin

18

Experimental Results

Dataset Source ： Lawrence Berkeley National Lab (LBNL)

Michael Eisen's Lab (http://rana.lbl.gov/EisenData.htm ）

Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions

Similarity matrix was obtained in advance

19

Experimental Results (cont.)

Without Range Narrow down Executions： 19 Execution Time： 246

sec Γ statistic： 0.5138

With Range Narrow down Executions： 13 Execution Time： 27 sec Γ statistic： 0.5137

20

Experimental Results (cont.)

Comparison

Method

Execution

Time (Sec)

Cluster Number

Best

Γ Statistic

Our Method 27 57 0.51

K-means(k= 3 ~ 21)

404 5 0.45

K-means(k= 3 ~ 39)

1092 5 0.45

21

Conclusions

Microarray data analysis is an emerging field needing support of data mining techniques Accuracy Efficiency Automation

Documents

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 [email protected] Dept. Computer Science and Information Engineering