Upload
ruth-mcgee
View
216
Download
0
Embed Size (px)
Citation preview
Recent Research and Development on Microarray Data Mining
Shin-Mu Tseng 曾新穆[email protected]
Dept. Computer Science and Information EngineeringNational Cheng Kung University
Taiwan, R.O.C.
August 13, 2001
2
Outline
Microarray Techniques Goal of Microarray Data Mining Clustering Methods Efficient Microarray Data Mining Conclusions
3
Current Status
Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell
For more than 90% of the genes, we know little about their real functions
4
Microarray Techniques
Main Advantage of Microarray Techniques allow simultaneous studies of the expression of thousands of
genes in a single experiment
Microarray Process Arrayer Experiments: Hybridization Image Capturing of Results Analysis
5
Goal of Microarray Mining
genetest
1234....1000
A
0.60.2 00.7 .. ..0.3
B
0.40.9 00.5 .. ..0.8
C
0.20.80.30.2 .. ..0.7
… … ….
… … … … … … …
Multi-Conditions Expression Analysis
6
Goal of Microarray Mining
genetest
1234....1000
A
0.60.2 00.7 .. ..0.3
B
0.40.9 00.5 .. ..0.8
C
0.20.80.30.2 .. ..0.7
… … ….
… … … … … … …
Multi-Conditions Expression Analysis
7
Sample Clustering Results
8
Clustering Methods
Types of Clustering Methods Partitioning : K-Means, K-Medoids, PAM, CLARA … Hierarchical : HAC 、 BIRCH 、 CURE 、 ROCK Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE
… Grid-based : STING 、 CLIQUE 、 WaveCluster… Model-based : COBWEB 、 SOM 、 CLASSIT 、 AutoCl
ass…
9
Clustering Methods (cont.)
a,b,c,d,e
c,d,e
a,b
d,ed
a
b
c
e
0 1 2 3 4agglomerative
divisive4 3 2 1 0
Partitioning Hierarchical
10
Clustering Methods (cont.)
Density-based Grid-based
11
CAST Clustering
Input S : a symmetic n × n Similarity Matrix , S(i, j) [0, 1]∈ t : Affinity Threshold (0 < t < 1)
Method1. Choose a seed for generating a new cluster
2. ADD: add qualified items to the cluster
3. REMOVE: remove unqualified items from the stable cluster
4. Repeat Steps 1-3 till no more clusters can be generated
12
Similarity Measurements:Correlation Coefficients
The most popular correlation coefficient is Pearson correlation coefficient (1892)
correlation between X={X1, X2, …, Xn} and Y={Y1, Y2,
…, Yn} :
where
n
k
kk
YX
YYXX
nr
1
1
n
k
k
n
GGG
1
2
13
Similarity Measurements:Correlation Coefficients (cont.)
It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.
14
Problems in Microarray Mining
How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation
15
Problems in Microarray Mining (cont.)
How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation
Good Clustering Methods +
Validation Techniques
16
Efficient Microarray Mining
Improved CAST algorithm for clustering Hubert’s Γ statistic for validation Iterative sampled computation for automatic clustering
17
Reduce the Computation
1. Narrow down the threshold range
2. Split and Conquer: find “nearly-best” result
<Example> m = 4
threshold0 100%
LM RM
LM: Left MarginRM: Right Margin
18
Experimental Results
Dataset Source : Lawrence Berkeley National Lab (LBNL)
Michael Eisen's Lab (http://rana.lbl.gov/EisenData.htm )
Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions
Similarity matrix was obtained in advance
19
Experimental Results (cont.)
Without Range Narrow down Executions: 19 Execution Time: 246
sec Γ statistic: 0.5138
With Range Narrow down Executions: 13 Execution Time: 27 sec Γ statistic: 0.5137
20
Experimental Results (cont.)
Comparison
Method
Execution
Time (Sec)
Cluster Number
Best
Γ Statistic
Our Method 27 57 0.51
K-means(k= 3 ~ 21)
404 5 0.45
K-means(k= 3 ~ 39)
1092 5 0.45
21
Conclusions
Microarray data analysis is an emerging field needing support of data mining techniques Accuracy Efficiency Automation