43
Covariate-adjusted Matrix Visualization via Correlation Decomposition 吳吳吳 吳吳吳吳 吳吳吳 吳吳吳吳吳吳吳吳吳吳 [email protected] http://www.hmwu.idv.tw

C ovariate- a djusted M atrix V isualization via C orrelation D ecomposition

  • Upload
    daisy

  • View
    48

  • Download
    3

Embed Size (px)

DESCRIPTION

C ovariate- a djusted M atrix V isualization via C orrelation D ecomposition. 吳漢銘 淡江大學 數學系 資料科學與數理統計組 [email protected] http://www.hmwu.idv.tw. Outlines. D ata/Information Visualization T wo Demo Data Sets G eneralized Association Plots ( GAP ) - PowerPoint PPT Presentation

Citation preview

Page 1: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Covariate-adjusted Matrix Visualization

via Correlation Decomposition吳漢銘

淡江大學 數學系 資料科學與數理統計組

[email protected] http://www.hmwu.idv.tw

Page 2: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Outlines Data/Information Visualization Two Demo Data Sets Generalized Association Plots (GAP) Related Works with Matrix Visualization Covariate-adjusted Matrix Visualization

For a discrete covariate: Within And Between Analysis (WABA) For a continuous covariate: Partial Correlations

Examples GAP Software Concluding Remarks

2/42

Page 3: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Data/Information Visualization Exploiting the human visual system to

extract information from data. Provides an overview of complex data sets. Identifies structure, patterns, trends,

anomalies, and relationships in data. Assists in identifying the areas of interest.

Matrix Visualization: reorderable matrix, the heatmap, color histogram, data image.

Visualization =

Data

information

Graphing for Data

+ Fitting

+ Graphing for Model

Raw Data Matrix Raw Data Map

3/42

Page 4: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

The Iris Data (Anderson 1935; Fisher 1936)

The iris data published by Fisher (1936) have been widely used for examples in discriminant analysis and cluster analysis.

Images source: http://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture27.pdf

4 variables

50x3=150 subjects

Raw Data Matrix

1 covariate

setosa

versicolor

virginica

4/42

Page 5: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data (Chen 2002)

All the symptoms are recorded on a six point scale

(0-5).

Scale for Assessment of Negative Symptoms

(SANS): 20 items, 5 subgroups.

Expression (NA1-7)Scale for Assessment of Positive Symptoms

(SAPS): 30 items, 4 subgroups.

Hallucinations (AH1-6)

69 schizophrenic

26 bipolar disorders

Speech (NB1-4)

Hygiene (NC1-3)

Activity (ND1-4)

Inattentiveness (NE1-2)

Behavior (BE1-4)

Delusions (DL1-12)

Thought disorder (TH1-8)

50 Variables

95 Subjects

正性症狀 : 行為的過量 負性症狀 : 行為的不足

精神分裂症

躁鬱症 精神疾病

幻覺妄想行為

思考失序

表達語言社交

做事的意志

衛生

Raw Data Matrix胡海國 國立臺灣大學 精神科教授國立臺灣大學醫學院附設醫院 精神部主任

5/42

Page 6: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

(4) Sufficient充分

Generalized Association Plots (GAP) (Chen, 2002)

Four Steps of Generalized Association Plots (GAP)

(1)Presentation呈現

(2) Seriation

排序 (3)

Partition 分割

Raw Data Matrix

Proximity Matrices for Rows and Columns

Clustering Summarization

6/42

Page 7: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Presentation of Raw Data Matrix

0. Data Transformation1. Selection of Proximity Measures2. Color Spectrum3. Display Conditions

The 1st Step of GAP

7/42

Page 8: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Presentation of Raw Data Matrix: iris data

(2) Color Spectrum

(1) Selection of Proximity Measures

(3) Range Matrix Condition

Pearson Correlation Matrix for VariablesEculidean Distance Matrix for Subjects

8/42

Page 9: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Presentation of Raw Data Matrix: Psychosis Disorder Data

Pearson Correlation CoefficientCorrelation Matrix for Variables

Correlation Matrix for Subjects

Raw Data Matrix

(2) Color Spectrum

(1) Selection of Proximity Measures

(3) Range Matrix Condition

9/42

Page 10: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Seriation of Proximity Matrices and Raw Data Matrix

Relativity of a Statistical Graph Global Criterion

GAP Rank-Two Elliptical Seriation Local Criterion

Tree Seriation Flipping of Tree Intermediate Nodes

The 2nd Step of GAP

10/42

Page 11: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Relativity of a Statistical GraphPlacing similar objects at closer positions.Placing different objects at distant positions.

Seriation Methods

(1) Rank Two Ellipse Ordering (Chen, 2002)

(2) Hierarchical Clustering Tree (Average-Linkage)

Seriation Methods

11/42

Page 12: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

GAP Rank-Two Elliptical Seriation

The p objects fall on an ellipse and have unique relative position on the ellipse (Chen 2002).

Seriation Algorithms with Converging Correlation Matrices

First two Eigenvectors

Correlation Matrix (without ordering)

12/42

Page 13: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Hierarchical Clustering Tree with a Dendrogram

Different Seriations Generated from Identical

Tree Structure

Tree seriation for proximity matrices

Tree seriation for raw data matrices3 flips1 flip

many flips5 flips

ideal model

Tree seriation

Internal Tree Flips External Tree Flips Ziv Bar-Joseph, David K. Gifford, and Tommi S. Jaakkola, (2001), Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics 17(Suppl. 1):S22–S29.

13/42

Page 14: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

GAP Rank-two elliptical seriation Michael Eisen (1998) tree seriation

Global vs. Local SeriationData: 517 genes by 13 arrays

Tien, Y. J., Lee, Y. S, Wu, H. M. and Chen, C. H.* (2008), Methods for Simultaneously Identifying Coherent Local Clusters with Smooth Global Patterns in Gene Expression Profiles. BMC Bioinformatics 9:155, 1-16.

14/42

Page 15: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Related Works of Matrix Visualization

Concept:1. Bertin (1967): reorderable matrix.2. Carmichael and Sneath (1969): taxometric maps.

Clustering of data arrays:1. Hartigan (1972): direct clustering of a data matrix. 2. Tibshirani (1999): block clustering. 3. Lenstra (1974): traveling-salesman problem.4. Slagle et al. (1975): shortest spanning path.

Colour Representation:1. Wegman (1990): colour histogram.2. Minnotte and West (1998): data image.3. Marchette and Solka (2003): outlier detection.

1

1 2

1 2 3

15/42

Page 16: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Exploring proximity matrices only:1. Ling (1973): shaded correlation matrix.2. Murdoch and Chow (1996): elliptical glyphs.3. Friendly (2002): corrgrams.

Integration of raw data matrix with two proximity matrices1. Chen (1996, 1999, and 2002): generalized association plots (GAP).

Reordering of variables and samples1. Chen (2002): concept of relativity of a statistical graph.2. Friendly and Kwan (2003): effect ordering of data displays.3. Hurley (2004): placing interesting displays in prominent positions.

Matrix Visualization (MV): reorderable matrix, the heatmap, color histogram, data image.

Related Works of MV (conti.)

1

2

3

1

16/42

Page 17: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Covariate-adjustedFirst two PCAs for Iris Data Psychosis Disorder Data

17/42

Page 18: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

A Model18/42

Page 19: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Correlation Decomposition19/42

Page 20: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Covariate-adjusted MV for Discrete Case

Correlation (Distance) for rows

based on(1) raw data matrix(2) fitted data matrix(3) residual data matrix

Correlations for

columns

Discrete Covariate

Y

20/42

Page 21: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Within And Between AnalysisDansereau, F., Alutto, J. A., & Yammarino, F. J. (1984).

Total correlation

Between-group correlation Within-group correlation

Between-eta correlation Within-eta correlation

Between component Within component

WABA equation

21/42

Page 22: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Three Steps to WABA

WABA I: Assessment of Variation: eta Each variable is assessed to determine whether the variable varies  

between group (suggesting within-group homogeneity). within groups (suggesting within-group heterogeneity). both between and within groups (suggesting individual differences

rather than within-group homogeneity or heterogeneity).

22/42

Page 23: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Three Steps to WABA (conti.)

WABA II: Assessment of Covariation: RB, RW Relationship among variables are assessed to determine whether the

correlation between variables is primarily a function of between-group covariance within-group covariance within- and between-group covariance (suggesting individual differences).

Drawing Inferences: Combination of WABA I and WABA II: R, B, W The results of the first two steps are assessed for consistency and combined

to draw the best overall conclusion from the data.

23/42

Page 24: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Covariate-adjusted MV for Continuous Case

Correlation (Distance) for rows

based on(1) raw data matrix(2) fitted data matrix(3) residual data matrix

Correlations for

columns

Continuous Covariate

Y

24/42

Page 25: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Partial Correlations

Conditional correlation is equivalent to partial correlation under some assumptions (Kurowicka and Cooke, 2000).

25/42

Page 26: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Assessing the Goodness of Fit of the Model Component

+=

26/42

Page 27: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Significance Analysis of the Residual Component Dunn and Clark’s z test for the equality of two dependent

correlations in the case of N exceeds 20 (Steiger, 1980). Test whether the correlations between variables Xj and Xk are

different significantly before and after a covariate adjustment.

27/42

Page 28: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

z-score Significant Map This z-score significant map is helpful identifying

variable pairs with the most significant differences in correlation before and after a covariate adjustment.

R Radj z

Dunn and Clark’s

z test

28/42

Page 29: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Simulation Study29/42

Page 30: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: RRank-two ellipse orderingFive symptom groups

identified by Chen (2002).

thought disorder ( 思考失序 )

Negative ( 負性症狀 )

auditory hallucination ( 聽幻覺 )

loss of ego boundary ( 分際喪失 )

Mania ( 狂躁 )

NOTE: the mania symptoms are negatively related to the negative symptoms and the auditory hallucination symptoms.

30/42

Page 31: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: R=B+W

By comparing B and R, the negative correlations between the mania symptoms

V5 (DL4, TH6-8) with the negative symptoms V2 (NC1-ND4) and the auditory

hallucination symptoms V3 are mostly due to the patients‘ subtypes.

31/42

Page 32: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: BAverage-linkage + GrandPa Flip

mania symptoms (DL4, TH6-8)

negative symptoms (NC1-ND4)

auditory hallucination symptoms

Delusions ( 妄想 )

32/42

Page 33: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: RB

Average-linkage + GrandPa Flip

All correlations are either positive one or negative one since there are only two subtypes for patients.

Two clusters (DL2-TH6) and (NA7-NA6) are formed and are negatively correlated.

For 50 between-eta correlations, symptom TH6 with the darkest between-eta has the most significant difference between schizophrenic and bipolar disorders.

話停不下來的 (Pressure of Speech)

33/42

Page 34: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: W

Rank-two ellipse ordering

Residual Patterns

34/42

Page 35: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: RW

Rank-two ellipse ordering

Four new symptom groups: (ND2-NE1), (TH5-TH7), (Th3-Th4) and (DL4-DL6).

Four symptoms NE1, DL2, BE1, and BE2 were grouped into the original negative symptoms group.

The symptoms in the TH (thought disorder) were grouped into two highly correlated subgroups (TH3-TH4, Th5-TH7).

All hallucination symptoms (AH1-6) and most of the delusion symptoms (except DL2, DL3) were clustered together.

negative symptoms

thought disorder

hallucination

delusion

35/42

Page 36: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: ZPositive z scores: between and within the group of

the negative symptoms (V2, except NA6, DL3, and BE4) and the group of the auditory hallucination symptoms (V3, except DL6)

within the group of the mania symptoms (V5, except DL5 and BE3)

Negative z scores: between the group of mania

symptoms V5 and the group of the negative symptoms (V2, except DL3, and BE4) and the group of the auditory hallucination symptoms (V3, except DL6).

negative

auditory hallucination

mania

36/42

Page 37: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: ZChanged significantly: the symptom TH4 ( 不合邏

輯 ) of V1 to the group of the negative symptoms (V2, except NA6, DL3, and BE4), the group of the auditory hallucination symptoms (V3, except DL6).

Without significant relationship with any other symptom for different patients' subtypes:

Eleven symptoms (TH5, NE2, DL2, BE1, BE2, DL3, BE4, AH6, AH5, DL5,and BE3)

Note: positive symptoms of behavior (BE1-BE4) are all included.

thought disorder

loss of ego boundary

37/42

Page 38: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: ZSingle-linkage+GrandPa Flip

Most significant difference A right slash: A reversed slash:

(AH1, DL4), (AH1, TH7), (DL1, DL4), (TH6, NC2), (TH7, NA1), (TH7, NA2), (TH7, NA3) (TH7, NA4), (TH7, NA5), (TH7, NB1), and (TH7, ND1).

Bipolar disorders patients tend to have higher distractible speech score (TH7).

Schizophrenic patients are more likely having higher negative symptoms scores.

38/42

Page 39: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Psychosis Disorder Data: ZSingle-linkage+GrandPa Flip

Most significant difference A right slash: A reversed slash:

(AH1, NA5), (DL1, NA4), (DL7, NA1) and (DL7, NA5).

Bipolar disorders patients have lower scores on these symptoms than schizophrenic patients.

39/42

Page 40: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

GAP Software verison 0.2.7

Generalized Association Plots Input Data Type: continuous or binary. Various seriation algorithms and clustering analysis. Various display conditions.

Modules: Covaraite Adjusted. Proximity Modelling. Nonlinear Association Analysis. Missing Value Imputation.

http://gap.stat.sinica.edu.tw/Software/GAP

Statistical Plots2D Scatterplot, 3D Scatterplot (Rotatable)

Download

Wu, H. M., Tien, Y. J. and Chen, C. H.* (2010). GAP: A Graphical Environment for Matrix Visualization and Cluster Analysis, Computational Statistics and Data Analysis, 54, 767-778.

40/42

Page 41: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

Concluding Remarks

Suggestions A preliminary step in modern exploratory data analysis. A continuing and active topic of research and application. New generation of exploratory data analysis (EDA) tool.

Matrix Visualization Color order-based representation of data

matrices. Provide several levels of information.

Covariate-adjusted Matrix Visualization Decomposition of correlations. Working on fitted and residual data matrix. Interactive Software: GAP. Extension to multi-level data.

GAP

41/42

Page 43: C ovariate- a djusted  M atrix  V isualization via  C orrelation  D ecomposition

43/42