1 August 30, 2014
中央研究院 人文社會科學館國際會議廳
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
陳君厚 中央研究院 統計科學研究所
2
http://escience.washington.edu/blog/new-phd-tracks-big-data
3
Q: 君厚, Big Data 這麼熱, 我們是否應該排進統計系所學程?
A: Big Data? 現在的統計系所畢業學生能夠處理 Data 嗎?
4
胡海國 教授 台灣大學 醫學院 精神科主任
主要合作研究者
楊泮池 教授 台灣大學 醫學院 院長
陳章榮 研究員 美國 FDA 毒理研究所
李御賢 教授 長庚紀念醫院 核心實驗室 銘傳大學
白果能 研究員 中央研究院 生物醫學科學研究所
吳漢銘 教授 淡江大學 數學系
周正中 教授 中正大學 生命科學系
楊永正 教授 陽明大學 生物醫學資訊研究所
林文昌 研究員 中央研究院 生物醫學科學研究所
黃奇英 教授 陽明大學 臨床醫學研究所
楊欣洲 研究員 中央研究院 生物醫學科學研究所
中研院賴明昭副院長推廣跨領域合作 (November 10, 2004)
I. Collaboration with Statistician?
合作不一定是資料分析 與統計學家合作很可能是需要資料分析
陳老師,我的同事用Factor Analysis上 同樣的Journal只花了三分之一的時間
Luxury Research vs. Necessity Research
Senior Researcher vs. Young Investigator
(Established) vs. (Struggling)
0 10.2 0.4 0.6 0.8
Corr. -1 0 1
Corr. -1 0 1
Corr. -1 0 1
Corr. -1 0 1
-.1 .1
-0.2 0.2 -0.4 0.4
Chun-houh, can you create some powerful statistical/bioinformatics methods so we can get our experiments published in Nature/Science?
Sir, can you conduct some meaningful biological/medical experiments so we can get our methods published in Nature/Science?
Mutual Trust (XX 人在中研院 YY 會議中說) 統計所的陳君厚說你們 ZZ 所的生物晶片 都有問題
Mutual Understanding You can remove Figure1 together with my name from the paper
P3G9P1P6
G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15
G4G16G8P7S2G14S1S3P4
G3G6
P5G2G1
Negative Disorg. Host./Excit.
De l./Hall.
15 10 5 0Average Euc lidean Distance
GONEG
GWNEG
G2
G3
G4
G1
PANSS Score
1 2 3 4 5 6 7
P3G9P1P6
G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15
G4G16G8P7S2G14S1S3P4
G3G6
P5G2G1
Average Correlation1 0.20.8 0.6 0.4
Correlation Coefficient
-1 -0.5 0 0.5 1
Negative�Symptoms
Disorganized�Thought
Hostility /�Excitement
Delusion /�Hallucination
PANSS Score
1 2 3 4 5 6 7
Correlation Coefficient
-1 -0.5 0 0.5 1
G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6
G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6
Negative�Symptoms
Disorganized�Thought
Hostility /�ExcitementDelusion /�HallucinationAnxiety�Symptoms
RMG�(n=61)
MBG�(n=50)
PDHG2�(n=38)
PDHG1�(n=14)
0 5 10Average Euc lidean Distance
0.2 10.4 0.6 0.8Average Correlation Negative Disorg. Host./
Excit.De l./Hall. Anxiety
9
II. 矩陣視覺化於探索式資料分析
Matrix Visualization: Approaching Statistics and Statistical Approach
矩陣視覺化: 趨近統計與統計趨勢
Lab 309 (???) for Information Visualization
Dr. 田銀錦 Postdoc. Fellow
張勝傑 張文宗 陳柏旭 鐘雅齡 黃建勳 林香誼 劉勝宗 曾聖澧 葉紫君 吳怡真 林倩如 歐陽智聞 . . .
10
Prof. 吳漢銘 Dept. Math. Tamkang U.
Mr. 高君豪 Ph.D. student
Prof. 須上英 Dept. Stat. Nat’l Taipei U.
Dr. 何孟如 Postdoc. Fellow
Ms. 石佳鑫 Research Assistant
11
Data analysis
With the goal of • discovering useful information • suggesting conclusions • supporting decision making
A process of • inspecting data • cleaning data • transforming data • modeling data
了解資料 探索式資料分析 (Exploratory Data Analysis)
資料視覺化
allow the data to speak for themselves before standard assumptions or formal modeling
Matrix Visualization as an EDA tool for assisting formal mathematical modeling
The greatest value of a picture is when it forces us to notice what we never expected to see.
It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.
Exploratory Data Analysis EDA, John Tukey (1977)
1915~ 2000
12
John W. Tukey在探索式資料分析 (Exploratory Data Analysis, EDA) 書中開宗明義地提到: It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it. 學習你可以做什麼,有助於在資料分析的過程中達到事半功倍的效果。EDA的作用在於從「看」資料獲得資料所傳達的訊息,所著重的是簡單的算術與容易建構的圖、表。透過 EDA 對於圖表中所顯露之型樣 (pattern) 做一初步的認知與描述,再進一步以人類的心智 (mind) 對所接收的訊息做全面的分析與判斷,以探索潛藏於資料中的訊息。強調的是探索式的分析而非嚴謹的模式確認。
14
資料視覺化需要統計嗎?
I. SetosaI. VerginicaI. Versicolor
Species name
0
20
40
60
80
Pet É Pet É Sep É Sep É
0
20
40
60
80
Pet É Pet É Sep É Sep É
20
40
60
80
30 60 90 120
Series
Data
nscores
Petal width
I.S I.V I.V
10
20
30
40
50
Species name
Graphics/Visualization for high dimensional data? P>5 p>10 p>100 p>10000
15
Recent Review Articles for MV The History of the Cluster Heat Map
Leland WILKINSON and Michael FRIENDLY
The American Statistician, May 2009, Vol. 63, No. 2 179
REVIEW Seriation and Matrix Reordering Methods: An
Historical Overview by Innar Liiv
Statistical Analysis and Data Mining 3: 70–91, 2010
Figure 2. Shaded matrix display from Loua (1873), available online at http://books.google.com/books/. This was designed as a summary of 40 separate maps of Paris, showing the characteristics (e.g., national origin, professions, age, social classes) of 20 districts, using a color scale ranging from white (low) through yellow and blue to red (high).
Figure 3. Sorted shaded display from Brinton (1914). The data are ranks of U.S. states on each of 10 educational features assessed in 1910. The matrix has been sorted by the row-marginal ranks.
Figure 5. Sorted shaded display from Czekanowski (1909), reproduced in Hage and Harary (1995).
Figure 9. Cluster heat map from Wilkinson (1994). The data are social statistics (i.e., urbanization, literacy, life expectancy for females, GDP, health expenditures, educational expenditures, military expenditures, death rate, infant mortality, birth rate, and ratio of birth to death rate) from a United Nations survey of world countries. The variables were standardized before the hierarchical clustering was performed.
Matrix Visualization (MV): reorderable matrix, heatmap, color histogram, data image 16
1100004000005000000000000000002223302220034220330032111010001030000002000000000000000000000000000002555000100000000000111100000000333330053151214444205555455005551550550010000300003002210000000020000000000000010000000000000000000010000000000000000000202002202000020000022020100000323120000021322122200000000003000000000000000002200000000000000000000031100020000131300002503300000020202003043331300031551004400000344040005500000000444430444143554133300000000003000000000000000000000000000000000000000020200000030122032000000000000020101000000030200020501000500450000000134000000000200120000003204423110005005000042000000020000020001000100000003141030000000030155033000000400000000000000000000000040333555000100000044033003040440000333230300000023322220000000002000000000000000003210000000000000000000011100020000010000000503000000020022200200300302034554000550000444040020002000000200200000003204044330000000000001000000000000000001000200222200121120042400030040030000010402200023022322002000330301222333000400000200000244043340020302300000003304000452310003000003000000400330000000003000000002245430244400030000440030020233200000043433334433232222231321000100220230020000000000000200220333020424133331110000005500000000240300000000000000000010001000200000000000000000000300000000000000000000000000020444000400440000000332433033000433334333003444244441110000013001112000030112222101011121221100121222111111111111115000005001111111155555005511555555551331000000300500000010043441110101101111000101112110000000004000000000000230004400000000000000000002000000030010030000022103202200032323322202233303321221100110001100000321032212000332232223233333133214410034002002000000050000000000000000000002020002000000012000020000000000003000000000002200200000200440000401000000000015001001000011110210020322023002200004000004000000210000000002000010010003020221040400000020055300000000002000000000000010031212101555505400000344440005002000000000000000100000002100000000000000000000000000000001101200020013020321044140040000441140000100000000011101010000000223200222000500000400010000100000000323240000022222204005500053004003000000050000000000000000000000000000043400030000040000000000000000044434033311032444422333000000044530000000033030000423333333021333333211002000000000200000020000000002003200002103221121033200030000041330001210001000020102222211032323310333000200300300000011022022000312022222010322022223330003004003000000210220200002220220110002020220055511150551115555522335544101142424433454455445545000000300200000000120032023000222123112001313222114000000003303333330030555500432011120203123111002000000000000000000000002200043000000000000020000030110000302003300000232032030000433333333023333333325500002000002330100000342100402333213344403431115121110150111155552201000000000021234031111021335500555000101000010000200022020040333421221222505523445550000000003000000220000300001000000032000000031144400350001033332200402102211001000000100030202321330003300303300200003043300030022002000002000000320000000005000000002030332014300002100000003000003300000000020020000000300000020000000000000000200100111021201220000000022155111150201023024111404111555550050005003341000200040343300000000000000000001044400240040030033320505434204040022400000240414445505000500000000000000000000000333134331011111134025550005000055550000000000000002222100210003331230030330000000050000045000000000031243313304323303410551001300310330000025000000000333220231231224113230000000000000000000000210023010000000000000000001222100120000030000000200000000000000000000230210222200000000300000000032000000043000000000000000000430000000003000000000130000002000000000000012000003133300320023210300202303323000020002200022043220022000000210100401000010033011000210010013012101032005040045004005000000050000000000000000003300000000020000200000000000000100000000043434033202224012330511000300000030005030033200000333440431545144114125500053001300000000200000000003000230000000035540000000000000000002002000000000022201001200022200211330003000440200000014220000210000000000000201000301000001302000000004300432430404443403341224440242033300020030000200000003302044200000000000000000032330000000000300000223232000000322223223102302022110000000013100000000221210003320010010110011110001100000000030000000023200000033000000000000000000030000000230200200000023000000220000000011000221200100000000004000000002320000003200000020000021010001100000000020000000000100000022110000100000000220010444000300300300000333003000230200002000001200000020000000003002000002220200003220000010000000000001100000003040000000033310000043000000000000230000032222000000000010000000022110000212210111010320000214140024000013000000250300200002200000220003243222320200000031000000000400000243000000000000030000030
Data Matrix
50 題精神症狀量表 95
位精
神醫
學患
者
Data Map
5
0
嚴重
正常
50 題精神症狀量表
95 位
精神
醫學
患者
Generalized Association Plots (GAP) for MV of continuous data
2.permutation
4.summary
3.partition GAP
Data Matrix Continuous
ordinal Binary nominal
1.presentation
18
Approaching Statistics & Statistical Approach 1100004000005000000000000000002223302220034220330032111010001030000002000000000000000000000000000002555000100000000000111100000000333330053151214444205555455005551550550010000300003002210000000020000000000000010000000000000000000010000000000000000000202002202000020000022020100000323120000021322122200000000003000000000000000002200000000000000000000031100020000131300002503300000020202003043331300031551004400000344040005500000000444430444143554133300000000003000000000000000000000000000000000000000020200000030122032000000000000020101000000030200020501000500450000000134000000000200120000003204423110005005000042000000020000020001000100000003141030000000030155033000000400000000000000000000000040333555000100000044033003040440000333230300000023322220000000002000000000000000003210000000000000000000011100020000010000000503000000020022200200300302034554000550000444040020002000000200200000003204044330000000000001000000000000000001000200222200121120042400030040030000010402200023022322002000330301222333000400000200000244043340020302300000003304000452310003000003000000400330000000003000000002245430244400030000440030020233200000043433334433232222231321000100220230020000000000000200220333020424133331110000005500000000240300000000000000000010001000200000000000000000000300000000000000000000000000020444000400440000000332433033000433334333003444244441110000013001112000030112222101011121221100121222111111111111115000005001111111155555005511555555551331000000300500000010043441110101101111000101112110000000004000000000000230004400000000000000000002000000030010030000022103202200032323322202233303321221100110001100000321032212000332232223233333133214410034002002000000050000000000000000000002020002000000012000020000000000003000000000002200200000200440000401000000000015001001000011110210020322023002200004000004000000210000000002000010010003020221040400000020055300000000002000000000000010031212101555505400000344440005002000000000000000100000002100000000000000000000000000000001101200020013020321044140040000441140000100000000011101010000000223200222000500000400010000100000000323240000022222204005500053004003000000050000000000000000000000000000043400030000040000000000000000044434033311032444422333000000044530000000033030000423333333021333333211002000000000200000020000000002003200002103221121033200030000041330001210001000020102222211032323310333000200300300000011022022000312022222010322022223330003004003000000210220200002220220110002020220055511150551115555522335544101142424433454455445545000000300200000000120032023000222123112001313222114000000003303333330030555500432011120203123111002000000000000000000000002200043000000000000020000030110000302003300000232032030000433333333023333333325500002000002330100000342100402333213344403431115121110150111155552201000000000021234031111021335500555000101000010000200022020040333421221222505523445550000000003000000220000300001000000032000000031144400350001033332200402102211001000000100030202321330003300303300200003043300030022002000002000000320000000005000000002030332014300002100000003000003300000000020020000000300000020000000000000000200100111021201220000000022155111150201023024111404111555550050005003341000200040343300000000000000000001044400240040030033320505434204040022400000240414445505000500000000000000000000000333134331011111134025550005000055550000000000000002222100210003331230030330000000050000045000000000031243313304323303410551001300310330000025000000000333220231231224113230000000000000000000000210023010000000000000000001222100120000030000000200000000000000000000230210222200000000300000000032000000043000000000000000000430000000003000000000130000002000000000000012000003133300320023210300202303323000020002200022043220022000000210100401000010033011000210010013012101032005040045004005000000050000000000000000003300000000020000200000000000000100000000043434033202224012330511000300000030005030033200000333440431545144114125500053001300000000200000000003000230000000035540000000000000000002002000000000022201001200022200211330003000440200000014220000210000000000000201000301000001302000000004300432430404443403341224440242033300020030000200000003302044200000000000000000032330000000000300000223232000000322223223102302022110000000013100000000221210003320010010110011110001100000000030000000023200000033000000000000000000030000000230200200000023000000220000000011000221200100000000004000000002320000003200000020000021010001100000000020000000000100000022110000100000000220010444000300300300000333003000230200002000001200000020000000003002000002220200003220000010000000000001100000003040000000033310000043000000000000230000032222000000000010000000022110000212210111010320000214140024000013000000250300200002200000220003243222320200000031000000000400000243000000000000030000030
1. Data Matrix (n * p)
(w/ Color coding)
Continuous Ordinal Binary
Nominal
2. Proximity Matrix for Subject (n * n)
Continuous Ordinal Binary
Nominal
3. Proximity (Variable p * p)
Continuous Ordinal Binary
Nominal
Some essential elements in a GAP MV procedure
4. Permutation (variable)
4. Permutation (subject)
19
Statistical Approach Identify Global Trend: Singular Value Decomposition
20
Chen 2002, Statistica Sinica Rank 2 Elliptical
R2E
SVD
SVD1
(c) Correlation-1 0 +1
-8 1:1 +8(a) Expression
(d)
(b) Correlation-1 0 +1
Alter O. et al 2000, PNAS
SVD2
21
Eisen et al. (1998)
Tree seriation & flipping of intermediate nodes
(a)
A E D C B
D A
E C B
(b)
(c)
A B E D C
1 flip 3 flips 5 flips many flips
2n-1=25-1=16
Different Seriations (Ordering of Terminal Nodes or Leaves) Generated from Identical Tree Structure
ideal model
external and internal references for guiding flipping mechanism
Statistical Approach: Identify Local Clusters
(c) Correlation-1 0 +1
-8 1:1 +8(a) Expression
(d)
(b) Correlation-1 0 +1
(c) Correlation-1 0 +1
-8 1:1 +8(a) Expression (b) Correlation
-1 0 +1
(d) (e)(c) Correlation-1 0 +1(c) Correlation
-1 0 +1
-8 1:1 +8(a) Expression
-8 1:1 +8(a) Expression (b) Correlation
-1 0 +1(b) Correlation
-1 0 +1
(d) (e)
( c) Correl at i on
- 1 0 +1
( d)
- 8 1: 1 +8( a) Expressi on ( b) Correl ati on
- 1 0 +1
GAP Elliptical (R2E) Seriation Hierarchical Tree Seriation Tree guided by (R2E) 22
Approaching Statistics & Statistical Approach
HCT R2E HCTR2E + =
Admission 6 month
0.2 10.4 0.6 0.8
Absolute Random Error Coefficient
0 1comforting the aggravating patient�assistant to the aggravating patient�transport of the aggravating patient to service setting �financial aid�general psychological/practical supportcoping with medical team�understanding diagnosis and treatment�identifying early signs of relapse�understanding mental health laws�general social acceptance�occupational therapy�sheltered working facilities�advice on intimate relationship for patient�lifelong custodial care for patient
Need �cluster for�assistant�to patient�care
Need �cluster for�accessing �to relevant�information
Need �cluster �for�societal �support
Need �cluster�for�burden�release
Admission Hwu et al. Schizophrenia Research (2002) Symptom Patterns and Subgrouping of Schizophrenic Patients: Significance of Negative Symptoms Assessed on Admission
0 10.2 0.4 0.6 0.8
Corr. -1 0 1
Corr. -1 0 1
Corr. -1 0 1
Corr. -1 0 1
-.1 .1
-0.2 0.2 -0.4 0.4
Psychiatry Research (1998) Lin, Chen et al. Psychopathological Dimensions in Schizophrenia: A Correlational Approach to Items of the SANS and SAPS
Genes, Brain and Behavior (2009) Lin et al. Clustering by neurocognition for fine-mapping of the schizophrenia susceptibility loci on chromosome 6p
6 month Liu et al. J. of the Formosan Med. Ass. (2012) Medium-term course and outcome of schizophrenia depicted by the sixth-month subtype after an acute episode
GAP for Heritable (Genetic) Disease: Schizophrenia (National Taiwan University)
P3G9P1P6
G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15
G4G16G8P7S2G14S1S3P4
G3G6
P5G2G1
Negative Disorg. Host./Excit.
De l./Hall.
15 10 5 0Average Euc lidean Distance
GONEG
GWNEG
G2
G3
G4
G1
PANSS Score
1 2 3 4 5 6 7
P3G9P1P6
G7N6G13N1N4N2N3N5G10G12G5P2G11N7G15
G4G16G8P7S2G14S1S3P4
G3G6
P5G2G1
Average Correlation1 0.20.8 0.6 0.4
Correlation Coefficient
-1 -0.5 0 0.5 1
Negative�Symptoms
Disorganized�Thought
Hostility /�Excitement
Delusion /�Hallucination
PANSS Score
1 2 3 4 5 6 7
Correlation Coefficient
-1 -0.5 0 0.5 1
G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6
G7N6N3N1N2N4N5G16G10N7G5G13G11P2G15G12G8P7S1G14S2S3P4P6P3G9P1P5G4G2G1G3G6
Negative�Symptoms
Disorganized�Thought
Hostility /�ExcitementDelusion /�HallucinationAnxiety�Symptoms
RMG�(n=61)
MBG�(n=50)
PDHG2�(n=38)
PDHG1�(n=14)
0 5 10Average Euc lidean Distance
0.2 10.4 0.6 0.8Average Correlation Negative Disorg. Host./
Excit.De l./Hall. Anxiety
J. of the Formosan Med. Ass.(2008) Yeh et al. Factors Related to Perceived Needs of Chief Caregivers of Patients with Schizophrenia
PLoS ONE (2011) Lai et al. MicroRNA expression aberration as potential peripheral blood biomarkers for schizophrenia
Schizophrenia Research (2013) Liu et al. Development of a brief self-report questionnaire for screening putative pre-psychotic states.
BMC Genomics 9 (2008) Genomics and proteomics of immune modulatory effects of a butanol fraction of Echinacea purpurea in human dendritic cells Wang et al.
Phytochemistry 70 (2009) Anti-diabetic properties of three common Bidens pilosa variants in Taiwan Chien et al.
Journal of Nutritional Biochemistry 21 (2010) Comparative metabolomics approach coupled with cell- and gene-based assays for species classification and anti-inflammatory bioactivity validation of Echinacea plants Hou et al.
GAP for Comparative Metabolome: Chinese Herbal Medicine Drs. Ning-Sun Yang, Lie-Fen Shyur, Wen-Chin Yang Agricultural Biotechnology Research Center (ABRC) of Academia Sinica
BMC Complementary and Alternative Medicine 13 (2013) Morus alba and active compound oxyresveratrol exert anti-inflammatory activity via inhibition of leukocyte migration involving MEK/ERK signaling. Chen et al.
紫錐菊
咸豐草
白桑
Journal of Clinical Oncology 23 (2005) Tumor-Associated Macrophages in Cancer Progression Chen J. J. et al.
The New England Journal of Medicine 356 (2007) A Five-Gene Signature and Clinical Outcome in Non–Small-Cell Lung Cancer Chen H. Y. et al.
BMC Genomics 6 (2005)Molecular signature of clinical severity in recovering patients with (SARS-CoV) Lee Y. S. et al. (Chang Gung Hospital)
Open Access Scientific Reports 1 (2006) In silico Therapeutic Drug Screening for Reversing the Lung Adenocarcinoma Overexpressed Gene Signatures. Kuo Y. L. et al. (Nat’l Yang-Ming Univ.)
Cancer Research 66 (2006) Non–Small Cell Lung Cancer with Tumor Cell Invasiveness Sher Y. P. et al.
GAP for Cancer Study: Non–Small Cell Lung Cancer (National Taiwan University)
GAP for Infectious Disease: SARS b Simple Match Between Pathways
a PPI to Pathway c Simple Match Between PPIs
F13A1,HSPB1!MAPK14,EGFR!
EGFR,HSPB1!STAT1,PDGFRB!PDGFRB,CRKL!
HCK,CRKL!ITGAV,PTK2!FLT1,CRKL!
CRKL,MAPK1!CRKL,RAF1!
MAPK3,PTPN11!STAT5A,SHC1!
CRK,SRC!GAB1,SOS1!
CRK,SHC1!PXN,PTPN11!
PDGFRB,PTPN11!PDGFRB,PLCG1!
PLCG1,PTK2!CRKL,GAB1!
CRKL,PTPN11!BAD,YWHAZ!
BAD,RAF1!PTK2,PTEN!PXN,PTEN!
CRKL,PIK3R1!AKT1,HSPB1!AKT1,PDPK1!
MAPK14,AKT1!PIK3R1,SHC1!
PIK3R1,SRC!HCK,SOS1!
CRKL,SOS1!PDGFRB,RAF1!
FLT1,PTPN11!HCK,PLCG1!FLT1,PLCG1!CRKL,EGFR!
CRK,KDR!CRKL,PTK2!FLT1,PTK2!
MAPK14,MAPK3!BAD,MAPK8!
AKT1,SMAD4!FLT1,HCK!
HCK,PIK3CB!CTNNB1,FLT1!
PIK3R1,PXN!FLT1,PIK3R1!
AKT1,PAK1!AKT1,NOS3!AKT1,MDM2!PTK2,YES1!
PXN,MAPK8!CRK,FLT1!
MAPK3,MAPK1!PDGFRB,SLC9A3R1!
EGFR,HCK!MCM7,CDC6!CDC6,MCM6!
PLK1,PKMYT1!E2F1,CDC6!
CCNB1,PKMYT1!CDK7,E2F1!
PLK1,CCNB1!CCNB1,CDC25A!
CCNA2,CCNB1!
M1!
M2!
H1!
H2!
B1!
B2!
A1!
A2!
P1!
P2!
P3!P4!P5!
Signalling to RAS!Signaling by EGFR!PDGFR-alpha signaling pathway!PDGFR-beta signaling pathway!Signaling events activated by Hepatocyte Growth Factor Receptor (c-Met)!IGF1 pathway!Signaling events mediated by VEGFR1 and VEGFR2!role of pi3k subunit p85 in regulation of actin organization and cell migration!PI3K/AKT signalling!akt signaling pathway!mTOR signaling pathway!Hedgehog signaling events mediated by Gli proteins!PPAR signaling pathway - Homo sapiens (human)!Canonical Wnt signaling pathway!Complement and coagulation cascades - Homo sapiens (human)!Unwinding of DNA!Activation of the pre-replicative complex!cdk regulation of dna replication!sonic hedgehog receptor ptc1 regulates cell cycle!Cyclin A/B1 associated events during G2/M transition!E2F mediated regulation of DNA replication!
a Both Positive!
Mahlavu Only!
Huh7 Only!
Not on the pathway!
b,c 1 0
Color legends
GAP for Protein-Protien
Interaction C-Y F. Huang,
Nat’l Yang-Ming Univ.
Molecular and Cellular Proteomics 12 (2013) An analysis of protein-protein interactions in cross-talk pathways reveals CRKL as a novel prognostic marker in hepatocellular carcinoma. Liu et al.
X1 x2 x3 x4 0
1
Scatter-plot Matrix (SM) Parallel Coordinates Plot (PCP)
Mosaic Plot
Visualization for Binary Data
26
1. Data Matrix
2. Subject Proximity
3. Variable Proximity
Essential elements in a GAP MV procedure?
1. Data Matrix
2. Subject Proximity
3. Variable Proximity
Continuous Binary
Correlation Covariance polychoric Correlation . . .
Euclidean Distance Manhattan Distance Correlation … ?
27
Matrix Visualization for Binary Data
28
Commonly used similarity coefficients for binary data
Tzeng et al. (BMEI 2009) (IEEE Xplore Digital Library)
CGMIM performs automated text-mining of OMIM to identify genetically-related cancers Online Mendelian In Man (OMIM) is a computerized database of information about genes and heritable traits in human populations OMIM is maintained on the Internet by the National Center for Biotechnology Information at the US National Institutes of Health CGMIM considers 21 anatomic sites based on the major cancers identified by the National Cancer Institute of Canada CGMIM compares each OMIM entry name and alternative name with a list of gene names assigned by HUGO (HUman Genome Organization). CGMIM produces the number of genes for which an OMIM entry mentions each pair of cancers, as well as a ratio of the observed and expected number of genes for the combination
http://www.bccrc.ca/ccr/CGMIM/ CGMIM Online Binary GAP Example
29
CGMIM All Data (1948 genes * 21 Sites) Original Order
21 Cancer Sites
1948 Related Genes
Jaccard: a/(a+b+c)
30
21 Cancer Sites
1948 Related Genes
CGMIM All Data (1948 genes * 21 Sites) Single_Tree_GrandPa_Guide
Jaccard: a/(a+b+c)
31
21 Cancer Sites
768 Related Genes
CGMIM 768 genes at least at 2 Sites Original Order
Jaccard: a/(a+b+c)
32
21 Cancer Sites
768 Related Genes
CGMIM 768 genes at least at 2 Sites GAP_Elliptical_Order
Jaccard: a/(a+b+c)
33
Matrix visualization of nominal data (GAP approach)
Example: Classification of Animals Data
Shizuhiko Nishisato 2006 34
Approaching Statistics & Statistical Approach
35
票種 code 合計
早鳥/社會 0 351
早鳥/學生 1 121
邀請/⼀一般 2 48
邀請/媒體 3 25
邀請/貴賓 4 22
黑客松 5 90
邀請/講師 6 32 邀請/贊助 7 12
總計 701
性別 code 合計
女 0 166
男 1 535
總計 701 業別 code 合計
無法認定 0 56
資訊業 1 203
研究機構 2 41
科技業 3 33
顧問公司 4 13
金融業 5 18
通訊業 6 21
自由業 7 11
傳播業 8 17
學術 9 242
政府機關 10 11
醫藥業 11 8
電子業 12 4
公益團體 13 8
通路商 14 5
其他 15 10
總計 701
餐飲 code 合計
素食 0 47
葷食 1 654
總計 701
活動 code 合計
黑客松 0 91
資料分析 1 196
演講議程 2 414
總計 701
通知 code 合計
否 0 61
是 1 640
總計 701
電郵 code 合計
.edu 0 49
.org 1 15
gmail 2 500
hotmail 3 44
yahoo 4 10
MSN 5 3
.com 6 64
其他 7 16
總計 701
報名資料 可能變數
報名序 1~701 票種
業別
通知
活動
電郵
… … … … …
0 1 1 1 1 0 9 1 1 2 0 1 1 0 6 0 1 2 1 2 0 9 2 1 2 1 9 2 1 2 1 9 2 1 2 1 9 1 1 2 1 9 1 1 2 0 3 2 1 3 1 9 2 1 3 0 3 2 1 6 0 11 2 1 2 1 9 1 1 2 1 9 2 1 2 0 1 2 0 2 1 9 2 1 2 1 9 2 1 2 0 3 2 1 6 0 2 2 1 1 0 1 2 1 2 … … … … …
701 人
7 變數 報 名
葷素
性別
… … …
139 1 1
140 1 1
141 0 0
142 1 1
143 1 1
144 1 0
145 1 1
146 1 1
147 1 0
148 1 1
149 1 1
150 1 1
151 1 1
152 1 1
155 1 1
157 1 1
158 1 0
159 0 1
160 1 0
162 1 1
163 1 0
… … …
1. Data Matrix
2. Subject Proximity
3. Variable Proximity
Essential elements in a GAP MV procedure?
1. Data Matrix
2. Subject Proximity
3. Variable Proximity Continuous Nominal
Correlation Covariance polychoric Correlation . . .
Euclidean Distance Manhattan Distance Correlation … ?
type measurements
Matching proportion
€
χ 2
?
?
36
Approaching Statistics & Statistical Approach
Shizuhiko Nishisato, 2006 Classifica5on of Animals
35 animals were sorted into piles of similar animals by 15 students
Animal/Subject S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 Alligator 8 1 6 9 6 1 3 4 1 4 3 4 3 6 4 Bear 6 3 2 6 6 1 3 4 4 5 5 1 4 2 2 Camel 4 3 9 3 1 4 3 5 4 2 5 1 7 7 8 Cat 6 3 7 4 0 1 1 2 3 3 1 1 6 3 5 Cheetah 3 3 7 4 0 1 3 5 4 3 6 1 6 2 2 Chiken 7 2 4 1 2 5 7 1 5 1 1 3 8 4 1 Chimpanzee 5 3 5 7 5 2 4 4 2 2 6 4 3 6 6 Cow 1 3 9 6 1 1 3 5 3 4 1 1 4 5 8 Crane 7 2 4 5 2 5 5 1 5 1 2 3 8 4 1 Crow 7 2 4 5 2 5 5 1 5 1 2 3 8 4 1 Dog 6 3 7 10 0 2 1 2 3 3 1 1 4 3 5 Duck 7 2 4 1 2 5 5 1 5 1 2 3 8 4 1 Elephant 4 3 6 3 1 4 3 5 4 5 3 1 7 7 2 Fox 6 3 7 4 0 1 6 2 3 3 3 1 4 3 5 Frog 8 1 3 2 3 3 2 3 1 4 4 2 1 1 3 Giraffe 1 3 8 3 1 4 3 5 4 2 5 1 7 7 8 Goat 3 3 9 6 1 4 6 5 3 3 1 1 5 3 5 Hawk 7 2 4 5 2 5 5 1 5 1 3 3 8 4 1 Hippopotamus 4 3 6 6 6 4 3 3 4 4 5 1 7 7 2 Horse 6 3 9 6 1 2 3 5 3 3 1 1 5 5 8 Leopard 1 3 7 4 0 1 3 5 4 3 3 1 6 2 2 Lion 5 3 7 4 6 1 3 5 4 3 3 1 7 2 2 Lizard 2 1 3 2 3 3 2 3 1 4 4 2 2 1 3 Monkey 6 3 5 7 5 2 4 4 2 2 6 4 3 6 6 Ostrich 3 2 4 1 2 5 3 1 5 1 5 3 8 7 8 Pig 1 3 9 6 1 1 6 5 3 3 1 1 5 5 5 Pigeon 7 2 4 5 2 5 5 1 5 1 2 1 8 4 1 Rabbit 6 3 1 6 0 4 6 2 3 3 1 1 5 3 5 Racoon 6 3 7 10 4 1 6 2 3 3 3 1 4 3 5 Rhinoceros 4 3 5 6 6 4 3 5 4 4 5 1 7 7 2 Snake 8 1 3 9 6 3 2 3 1 4 4 2 2 1 3 Sparrow 7 2 4 5 2 5 5 1 5 2 2 3 8 4 1 Tiger 5 3 7 4 0 1 3 5 4 3 3 1 6 2 2 Tortoise 8 1 3 9 3 3 2 3 1 5 4 2 1 1 3 Turkey 7 2 4 1 2 5 7 1 5 1 1 3 8 4 1
What about 3500 samples 1500 variables
?
A typical nominal data
37
Alligator Bear Camel Cat Cheetah Chicken Cow Crane Chimpanzee Crow
Dog Duck Elephant Fox Frog Giraffe Goat Hawk Hippopotamus Horse
Leopard Lion Lizard Ostrich Pig Pigeon Rabbit Racoon Rhinoceros Snake
Sparrow Tiger Tortoise Turkey Monkey
39
S12
Mammalia Reptilia Aves Primates ? Mammalia Reptilia Aves
Uni-variate Display Bar-Chart Pie-Chart
S2
20
15
10
5
0 3 4 2 1 3 2 1
41 3. Mammal 1. Reptile 2. Bird
4. P
rimat
e?
1. M
amm
al
2. R
eptil
e 3.
Bird
Bi-variate Display Mosaic Display
Conventional multivariate visualization for this data
2D Mosaic Display
5D Mosaic Display
Parallel Coordinate Plot
Scatter-plot Matrix
Scatter-plot
42
GAP Categorical MV Solution with 3 matrix maps
(original orders)
43
Alliga Bear Camel Cat Cheeta Chicke Chimpa Cow Crane Crow Dog Duck Elepha Fox Frog Giraff Goat Hawk Hippop Horse Leopar Lion Lizard Monke Ostric Pig Pigeon Rabbit Racoo Rhinoc Snake Sparro Tiger Tortoi Turkey
44
GAP Categorical MV Solution with 3 matrix maps (R2E orders)
45
11 6 7
15 14 12 13 9 4 3 5 8
10 2 1
Os
tr
ic
Tu
rk
ey
C
hic
ke
P
ige
on
H
aw
k
Sp
ar
ro
D
uc
k
Cr
ow
C
ra
ne
L
iza
rd
F
ro
g
To
rt
oi
Sn
ak
e
Al
lig
a
Hip
po
p
Be
ar
R
hin
oc
E
le
ph
a
Lio
n
Tig
er
L
eo
pa
r
Co
w
Fo
x
Ra
co
o
Ch
ee
ta
C
at
D
og
P
ig
Ra
bb
it
Ho
rs
e
Go
at
C
am
el
G
ira
ff
M
on
ke
C
him
pa
11 6 7
15 14 12 13 9 4 3 5 8
10 2 1
Mammalia Reptilia Aves Primates
D0_Lion D1_Elephant D2_Camel D3_Hawk D4_Fox
A0_Dog A1_Alligator A2_Chimpanzee A3_Cow A4_Crow A5_Pigeon A6_Cheetah A7_Chiken A8_Bear A9_Cat
B0_Rabbit B1_Frog B2_Goat B3_Tiger B4_Rhinoceros B5_Giraffe B6_Duck B7_Sparrow B8_Hippopotamus B9_Monkey
C0_Turkey C1_Pig C2_Crane C3_Leopard C4_Ostrich C5_Lizard C6_Horse C7_Racoon C8_Tortoise C9_Snake
Alligator Bear Camel Cat Cheetah Chicken Cow Crane Chimpanzee Crow
Dog Duck Elephant Fox Frog Giraffe Goat Hawk Hippopotamus Horse
Leopard Lion Lizard Ostrich Pig Pigeon Rabbit Racoon Rhinoceros Snake
Sparrow Tiger Tortoise Turkey Monkey
?
48
Univariate frequency breakdown Bivariate table
與會人士類別性資料
49
票種 code 合計
早鳥票/社會人士 0 351
早鳥票/學生 1 121
邀請票/⼀一般 2 48
邀請票/媒體 3 25
邀請票/貴賓 4 22
g0v 黑客松票 5 90
邀請票/講師 6 32 邀請票/贊助 7 12
總計 701
性別 code 合計
女 0 166
男 1 535
總計 701
業別 code 合計
Missing、無法認定 0 56
資訊業 1 203
研究機構 2 41
科技業 3 33
顧問公司 4 13
金融業 5 18
通訊業 6 21
自由業 7 11
傳播業 8 17
學術 9 242
政府機關 10 11
醫藥業 11 8
電子業 12 4
公益團體 13 8
通路商 14 5
製造業、食品業、運輸業、服務業 15 10
總計 701
餐飲需求 code 合計
素食 0 47
葷食 1 654
總計 701
參加活動 code 合計
g0v 黑客松 0 91
資料分析上手課程 1 196
演講議程 2 414
總計 701
日後通知 code 合計
否 0 61
是 1 640
總計 701
電郵 code 合計
.edu 0 49
.org 1 15
gmail 2 500
hotmail 3 44
yahoo 4 10
MSN 5 3
.com 6 64
未定、其他 7 16
總計 701
報名資料可能變數
報名序 1~701
50
701
人
5 變數
Optimization 票種
業別
通知
活動
電郵
… … … … …
0 1 1 1 1 0 9 1 1 2 0 1 1 0 6 0 1 2 1 2 0 9 2 1 2 1 9 2 1 2 1 9 2 1 2 1 9 1 1 2 1 9 1 1 2 0 3 2 1 3 1 9 2 1 3 0 3 2 1 6 0 11 2 1 2 1 9 1 1 2 1 9 2 1 2 0 1 2 0 2 1 9 2 1 2 1 9 2 1 2 0 3 2 1 6 0 2 2 1 1 0 1 2 1 2 … … … … …
葷素
性別
報名
… … …
1 1 139
1 1 140
0 0 141
1 1 142
1 1 143
1 0 144
1 1 145
1 1 146
1 0 147
1 1 148
1 1 149
1 1 150
1 1 151
1 1 152
1 1 155
1 1 157
1 0 158
0 1 159
1 0 160
1 1 162
1 0 163
… … …
3 共變數
51
其他
.com
MSN
Yahoo
Hotmail
Gmail
.org
.edu
是
否
演講議程
資料分析
黑客松
其他
通路商
公益團體
電子業
醫藥業
政府機關
學術
傳播業
自由業
通訊業
金融業
顧問公司
科技業
研究機構
資訊業
無法認定
邀請贊助
邀請講師
黑客松
邀請貴賓
邀請媒體
邀請一般
早鳥學生
早鳥社會
票種 電郵
通知 活動
業別 李育杰
701 人
5 變數
52
報 名 序 女素
票 業 活 通 電 種 別 動 知 郵
票種 業別 活動 通知 電郵
5 變數
701
人
Original Orders (報名序)
53
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
其他 .com MSN yahoo hotmail gmail .org .edu
通知 是否
活 動
演講議程 資料分析 g0v黑客松
報 名 序
票 種
業 別
電 郵
女素
票 業 活 通 電 種 別 動 知 郵
票種 業別 活動 通知 電郵
5 變數 Original Orders (報名序)
701
人
票 通 活 電 業 種 知 動 郵 別
54
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
其他 .com MSN yahoo hotmail gmail .org .edu
通知 是否
活 動
演講議程 資料分析 g0v黑客松
報 名 序
票 種
業 別
電 郵
Random Orders
票種 通知 活動 電郵 業別 女素
5 變數
701
人
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
55
票種 業別 活動 通知 電郵 報名 女男 素葷
通知 活動
票種
業別
電郵 其他 .com MSN yahoo hotmail gmail .org .edu
是否
演講議程 資料分析 黑客松
邀請/贊助 邀請/講師
黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
其他 通路商
公益團體 電子業 醫藥業
政府機關 學術
傳播業 自由業 通訊業 金融業
顧問公司 科技業
研究機構 資訊業
無法認定
701 人
沈澱圖 (pie-chart, bar-chat) (可觀察單一變數各類別之比例,但失去變數間連結)
166
100
56
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
報 名 序
票 種
業 別
Elliptical Seriations (R2E)
電郵 通知 活動 票種 業別
電 通 活 票 業 郵 知 動 種 別 其他
.com MSN yahoo hotmail gmail .org .edu
通知 是否
活 動
演講議程 資料分析 g0v黑客松
電 郵
5 變數
701
人
女素
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
業 電 通 活 票 別 郵 知 動 種
57
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
報 名 序
業別 電郵 通知 活動 票種
票 種
業 別
Hierarchical Clustering Tree (HCT)
其他 .com MSN yahoo hotmail gmail .org .edu
通知 是否
活 動
演講議程 資料分析 g0v黑客松
電 郵
5 變數
701
人
女素
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
電 通 活 票 業 郵 知 動 種 別
58
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
報 名 序
票 種
業 別
Hierarchical Clustering Tree (HCT)
其他 .com MSN yahoo hotmail gmail .org .edu
通知 是否
活 動
演講議程 資料分析 g0v黑客松
電 郵 電郵
通知 活動 票種 業別
5 變數
701
人
女素
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
59
電郵 通知 活動 票種 業別
票 種
其他 通路商 公益團體 電子業 醫藥業 政府機關 學術 傳播業 自由業 通訊業 金融業 顧問公司 科技業 研究機構 資訊業 無法認定
業 別
電 郵
其他 .com MSN yahoo hotmail gmail .org .edu
通 知
是否
活 動
演講議程 資料分析 g0v黑客松
報名 Orders: HCT-R2E Hierarchical Clustering Tree with Flips Guided Elliptical Seriation
邀請/贊助 邀請/講師 g0v黑客松 邀請/貴賓 邀請/媒體 邀請/一般 早鳥/學生 早鳥/社會
女素
Data: 160 international organization membership pattern (variables) for 230 countries/regions (subjects) 0. non-member □ 1. member ■ 2. observer 3. associate member 4. guest 5. dialogue partner
http://www.faqs.org/docs/factbook/index.html
CIA Political Map of the World
230 countries (regions)
CIA
60
160 international organization
Matrix Visualization with cartography links Approaching Statistics & Statistical Approach
. . . . . . 160 maps (?)
Draw one membership map for each organization (variable)?
1 2 3
4 5 6
7 8 9
158 159 160 61
Data: Ranks of 5 Candidates (扁宋連許李) on 360 Townships 2000 總統大選資料
Is it possible to visualize information structure for all 5 candidates in a single MAP?
A B C D E
�
D E
A B C
? Rank 1 2 3 4 4.5 5
���������������������
��"
#"
��"
��"
��"
��"
��"
��"
��"
��"
��"
��"
�"
�"�!" �" �"��"
ABCDE
Cartography Coloring Scheme with Categorical GAP (CartoGAP) - 2
扁 宋 連
許 扁宋連 許李
李
A B C
E
A B C D E
D
Cartography Coloring Scheme with GAP (CartoGAP)-2 (B). CateGAP Color Map for Each Individual Variable
(C). Final Single CateGAP Cartography Color MAP for Complete Information Visualization
扁 宋 連
李
扁宋連 許李
許
From physical maps to conceptual maps
64
Chromosome Map
Semiconductor Wafer Quality
Control
Macro Biodiversity
Micro Biodiversity
65
Matrix Visualization for Symbolic Data (Analysis)
for Big Data?
Fig. 1. Diagram for related conven5onal data matrix and symbolic (interval type) data table with their corresponding proximity matrices for samples/concepts and variables.
1.1 Symbolic Data Analysis (SDA) and 1.2 Matrix Visualiza<on (MV)
67
58 variables
1899 Level 4 Cities
市區町村
58 variables (interval)
151 Level 2
Areas地域
continuous Data ↓
Rank Data
(1~1899)
merged (interval of ranks)
data
10 Level 1 Regions
covariate
Example: Japan Minryoku 2010 Data (with Junji Nakano, ISM)
Level 1 Level 2 Level 3 Level 4 Region (10) Area (151) District (821) City (1899)
Japan Minryoku 2010 Data
12 displaying modes for MV of interval data
58 interval variables
151
regi
ons
(con
cept
s)
Min Mid Max Length
Sufficient Sediment Row Condition Col Condition
Length < 949 len<949, 949<mid 1746 < length 900<mid<1000
Statisticians, Data Analysts, & Bioinformaticists A statistician is someone who wants to get exactly the right answer, even if it’s the answer to the wrong question. A data analyst is someone who is willing to settle for an approximate answer, as long as it’s the answer to the right question. A bioinformaticist is someone who is willing to settle for answers of unknown accuracy, to questions that have not been clearly articulated, as long as the results can be graphed in color.
David B. Allison, Ph.D. Department of Biostatistics University of Alabama at Birmingham
12. MV for Color Blind people Types of color blind Monochromacy Dichromacy Protanopia and deuteranopia Hereditary tritanopia Anomalous Trichromacy
http://www.vischeck.com/examples/
To act passively to prevent from using color systems that are difficult for color blind people to understand. or To work actively in assisting people with visual impairments to have better visualization of data/information.
“I believe there are more mathematics/statistics blind people than color blind people” 71
Approaching Statistics & Statistical Approach
Thank You!
72