36
www.gigasciencejournal.com cott Edmunds, GigaScience/BGI Hong Kong CG7, Hong Kong, 1 st December 2012 Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Embed Size (px)

DESCRIPTION

Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012 "

Citation preview

Page 1: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

www.gigasciencejournal.com

Scott Edmunds, GigaScience/BGI Hong Kong ICG7, Hong Kong, 1st December 2012

Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Page 2: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

The challenges integrating papers + data:

• Data volumes: (1.2 zettabytes generated globally each year)

• >Exponential growth of genomics data

• Technical challenges (VMs/cloud, compression)

• Lack of incentives (Data DOIs)

• Data licensing (CC-BY, CC0)

• Journal/funder policies

Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Technical issues:

Cultural issues:

Page 3: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

The challenges integrating papers + data:

• Data volumes: (1.2 zettabytes generated globally each year)

• >Exponential growth of genomics data

• Technical challenges (VMs/cloud, compression)

• Lack of incentives (Data DOIs)

• Data licensing (CC-BY, CC0)

• Journal/funder policies

Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Technical issues:

Cultural issues:

* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham

Page 4: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

“Faked research is endemic in

China”

Why is this important?

• Transparency• Reproducibility• Re-use

Source: New Scientist, 17th Nov 2012: http://www.newscientist.com/article/mg21628910.300-fraud-fighter-faked-research-is-endemic-in-china.html

Page 5: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Why is this important?

475, 267 (2011)

“Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“

“There have been widespread complaints from scientists inside and outside China about this lack of transparency. ”

“Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”

Source: Nature 475, 267 (2011) http://www.nature.com/news/2011/110720/full/475267a.html?

Page 6: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Global Issue: increasing number of retractions>15X increase in last decade

Strong correlation of “retraction index” with higher impact factor

1. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html2. Retracted Science and the Retraction Index ▿ http://iai.asm.org/content/79/10/3855.abstract?

Page 7: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Global Issue: unrepeatability of scientific results

Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.Nature Genetics 41: 149-155.

Out of 18 microarray papers, resultsfrom 10 could not be reproduced

Page 8: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Sharing aids authors…

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Page 9: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

19961997

19981999

20002001

20022003

20042005

20062007

20080

100

200

300

400

500

600

700rice wheat

Rice v Wheat: consequences of publically available genome data.

Page 10: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 11: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era
Page 12: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era
Page 13: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era
Page 14: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Downstream consequences:

“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work without wasting time on legal wrangling.”

1. Citations (~100) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons

4. Example for faster & more open science

Page 15: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.

Page 16: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

1. Lack of sufficient metadata

2. Lack of interoperability

3. Long tail of curation (“Democratization” of “Big-Data”)

Not just (data) quantity, but quality

?

Page 17: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Cloud solutions?

Better handling of metadata…

Novel tools/formats for data interoperability/handling.

Not just (data) quantity, but quality

Page 18: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Data quality assessment

Tools making work more easily reproducible…

WorkflowsInteroperability/Ease of use

Not just (data) quantity, but quality

Page 19: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

www.gigasciencejournal.com

Large-Scale Data Journal/Database

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDCommisioning Editor: Nicole Nogoy, PhDLead Curator: Tam Sneddon D.PhilData Platform: Peter Li, PhD

In conjunction with:

Page 20: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Computable methods/workflow systemsBioinformaticsDevelopment PublishingBiomedical and bioinformatics research

Addressing the reproducibility gap:

Page 21: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Redefining what is a paper in the era of big-data?

Analysis Data

Tools/Workflows

Compute

goal: Executable Research Objects

Citable DOI

Page 22: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Integrating workflows into papers…

Page 23: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Anatomy of a Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 24: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Anatomy of a Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 25: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

Publication

Page 26: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

doi:10.5524/100035

PublicationData

Page 27: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

doi:10.5524/100035

PublicationData +Methods +

DOI for workflows?

Page 28: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

Page 29: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

Page 30: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

DOI: Y+ = DOI: 3DOI: A

Page 31: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

DOI: Y+ = DOI: 3DOI: A

A, B, C… X, Y, Z… 4, 5, 6…=

Page 32: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Different shaped publishable objects

DataPapers

Executable (Methods)

Papers

Analysis Papers

Page 33: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Different levels of granularity

Experiment(e.g. ACRG project)

Datasets(e.g. cancer type)

Sample(e.g. specimen xyz)

e.g. doi:10.5524/100001

e.g. doi:10.5524/100001-2

e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyz

Smaller still?

Papers

Data/Micropubs

NanopubsFacts/Assertions (~1014 in literature)

Different shaped publishable objects

Page 34: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

DOIs are cheap*, data is precious: maximise its use

Adding “value” publishing data

* ish

• Scope for different shaped publishable objects• Scope for publishing methods/executable papers• Peer review of data problematic

– Post publication peer review– Change criteria (assess on transparency/access only)– Better use of workflows/cloud/VMs

Page 35: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

• Transparency• Reproducibility• Re-use

= Credit}

Page 36: Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era

Thanks to:

[email protected]@gigasciencejournal.com

@gigascience

facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/

Contact us:

Laurie GoodmanTam SneddonNicole NogoyAlexandra BasfordPeter LiJesse Si Zhe

Follow us:

Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Huayen Gao (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)Cogini

www.gigadb.orgwww.gigasciencejournal.com