38
FAIR or unfair? Principled publishing for data. 公公公公公公公 公公公公公公 Scott Edmunds China #OAWeek 2016, 18 th October 2016

Scott Edmunds: FAIR or unfair? Principled publishing for data

Embed Size (px)

Citation preview

Page 1: Scott Edmunds: FAIR or unfair? Principled publishing for data

FAIR or unfair? Principled publishing for data. 公平或者不公平? 数据发表的原则

Scott EdmundsChina #OAWeek 2016, 18th October 2016

Page 2: Scott Edmunds: FAIR or unfair? Principled publishing for data

FAIR or unfair? Principled publishing for data. 公平或者不公平? 数据发表的原则Talk outline:

What is unFAIR about research dissemination?什么是 unFAIR 关于研究推广?What are the alternative models & principles?什么是备选模型和原理?What is the most FAIR approach? 什么是最 FAIR 方式?Our practical experience in a FAIR ecosystem.我们在 FAIR 系统下的实践经验

Page 3: Scott Edmunds: FAIR or unfair? Principled publishing for data

What is FAIR ( 公平的 )?什么是 FAIR?

AdverbWithout cheating or trying to achieve unjust advantage. ‘no one could say he played fair’

AdjectiveTreating people equally without favouritism or discrimination. ‘the group has achieved fair and equal representation for all its members’‘a fairer distribution of wealth’

fair /fɛː/

Page 4: Scott Edmunds: FAIR or unfair? Principled publishing for data

Is this FAIR? 这是 FAIR?

http://dx.doi.org/10.1087/20110203

PaperChu cp70,000¥

Page 5: Scott Edmunds: FAIR or unfair? Principled publishing for data

475, 267 (2011)

http://www.nature.com/news/2011/110720/full/475267a.html

“Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.“

“There have been widespread complaints from scientists inside and outside China about this lack of transparency. ”

“Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.”

Is this FAIR? 这是 FAIR?

Page 6: Scott Edmunds: FAIR or unfair? Principled publishing for data

A FAIR way out of this dilemma?一个用 FAIR 方式走出困境?'if I have seen further it is by standing on the shoulders of

giants'.

Buckheit & Donoho: Scholarly articles are merely advertisement of scholarship. The actual scholarly artifacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible.

Page 7: Scott Edmunds: FAIR or unfair? Principled publishing for data

FAIR questions to ask?有关于 FAIR 的问题需要提问 ?Is the raw data publically available?

Are the reagents (plasmids, antibodies, etc.) available?

Are detailed protocols available?

Can I access the processed data & results (supporting the figures)?

Was this all available BEFORE publication to the peer reviewers?

Can I inspect the peer reviews?

Can I publish/link +/-ve replication experiments to this?

Page 8: Scott Edmunds: FAIR or unfair? Principled publishing for data

Taking the IMPACT out of Impact FactorGigaScience Ethos/Policies: Impact is subjective. Data is quantitive.

http://gigascience.biomedcentral.com/guide-for-gigascience-reviewers

UnFAIR ranking system places us above “selective” journals.

A more FAIR approach: Reproducibility?一个更加 FAIR 方式:再现性 ?

Page 9: Scott Edmunds: FAIR or unfair? Principled publishing for data

A more FAIR approach: Open Data?一个更加 FAIR 方式:公开数据 ?

Page 10: Scott Edmunds: FAIR or unfair? Principled publishing for data

Importance of licensing: ability to mine & reuse content

“By “open access” to [peer-reviewed research literature], we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

=

CC0 better than CC-BY for datasets to prevent “attribution stacking”

Budapest Open Access Initiative:

ODbLLicence Ouverte

Page 11: Scott Edmunds: FAIR or unfair? Principled publishing for data

What is open data ( 公开数据 )?

http://opendefinition.org/od/2.0/en/

「开放知识」的定义

Page 12: Scott Edmunds: FAIR or unfair? Principled publishing for data

Panton Principles

http://pantonprinciples.org/

=CC0 better than CC-BY for datasets to prevent “attribution stacking”

Page 13: Scott Edmunds: FAIR or unfair? Principled publishing for data

Is a 5★ open data more FAIR?

http://5stardata.info

Page 14: Scott Edmunds: FAIR or unfair? Principled publishing for data

Is a 5★ open data more FAIR?

http://5stardata.info

★ Make your stuff available on the Web (whatever format) under an open license.

★★ Make it available as structured data (e.g., Excel instead of image scan of a table).

★★★ Make it available in a non-proprietary open format (e.g., CSV as well as of Excel).

★★★★ Use URIs to denote things, so that people can point at your stuff.

★★★★★ Link your data to other data to provide context.

Page 15: Scott Edmunds: FAIR or unfair? Principled publishing for data

gigagalaxy.net

More FAIR to consider Workflows更多的用 FAIR 去考虑在工作流程中

Page 16: Scott Edmunds: FAIR or unfair? Principled publishing for data

Research Objects: a concept & model

http://www.researchobject.org/

• Supporting publication of more than just PDFs, making data, code, & other resources first class citizens of scholarship.

• Recognizing that there is often a need to publish collections of these resources together as one shareable, cite-able resource.

• Enriching these resources and collections with any & all additional information required to make research reusable, & reproducible!

Page 17: Scott Edmunds: FAIR or unfair? Principled publishing for data

Importance of metadata: context (& discoverability)元数据的重要性:上下文(可发现性)

https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naminghttps://twitter.com/AlisonMcNab/status/751375987624009728/photo/1

?

Page 18: Scott Edmunds: FAIR or unfair? Principled publishing for data

Novel tools/formats for data interoperability/handling: ISA

Importance of metadata: context (& discoverability)元数据的重要性:上下文(可发现性)

Page 19: Scott Edmunds: FAIR or unfair? Principled publishing for data

Where do you set it?

Experiment(e.g. International Cancer Genome

Consortium)

Datasets(e.g. cancer type)

Sample(e.g. specimen

xyz)

e.g. doi:10.5524/100001

e.g. doi:10.5524/100001-2

e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyz

Smaller still?

Importance of granularity

Papers

Data/Micropubs

NanopubsFacts/Assertions (~1013 in literature)

Page 20: Scott Edmunds: FAIR or unfair? Principled publishing for data

Assertion

Nanopublication URL

Provenance PublicationInfo

assertion

opm:was

DerivedFrom

opm:wasGene-ratedBy

thisnanopub

dcterms:created

pav:authored-

By

associa-tion

asio:statis-ticalAssoci

ation

sio:has-measurementValue

Association_1_p_val

ue

a

Sio:probability-value

sio:has-value

6.56 e-5

^^xsd:float

sio:refers-to

dcterms:DOI

Integrity Key

An Individual association between concepts:•statement or declaration•measurement•hypothetical inference•quantitative or qualitative

Guarantee immutabilityafter publication

Unique, persistent and resolvable identifier

How this assertion came to be, methods,

evidence, context, etc.

• Detailed attribution for authors, institutions, lab technicians, curators

• License info• Publication date

A nanopub represents structured data along with its provenance in a single publishable & citable entity.

http://nanopub.org/

Page 21: Scott Edmunds: FAIR or unfair? Principled publishing for data

Lots of models/standards/guidelinesWhere does that leave us?我们从哪里开始?

?

5★ open data

Page 22: Scott Edmunds: FAIR or unfair? Principled publishing for data

A mnemonic to remember: FAIR 一个帮助记忆的词语 :FAIR

http://www.nature.com/articles/sdata201618http://www.datafairport.org/

Findable 可发现的Accessible 可得到的Interoperable 能共同使用的Reusable 可以再度使用的

Lots of models/standards/guidelinesWhere does that leave us?我们从哪里开始?

Page 23: Scott Edmunds: FAIR or unfair? Principled publishing for data

A mnemonic to remember: FAIR

http://www.nature.com/articles/sdata201618http://www.datafairport.org/

一个帮助记忆的词语 :FAIR

Page 24: Scott Edmunds: FAIR or unfair? Principled publishing for data

Beyond a mnemonic: FAIR ecosystems 不仅是一种帮助记忆的技巧: FAIR 生态系统

FAIRifier tool

Page 25: Scott Edmunds: FAIR or unfair? Principled publishing for data

Beyond a mnemonic: FAIR ecosystems

• A particular class of FAIR Data System to provide support for data interoperability;

• Supports publication, search and access to FAIR data. • Fosters an ecosystems of applications and services; • Federated architecture: different FAIRports (and other FAIR Data

Systems) are interconnectable;• Supports citations of datasets and data items;• Provides metrics for data usage and citation;

A ‘FAIRpoint or FAIRport’ can be any specific data instance following FAIR data principles.

http://www.datafairport.org/

不仅是一种帮助记忆的技巧: FAIR 生态系统

Page 26: Scott Edmunds: FAIR or unfair? Principled publishing for data

Beyond a mnemonic: FAIR ecosystems

http://www.datafairport.org/

?

不仅是一种帮助记忆的技巧: FAIR 生态系统

Page 27: Scott Edmunds: FAIR or unfair? Principled publishing for data

DTL/ELIXIR-NL“Bring Your Own Data Party”

GigaScience/BGI HKMetabolomics ISA-TAB athon v

More FAIR mnemonics: “BYODs”更多 FAIR 记忆 : “BYODs”

Page 28: Scott Edmunds: FAIR or unfair? Principled publishing for data

FAIR Data in the wildFAIR 数据在野生环境中

Taking a microscope to the publication process

Page 29: Scott Edmunds: FAIR or unfair? Principled publishing for data

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612

Page 30: Scott Edmunds: FAIR or unfair? Principled publishing for data

How FAIR can we get?如何获取 FAIR?

Data sets

Analyses

Linked to

Linked to

DOI

DOI

Open-Paper

Open-Review

DOI:10.1186/2047-217X-1-18>50,000 accesses& 885 citations

Open-Code

7 reviewers tested data in ftp server & named reports published

DOI:10.5524/100044

Open-PipelinesOpen-Workflows

DOI:10.5524/100038Open-Data

78GB CC0 data

Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>40,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

Page 31: Scott Edmunds: FAIR or unfair? Principled publishing for data

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

Page 32: Scott Edmunds: FAIR or unfair? Principled publishing for data

The SOAPdenovo2 Case studySubject to and test with 3 models:

Data

Method/Experimental

protocol

Findings

Types of resources in an RO

Wfdesc/ISA-TAB/ISA2OWL

Models to describe each resource type

Page 33: Scott Edmunds: FAIR or unfair? Principled publishing for data
Page 34: Scott Edmunds: FAIR or unfair? Principled publishing for data

1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.

Page 35: Scott Edmunds: FAIR or unfair? Principled publishing for data

Lessons Learned 经验教训 • Most published research findings are false. Or at least

have errors大多数发表研究发现是错误的。或者至少有错误的• With enough effort is possible to push button(s) &

recreate a result from a paper with current tools在足够努力的情况下可以做到按下按键和重新创造结果从文章中并且通过运用当前的工具

• Being FAIR can be COSTLY. How much are you willing to spend? Who will build FAIR infrastructure? FAIR 是 有代价的,你愿意花多少? 谁来建立 FAIR 基础 ?

• Much easier to make things FAIR before rather than after publication. BYODs useful intermediate here在发表文章之前,让文章数据 FAIR 会比在发表之后再让文章数据FAIR 更加的容易。 BYODs 在这里是一个有用的媒介。

Page 36: Scott Edmunds: FAIR or unfair? Principled publishing for data

http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html

“The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: “is it FAIR”?”

Be FAIR to yourself: stop limiting your impact用 FAIR 的方式对自己,停止限制你的影响

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Page 37: Scott Edmunds: FAIR or unfair? Principled publishing for data

www.gigasciencejournal.com

Give us your FAIR data, papers & pipelines

Help GigaPanda make things FAIR!

[email protected] [email protected] [email protected]

Contact us:

Page 38: Scott Edmunds: FAIR or unfair? Principled publishing for data

Thanks to:Laurie Goodman, Editor in ChiefNicole Nogoy, Commissioning EditorPeter Li, Lead Data ManagerChris Hunter, Lead BioCuratorXiao (Jesse) Si Zhe, Database DeveloperChen Qi, Shenzhen Office.

@GigaSciencefacebook.com/GigaScienceblogs.biomedcentral.com/gigablog/

Follow us:

www.gigasciencejournal.comwww.gigadb.org

+Weibo

& WeChat