31
September 29–30, 2015 California Institute of Technology Beckman Institute 400 S Wilson Ave, Pasadena, CA 91125 IBD/CROHN’S DISEASE PROGRAM Information Technology and Bioinformatics Workshop for the Helmsley IBD Research Network MEETING REPORT

Read the full report from the workshop

  • Upload
    hakhanh

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Read the full report from the workshop

September 29–30, 2015

California Institute of TechnologyBeckman Institute

400 S Wilson Ave, Pasadena, CA 91125

IBD/CROHN’S DISEASE PROGRAM

Information Technology and Bioinformatics Workshop for the Helmsley IBD Research Network

MEETING REPORT

Page 2: Read the full report from the workshop

Executive Summary ……………………………………………………………………………………… 3

Workshop Report ………………………………………………………………………………………… 6

Session 1: Technology Capabilities, Gaps, and Priorities …………………………………………… 8

Session 2: Information Models and Ontologies ……………………………………………………… 10

Session 3: Data Analytics …………………………………………………………………………… 12

Recommendations & Next Steps ………………………………………………………………………… 14

From Workshop Participants ………………………………………………………………………… 15

From Workshop Faculty ……………………………………………………………………………… 16

Appendix A: Participants ………………………………………………………………………………… 18

Appendix B: Data Programs Used By Helmsley IBD Grant Recipients ………………………………… 22

Appendix C: Proposed Metadata Elements ……………………………………………………………… 27

Appendix D: Bioinformatics Planning Template ………………………………………………………… 28

Appendix E: Glossary ……………………………………………………………………………………… 29

TABLE OF CONTENTS

Page 3: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

3

On September 29-30, 2015, the Helmsley Charitable Trust sponsored a workshop focused on bioinformatics resources and capabilities, needs and challenges, and future priorities for Inflammatory Bowel Disease (IBD) research. The workshop was held at the California Institute of Technology (Caltech) in Pasadena. (See Appendix A for the list of workshop participants.)

Since 2009, the Helmsley Charitable Trust has authorized more than $185 million in grants to find a cure for IBD and Crohn’s disease and, while awaiting that outcome, to find better therapeutic and prevention strategies. More than 100 Helmsley-funded investigators are working in nine countries to understand how human genetics, the gut microbiome, and the immune system can cause and exacerbate IBD and bring those insights into the development of new treatments.

As these grants mature, the types and amounts of data generated are growing exponentially. To capitalize on these valuable resources, the lead data scientists for the Helmsley IBD Network requested funding to host the Information Technology and Bioinformatics Workshop with the goal of creating a shared bioinformatics vocabulary and standardized data management protocols across the IBD & Crohn’s Program grantees.

Significant progress has been made in informatics to improve the capture and analysis of data in many different areas, from biomedical research to astronomy and space science. Many of these fields are now embracing “data science” as critical to bringing data, computation, and science together in order to improve the opportunity for scientific discovery.

This workshop brought together cross-disciplinary expertise with NASA’s Jet Propulsion Laboratory (NASA JPL), Caltech scientists, and Helmsley Charitable Trust-funded IBD researchers involved in bioinformatics, data handling, and data analytics. While many of the IBD research data scientists are developing bioinformatics capabilities to support

their projects, many of the techniques remain substantially ad hoc, with little standardization of methods for collecting and analyzing the generated data in a systematic manner. As a result, the workshop identified opportunities for collaboration and development of an IBD information model and scalable data infrastructures that will benefit IBD research.

Prior to the workshop, participants from each Helmsley-funded IBD project completed a structured survey to reflect and report on their bioinformatics capabilities, gaps, and priorities. The questions covered areas including technology, data handling, information models, and analytics. The survey information was used to organize working breakout groups and to support the exploration of new paradigms, with the goals of beginning to construct individual project roadmaps, developing a collective roadmap, and a creating a set of recommendations, which are reflected in this report.

During the workshop, keynotes and speakers discussed the data science capabilities implemented both in IBD research and other disciplines, such as planetary science, cancer research, and computational pathology. The speakers identified and explored the applicability of adapting methods for developing an ontology—the formal naming and defining of the properties or domains that make up the data and systems—and capturing, managing, and analyzing data from these disciplines for the Helmsley IBD research initiatives. The concept of defining an ontology in biomedical research is critical, as it provides researchers with standardized categories and a vocabulary across which the data can be organized, stored, and ultimately accessed.

Data Scientists from the University of North Carolina at Chapel Hill (UNC) and NASA JPL, working in conjunction with the Helmsley Charitable Trust, organized the workshop. The program leaders included Kristen Anton (UNC/Geisel School of Medicine at Dartmouth) and Dan Crichton (NASA JPL/Caltech).

EXECUTIVE SUMMARY

Page 4: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

4

THE FOLLOWING OBSERVATIONS EMERGED FROM THE WORKSHOP DISCUSSIONS:

• Technology used in the projects—including hardware infrastructure, software, and databases— is somewhat divergent, but this is less critical if the ontology is consistent.

• Architecture is important: decoupling the technology from the information model and standardization of the API layer (the way in which software pieces can connect to each other, like “Lego” pieces that plug into each other universally) allows for flexibility, scalability, and consistency across projects.

• Consistent, well-defined IBD ontology eliminates platform dependence and promotes data sharing and meaningful use of data.

• Making first steps toward consistent ontology, data sharing, and technology sharing—even if small—will enable cross project training and encourage publication by the bioinformaticians and data scientists.

OVERALL, GROUPS IDENTIFIED THE FOLLOWING CONSISTENT CHALLENGES:

• Assembling technical expertise (e.g. in areas of data architecture, information model development, and professional software development), perhaps addressed by training and identifying technical expertise within the network.

• Building collaborations within the Helmsley data science community.

• Understanding and standardizing metadata.

• Integrating data for analysis (related to lack of metadata and metadata standards).

• Harnessing statistical and methodological expertise to inform the appropriate analytic methods to apply to datasets.

• Reaching the limits of computational infrastructure capacity within large-scale jobs.

• Using data from open source repositories (e.g. access and integration of publicly available data with data generated by the network).

• Addressing the negative impact of patient privacy, data security, and the culture of academic research on freedom to share data and resources.

Page 5: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

5

The challenges noted on the previous page are listed in order of the priority the group recommended for addressing them. For instance, enhancing bioinformatics expertise at all funded research centers lays the foundation for universal improvement in data science capabilities. This can be accomplished with the development of an online training forum (blog and tutorials) and support for meetings and/or workshops to facilitate cross training within the group of data scientists assembled for the workshop.

Finally, all workshop participants agreed in principle that data should be shared and made public. But they also noted that the field’s current culture prevents this because funding and careers are so highly dependent on people being able to publish their research. Workshop participants praised the Helmsley Charitable Trust for the fact that its funding opportunities are not weighted so heavily on publication record.

IN THE CONTEXT OF A SHARED ROADMAP, THE WORKSHOP FACULTY DISTILLED SEVERAL RECOMMENDATIONS FROM THE POST-BREAKOUT SESSION DISCUSSIONS, WHICH ARE DETAILED IN THIS REPORT AND INCLUDE:

1. Create a data management template for new grantees and project applications.

2. Develop strategies and materials for training the projects’ data scientists.

3. Create solutions for data integration and sharing.

4. Develop specific shared goals and vision for linking data science and IBD research.

5. Work with grant leaders to assess the value of developing a comprehensive IBD ontology for Helmsley IBD research.

6. Consider funding data science and infrastructure initiatives and encourage other funders to do the same.

7. Provide funding for short-term “test and learn” projects.

Page 6: Read the full report from the workshop

WORKSHOP REPORT

Page 7: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

7

WORKSHOP STRUCTURETo provide a good base for discussion during the workshop and in breakout sessions, workbooks were developed by the faculty with questions that followed the three themes of the workshop: technology, ontology, and analytic methods. In the technology section, participants were asked to provide descriptions of the hardware and infrastructure used in the course of their research and to identify any gaps as well as capabilities that might be needed. They were also asked to describe the types of databases and software used in their projects and the tools needed to support their work, and then to prioritize their needs and identify the most critical. Similar questions followed for the ontology and analytics sections, asking what tools and models were used, what capabilities were missing, what were the perceived barriers, and what were the priorities for filling the most critical needs. The workbooks were sent to each of the participants prior to the workshop to complete and return to us. Each customized workbook was printed and given to its author to use during the facilitated breakout sessions. The workshop participants identified 93 analysis programs or packages that are used by at least one Helmsley IBD grant recipient; some programs are used by as many as 10 laboratories. This list, presented alphabetically in Appendix B, will allow researchers to identify and reach out to other scientists using the same programs. Participants from the nine grant projects were broken into four small working groups. They were paired with a facilitator from the workshop faculty who guided the group through specific questions relating to the workbook content. The outcome of the sessions will ultimately lead the participants in drafting a roadmap for achieving their data management goals. The breakout groups were organized as follows: Group 1 – Sinai Helmsley Alliance for Research Excellence (SHARE) (15 participants) Group 2 – CCFA projects, Genetics, Environmental and Microbial (GEM) project (8 participants) Group 3 – IBD Over Time, Israeli IBD Research Nucleus (IIRN), Weizmann (10 participants) Group 4 – Very Early Onset (VEO) IBD (5 participants)

Page 8: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

8

The application of multiple biotechnologies to the development of indicators and treatments for disease is inextricably linked to the development of algorithms to support the analysis of complex datasets. These technologies and algorithms play a key role in understanding data generated by research projects funded with the goal of ultimately curing IBD, and until then, of developing improved treatments. Informatics capabilities to support the capture, integration, and analysis of these datasets are critical to supporting IBD research and enabling reproducibility of the results. This session included contributions from researchers who identified key needs for supporting experimental methodologies; data scientists who are developing needed algorithms for data capture, integration, and analysis; and data managers who actively handle the data in the research projects.Presentations in this session included:“Using Data Science to Enable Scientific Innovation,” Lt. Gen. Larry James, Deputy Director, NASA Jet Propulsion Laboratory (Keynote)“What Components Make Up a Data Science System?” Kristen Anton, UNC at Chapel Hill and Geisel School of Medicine at Dartmouth“Informatics for the Human Microbiome in Health and Disease,” Eric Franzosa, Harvard School of Public HealthThe technology breakout session was guided by the following questions: 1. What types of hardware infrastructure are

being used? 2. What types of databases are used? 3. What types of software are used? 4. What gaps in databases, computation, and

other infrastructures exist? What is needed in the future to support IBD research?

5. Are there opportunities for collaboration? Within the group? With other groups?

Following the breakout sessions, a representative from each group reported on the group’s discussion to all participants. Each group documented points from its breakout discussion on colored paper, so that a poster offering a view of the session discussions could be built. This technique clearly displayed a building of common bioinformatics vocabulary, as well as heterogeneity in data science capability, for each session topic.

SUMMARY OF DISCUSSION

• Hardware infrastructure: Hardware infrastructure across the groups was fairly consistent, e.g. High Performance Computing (HPC), Cluster Computing, cloud services, local storage. One challenge commonly expressed was that large compute-intensive jobs are reaching the limits of infrastructure. The computing for several projects incorporated Hadoop and i2b2.

• Databases: There was both diversity and overlap in the databases implemented to support the projects. Clinical databases include EPIC, Allscripts, Cerner and Clindesk (a homegrown portal), although EPIC implementations were the source for more than 75 percent of clinical data. Databases such as Oracle and REDCap were commonly used, although some projects are using MS Access and Excel. Statistical databases included SAS and SPSS.

• Software: There was both diversity and overlap in the software being used and/or developed across the network. Much of the software used to support the research, including data analytics and visualization, included custom code developed by staff with varying levels of software development expertise. Much of the custom code was built for use and not for sharing or distribution. Software in each of these areas was discussed:

➢ Scripting languages: The most common scripting language in use is python. Others noted include PHP, javascript, perl and html.

SESSION 1: TECHNOLOGY CAPABILITIES, GAPS, AND PRIORITIES IN HELMSLEY CHARITABLE TRUST-FUNDED IBD RESEARCH

Page 9: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

9

➢ Statistical/data analysis software: The most commonly used statistical software in use is R (including Bioconductor framework built around R), with others using SAS and SPSS. Several projects are using MATLAB (“matrix laboratory”) and Golden Helix.

➢ LIMS: Every project interacts with a Laboratory Information Management System (LIMS). Many projects use homegrown systems. However, common tools for managing biospecimens include Freezerworks and caTissue, an open source software tool developed by the National Cancer Institute of NIH.

➢ Instruments and bioinformatics tools: Common platforms were, predictably, Affymetrix, Illumina, and Agilent. Sequence analyses are performed using Blast and Fasta. PLINK, a whole genome analysis toolset, and QIIME (Quantitative Insights into Microbial Ecology), a pipeline framework for microbiome analysis from raw DNA sequencing data, were common. Group 4, the VEO team, uses SAGE, a platform for data sharing. Much of the pipelining code was developed as a custom application. Data visualization tools were lacking and desired. Group 1, the SHARE team, expressed the need for a flexible query tool to interface with the SHARE Registry.

Some packaged software and tools were ubiquitous (Oracle, python, Golden Helix, etc.). However, there was a general agreement that the field has a great need for better sharing of knowledge among groups, and for tools and services that can help in executing common tasks. The plethora of custom code speaks to an opportunity to collaborate and share bioinformatics tools across projects.A table listing some of the software in use by the workshop participants can be found in Appendix B. This table is a first step in cataloging all tools implemented across the network and identifying expertise for collaboration in the effective use of these computer programs.

OpportunitiesThere was a clearly expressed need for technical expertise—identifying staff in the network with expertise, sharing expertise, and expertise in training project staff. One group specifically identified the need for a better understanding of the proper statistical techniques to apply to analyze the data. Short-term Actions: Create a blog or online

forum to cross-train one another; include tutorials or information about resources and software packages. Organize “boot camp” training sessions within the network. Organize and hold regular meetings (including face-to-face meetings) of bioinformatics staff.

There is also a clear opportunity for collaboration in co-developing and sharing software. In particular, the newer projects and projects with limited on-site bioinformatics and analytical expertise would benefit from collaboration with data scientists in the network who have more extensive experience with large data integration and handling. One critical need identified by Breakout Group 3 was the development of tools to merge and compare datasets from different studies. All projects in SHARE would benefit from a flexible query interface to the SHARE Registry. Short-term Actions: Identify and support

cross-project technical collaborations, sharing expertise, methods, software designs, and software (observations and feedback from the workshop may facilitate this). Build and/or share methods, tools, and skill around integrating clinical data with research data, and merging data from different studies. Create an online library for shared code and open source solutions.

Page 10: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

10

The successful implementation of data science capability for Helmsley Charitable Trust-funded projects depends in large part upon the successful definition of an information model for describing IBD data. An IBD ontology would serve as a blueprint to describe the data, and its relationship, produced during the course of Helmsley Charitable Trust IBD research. An ontology in data engineering is a representation of concepts and the relationships, constraints, rules, and operations to specify data semantics. An IBD ontology, applicable for the research conducted within the network, may include information about patient phenotypes, clinical outcomes, biospecimens, microbiomes, host genetics, proteomics, and other varied biologic information that forms a “model” for IBD research. From a software system architecture perspective, this approach puts the intelligence in the ontology, rather than in software. This ontology- or model-driven software architecture approach allows the data definitions to be described in the information model and provides full flexibility for the underlying software to dynamically evolve to support new types of data generated by new methodology. This will enable software developed in the Helmsley Charitable Trust IBD research network to support IBD research into the future. An IBD information model may be developed collaboratively, in phases, over time.Presentations in this session included: “Information Models for Scientific Research,”

Steven Hughes, Information Architect and Principal Computer Scientist, NASA Jet Propulsion Laboratory

“Common Data Elements (CDEs): Where to Begin?” Heather Kincaid, Project Manager and Computer Scientist, NASA Jet Propulsion Laboratory

“Data Standards and Curation: Application to Building a Data Model,” Maureen Colbert, Bioinformaticist and Biocurator, NASA Jet Propulsion Laboratory/Geisel School of Medicine at Dartmouth

The ontology breakout session was guided by the following questions:1. What types or categories of data do you capture,

acquire and/or manage?2. What data and metadata standards are used?3. How do you format your data for storage

and/or use?4. Do you have a documented information model for

your project?5. How much data do you generate and manage?6. Do you share your data?7. What gaps in data standards exist? What data is

missing? What is needed in the future to support IBD research?

8. Are there opportunities for collaboration? Within the group? With other groups?

SUMMARY OF DISCUSSION

One of the strongest recommendations to come from this workshop emerged from this session: Establish an IBD Ontology Working Group within the Helmsley IBD research network as a way to support data scientists and share ideas regarding information modeling and development. • Types of data captured and managed: The types

of data collected are diverse, but also have much overlap, including disease description, diagnosis, demographics, biopsy/blood specimens, family history, medications, microbiome, clinical, clinical phenotype, genomic sequence, gene expression, genotyping, etc. The need to integrate clinical data with experimental data was expressed consistently. The considerable overlap in data types offers the opportunity to develop a data model and standards that would allow the network to share and/or integrate this data meaningfully.

SESSION 2: INFORMATION MODELS AND ONTOLOGIES

Page 11: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

11

• Standards: Data and metadata standards, e.g. MIAME, KEGG, HL7, OID, OWL (for semantics), BioPortal, and disease indices such as Harvey-Bradshaw, exist and are used in some projects. However, there was a significant gap in shared standards. Creating an IBD Ontology Working Group would address this issue and offer the research network valuable data and metadata standards.

• Information Model: Some projects have data dictionaries that describe their data. However, no project offered a comprehensive IBD information model. All groups expressed the desire for this development. In particular, the group would like to see immediate development and implementation of metadata and data descriptors to catalog research studies across the network, which could comprise a minimal common descriptive dataset. The group concluded that a consistent ontology eliminates platform dependence in sharing and meaningful use of data, making heterogeneity in technology solutions largely inconsequential.

• Data Sharing: All projects described challenges with data sharing. These challenges include the “publish or perish” culture of research, subject privacy, and lack of an information model that facilitates data sharing. The challenges with data sharing were universal. All participants saw great value in ability to overcoming these challenges and sharing data.

OpportunitiesThe groups noted two primary opportunities: ➢ Collaborate to create an IBD Information

Model Working Group to develop a comprehensive information model.

Short-term Actions: In the short term, define and implement metadata to describe studies and datasets across the network. Appendix C contains a proposed list of metadata and descriptors, drafted by Group 1 (SHARE) in this session.

➢ Collaborate to create a Data Sharing Policy Working Group to suggest policies for intra-network data sharing.

Page 12: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

12

The bioinformatics systems we build are focused on creating valid, meaningful datasets for feeding analytic algorithms to enable scientific discovery, understanding, and value. This session discussed reproducible pipelines, novel analytic methodology, and extending methodology from other disciplines (e.g. computational pathology) to IBD research.

Presentations in this session included:“Using Data Science Methods for Scientific Research,” Thomas Fuchs, Memorial Sloan Kettering Cancer Center, Caltech, NASA Jet Propulsion Laboratory“Putting it all together: Examples of Data Science Systems as Models,” Kristen Anton, UNC at Chapel Hill and Geisel School of Medicine at Dartmouth“Putting it all together: Demonstration of Working Knowledge Environment,” Dan Crichton and Heather Kincaid, NASA Jet Propulsion LaboratoryThe Analytics breakout session was guided by the following questions: 1. What types of analyses do you perform in the

context of your projects? 2. What packaged software/tools do you use? 3. What custom programming have you

developed (or anticipate developing) for your project?

4. How compute-intensive is the processing and analytics for your project? Can you do what you need to do?

5. How do you format and store your analytic results?

6. Are there any capabilities missing in your analytical methods? What are needed in the future to support IBD research?

7. Are there opportunities for collaboration? Within the group? With other groups?

SUMMARY OF DISCUSSION

Many of the projects are in the data modeling and data capture phase, with little experience to date in the analytics for the project. The analytical needs of the groups varied significantly based on conditions such as the population being studied, e.g. the Very Early Onset team has a very small number of patients (from a few weeks old up to 6 years of age) with a large signal (genetics, with few confounding environmental factors), where the other groups studying large adult populations need analytics to identify the signal from the noise of large sample variation and confounding environmental factors. The need for data visualization tools was universal.• Reproducibility of results: This was a common

theme throughout the workshop and is highly dependent on the proper information capabilities. In particular, proper capture of the data is essential. This includes documentation of pre-determined protocol, raw data, code, scripts, data provenance, and analysis workflows. The capture of data, metadata, and supporting documentation should ensure proper reuse of the data.

• Transparency: There is a direct need to share data and analytical methods. Well curated and maintained data repositories are needed to support proper sharing of data and methods. The group would like to identify and share analysis expertise to help them understand the proper statistical techniques to apply to the datasets.

• Share information about data quality: Common approaches should be developed to share information about data quality in the network research, as data quality impacts analytics and results. One approach discussed would be to define metadata that will give a quality “score” to each data set, or even data point as applicable. The “score” offers a level of confidence in the accuracy of the data, which is important in determining

SESSION 3: DATA ANALYTICS

Page 13: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

13

appropriateness of data integration and accuracy in analytic results. The metadata descriptors would become part of the information model.

• Analytical methods and tools: Among the groups there was a consistent call for collaboration to implement data visualization tools, data searching tools (e.g. to independently mine the SHARE Registry data), analytic methods and tools and workflow systems (e.g. The Early Detection Research Network LabCAS).

• Data integration: Groups are capturing and validating data, and developing methods for integrating data (e.g. clinical data into experimental data sets) are lacking. Groups expressed interest in supporting capability to integrate data from clinical systems and open-source repositories into their project data collection.

OpportunitiesGroups all noted the opportunity to share data integration and analytic expertise as well as analytic methodologies. JPL/Caltech and Mayo offered to collaborate on data visualization methods and tools. The network saw value in collaborating to define and share a minimal data set from all projects.

Page 14: Read the full report from the workshop

RECOMMENDATIONS AND NEXT STEPS

Page 15: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

15

The workshop structure allowed projects to begin to build project-specific roadmaps for bioinformatics, with each session building upon the previous and covering the topics of technology, information models, and data analytics.At the close of the workshop, the participants were asked, ‘What can the Trust do to continue to facilitate the data science needs of our grant projects?’ The workshop participants came forward with many ideas and collectively built a set of recommendations for bioinformatics in Helmsley Charitable Trust-funded IBD research network. These recommendations, summarized by Helmsley staff and workshop faculty, are to:1. Create a data management template for new

grantees and project applications. Part of the grant application process and onboarding process for new grantees would include a description of data capture, validation, and management processes for Helmsley IBD research. This documentation would ensure that investigators and their bioinformatics staff have carefully considered how they will handle their data. This is especially critical for high throughput and other “big data.” The process of developing a data management strategy would also allow for investigators to develop collaborations before any data is captured, if required, to ensure responsible stewardship of research data. Appendix D provides a draft Bioinformatics Planning Template for Helmsley IBD applicants.

2. Develop strategies and materials for training the projects’ data scientists by:

• Polling grant leaders and data scientists for specific needs for knowledge and training;

• Creating an online forum to share information and expertise;

• Creating online tutorials; • Creating an online library of shared code and

open source resources; and,

POSSIBLE NEXT STEPS: RECOMMENDATIONS FROM WORKSHOP PARTICIPANTS

• Supporting opportunities for the network’s data scientists to meet and share methods, accomplishments and challenges.

3. Create solutions for data integration and sharing by:

• Nominating, convening, and supporting a Data Sharing Methods and Policy Working Group to propose innovative collaborations and solutions for data-sharing issues; and,

• Defining a minimal set of data to share and leading the way; establishing feasibility and exercising policy; and building out that minimal set iteratively.

4. Develop specific shared goals and vision for linking data science and IBD research by regularly convening data scientists from the Helmsley IBD research network via teleconferences and face-to-face meetings. It would be helpful to overlap a bioinformatics meeting with a scientific meeting, perhaps bi-annually or annually, allowing data scientists to meet and providing opportunities for bioinformaticists and IBD researchers to meet collectively. These groups could discuss results and accomplishments in the IBD Network and help evolve the bioinformatics mandate to most effectively support the research.

5. Work with grant leaders to assess the value of developing a comprehensive IBD ontology for Helmsley IBD research, incorporating data standards to create a sustainable model by:

• Nominating, convening, and supporting an IBD Ontology Working Group;

• Describing all existing studies and datasets; and,

• Defining a pilot project to effectively initiate the ontology development process. As part of this, also determine interest in working cooperatively across the network to develop a shared “core” data model. In doing

Page 16: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

16

so, recognize the importance of an information model in allowing for heterogeneous technical platforms and workflows without hindering data harmonization and integration capabilities.

6. Build funding for data science and infrastructure initiatives by encouraging the Helmsley Charitable Trust (and other funding sources) to consider offering grant funds for bioinformatics infrastructure projects, e.g. data-sharing frameworks, ontology development, and shared data analytics.

7. Encourage funding for short-term “test and learn” projects to allow rapid forward progress on the established priorities without high cost.

Two primary themes emerge from these recommendations: 1) Cataloging all existing studies and datasets, starting with short-term, low-cost, pilot “test-and-learn” projects that may be scaled up, and 2) Creating bioinformatics working groups to address priorities with collaborative innovation.

POSSIBLE NEXT STEPS: RECOMMENDATIONS FROM WORKSHOP FACULTYThe following recommendations were developed by workshop organizers and faculty to guide the Trust in establishing initiatives for the IBD Network data scientists that could be implemented quickly and have immediate impact. These action items are recommended with the intent to build out models, concepts, and technology in the longer term.1. Create Working Groups Nominating and implementing several “technical working groups” would maintain the momentum of the workshop. These working groups would meet monthly by teleconference, and ideally bi-annually or annually in person. Each working group would nominate a chair and a secretary, and all proceedings could be made available to the entire network through a website. We suggest that these

working groups, comprised of members from across the Helmsley Charitable Trust IBD Network, would greatly improve the network’s data science capability.

Training and Methodology Sharing Working Group Focus: This working group would define a

baseline standard for bioinformatics expertise and resources required for research projects, identify and prioritize training initiatives, organize and (if necessary) seek additional funding for training initiatives, identify expertise across the network, fill expertise gaps by facilitating cross-project collaboration, and create a library or catalog of methods and software available to the network. In the short term, the product of this working group effort will be the implementation of an online training forum. In the longer term, this working group may consider sharing and/or developing robust methods for data integration, sharing, analytics and visualization.

Information Model Development Working Group The understanding of and motivation to build an IBD research information model is one of the most important potential follow-on activities from the workshop. Focus: This working group, which could be

comprised of a representative from each IBD project, would determine and, if necessary, develop metadata and data standards and would document a comprehensive information model (i.e. data elements, allowable values, constraints and relationships) that describes the landscape of IBD research data across all Helmsley-funded projects. The standards would be developed with consideration of industry and government standards to streamline efforts and ensure maximum data-sharing potential and system interoperability.

Page 17: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

17

In the short term, goals include cataloging all studies and datasets across the network, and rapidly defining a minimal standard dataset to share across the network (may be implemented as part of existing projects or as a “test-and-learn” pilot study). In the longer term, the ontology (including experiments, datasets, and data elements) could be built out in phases.

We recommend leveraging the ontology work done for the EDRN, both in format (using Protégé software) and process. Steve Hughes, Data Architect for JPL, presented his work in building ontologies both for NASA Planetary Data Science (an enterprise-wide effort) and EDRN at the workshop; we recommend leveraging Steve’s expertise for getting started and forming a plan and timeline. The SHARE Mayo Clinic participants expressed willingness for leadership in this effort.

IBD Bioinformatics Initiative Working Group Focus: This working group would define

and prioritize a series of “test-and-learn” bioinformatics projects that, with low cost and short timeline, would allow proof of feasibility and rapid progress, and act as seeds upon which more substantial resources may be built out. These projects may include: 1) the implementation of an initiative to catalog all studies and datasets existing in the Helmsley IBD research network, 2) the sharing of relatively small, manageable set of data across the network to work through patient privacy, IT, and ontology issues, and 3) the specification and development of software modules to be shared across the network. In the longer term, this workgroup could use and/or develop bioinformatics tools. For instance, it could implement the SAGE data-sharing platform (currently in use by Group 4, VEO) and customize tools developed by Caltech for data visualization.

2. Convene Face-to-Face Meetings Data scientists from each project could be gathered at least annually to share and discuss working group progress, new capabilities, emerging gaps in capability, and impact of bioinformatics on the scientific research. Having a cohesive group that identifies as a data science “team” will promote resource sharing and strong collaboration. We recommend convening this group in the context of a scientific meeting, e.g. Digestive Disease Week (DDW), so research scientists may participate in part, enabling open communication about needs, priorities, and available resources. Research scientists and data scientists can motivate and learn from each other, offering the possibility of innovation and efficiency.3. Fund Bioinformatics Projects We recommend developing Requests for Proposals (RFPs) for data infrastructure projects. These proposals may offer to fund such projects as the development of an online training session on data integration principles and methods for Helmsley-funded data scientists; a feasibility data-sharing initiative that includes and builds off of the rapid development of a prototype information model and which includes integrating a small relevant dataset across all projects; or an evaluation of high-performance computing resources across projects with innovative recommendations for collaboration to enable large-scale computation that tests the limits of an individual project infrastructure.The ultimate long-term vision that emerged from the workshop was for a coordinated, well-defined, consistent data model infrastructure for the Helmsley IBD research network that would facilitate data sharing, integration and collaboration among bioinformaticists to most effectively move research to the goal of a cure for IBD.

Page 18: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

18

APPENDIX A — LIST OF PARTICIPANTS

Kristen Anton, MSDirector, BioinformaticsUniversity of North Carolina & Geisel School of Medicine at DartmouthHanover, New [email protected]

Megan Baldridge, MD, PhDPostdoctoral Research FellowWashington University School of MedicineSt. Louis, [email protected]

Yu-Ling ChangGraduate StudentUCLALos Angeles, [email protected]

Luca Cinquini, PhDSoftware EngineerNASA/JPLPasadena, [email protected]

Maureen Colbert, MSBioinformatics AnalystGeisel School of Medicine at DartmouthHanover, New [email protected]

Ana Corraliza, MScGraduate StudentIDIBAPSBarcelona, [email protected]

Dan Crichton, MSProgram Manager and Principal InvestigatorNASA/JPLPasadena, [email protected]

Olga DavidovData ManagerTel Aviv-Sourasky Medical CenterTel Aviv, [email protected]

George Djorgovski, PhDDirector and ProfessorCaltechPasadena, [email protected]

Angela Dobes, BS, MPHIBD Plexus DirectorCrohn’s and Colitis Foundation of AmericaNew York, New [email protected]

Fengshi Dong, MSResearch TechnologistThe University of ChicagoChicago, [email protected]

Richard Doyle, PhDProgram ManagerJet Propulsion LaboratoryPasadena, [email protected]

Abdul Elkadri, MDClinical Research FellowThe Hospital For Sick ChildrenToronto, [email protected]

Eric Franzosa, PhDResearch AssociateHarvard T. H. Chan School of Public HealthBoston, [email protected]

Page 19: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

19

Thomas Fuchs, PhDHead, Computational Pathology LabMemorial Sloan Kettering Cancer CenterNew York, New [email protected]

John Garber, MDGastroenterologist/Research FellowMassachusetts General HospitalBoston, [email protected]

Gautam Goel, PhDResearch Fellow in MedicineMass General HospitalBoston, [email protected]

George Gulotta, BSReport WriterUniversity of ChicagoChicago, [email protected]

Gabriel Hoffman, PhDAssistant ProfessorIcahn School of Medicine at Mount SinaiNew York, New [email protected]

Steven Hughes, MSInformation ArchitectPrincipal Computer ScientistNASA/JPLPasadena, California [email protected]

Larry James, Lt. Gen. (ret) Deputy DirectorNASA/JPLPasadena, California

Heather Kincaid, BSOperations LeadNASA/JPLPasadena, [email protected]

Jennifer Kwon, DrPH, MPHSHARE Research Consortium ManagerMount Sinai School of MedicineNew York, New [email protected]

Ilias Lagkouvardos, PhDPost Doctoral ResearcherTechnical University of MunichMunich, [email protected]

Lionel Le Bourhis, PhDResearch ScientistINSERM, Saint-Louis HospitalParis, [email protected]

Maayan LevyGraduate StudentWeizmann Institute of ScienceRehovot, [email protected]

Dalin Li, PhDResearch Scientist IICedars-Sinai Medical CenterLos Angeles, [email protected]

Ashish Mahabal, PhDSr. Computational ScientistCaltech Pasadena, [email protected]

Page 20: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

20

Peter Meisel, MPHAssociate Program Officer, IBD/Crohn’s Disease ProgramHelmsley Charitable TrustNew York, New [email protected]

Kelly Monroe, MSWLead Project CoordinatorWashington University School of MedicineSt. Louis, [email protected]

Darren NixClinical Lab ManagerWashington University School of MedicineSt. Louis, [email protected]

Jim O’Sullivan, MPHProgram Director, IBD/Crohn’s Disease ProgramHelmsley Charitable TrustNew York, New [email protected]

Jodie Ouahed, MD, MMScPhysician/ScientistBoston Children’s HospitalBoston, [email protected]

Kevin OwClinical Research Project ManagerMount Sinai HospitalToronto, [email protected]

Susan Painter, MPPProgram Officer, IBD/Crohn’s Disease ProgramHelmsley Charitable TrustNew York, New [email protected]

Andrew Paterson, MB ChBSenior Scientist, Genetics and Genome BiologyHospital for Sick ChildrenToronto, [email protected]

Núria Planell, MScGraduate StudentCIBERehdBarcelona, [email protected]

Alka Potdar, MS, PhDResearch Bioinformatician IIICedars-Sinai Medical CenterLos Angeles, [email protected]

Daniel Quest, PhDBioinformatician and Information ArchitectMayo ClinicRochester, [email protected]

Robert Rheaume, BSSenior Programmer/AnalystDartmouth CollegeHanover, New [email protected]

Khushbu Rupani, MS Database AnalystMount Sinai HospitalNew York, New [email protected]

Melissa Saul, MSAnalytics Center DirectorUniversity of PittsburghPittsburgh, [email protected]

Page 21: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

21

David SeligsonDeveloperUniversity of North CarolinaChapel Hill, North [email protected]

Ruslan Sergienko, MA, BEngDepartment of Public HealthBen-Gurion University of the NegevBeer-Sheva, [email protected]

Shai Shen-Orr, PhDAssistant ProfessorTechnion-Israeli Institute of TechnologyHaifa, [email protected]

Shefali Soni, PhDAssociate Program Officer, IBD/Crohn’s Disease ProgramHelmsley Charitable TrustNew York, New [email protected]

Nicole Villaverde, MSSenior Associate ResearcherMount Sinai School of MedicineNew York, New [email protected]

Mathieu Wiepert, MSUnit Head ITMayo ClinicRochester, [email protected]

Lior Yahav, MScResearch CoordinatorTel Aviv-Sourasky Medical CenterTel Aviv, [email protected]

Xiaofei Yan, MSSenior BiostatisticianCedars-Sinai Medical CenterLos Angeles, [email protected]

Guoyan Zhao, PhDAssistant ProfessorWashington UniversitySt. Louis, [email protected]

Gili ZilbermanGraduate StudentWeizmann Institute of ScienceRehovot, [email protected]

Page 22: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

22

APPENDIX B — DATA PROGRAMS USED BY HELMSLEY IBD GRANT RECIPIENTS

PROGRAM PURPOSE CONSORTIA*

16S rDNA Genome Shotgun sequencing

In-house pipelines for utilizing, analyzing, and interpreting the sequencing data

Weizmann Institute of Science

ADMIXTURE Genome analysis toolset Cedars Sinai (SHARE)

affy Bioconductor tools for oligonucleotide array analysis

IBD Over Time (IBDOT)Cedars Sinai (SHARE)

Affymetrix expression consoleSoftware for summarizing and performing gene-level normalization

Washington UniversityUniversity of North Carolina (SHARE)

ANADAMA Pre-built workflows for microbiome analysis CCFA Microbiome

AnnovarTools for annotating of genetic variants from high-throughput sequencing data

Cedars Sinai (VEO-IBD)

ANSI-SQL Standard relational database on an Oracle 10g RDMBS platform GEM-CCC

Bioinformatics toolbox Network construction and visualization packages Cedars Sinai (SHARE)

Biom Conversion of biome data to text data GEM-CCC

Biosample Inventory Management System

Cloud-based multi-tenant SQL database (hosted on AWS) which uses MicroStrategy BI and analytics tools

IBD Plexus

BWA Alignment tool for sequencing data

Ludwig Maximilian University of Munich [LMU](VEO-IBD)Cedars Sinai (VEO-IBD)

ClassDiscovery RNA differential expression analysis IBD Over Time (IBDOT)

Correlation Network Analysis Gene association analysis MGH (SHARE)

CytoscapeBioinformatics software platform for visualizing molecular interaction networks

Cedars Sinai (SHARE)

DAVID Annotation and Gene Enrichment Analysis Cedars Sinai (SHARE)

Delly WG structural variation Mount Sinai (VEO-IBD)

DESeq Bioconductor toolsCedars Sinai (VEO-IBD)Cedars Sinai (SHARE)

Page 23: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

23

PROGRAM PURPOSE CONSORTIA*

eCRFsLeveraging Salesforce to create 3 electronic case report forms for site coordinators to fill out

IBD Plexus

edgeR Bioconductor package for RNA differential expression

Mayo Clinic (SHARE)Cedars Sinai (SHARE)

EIGENSTRAT Genome analysis toolset Cedars Sinai (SHARE)

EMPERORGeneration of 3D Principal Coordinates Analysis for microbiome data

GEM-CCC

EMR accelerator tools

Build of an extract process to transfer Epic Smart-form data and additional EMR data to Clarity database

IBD Plexus

ExAC Biological databases LMU (VEO-IBD)FACSDiva for aquisition and FlowJo for analysis Immune cell phenotyping IBD Over Time (IBDOT)

Freezerworks Biological sample tracking and freezer inventory technology

Washington UniversityUniversity of North Carolina (SHARE)

GATK Software package for analysis of high-throughout sequencing data

Cedars Sinai (VEO-IBD)Mount Sinai (SHARE)

GEM Genome Multitool mapper is an alignment tool for sequencing data Weizmann Institute of Science

GEMINI Filtering Mount Sinai (VEO-IBD)Gene expression deconvolution algorithms and mixed models Longitudinal Analysis Israeli IBD Research Nucleus

(IIRN)Genome Analysis Toolkit Customized NGS-analysis pipeline LMU (VEO-IBD)

Golden Helix: SNP & Variation Suite

Software to analyze large genomic datasets

Boston Children’s Hospital and Sickkids(VEO-IBD)Cedars Sinai (SHARE)

GWASTools Bioconductor package for analysis of genome-wide association studies Cedars Sinai (SHARE)

HGMD Biological databases LMU (VEO-IBD)

hgu219hsentrezg.db packages RNA differential expression analysis IBD Over Time (IBDOT)

HUMANn2 Microbial metagene analysis platform CCFA Microbiome

Page 24: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

24

PROGRAM PURPOSE CONSORTIA*Illumina A tool for genetic analysis University of Chicago (SHARE)Illumina GenomeStudio Genotyping Module

Generation of genotype data and allele-calling Cedars Sinai (SHARE)

IlluminaHumanMethylation450k.db

Epigenetic analysis (the methylation of cytosine residues) IBD Over Time (IBDOT)

javascript Data validation scripts Mayo Clinic (SHARE)

Lefse Generation of cladogram from microbiome data GEM-CCC

LimesurveyWeb application to develop, publish, and collect responses to online and offline surveys

Israeli IBD Research Nucleus (IIRN)

Limma RNA differential expression analysis IBD Over Time (IBDOT)

LumyAnalysis pipeline for Illumina expression and methylation microarray data

IBD Over Time (IBDOT)Mount Sinai (VEO-IBD)

MAP-RSeq, Mayo Analysis Pipeline RNA sequencing analysis pipeline Mayo Clinic (SHARE)MATLAB Statistical analyses Weizmann Institute of Science

Megan Bacterial 16S rDNA sequencing data analysis CCFA Genetics Initiative

Metadistance in Matlab Statistics and machine learning tools Cedars Sinai (SHARE)

METAL Genetic statistical analyses Cedars Sinai (SHARE)

Methylumi Epigenetic analysis (the methylation of cytosine residues) IBD Over Time (IBDOT)

MLRInterface for a number of classification and regression techniques

Cedars Sinai (SHARE)

MS Access Database management system Israeli IBD Research Nucleus (IIRN)

MS tools: SSIS, SQL Server For BI process Israeli IBD Research Nucleus (IIRN)

NanostringData normalization and machine learning frameworks for feature identification and classifier training

MGH (SHARE)

NGS tools Sequencing tools Cedars Sinai (SHARE)OMIM Biological database LMU (VEO-IBD)

Pandaseq erging pair-end sequences from genotyping GEM-CCC

Page 25: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

25

PROGRAM PURPOSE CONSORTIA*

Partek Sequencing, microarray and qPCR data analysis software

Washington University and University of North Carolina (SHARE)

PHP Web interfaces Mount Sinai (VEO-IBD)

Phyloseq Softwares for analysis of high-throughput microbiome data Cedars Sinai (VEO-IBD)

Picrust Generation of imputed microbial function GEM-CCC

PLINK Analysis of genotyping dataGEM-CCCCedars Sinai (VEO-IBD) Mount Sinai (SHARE)

PostgreSQL Customized NGS-analysis pipeline LMU (VEO-IBD)Prism Statistical analysis CCFA Genetics Initiative

Python

For integrative data analysis – including high-dimensional, single-cell data (CyTOF), for longitudinal analysis

Israeli IBD Research Nucleus (IIRN)

QIIME Bioinformatics pipeline for performing microbiome analysis

GEM-CCCWeizmann Institute of Science

QIITA Microbiome storage and analysis resource CCFA Microbiome

QlikView Software platform for development of big data analytics IBD Plexus

QTL analysis pipelines Mapping of quantitative trait loci MGH (SHARE)

Quantitative EST Immune expression enrichment analysis CCFA Genetics Initiative

R Statistical analysis

CCFA Genetics Initiative GEM-CCCIBD PlexusIsraeli IBD Research Nucleus (IIRN) Weizmann Institute of Science IBD Over Time (IBDOT)LMU and SickKids (VEO-IBD)Mayo Clinic (SHARE), Mount Sinai (SHARE), Cedars Sinai (SHARE)

RTN Master regulator analysis Mayo Clinic (SHARE)

Page 26: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

26

PROGRAM PURPOSE CONSORTIA*

SAAP-RRBS

DNA methylation sequencing data analysis pipeline developed at Mayo Bioinformatics Core and used for the RRBS data

Mayo Clinic (SHARE)

SAM Tools Sequencing toolsLMU (VEO-IBD)Cedars Sinai (SHARE)

SAP Enterprise management software Israeli IBD Research Nucleus (IIRN)

SAS Software for advanced analytics Mount Sinai (SHARE)scikit-learn in python Statistics and machine learning tools Cedars Sinai (SHARE)

SNPassoc and SNPstats R CRAN Analysis of Single Nucleotide Polymorphisms (SNPs)

IBD Over Time (IBDOT)Boston Children’s Hospital and Sickkids (VEO-IBD)

SOLiD High-throughput DNA sequencing platform University of Chicago (SHARE)

SQL database Managing study data Mount Sinai (VEO-IBD)

SVM, Penalized Statistics and machine learning tools Cedars Sinai (SHARE)

Syzygy - SNP and indel calling Pooled and individual targeted resequencing CCFA Genetics Initiative

TopHat Sequencing tools Cedars Sinai (SHARE)

UniProt A high-quality, curated biological database LMU (VEO-IBD)

Usearch

Removal of chimera sequences and clustering of Operational Taxinomic Units (OTU’s) from sequencing data

GEM-CCC

VarSeq Softwares to analyze large genomic datasets

Boston Children’s Hospital and Sickkids (VEO-IBD)

VEPTool for annotating of genetic variants from high-throughput sequencing data

Cedars Sinai (VEO-IBD) Mount Sinai (VEO-IBD)

VirusSeeker Virome analysis CCFA Genetics Initiative

WGCNA Software for performing gene correlation network analysis Cedars Sinai (VEO-IBD)

XHMM Exome copy number variation Mount Sinai (VEO-IBD)ZIBSeq Bioconductor tools Cedars Sinai (SHARE)ZIG Bioconductor tools Cedars Sinai (SHARE)

* CCFA Microbiome (Washington University) CCFA Genetic Initiative (Washington University) GEM-CCC (Mount Sinai Hospital, Toronto; Hospital for Sick Children, Toronto)

Israeli IBD Research Nucleus (IIRN) [Tel Aviv Sourasky Medical Center, Ben-Gurion University of the Negev, Technion Israel Institute of Technology] IBD Over Time (IBDOT) [INSERM, IDIBAPS, TUM]

Page 27: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

27

APPENDIX C — PROPOSED METADATA ELEMENTS AND DATA DESCRIPTORS FOR EACH HELMSLEY-FUNDED STUDYThis list of descriptors is the start of a set of standard information that may be captured about each Helmsley-funded research study. Each of these elements would be carefully and precisely defined; the Information Model Development Working Group would complete this work. This descriptive information would allow researchers to know what is being investigated across the network, and standardizing these descriptors allows for direct query and classification of experiments.

Metadata list (draft):1. Sample type (tissue/stool/blood)2. Tissue type location 3. Tissue state (inflamed/uninflamed)4. Disease activity at the time of collection (through some indices to be defined)5. Experiments6. Age of patient7. Age of disease8. Ethnicity9. Gender10. Undergone surgery?11. Medication, by category12. Smoking history13. Diagnosis14. Family history15. Flag for self-report16. Various indices

Dataset descriptors (draft):1. Size of set2. Origin of collection3. Origin of data4. Consent status5. Methods used6. Location

Page 28: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

28

APPENDIX D — BIOINFORMATICS PLANNING TEMPLATE FOR HELMSLEY APPLICANTS

In up to three pages of text, please address the questions below about how you will manage data needs for your research project.

I. Roles and Responsibilities

a. Who will be the leads for (a) data management and (b) bioinformatics for the proposed project(s)?

b. What experience do team members have that will inform their data management roles?

II. Technology and Analyses Packages

a. What hardware is needed for the proposed research?

b. Which databases will be utilized or developed for these projects?

c. What software is needed for the proposed research, data dissemination, and analyses?

d. For each item above, please specify whether the components are in place (and if so, for how long) or do they need to be build or developed (and if so, what is the timeframe)?

III. Data Standards, Sharing, and Ontology

a. How will metadata be generated and captured for each of your data sets?

i. Where is the data generated for this project?

ii. Who is responsible for the data management and coordination (at each site, if applicable)?

b. What metadata will you need to generate so that others will be able to find, understand, and make use of your data?

c. Who on your team is responsible for ensuring that data standards are properly applied and data are properly formatted?

i. How will you format your data so that others in your field will be able to make use of it?

d. Who on your team will be responsible for ensuring metadata standards are followed among network participants?

e. Would you place any conditions on sharing your data with others?

i. Who will own these data sets? Who needs to be consulted before data sets are made available?

ii. How will you ensure ongoing access to the data beyond the life of the project?

iii. How will you ensure that these data sets are able to withstand changes in or the obsolescence of the storage technologies?

Page 29: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

29

Affymetrix: American company that manufactures DNA microarrays used to identify large numbers of genes simultaneously.

Allscripts: An American company providing physician practices, hospitals, and other healthcare providers with electronic health record and practice management technology, including electronic prescribing, care management, and revenue cycle management software.

Agilent: Bioinformatics software that provides a full range of tools to enable robust visualization and analysis of genomic datasets across a range of applications.

BioPortal: Comprehensive repository of biomedical ontologies.

BLAST: Basic Local Alignment Search Tool. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

Cerner: Cerner Corporation is a supplier of health information technology (HIT) solutions, services, devices, and hardware.

Clindesk: Electronic health record (EHR) software.

Cluster Computing: A network of computers acting as a single, much more powerful machine.

EPIC: Electronic Health Record software for medical groups, hospitals, and integrated healthcare organizations.

Fasta: A suite of programs for searching nucleotide or protein databases with a query sequence. Often used as an alternative to, or in conjunction with, BLAST.

Freezerworks: Biological sample tracking and freezer inventory software.

Golden Helix: High quality software to analyze large genomic datasets.

Harvey-Bradshaw: A standardized, validated index used to assess Crohn’s disease activity.

High Performance Computing: The practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business.

HL7: A framework (and related standards) for the exchange, integration, sharing, and retrieval of electronic health information.

HTML: Hyper Text Markup Language. Standard language used to create websites.

Illumina: An American company that develops, manufactures and markets integrated systems for the analysis of genetic variation and biological function.

Javascript: A computer programing language.

KEGG: Kyoto Encyclopedia of Genes and Genomes is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances.

LIMS: Laboratory Information Management System is a software-based laboratory and information management system with features that support a modern laboratory’s operations.

MATLAB: A high-level language and interactive environment that lets the user explore and visualize ideas and collaborate across disciplines including signal and image processing, communications, control systems, and computational finance.

APPENDIX E — GLOSSARY

Page 30: Read the full report from the workshop

IBD/CROHN’S DISEASE PROGRAM

BIOINFORMATICS WORKSHOP REPORT

30

Metadata: The data that describes other data. It describes, for instance, how and when and by whom a particular set of data was collected, and how the data is formatted.

MIAME: Minimum information about a microarray experiment is a standard created for reporting microarray experiments.

OID: An object identifier, which is an extensively used identification mechanism for naming any type of object or concept with a globally unambiguous, persistent name. It is part of a naming structure and serves as a consistent naming standard in computer programming.

Ontology: In data engineering, this is a representation of concepts and the relationships, constraints, rules and operations to specify data semantics.

Oracle: A relational database management system. Oracle databases hold and manage large tabular data and are a leading industry standard.

OWL: A Java library and a set of command line tools for the analysis of biological macromolecules. It provides functionality for analyzing protein sequences and structures using built-in algorithms and interfaces to external tools. Particular emphasis is put on the analysis of proteins as contact graphs.

Perl: Practical Extraction and Report Language is a programing language.

PHP: A scripting language designed for web development but also used as a general-purpose programming language.

PLINK: A free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.

Pyhton: A widely used general-purpose, high-level programming language.

QIIME: An open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data.

R: A programming language and free software environment for statistical computing and graphics.

REDCap: An open source database application designed to support data capture for research studies that are not complex. This product was developed by Vanderbilt University, with public funds, and is made available to institutions free of charge.

SAS: Statistical Analysis System is a software suite for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics.

SPSS: A software package used for statistical analysis, an alterative to SAS and R.

Page 31: Read the full report from the workshop

helmsleytrust.org