Genomes, Clouds, and Organization
eMedLab Workshop, LondonMay, 2016
Chris DwanDirector, Research Computing
[email protected] @fdmts
Conclusions
• In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it.
• Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant.
• The groups who get good at collaboration in this new world will lead the next decade of biomedical science.
• The Broad Institute is a non-profit biomedical research institute founded in 2004
• Fifty core faculty members, from MIT and Harvard, plus hundreds of associate members.
• ~1000 directly affiliated personnel• ~2,400+ associated researchers
Programs and Initiativesfocused on specific disease or biology areas
CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics
Platformsfocused technological innovation and application
GenomicsData SciencesTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical research institute founded in 2004
• Fifty core faculty members and hundreds of associate members from MIT and Harvard
• ~1000 research and administrative personnel, plus ~2,400+ associated researchers
• ~1.4 x 106 genotyped samples
Programs and Initiativesfocused on specific disease or biology areas
CancerGenome BiologyCell CircuitsPsychiatric DiseaseMetabolismMedical and Population GeneticsInfectious DiseaseEpigenomics
Platformsfocused technological innovation and application
GenomicsData SciencesTherapeuticsImagingMetabolite ProfilingProteomicsGenetic Perturbation
The Broad Institute
“This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and cure of disease”
People @ Broad
WGS / day: ~120 140 .. (plus other products)Data generation: ~ 0.5PB/mo (200 MB/s) Network: ~1.6Gb/sec
This is not going to slow down any time soon.
WGS / day: ~120 140 …Data storage: ~200 MB/s (0.5PB/mo) Network: ~1.6Gb/sec
This is not going to slow down any time soon.
Colocated File Storage: ~30PColocated HPC: ~14k coresColocated Object Storage Capacity: ~5P
Public cloud data: ~7PPublic cloud cores: ~15k cores steady state
Internal network: 10Gb/secExternal network: 100Gb/sec
Base pairs vs. Samples
The future is already here – it’s just not very well distributed
William Gibson
A lot of technology has happened since we were all worried about “data tsunamis” in 2007.
Amazon’s innovation
2002: All sharing of data, provisioning of services, configuration of infrastructure – everything is via programmatic call (API)
APIs must be written to be called by external customers.
Anyone who does not do this will be fired, have a nice day.
2004: Amazon launches a product with which I can provision servers and storage as easily as I buy books.
Cloudbursting (Aug, 2015)
50,000+ cores used for ~2 hours
Data Storage (May 2016)
Avere (June 2015): A cloud gateway for files.
• Data uploaded 4 PB and counting• Compression and client side encryption in-line (push-button)• Simple enough that we’re out in front of the computational capabilities ($$)
Broad Data Center Google Cloud Services
Cloud Bucket
PhysicalAvere
Cluster
VirtualAvere
ClusterPhysicalCompute
Hosts
VirtualCompute
HostsPhysical Data Store
Free
Expensive
Liberation from the location of metal
The billing API is the best way to get usage information out of google’s cloud offerings.
Eight Exabytes Free
File based storage: The Information Limits• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever– Forests of symbolic links– “Charlotte’s web”
• Access semantics are fundamentally inadequate.– We need complex, dynamic, context sensitive semantics including
consent for research use.– File hierarchies will never scale to a federated world.
3rd Party Companies Fill Cloud Feature GapsCloudhealth dashboard atop the billing API
Storage $$
Network $$
Direct storage cost
Two kinds of network egress
Data’s trip to the cloud should be one-way.
Genomes on the Cloud (April 2016)
Testing the genome analysis
pipeline“Go-live”
“To be without method is deplorable, but to depend entirely on method is worse.”
The Mustard Seed Garden Manual of Painting, 1679
Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
A Technology Engagement Spectrum
“Users”
Most laboratory and clinical work
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
A Technology Engagement Spectrum
“Users”
Well served by traditional “research computing”
Most laboratory and clinical work
Manager of compute infrastructure for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
Manager of compute infrastructure for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by traditional “research computing”
Most laboratory and clinical work
Manager of compute infrastructure for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
Manager of compute infrastructure for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by traditional “research computing”
To The Cloud!
Most laboratory and clinical work
Manager of compute infrastructure for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
Manager of compute infrastructure for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by traditional “research computing”
To The Cloud!
To The Other Cloud!
Most laboratory and clinical work
Manager of compute infrastructure for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
Manager of compute infrastructure for personal use
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by traditional “research computing”
To The Cloud!
To The Other Cloud!
Already happily off-prem, PaaS,
etc.
Most laboratory and clinical work
Manager of compute infrastructure for use by others.
Consumer of analysis
User of GUI and visual tools
Author of scripts and workflows for personal use
Author of scripts and command line tools for use by others
Manager of compute infrastructure for personal use
Tool
Bui
ldin
g
Trai
ning
/ A
cces
s
Shifting how we engage with technology
A Technology Engagement Spectrum
“Users”
“Shadow IT”
Well served by traditional “research computing”
What does “cloud” mean to me?
• Engineering and Design Approach: – All infrastructure and technology choices are
seamlessly available, as necessary, to every project and product.
• Integrative Organizing Principle*– Technologists directly engaged and accessible– Shared accountability for business / project goals.
Organizations who fail to integrate in this way will be routed around.
*DevOps
Product(revenue
generation)
User Services(workstations,
laptops, printers)
Run the Business(HR, Finance, …)
IT / Infrastructure
Internal Service Catalog
Business Priorities
A traditional IT organization, splitting infrastructure and technical architecture away from business priorities
Product(increased connection with architectural and infrastructure design)
User Services(workstations,
laptops, printers)
Run the Business(HR, Finance, …)
Infrastructure
Business Priorities
Internal Service Catalog DevOps(direct engagement w/ teams through entire
product lifecycle)
The beginnings of a DevOps transition, characterized by teams named “DevOps,” that serve particular projects
Business units dive into
infrastructure as they need,
partnering with technologists to
achieve business goals
A mature DevOps IT organization composed of the same staff, working in a fundamentally different way.
Business Priorities
Clouds open new possibilities for IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Clouds open new possibilities for IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User
Governance remains critical
$$ !!
Clouds open new possibilities for IT Services
Traditional IT:• Globally shared services• NFS, AD / LDAP, DNS, …• Many services provided using
public cloudsResponsibility: CIO
Cancer Genome Analysis Connectivity MapBilling Support:• IT provides coordination between internal cost
objects and cloud vendor “projects” or “roles”• No shared servicesResponsibility: User
Cloud / Hybrid Model• Granular shared services• VPN used to expose selected
services to particular projects Responsibility: Project / Service Lead
BITS DevOps DSDE Dev Cloud Pilot
API API API
The Cloud Future (where we are going)
• We are not so special:• Dozens to hundreds of businesses have multiple exabytes of data.• Health care / life sciences is playing catch-up.
• Objects, not files: • Engineer like an MMORPG* designer.• Do not copy files. Access APIs.• Avere gets around this by turning objects back into files.
• Cloud aware access patterns: • Data egress is expensive. • Do computing adjacent to the data.• Figure out a cost model to support this world.
• Everybody will not use the same cloud vendor:• If we want to collaborate at scale, we need to stop thinking in terms of single,
monolithic solutions.
*Massively Multiplayer Online Role Playing Game
Funding for specific analysis
Funding allocated by headcount, team, or department
Unfunded
Cos
t / s
cale
of a
naly
sis
Large
Trivial
Moderate
Ongoing unfunded support burden
Fixed capacity on shared use systems.
Hard choices, limitations
Ad-hoc / opportunistic use
Elastic capacity on shared use
systems
MoonshotsLost opportunity
Distinct funding models
You move towards and become like that which you think about.
The Big Data Healthcare Feeding Frenzy
• “If we sequence X new patients with condition Y every year, the sequencing data alone will take up ALL THE EXABYTES”*
• The data storage and analysis needs of precision / personalized / genomic medicine are not unreasonable by comparison with major, data driven industries (100s of Exabytes over the next decade).
• We can compensate by being thoughtful about what data we store, how we store it, and how we share it.
* If you multiply a number by a sufficiently large number the product is a large number.
… people who had nothing to do with the design and execution of the study …
... use another group’s data for their own ends …
… even use the data to try to disprove what the original investigators had posited…
… some researchers have characterized as “research parasites”
Fear, Uncertainty, and Doubt
What we need
• Incentive structures that reward making data accessible and useful– All indicators except the benefit of the patient lead to suboptimal behavior– This will require courage.
• National / global data scale data repositories, standards, and toolkits– Death to walled gardens, monolithic systems, and GUIs.– Life to APIs built for a global community (c.f. Amazon, 2002)
• Open, fearless conversation about data protection vs. appropriate use– Genomic data is inherently personally identifiable and should be treated as such– “Appropriate usage” goes well beyond legal conformity
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics and Health is to accelerate progress in human health by helping to establish a common framework of harmonized approaches to enable effective and responsible sharing of genomic and clinical data, and by catalyzing data sharing projects that drive and demonstrate the value of data sharing.”
Regulatory IssuesEthical IssuesTechnical Issues
This stuff is important
We have an opportunity to change lives and health outcomes, and to realize the gains of genomic medicine, not in an indefinite future, but this year.
We also have an opportunity to waste vast amounts of money (very rapidly) and still not really help anybody.
I would like to work together with you to build a better future.
Conclusions
• In order to take full advantage of cloud technologies, we need to change not just what we do, but also how we do it.
• Organizations need to fundamentally rethink how they engage with technology and technologists in order to remain relevant.
• The groups who get good at collaboration in this new world will lead the next decade of biomedical science.
Thank You