Upload
uri-laserson
View
1.807
Download
1
Embed Size (px)
DESCRIPTION
Overview of using the Hadoop ecosystem for life sciences/health use cases
Citation preview
1
Hadoop ecosystem for life sciencesUri Laserson30 September 2013
2
About the speaker
• Currently “Data Scientist” at Cloudera
• PhD in Biomedical Engineering at MIT/Harvard (2005-2012)
• Focused on next-generation DNA sequencing technology in George Church’s lab
• Co-founded Good Start Genetics (2007-)• First application of next-gen sequencing to genetic
carrier screening
3
Agenda
• Historical context• Introduction to Hadoop ecosystem• Genomics on Hadoop• Other use cases in life sciences
4
Historical Context
5
1999!
6
Indexing the Web
• Web is Huge• Hundreds of millions of pages in 1999
• How do you index it?• Crawl all the pages• Rank pages based on relevance metrics• Build search index of keywords to pages• Do it in real time!
7
8
Databases in 1999
1. Buy a really big machine2. Install expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another big machine as backup
9
10
Database Limitations
• Didn’t scale horizontally• High marginal cost ($$$)
• No real fault-tolerance story• Vendor lock-in ($$$)• SQL unsuited for search ranking
• Complex analysis (PageRank)• Unstructured data
11
12
Google does something different
• Designed their own storage and processing infrastructure
• Google File System (GFS) and MapReduce (MR)• Goals: KISS
• Cheap• Scalable• Reliable
13
Google does something different
• It worked!• Powered Google Search for many years• General framework for large-scale batch computation
tasks• Still used internally at Google to this day
14
Google benevolent enough to publish
2003 2004
15
Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR.
• 2006: Spun out as Apache Hadoop• Named after Doug’s son’s yellow stuffed elephant
16
Industry strategy: Copy Google
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
Hadoop
17
Overview of core technology
18
HDFS design assumptions
• Based on Google File System• Files are large (GBs to TBs)• Failures are common
• Massive scale means failures very likely• Disk, node, or network failures
• Accesses are large and sequential• Files are append-only
19
HDFS properties
• Fault-tolerant• Gracefully responds to node/disk/network failures
• Horizontally scalable• Low marginal cost
• High-bandwidth
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distributionNode A Node B Node C Node D Node E
20
MapReduce computation
21
MapReduce
• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected
and restarted• Schema-on-read: data must not be stored conforming
to rigid schema
22
WordCount example
23
HPC separates compute from storage
Storage infrastructure Compute cluster
• Proprietary, distributed file system
• Expensive
• High-performance hardware
• Low failure rate• Expensive
Big network pipe ($$$)
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.Hadoop is about data.
24
Hadoop colocates compute and storage
Compute clusterStorage infrastructure
• Commodity hardware• Data-locality• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.Hadoop is about data.
25
HPC is lower-level than Hadoop
• HPC only exposes job scheduling• Parallelization typically occurs through MPI
• Very low-level communication primitives• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split• Failures must be dealt with manually
• Hadoop has fault-tolerance, data locality, horizontal scalability
26
Sqoop
Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver
27
Flume
A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
28
Cloudera Impala
Modern MPP database built on top of HDFS
Designed for interactive queries on terabyte-scale data sets.
29
Cloudera Search
• Interactive search queries on top of HDFS
• Built on Solr and SolrCloud• Near-realtime indexing of new documents
30
Benefits of Hadoop ecosystem
• Inexpensive commodity compute/storage• Tolerates random hardware failure
• Decreased need for high-bandwidth network pipes• Co-locate compute and storage• Exploit data locality
• Simple horizontal scalability by adding nodes• MapReduce jobs effectively guaranteed to scale
• Fault-tolerance/replication built-in. Data is durable• Large ecosystem of tools• Flexible data storage. Schema-on-read. Unstructured
data.
31
Scaling Genomics
32
33
NCBI Sequence Read Archive (SRA)
Today…1.14 petabytes
One year ago…609 terabytes
34
Every ‘ome has a -seq
Genome DNA-seq
TranscriptomeRNA-seqFRT-seqNET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
ProteomePhIP-seqBind-n-seq
35
Genomics ETL
GATK best practices
36
Genomics ETL
.fastq .bam .vcf
short read alignment
genotype calling
• Short read alignment is embarrassingly parallel• Pileup/variant calling requires distributed sort• GATK is a reimplementation of MapReduce; could run on Hadoop• Already available Hadoop tools
• Crossbow: short read alignment/variant calling• Hadoop-BAM: distributed bamtools• BioPig: manipulating large fasta/q• SEAL: Hadoop-enabled BWA• Contrail: de-novo assembly
37
Use case 1: Scaling a genome center pipeline
• Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB)
• Current throughput• >1300 samples per month• >12 TB raw data per month
• Data ultimately served from MySQL database• 750 GB of processed variant data• 25k genomes requires >3.5 TB in MySQL
• Complex 4-tier storage system, including tape, filer, and RDMBS
38
Use case 1: Scaling a genome center pipeline
• Database serves population genetics applications and case/control studies
• Unify all data processing into HDFS• Replace MySQL with Impala on Hadoop for increased
scalability• Possibly move raw data processing into MapReduce
39
Use case 2: Querying large, integrated data sets
• Biotech client has thousands of genomes• Want to expose ad hoc querying functionality on large
scale• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
• Integrating data with public data sets (e.g., ENCODE, UCSC browser)
• Terabyte-scale annotation sets• Currently, these capabilities (e.g., data joins) are often
manually implemented
40
Use case 2: Querying large, integrated data sets
• Hadoop allows all data to be centrally stored and accessible
• Impala exposes a SQL query interface to data sets in Hadoop
41
Variant-filtering example
• “Give me all SNPs that are:• on chromosome 5• absent from dbSNP• present in COSMIC• observed in breast cancer samples• absent from prostate cancer samples• overlap a DNase hypersensitivity site• overlap a ChIP-seq site for a particular TF”
• On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds
42
All-vs-all eQTL
• Possible to generate trillions of hypothesis tests• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations
• Example queries:• “Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”• Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”• Finishes in a couple of minutes• Limited by disk throughput
43
All-vs-all eQTL
• “Find all SNPs that are:• in LD with some lead SNP
or eQTL of interest• align with some functional
annotation of interest”• Still in testing, but likely
finishes in seconds
Schaub et al, Genome Research, 2012
44
Genomics summary
• ETL (raw data to analysis-ready data)• Data integration
• e.g., interactively queryable UCSC genome browser• De novo assembly• NLP on scientific literature
45
Clinical dataManufacturing
Other use cases
46
Use case 3: Clinical document queries for EHR company
• EHR wants to expose query functionality to clinicians• >16 million clinical documents with free text; processed
through NLP pipeline• >500 million lab results
• Perform subject expansion on search queries via ontologies
• e.g., “myocardial infarction” will match “heart disease”• Search functionality implemented with Lucene
(serving) on top of Hbase (processing/storage/indexing)
47
Use case 3: Clinical document queries for EHR company
• Interested in recommendation engine-enabled queries, like:
• Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record
• Clinician wants to know what other conditions might be correlated with a finding of interest
48
Use case 3: Clinical document queries for EHR company
“Find other patients similar to mine”
• The Stanford system is limited to search
• Recommendation engines allow a button “find similar”
49
Use case 4: Insurance company
• Data from 30 different EHRs across multiple business units
• High variance in ICD9 coding between locales.• Use NLP and machine learning to improve ICD9 coding
to reduce variance in diagnosis
50
Use case 5: Pharma company variance in yields
• Pharma company performs large batch fermentations of their product
• Find high levels of variance in their yield• Fermentations are automated and highly
instrumented• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.
• Perform time series analysis on fermentation runs to predict yields and determine which variables control variance.
51
Use case 6: AgTech company integrating data sources
• Multiple reference genome sequences• Genotyping on thousands of samples• Weather data• Soil data• Microbiome data• Yield data• Geo data• All integrated in HBase
52
Use case 6: AgTech company integrating data sources
• Can increase crop yields ~15% by “printing” seeds onto a field
• Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs
• Import any annotation data in CSV/GFF• Integration with cloning tools• Supports a web front-end for easy access
53
Conclusions
Highly heterogeneous data
54
COMMUNICATIONSLocation-based advertising
HEALTH CAREPatient sensors, monitoring, EHRs Quality of care
LAW ENFORCEMENT & DEFENSEThreat analysis,Social media monitoring, Photo analysis
EDUCATION& RESEARCHExperimentsensor analysis
FINANCIAL SERVICESRisk & portfolioanalysisNew products
ON-LINE ERVICES / SOCIAL MEDIAPeople & career matching
Websiteoptimization
UTILITIESSmart Meter analysis for network capacity
CONSUMER PACKAGED GOODSSentiment analysis of what’s hot,customer service
MEDIA /ENTERTAINMENTViewers /advertising effectiveness
TRAVEL &TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment
LIFE SCIENCESClinical trialsGenomics
RETAILConsumer sentimentOptimized marketing
AUTOMOTIVEAuto sensors reporting location, problems
HIGH TECH / INDUSTRIAL MFG.Mfg. quality
Warranty analysis
OIL & GASDrilling exploration sensor analysis
©2013 Cloudera, Inc. All Rights Reserved.
55
Flexibility• Store any data• Run any analysis and processing• Keeps pace with the rate of change of incoming data
Scalability• Proven growth to PBs/1,000s of nodes• No need to rewrite queries, automatically scales• Keeps pace with the rate of growth of incoming data
Efficiency• Cost per TB at a fraction of other options• Keep all of your data alive in an active archive• Powering the data beats algorithm movement
The Cloudera Enterprise Platform for Big Data
©2013 Cloudera, Inc. All Rights Reserved.
56
Cloudera Hadoop Stack
57