Hadoop ecosystem for health/life sciences

1

Hadoop ecosystem for life sciencesUri Laserson30 September 2013

2

About the speaker

• Currently “Data Scientist” at Cloudera

• PhD in Biomedical Engineering at MIT/Harvard (2005-2012)

• Focused on next-generation DNA sequencing technology in George Church’s lab

• Co-founded Good Start Genetics (2007-)• First application of next-gen sequencing to genetic

carrier screening

• [email protected]

3

Agenda

• Historical context• Introduction to Hadoop ecosystem• Genomics on Hadoop• Other use cases in life sciences

4

Historical Context

5

1999!

6

Indexing the Web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages• Rank pages based on relevance metrics• Build search index of keywords to pages• Do it in real time!

7

8

Databases in 1999

1. Buy a really big machine2. Install expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another big machine as backup

9

10

Database Limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story• Vendor lock-in ($$$)• SQL unsuited for search ranking

• Complex analysis (PageRank)• Unstructured data

11

12

Google does something different

• Designed their own storage and processing infrastructure

• Google File System (GFS) and MapReduce (MR)• Goals: KISS

• Cheap• Scalable• Reliable

13

Google does something different

• It worked!• Powered Google Search for many years• General framework for large-scale batch computation

tasks• Still used internally at Google to this day

14

Google benevolent enough to publish

2003 2004

15

Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR.

• 2006: Spun out as Apache Hadoop• Named after Doug’s son’s yellow stuffed elephant

16

Industry strategy: Copy Google

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubby Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on Hadoop

Hadoop

17

Overview of core technology

18

HDFS design assumptions

• Based on Google File System• Files are large (GBs to TBs)• Failures are common

• Massive scale means failures very likely• Disk, node, or network failures

• Accesses are large and sequential• Files are append-only

19

HDFS properties

• Fault-tolerant• Gracefully responds to node/disk/network failures

• Horizontally scalable• Low marginal cost

• High-bandwidth

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Input File

HDFS storage distributionNode A Node B Node C Node D Node E

20

MapReduce computation

21

MapReduce

• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”

• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected

and restarted• Schema-on-read: data must not be stored conforming

to rigid schema

22

WordCount example

23

HPC separates compute from storage

Storage infrastructure Compute cluster

• Proprietary, distributed file system

• Expensive

• High-performance hardware

• Low failure rate• Expensive

Big network pipe ($$$)

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

24

Hadoop colocates compute and storage

Compute clusterStorage infrastructure

• Commodity hardware• Data-locality• Reduced networking

needs

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

25

HPC is lower-level than Hadoop

• HPC only exposes job scheduling• Parallelization typically occurs through MPI

• Very low-level communication primitives• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split• Failures must be dealt with manually

• Hadoop has fault-tolerance, data locality, horizontal scalability

26

Sqoop

Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver

27

Flume

A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.

Client

Client

Client

Client

Agent

Agent

Agent

28

Cloudera Impala

Modern MPP database built on top of HDFS

Designed for interactive queries on terabyte-scale data sets.

29

Cloudera Search

• Interactive search queries on top of HDFS

• Built on Solr and SolrCloud• Near-realtime indexing of new documents

30

Benefits of Hadoop ecosystem

• Inexpensive commodity compute/storage• Tolerates random hardware failure

• Decreased need for high-bandwidth network pipes• Co-locate compute and storage• Exploit data locality

• Simple horizontal scalability by adding nodes• MapReduce jobs effectively guaranteed to scale

• Fault-tolerance/replication built-in. Data is durable• Large ecosystem of tools• Flexible data storage. Schema-on-read. Unstructured

data.

31

Scaling Genomics

32

33

NCBI Sequence Read Archive (SRA)

Today…1.14 petabytes

One year ago…609 terabytes

34

Every ‘ome has a -seq

Genome DNA-seq

TranscriptomeRNA-seqFRT-seqNET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seqBind-n-seq

35

Genomics ETL

GATK best practices

36

Genomics ETL

.fastq .bam .vcf

short read alignment

genotype calling

• Short read alignment is embarrassingly parallel• Pileup/variant calling requires distributed sort• GATK is a reimplementation of MapReduce; could run on Hadoop• Already available Hadoop tools

• Crossbow: short read alignment/variant calling• Hadoop-BAM: distributed bamtools• BioPig: manipulating large fasta/q• SEAL: Hadoop-enabled BWA• Contrail: de-novo assembly

37

Use case 1: Scaling a genome center pipeline

• Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB)

• Current throughput• >1300 samples per month• >12 TB raw data per month

• Data ultimately served from MySQL database• 750 GB of processed variant data• 25k genomes requires >3.5 TB in MySQL

• Complex 4-tier storage system, including tape, filer, and RDMBS

38

Use case 1: Scaling a genome center pipeline

• Database serves population genetics applications and case/control studies

• Unify all data processing into HDFS• Replace MySQL with Impala on Hadoop for increased

scalability• Possibly move raw data processing into MapReduce

39

Use case 2: Querying large, integrated data sets

• Biotech client has thousands of genomes• Want to expose ad hoc querying functionality on large

scale• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets

• Integrating data with public data sets (e.g., ENCODE, UCSC browser)

• Terabyte-scale annotation sets• Currently, these capabilities (e.g., data joins) are often

manually implemented

40

Use case 2: Querying large, integrated data sets

• Hadoop allows all data to be centrally stored and accessible

• Impala exposes a SQL query interface to data sets in Hadoop

41

Variant-filtering example

• “Give me all SNPs that are:• on chromosome 5• absent from dbSNP• present in COSMIC• observed in breast cancer samples• absent from prostate cancer samples• overlap a DNase hypersensitivity site• overlap a ChIP-seq site for a particular TF”

• On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds

42

All-vs-all eQTL

• Possible to generate trillions of hypothesis tests• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations

• Example queries:• “Given 5 genes of interest, find top 20 most significant

eQTLs (cis and/or trans)”• Finishes in several seconds

• “Find all cis-eQTLs across the entire genome”• Finishes in a couple of minutes• Limited by disk throughput

43

All-vs-all eQTL

• “Find all SNPs that are:• in LD with some lead SNP

or eQTL of interest• align with some functional

annotation of interest”• Still in testing, but likely

finishes in seconds

Schaub et al, Genome Research, 2012

44

Genomics summary

• ETL (raw data to analysis-ready data)• Data integration

• e.g., interactively queryable UCSC genome browser• De novo assembly• NLP on scientific literature

45

Clinical dataManufacturing

Other use cases

46

Use case 3: Clinical document queries for EHR company

• EHR wants to expose query functionality to clinicians• >16 million clinical documents with free text; processed

through NLP pipeline• >500 million lab results

• Perform subject expansion on search queries via ontologies

• e.g., “myocardial infarction” will match “heart disease”• Search functionality implemented with Lucene

(serving) on top of Hbase (processing/storage/indexing)

47


• Interested in recommendation engine-enabled queries, like:

• Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record

• Clinician wants to know what other conditions might be correlated with a finding of interest

48


“Find other patients similar to mine”

• The Stanford system is limited to search

• Recommendation engines allow a button “find similar”

49

Use case 4: Insurance company

• Data from 30 different EHRs across multiple business units

• High variance in ICD9 coding between locales.• Use NLP and machine learning to improve ICD9 coding

to reduce variance in diagnosis

50

Use case 5: Pharma company variance in yields

• Pharma company performs large batch fermentations of their product

• Find high levels of variance in their yield• Fermentations are automated and highly

instrumented• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.

• Perform time series analysis on fermentation runs to predict yields and determine which variables control variance.

51

Use case 6: AgTech company integrating data sources

• Multiple reference genome sequences• Genotyping on thousands of samples• Weather data• Soil data• Microbiome data• Yield data• Geo data• All integrated in HBase

52

Use case 6: AgTech company integrating data sources

• Can increase crop yields ~15% by “printing” seeds onto a field

• Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs

• Import any annotation data in CSV/GFF• Integration with cloning tools• Supports a web front-end for easy access

53

Conclusions

Highly heterogeneous data

54

COMMUNICATIONSLocation-based advertising

HEALTH CAREPatient sensors, monitoring, EHRs Quality of care

LAW ENFORCEMENT & DEFENSEThreat analysis,Social media monitoring, Photo analysis

EDUCATION& RESEARCHExperimentsensor analysis

FINANCIAL SERVICESRisk & portfolioanalysisNew products

ON-LINE ERVICES / SOCIAL MEDIAPeople & career matching

Websiteoptimization

UTILITIESSmart Meter analysis for network capacity

CONSUMER PACKAGED GOODSSentiment analysis of what’s hot,customer service

MEDIA /ENTERTAINMENTViewers /advertising effectiveness

TRAVEL &TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment

LIFE SCIENCESClinical trialsGenomics

RETAILConsumer sentimentOptimized marketing

AUTOMOTIVEAuto sensors reporting location, problems

HIGH TECH / INDUSTRIAL MFG.Mfg. quality

Warranty analysis

OIL & GASDrilling exploration sensor analysis

©2013 Cloudera, Inc. All Rights Reserved.

55

Flexibility• Store any data• Run any analysis and processing• Keeps pace with the rate of change of incoming data

Scalability• Proven growth to PBs/1,000s of nodes• No need to rewrite queries, automatically scales• Keeps pace with the rate of growth of incoming data

Efficiency• Cost per TB at a fraction of other options• Keep all of your data alive in an active archive• Powering the data beats algorithm movement

The Cloudera Enterprise Platform for Big Data

©2013 Cloudera, Inc. All Rights Reserved.

56

Cloudera Hadoop Stack

57

Technology

Hadoop ecosystem for health/life sciences