57
1 Hadoop ecosystem for life sciences Uri Laserson 30 September 2013

Hadoop ecosystem for health/life sciences

Embed Size (px)

DESCRIPTION

Overview of using the Hadoop ecosystem for life sciences/health use cases

Citation preview

Page 1: Hadoop ecosystem for health/life sciences

1

Hadoop ecosystem for life sciencesUri Laserson30 September 2013

Page 2: Hadoop ecosystem for health/life sciences

2

About the speaker

• Currently “Data Scientist” at Cloudera

• PhD in Biomedical Engineering at MIT/Harvard (2005-2012)

• Focused on next-generation DNA sequencing technology in George Church’s lab

• Co-founded Good Start Genetics (2007-)• First application of next-gen sequencing to genetic

carrier screening

[email protected]

Page 3: Hadoop ecosystem for health/life sciences

3

Agenda

• Historical context• Introduction to Hadoop ecosystem• Genomics on Hadoop• Other use cases in life sciences

Page 4: Hadoop ecosystem for health/life sciences

4

Historical Context

Page 5: Hadoop ecosystem for health/life sciences

5

1999!

Page 6: Hadoop ecosystem for health/life sciences

6

Indexing the Web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages• Rank pages based on relevance metrics• Build search index of keywords to pages• Do it in real time!

Page 7: Hadoop ecosystem for health/life sciences

7

Page 8: Hadoop ecosystem for health/life sciences

8

Databases in 1999

1. Buy a really big machine2. Install expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another big machine as backup

Page 9: Hadoop ecosystem for health/life sciences

9

Page 10: Hadoop ecosystem for health/life sciences

10

Database Limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story• Vendor lock-in ($$$)• SQL unsuited for search ranking

• Complex analysis (PageRank)• Unstructured data

Page 11: Hadoop ecosystem for health/life sciences

11

Page 12: Hadoop ecosystem for health/life sciences

12

Google does something different

• Designed their own storage and processing infrastructure

• Google File System (GFS) and MapReduce (MR)• Goals: KISS

• Cheap• Scalable• Reliable

Page 13: Hadoop ecosystem for health/life sciences

13

Google does something different

• It worked!• Powered Google Search for many years• General framework for large-scale batch computation

tasks• Still used internally at Google to this day

Page 14: Hadoop ecosystem for health/life sciences

14

Google benevolent enough to publish

2003 2004

Page 15: Hadoop ecosystem for health/life sciences

15

Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR.

• 2006: Spun out as Apache Hadoop• Named after Doug’s son’s yellow stuffed elephant

Page 16: Hadoop ecosystem for health/life sciences

16

Industry strategy: Copy Google

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubby Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on Hadoop

Hadoop

Page 17: Hadoop ecosystem for health/life sciences

17

Overview of core technology

Page 18: Hadoop ecosystem for health/life sciences

18

HDFS design assumptions

• Based on Google File System• Files are large (GBs to TBs)• Failures are common

• Massive scale means failures very likely• Disk, node, or network failures

• Accesses are large and sequential• Files are append-only

Page 19: Hadoop ecosystem for health/life sciences

19

HDFS properties

• Fault-tolerant• Gracefully responds to node/disk/network failures

• Horizontally scalable• Low marginal cost

• High-bandwidth

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Input File

HDFS storage distributionNode A Node B Node C Node D Node E

Page 20: Hadoop ecosystem for health/life sciences

20

MapReduce computation

Page 21: Hadoop ecosystem for health/life sciences

21

MapReduce

• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”

• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected

and restarted• Schema-on-read: data must not be stored conforming

to rigid schema

Page 22: Hadoop ecosystem for health/life sciences

22

WordCount example

Page 23: Hadoop ecosystem for health/life sciences

23

HPC separates compute from storage

Storage infrastructure Compute cluster

• Proprietary, distributed file system

• Expensive

• High-performance hardware

• Low failure rate• Expensive

Big network pipe ($$$)

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

Page 24: Hadoop ecosystem for health/life sciences

24

Hadoop colocates compute and storage

Compute clusterStorage infrastructure

• Commodity hardware• Data-locality• Reduced networking

needs

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

Page 25: Hadoop ecosystem for health/life sciences

25

HPC is lower-level than Hadoop

• HPC only exposes job scheduling• Parallelization typically occurs through MPI

• Very low-level communication primitives• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split• Failures must be dealt with manually

• Hadoop has fault-tolerance, data locality, horizontal scalability

Page 26: Hadoop ecosystem for health/life sciences

26

Sqoop

Bidirectional data transfer between Hadoop and almost any SQL database with a JDBC driver

Page 27: Hadoop ecosystem for health/life sciences

27

Flume

A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.

Client

Client

Client

Client

Agent

Agent

Agent

Page 28: Hadoop ecosystem for health/life sciences

28

Cloudera Impala

Modern MPP database built on top of HDFS

Designed for interactive queries on terabyte-scale data sets.

Page 29: Hadoop ecosystem for health/life sciences

29

Cloudera Search

• Interactive search queries on top of HDFS

• Built on Solr and SolrCloud• Near-realtime indexing of new documents

Page 30: Hadoop ecosystem for health/life sciences

30

Benefits of Hadoop ecosystem

• Inexpensive commodity compute/storage• Tolerates random hardware failure

• Decreased need for high-bandwidth network pipes• Co-locate compute and storage• Exploit data locality

• Simple horizontal scalability by adding nodes• MapReduce jobs effectively guaranteed to scale

• Fault-tolerance/replication built-in. Data is durable• Large ecosystem of tools• Flexible data storage. Schema-on-read. Unstructured

data.

Page 31: Hadoop ecosystem for health/life sciences

31

Scaling Genomics

Page 32: Hadoop ecosystem for health/life sciences

32

Page 33: Hadoop ecosystem for health/life sciences

33

NCBI Sequence Read Archive (SRA)

Today…1.14 petabytes

One year ago…609 terabytes

Page 34: Hadoop ecosystem for health/life sciences

34

Every ‘ome has a -seq

Genome DNA-seq

TranscriptomeRNA-seqFRT-seqNET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seqBind-n-seq

Page 35: Hadoop ecosystem for health/life sciences

35

Genomics ETL

GATK best practices

Page 36: Hadoop ecosystem for health/life sciences

36

Genomics ETL

.fastq .bam .vcf

short read alignment

genotype calling

• Short read alignment is embarrassingly parallel• Pileup/variant calling requires distributed sort• GATK is a reimplementation of MapReduce; could run on Hadoop• Already available Hadoop tools

• Crossbow: short read alignment/variant calling• Hadoop-BAM: distributed bamtools• BioPig: manipulating large fasta/q• SEAL: Hadoop-enabled BWA• Contrail: de-novo assembly

Page 37: Hadoop ecosystem for health/life sciences

37

Use case 1: Scaling a genome center pipeline

• Currently at 5k genomes (150 TB incl. raw), looking to scale to 25k now (1 PB) and eventually 100k (requiring 4 PB)

• Current throughput• >1300 samples per month• >12 TB raw data per month

• Data ultimately served from MySQL database• 750 GB of processed variant data• 25k genomes requires >3.5 TB in MySQL

• Complex 4-tier storage system, including tape, filer, and RDMBS

Page 38: Hadoop ecosystem for health/life sciences

38

Use case 1: Scaling a genome center pipeline

• Database serves population genetics applications and case/control studies

• Unify all data processing into HDFS• Replace MySQL with Impala on Hadoop for increased

scalability• Possibly move raw data processing into MapReduce

Page 39: Hadoop ecosystem for health/life sciences

39

Use case 2: Querying large, integrated data sets

• Biotech client has thousands of genomes• Want to expose ad hoc querying functionality on large

scale• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets

• Integrating data with public data sets (e.g., ENCODE, UCSC browser)

• Terabyte-scale annotation sets• Currently, these capabilities (e.g., data joins) are often

manually implemented

Page 40: Hadoop ecosystem for health/life sciences

40

Use case 2: Querying large, integrated data sets

• Hadoop allows all data to be centrally stored and accessible

• Impala exposes a SQL query interface to data sets in Hadoop

Page 41: Hadoop ecosystem for health/life sciences

41

Variant-filtering example

• “Give me all SNPs that are:• on chromosome 5• absent from dbSNP• present in COSMIC• observed in breast cancer samples• absent from prostate cancer samples• overlap a DNase hypersensitivity site• overlap a ChIP-seq site for a particular TF”

• On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds

Page 42: Hadoop ecosystem for health/life sciences

42

All-vs-all eQTL

• Possible to generate trillions of hypothesis tests• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations

• Example queries:• “Given 5 genes of interest, find top 20 most significant

eQTLs (cis and/or trans)”• Finishes in several seconds

• “Find all cis-eQTLs across the entire genome”• Finishes in a couple of minutes• Limited by disk throughput

Page 43: Hadoop ecosystem for health/life sciences

43

All-vs-all eQTL

• “Find all SNPs that are:• in LD with some lead SNP

or eQTL of interest• align with some functional

annotation of interest”• Still in testing, but likely

finishes in seconds

Schaub et al, Genome Research, 2012

Page 44: Hadoop ecosystem for health/life sciences

44

Genomics summary

• ETL (raw data to analysis-ready data)• Data integration

• e.g., interactively queryable UCSC genome browser• De novo assembly• NLP on scientific literature

Page 45: Hadoop ecosystem for health/life sciences

45

Clinical dataManufacturing

Other use cases

Page 46: Hadoop ecosystem for health/life sciences

46

Use case 3: Clinical document queries for EHR company

• EHR wants to expose query functionality to clinicians• >16 million clinical documents with free text; processed

through NLP pipeline• >500 million lab results

• Perform subject expansion on search queries via ontologies

• e.g., “myocardial infarction” will match “heart disease”• Search functionality implemented with Lucene

(serving) on top of Hbase (processing/storage/indexing)

Page 47: Hadoop ecosystem for health/life sciences

47

Use case 3: Clinical document queries for EHR company

• Interested in recommendation engine-enabled queries, like:

• Clinician searches “diabetes” and has relevant lab results already highlighted when opening a patient’s record

• Clinician wants to know what other conditions might be correlated with a finding of interest

Page 48: Hadoop ecosystem for health/life sciences

48

Use case 3: Clinical document queries for EHR company

“Find other patients similar to mine”

• The Stanford system is limited to search

• Recommendation engines allow a button “find similar”

Page 49: Hadoop ecosystem for health/life sciences

49

Use case 4: Insurance company

• Data from 30 different EHRs across multiple business units

• High variance in ICD9 coding between locales.• Use NLP and machine learning to improve ICD9 coding

to reduce variance in diagnosis

Page 50: Hadoop ecosystem for health/life sciences

50

Use case 5: Pharma company variance in yields

• Pharma company performs large batch fermentations of their product

• Find high levels of variance in their yield• Fermentations are automated and highly

instrumented• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.

• Perform time series analysis on fermentation runs to predict yields and determine which variables control variance.

Page 51: Hadoop ecosystem for health/life sciences

51

Use case 6: AgTech company integrating data sources

• Multiple reference genome sequences• Genotyping on thousands of samples• Weather data• Soil data• Microbiome data• Yield data• Geo data• All integrated in HBase

Page 52: Hadoop ecosystem for health/life sciences

52

Use case 6: AgTech company integrating data sources

• Can increase crop yields ~15% by “printing” seeds onto a field

• Support search queries by name, ontology concepts, protein families, creation dates, assembly/chromsome positions, SNPs

• Import any annotation data in CSV/GFF• Integration with cloning tools• Supports a web front-end for easy access

Page 53: Hadoop ecosystem for health/life sciences

53

Conclusions

Page 54: Hadoop ecosystem for health/life sciences

Highly heterogeneous data

54

COMMUNICATIONSLocation-based advertising

HEALTH CAREPatient sensors, monitoring, EHRs Quality of care

LAW ENFORCEMENT & DEFENSEThreat analysis,Social media monitoring, Photo analysis

EDUCATION& RESEARCHExperimentsensor analysis

FINANCIAL SERVICESRisk & portfolioanalysisNew products

ON-LINE ERVICES / SOCIAL MEDIAPeople & career matching

Websiteoptimization

UTILITIESSmart Meter analysis for network capacity

CONSUMER PACKAGED GOODSSentiment analysis of what’s hot,customer service

MEDIA /ENTERTAINMENTViewers /advertising effectiveness

TRAVEL &TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment

LIFE SCIENCESClinical trialsGenomics

RETAILConsumer sentimentOptimized marketing

AUTOMOTIVEAuto sensors reporting location, problems

HIGH TECH / INDUSTRIAL MFG.Mfg. quality

Warranty analysis

OIL & GASDrilling exploration sensor analysis

©2013 Cloudera, Inc. All Rights Reserved.

Page 55: Hadoop ecosystem for health/life sciences

55

Flexibility• Store any data• Run any analysis and processing• Keeps pace with the rate of change of incoming data

Scalability• Proven growth to PBs/1,000s of nodes• No need to rewrite queries, automatically scales• Keeps pace with the rate of growth of incoming data

Efficiency• Cost per TB at a fraction of other options• Keep all of your data alive in an active archive• Powering the data beats algorithm movement

The Cloudera Enterprise Platform for Big Data

©2013 Cloudera, Inc. All Rights Reserved.

Page 56: Hadoop ecosystem for health/life sciences

56

Cloudera Hadoop Stack

Page 57: Hadoop ecosystem for health/life sciences

57