56
Biology, Big Data, Precision Medicine, and Other Buzzwords C. Titus Brown School of Veterinary Medicine; Genome Center & Data Science Initiative 1/15/16 #titusbuzz Slides are on slideshare.

2016 davis-biotech

Embed Size (px)

Citation preview

Page 1: 2016 davis-biotech

Biology, Big Data, Precision Medicine, and Other Buzzwords

C. Titus BrownSchool of Veterinary Medicine;

Genome Center & Data Science Initiative1/15/16

#titusbuzzSlides are on slideshare.

Page 2: 2016 davis-biotech

N.B. This talk is for the students!

(I heard they had to attend, and I couldn’t pass up a guaranteed

audience!)Note: at end, I would like to take a question or two from grad students first!

Page 3: 2016 davis-biotech

My academic path• Undergrad: math major• Grad school: developmental

biology/genomics• Postdoc: developmental biology/genomics• Asst Prof: genomics/bioinformatics• Now: bioinformatics/data-intensive biology

Page 4: 2016 davis-biotech

My non-academic path:

• Open source programming.• Two startups, one real one & one

half-academic thing.• Some consulting on software

engineering and testing.

Page 5: 2016 davis-biotech

Outline1. Research on how to deal with lots of

data.2. How biology, in particular, is

unprepared.3. My advice for the next generation of

researchers.

Page 6: 2016 davis-biotech

1. My research!

Some background & then some information.

Page 7: 2016 davis-biotech

DNA sequencing rates continues to grow.

Stephens et al., 2015 - 10.1371/journal.pbio.1002195

Page 8: 2016 davis-biotech

Oxford Nanopore sequencing

Slide via Torsten Seeman

Page 9: 2016 davis-biotech

Nanopore technology

Slide via Torsten Seeman

Page 10: 2016 davis-biotech

Scaling up --

Page 11: 2016 davis-biotech

Scaling up --

Page 12: 2016 davis-biotech

Slide via Torsten Seeman

Page 13: 2016 davis-biotech

http://ebola.nextflu.org/

Page 14: 2016 davis-biotech

“Fighting Ebola With a Palm-Sized DNA Sequencer”

See: http://www.theatlantic.com/science/archive/2015/09/

ebola-sequencer-dna-minion/405466/

Page 15: 2016 davis-biotech

“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.

Via Elizabeth Kujawinski

Lots of data other than just sequencing!

Page 16: 2016 davis-biotech

Data integration between different data types..

Figure 2. Summary of challenges associated with the data integration in the proposed project.

Figure via E. Kujawinski

Page 17: 2016 davis-biotech

=> My research

Planning for ~infinite amounts of data, and trying to do something effective

with it.

Page 18: 2016 davis-biotech

Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 19: 2016 davis-biotech

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Page 20: 2016 davis-biotech

Digital normalization

Page 21: 2016 davis-biotech

Digital normalization

Page 22: 2016 davis-biotech

Digital normalization

Page 23: 2016 davis-biotech

Digital normalization

Page 24: 2016 davis-biotech

Digital normalization

Page 25: 2016 davis-biotech

Digital normalization

Page 26: 2016 davis-biotech

Computational problem now scales with information content rather than data set size.

Most samples can be reconstructed via de novo assembly on commodity computers.

Page 27: 2016 davis-biotech

Digital normalization & horse transcriptome

The computational demands for cufflinks- Read binning (processing time)- Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization)

Diginorm- Significant reduction of binning time

- Relative increase of the resources required for gene model construction with merging more samples and tissues- ? false recombinant isoformsTamer Mansour

Page 28: 2016 davis-biotech

Effect of digital normalization

** Should be very valuable for detection of ncRNA

Tamer Mansour

Page 29: 2016 davis-biotech

The khmer software package

• Demo implementation of research data structures & algorithms;

• 10.5k lines of C++ code, 13.7k lines of Python code;• khmer v2.0 has 87% statement coverage under

test;• ~3-4 developers, 50+ contributors, ~1000s of users

(?)

The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1

Page 30: 2016 davis-biotech

khmer is developed as a true open source package

• github.com/dib-lab/khmer;• BSD license;• Code review, two-person sign off on

changes;• Continuous integration (tests are run on

each change request);Crusoe et al., 2015; doi: 10.12688/f1000research.6924.1

Page 31: 2016 davis-biotech

Literate graphing & interactive exploration

Camille Scott

Page 32: 2016 davis-biotech

Research process

Page 33: 2016 davis-biotech

This is standard process in lab --

Our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Zhang et al. doi: 10.1371/journal.pone.0101271

Page 34: 2016 davis-biotech

The buoy project - decentralized infrastructure for bioinformatics.

ivory.idyll.org/blog/2014-moore-ddd-award.html

Page 35: 2016 davis-biotech

The next questions --(a)If you had all the data from all the

things, what could you do with it?

(b) If you could edit any genome you wanted, in any way you wanted, what

would you edit?

Page 36: 2016 davis-biotech

2. Big Data, Biology, and how we’re underprepared.

(Answers to previous qs: we are not that good at using data to inform our models or our experimental plans...)

Page 37: 2016 davis-biotech

My first 7 reasons --1. Biology is very complicated.2. We know very little about function in biology.3. Very few people are trained in both data analysis

and biology.4. Our publishing system is holding back the

sharing of knowledge.5. We don’t share data.6. We are too focused on hypothesis-driven

research.7. Most computational research is not reproducible.

Page 38: 2016 davis-biotech

Biology is complicated.

Sea urchin gene network for early development; http://sugp.caltech.edu/endomes/

Page 39: 2016 davis-biotech

We know very little, and a lot of what we “know” is wrong.

One recent story that caught my eye – problems with genetic testing & databases. (See URL below for full story.)• “1/4 of mutations linked to childhood

diseases are debatable.”• In a study of 60,000 people, on average each

had 53 “pathogenic” variants…http://www.theatlantic.com/science/archive/2015/12/why-human-genetics-research-is-full-of-costly-mistakes/420693/

Page 40: 2016 davis-biotech

Very few people are trained in both data analysis and biology.

(More on this later)

Page 41: 2016 davis-biotech

Our publishing system has become a real problem.

• The journal system costs more than $10bn/yr, with profit margins estimated at 20-30% (see citation, below).

• Articles in high impact factor journals have lower statistical power.

• High-IF journals have higher rates of retractions (which cannot solely be attributed to “attention paid”)

• We publish in PDF form, which is computationally opaque.• Publishing is slow!

$10bn/year: http://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf

Page 42: 2016 davis-biotech

High-impact-factor articles have poor statistical power.

Our current system rewards A but not B.

Brembs et al., 2013 - http://journal.frontiersin.org/article/10.3389/fnhum.2013.00291/full

Page 43: 2016 davis-biotech

High impact factor => high retraction index.

Brembs et al., 2013 - http://journal.frontiersin.org/article/10.3389/fnhum.2013.00291/full

Page 44: 2016 davis-biotech

We just don’t share our data.

• Researchers have virtually no short-term incentives to share data in useful ways.

• “46% of respondents reported they do not make their data available to others” – study in ecology (Tenopir et al., 2011)

• Some “great” stories from the rare disease community – see New Yorker link, below.

http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-2

Page 45: 2016 davis-biotech

We are focused on hypothesis-driven research.

• Granting agencies require specific hypotheses, even when little is known.

• This focuses research on “known unknowns”, and leaves “unknown unknowns” out in the cold.

Page 46: 2016 davis-biotech

The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome"

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery

being associated with greater research momentum—a genomic bandwagon effect."

Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz

Page 47: 2016 davis-biotech

Most computational research is not reproducible.

I don’t know of a systematic study, but of papers that I read, approximately 95% fail to include

details necessary for replication. It’s very hard to build off of research like

this.(There’s a lot more to say about reproducibility

and replicability than I can fit in here…)

Page 48: 2016 davis-biotech

What am I doing about it?

1. Open science2. “Culture hacking” to drive open

data.3. Training!

(I don’t have any guaranteed solutions. All I can do is think & work.)

Page 49: 2016 davis-biotech

Perspectives on training• Prediction: The single biggest

challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report)

• Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing.

• Training is systematically undervalued in academia (!?)

Page 50: 2016 davis-biotech

UC Davis and trainingMy goal here is to support the

coalescence and growth of a local community of practice around “data

intensive biology”.

Page 51: 2016 davis-biotech

Summer NGS workshop (2010-2017)

Page 52: 2016 davis-biotech

General parameters:• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &

more senior); open to all (including outside community).

• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.

Page 53: 2016 davis-biotech

Thus far & near future~12 workshops on bioinformatics in 2015.

Trying out Q1 & Q2 2016:• Half-day intro workshops (27

planned);• Week-long advanced workshops;• Co-working hours (“data therapy”).

dib-training.readthedocs.org/

Page 54: 2016 davis-biotech

3. Advice to the next generation(or two generations, if you want me to feel really old.)

a. Get involved with a broad group of people and ideas (social media FTW!)

b. Learn something about both computing and biology.

c. Realize that you have nothing but opportunity, and that there has never been a better time to be in bio research!

Page 55: 2016 davis-biotech

Precision Medicine?

Page 56: 2016 davis-biotech

Thanks for listening!Please contact me at [email protected]!

Note: I work here!(I’d like to start with a grad student question?)