2015 genome-center
TRANSCRIPT
![Page 1: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/1.jpg)
A data intensive future:How can biology take full advantage of the
coming data deluge?
C. Titus BrownSchool of Veterinary Medicine;
Genome Center & Data Science Initiative11/13/15
![Page 2: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/2.jpg)
Outline0. Background1. Research: what do we do with
infinite data?2. Development: software and
infrastructure.3. Open science & reproducibility.4. Training
![Page 3: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/3.jpg)
0. Background
In which I present the perspective that we face increasingly large data sets, from diverse samples, generated in real time, with many different data
types.
![Page 4: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/4.jpg)
DNA sequencing rates continues to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
![Page 5: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/5.jpg)
Oxford Nanopore sequencing
Slide via Torsten Seeman
![Page 6: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/6.jpg)
Nanopore technology
Slide via Torsten Seeman
![Page 7: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/7.jpg)
Scaling up --
![Page 8: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/8.jpg)
Scaling up --
![Page 9: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/9.jpg)
Slide via Torsten Seeman
![Page 10: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/10.jpg)
http://ebola.nextflu.org/
![Page 11: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/11.jpg)
“Fighting Ebola With a Palm-Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/
ebola-sequencer-dna-minion/405466/
![Page 12: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/12.jpg)
“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
Via Elizabeth Kujawinski
![Page 13: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/13.jpg)
1. Research
In which I discuss advances made towards analyzing infinite amounts of genomic data, and the perspectives
engendered thereby: to whit, streaming and sketches.
![Page 14: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/14.jpg)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
De Bruijn graphs (sequencing graphs) scale with data size, not information size.
![Page 15: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/15.jpg)
Why do sequence graphs scale badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
![Page 16: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/16.jpg)
Practical memory measurements
Velvet measurements (Adina Howe)
![Page 17: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/17.jpg)
Our solution: lossy compression
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
![Page 18: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/18.jpg)
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
![Page 19: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/19.jpg)
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
![Page 20: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/20.jpg)
Digital normalization
![Page 21: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/21.jpg)
Digital normalization
![Page 22: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/22.jpg)
Digital normalization
![Page 23: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/23.jpg)
Digital normalization
![Page 24: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/24.jpg)
Digital normalization
![Page 25: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/25.jpg)
Digital normalization
![Page 26: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/26.jpg)
Graph sizes now scales with information content.
Most samples can be reconstructed via de novo assembly on commodity computers.
![Page 27: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/27.jpg)
Diginorm ~ “lossy compression”
Nearly perfect from an information theoretic perspective:
– Discards 95% more of data for genomes.– Loses < 00.02% of information.
![Page 28: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/28.jpg)
This changes the way analyses scale.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
![Page 29: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/29.jpg)
Streaming lossy compression:
for read in dataset:
if estimated_coverage(read) < CUTOFF:
yield read
This is literally a three line algorithm. Not kidding.It took four years to figure out which three lines, though…
![Page 30: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/30.jpg)
Diginorm can detect information saturation in a stream.
Zhang et al., submitted.
![Page 31: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/31.jpg)
This generically permits semi-streaming analytical approaches.
Zhang et al., submitted.
![Page 32: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/32.jpg)
e.g. E. coli analysis => ~1.2 pass, sublinear memory
Zhang et al., submitted.
![Page 33: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/33.jpg)
Another simple algorithm.
Zhang et al., submitted.
![Page 34: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/34.jpg)
Single pass, reference free, tunable, streaming online variant calling.
Error detection variant calling
![Page 35: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/35.jpg)
Real time / streaming data analysis.
![Page 36: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/36.jpg)
My real point -• We need well founded, and flexible, and
algorithmically efficient, and high performance
components for sequence data manipulation in biology.
• We are building some of these on a streaming and low memory paradigm.
• We are building out a scripting library for composing these operations.
![Page 37: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/37.jpg)
2. Software and infrastructure
Alas, practical data analysis depends on software and computers, which
leads to depressingly practical considerations for gentleperson
scientists.
![Page 38: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/38.jpg)
SoftwareIt’s all well and good to develop new data
analysis approaches, but their utility is greater when they are implemented in usable
software.
Writing, maintaining, and progressing research software is hard.
![Page 39: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/39.jpg)
The khmer software package
• Demo implementation of research data structures & algorithms;
• 10.5k lines of C++ code, 13.7k lines of Python code;• khmer v2.0 has 87% statement coverage under
test;• ~3-4 developers, 50+ contributors, ~1000s of users
(?)
The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1
![Page 40: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/40.jpg)
khmer is developed as a true open source package
• github.com/dib-lab/khmer;• BSD license;• Code review, two-person sign off on
changes;• Continuous integration (tests are run on
each change request);
![Page 41: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/41.jpg)
Challenges:
Research vs stability!Stable software for users, & platform
for future research;vs research “culture”(funding and careers)
![Page 42: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/42.jpg)
How is continued software dev feasible?!
Representative half-arsed lab software development
![Page 43: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/43.jpg)
A not-insane way to do software development
![Page 44: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/44.jpg)
Infrastructure issuesSuppose that we have a nice ecosystem of bioinformatics &
data analysis tools.Where and how do we run them?
Consider:1. Biologists hate funding computational infrastructure.2. Researchers are generally incompetent at building and
maintaining usable infrastructure.3. Centralized infrastructure fails in the face of infinite data.
![Page 45: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/45.jpg)
Decentralized infrastructure for bioinformatics?
ivory.idyll.org/blog/2014-moore-ddd-award.html
![Page 46: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/46.jpg)
3. Open science and reproducibility
In which I start from the point that most researchers* cannot replicate their own
computational analyses, much less reproduce those published by anyone else.
* This doesn’t apply to anyone in this
audience; you’re all outliers!
![Page 47: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/47.jpg)
My lab & the diginorm paper.
• All our code was on github;• Much of our data analysis was in the
cloud (on Amazon EC2);• Our figures were made in IPython
Notebook.• Our paper was in LaTeX.
Brown et al., 2012 (arXiv)
![Page 48: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/48.jpg)
IPython Notebook: data + code =>
![Page 49: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/49.jpg)
To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make
![Page 50: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/50.jpg)
This is standard process in lab --
Our papers now have:
• Source hosted on github;• Data hosted there or on
AWS;• Long running data
analysis => ‘make’• Graphing and data
digestion => IPython Notebook (also in github)
Zhang et al. doi: 10.1371/journal.pone.0101271
![Page 51: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/51.jpg)
Research process
![Page 52: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/52.jpg)
Literate graphing & interactive exploration
Camille Scott
![Page 53: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/53.jpg)
Why bother??“There is no scientific knowledge of the
individual.” (Aristotle)
More pragmatically, we are tired of struggling to reproduce other people’s results.
And, in the end, it’s not all that much extra work.
![Page 54: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/54.jpg)
What does this have to do with open science?
This is a longer & larger conversation, but:
All of our processes enable easy and efficient pre-publication sharing. Source code, analyses, preprints…
When we share early, our ideas have a significant competitive advantage in the research marketplace of
ideas.
![Page 55: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/55.jpg)
4. Training
In which I note that methods and tools do little without a trained hand
wielding them, and a trained eye examining the results.
![Page 56: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/56.jpg)
Perspectives on training• Prediction: The single biggest challenge
facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report)
• Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing.
• Training is systematically undervalued in academia (!?)
![Page 57: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/57.jpg)
UC Davis and trainingMy goal here is to support the
coalescence and growth of a local community of practice around “data
intensive biology”.
![Page 58: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/58.jpg)
Summer NGS workshop (2010-2017)
![Page 59: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/59.jpg)
General parameters:• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &
more senior); open to all (including outside community).
• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.
![Page 60: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/60.jpg)
Thus far & near future~12 workshops on bioinformatics in
2015.Trying out soon:• Half-day intro workshops;• Week-long advanced workshops;• Co-working hours.
dib-training.readthedocs.org/
![Page 61: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/61.jpg)
The End.• If you think 5-10 years out, we face significant
practical issues for data analysis in biology.• We need new algorithms/data structures, AND good
implementations, AND better computational practice, AND training.
• This can be either viewed with despair… or seen as an opportunity to seize the competitive advantage!
(How I view it varies from day to day.)
![Page 62: 2015 genome-center](https://reader036.vdocument.in/reader036/viewer/2022070515/5879a6af1a28ab082c8b7121/html5/thumbnails/62.jpg)
Thanks for listening!Please contact me at [email protected]!Note: I work here now!