bridging the gap: enabling top research in translational research - knut reinert

57
Prof. Dr. Knut Reinert Algorithmische Bioinformatik, FB Mathematik und Informatik Australia, 10/13 Knut Reinert Freie Universität Berlin Institute for Computer Science Bridging the gap: Enabling top research in translational research

Upload: australian-bioinformatics-network

Post on 10-May-2015

593 views

Category:

Health & Medicine


0 download

DESCRIPTION

In this talk I will convey to you my view about the necessary steps for enabling efficient research in biomedical research in the times where biotechnology can give us comprehensive views of certain data. I will start by arguing that the NGS technologies developed in the recent years changed the research landscape to a degree similar to the beginning of the millennium when the human genome was initially sequenced. As a consequence, the research tools of many biomedical researcher have or will change in the sense that they will conduct large scale, complex computations. Hence, as a community, we have to turn our focus to how we develop such tools. Thinking about this becomes essential since in the near future clinical decisions concerning the treatment of individuals (personalised medicine) will be based on such computations. I will talk about the past and future role of software libraries for enabling translational research and exemplify some points with the SeqAn C++ library developed in my lab.

TRANSCRIPT

Page 1: Bridging the gap: Enabling top research in translational research - Knut Reinert

Prof. Dr. Knut Reinert

Algorithmische Bioinformatik, FB Mathematik und Informatik

Australia, 10/13

Knut Reinert

Freie Universität Berlin

Institute for Computer Science

Bridging the gap: Enabling top

research in translational research

Page 2: Bridging the gap: Enabling top research in translational research - Knut Reinert

6 Australia, 10/13

~ 13 years ago...

Data volume and cost: In 2000 the 3 billion base pairs of the

human genome were sequenced for

about 3 billion US$ Dollar

100 million bp per day

Page 3: Bridging the gap: Enabling top research in translational research - Knut Reinert

7 Australia, 10/13

Sequencing today...

Within roughly ten years sequencing has

become about 10 million times cheaper

Illumina HiSeq

100 billion bps per day

Page 4: Bridging the gap: Enabling top research in translational research - Knut Reinert

8 Australia, 10/13

Sequencing earth?

107 species x 108 genome size =>

earth genome has 1015 bps

104 Hiseqs can each sequence

1011 bps per day =>

earth genome at 10x in 10 days

Page 5: Bridging the gap: Enabling top research in translational research - Knut Reinert

9 Australia, 10/13

Future of NGS data analysis

Page 6: Bridging the gap: Enabling top research in translational research - Knut Reinert

10 Australia, 10/13

Why is translational research hard?

Medical

doctors/Biologists

Biomedical

Modelers

Engineers

(Hardware)

Engineers

(Software)

Algorithmicists

Mathematicians

Result 1

Quality 0.75 Result 2

Quality 0.95

Result 3

Quality 0.35

Implementation 1

Quality 0.65

Implementation 2

Quality 0.75

Implementation 3

Quality 0.85

Algorithm 1

quality 0.75

Algorithm 3

quality 0.95 Algorithm 2

quality 0.45

Algorithm 4

quality 0.85

0.95*0.85*0.95=0.76

Page 7: Bridging the gap: Enabling top research in translational research - Knut Reinert

11 Australia, 10/13

Software libraries bridge gap

Theoretical Considerations

Algorithm design

Prototype implementation

Maintainable tool

Analysis pipelines

Computer Scientists

Experimentalists

Algorithm libraries

RNA-Seq

ChIP-Seq

Structural variants Metagenomics abundance

Sequence assembly Cancer genomics

FM-index

Suffix arrays

Multicore

Hardware acceleration

K-mer filter

Fast I/O

Secondary memory

We need to be

very good

on all levels

Page 8: Bridging the gap: Enabling top research in translational research - Knut Reinert

12 Australia, 10/13

This talk

Translational research

Page 9: Bridging the gap: Enabling top research in translational research - Knut Reinert

14 Australia, 10/13

This talk

SeqAn Content

SeqAn Performance

Page 10: Bridging the gap: Enabling top research in translational research - Knut Reinert

16 Australia, 10/13

SeqAn

Now SeqAn/SeqAn tools have been cited more

than 360 times

Among the institutions are (omitting German institutes):

Department of Genetics, Harvard Medical School, Boston,

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,

J. Craig Venter Institute, Rockville MD, USA,

Department of Molecular Biology, Princeton University,

Applied Mathematics Program, Yale University, New Haven,

IBM T.J. Watson Research Center, Yorktown Heights,

The Ohio State University, Columbus, University of Minnesota,

Australian National University, Canberra,

Department of Statistics, University of Oxford,

Swedish University of Agricultural Sciences (SLU), Uppsala,

Graduate School of Life Sciences, University of Cambridge,

Broad Institute, Cambridge, USA,

EMBL-EBI, University of California, University of Chicago,

Iowa State University, Ames, The Pennsylvania State University,

Peking University, Beijing University of Science and Technology of China,

BGI-Shenzhen, China, Beijing Institute of Genomics……

Is under BSD license and

hence free for academic

AND commercial use.

Page 11: Bridging the gap: Enabling top research in translational research - Knut Reinert

17 Australia, 10/13

SeqAn developers

0

2

4

6

8

10

12

14

16

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

External

CSC

BMBF

DFG

IMPRS

FU

Page 12: Bridging the gap: Enabling top research in translational research - Knut Reinert

18 Australia, 10/13

SeqAn Content - SDK

Page 13: Bridging the gap: Enabling top research in translational research - Knut Reinert

19 Australia, 10/13

SeqAn SDK Components - Tutorials

Page 14: Bridging the gap: Enabling top research in translational research - Knut Reinert

20 Australia, 10/13

SeqAn SDK Components – Reference Manual

Page 15: Bridging the gap: Enabling top research in translational research - Knut Reinert

21 Australia, 10/13

SeqAn SDK Components

CDash/CTest to automatically

compile and test across platforms

Review Board to ensure code quality Code coverage reports

Page 16: Bridging the gap: Enabling top research in translational research - Knut Reinert

22 Australia, 10/13

SeqAn Content

algorithms & data structures

Page 17: Bridging the gap: Enabling top research in translational research - Knut Reinert

23 Australia, 10/13

Standard DP-Algorithms

Global & Semi Global Alignments Local Alignments

Modified DP-Algorithms

Split Breakpoint Detection Banded Chain Alignment

Unified Alignment Algorithms

For Example ...

Versatile & Extensible DP-Interface

Page 18: Bridging the gap: Enabling top research in translational research - Knut Reinert

24 Australia, 10/13

Unified Alignment Algorithms

Configure

Central Configuration

DPProfile_<TAlgorithm, TGaps, TTrace>

Run

Unbanded DP-Algorithms Banded DP-Algorithms

Compile

… selects code snippets accordingly … generates desired DP-Algorithm

Page 19: Bridging the gap: Enabling top research in translational research - Knut Reinert

25 Australia, 10/13

Unified Alignment Algorithms

For Example ...

Banded Smith-Waterman with Affine Gap Costs:

DPBand_<BandOn>(lowerDiag, upperDiag),

DPProfile_<LocalAlignment_<>, AffineGaps, TracebackOn<> >

Semi-Global Gotoh without Traceback:

DPProfile_<GlobalAlignment_<FreeEndGaps_<True, False, True, False> >,

AffineGaps, TracebackOff>

Split-Breakpoint Detection for Right Anchor:

DPProfile_<SplitAlignment_<>, AffineGaps, TracebackOn<GapsRight> >

Needleman-Wunsch with Traceback:

DPProfile_<GlobalAlignment_<>, LinearGaps, TracebackOn<> >

Page 20: Bridging the gap: Enabling top research in translational research - Knut Reinert

27 Australia, 10/13

Unified Alignment Algorithms

... And how much slower is that? (affine alignment of 10kb Dengue virus sequences)

0

1

2

3

4

5

6

7

8

9

10

SeqAn NCBI GGSEARCH NEEDLE

tim

e [

s]

Page 21: Bridging the gap: Enabling top research in translational research - Knut Reinert

28 Australia, 10/13

Support for Common File Formats

Important file formats for HTS analysis

Sequences

FASTA, FASTQ

Indexed FASTA (FAI) for random access

Genomic Features

GFF 2, GFF 3, GTF, BED

Read Mapping

SAM, BAM (plus BAM indices)

Variants

VCF

… or write your own parser

Tutorials and helper routines for writing your own parsers.

SequenceStream ss(“file.fa.gz”); while (!atEnd(ss)) { readRecord(id, seq, ss); cout << id << '\t' << seq << '\n'; }

BamStream bs(“file.bam”); while (!atEnd(bs)) { readRecord(record, bs); cout << record.qName << '\t' << record.pos << '\n’; }

Page 22: Bridging the gap: Enabling top research in translational research - Knut Reinert

29 Australia, 10/13

Journaled Sequences

Store Multiple Genomes

Save Storage Capacities

StringSet<TJournaled, Owner<JournalSet> > set;

setGlobalReference(set, refSeq);

appendValue(set, seq1);

join(set, idx, JoinConfig<>());

String<Dna, Journaled<Alloc<> > >

G1:

G2:

GN:

Ref:

Page 23: Bridging the gap: Enabling top research in translational research - Knut Reinert

31 Australia, 10/13

Journaled stringset benchmark

1091 x chr. 22

(~60 GB)

Task: run a sequential algorithms over all strings

(in demo the Horspool algorithm)

Use core parallelism AND data parallelism

(JournaledStringSet)

Timings with and without I/O

Page 24: Bridging the gap: Enabling top research in translational research - Knut Reinert

32 Australia, 10/13

Timings without I/O (secs)

0

20

40

60

80

100

120

140

160

180

1 2 4 8

SS

JSS,NO DP

JS, DP

about

4x slower (no DP)

5x times faster (with DP)

Page 25: Bridging the gap: Enabling top research in translational research - Knut Reinert

33 Australia, 10/13

Timings with I/O (min)

0

2

4

6

8

10

12

14

SS JSS, NO DP JSS, DP

SS

JSS, NO DP

JSS, DP

~ 30 times faster

Page 26: Bridging the gap: Enabling top research in translational research - Knut Reinert

34 Australia, 10/13

Fragment Store

All-In-One Data Structure for HTS

Designed to represent:

• reads, mate-pairs, reference genomes

• pairwise alignments

• genome annotations

Easy-to-use interface and high-level functions for typical workflows.

Annotation files can easily be imported (GFF/GTF/UCSC):

The annotation tree can be traversed with iterators:

FragmentStore<> store; read(file, store, Gff());

Genome Annotations

root

gene

gene

mRNA

mRNA

mRNA

exon

exon

exon

exon

exon

CDS

CDS

CDS

exon

Iterator<TStore, AnnotationTree<> > it(store); goDown(it);

Page 27: Bridging the gap: Enabling top research in translational research - Knut Reinert

35 Australia, 10/13

Fragment Store

(Multi) Read Alignments

Read alignments can be easily imported:

… and accessed as a multiple alignment, e.g. for visualization:

std::ifstream file("ex1.sam"); read(file, store, Sam());

AlignedReadLayout layout; layoutAlignment(layout, store); printAlignment(svgFile, Raw(), layout, store, 1, 0, 150, 0, 36);

Page 28: Bridging the gap: Enabling top research in translational research - Knut Reinert

36 Australia, 10/13

Unified Full-Text Indexing Framework

Available Indices

All indices support multiple strings and external memory construction/usage.

Index<TSeq, IndexEsa<> >

Index<StringSet<TSeq>, FMIndex<> >

Suffix Trees:

• suffix array

• enhanced suffix array

• lazy suffix tree

Prefix Trie:

• FM-index

q-Gram Indices:

• direct addressing

• open addressing

• gapped

All indices support the (sequential) find interface:

Finder<TIndex> finder(index); while (find(finder, "TATAA")) cout << "Hit at position" << position(finder) << endl;

Index Lookup Interface

Page 29: Bridging the gap: Enabling top research in translational research - Knut Reinert

37 Australia, 10/13

Unified Full-Text Indexing Framework

String Tree Interface

Iterator<TIndex, TopDown<> >::Type it(index); goDown(it, 'a');

Suffix/prefix trees can be accessed with iterators that

support different traversals:

• top-down

• depth-first search

• random

suffix tree

Advanced Index Algorithms

Repeat search iterators

• for maximal/supermaximal repeats and maximal unique matches

Pattern search with backtracking

• parallel exact/approximate search of multiple patterns (tree vs. tree)

q-gram filters

• counting/seed filters for approximate pattern search and local alignments

Page 30: Bridging the gap: Enabling top research in translational research - Knut Reinert

38 Australia, 10/13

Applications

Page 31: Bridging the gap: Enabling top research in translational research - Knut Reinert

39 Australia, 10/13

STELLAR – exact local aligner

Filtering

... finds all maximal ε-matches in short time.

Index module

sequence 1

sequence 2

Finder<Tsequence,Swift<SwiftLocal> >

Index <TSeq, IndexQGram>

Verification

Page 32: Bridging the gap: Enabling top research in translational research - Knut Reinert

40 Australia, 10/13

STELLAR – exact local aligner

Verification

Alignment module

Align <Tseq>

LocalAlignmentEnumerator<TScore, Banded>

Seeds module

Seed <Simple>

extendSeed(seed, ...,GappedXDrop())

Page 33: Bridging the gap: Enabling top research in translational research - Knut Reinert

41 Australia, 10/13

Stellar – exact local aligner

Stellar is based on a SWIFT filter and allows

epsilon threshold matches with X-drops

Page 34: Bridging the gap: Enabling top research in translational research - Knut Reinert

42 Australia, 10/13

Exact vs. Heuristics

E-value 6x10-84 not found by standard BLAST

Page 35: Bridging the gap: Enabling top research in translational research - Knut Reinert

43 Australia, 10/13

Splazers: split read mapping

SplazersS is based on a SWIFT filter and allows large indels

Combination of split matches

Page 36: Bridging the gap: Enabling top research in translational research - Knut Reinert

44 Australia, 10/13

Acceptable speed and very sensitive

Splazers: split read mapping

Page 37: Bridging the gap: Enabling top research in translational research - Knut Reinert

47 Australia, 10/13

SeqAn Performance

Page 38: Bridging the gap: Enabling top research in translational research - Knut Reinert

48 Australia, 10/13

Masai read mapper

Page 39: Bridging the gap: Enabling top research in translational research - Knut Reinert

49 Australia, 10/13

Algorithm is based on the simultaneous traversal of two string indices (e.g., FM-index, Enhanced suffix array, Lazy suffix tree)

ACGCTTCATCGCCCT…

Index of reads (Radix tree of seeds)

Index of genome (e.g. FM-index)

Reads

Chr. 2

Chr. 1

Chr. X

Genome

Masai read mapper

Page 40: Bridging the gap: Enabling top research in translational research - Knut Reinert

50 Australia, 10/13

Read Mapping: Masai

Faster and more accurate than BWA and Bowtie2 Timings on a single core

Page 41: Bridging the gap: Enabling top research in translational research - Knut Reinert

51 Australia, 10/13

No bias in SNP/Sequencing error

Page 42: Bridging the gap: Enabling top research in translational research - Knut Reinert

52 Australia, 10/13

Easily exchange index….

Page 43: Bridging the gap: Enabling top research in translational research - Knut Reinert

53 Australia, 10/13

Collaboration to parallelize indices and verification

algorithms in SeqAn, to speed up

any applications making use of indices

What about multi-core implementation?

Page 44: Bridging the gap: Enabling top research in translational research - Knut Reinert

54 Australia, 10/13

SeqAn going parallel

GOAL

Parallelize the finder interface of SeqAn

so it works on CPU and accelerators like GPU

Will be replaced by hg18 and 10 million 20-mers

Page 45: Bridging the gap: Enabling top research in translational research - Knut Reinert

55 Australia, 10/13

SeqAn going parallel

Construct FM-index on reverse genome

Set # OMP threads Call generic count function

Page 46: Bridging the gap: Enabling top research in translational research - Knut Reinert

56 Australia, 10/13

SeqAn going parallel : NVIDIA GPUs

SAME count function as on CPU !

Copy needles and index to GPU

Page 47: Bridging the gap: Enabling top research in translational research - Knut Reinert

57 Australia, 10/13

…12... 2.66 sec

18.6 sec 1 X

Intel Xeon Phi 7120, 244 threads

2.18 sec

SeqAn going parallel

Count occurrences of 10 million 20-mers in the human genome using an FM-index

47 X

7 X

NVIDIA Tesla K20

I7,3.2 GHz

8.5 X

0.4 s

Page 48: Bridging the gap: Enabling top research in translational research - Knut Reinert

58 Australia, 10/13

66.1 s

…12...

1 X

SeqAn going parallel

Approx. count occurrences of 1.2 million 33-mers in the human genome using an FM-index

20.7 X

7.3 X

NVIDIA Tesla K20

I7,3.2 GHz

16.9 X

9.0 s

3.9 s

3.2 s

Intel Xeon Phi 7120, 244 threads

Page 49: Bridging the gap: Enabling top research in translational research - Knut Reinert

59 Australia, 10/13

Workflow integration

Page 50: Bridging the gap: Enabling top research in translational research - Knut Reinert

60 Australia, 10/13

Generic workflow nodes

Page 51: Bridging the gap: Enabling top research in translational research - Knut Reinert

61 Australia, 10/13

Library Integration

• Give every tool a self-describing output format: semantic annotation of its inputs/outputs

• In OpenMS and SeqAn we developed CTD (Common Tool Description) for this purpose

• Each tool can thus ‘tell’ its requirements and options in a coherent format

• All interfaces are fully described by this format

• Information on the tools options, I/O are entirely contained within the individual tool (maintenance!)

Page 52: Bridging the gap: Enabling top research in translational research - Knut Reinert

62 Australia, 10/13

CTD Format

General tool description in header

<tool name="MasaiMapper" version="0.7.1 [14053]"

docurl="http://www.seqan.de" category="Read Mapping" >

<executableName>masai_mapper</executableName>

<description>Masai Mapper</description>

<manual>Masai is a fast and accurate read mapper based on

approximate seeds and multiple backtracking.

See http://www.seqan.de/projects/masai for more information.

(c) Copyright 2011-2012 by Enrico Siragusa.

</manual>

<cli>

<clielement optionIdentifier="--write-ctd-file-ext" isList="false">

<mapping referenceName="masai_mapper.write-ctd-file-ext" />

</clielement>

........

Page 53: Bridging the gap: Enabling top research in translational research - Knut Reinert

63 Australia, 10/13

Node generator

Generic workflow nodes can generate nodes to be used

in e.g. KNIME.

https://github.com/genericworkflownodes

It is compatible with both internal and external tools.

This means, ANY tool can be integrated in KNIME as

long as it has a CTD.

Page 54: Bridging the gap: Enabling top research in translational research - Knut Reinert

65 Australia, 10/13

Workflows Enabling Software Re-Use

External tools

Page 55: Bridging the gap: Enabling top research in translational research - Knut Reinert

66 Australia, 10/13

Workflows Enabling Software Re-Use

Page 56: Bridging the gap: Enabling top research in translational research - Knut Reinert

67 Australia, 10/13

SeqAn projects

David Weese Björn Kahlert Sabrina Krakau Jochen Singer Manuel Holtgrewe

Jialu Hu Birte Kehr Enrico Siragusa Anne-Katrin Emde

Leon Kuchenbecker (Charité)

Stephan Aiche Kathrin Trappe

Oliver Stolpe (Charité)

René Märker

Genome comparison

(Evolutionary models due to new breakpoint distance,

Local alignments to find syntheny blocks)

Variant detection

(SNPs, small indels, insert assembly, split

read mapping) Network analysis

(Network alignment)

Metagenomics and bisulphite mapping

BlastX replacement, Bisulphite analysis Useability and

KNIME support

Data parallelism

(Data parallel iterators, dynamic

indices)

Temesgen Dadi

Parallelization of indices

(Multicore, GPU, Xeon Phi),

Error correction, Read mapping

Page 57: Bridging the gap: Enabling top research in translational research - Knut Reinert

Prof. Dr. Knut Reinert

Algorithmische Bioinformatik, FB Mathematik und Informatik

Australia, 10/13

The OpenMS and SeqAn teams

(Berlin, Tübingen, Zürich)

THANK YOU for your attention

The KNIME team

(Michael Berhold, Konstanz)

NVIDIA

(Jacopo Pantaleoni, Jonathan Cohen)

www.seqan.de, www.openms.de

SeqAn Nvidia webinar

October 22nd 2013

at 9.00 AM pacific time.