the data deluge driven by next generation sequencing … · actinomyces odontolyticus atcc 17982...

The data deluge driven by Next Generation Sequencing is transforming life sciences and its computational needs

Simon RasmussenAssistant Professor

Center for Biological Sequence AnalysisDepartment of Systems Biology

Technical University of [email protected]

Helicobacter acinonychis str Sheeba

Helicobacter pylori P12

Helicobacter pylori B8

Helicobacter pylori 26695

Helicobacter pylori G

27H

elicobacter pylori B38H

elicobacter pylori HPAG

1H

elicobacter pylori Shi470H

elicobacter pylori J99H

elicobacter cinaedi CC

UG

18818H

elicobacter hepaticus ATCC

51449H

elicobacter mustelae 12198

Helicobacter bilis ATC

C 43879

Helicobacter pullorum

MIT 98−5489

Helicobacter canadensis M

IT 98−5491H

elicobacter wingham

ensis ATCC

BAA−430W

olinella succinogenes DSM

1740

Cam

pylobacter concisus 13826C

ampylobacter curvus 52592

Cam

pylobacter rectus RM

3267C

ampylobacter show

ae RM

3277

Cam

pylobacter fetus subsp fetus 82−40C

ampylobacter hom

inis ATCC

BAA−381C

ampylobacter gracilis R

M3268

Sulfurospirillum deleyianum

DSM

6946N

itratiruptor sp SB155−2

Sulfurimonas denitrificans D

SM 1251

Arcobacter nitrofigilis DSM

7299Arcobacter butzleri R

M4018

Sulfurovum sp N

BC37−1

Nautilia profundicola Am

H

GU

649V1.CD

18.3G

U649V1.C

D35.0

Fusobacterium sp D

11

Fusobacterium sp 3 1 33

Fusobacterium sp 7 1

Fusobacterium nucleatum

subsp nucleatum ATC

C 25586

Fusobacterium nucleatum

subsp nucleatum ATC

C 23726



Fusobacterium sp 3 1 36A2


Fusobacterium sp 1 1 41FAA

Fusobacterium periodonticum

ATCC

33693Fusobacterium

sp D12

Fusobacterium gonidiaform

ans ATCC

25563

Fusobacterium sp 3 1 5R

Fusobacterium varium

ATCC

27725

Fusobacterium ulcerans ATC

C 49185

Fusobacterium m

ortiferum ATC

C 9817

Sebaldella termitidis ATC

C 33386

Leptotrichia goodfellowii F0264

Leptotrichia hofstadii F0254

Leptotrichia buccalis C−1013−b

Streptobacillus moniliform

is DSM

12112

Nostoc punctiform

e PCC

73102

Nostoc sp PC

C 7120

Anabaena variabilis ATCC

29413N

ostoc azollae 0708Trichodesm

ium erythraeum

IMS101

Cyanothece sp PC

C 7425

Thermosynechococcus elongatus BP−1

Acaryochloris marina M

BIC11017

Synechococcus elongatus PCC

7942

Synechococcus elongatus PCC

6301Synechocystis sp PC

C 6803

Cyanothece sp PC

C 8802

Cyanothece sp PC

C 8801

Cyanothece sp ATC

C 51142

Cyanothece sp PC

C 7424

Microcystis aeruginosa N

IES−843

Synechococcus sp PCC

7002

cyanobacterium U

CYN−A

Synechococcus sp WH

8102

Synechococcus sp CC

9605Synechococcus sp C

C9902

Synechococcus sp WH

7803

Synechococcus sp CC

9311Prochlorococcus m

arinus str MIT 9303

Prochlorococcus marinus str M

IT 9313Prochlorococcus m

arinus str MIT 9211

Prochlorococcus marinus subsp m

arinus str CC

MP1375

Prochlorococcus marinus str N

ATL2A

Prochlorococcus marinus str N

ATL1A

Synechococcus sp RC

C307


IT 9312


IT 9215

Prochlorococcus marinus str AS9601


IT 9301Prochlorococcus m

arinus str MIT 9515

Prochlorococcus marinus subsp pastoris str C

CM

P1986

Synechococcus sp JA−3−3Ab

Synechococcus sp JA−2−3Ba(2−13)

Gloeobacter violaceus PC

C 7421

GU

729MH

0021

GU

967MH

0067

GU

768V1.CD

19.0

GU

715MH

0183G

U439M

H0043

GU

484V1.UC

40.0G

U815M

H0137

GU

815O2.U

C44.0

GU

815O2.U

C44.2

GU

196MH

0038G

U306V1.C

D28.0

Rothia mucilaginosa DY−18

Rothia mucilaginosa ATCC 25296

Rothia dentocariosa ATCC 17931

Kocuria rhizophila DC2201Arthrobacter sp FB24

Arthrobacter chlorophenolicus A6

Arthrobacter aurescens TC1

Renibacterium salmoninarum ATCC 33209

Micrococcus luteus NCTC 2665

Micrococcus luteus SK58

Brevibacterium mcbrellneri ATCC 49030

Kytococcus sedentarius DSM 20547

Clavibacter michiganensis subsp sepedonicus

Clavibacter michiganensis subsp michiganensis NCPPB 382

Leifsonia xyli subsp xyli str CTCB07

Kineococcus radiotolerans SRS30216

Mobiluncus mulieris 28−1

Mobiluncus mulieris ATCC 35243

Mobiluncus curtisii ATCC 43063Actinomyces odontolyticus ATCC 17982

Actinomyces odontolyticus F0309

Actinomyces coleocanis DSM 15436

Actinomyces urogenitalis DSM 15434

Actinomyces sp oral taxon 848 str F0332

Arcanobacterium haemolyticum DSM 20595

Cellulomonas flavigena DSM 20109

Sanguibacter keddieii DSM 10542

Xylanimonas cellulosilytica DSM 15894

Jonesia denitrificans DSM 20603

Beutenbergia cavernae DSM 12333

Brachybacterium faecium DSM 4810

Frankia sp EAN1pec

Frankia alni ACN14a

Frankia sp CcI3

Geodermatophilus obscurus DSM 43160

Kribbella flavida DSM 17836

Nocardioides sp JS614

Aeromicrobium marinum DSM 15272

Propionibacterium freudenreichii subsp shermanii CIRM−BIA1

Propionibacterium acnes J139

Propionibacterium acnes J165

Propionibacterium acnes KPA171202

Propionibacterium acnes SK187

Propionibacterium acnes SK137

Bifidobacterium bifidum NCIMB 41171

GU234V1.CD36.0

Bifidobacterium longum subsp infantis ATCC 15697

Bifidobacterium longum subsp longum ATCC 55813

Bifidobacterium longum subsp infantis CCUG 52486

Bifidobacterium longum subsp longum F8

Bifidobacterium longum DJO10A

Bifidobacterium longum NCC2705

Bifidobacterium longum subsp longum JDM301

Bifidobacterium breve DSM 20213

GU69V1.CD36.0

Bifidobacterium adolescentis ATCC 15703

Bifidobacterium adolescentis L2−32

Bifidobacterium pseudocatenulatum DSM 20438

Bifidobacterium catenulatum DSM 16992

Bifidobacterium dentium Bd1

Bifidobacterium dentium ATCC 27678

Bifidobacterium angulatum DSM 20098

Bifidobacterium animalis subsp lactis AD011

Bifidobacterium animalis subsp lactis DSM 10140

Bifidobacterium animalis subsp lactis Bl−04

Bifidobacterium gallicum DSM 20093

Gardnerella vaginalis ATCC 14019

Gardnerella vaginalis 409−05

Parascardovia denticolens F0305

Scardovia inopinata F0304

Tropheryma whipplei str TwistTropheryma whipplei TW0827

Tsukamurella paurometabola DSM 20162Rhodococcus jostii RHA1Rhodococcus opacus B4Rhodococcus erythropolis PR4Rhodococcus erythropolis SK121

Rhodococcus equi ATCC 33707

Nocardia farcinica IFM 10152Gordonia bronchialis DSM 43247

Mycobacterium abscessus ATCC 19977Mycobacterium sp JLSMycobacterium sp KMS

Mycobacterium sp MCSMycobacterium smegmatis str MC2 155

Mycobacterium gilvum PYR−GCKMycobacterium vanbaalenii PYR−1

Mycobacterium tuberculosis F11Mycobacterium tuberculosis KZN 1435

Mycobacterium tuberculosis H37Rv

Mycobacterium tuberculosis CDC1551

Mycobacterium tuberculosis H37Ra

Mycobacterium bovis BCG str Tokyo 172

Mycobacterium bovis BCG str Pasteur 1173P2

Mycobacterium bovis AF212297

Mycobacterium marinum MMycobacterium ulcerans Agy99

Mycobacterium parascrofulaceum ATCC BAA−614

Mycobacterium avium subsp paratuberculosis K−10

Mycobacterium avium 104

Mycobacterium leprae TN

Nakamurella multipartita DSM 44233Actinosynnema mirum DSM 43827Saccharopolyspora erythraea NRRL 2338

Saccharomonospora viridis DSM 43017

Corynebacterium tuberculostearicum SK141Corynebacterium pseudogenitalium ATCC 33035

Corynebacterium accolens ATCC 49725

Bacteroides ovatus SD CC 2a

Bacteroides xylanisolvens SD C

C 1b

Bacteroides sp D1

Bacteroides sp 2 1 22

Bacteroides xylanisolvens XB1A

Bacteroides ovatus SD CMC 3f

Bacteroides ovatus ATCC 8483

Bacteroides sp 2 2 4Bacteroides sp D2

Bacteroides caccae ATCC 43185

Bacteroides finegoldii DSM 17565

Bacteroides thetaiotaomicron VPI−5482

Bacteroides sp 1 1 6Bacteroides fragilis NCTC 9343

Bacteroides fragilis YCH46

Bacteroides sp 2 1 16Bacteroides sp 3 2 5

Bacteroides fragilis 3 1 12

Bacteroides cellulosilyticus DSM 14838

Bacteroides intestinalis DSM 17393

Bacteroides sp D20

Bacteroides uniformis ATCC 8492

Bacteroides eggerthii DSM 20697

Bacteroides stercoris ATCC 43183

GU633M

H0143Bacteroides vulgatus PC510

Bacteroides sp 4 3 47FAA

Bacteroides vulgatus ATCC 8482

Bacteroides dorei DSM 17855

Bacteroides sp 3 1 33FAA

Bacteroides dorei 5 1 36D4

Bacteroides sp 9 1 42FAABacteroides coprocola DSM

17136

Bacteroides coprophilus DSM 18228

Bacteroides plebeius DSM 17135

GU702M

H0047

GU702M

H0135G

U462V1.CD38.0

GU116M

H0047

GU116M

H0006G

U755V1.CD19.0

GU617M

H0046

GU5226O

2.UC43.0

GU891M

H0057Prevotella tannerae ATCC 51259

GU474MH0006

Prevotella bergensis DSM 17361

GU924MH0069

Prevotella bivia JCVIHMP010

Prevotella melaninogenica ATCC 25845

Prevotella melaninogenica D18

Prevotella veroralis F0319

GU164V1.UC56.0

Prevotella copri DSM 18205

Prevotella buccae D17

Prevotella oris F0302

GU1320MH0057

GU1320O2.UC57.0

GU301V1.CD13.0Prevotella buccalis ATCC 35310

Prevotella timonensis CRIS 5C−B1

Prevotella sp oral taxon 472 str F0295



GU255MH0011

GU255V1.UC55.4

GU1185MH0107

GU1058V1.CD19.0

GU592MH0168

GU520MH0045

GU520MH0012

Prevotella ruminicola 23

GU20MH0012

GU20MH0061

GU51O2.UC37.0

GU118V1.CD15.3

Parabacteroides merdae ATCC 43184

Parabacteroides johnsonii DSM 18315

Bacteroides sp 2 1 7

Bacteroides sp 2 1 33B

Parabacteroides sp D13

Parabacteroides distasonis ATCC 8503

GU2MH0020

GU2MH0074

GU279MH0020

GU279O2.UC18.2

Porphyromonas uenonis 60−3

Porphyromonas endodontalis ATCC 35406

Porphyromonas gingivalis ATCC 33277

Porphyromonas gingivalis W

83

GU1031V1.CD20.4

GU927V1.CD29.0

GU927O2.UC40.2

GU927O2.UC40.0

GU873O2.UC60.0

GU485O2.UC60.0

Candidatus Azobacteroides pseudotrichonymphae genomovar CFP2

GU67O2.UC48.2

GU67MH0012

Alistipes putredinis DSM 17216

GU29MH0002

GU29MH0074

Alistipes shahii WAL 8301

GU268MH0054

GU157V1.UC11.5

GU14MH0012

GU14O2.UC48.2

GU788MH0016

GU788V1.UC49.1

GU561O2.UC51.2

GU561V1.UC49.1GU709MH0158

GU770MH0006

GU770MH0022

GU545MH0009

GU435MH0012

GU514MH0009

GU514MH0031

GU1060MH0044

GU831MH0143

GU831MH0071

Pedobacter heparinus DSM 2366

Sphingobacterium spiritivorum ATCC 33300

Sphingobacterium spiritivorum ATCC 33861

Cytophaga hutchinsonii ATCC 33406

Dyadobacter fermentans DSM 18053

Spirosoma linguale DSM 74

Flavobacterium psychrophilum JIP0286

Flavobacterium johnsoniae UW101

Croceibacter atlanticus HTCC2559

Gramella forsetii KT0803

Zunongwangia profunda SM−A87

Robiginitalea biformata HTCC2501

Capnocytophaga ochracea DSM 7271

Capnocytophaga sputigena ATCC 33612

Capnocytophaga gingivalis ATCC 33624

Flavobacteriaceae bacterium 3519−10

Chryseobacterium gleum ATCC 35910

Chitinophaga pinensis DSM 2588

Candidatus Amoebophilus asiaticus 5a2

Blattabacterium sp (Periplaneta americana) str BPLAN

Blattabacterium sp (Blattella germanica) str Bge

Candidatus Carsonella ruddii PV

Ruminococcus gnavus ATCC 29149

Candidatus Sulcia muelleri GWSS

Candidatus Sulcia muelleri DMIN

Candidatus Sulcia muelleri SMDSEM

Salinibacter ruber

Salinibacter ruber DSM 13855

Rhodothermus marinus DSM 4252

Chlorobium luteolum DSM 273

Chlorobium phaeovibrioides DSM 265

Pelodictyon phaeoclathratiforme BU−1

Chlorobium limicola DSM 245

Chlorobium phaeobacteroides DSM 266

Chlorobium chlorochromatii CaD3

Chlorobaculum parvum NCIB 8327

Chlorobium tepidum TLS

Prosthecochloris aestuarii DSM 271

Chlorobium phaeobacteroides BS1

Chloroherpeton thalassium ATCC 35110

Gemmatimonas aurantiaca T−27

Fibrobacter succinogenes subsp succinogenes S85

Chlamydia trachomatis AHAR−13

Chlamydia trachomatis BTZ1A828OT

Chlamydia trachomatis DUW−3CX

Chlamydia trachomatis BJali20OT

Chlamydia trachomatis L2bUCH−1proctitis

Chlamydia trachomatis 434Bu

Chlamydia muridarum Nigg

Chlamydophila pneumoniae J138

Chlamydophila pneumoniae TW−183

Chlamydophila pneumoniae CWL029

Chlamydophila pneumoniae AR39

Chlamydophila felis FeC−56

Chlamydophila caviae GPIC

Chlamydophila abortus S263

Candidatus Protochlamydia amoebophila UWE25

Waddlia chondrophila WSU 86−1044

GU154MH0012

GU154MH0002

GU154V1.CD31.0

GU344V1.CD7.4

Akkermansia muciniphila ATCC BAA−835

Methylacidiphilum infernorum V4

Opitutus terrae PB90−1

Coraliomargarita akajimensis DSM 45221

Rhodopirellula baltica SH 1

Pirellula staleyi DSM 6068

Planctomyces limnophilus DSM 3776

Borrelia burgdorferi B31

Borrelia burgdorferi ZS7

Borrelia afzelii PKo

Borrelia garinii PBi

Borrelia turicatae 91E135

Borrelia hermsii DAH

Borrelia recurrentis A1

Borrelia duttonii Ly

Treponema vincentii ATCC 35580

Treponema denticola ATCC 35405

Treponema pallidum subsp pallidum SS14

Treponema pallidum subsp pallidum str Nichols

Leptospira biflexa serovar Patoc strain Patoc 1 (Ames)

Leptospira biflexa serovar Patoc strain Patoc 1 (Paris)

Leptospira borgpetersenii serovar Hardjo−bovis L550

Leptospira borgpetersenii serovar Hardjo−bovis JB197

Leptospira interrogans serovar Copenhageni str Fiocruz L1−130

Leptospira interrogans serovar Lai str 56601

Brachyspira hyodysenteriae WA1

Brachyspira murdochii DSM 12563

Elusimicrobium minutum Pei191

uncultured Termite group 1 bacterium phylotype Rs−D17

Thermosipho melanesiensis BI429

Thermosipho africanus TCF52B

Fervidobacterium nodosum Rt17−B1

Thermotoga petrophila RKU−1

Thermotoga naphthophila RKU−10

Thermotoga sp RQ2

Thermotoga maritima MSB8

Thermotoga neapolitana DSM 4359

Thermotoga lettingae TMO

Kosmotoga olearia TBF 1951

Petrotoga mobilis SJ95

Dictyoglomus turgidum DSM 6724

Dictyoglomus thermophilum H−6−12

Coprothermobacter proteolyticus DSM 5265

Candidatus Cloacamonas acidaminovorans

Dehalococcoides ethenogenes 195

Dehalococcoides sp VS

Dehalococcoides sp GT

Dehalococcoides sp CBDB1

Dehalococcoides sp BAV1

Dehalogenimonas lykanthroporepellens BL−DC−9

Sphaerobacter thermophilus DSM 20745

Thermomicrobium roseum DSM 5159

Thermobaculum terrenum ATCC BAA−798

Chloroflexus sp Y−400−fl

Chloroflexus aurantiacus J−10−fl

Chloroflexus aggregans DSM 9485

Roseiflexus castenholzii DSM 13941

Roseiflexus sp RS−1

Herpetosiphon aurantiacus DSM 785

Synergistetes bacterium SGP1

Aminobacterium colombiense DSM 12261

Anaerobaculum hydrogeniformans ATCC BAA−1850

Thermanaerovibrio acidaminovorans DSM 6589

Pyramidobacter piscolens W5455

Jonquetella anthropi E3 33 E1

Meiothermus ruber DSM 1279

Meiothermus silvanus DSM 9946

Thermus thermophilus HB8

Thermus thermophilus HB27

Deinococcus deserti VCD115

Deinococcus geothermalis DSM 11300

Deinococcus radiodurans R1

Truepera radiovictrix DSM 17093

mailto:[email protected]

mailto:[email protected]

Life science data deluge• Massive unstructured

data from several areas DNA, patient journals, proteomics, imaging, ...

• Impacts Industry, Environment, Health

• Societal grand challenges

• Cheap sequencing technologies results in explosion of DNA data

What does DNA do?How to make a car? Car blueprint

What does DNA do?How to make a human? DNA

DNA contains the information on how to create an organism!

DNA: strings...GGATCAGCTGACTCGCCTGGCTCTGAGCCCCGCCGCCGCGCTCGGGCTCCGTCAGTTTCCTCGGCAGCGGTAGGCGAGAGCACGCGGAGGAGCGTGCGCGGGGGCCCCGGGAGACGGCGGCGGTGGCGGCGCGGGCAGAGCAAGGACGCGGCGGATCCCACTCGCACAGCAGCGCACTCGGTGCCCCGCGCAGGGTCGCGATGCTGCCCGGTTTGGCACTGCTCCTGCTGGCCGCCTGGACGGCTCGGGCGCTGGAGGTACCCACTGATGGTAATGCTGGCCTGCTGGCTGAACCCCAGATTGCCATGTTCTGTGGCAGACTGAACATGCACATGAATGTCCAGAATGGGAAGTGGGATTCAGATCCATCAGGGACCAAAACCTGCATTGATACCAAGGAAGGCATCCTGCAGTATTGCCAAGAAGTCTACCCTGAACTGCAGATCACCAATGTGGTAGAAGCCAACCAACCAGTGACCATCCAGAACTGGTGCAAGCGGGGCCGCAAGCAGTGCAAGACCCATCCCCACTTTGTGATTCCCTACCGCTGCTTAGTTGGTGAGTTTGTAAGTGATGCCCTTCTCGTTCCTGACAAGTGCAAATTCTTACACCAGGAGAGGATGGATGTTTGCGAAACTCATCTTCACTGGCACACCGTCGCCAAAGAGACATGCAGTGAGAAGAGTACCAACTTGCATGACTACGGCATGTTGCTGCCCTGCGGAATTGACAAGTTCCGAGGGGTAGAGTTTGTGTGTTGCCCACTGGCTGAAGAAAGTGACAATGTGGATTCTGCTGATGCGGAGGAGGATGACTCGGATGTCTGGTGGGGCGGAGCAGACACAGACTATGCAGATGGGAGTGAAGACAAAGTAGTAGAAGTAGCAGAGGAGGAAGAAGTGGCTGAGGTGGAAGAAGAAGAAGCCGATGATGACGAGGACGATGAGGATGGTGATGAGGTAGAGGAAGAGGCTGAGGAACCCTACGAAGAAGCCACAGAGAGAACCACCAGCATTGCCACCACCACCACCACCACCACAGAGTCTGTGGAAGAGGTGGTTCGAGAGGTGTGCTCTGAACAAGCCGAGACGGGGCCGTGCCGAGCAATGATCTCCCGCTG...

Human: 3 bill

Bacteria: 4 mill

Virus: 10k

A, C, G and T

Some sequencing examples

• Sequence 1000s of human genomes: who are we, who are you, drug effects, diseases, cancers, ...

• Sequence environmental samples, thousands of different bacteria: novel enzymes, the human microbiome, bacteria producing electricity

The sequencing data avalanche

• Computer speed and storage capacity is doubling every 18 months and this rate is steady

• DNA sequence data is doubling every 6-8 months over the last 3 years!

Distributed data productionWorld wide >900 centers

Data transfer and storage becomes

an issue

GenomeDK (KU, DTU, AU)

We ship harddrives...

>60 Pb pr year

http://omicsmaps.com



What does this mean?

• First human genome draft in 2001, final 2004

• Estimated costs $3 billion, time 13 years

• Today: 1 week, $8000

• Towards $1000 genome

Storage and analysis

Highest cost is not the experimentbut storage and analysis

A standard human (30-40x) whole-genome sequencing exp. would create

150 Gb (compressed) data

High strain on IO - read/writing GB->TBs

Analysis: Two basic approaches

• Alignment: We compare to a known genome

• de novo assembly: The genome is unknown we must create it ourselves

• Algorithm development

• is very dynamic - code optimization no longer vital

• What we used 2 years ago we don’t use today!

Alignments: Human data

• Using a known reference genome to assemble our data

• Where does the each of the 100-character strings match in the genome?

• Originally hash based algorithms - problem: high memory demand and slow

1.2 bill 100-character DNA-strings

3.2 bill genome

BWT alignment

• Burrows Wheelers Transformation (known from bzip2)

• Reversible transformation rearranging a character string into runs of similar characters

• Important because genomes tend to have many similar strings!

• Combine with suffix arrays to quickly find all possible matches

• High speed, high precision, low memory usage

Human project example• 51 human genomes from around the world

• Compute cluster resources used >20 CPU-years, 43 Tb storage

• >30 of algorithms/software used for only this project

• Application pipelines (sequential code and data flow) - need versatile compute facilities!

• Time to solution is key

• Competing with Stanford University, we could outperform them on compute time - we publish

• Time matters! - Accessibility matters!

de novo assembly algorithms• If no genome is known for a species we need

to make it

• Graph theory - de bruijn graphs

• Example: Polarbear

• Raw data: ~4 bill 100-character DNA strings

• Eg. total 400 bill characters!

• Originally: All vs. all comparisons - no chance

De Bruijn graph assemblers

• Directed graph of overlapping items (here DNA sequences)

• Graph is created by 1 pass of the data and assembly by walking Eulerian path

• Lots of RAM required (up to 2Tb or more)

• Data is unstructured - placement of each data string can be anywhere in the graph!

• High strain on communication between nodes in SMP systems!

Example genome de Bruijn graphs

only a handful of near-identical repeats longer than 200 bp (Fig. 3), whereas complex genomes, such as the human, usually have their repeat length determined by whether there has been an active LINE or SINE transposable element (usually around 4 kb in length for the former and between 500 bp and 1 kb for the lat-ter). As the ability to produce longer read pairs (also referred to as ‘mate pairs’ to distinguish them from the shorter read pairs) has only recently been optimized for next-generation technologies, assemblies of complex genomes have been rare.

The other main barrier for large, complex genome assem-blies is the memory overhead for these methods. Although the de Bruijn data structure is compressed, all the methods use some sort of adjunct data structures in addition to the core de Bruijn graph to map the reads to the graph. These adjunct structures are critical for leveraging additional information required for accurate assemblies, such as read pair information.

sequence length from a reference assembly. The read lengths need only be over the k-mer length to generate a reasonable assembly (in theory, k must be over 15 bp, though in practice 19 is the lowest sensible k-mer, and larger k-mers are always better, although at the expense of having to generate more coverage to support these large k-mer sizes).

The first assembler to exploit this technology was Roche’s 454 assembler, Newbler, which adapted the scheme specifically to handle the main source of error in 454 sequencing—namely, ambiguity in the length of homopolymer runs. In late 2007 and early 2008, sev-eral second-generation de Bruijn graph assemblers were released for very short reads, compatible with the Solexa technology, includ-ing SHARCGS27, VCAKE28, VELVET29, EULER-SR30, EDENA31, ABySS32 and ALLPATHS33. Some of these methods, such as VELVET, EULER-SR and ABySS, explicitly use de Bruijn graphs, whereas other methods implicitly explore a de Bruijn graph—for example, constrained by read-pair behavior, as in ALLPATHS. The methods differ in how they treat errors and to what extent they use read-pair information. Read pairs are defined as two short DNA sequence reads generated from different ends of a longer DNA molecule—for example, 35-bp reads generated from both ends of a 500 bp frag-ment. One does not know the identity of the sequence between the read pairs, but one usually has an estimate of the length of the inter-vening sequence. As it is only marginally more expensive to generate short reads in read-pair format than as single reads, extremely high coverage of read pairs is routinely available. The more advanced de Bruijn graph assemblers29,30,32,33 can use read pairs to provide long assemblies. A particular challenge has been the two-base-encoding ‘color space’ of ABI SOLiD technology. In this two-base encoding, a single error produces a systematic translation error on all subse-quent decoding of the bases for the rest of the read. In the context of an alignment, such an encoding scheme can be integrated into the alignment routine, and there is an argument that the double base encoding provides better discrimination between errors and observed differences. In de novo assembly, however, there is no ref-erence. The solution has been to perform the assembly directly in color space and then ‘key’ the resulting color space assembly to one of the four feasible base-pair assemblies using either a small amount of traditional sequence or the presence of a known base at the start of each SOLiD read.

Whichever sequencing technology and assembly method are used, the ability to provide long assemblies critically requires that at least a proportion of the read pairs are longer than the longest common near-identical repeat in the genome. This var-ies considerably between genomes. Bacterial genomes often have

Linear stretches

Tips

! ! ! ! ! ! ! !

! ! !

! ! ! !

! ! ! ! ! !

!

! ! ! !

!

!!!!!!!!!

a

b

1. Sequencing (for example, Solexa or 454)

2. Hashing

3. Simplification of linear stretches

4. Error (tip and bubble) removal Bubble

!

Figure 3 | Constructing and visualizing a de Bruijn graph of a DNA sequence. (a) An example de Bruijn graph assembly for a short genomic sequence without polymorphism. Sequence at top represents the genome, which is then sampled using shotgun sequencing in base space with 7-bp reads (step 1). Some of the reads have errors (red). In step 2, the k-mers in the reads (4-mers in this example) are collected into nodes and the coverage at each node is recorded. There are continuous linear stretches within the graph, and the sequencing errors create distinctive, low-coverage features through out the graph. In step 3, the graph is simplified to combine nodes that are associated with the continuous linear stretches into single, larger nodes of various k-mer sizes. In step 4, error correction removes the tips and bubbles that result from sequencing errors and creates a final graph structure that accurately and completely describes in the original genome sequence. (b) A full de Bruijn graph from a bacterial genome that shows the general lack of repetitive structure within the entire genome.

S10 | VOL.6 NO.11s | NOVEMBER 2009 | NATURE METHODS SUPPLEMENT

REVIEW

Figure 2.9: Graphical representation of the de Bruijn of the Streptococcussuis genome

In this representation, node sequences are represented as curves, whichconnect at their tips.

2.7.2 Local reference based visualisation

It quickly became clear that troubleshooting some of the algorithms de-scribed in the following chapters would require some adequate visualisa-tion techniques. Given that most tests were done on species with a knownreference, it was possible to use this sequence as a guide to the graph’scomplexity.

The first solution consisted in following the path of the referencesequence through the graph, recording the properties of the nodes beingtraversed. Figure 2.10 represents the length and multiplicity of successivenodes on the reference path.

In this diagram, long contigs are interrupted by two types of breaks.Sometimes, two long contigs are separated by a very short, isolated fea-

39

Simple genome

A bit more complex genome

>1Tb RAM580 days of compute

>5 Tb storage

Conclusions• The data deluge is fundamentally changing life science and the

required computational resources

• Analysis requires High Performance Computing facilities, CPU, Memory, Storage, IO and fast data links

• Time to solution - need accessible compute resource

• Dynamic algorithm development - very fast algorithm turnaround

• A need for shared compute (cloud) and storage facilities - computable storage

Acknowledgements

• Center for Biological Sequence analysis (DTU)

• Søren Brunak

• John Damm Sørensen

• Bent Petersen

Preprocessing and SNP calling Natasja S. Ehlers, PhD student Center for Biological Sequence Analysis Functional Human Variation Group

the data deluge driven by next generation sequencing … · actinomyces odontolyticus atcc 17982...

Documents