pan-genome graphs biodata14

21

Upload: andrew-warren

Post on 02-Jul-2015

393 views

Category:

Science


0 download

DESCRIPTION

Pan-genome graphs for bacteria and the web.

TRANSCRIPT

Page 1: Pan-genome Graphs biodata14
Page 2: Pan-genome Graphs biodata14

11/6/2014 graphSVG.svg

file:///Users/anwarren/Documents/biodata14/graphSVG.svg 1/1

Page 3: Pan-genome Graphs biodata14

Background

• “Pan Genome” - way to think about, compute on, visualize the differences and similarities of many genomes at once

• Reference free structure

• Many, many genomes

Page 4: Pan-genome Graphs biodata14

de Bruijn Graph Construction

• Dk = (V,E)• V = All length-k subfragments• E = Directed edges between consecutive subfragments

• Nodes overlap by k-1 words

• Locally constructed graph reveals the global sequence structure• Overlaps between sequences implicitly computed

Slide: http://cbcb.umd.edu/confcour/CMSC828H-materials/Lecture12-MSchatz-DeBruijnAssembly.pptx

It was the best was the best ofIt was the best of

Original Fragment Directed Edge

de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001

Page 5: Pan-genome Graphs biodata14

Strategy: find all k-mers, build graph

• Every k-mer becomes a node

• Two nodes are linked with an edge if they

share a k-1 mer

GACTGGGACTCC

GACTGG ACTGGG

GGACTC GGGACT

TGGGACCTGGGA

GACTCC

Page 6: Pan-genome Graphs biodata14

Strategy: k-mers from feature families, build graph

• Every k-mer becomes a node

– If it is present in m genomes

• Two nodes are linked with an edge if they share a k-1 mer

• d# = a feature family

d1d2d3d4d5d6d7d8

d9

d1d2d3d4d5d

6

d2d3d4d5d6d

7

d4d5d6d7d8d9d3d4d5d6d7d8

d1d2d3d4d5d6d7d8

d9

Page 7: Pan-genome Graphs biodata14

rf-graph de Bruijn “like”

Page 8: Pan-genome Graphs biodata14

Create pg-graph

Page 9: Pan-genome Graphs biodata14

Similarities and Differences10 groups of 10

Organism Sum Pairwise Distances (Phylogenetic)

E. coli 0.07

Coxiella 10.42

Mycobacterium 2.70

Brucella 0.08

Rickettsia 8.62

Burkholderia 7.21

Clostridium 9.05

Bacillus 4.48

Staph. 2.08

Strep. 4.79

Page 10: Pan-genome Graphs biodata14

Similarities and Differences

Node Increase = (Nodes – Max(Families)) / Nodes

Diversity Score= Sum of maximum pairwise distances in Order level tree

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.850 0.900 0.950 1.000

No

de

Incr

eas

e

MUMi

Node Increase vs. MUMi

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.00 2.00 4.00 6.00 8.00 10.00 12.00

No

de

Incr

eas

e

Diversity Score

Node Increase vs. Diversity Score

MUMi= Maximum of all pairwise MUMi in a group

Page 11: Pan-genome Graphs biodata14

Layout

Gephi ToolkitYifan Hu’s MultilevelForce Atlas 2

Page 12: Pan-genome Graphs biodata14

Colors and Lines

Page 13: Pan-genome Graphs biodata14

Dealing with many Genomes

N=2K=5M=2B. Abortus

N=40, K=5, M=2, B. Suis

N=20K=5M=2Brucella

N=400, K=5, M=2, All Brucella N=1000, K=10, M=100, E. coli

Page 14: Pan-genome Graphs biodata14

Information Compounded

Page 15: Pan-genome Graphs biodata14

For the Web

• GEXF

– NetworkX, Gephi,

– Cytoscape, Gexf-JS, D3-Gexf

• BGZF GFF

– Backing store

– Byte range loading

Page 16: Pan-genome Graphs biodata14

Other Uses

• “Rearrangement” detection

Page 17: Pan-genome Graphs biodata14

Other Uses

• “Scaffolding”

– e.g. 86 contigs

• Closing

– Predicted primers

Page 18: Pan-genome Graphs biodata14

Other Uses• Rearrangements

– Insertions/Deletions

– Islands

– Inversions

Page 19: Pan-genome Graphs biodata14

Other Uses

• Synthetic BAM

Page 20: Pan-genome Graphs biodata14

Takeaways

• A new way to leverage protein family databases

• “Reference free” structure for many bacterial genomes using feature families

• Quickly investigate whole genome relationships and speed up potentially expensive calculations

Page 21: Pan-genome Graphs biodata14

Acknowledgements

• Eric Nordberg

• Lenny Heath

• CID at VBI (PATRIC)

• RAST – Argonne (PATRIC)

https://github.com/aswarren

https://twitter.com/aswarren