pangenome-wide association studies (pwas) with frequented … · 2020. 12. 31. · t. bellerand e....

Indika Kahanda, Buwani Manuweera, Brendan Mumey

Gianforte School of ComputingMontana State University

Bozeman, MT, USA

Alan Cleary Joann Mudge,Thiruvarangan Ramaraj

National Center for Genome ResourcesSanta Fe, NM, USA

Pangenome-wide association studies (PWAS) with frequented regions

Pangenomic data

• Increasingly common to sequence multiple genomes per species.

• ...creating pangenomic data sets.

Compressed De Brujn Graphs

Slide from:T. Beller, E. Ohlebusch, “Efficient construction of a compressed de Bruijn graph for pan-genome analysis”

Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

Slide from:S. Marcus, “splitMEM: graphical pan-genome analysis with suffix skips”

• Software:– SplitMem: (uses suffix trees)

S. Marcus, H. Lee, and M. C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.

– E-SplitMem: (uses FM-index)T. Beller and E. Ohlebusch. 2016. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms for Molecular Biology11, 1 (2016), 20

Pangenomic graphs

genomicsequences

…can have millions of vertices and edges

The data:– A cDB graph G– A set of paths P within G

A frequented region (FR) is a tuple (C,S) : C is a set of de Bruijn nodesS is a set of (α, κ)-supporting subpathsParameters: k, a, K

p[i,j]

C nodes

gap <= K>= a|C|

NB: need to also consider reverse-complement support

Our FR Algorithm

• Basic idea: find FRs in a bottom-up, agglomerative fashion:– Each De Bruijn node starts in its own cluster.

– Repeat: merge best pair of clusters.

Merge?

Merge process = maximum weight matching : fast parallel approx. algorithms exist

Running time: O(LV + V2lgV)– V = # of cDB vertices– L = total length of all genomic sequence in P

(NB: no dependence on # of sequences)

A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey, "Exploring Frequented Regions in Pan-Genomic Graphs," in IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018doi: 10.1109/TCBB.2018.2864564

Uses for FRs

• FRs identify syntenic regions• We have been exploring the following:– Visualizing pan-genomic space– Machine Learning with FRs as features

FRs to visualize pangenomes…

Alcohol (wine, sake, ale, bioethanol)Laboratory

BakeryOther

17kb insertion on chromosome XIV

Yeast Insertion for Alcohol Tolerance:

Machine Learning

• We propose to use FRs as features for:– describing existing genomes– make inferences on unseen

genomes.

Strain FR1 FR2 … FRn LabelA 1 0 … 1 0.6B 0 1 … 1 0.2

Back to pangenomics…

Can we use FR content for ML task like phenotype regression?

1 2 3 4 5 6 7 8 9 10 11 12 13# Sharing Accession

Accession−SpecificHM101HM058HM056HM125HM129HM034HM095HM060HM185HM004HM050HM023HM010

5 10# Genomes Sequenced

Pan−proteomeCore−proteome

Medicago trunculata (450 Mb genome)Model legumeWhole genome duplicationsHigh level of rearrangements and gene family expansions

Test case: Phenotype regressionwith FRs

• 100 yeast genome dataset:– Strope, Pooja K., et al. “The 100-genomes

strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen.” Genome research 25.5 (2015): 762-774.

…studied SNP-phenotype associations for 49 phenotypes

SNPs vs FRs for phenotype regression

• 5-fold cross validation:

need to tune parameters:

SNP-based regressionFold 1

Slope= 0.48 R2= 0.0943

Fold 2Slope= 0.76 R2= 0.058

Fold 3Slope= 1.1 R2= 0.343

Fold 4Slope= 0.85 R2= 0.206

Fold 5Slope= 1 R2= 0.432

0 5 10 15 20Original

Predicted

variable

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Lithium Chloride Folds

Regression done with a sparseBayesian mixed modelusing the GEMMA tool

FRs show improvement over SNPs:

Conclusions

• FRs capture the notion of frequented regions or junctions in the cDB graph.

• PWAS – pangenome wide association studies:– Identifying core and adapted gene sets quickly– visualization

FutuRe work

• Scale up to larger plant and human pangenomic data sets.• Interested in new collaborations!

• Acknowledgements:Supported in part by:– NSF-ABI award 1542262– NSF-IOS award 1444806– NSF-DBI award 1759522– USDA-ARS project funding for

the Legume Information System– Google Summer of Code

Questions?

pangenome-wide association studies (pwas) with frequented … · 2020. 12. 31. · t. bellerand e....

Documents

northwestern coding boot camp full stack flex … · server...

evolutionary biology copyright © 2020 …...evolutionary...

what seos need to know about pwas by cindy krum

· piezoelectric wafer-active sensors operation is...

recovery of gene haplotypes from a metagenomehansel and...

chaining algorithms for multiple genome comparison · 2017....

beneﬁts of pwas

pwas: proteome-wide association study - biorxivwe present...

developing pwas and mobile apps with ionic, angular, and...

mobile app gallery...pwas work on any mobile including ios,...

introduction to pwas & new js frameworks for mobile

a dental home for pwas 131027

overview: full stack flex program · server side...

genotyping structural variants in pangenome graphs using...

functional pangenome analysis suggests inhibition...

the chlamydiales pangenome revisited: structural stability...

bioinformatics algorithms - libreria herrero...

edinburgh research explorer€¦ · edinburgh research...

addressedtermrewritingsystems:...

excitability of guided waves in composites with pwas ... ·...