pangenome-wide association studies (pwas) with frequented … · 2020. 12. 31. · t. bellerand e....
Post on 08-Mar-2021
4 Views
Preview:
TRANSCRIPT
Indika Kahanda, Buwani Manuweera, Brendan Mumey
Gianforte School of ComputingMontana State University
Bozeman, MT, USA
Alan Cleary Joann Mudge,Thiruvarangan Ramaraj
National Center for Genome ResourcesSanta Fe, NM, USA
Pangenome-wide association studies (PWAS) with frequented regions
Pangenomic data
• Increasingly common to sequence multiple genomes per species.
• ...creating pangenomic data sets.
Compressed De Brujn Graphs
Slide from:T. Beller, E. Ohlebusch, “Efficient construction of a compressed de Bruijn graph for pan-genome analysis”
Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25
Slide from:S. Marcus, “splitMEM: graphical pan-genome analysis with suffix skips”
• Software:– SplitMem: (uses suffix trees)
S. Marcus, H. Lee, and M. C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.
– E-SplitMem: (uses FM-index)T. Beller and E. Ohlebusch. 2016. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms for Molecular Biology11, 1 (2016), 20
Pangenomic graphs
genomicsequences
…can have millions of vertices and edges
The data:– A cDB graph G– A set of paths P within G
A frequented region (FR) is a tuple (C,S) : C is a set of de Bruijn nodesS is a set of (α, κ)-supporting subpathsParameters: k, a, K
p[i,j]
C nodes
gap <= K>= a|C|
NB: need to also consider reverse-complement support
Our FR Algorithm
• Basic idea: find FRs in a bottom-up, agglomerative fashion:– Each De Bruijn node starts in its own cluster.
– Repeat: merge best pair of clusters.
e
a
b
d
cf
g
Merge?
Merge process = maximum weight matching : fast parallel approx. algorithms exist
Running time: O(LV + V2lgV)– V = # of cDB vertices– L = total length of all genomic sequence in P
(NB: no dependence on # of sequences)
A. Cleary, T. Ramaraj, I. Kahanda, J. Mudge and B. Mumey, "Exploring Frequented Regions in Pan-Genomic Graphs," in IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018doi: 10.1109/TCBB.2018.2864564
Uses for FRs
• FRs identify syntenic regions• We have been exploring the following:– Visualizing pan-genomic space– Machine Learning with FRs as features
FRs to visualize pangenomes…
Alcohol (wine, sake, ale, bioethanol)Laboratory
BakeryOther
17kb insertion on chromosome XIV
Yeast Insertion for Alcohol Tolerance:
Machine Learning
• We propose to use FRs as features for:– describing existing genomes– make inferences on unseen
genomes.
Strain FR1 FR2 … FRn LabelA 1 0 … 1 0.6B 0 1 … 1 0.2
Back to pangenomics…
Can we use FR content for ML task like phenotype regression?
050
0010
000
1500
020
000
2500
0
1 2 3 4 5 6 7 8 9 10 11 12 13# Sharing Accession
# O
rthol
og G
roup
s
Accession−SpecificHM101HM058HM056HM125HM129HM034HM095HM060HM185HM004HM050HM023HM010
A
020
000
4000
060
000
5 10# Genomes Sequenced
# O
rthol
og G
roup
s
Pan−proteomeCore−proteome
B
Medicago trunculata (450 Mb genome)Model legumeWhole genome duplicationsHigh level of rearrangements and gene family expansions
Test case: Phenotype regressionwith FRs
• 100 yeast genome dataset:– Strope, Pooja K., et al. “The 100-genomes
strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen.” Genome research 25.5 (2015): 762-774.
…studied SNP-phenotype associations for 49 phenotypes
SNPs vs FRs for phenotype regression
• 5-fold cross validation:
need to tune parameters:
SNP-based regressionFold 1
Slope= 0.48 R2= 0.0943
Fold 2Slope= 0.76 R2= 0.058
Fold 3Slope= 1.1 R2= 0.343
Fold 4Slope= 0.85 R2= 0.206
Fold 5Slope= 1 R2= 0.432
5
10
15
0 5 10 15 20Original
Predicted
variable
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
Lithium Chloride Folds
Regression done with a sparseBayesian mixed modelusing the GEMMA tool
FRs show improvement over SNPs:
Conclusions
• FRs capture the notion of frequented regions or junctions in the cDB graph.
• PWAS – pangenome wide association studies:– Identifying core and adapted gene sets quickly– visualization
FutuRe work
• Scale up to larger plant and human pangenomic data sets.• Interested in new collaborations!
• Acknowledgements:Supported in part by:– NSF-ABI award 1542262– NSF-IOS award 1444806– NSF-DBI award 1759522– USDA-ARS project funding for
the Legume Information System– Google Summer of Code
Questions?
top related