jonathan keith - exploring the structure of whole-genome conservation profiles using bayesian...
Post on 10-May-2015
411 Views
Preview:
DESCRIPTION
TRANSCRIPT
- 1.Exploring the Structure of Whole-Genome Conservation Profiles using Bayesian Segmentation Jonathan M Keith Mathematical & Computational Biology Winter School Brisbane July 8, 2014
2. 2 Outline Introduction DNA/RNA/protein/ncRNAs Genome segmentation Incorporating multiple data types into segmentation Generalised Gibbs Sampler Applications The proportion of functional sequence in genomes Investigating alternative splicing Complexity of Drosophila 3 utrs Non-coding RNAs in zebrafish Non-coding RNAs in Wolbachia Regions contributing to malaria pathogenicity and host specificity Transcription factor binding sites in zebrafish 3. 3 DNA 4. 4 The Central Dogma of Molecular Biology 5. 5 Gene structure 6. Non-coding RNA 6 7. 7 Non-protein-coding RNA 8. Conservation 8 9. 9 Bayesian Genome Segmentation Input: sequence of characters from a finite alphabet Output: Segmentations Classifications 10. 10 Segmenting what? GC content binary 1=GC, 0=AT Pairwise alignment binary 1=match, 0=mismatch Multiple sequence alignment Column-wise counts of most frequent character Column-wise Parsimony score given phylogeny Pairwise conservation + GC content Pairwise content + GC content + indel frequency 11. Segmenting an alignment 11 Alignment encoded by a 32-character alphabet Human: GCCGA-- Mouse: GTC-A-- Zf : ATTAATG S : xZxIaJJ Species 1 A A A A A A A A A A A A A A A A C C C C C C C C C C C C C C C C Species 2 A A A A C C C C G G G G T T T T A A A A C C C C G G G G T T T T Species 3 A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T Encoding a b c d e f g h i j k l m n o p q r s t u v w x y z U V W X Y Z Example 12. 12 A Bayesian model Binary sequences encoding matches and mismatches are assumed to have been generated by a process involving the following parameters: S binary sequence k number of change-points c = (c1,,ck) vector of change-points = (0,, k) vector of conservation levels for each segment probability that any given sequence position is a change-point g = (g0,, gk) vector containing, for each segment, the number of the group to which it belongs parameters of the beta distributions for each group proportion of segments in each group The algorithm generates segmentations (k,c) and group parameters (, ) and uses them to generate profiles for each group. 13. 13 Model parameters Bayesian hierarchical model includes the parameters shown Integrate over and , sum over g, sample k, c, and 14. 14 15. 15 16. Example: Subset-based sampling 16 1 3 1 3 1 3 17. Generalized Gibbs sampler 17 1 3 2 3 2 3 1 3 1 3 1 3 1 3 18. 18 Generalized Gibbs sampler X target set I index set (move types) U I X (x) elements of the form (i,x) Qx transition matrix on (x) qx stationary distribution for Qx (i,x) elements that can be reached from x by move type i I X U (i,x) (i,x) (x) 19. 19 Generalized Gibbs Sampler (discrete space) Starting with an arbitrary U0, perform the following steps iteratively: 1. [Q-step] Given Un = (i,x), generate (j,x) (x) using Qx((i,x), . ). 2. [R-step] Given (j,x), generate W (j,x) using R((j,x), . ). 3. Let Un+1 = W. R((i,x),( j,y)) = p(y)qy ( j) p(z)qz (k) (k,z)(i,x) 20. 20 For further details 21. 21 Order of updates 22. Model Selection 22 23. 23 Model selection AIC approximation BIC approximation DICV Lk ln22 Lngaskn ln2))1((ln ++ 24. Genome-wide conservation levels 24 Oldmeadow et al. (2010) Mol. Biol. Evol. 27(4): 942-953 25. Investigating alternative splicing 25 Boyd et al. (2014) PLoS ONE 7(3):e33565 26. Complexity of Drosophila 3 UTRs 26 Algama et al. (2014) PLoS ONE 9(5):e97336 27. Objec've: Methoddevelopmentinndingputa2vefunc2onalregionsin13 musclegenes Muscle Development Genes Project EYA1 SIX1 EYA4 SIX4 MYF5 WNT1 MYF6 WNT7a MYOD1 PAX3 MYOG PAX7 SHH 28. Generating Input Sequence Human A A A A C C C C G G G G T T T T - N Mouse A C G T A C G T A C G T A C G T N - code a b c d e f g h i j k l m n o p skip I 1.PairwiseAlignment 2.MultipleAlignment In 3-way alignment, columns with complementary bases are encoded using same letters Human indels are skipped ZF/Mouse indels are encoded with I 29. Results: EYA4 EYA4:311,523bp,and20exons Model Selection Conservation levels propor2onof charactersa,f,k,p insegments 30. Identification of conserved non coding sequences q Identifying putative functional elements (PFEs) in 13 muscle genes 30 Application of changept on eya1, 3-way alignment (human, mouse, zebrafish) 50% 65% 45% Conservation 31. Results 3way 31 Gene # PFEs # PFEs matched with EvoFold # PFEs matched with DNAse-footprints # PFEs matched with fRNAdb nc transcripts EYA1 6 5 6 1 (Transcript : 3679 , PFE:169) EYA4 2 1 SHH 4 1 1 PAX3 9 6 4 2 (Transcript : 1521 , PFE:126, 127) PAX7 6 4 3 MYF5 1 1 SIX4 1 1 TOTAL 29 17 16 3 32. PCR Results Expression was determined from pooled 24hpf zebrafish cDNA Muscle Genes Project: Lab Results 33. Bacterial Genome Project: wMel & wPip Modelselec'on AlignedusingMauvealignmentprogramwhich takesgenomicre-arrangementsintoaccountAIC DICV Conserva'on Levels 34. wMel & wPip: Results Referencespecies:wMel:1.27milbp&1309genes 11tRNAsand19non-codingregionswereiden2edwiththresholds: 1.Conserva2on>0.95 2.Prolevalue>0.75 3.Segmentlength>50bp WIGproleofthemostconservedgroup 35. 35 q Identified 17 tRNAs, 2 rRNAs, 2 ncRNAs, 2 pseudogenes and 19 intergenic regions with no previous annotations q Discovery of small non-coding RNAs from the obligate intracellular bacterium Wolbachia pipientis , (target journal: Parasites and Vectors) q Identifying putative ncRNAs in two bacteria genomes wMel and wPip tRNAncRNA Identification of conserved non coding sequences 36. wMel & wPip: Results changeptmainlyusedtosearchfor6intergenicregionsin wMelgenome,iden2edasbeingtranscribedusingRACE(Rapid amplica2onofcDNAends)method Currentlyconduc2nglabexperiments 37. Genomic regions contributing to malaria pathogenicity and host specificity The shared evolutionary history that has honed the ability of malaria parasites to use haemoglobin as a fundamental resource, coupled with complex and divergent vertebrate immune systems that work to preclude access to the resource, suggests that malarial genomes are a mosaic of conserved and divergent regions that reflect this tug-of-war. Portions of the genome that are directly affected or recognized by the hosts immune systems or are involved in infecting a host cell are likely to be divergent across parasites that specialise on different host species. On the other hand, portions that are fundamental for the parasites ability to use haemoglobin as a primary resource are likely to be conserved. We apply a Bayesian segmentation model to a three-way whole-genome alignment of Plasmodium falciparum (human malaria), P. reichnowi (chimpanzee malaria), and P. gallinaciaum (chicken malaria). Seeking novel regions relevant to drug targets. 37 38. Identifying TFBS Regions upstream of the zebrafish muscle development genes mentioned earlier identified as conserved contain regulatory elements Currently attempting to use the technique to identify TFBS genome-wide 38 39. 39 Acknowledgments Queensland University of Technology Kerrie Mengersen Chris Oldmeadow University of Queensland Peter Adams Dirk Kroese Darryn Bryant Benjamin Goursaud Rachel Crehange Institute for Molecular Biosciences John Mattick Stuart Stephen Monash University Manjula Algama Edward Tasker Robert Bryson-Richardson Caitlin Johnston Adam Parslow Beth McGraw Jean Popovici Meg Woolfit Anders Goncalves da Silva Australian Research Council Research Grants DP0879308, DP1095849 Email: jonathan.keith@monash.edu
top related