imputation for genotyping by sequencing - emma huang

26
Imputation for genotyping by sequencing Emma Huang, Chitra Raghavan, Ramil Mauleon, Karl Broman, Hei Leung CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP

Upload: australian-bioinformatics-network

Post on 22-Jan-2015

422 views

Category:

Science


1 download

DESCRIPTION

Genotyping-by-sequencing (GBS) technology has made dense genotyping cost-effective for many species. However, the high levels of missing data can result in a large loss of information. The popularity of GBS makes the development of efficient imputation approaches a priority. Here we consider imputation under the further difficulty caused by multi-parental experimental crosses. We present an approach to imputing founder genotypes which allows recovery of a large proportion of markers. Once these have been imputed, we compare three approaches to imputing progeny genotypes and apply our strategy to an eight-parent rice population to demonstrate the potential gain from imputation.

TRANSCRIPT

  • 1. Imputation for genotyping by sequencing Emma Huang, Chitra Raghavan, Ramil Mauleon, Karl Broman, Hei Leung CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP

2. CSIRO MATHEMATICS, INFORMATICS AND STATISTICS AND FOOD FUTURES FLAGSHIP 3. Comparing Designs FOAM 2014 Resolution/Diversity Allelefrequency/Power BC F2 RIL MAGIC Natural populations Experimental Crosses Biparental Crosses NAM 4. MAGIC Wheat Inbreeding No mixing 2 generations intercrossing 3 generations intercrossing Double haploids FOAM 20144 5. MAGIC Arabidopsis FOAM 2014 H I C D E J K LA B C D E F F G X Kover et al. PLoS Genet 2009 6. Arabidopsis MAGIC 19 founders, outcrossed for four generations Lines from 342 F4 families selfed for 6 generations Founder lines resequenced (60x coverage) ~3M SNPs ~500 progeny sequenced (.5x coverage) ~500K SNPs FOAM 20146 7. MAGIC Rice FOAM 2014 Indica Japonica X Bandillo et al. Rice 2013 6:11 ~2000 lines selfed for 6-8 generations Preliminary genotyping/phenotyping of 200 lines at S4 Further genotyping by sequencing (GBS) planned for S8 and founder lines 8. Organisms 125 Mb Diploid 17 Gb Hexaploid 430 Mb Diploid FOAM 20148 9. Major differences in resources www.wheatgenome.org Arabidopsis: reference genomes, annotation, Rice: reference genome (japonica) Wheat: FOAM 20149 10. Genotypes 60x founders .5x progeny 9K/90K SNP chipsLow-coverage GBS, founders and progeny FOAM 201410 11. Stretches of missing values where reads don't align Arabidopsis: .5x coverage, 500K/3M SNPs 17% of total Rice: Filtering process reduces 159,522 SNPs 12,767 (8%) How do we make use of the genome structure to fill in the gaps in our knowledge? Low-coverage GBS FOAM 2014 12. Missing data (random) Comparison across studies (systematic) Genotype Imputation 1 0 - 1 - 1 1 0 - 1 - 1 0 0 - 0 - 1 1 1 - 1 - 0 - - 1 - 0 - - - 1 - 1 - FOAM 201412 13. Typical approach FOAM 2014 High-density reference panel Phasing Low-density targets HMM Pedigree Probabilities Phases Imputation 14. History FOAM 2014 Software Release Date Author Institute (fast)PHASE 2001/2006 Stephens Chicago MACH 2007 Abecasis Michigan BEAGLE 2007 Browning Washington AlphaImpute 2011 Hickey Roslin IMPUTE(2) 2009/2012 Marchini Oxford SHAPEIT(2) 2011/2013 Delaneau CNAM 15. Top-down FOAM 2014 Reference Panel Subj_ct 1 S_bj_ct 3 _ubj_c_ 2 16. FOAM 2014 Spacing (/cM) N %MISS %B %M %K 1 200 30 93.7 96.3 79.8 1 200 40 93.0 95.5 78.8 1 200 50 92.0 94.8 77.5 1 400 30 94.3 96.3 80.3 1 400 40 93.8 95.5 79.4 1 400 50 92.6 94.8 78.2 2 200 30 96.7 98.3 83.5 2 200 40 96.3 98.0 82.3 2 200 50 95.4 97.6 80.8 2 400 30 97.0 98.3 84.1 2 400 40 96.5 98.0 83.1 2 400 50 96.0 97.6 81.8 But what happens if our reference panel is incomplete? 16 17. Higher coverage Different platform More replicates Simplest solution: get more data FOAM 2014 18. FOAM 2014 Progeny Fo_nder A F_und_r B Fo__der C _oun__r D 18 19. Very simple approach FOAM 2014 F o u n d e r A 1 1 - 1 1 1 0 B 1 - 1 0 1 - 1 C 1 - - 0 1 - 0 D - 1 1 1 - 0 1 F o u n d e r 0 27 48 26 36 43 43 51 1 73 52 74 64 57 57 49 20. Very simple approach FOAM 2014 F o u n d e r A 1 1 ? 1 1? 1 0 B 1 0 1 0 1? ? 1 C 1 0 ? 0 1? ? 0 D 0 1 1 1 0 0 1 F o u n d e r 0 27 48 26 36 43 43 51 1 73 52 74 64 57 57 49 21. More complicated version FOAM 2014 Missing data in progeny Recombination between markers Genotyping error in progeny 22. MAGIC 8-parent populations Masked out founder values and progeny values Varying marker density, sample size, missing % Imputed founders and used those to impute all data Simulations FOAM 2014 23. Simulations FOAM 2014 Spacing (/cM) N %MISS %F0 %FC %FK 1 200 30 46.9 100 86.6 1 200 40 24.5 100 85.4 1 200 50 9.8 99.6 83.9 1 400 30 47.3 100 88.4 1 400 40 24.9 100 87.4 1 400 50 10.1 100 86.2 2 200 30 47.1 100 90.7 2 200 40 24.8 100 89.5 2 200 50 10.0 100 87.8 2 400 30 47.1 100 92.1 2 400 40 24.9 100 91.3 2 400 50 10.0 100 90.1 24. 178 F4 lines, 37240 markers after filtering ~21% missing parents; 38% missing progeny Masked data on Chr 1 from 1130 markers with full parent data Simulated 22% missingness 128 -> 1092 with 96% correctly imputed For all markers, 25.2% imputed up to 92.7% Rice data FOAM 2014 25. Wheat: requirement of map position Arabidopsis: resequenced founders; detection of other variants? Density of markers Level of missingness Genotyping errors Heterozygosity Relevance to other populations? FOAM 2014 26. CCI Emma Huang t +61 7 3833 5542 e [email protected] Thanks! COMPUTATIONAL INFORMATICS AND FOOD FUTURES FLAGSHIP xkcd.com