analysis of next generation sequence data biost 2055 04/06/2015
DESCRIPTION
Genetic Spectrum of Complex Diseases GWAS Sequencing LinkageTRANSCRIPT
![Page 1: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/1.jpg)
Analysis of Next Generation Sequence Data
BIOST 205504/06/2015
![Page 2: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/2.jpg)
Last Lecture
• Genome-wide association study has identified thousands of disease-associated loci
• Large consortium performs meta-analysis to further increase the sample size (power) to detect additional loci
• GWAS is limited by the chip design and rare variants are rarely explored
![Page 3: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/3.jpg)
Genetic Spectrum of Complex Diseases
GWASSequencing
Linkage
![Page 4: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/4.jpg)
Outline• Background
• From sequence data to genotype
• Rare Variant Tests
![Page 5: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/5.jpg)
Human Genome and Single Nucleotide Polymorphisms (SNPs)
• 23 chromosome pairs• 3 billion bases
• A single nucleotide change between pairs of chromosomes
• E.g.
• A/A or G/G homozygote • A/G heterozygote
Haplotype1: AAGGGATCCACHaplotype2: AAGGAATCCAC
![Page 6: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/6.jpg)
Association Study in Case Control Samples
CAGATCGCTGGATGAATCGCATCCGGATTGCTGCATGGATCGCATC
CAGATCGCTGGATGAATCGCATCCAGATCGCTGGATGAATCCCATC
CGGATTGCTGCATGGATCCCATCCGGATTGCTGCATGGATCCCATC
SNP2↓
SNP3↓
SNP4↓
SNP5↓
SNP1↓
Disease
![Page 7: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/7.jpg)
– Only subset of functional elements include common variants– Rare variants are more numerous and thus will point to additional loci
![Page 8: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/8.jpg)
History of DNA Sequencing
![Page 9: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/9.jpg)
Sequencing Cost
http://www.genome.gov/
![Page 10: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/10.jpg)
A Road to Discover Human Genome
1990-2003 2002 - 2008 -
![Page 11: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/11.jpg)
Current Genome Scale Approaches
• Deep whole genome sequencing– Expensive, only can be applied to limited samples currently– Most complete ascertainment of all variations
• Low coverage whole genome sequencing– Modest cost, typically 100-1000 samples– Complete ascertainment of common variations– Less complete ascertainment of rare variants
• Exome capture and targeted region sequencing– Modest cost, high coverage– Most interesting part of the genome
![Page 12: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/12.jpg)
Next Generation Sequencing• Commercial platforms produce gigabases of sequence rapidly
and inexpensively – ABI SOLiD, Illumina Solexa, Roche 454, Complete
Genomics, and others…
• Sequence data consist of thousands or millions of short sequence reads with moderate accuracy 0.5 – 1.0% error rates per base may be typical
• High-throughput but hard to assemble
![Page 13: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/13.jpg)
A Typical PipelineShotgun Sequencing Reads
Single Marker Caller
Haplotype-basedCaller
Mapped Reads
Polymorphic Sites
Individual Genotypes
ReadAlignment Software
![Page 14: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/14.jpg)
Short read alignment
Sequencer
Reads from new sequencing machines are short: 30-400 bp
Human source
![Page 15: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/15.jpg)
Short read alignment
Sequencing machine
And you get MILLIONS of them
![Page 16: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/16.jpg)
Short read alignment
Need to map them back to human reference
![Page 17: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/17.jpg)
AlignmentReference sequence:
actgtagattagccgagtagctagctagtcgat
ccgagaagctag
Find best match for each read in a reference sequence
• Hashing is time and memory consuming for millions of reads and billion-base long reference
• Errors in reads• Each read may be mapped to multiple positions• Individual polymorphisms
![Page 18: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/18.jpg)
Existing Alignment by Category• Hashing reference genome
– SOAP1, MOSAIK, PASS, BFAST, …
• Hashing short reads– Eland, MAQ, SHRiMP, …
• Merge-sorting reference together with reads– Slider
• Based on Burrows-Wheeler Transform– BWA, SOAP2, Bowtie, …
Li and Durbin (2009), Bioinformatics 25 (14): 1754-60
![Page 19: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/19.jpg)
After Alignment
• Each read is mapped to reference genome with tolerated number of mismatches– Mismatches allow us to discover the individual
variation
• Each site of reference genome is covered by multiple un-evenly distributed reads– Some sites might not be covered
![Page 20: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/20.jpg)
Genome
Genome 1
Genome 2
Genome 3
Genome 4
Reads
Coverage (High vs Low)
VS
• Which one has more power to detect variations?
![Page 21: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/21.jpg)
Genotype Calling from Sequence Data
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’Reference Genome
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTCTAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Sequence Reads
Predicted GenotypeA/C or A/A or C/C
Observed Data2A and 3C
![Page 22: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/22.jpg)
A Simple ModelAt one site, Na reads carry A, Nb reads carry B
![Page 23: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/23.jpg)
Inference with no reads
Reference Genome
Sequence Reads
Possible Genotypes
P(reads|A/A)= 1.0
P(reads|A/C)= 1.0
P(reads|C/C)= 1.0
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
![Page 24: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/24.jpg)
Inference with short read data
Reference Genome
Sequence Reads
Possible Genotypes
P(reads|A/A)= P(C observed, read maps |A/A)
P(reads|A/C)= P(C observed, read maps |A/C)
P(reads|C/C)= P(C observed, read maps |C/C)
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
![Page 25: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/25.jpg)
Inference assuming error of 1%
Reference Genome
Possible Genotypes
P(reads|A/A)= 0.01
P(reads|A/C)= 0.50
P(reads|C/C)= 0.99
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
![Page 26: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/26.jpg)
As data accumulate …
Reference Genome
Possible Genotypes
P(reads|A/A)= 0.0001
P(reads|A/C)= 0.25
P(reads|C/C)= 0.98
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
![Page 27: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/27.jpg)
As data accumulate …
Reference Genome
Possible Genotypes
P(reads|A/A)= 0.000001
P(reads|A/C)= 0.125
P(reads|C/C)= 0.97
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
![Page 28: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/28.jpg)
As data accumulate …
Reference Genome
Possible Genotypes
P(reads|A/A)= 0.00000099
P(reads|A/C)= 0.0625
P(reads|C/C)= 0.0097
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
![Page 29: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/29.jpg)
In the “end”
Reference GenomeP(reads|A/A)= 0.00000098
P(reads|A/C)= 0.03125
P(reads|C/C)= 0.000097
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGAAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
TAGCTGATAGCTAGATAGCTGATGAGCCCGATATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
![Page 30: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/30.jpg)
Not the “end” yet
Reference GenomeP(reads|A/A) = 0.00000098
P(reads|A/C) = 0.03125P(reads|C/C) = 0.000097
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGAAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTG
ATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Making a genotype call requires combining sequence data with prior information
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
![Page 31: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/31.jpg)
Not the “end” yet
Reference Genome
P(reads|A/A)= 0.00000098 Prior(A/A) = 0.00034 P(A/A|reads) < 0.01P(reads|A/C)= 0.03125 Prior(A/C) = 0.00066 P(A/C|reads) = 0.175P(reads|C/C)= 0.000097 Prior(C/C) = 0.99900 P(C/C|reads) = 0.825
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Base Prior: every site has 1/1000 probability of varying
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
![Page 32: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/32.jpg)
Population Based Prior
Reference Genome
P(reads|A/A)= 0.00000098 Prior(A/A) = 0.04 P(A/A|reads) < .001P(reads|A/C)= 0.03125 Prior(A/C) = 0.32 P(A/C|reads) = 0.999P(reads|C/C)= 0.000097 Prior(C/C) = 0.64 P(C/C|reads) = <.001
Sequence Reads
5’-ACTGGTCGATGCTAGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGCTAGCTCGACG-3’
GCTAGCTGATAGCTAGCTAGCTGATGAGCCCGA
AGCTGATAGCTAGCTAGCTGATGAGCCCGATCGCTGATGCTAGCTGATAGCTAGCTAGCTGATGAGCC
TAGCTGATAGCTAGATAGCTGATGAGCCCGAT
Population Based Prior: Use frequency information from examining others at the same site. E.g. P(A) = 0.2
ATAGCTAGATAGCTGATGAGCCCGATCGCTGCTAGCTC
![Page 33: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/33.jpg)
Prior Information• Individual based prior
– Equal probability of showing polymorphism– 1/1000 bases different from reference– Error Free and Poisson distribution– Single sample, single site
• Population based prior– Estimate frequency from many individuals– Multiple sample, single site
• Haplotype/Imputation based prior– Jointly model flanking SNPs, use haplotype information– Important for low coverage sequence data– Multiple samples, multiple sites
![Page 34: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/34.jpg)
Comparisons of Different Genotype Calling Methods
![Page 35: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/35.jpg)
Rare Variant Tests
• Genotype calling is the first step of the journey
• Identify SNPs/genes associated with phenotype
• Sequencing provides more comprehensive way to study the genome– Discover more rare variants
![Page 36: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/36.jpg)
– Only subset of functional elements include common variants– Rare variants are more numerous and thus will point to additional loci
![Page 37: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/37.jpg)
Genetic Spectrum of Complex Diseases
GWASSequencing
McCarthy MI et al. Nat Rev Genet. 2008
![Page 38: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/38.jpg)
Several Approaches to Study Rare Variants
• Deep whole genome sequencing – Can only be applied to limited numbers of samples – Most complete ascertainment of variation
• Exome capture and targeted sequencing – Can be applied to moderate numbers of samples – SNPs and indels in the most interesting 1% of the genome
• Low coverage whole genome sequencing – Can be applied to moderate numbers of samples – Very complete ascertainment of shared variation
• New Genotyping Arrays and/or Genotype Imputation – Examine low frequency coding variants in 100,000s of samples – Current catalogs include 97-98% of sites detectable by sequencing an individual
![Page 39: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/39.jpg)
Single SNP Test for Rare Variant
• Rare variants are hard to detect
• Power/sample size depends on both frequency and effect size
• Rare causal SNPs are hard to identify even with large effect size
![Page 40: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/40.jpg)
Single SNP Test for Rare Variant
• Disease prevalence ~10%• Type I error 5x10-6
• To achieve 80% power • Equal number of cases and controls
• Minor Allele Frequency (MAF) = 0.1, 0.01, 0.001
• Required sample size = 486, 3545, 34322,
![Page 41: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/41.jpg)
Alternatives to Single Variant Test Collapsing Method (Burden Test)
• Group rare variants in the same gene/region
• Score each individual– Presence or absence of rare copy– Weight each variant
• Use individual score as a new “genotype”
• Test in a regression framework
![Page 42: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/42.jpg)
Challenges
• Disease is caused by multiple rare variants in an additive manner
• It is hard to separate causal and null SNPs– Including all rare variants will dilute the true signals
• The effect size of each rare variant varies
![Page 43: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/43.jpg)
Power of Burden Test
• Power tabulated in collections of simulated data
• Combining variants can greatly increase power
• Currently, appropriately combining variants is expected to be key feature of rare variant studies.
![Page 44: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/44.jpg)
Impact of Null Variants
• Including non-disease variants reduces power
• Power loss is manageable, combined test remains preferable to single marker tests
![Page 45: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/45.jpg)
Impact of Missing Disease Alleles
• Missing disease alleles loses power
• Still better than single variant test
![Page 46: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/46.jpg)
1. yi: quantitative or binary phenotypes;
2. α'Xi: fixed effects of covariates;
3. β'Gi: genetic effects from one gene consisted of SNPs;
4. εi: random error.
Sequence Kernel Association Test (SKAT)
0τ:H0 0:,,0~ Assume 0 HWN IE2,0~
![Page 47: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/47.jpg)
Sequence Kernel Association Test (SKAT)
• Regression based method
• Score statistic
• Kernel
'K GWG
![Page 48: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/48.jpg)
![Page 49: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/49.jpg)
Maximizing the Power• Power depends summed frequency
– Choose threshold for defining rare carefully
• Enriched functional variants in cases increase power – Focus on loss of function variants only
• Use more efficient design– For quantitative traits, focus on individuals with extreme trait values– For binary traits, focus on individuals with family history of disease
![Page 50: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/50.jpg)
![Page 51: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/51.jpg)
Discussion
• Analysis of rare variants is an active research area
• Weight for each SNP is the key
• What to do if the samples are related
• Most tests reply on permutation– Computationally intensive
![Page 52: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/52.jpg)
Reference• The 1000 Genomes Project (2010) A map of human genome vairation from
population-scale sequencing. Nature 467:1061-73
• Nielsen R, Paul JS et al. (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet
• Li Y, Chen W et al. (2012) Single Nucleotide Polymorphism (SNP) Detection and Genotype Calling from Massively Parallel Sequencing (MPS) Data. Statistics in Biosciences.
• Li Y et al (2011) Low-coverage sequencing: Implication for design of complex trait association studies. Genome Research 21: 940-951
• Chen W, Li B et al. (2013) Genotype calling and haplotyping in parent-offspring trios. Genome Research.
![Page 53: Analysis of Next Generation Sequence Data BIOST 2055 04/06/2015](https://reader030.vdocument.in/reader030/viewer/2022020416/5a4d1be87f8b9ab0599e3204/html5/thumbnails/53.jpg)
Reference• http://genome.sph.umich.edu/wiki/Rare_variant_tests
• Raychaudhuri S. Mapping rare and common causal alleles for complex human diseases. Cell. 2011 Sep 30;147(1):57-69.
• Li and Leal (2008) Am J Hum Genet 83:311-321
• Madsen BE, Browning SR (2009) A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genet 5(2)
• Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zöllner S (2010) Am J Hum Genet 87:604-617
• Wu M, Lee S, et al. (2011) Am J Hum Genet