bst 775 lecture plink – a popular toolset for gwas guodong wu ssg, department of biostatistics...
TRANSCRIPT
BST 775 LecturePLINK – A Popular Toolset for GWAS
Guodong WuSSG, Department of Biostatistics
University of Alabama at BirminghamSeptember 24, 2013
• Designed for GWAS and population-based linkage analysis.
• Developed by Shaun Purcell*, current version V1.07. • http://pngu.mgh.harvard.edu/~purcell/plink/• Why the toolset is so popular?
• Store the GWAS data sets, which is too large for SAS, R, or other statistical packages.
• Well developed guideline and toolsets for Dataset Management and Quality Control
• Platform for various association methods
Overview
* Purcell et al 2007, AJHG
Overview
• Data management
• Summary statistics
• Quality Control
• Association Test
Summary statistics and quality control
Summary statistics and quality control
Assessment of population stratification
Assessment of population stratification
Further exploration of ‘hits’Further exploration of ‘hits’
Visualization and follow-upVisualization and follow-up
Whole genome SNP-based association
Whole genome SNP-based association
GeneChip Scanner
GeneChip Scanner
Cell Intensity Files for each chip
Cell Intensity Files for each chip
Phenotype, sex and other
covariates
Phenotype, sex and other
covariates
Experimental Design & Sample
Collection
Experimental Design & Sample
Collection
PLINK in GWAS workflow
Data Format
P1 A A A C C G T T A A T TP2 A C A A C G G T A C T TP3 C C A C G G T T A A T TP4 C C A A G G G T A A T T←
Peop
le
SNPs →
PED and MAP format
1 snp1 0 1000X snp2 0 1000Y snp3 0 1000XY snp4 0 1000 MT snp5 0 1000
Transposed format
S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T
←SN
Ps
People →
P1 … P2 …P3 …P4 …P5 …
SNP information
People information 01010100101010101011010011101010101010110111010100101010111010010111011010101101010101010111010
Compact binary format
Data management
• Recode dataset (A,C,G,T → 1,2)
• Reorder, reformat dataset
• Flip DNA strand
• Extract/remove individuals/SNPs
• New phenotypes, covariates as extra file
• Merge 2 or more data sets
Summary and QC
• Hardy-Weinberg test
• Mendel errors
• Missing genotypes
• Allele frequencies
• Tests of non-random missingness
– by phenotype and by (unobserved) genotype
• Sex Check
• Pairwise IBD estimates
Mendel errors
plink --file data --hardy
An exact test by default.
In Case control study, the Control group typically needs more lenient threshold (eg. P-value < 1e-3)
Mendel errors
plink --file data --mendel
Genotyping error when child’s genotype is not inherited from the parents, according to mendel’s law
Output as
Output the error rate for each SNP and each individual
Code Pat , Mat -> Offspring
1 AA , AA -> AB 2 BB , BB -> AB
3 BB , ** -> AA 4 ** , BB -> AA 5 BB , BB -> AA
6 AA , ** -> BB 7 ** , AA -> BB 8 AA , AA -> BB
Missingness and Allele Frequency
plink --file data --missing
Output the missing rate per SNP and per individual.
plink --file data --freq
Output each SNP’s allele frequency
Is the missingness random?
plink --file data –-test-missing
Test whether the SNP is randomly missing between case and control status.
plink --file data -–test-mishap
• Test whether the SNP is randomly missing based on observed genotyped nearby SNPs.
• Assume dense SNP genotyping. • Use haplotype and LD information in tests.
Sex Check
plink --file data –check-sex
Use X chromosome data heterozygosity rates to determine sex, and then compare with the observed sex.
Pairwise IBD sharing (relatedness)
AB AC
AB AC
IBS = 1IBD = 0
ParentsMost recent common ancestor from
homogeneous random mating population
AB AC
PLINK tutorial, October 2006; Shaun Purcell, [email protected]
plink --file data –-genome
Relatedness Check
• The Genome-wide information, typically do not need whole-genome SNPs.
• Typically 100K independent SNPs are enough.
Association methods in PLINK
• Population-based– Allelic, trend, genotypic, Fisher’s exact– Stratified tests (Cochran-Mantel-Haenszel, Breslow-Day)– Linear & logistic regression models
• multiple covariates, interactions, joint tests, etc
• Family-based– Disease traits: TDT / sib-TDT– Continuous traits: QFAM (between/within model, QTDT)
• Permutation procedures– “adaptive”, max(T), gene-dropping, between/within, rank-based,
within-cluster
• Multilocus tests– Haplotype estimation, set-based tests, Hotelling’s T2, epistasis
An Example: logistic Regression
plink --maf 0.05 --exclude nonautosomalSNPs.txt --out AllAssoc --bfile bdata --remove exclusions.txt --logistic --hide-covar --pheno IChipCovs.txt --pheno-name cas_con --covar IChipCovs.txt --covar-name Sex,EurAdmix
An Example: logistic RegressionResult
Cardinal rules in PLINK
• Always consult the log file, console output
• Also consult the web documentation– regularly
• PLINK has no memory– each run loads data anew, previous filters lost
• Exact syntax and spelling is important– “minus minus” …
PLINK tutorial, October 2006; Shaun Purcell, [email protected]