bst 775 lecture plink – a popular toolset for gwas guodong wu ssg, department of biostatistics...

19
BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Upload: jalyn-pigford

Post on 14-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

BST 775 LecturePLINK – A Popular Toolset for GWAS

Guodong WuSSG, Department of Biostatistics

University of Alabama at BirminghamSeptember 24, 2013

Page 2: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

• Designed for GWAS and population-based linkage analysis.

• Developed by Shaun Purcell*, current version V1.07. • http://pngu.mgh.harvard.edu/~purcell/plink/• Why the toolset is so popular?

• Store the GWAS data sets, which is too large for SAS, R, or other statistical packages.

• Well developed guideline and toolsets for Dataset Management and Quality Control

• Platform for various association methods

Overview

* Purcell et al 2007, AJHG

Page 3: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Overview

• Data management

• Summary statistics

• Quality Control

• Association Test

Page 4: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Summary statistics and quality control

Summary statistics and quality control

Assessment of population stratification

Assessment of population stratification

Further exploration of ‘hits’Further exploration of ‘hits’

Visualization and follow-upVisualization and follow-up

Whole genome SNP-based association

Whole genome SNP-based association

GeneChip Scanner

GeneChip Scanner

Cell Intensity Files for each chip

Cell Intensity Files for each chip

Phenotype, sex and other

covariates

Phenotype, sex and other

covariates

Experimental Design & Sample

Collection

Experimental Design & Sample

Collection

PLINK in GWAS workflow

Page 5: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Data Format

P1 A A A C C G T T A A T TP2 A C A A C G G T A C T TP3 C C A C G G T T A A T TP4 C C A A G G G T A A T T←

Peop

le

SNPs →

PED and MAP format

1 snp1 0 1000X snp2 0 1000Y snp3 0 1000XY snp4 0 1000 MT snp5 0 1000

Transposed format

S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T

←SN

Ps

People →

P1 … P2 …P3 …P4 …P5 …

SNP information

People information 01010100101010101011010011101010101010110111010100101010111010010111011010101101010101010111010

Compact binary format

Page 6: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Data management

• Recode dataset (A,C,G,T → 1,2)

• Reorder, reformat dataset

• Flip DNA strand

• Extract/remove individuals/SNPs

• New phenotypes, covariates as extra file

• Merge 2 or more data sets

Page 7: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Summary and QC

• Hardy-Weinberg test

• Mendel errors

• Missing genotypes

• Allele frequencies

• Tests of non-random missingness

– by phenotype and by (unobserved) genotype

• Sex Check

• Pairwise IBD estimates

Page 8: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Mendel errors

plink --file data --hardy

An exact test by default.

In Case control study, the Control group typically needs more lenient threshold (eg. P-value < 1e-3)

Page 9: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Mendel errors

plink --file data --mendel

Genotyping error when child’s genotype is not inherited from the parents, according to mendel’s law

Output as

Output the error rate for each SNP and each individual

Code Pat , Mat -> Offspring

1 AA , AA -> AB 2 BB , BB -> AB

3 BB , ** -> AA 4 ** , BB -> AA 5 BB , BB -> AA

6 AA , ** -> BB 7 ** , AA -> BB 8 AA , AA -> BB

Page 10: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Missingness and Allele Frequency

plink --file data --missing

Output the missing rate per SNP and per individual.

plink --file data --freq

Output each SNP’s allele frequency

Page 11: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Is the missingness random?

plink --file data –-test-missing

Test whether the SNP is randomly missing between case and control status.

plink --file data -–test-mishap

• Test whether the SNP is randomly missing based on observed genotyped nearby SNPs.

• Assume dense SNP genotyping. • Use haplotype and LD information in tests.

Page 12: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Sex Check

plink --file data –check-sex

Use X chromosome data heterozygosity rates to determine sex, and then compare with the observed sex.

Page 13: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Pairwise IBD sharing (relatedness)

AB AC

AB AC

IBS = 1IBD = 0

ParentsMost recent common ancestor from

homogeneous random mating population

AB AC

PLINK tutorial, October 2006; Shaun Purcell, [email protected]

Page 14: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

plink --file data –-genome

Relatedness Check

• The Genome-wide information, typically do not need whole-genome SNPs.

• Typically 100K independent SNPs are enough.

Page 15: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013
Page 16: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Association methods in PLINK

• Population-based– Allelic, trend, genotypic, Fisher’s exact– Stratified tests (Cochran-Mantel-Haenszel, Breslow-Day)– Linear & logistic regression models

• multiple covariates, interactions, joint tests, etc

• Family-based– Disease traits: TDT / sib-TDT– Continuous traits: QFAM (between/within model, QTDT)

• Permutation procedures– “adaptive”, max(T), gene-dropping, between/within, rank-based,

within-cluster

• Multilocus tests– Haplotype estimation, set-based tests, Hotelling’s T2, epistasis

Page 17: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

An Example: logistic Regression

plink --maf 0.05 --exclude nonautosomalSNPs.txt --out AllAssoc --bfile bdata --remove exclusions.txt --logistic --hide-covar --pheno IChipCovs.txt --pheno-name cas_con --covar IChipCovs.txt --covar-name Sex,EurAdmix

Page 18: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

An Example: logistic RegressionResult

Page 19: BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong Wu SSG, Department of Biostatistics University of Alabama at Birmingham September 24, 2013

Cardinal rules in PLINK

• Always consult the log file, console output

• Also consult the web documentation– regularly

• PLINK has no memory– each run loads data anew, previous filters lost

• Exact syntax and spelling is important– “minus minus” …

PLINK tutorial, October 2006; Shaun Purcell, [email protected]