statistical applications in genetics and molecular biology · volume 11, issue 3 2012 article 4...

Volume 11, Issue 3 2012 Article 4

Statistical Applications in Geneticsand Molecular Biology

Non-Iterative, Regression-Based Estimation ofHaplotype Associations with Censored

Survival Outcomes

Benjamin French, University of PennsylvaniaThomas Lumley, University of Auckland

Thomas P. Cappola, University of PennsylvaniaNandita Mitra, University of Pennsylvania

Recommended Citation:French, Benjamin; Lumley, Thomas; Cappola, Thomas P.; and Mitra, Nandita (2012) "Non-Iterative, Regression-Based Estimation of Haplotype Associations with Censored SurvivalOutcomes," Statistical Applications in Genetics and Molecular Biology: Vol. 11: Iss. 3, Article4.

DOI: 10.1515/1544-6115.1764©2012 De Gruyter. All rights reserved.

Brought to you by | University of Pennsylvania (University of Pennsylvania)Authenticated | 172.16.1.226

Download Date | 2/16/12 7:29 PM

Non-Iterative, Regression-Based Estimation ofHaplotype Associations with Censored

Survival OutcomesBenjamin French, Thomas Lumley, Thomas P. Cappola, and Nandita Mitra

AbstractThe general availability of reliable and affordable genotyping technology has enabled genetic

association studies to move beyond small case-control studies to large prospective studies. Forprospective studies, genetic information can be integrated into the analysis via haplotypes, withfocus on their association with a censored survival outcome. We develop non-iterative, regression-based methods to estimate associations between common haplotypes and a censored survivaloutcome in large cohort studies. Our non-iterative methods—weighted estimation and weightedhaplotype combination—are both based on the Cox regression model, but differ in how theimputed haplotypes are integrated into the model. Our approaches enable haplotype imputationto be performed once as a simple data-processing step, and thus avoid implementation based onsophisticated algorithms that iterate between haplotype imputation and risk estimation. We showthat non-iterative weighted estimation and weighted haplotype combination provide valid tests forgenetic associations and reliable estimates of moderate associations between common haplotypesand a censored survival outcome, and are straightforward to implement in standard statisticalsoftware. We apply the methods to an analysis of HSPB7-CLCNKA haplotypes and risk of adverseoutcomes in a prospective cohort study of outpatients with chronic heart failure.

KEYWORDS: Cox regression, phase ambiguity, prospective study, unphased genotypes

Author Notes: The Penn Heart Failure Study is supported by the National Institutes of Healththrough grant R01 HL088577.



1 Introduction

Genetic association studies often focus on estimating the association betweenhaplotypes—combinations of alleles at adjacent loci on a chromosome—and adisease trait. A haplotype-based analysis may offer an attractive data reduc-tion and efficiency gain compared to an analysis based on individual single-nucleotide polymorphisms (SNPs). However, linkage phase is typically un-known, so that there may be more than one pair of haplotypes that is con-sistent with the observed genotype for each individual. Thus, haplotype riskestimation is frequently based on sophisticated iterative algorithms that iter-ate between haplotype imputation and risk estimation. Currently, there existreliable haplotype-based estimation methods for a binary trait in case-controlstudies and for continuous or discrete traits in cohort studies (e.g., Schaid et al.,2002; Schaid, 2004; Venkatraman et al., 2004). We focus on large, population-based prospective cohort studies with a censored survival outcome, for whichthe development of estimation methods is ongoing and their application toclinical research is increasing.

Several iterative estimation methods exist for population-based cohortstudies in which primary interest lies in a censored survival outcome. Lin(2004) and Tregouet and Tiret (2004) introduced an expectation-maximization(EM) algorithm for estimation based on a proportional hazards regressionmodel. Tan et al. (2005) developed estimation procedures for longevity stud-ies based on a proportional hazards regression model, and estimated haplo-type hazard ratios from population survival information. Lin and Zeng (2006)established semi-parametric maximum likelihood estimation for cohort stud-ies. Chen et al. (2004) developed a haplotype-based score test for detectingthe association of a disease with a genomic region of interest using prospec-tive information and unphased genotype data collected from cohort studieswith a censored survival outcome. Subsequently, Chen and Chatterjee (2006)examined the statistical properties of alternative EM-based procedures forhaplotype risk estimation in cohort studies. Souverein et al. (2008) extendedEM-based approaches to rare haplotypes using a penalized proportional haz-ards regression model. Recently, Scheike et al. (2010) introduced estimationfor haplotype effects in survival data based on estimating equations.

We develop non-iterative, regression-based methods to estimate associ-ations between common haplotypes and a censored survival outcome in largecohort studies. Our approaches enable haplotype imputation to be performedonce as a simple data-processing step, and thus avoid implementation basedon complex algorithms that iterate between haplotype imputation and risk es-timation. We focus on moderate associations of common haplotypes inferred

1

French et al.: Estimating Haplotype Associations with Censored Survival Outcomes

Published by De Gruyter, 2012



from tag SNPs in small numbers of genes in large prospective studies. Our goalis to allow haplotypes to be integrated into the analysis of a censored survivaloutcome, so that analyses combining genetic and environmental informationcan be conducted in standard software by researchers expert in the relevantsubject matter, rather than by statisticians using specialized software.

Our non-iterative methods—weighted estimation and weighted hap-lotype combination—are both based on the Cox regression model, but dif-fer in how the imputed haplotypes are integrated into the model. French etal. (2006) introduced non-iterative weighted estimation in the context of case-control data. The method is based on creating multi-record data, in whichthe data for each individual consist of multiple records, one for each diplotypeconsistent with the unphased genotype. Weighted logistic regression is usedto relate the haplotypes to the binary disease outcome, in which the weightsare set to the conditional probability of each diplotype given the observedgenotype. A robust or ‘sandwich’ variance estimator is used for standarderror estimation. For a case-control outcome, weighted estimation providesvalid tests for genetic associations and reliable estimates of moderate effectsof common haplotypes. For analyses of case-control data, weighted estimationis implemented in the R package haplo.ccs (French and Lumley, 2007). Re-cently, Neuhausen et al. (2009) applied non-iterative weighted estimation to ananalysis of insulin-like growth factor variants and incidence of breast cancer.

Weighted haplotype combination is based on creating single-record datafrom the set of diplotypes consistent with the observed genotype. For eachindividual, a weighted combination of the set of diplotypes consistent withthe observed genotype is determined, in which the weights are determinedby the conditional probability of the diplotypes given the observed genotype.We have applied weighted haplotype combination to collaborative researchprojects regarding haplotypes of BRCA1- and BRCA2-interacting genes andincidence of ovarian and breast cancer (Rebbeck et al., 2009; Rebbeck et al.,2011). The approach is similar to that of Zaykin et al. (2002) in that diplotypeprobabilities are used as predictors. However, Zaykin creates a multi-recorddataset with unaveraged diplotype probabilities.

In Section 2, we detail non-iterative weighted estimation and weightedhaplotype combination in the context of a censored survival outcome. In Sec-tion 3, we use a simulation study to evaluate their statistical properties. InSection 4, we apply the methods to investigate associations between HSPB7-CLCNKA haplotypes (Cappola et al., 2010; Stark et al., 2010) and risk ofadverse outcomes in a prospective study of chronic heart failure patients. Weprovide concluding discussion in Section 5. In the Appendix, we provide in-structions for implementation in Stata (StataCorp, College Station, Texas).

2

Statistical Applications in Genetics and Molecular Biology, Vol. 11 [2012], Iss. 3, Art. 4



2 Statistical Methods

2.1 Notation and model

Let Ti and Ci denote the event time and censoring time, respectively, forindividual i = 1, . . . , n such that Yi = min(Ti, Ci) denotes the possibly censoredevent time. Let δi denote the event indicator such that δi = 1 if Yi = Ti andδi = 0 if Yi = Ci. Let Gi denote the unphased genotype and Zi denote a setof environmental exposures. Thus, {Yi, δi, Gi, Zi} compose the observed datafor individual i.

The survival model is specified as a function of Zi and Di, the unob-served diplotype (or pair of haplotypes). We assume that if the diplotypeswere known, then we would fit a Cox regression model (Cox, 1972) to estimatethe association between {Di, Zi} and {Yi, δi}. Recall that the Cox regres-sion model employs a log function to relate the hazard function to a linearcombination of the diplotypes and environmental exposures:

log λi(t) = log λ0(t) +DiβD + Ziβ

Z , (1)

where: λi(t) is defined as the instantaneous rate at which failures occur forindividuals that are surviving at time t:

λi(t) = lim∆t→0

P[t ≤ Ti < t+ ∆t | Ti ≥ t]/∆t; (2)

λ0(t) is an unspecified baseline hazard function, possibly stratified by an ex-posure of interest; βD are regression parameters that correspond to diplotypesDi; and βZ are regression parameters that correspond to environmental expo-sures Zi. In the context of an association study, β = {βD, βZ} represents thetarget of inference. Note that in our case study, diplotypes and exposures areonly measured at baseline and are hence constant over time. In the situationof time-independent exposures, λi(t) is often referred to as a ‘proportionalhazards’ regression model.

2.2 Haplotype imputation

Because linkage phase may be unknown, there may be more than one pairof haplotypes that is consistent with the observed genotype. In this case, aset of diplotypes may be imputed and corresponding posterior probabilitiesmay be estimated for each individual. Let d(Gi) denote the set of all possiblediplotypes consistent with the unphased genotype Gi and pid denote the popu-lation haplotype frequencies that correspond to each diplotype d in d(Gi). Letπid ≡ πid(pid, β) denote the conditional probability of diplotype d given d(Gi).

3





We impute haplotypes and estimate population haplotype frequenciesusing haplo.em, an implementation of an EM algorithm (Dempster, 1977)included in the R package haplo.stats (Sinnwell and Schaid, 2009), whichcomputes maximum likelihood estimates of haplotype probabilities from un-phased genotypes measured on unrelated individuals. Unlike the standard EMalgorithm that attempts to enumerate all possible pairs of haplotypes beforeiterating over the EM steps, the implementation in haplo.em is based on a‘progressive insertion’ algorithm that progressively inserts batches of loci intohaplotypes of growing lengths, runs the EM steps, trims off pairs of haplotypesper individual when the posterior probability of the pair is below a specifiedthreshold, and then continues the insertion, EM, and trimming steps until allloci are inserted into the haplotype. Haplotype imputation and estimation ofpopulation haplotype frequencies can also be performed using Bayesian meth-ods implemented in software such as PHASE (Stephens and Donnelly, 2003).

2.3 Non-iterative weighted estimation

Weighted estimation is based on creating multi-record data, in which eachindividual contributes multiple records to the analysis, one for each diplotypeconsistent with the unphased genotype. Let Xid denote the multi-record designmatrix (or set of covariates) that includes both diplotype information andenvironmental exposures. Diplotype information is integrated into Xid as avector of haplotype counts for all imputed haplotypes except the referent.For example, consider the following imputed haplotypes and correspondingconditional probabilities for four hypothetical individuals (i = 1, 2, 3, 4) basedon our simulation study (Table 1):

i A B C D E F* G H I πid1 0 0 0 1 0 1 0 0 0 0.81 0 0 1 0 0 0 1 0 0 0.22 0 0 0 0 0 1 0 0 1 0.62 0 0 0 0 0 0 1 1 0 0.43 0 1 0 1 0 0 0 0 0 1.04 0 0 0 0 0 2 0 0 0 1.0* Referent

Two individuals (i = 3, 4) have one possible diplotype; two individuals (i =1, 2) have two possible diplotypes. Therefore, the latter two individuals wouldeach contribute two rows to Xid. The column indicating the reference haplo-type (here, haplotype F) would be excluded.

4




Non-iterative weighted estimation is straightforward to implement inthe context of a censored survival outcome:

1. Impute haplotypes and estimate population haplotype frequencies p as-suming Hardy-Weinberg Equilibrium (HWE);

2. Create multi-record data for each individual:(a) Form a design matrix Xid containing the set of diplotypes d consis-

tent with the observed genotype d(Gi);(b) Set πid to the conditional probability of diplotype d given the ob-

served genotype d(Gi), and;3. Estimate β using a weighted Cox regression model.

Appendix A provides instructions on implementing weighted estimation inStata. Note that β is obtained as the solution to an estimating equation basedon the weighted Cox partial likelihood:

U(β) =n∑

i=1

∑d∈d(Gi)

∂lid(β)

∂β= 0, (3)

in which:

lid(β) = δid πid(pid, β)

Xidβ − log∑

i′∈R(ti)

expXi′dβ

, (4)

and R(ti) denotes the set of individuals at risk at time ti. Standard errorestimates for confidence intervals and hypothesis tests must be based on arobust variance estimator. By estimating robust standard errors from multi-record data, weighted estimation accounts for the uncertainty in phase, inaddition to the sampling variability of the data. Inference for one or moreelements of β are based on univariable or multivariable Wald tests. Underthe null hypothesis, the estimator is valid and efficient, because the estimatingfunction is the expectation of the known-phase score function given all theavailable data.

2.4 Weighted haplotype combination

Weighted haplotype combination is based on creating single-record data fromthe set of diplotypes consistent with the observed genotype for each individual.Let Xi denote the single-record design matrix that includes both diplotype in-formation and environmental exposures. For each individual, diplotype infor-mation is integrated into Xi as a weighted combination of the set of diplotypes

5





consistent with the observed genotype, in which the weights are determinedby the conditional probability of the diplotypes given the observed genotype.For example, the four sets of imputed haplotypes given in Section 2.3 wouldeach be averaged according to the conditional probability for each diplotype:

i A B C D E F* G H I1 0 0 0.2 0.8 0 0.8 0.2 0 02 0 0 0 0 0 0.6 0.4 0.4 0.63 0 1.0 0 1.0 0 0 0 0 04 0 0 0 0 0 2.0 0 0 0* Referent

Therefore, each individual would contribute one record to the analysis. Notethat the column indicating the reference haplotype (F) would be excluded.

It is also straightforward to implement weighted haplotype combinationin the context of a censored survival outcome:

1. Impute haplotypes and estimate population haplotype frequencies p as-suming HWE;

2. Create single-record data for each individual:(a) Form a design matrix Xi = πT

idXid as a weighted combination of theset of diplotypes d consistent with the observed genotype d(Gi), inwhich the weights are determined by πid, and;

3. Estimate β using an unweighted Cox regression model.

Appendix B provides instructions on implementing weighted haplotype com-bination approach in Stata. Note that β is obtained as the solution to anestimating equation based on the unweighted Cox partial likelihood:

U(β) =n∑

i=1

∂li(β)

∂β= 0, (5)

in which:

li(β) = δi

Xiβ − log∑

i′∈R(ti)

expXi′β

. (6)

Model-based standard error estimates may be used for confidence intervalsand hypothesis tests, because the data consist of one record for each individ-ual. Because the set of possible diplotypes for each individual are averagedusing weights determined by their conditional probability given the observed

6




genotype, weighted haplotype combination also accounts for the uncertaintyin phase, in addition to the sampling variability of the data. Inference for oneor more elements of β are based on univariable or multivariable Wald tests.The estimator is exactly valid under the null hypothesis.

3 Simulation Study

We designed a simulation study to evaluate the statistical properties of non-iterative methods to estimate associations between common haplotypes anda censored survival outcome. Simulated genotypes were based on haplotypefrequencies for angiotensin II receptor, type 1 (AGTR1), a gene in the renin-angiotensin system, which regulates blood pressure (French et al., 2006; Mer-ciante et al., 2007). Table 1 provides common AGTR1 haplotypes and theircorresponding estimated population frequencies. Simulated AGTR1 diplo-types had moderate phase ambiguity: approximately 60% of individuals hadan unambiguous diplotype; and approximately 90% had a highest posteriorprobability of having a particular diplotype greater than 0.75.

Table 1: Common haplotypes and estimated population frequencies forAGTR1.

Label Haplotype FrequencyA ATTATGCATCTC 0.029B ATTATGTGATCC 0.051C TCCACGCATCTC 0.027D TCCACGCATCTT 0.090E TCTGTGCAACTT 0.029F* TCTGTGCATCTC 0.223G TCTGTGCATCTT 0.188H TTTACACATCTC 0.038I TTTACACATCTT 0.032* Referent

3.1 Parameters

At each of 1000 iterations, we generated event times for a sample of eithern = 200, 500, or 1000 individuals from an Exponential distribution, in which

7





the log-rate was determined by a linear combination of the haplotypes accord-ing to an assumed genetic risk model with additive inheritance: no geneticeffects, moderate effects, strong effects characterized by a single SNP, andstrong effects not characterized by a single SNP. For effects characterized bya single SNP, a haplotype was related to risk if and only if it had the minorallele at a particular locus (here, the 12th locus). We defined moderate geneticeffects as hazard ratios between 1.0 and 2.0 (or between 1.0 and 1/2.0) relativeto the reference haplotype (here, haplotype F), and strong effects as hazardratios equal to 4.0. We generated an independent censoring process from anExponential distribution and selected the rate such that either 10% or 25% ofindividuals were censored before their event time.

We used four metrics to compare weighted estimation and weightedhaplotype combination. First, to evaluate the performance of standard errorestimation, we compared the average of the estimated standard errors to theempirical standard deviation of the estimated log-hazard ratios for each hap-lotype. Second, for each haplotype we created error intervals by calculatingthe difference between the estimated and true log-hazard ratios at every iter-ation and calculating the 5th, 50th, and 95th percentile. These error intervalsapproximate the bias in the estimated log-hazard ratios. Third, we calculatedpercent coverage of estimated 95% confidence intervals. Fourth, at every iter-ation we conducted a two-sided hypothesis test (α = 0.05) for each haplotypeeffect based on its estimated log-hazard ratio and standard error. Across alliterations, the rejection rate quantified the type-I error rate for null effectsand statistical power for non-null effects. We present results obtained assum-ing known phase to illustrate that, due to sampling variability, there is errorinherent in the log-hazard ratio estimates obtained from the Cox regressionmodel even when the haplotypes are known.

3.2 Results

Simulations with 10% and 25% censoring yielded similar results. Therefore,we only present results assuming 25% censoring. For brevity, we only presentresults with n = 200 and 1000.

Table 2 provides average standard error estimates and the empiricalstandard deviation of estimated log-hazard ratios under no genetic effects,i.e., all log-hazard ratios equal to 0 relative to the reference haplotype. Whenthe sample size is small (n = 200), every estimation method under-estimatesthe standard error; the average standard error estimate for every haplotype issystematically smaller than the corresponding empirical standard deviation.

8




However, when the sample size is large (n = 1000), every estimation methodproperly estimates the standard error; the average standard error estimatefor every haplotype is approximately equal to the corresponding empiricalstandard deviation.

Figure 1 presents error intervals, Table 3 provides estimated coverageof 95% confidence intervals, and Table 4 provides the estimated rejection rateof two-sided hypothesis tests (α = 0.05) for AGTR1 haplotype effects withn = 200 and 1000. In every scenario, estimation based on the known phaseprovides approximately unbiased parameter estimates with proper confidenceinterval coverage and the nominal type-I error rate. Under no genetic effects(results not shown), every estimation method provides approximately unbi-ased parameter estimates with proper coverage and the nominal type-I errorrate. Under moderate genetic effects, weighted haplotype combination pro-vides approximately unbiased parameter estimates, as shown by the greenerror intervals in Figures 1(a) and 1(b) centered at zero, with proper cov-erage. However, weighted estimation may provide slightly biased parameterestimates, as shown by the red error intervals in Figures 1(a) and 1(b) centeredaway from, yet covering zero. Coverage is reduced due to the small amount ofbias in the parameter estimates, especially when the sample size is small dueto under-estimation of standard errors. Power is modest for all methods whenthe sample size is small.

Under strong effects characterized by a single SNP, every estimationmethod provides approximately unbiased parameter estimates with properconfidence interval coverage. In addition, the type-I error rate is near thenominal 5% level for null effects (haplotypes A, B, C, H), and power is highfor non-null effects (haplotypes D, E, G, I). However, under strong effects notcharacterized by a single SNP, weighted estimation may provide heavily biasedparameter estimates, as shown by the red error intervals in Figures 1(e) and1(f) that do not cover zero. In this case, the approximate bias is negative,so that the bias is toward the null. Due to the large bias in the parameterestimates, coverage is substantially reduced. The type-I error rate is above thenominal 5% level for null effects (haplotypes A, B, D, I), particularly whenthe sample size is large. Power is reduced for non-null effects (haplotypesC, H), particularly when the sample size is small. Under strong effects notcharacterized by a single SNP, weighted haplotype combination may provideslightly biased (toward the null) parameter estimates, as shown by the greenerror intervals in Figures 1(e) and 1(f) centered away from, yet covering zero.Coverage is reduced due to the small amount of bias in the parameter esti-mates. Compared to weighted estimation, the type-I error rate is lower andthe power level is higher for weighted haplotype combination.

9





Tab

le2:

Ave

rage

stan

dar

der

ror

esti

mat

es(‘

Est

imat

ed’)

and

empir

ical

stan

dar

ddev

iati

onof

esti

mat

edlo

g-haz

ard

rati

os(‘

Em

pir

ical

’)under

no

genet

iceff

ects

,25

%ce

nso

ring.

Wei

ghte

des

tim

ati

on

Wei

ghte

dco

mb

inati

on

Kn

own

ph

ase

nH

aplo

typ

eE

stim

ated

Em

pir

ical

Est

imate

dE

mp

iric

al

Est

imate

dE

mp

iric

al

200

A0.

307

0.3

65

0.3

33

0.3

95

0.3

24

0.3

66

B0.

242

0.2

65

0.2

52

0.2

72

0.2

49

0.2

66

C0.

264

0.3

00

0.4

19

0.4

50

0.3

34

0.3

64

D0.

193

0.2

15

0.2

07

0.2

26

0.2

00

0.2

20

E0.

305

0.3

57

0.3

55

0.4

08

0.3

21

0.3

58

G0.

147

0.1

57

0.1

71

0.1

83

0.1

58

0.1

67

H0.

249

0.2

65

0.3

39

0.3

61

0.2

85

0.3

00

I0.

265

0.2

98

0.3

51

0.3

87

0.3

07

0.3

20

1000

A0.

135

0.1

36

0.1

38

0.1

38

0.1

37

0.1

37

B0.

106

0.1

07

0.1

08

0.1

09

0.1

08

0.1

07

C0.

119

0.1

24

0.1

72

0.1

77

0.1

42

0.1

45

D0.

085

0.0

85

0.0

89

0.0

89

0.0

86

0.0

88

E0.

135

0.1

36

0.1

41

0.1

42

0.1

37

0.1

37

G0.

064

0.0

64

0.0

74

0.0

75

0.0

68

0.0

67

H0.

109

0.1

10

0.1

40

0.1

40

0.1

21

0.1

21

I0.

120

0.1

25

0.1

45

0.1

51

0.1

31

0.1

33

10




HaplotypeHazard ratio

App

roxi

mat

e bi

as

−2

−1

01

2

A1/2.0 B1/1.75 C1.25 D1.5 E1/1.25 G1/1.5 H1.75 I2.0

Weighted estimationWeighted combinationKnown phase

(a) Moderate effects, n = 200


App

roxi

mat

e bi

as

−2

−1

01

2

A1/2.0 B1/1.75 C1.25 D1.5 E1/1.25 G1/1.5 H1.75 I2.0

(b) Moderate effects, n = 1000


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C1.0 D4.0 E4.0 G4.0 H1.0 I4.0

(c) Strong SNP effects, n = 200


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C1.0 D4.0 E4.0 G4.0 H1.0 I4.0

(d) Strong SNP effects, n = 1000


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C4.0 D1.0 E4.0 G4.0 H4.0 I1.0

(e) Strong non-SNP effects, n = 200


App

roxi

mat

e bi

as

−2

−1

01

2

A1.0 B1.0 C4.0 D1.0 E4.0 G4.0 H4.0 I1.0

(f) Strong non-SNP effects, n = 1000

Figure 1: Error intervals for AGTR1 haplotype effects, 25% censoring.

11





Tab

le3:

Est

imat

edco

vera

ge(%

)of

95%

confiden

cein

terv

als

forAGTR1

hap

loty

pe

effec

ts,

25%

censo

ring.

Wei

ghte

des

tim

ati

on

Wei

ghte

dco

mb

inati

on

Kn

own

ph

ase

Ris

km

od

elH

aplo

typ

eH

azar

dra

tio

n=

200

n=

1000

n=

200

n=

1000

n=

200

n=

1000

Mod

erat

eA

1/2.

091

94

94

95

94

95

effec

tsB

1/1.

7593

93

94

96

95

95

C1.

2581

76

94

95

94

94

D1.

594

94

95

95

94

95

E1/

1.25

92

93

95

95

94

95

G1/

1.5

92

82

95

96

95

96

H1.

7587

84

93

94

94

95

I2.

094

91

94

95

94

96

Str

ong

A1.

090

95

92

95

92

95

SN

Peff

ects

B1.

094

95

94

96

96

95

C1.

092

94

95

95

94

95

D4.

093

96

94

96

94

96

E4.

090

92

94

94

94

94

G4.

092

94

94

95

95

96

H1.

092

95

95

94

94

95

I4.

092

94

95

95

94

94

Str

ong

A1.

094

90

96

94

95

95

non

-SN

PB

1.0

93

87

93

92

94

94

effec

tsC

4.0

20

96

95

94

95

D1.

092

94

94

96

93

95

E4.

084

39

96

95

94

95

G4.

019

095

86

94

94

H4.

018

093

82

94

94

I1.

086

69

91

88

95

94

12




Tab

le4:

Est

imat

edre

ject

ion

rate

(%)

oftw

o-si

ded

hyp

othes

iste

sts

(α=

0.05

)fo

rAGTR1

hap

loty

pe

effec

ts,

25%

censo

ring.

Typ

e-I

erro

rra

tepre

sente

din

italic

typ

e;st

atis

tica

lp

ower

pre

sente

din

stan

dar

dty

pe.

Wei

ghte

des

tim

ati

on

Wei

ghte

dco

mb

inati

on

Kn

own

ph

ase

Ris

km

od

elH

aplo

typ

eH

azar

dra

tio

n=

200

n=

1000

n=

200

n=

1000

n=

200

n=

1000

Mod

erat

eA

1/2.

048

100

49

100

51

100

effec

tsB

1/1.

7553

100

55

100

57

100

C1.

2540

84

14

27

17

38

D1.

562

100

58

100

62

100

E1/

1.25

930

734

736

G1/

1.5

64

100

63

100

74

100

H1.

7581

100

48

98

60

100

I2.

071

100

63

100

74

100

Str

ong

A1.

010

58

58

5S

NP

effec

tsB

1.0

65

64

45

C1.

08

65

56

5D

4.0

100

100

100

100

100

100

E4.

099

100

97

100

99

100

G4.

0100

100

100

100

100

100

H1.

08

55

66

5I

4.0

100

100

97

100

100

100

Str

ong

A1.

06

10

46

55

non

-SN

PB

1.0

713

78

66

effec

tsC

4.0

16

62

91

100

99

100

D1.

08

66

47

5E

4.0

98

100

97

100

99

100

G4.

0100

100

100

100

100

100

H4.

052

99

95

100

100

100

I1.

014

31

812

56

13





Non-iterative, regression-based methods provide valid tests for geneticassociations and reliable estimates of moderate associations between com-mon haplotypes and a censored survival outcome. Weighted estimation andweighted haplotype combination provided estimates with reasonable bias un-der moderate SNP effects. In addition, weighted haplotype combination pro-vided estimates with reasonable bias under large non-SNP effects.

4 Case Study

4.1 Background

The Penn Heart Failure Study is a prospective cohort study of outpatientswith chronic heart failure recruited from referral centers at the University ofPennsylvania (Philadelphia, Pennsylvania), Case Western Reserve University(Cleveland, Ohio), and the University of Wisconsin (Madison, Wisconsin).The primary inclusion criterion was a clinical diagnosis of heart failure. Par-ticipants were excluded if they had a non-cardiac condition resulting in anexpected mortality of less than six months as judged by the treating physi-cian, or if they were unable to provide consent. At time of study entry, detailedclinical data were obtained using standardized questionnaires administered tothe participant and physician, with verification via medical records, as previ-ously described (Ky et al., 2009). Subsequent adverse events, including all-cause mortality, cardiac transplantation, and placement of a ventricular assistdevice were prospectively ascertained every six months via direct patient con-tact and verified through death certificates, medical records, and contact withfamily members by research personnel. All participants provided written, in-formed consent; the study protocol was approved by participating InstitutionalReview Boards.

The goal of this analysis was to estimate associations between HSPB7-CLCNKA haplotypes at 1p36 and risk of adverse outcomes. SNPs at 1p36 havebeen shown to be associated with heart failure and dilated cardiomyopathy incase-control studies (Cappola et al., 2010; Stark et al., 2010).

4.2 Materials and Methods

We limited our analysis to the combined outcome of all-cause mortality, car-diac transplantation, or placement of a ventricular assist device to focus onthe most serious adverse outcomes associated with heart failure. Genotyp-ing was performed using the IBC cardiovascular SNP array (Keating et al.,

14




2008), which includes 14 SNPs across the 1p36 locus. To account for po-tential population stratification, we used multi-dimensional scaling of ∼30,000SNP genotypes to identify a homogenous subgroup of 1149 genetically inferredCaucasians, as previously described (Cappola et al., 2011).

Cox regression models were used to estimate associations between com-mon HSPB7-CLCNKA haplotypes and the combined outcome. The EM algo-rithm was used to impute the set of all haplotypes consistent with the observedunphased genotype for each participant. Corresponding posterior probabilityestimates were either used as weights (weighted estimation) or to obtain aweighted average of the imputed haplotypes for each participant (weightedhaplotype combination). Haplotypes with an estimated population frequencyless than 0.02 were defined as rare and recoded into one category. Becauseparticipants entered the cohort at different stages of their disease progression,the baseline hazard function was stratified by New York Heart Association(NYHA) functional classification (class I, II, III, or IV), a standard system toclassify severity of heart failure symptoms. Additional adjustment was madefor gender, age, heart failure etiology (ischemic or non-ischemic), and clini-cal site. Age exhibited non-proportional hazards and was adjusted for usinga time-varying covariate, which was obtained by multiplying age by a linearterm for the natural log of time.

4.3 Results

HSPB7-CLCNKA genotypes were available for 1149 genetically inferred Cau-casians; 803 (70%) were male and the median age at study entry was 58.4 years(inter-quartile range, 49.7 to 66.4 years). Ischemic heart failure etiology wasreported by 409 participants (36%). Approximately 18%, 44%, 30%, and 9%of participants were classified as a NYHA class I, II, III, and IV, respectively.Approximately 65% of participants had an unambiguous diplotype; 90% hada highest posterior probability of having a particular diplotype greater than0.765. The median follow-up time was 2.9 years (maximum, 5 years), duringwhich 251 participants (22%) experienced an adverse event: 152 deaths, 75cardiac transplantations, and 25 ventricular assist device placements.

In individual Cox regression analyses of 14 pre-selected SNPs in theHSPB7-CLCNKA gene, adjusted for gender, age (time-varying), etiology, andsite and stratified by NYHA class, we found that only one SNP, SNP 5(rs12083572), was associated with adverse outcomes at the α = 0.05 level.Using the EM algorithm, we inferred 10 haplotypes with an estimated popula-tion frequency > 0.02 from all 14 HSPB7-CLCNKA SNPs (Table 5). In global

15





Tab

le5:

Com

mon

HSPB7-CLCNKA

hap

loty

pes

,es

tim

ated

pop

ula

tion

freq

uen

cies

,an

des

tim

ated

asso

ciat

ions

wit

hri

skof

all-

cause

mor

tality

,ca

rdia

ctr

ansp

lanta

tion

,or

ventr

icula

ras

sist

dev

ice

pla

cem

ent:

Haz

ard

rati

o(H

R)

and

95%

confiden

cein

terv

al(C

I).In

div

idualp

valu

esob

tain

edfr

omuniv

aria

ble

Wal

dte

stto

eval

uat

ew

het

her

the

haz

ard

rati

ois

equal

to1;

over

allp

valu

eob

tain

edfr

omm

ult

ivar

iable

Wal

dte

stto

eval

uat

ew

het

her

all

haz

ard

rati

osar

eeq

ual

to1.

Wei

ghte

des

tim

ati

on

Wei

ghte

dco

mb

inati

on

Lab

elH

aplo

typ

eF

requ

ency

HR

(95%

CI)

pH

R(9

5%

CI)

pQ

AGAGCGAGACGAGG

0.0

36

1.1

9(0

.80,

1.7

7)

0.3

91.1

9(0

.76,

1.8

7)

0.4

4R

AGAGCGAGGGAAGG

0.1

60

1.0

4(0

.80,

1.3

5)

0.7

91.0

5(0

.81,

1.3

7)

0.7

3S

AGAGCGGAGCAAGA

0.0

36

1.2

0(0

.80,

1.7

7)

0.3

81.2

0(0

.74,

1.9

4)

0.4

7T

AGCGAGAGGCAAGA

0.0

66

0.5

5(0

.34,

0.8

8)

0.0

14

0.5

3(0

.32,

0.9

0)

0.0

19

UGACGCGGAGCGCGG

0.0

63

0.8

0(0

.53,

1.2

0)

0.2

80.8

0(0

.52,

1.2

1)

0.2

9V

GGAACAAGGGAAGG

0.0

37

0.4

9(0

.26,

0.9

2)

0.0

25

0.4

3(0

.21,

0.8

8)

0.0

21

WGGAACAGAGCAAGA

0.2

99

Ref

eren

tR

efer

ent

XGGAACAGAGCAAGG

0.0

48

1.3

6(0

.95,

1.9

5)

0.0

89

1.4

4(0

.95,

2.1

9)

0.0

90

YGGAGCAAGGCAAGG

0.0

50

1.1

5(0

.78,

1.6

9)

0.4

91.1

5(0

.77,

1.7

2)

0.4

9Z

GGCGCGGAGCAAGG

0.0

31

1.0

9(0

.62,

1.9

2)

0.7

61.1

0(0

.64,

1.8

7)

0.7

3O

vera

ll0.0

15

0.0

26

16




tests of association, both analysis methods found marginally significant hap-lotype associations with adverse outcomes (weighted estimation: p = 0.015;weighted combination: p = 0.026). Using the most common haplotype (haplo-type W, estimated frequency 0.299) as the reference haplotype, we observed adecreased risk of adverse outcomes with carriage of haplotype T (which differsfrom the referent at SNPs 1, 3, 4, 5, 6, 7, 8) and haplotype V (which differsfrom the referent at SNPs 7, 8, 10, 14). However, neither of these associationswould be considered statistically significant after adjustment for multiple com-parisons. Of note, both analysis methods yielded very similar point estimates,95% confidence intervals, and p values.

5 Discussion

In this paper, we developed non-iterative, regression-based methods to esti-mate associations between common haplotypes and a censored survival out-come in large cohort studies, such that haplotype imputation is done onceas a data-processing step. We focused on moderate associations of commonhaplotypes inferred from tag SNPs in small numbers of genes in large prospec-tive studies. In our simulation study, we showed that non-iterative weightedestimation and weighted haplotype combination provide valid tests for geneticassociations and reliable estimates of moderate associations between commonhaplotypes and a censored survival outcome. Our case study provided an ex-ample in which the estimation methods provided very similar results becausethere was modest phase ambiguity.

Our simulation study focused on effects of common haplotypes. Thus,our results do not extend to estimating effects of rare haplotypes, nor to smallcohort studies, in which common haplotypes may appear to be rare. In ourcase study, we grouped together all imputed haplotypes with an estimatedpopulation frequency < 0.02. In addition, our simulation study focused oncohort studies of unrelated individuals. Thus, we did not consider situationsin which individuals are related. However, our methodology may be applicableto studies of related individuals, and may in fact perform better, because phasecan be estimated much more accurately from related individuals. There areoptions for estimating haplotype associations in family-based studies usingFBAT/PBAT (Laird et al., 2000).

An advantage of non-iterative estimation methods is their relative easeof implementation in standard statistical software. For weighted haplotypecombination, implementation only requires the standard Cox regression model.For non-iterative weighted estimation, implementation requires that the Cox

17





regression model accommodate probability weights, and depends on the avail-ability of a robust variance estimator. Both options are typically availablein standard software. In addition, because implementation of both methodsdepends on the Cox regression model, it is straightforward to adjust for orspecify interaction with environmental exposures, construct confidence inter-vals for haplotype associations, and generate inference via Wald tests. It isalso straightforward to extend the Cox model to include any requisite time-varying covariates and/or stratify the baseline hazard function by a key factor.For example, in our case study, we included a time-varying covariate for ageand stratified the baseline hazard function by NYHA functional classification.Given the reliability of non-iterative methods to estimate moderate associ-ations of common haplotypes, and their ease of implementation in standardsoftware, we continue to recommend them to applied researchers as a viable op-tion for haplotype risk estimation. In the Appendix, we provide instructionsfor implementing non-iterative weighted estimation and weighted haplotypecombination in Stata.

Appendix: Stata Implementation

In this Appendix, we provide Stata commands for implementing non-iterativemethods to estimate haplotype associations with a censored survival outcome.We assume that a program such as PHASE is used to impute the haplo-types and estimate population haplotype frequencies. The commands belowuse the following notation: event time time; event indicator event; exposurex; imputed haplotypes h1, h2, . . . , hk, with the referent haplotype removed;estimated diplotype probability prob; subject identifier id.

A Non-iterative weighted estimation

Create a multi-record dataset by merging the imputed haplotypes with theoutcome and exposure data. Declare the data to be survival data with prob-ability weights:

stset time [pweight=prob], failure(event)

Fit the Cox regression model with a robust variance estimator:

stcox h1-hk x, vce(cluster id)

18




See help stcox for additional options, including those for specifying stratifi-cation factors and including time-varying covariates.

B Weighted haplotype combination

Collapse the set of imputed haplotypes for each individual according to theestimated diplotype probabilities:

foreach var of varlist h1-hk {

quietly replace ‘var’ = ‘var’*prob

}

drop prob

collapse (sum) h1-h10, by(id)

Create a single-record dataset by merging the resultant haplotype data withthe outcome and exposure data. Declare the data to be survival data and fitthe Cox regression model:

stset time, failure(event)

stcox h1-hk x

References

Cappola TP, Li M, He J, Ky B, Gilmore J, Qu L, Keating B, Reilly M, Kim CE,Glessner J, Frackelton E, Hakonarson H, Syed F, Hindes A, MatkovichSJ, Cresci S, Dorn II GW. 2010. Common variants in HSPB7 andFRMD4B associated with advanced heart failure. Circ CardiovascGenet 3:147–54.

Cappola TP, Matkovich SJ, Wang W, van Booven D, Li M, Wang X, Qu L,Sweitzer NK, Fang JC, Reilly MP, Hakonarson H, Nerbonne JM, DornII GW. 2011. Loss-of-function DNA sequence variant in the CLCNKAchloride channel implicates the cardio-renal axis in interindividual heartfailure risk variation. Proc Natl Acad Sci U S A 108:2456–61.

Chen J, Chatterjee N. 2006. Haplotype-based association analysis in cohortand nested case-control studies. Biometrics 62:28–35.

Chen J, Peters U, Foster C, Chatterjee N. 2004. A haplotype-based test of as-sociation using data from cohort and nested case-control epidemiologicstudies. Hum Hered 58:18–29.

19





Cox D. 1972. Regression models and life tables (with discussion). J R StatSoc Series B 34:187–220.

Dempster AP, Laird NM, Rubin D. 1977. Maximum likelihood from incom-plete data via the EM algorithm. J R Stat Soc Series B 39:1–38.

French B, Lumley T, Monks SA, Rice KM, Hindorff LA, Reiner AP, PsatyBM. 2006. Simple estimates of haplotype relative risks in case-controldata. Genet Epidemiol 30:485–94.

French B, Lumley T. 2007. haplo.ccs: Estimate haplotype relative risks incase-control data. R package version 1.3. http://CRAN.R-project.org/package=haplo.ccs

Keating BJ, Tischfield S, Murray SS, Bhangale T, Price TS, Glessner JT,Galver L, Barrett JC, Grant SF, Farlow DN, Chandrupatla HR, HansenM, Ajmal S, Papanicolaou GJ, Guo Y, Li M, Derohannessian S, deBakker PI, Bailey SD, Montpetit A, Edmondson AC, Taylor K, GaiX, Wang SS, Fornage M, Shaikh T, Groop L, Boehnke M, Hall AS,Hattersley AT, Frackelton E, Patterson N, Chiang CW, Kim CE, Fab-sitz RR, Ouwehand W, Price AL, Munroe P, Caulfield M, Drake T,Boerwinkle E, Reich D, Whitehead AS, Cappola TP, Samani NJ, Lu-sis AJ, Schadt E, Wilson JG, Koenig W, McCarthy MI, Kathiresan S,Gabriel SB, Hakonarson H, Anand SS, Reilly M, Engert JC, NickersonDA, Rader DJ, Hirschhorn JN, Fitzgerald GA. 2008. Concept, designand implementation of a cardiovascular gene-centric 50 k SNP arrayfor large-scale genomic association studies. PLoS One 3:e3583.

Ky B, Kimmel SE, Safa RN, Putt ME, Sweitzer NK, Fang JC, Sawyer DB,Cappola TP. 2009. Neuregulin-1 beta is associated with disease severityand adverse outcomes in chronic heart failure. Circulation 120:310–7.

Laird N, Horvath S, Xu X. Implementing a unified approach to family basedtests of association. Genet Epidemiol 19:S36–S42.

Lin DY. 2004. Haplotype-based association analysis in cohort studies of unre-lated individuals. Genet Epidemiol 26:255–64.

Lin DY, Zeng D. 2006. Likelihood-based inference on haplotype effects ingenetic association studies. J Am Stat Assoc 101:89–104.

Marciante KD, Bis JC, Rieder MJ, Reiner AP, Lumley T, Monks SA, Kooper-berg C, Carlson C, Heckbert SR, Psaty BM. 2007. Renin-angiotensinsystem haplotypes and the risk of myocardial infarction and strokein pharmacologically treated hypertensive patients. Am J Epidemiol166:19–27.

20




Neuhausen SL, Brummel S, Ding YC, Singer CF, Pfeiler G, Lynch HT, NathansonKL, Rebbeck TR, Garber JE, Couch F, Weitzel J, Narod SA, Ganz PA,Daly MB, Godwin AK, Isaacs C, Olopade OI, Tomlinson G, RubinsteinWS, Tung N, Blum JL, Gillen DL. 2009. Genetic variation in insulin-like growth factor signaling genes and breast cancer risk among BRCA1and BRCA2 carriers. Breast Cancer Res 11:R76.

Rebbeck TR, Mitra N, Domchek SM, Wan F, Chuai S, Friebel TM, Panos-sian S, Spurdle A, Chenevix-Trench G, kConFab, Singer CF, PfeilerG, Neuhausen SL, Lynch HT, Garber JE, Weitzel JN, Isaacs C, CouchF, Narod SA, Rubinstein WS, Tomlinson GE, Ganz PA, Olopade OI,Tung N, Blum JL, Greenberg R, Nathanson KL, Daly MB. 2009. Mod-ification of ovarian cancer risk by BRCA1/2-interacting genes in a mul-ticenter cohort of BRCA1/2 mutation carriers. Cancer Res 69:5801–10.

Rebbeck TR, Mitra N, Domchek SM, Wan F, Friebel TM, Tran TV, SingerCF, Tea MK, Blum JL, Tung N, Olopade OI, Weitzel JN, Lynch HT,Snyder CL, Garber JE, Antoniou AC, Peock S, Evans DG, Paterson J,Kennedy MJ, Donaldson A, Dorkins H, Easton DF; for the Epidemio-logical Study of BRCA1 and BRCA2 Mutation Carriers (EMBRACE),Rubinstein WS, Daly MB, Isaacs C, Nevanlinna H, Couch FJ, AndrulisIL, Freidman E, Laitman Y, Ganz PA, Tomlinson GE, Neuhausen SL,Narod SA, Phelan CM, Greenberg R, Nathanson KL. 2011. Modifica-tion of BRCA1-associated breast and ovarian cancer risk by BRCA1-interacting genes. Cancer Res 71:5792–805.

Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. 2002. Scoretests for association of traits with haplotypes when linkage phase isambiguous. Am J Hum Genet 70:425–34.

Schaid DJ. 2004. Evaluating associations of haplotypes with traits. GenetEpidemiol 27:348–64.

Scheike TH, Martinussen T, Silver JD. 2010. Estimating haplotype effects forsurvival data. Biometrics 66:705–15.

Sinnwell JP, Schaid DJ. 2009. haplo.stats: Statistical analysis of haplotypeswith traits and covariates when linkage phase is ambiguous. R packageversion 1.4.4. http://CRAN.R-project.org/package=haplo.stats

Souverein OW, Zwinderman AH, Jukema JW, Tanck MW. 2008. Estimat-ing effects of rare haplotypes on failure time using a penalized Coxproportional hazards regression model. BMC Genet 9:9.

21





Stark K, Esslinger UB, Reinhard W, Petrov G, Winkler T, Komajda M, IsnardR, Charron P, Villard E, Cambien F, Tiret L, Aumont MC, DubourgO, Trochu JN, Fauchier L, Degroote P, Richter A, Maisch B, WichterT, Zollbrecht C, Grassl M, Schunkert H, Linsel-Nitschke P, ErdmannJ, Baumert J, Illig T, Klopp N, Wichmann HE, Meisinger C, KoenigW, Lichtner P, Meitinger T, Schillert A, Konig IR, Hetzer R, Heid IM,Regitz-Zagrosek V, Hengstenberg C. 2010. Genetic association studyidentifies HSPB7 as a risk gene for idiopathic dilated cardiomyopathy.PLoS Genet 6:e1001167.

Stephens M, Donnelly P. 2003. A comparison of Bayesian methods for haplo-type reconstruction from population genotype data. Am J Hum Genet73:1162–69.

Tan Q, Christiansen L, Bathum L, Zhao JH, Yashin AI, Vaupel JW, Chris-tensen K, Kruse TA. 2005. Estimating haplotype relative risks onhuman survival in population-based association studies. Hum Hered59:88–97.

Tregouet DA, Tiret L. 2004. Cox proportional hazards survival regressionin haplotype-based association analysis using the Stochastic-EM algo-rithm. Eur J Hum Genet 12:971–4.

Venkatraman ES, Mitra N, Begg CB. 2004. A method for evaluating theimpact of individual haplotypes on disease incidence in molecular epi-demiology studies. Stat Appl Genet Mol Biol 3:27.

Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG.2002. Testing associations of statistically inferred haplotypes with dis-crete and continuous traits in samples of unrelated individuals. HumHered 53:79–91.

22




statistical applications in genetics and molecular biology · volume 11, issue 3 2012 article 4...

Documents