a systematic, data driven approach to the combined analysis of microarray and qtl data

7
Pinard M-H, Gay C, Pastoret P-P, Dodet B (eds): Animal Genomics for Animal Health. Dev Biol (Basel). Basel, Karger, 2008, vol 132, pp 293-299. A Systematic, Data-driven Approach to the Combined Analysis of Microarray and QTL Data C. Rennie 1,2 , H. Hulme 2 , P. Fisher 2 , L. Hall 3 , M. Agaba 4 , H.A. Noyes 1 , S.J. Kemp 1,4 , A. Brass 2 1. School of Biological Sciences, Biosciences Building, Liverpool, UK 2. School of Computer Science/Faculty of Life Sciences, University of Manchester, UK 3. Roslin Institute, Roslin, Midlothian, Scotland, UK 4 . ILRI, Nairobi, Kenya Keywords: automated analysis, microarray, QTL, workflow Abstract: High-throughput technologies inevitably produce vast quantities of data. This presents challenges in terms of developing effective analysis methods, particularly where the analysis involves combining data derived from different experimental technologies. In this investigation, a systematic approach was applied to combine microarray gene expression data, quantitative trait loci (QTL) data and pathway analysis resources in order to identify functional candidate genes underlying tolerance to Trypanosoma congolense infection in cattle. We automated much of the analysis using Taverna workflows previously developed for the study of trypanotolerance in the mouse model. Pathways represented by genes within the QTL regions were identified, and this list was subsequently ranked according to which pathways were over-represented in the set of genes that were differentially expressed (over time or between tolerant N’dama and susceptible Boran breeds) at various timepoints after T. congolense infection. The genes within the QTL that played a role in the highest ranked pathways were flagged as good targets for further investigation and experimental confirmation. INTRODUCTION The analysis of microarray gene expression data can present difficulties due to the vast size of the datasets. Depending on the purpose of the study, analysis may be further complicated by the need to combine data produced using different experimental techniques or by the underlying complexity of the phenotype being investigated. A systematic, data-driven, semi-automated analysis pipeline was 293

Upload: laurence-dawkins-hall

Post on 12-Apr-2017

4 views

Category:

Science


1 download

TRANSCRIPT

Page 1: A systematic, data driven approach to the combined analysis of microarray and qtl data

Pinard M-H, Gay C, Pastoret P-P, Dodet B (eds): Animal Genomics for Animal Health. Dev Biol(Basel). Basel, Karger, 2008, vol 132, pp 293-299.

A Systematic, Data-driven Approach tothe Combined Analysis of Microarrayand QTL DataC. Rennie1,2, H. Hulme2, P. Fisher2, L. Hall3, M. Agaba4, H.A. Noyes1, S.J. Kemp1,4, A. Brass2

1. School of Biological Sciences, Biosciences Building, Liverpool, UK2. School of Computer Science/Faculty of Life Sciences, University of Manchester,

UK3. Roslin Institute, Roslin, Midlothian, Scotland, UK4. ILRI, Nairobi, Kenya

Keywords: automated analysis, microarray, QTL, workflow

Abstract:High-throughput technologies inevitably produce vast quantities of data. This presentschallenges in terms of developing effective analysis methods, particularly where the analysisinvolves combining data derived from different experimental technologies. In this investigation,a systematic approach was applied to combine microarray gene expression data, quantitativetrait loci (QTL) data and pathway analysis resources in order to identify functional candidategenes underlying tolerance to Trypanosoma congolense infection in cattle. We automated muchof the analysis using Taverna workflows previously developed for the study of trypanotolerancein the mouse model.Pathways represented by genes within the QTL regions were identified, and this list wassubsequently ranked according to which pathways were over-represented in the set of genesthat were differentially expressed (over time or between tolerant N’dama and susceptible Boranbreeds) at various timepoints after T. congolense infection. The genes within the QTL that playeda role in the highest ranked pathways were flagged as good targets for further investigation andexperimental confirmation.

INTRODUCTION

The analysis of microarray gene expression data can present difficulties due tothe vast size of the datasets. Depending on the purpose of the study, analysis maybe further complicated by the need to combine data produced using differentexperimental techniques or by the underlying complexity of the phenotype beinginvestigated. A systematic, data-driven, semi-automated analysis pipeline was

293

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 293

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 2: A systematic, data driven approach to the combined analysis of microarray and qtl data

294 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS

developed for the pathway-based combined analysis of microarray and quantitativetrait locus (QTL) data as part of a study investigating the genetics underlying toleranceto African bovine trypanosomiasis (nagana).

Nagana is transmitted by the tsetse fly, leading to loss of productivity and oftendeath in infected cattle. It represents a major constraint on livestock production inAfrica [1].

Some breeds of cattle, such as the Boran, are susceptible to the pathologicalconsequences of trypanosomiasis. Others, such as the N'dama, are more resistant tothese effects (trypanotolerant) [2]. However, the susceptible breeds have desirabletraits, such as greater size, and may be preferred by farmers. Identification of genesthat influence response to trypanosomiasis might inform new treatment approaches,or even pave the way for creating transgenic breeds that combine the desirable traitsof susceptible and trypanotolerant cattle.

Trypanotolerance is a complex phenotype including several distinct components,likely to involve separate genetic control mechanisms. Features include the abilityto control anaemia, control parasitaemia and maintain bodyweight. Previous studiesprovide evidence of the complexity of trypanotolerance. The trypanosomiasis responseof haematopoietic chimeric twins bred from one Boran and one N’dama parent wasstudied, demonstrating that control of anaemia depends on bone marrow from atrypanotolerant background, whereas control of parasitaemia does not [3]. Themapping study that provided QTL data used in this analysis showed that the proportionof phenotypic variation explained by each QTL was between 6 and 20%, suggestingthat multiple genes, or complex epistatic or environmental effects, may influenceeach trait [4].

A microarray gene expression time course study was carried out to investigategene expression differences between (trypanotolerant) N'dama and (trypanosusceptible)Boran cattle infected with T. congolense strain IL1180. This study generated a vastdataset. Thousands of probe sets on the array generated signals that were significantlydifferent between timepoints and/or between the two breeds (in T-tests or paired T-tests with p≤0.01).

A mapping study identified QTL for 16 phenotypic traits associated withtrypanotolerance in Boran and N’dama cattle [4]. The gene underlying a QTL is notassumed to be differentially expressed. However, it is expected to connect biologicallywith differentially expressed genes. The known pathways that included a gene withinone of the five QTL with the largest effect were identified and compared with theknown pathways that included a differentially expressed gene. The rationale behindthis approach was to establish the possible connections between the QTL and thedifferentially expressed genes.

A systematic strategy was used to enable an objective triage of the datasets,resulting in a short list of strong candidate pathways that included both a differentiallyexpressed gene and a gene within a trypanotolerance QTL. These pathways wereranked according to the results of a Fisher exact test performed using the Databasefor Annotation, Visualisation and Integrated Discovery (DAVID) [5]. A literaturesearch was carried out to determine whether the biological function of each pathwaywas likely to be linked to the phenotypic trait influenced by the QTL.

Large sections of the analysis were automated by adapting Taverna workflowsoriginally developed for the study of trypanotolerance in the mouse model [6]. Thisallowed for the entire analysis to be repeated consistently and relatively quickly and,for example, for the incorporation of information on the bovine genome from a

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 294

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 3: A systematic, data driven approach to the combined analysis of microarray and qtl data

different EnsEMBL build. It was also possible to adapt the analysis procedure toexamine a different species or a different phenotype for which QTL data was available.

MATERIALS AND METHODSMicroarray gene expression data was acquired using Affymetrix Bovine Genome ‘100 format (Midi)’

microarrays for liver samples harvested from Boran and N’dama cattle at 0, 12, 15, 18, 21, 26, 29, 32 and 35days post-infection.

This data was analysed with dChip [7] to identify and remove outliers before normalisation using therobust multi-array (RMA) method. Principal components analysis (PCA) was used to check that hybridisationsgrouped as expected.

T-tests were used to compare gene expression between breeds at each time point. Paired T-tests (usingdata for the same individual animals at different timepoints) were used to compare gene expression for eachtime point with day 0. Lists of probes that showed differential gene expression (p≤0.01) between breeds orover time were compiled.

A previous study identified trypanotolerance QTL in N’dama and Boran cattle [4]. Five QTL were selectedto be included in this analysis based on phenotypic trait, mapping resolution and strength of effect. Base pairpositions of QTL relative to the EnsEMBL bovine genome preliminary build Btau2.0 were determined manually.Names and phenotypes for the five QTL are shown in Table 1.

Table 1: Name and phenotype for the five trypanotolerance QTL used in this analysis. For more detailedinformation, please refer to the original mapping study [5].

To combine microarray and QTL data, a Taverna workflow previously developed for the study oftrypanotolerance in the mouse model was adapted [6]. The paper cited provides a full description. In brief,lists of differentially expressed genes (over time or between breeds) were associated with Kyoto Encyclopaediaof Genes and Genomes (KEGG) pathways. A separate process identified genes within QTL and associatedthese with KEGG pathways. A third process compared these lists to produce a list of KEGG pathways thatcontained both differentially expressed genes and genes from the QTL.

Some adaptations of the workflow were necessary. Rather than the mouse EnsEMBL build and IDs, thebovine EnsEMBL preliminary build (Btau2.0) was used. Bovine gene IDs were retrieved for Affymetrix probesthen mapped to human homologues (using EnsEMBL data for NCBI build 36) so that human IDs could beused for the remainder of the analysis (available annotation on bovine genes is very limited). Output was inthe same form as the original, comprising a list of KEGG pathways that included at least one differentiallyexpressed gene and at least one gene from the QTL.

This list was ranked based on the p-value of each pathway in a Fisher exact test performed on the microarraydata using DAVID indicating whether pathway genes showed more differential expression than expected bychance. The list was annotated to add gene symbols for pathway genes in the QTL and to indicate the breedsand timepoints in which pathway genes were differentially expressed. Further annotation was derived fromgene and pathway resources including GenBank, iHOP, GenMAPP and GeneGo: MetaCore.

Figure 1 summarises the analysis protocol described above.

QTL Phenotype

BTA2 Anaemia

BTA4 Parasitaemia

BTA7 Anaemia and parasitaemia

BTA16 Anaemia

BTA27 Anaemia

MICROARRAY AND QTL DATA ANALYSIS 295

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 295

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 4: A systematic, data driven approach to the combined analysis of microarray and qtl data

Fig. 1: Summary of the analysis procedure. Automated sections are indicated using grey shading.

296 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 296

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 5: A systematic, data driven approach to the combined analysis of microarray and qtl data

RESULTS

This analysis procedure could be re-used or adapted to examine another speciesor phenotype for which QTL data are available. The modular nature of the protocoland of Taverna workflows facilitates adding or altering analysis stages. Workflowscripts and supplementary data (e.g. files from intermediate stages) are availablefrom the authors.

In the bovine trypanotolerance study, the result of the analysis procedure was toprovide a short list of targets for further investigation. This result can be quantifiedby assessing the numbers of genes requiring further investigation based on thecombined analysis results or on the original data.

Out of 24,128 probe sets on the array, 12,591 were significantly differentiallyexpressed (p≤0.01) in T-tests or paired T-tests comparing expression between breedsor over time. Of these probe sets, 8,342 were mapped to a known gene, in totalrepresenting 7,071 unique gene symbols.

After combining the pathway lists for differentially expressed and QTL genes,pathway genes within QTL provided a list of 127 targets. Restricting the pathwaylist to those with a significant (p≤0.05) score in the DAVID Fisher exact test reducedthis to 51 targets (it could be reduced more by checking whether expression changesare downstream of the QTL gene and whether the pathway function is related to theQTL phenotype).

The list of pathways with a significant score (p≤0.05) in the DAVID Fisher exacttest is displayed in Table 2. Pathway genes lying within each QTL are also listed.Note that these data are based on an analysis using the EnsEMBL Btau2.0 preliminarybuild. A more recent preliminary build is available, and the analysis will be repeated,and key findings discussed, in a future publication.

DISCUSSION

When studying complex phenotypes, analysis based on biological processesalready known to be involved may be insufficient. It is possible that other keybiological pathways, or complex interactions between them, could be missed. Data-driven approaches are useful to identify the biological processes showing strongestvariation in the results.

The aim of a pathway-based approach to analysing microarray and QTL data isto identify biologically meaningful links between the two datasets. The gene underlyinga QTL is not necessarily differentially expressed, but may influence the expressionof other genes downstream in a known pathway. This approach allows such genesto be identified without detailed investigation of every gene in the QTL regions orevery gene that is differentially expressed.

Automation is increasingly necessary to handle the vast quantities of data producedby high-throughput technologies, where manual analysis of the entire dataset is notfeasible. Automated approaches are systematic, promoting consistency and reducingbias. Consistent replication of automated analyses is relatively simple, allowingseparate studies to produce comparable results and allowing analyses to be repeatedin order to incorporate new information (e.g. from updates to genome build and geneinformation available in public databases). Automated analysis can be an effectivetriage process, producing a short list of strong targets for thorough manual investigation.

MICROARRAY AND QTL DATA ANALYSIS 297

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 297

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 6: A systematic, data driven approach to the combined analysis of microarray and qtl data

Table 2: Pathways with a significant (p≤0.05) score in a Fisher exact test to determine whether the differentialexpression of pathway genes is higher than expected by chance. The columns on the right give thegene symbols for pathway genes within each of the QTL.

KEGG pathway name BTA2 BTA4 BTA7 BTA16 BTA27

Leukocyte transendothelial migration VAV1 CLDN23

Regulation of actin cytoskeleton FN1 CHRM2 VAV1 BRAFPIP5K3 FGF20

Cell cycle MCM6 CDKN2DORC4L

Gap junction PRKACA GNAQTUBB4

Focal adhesion FN1 ZYX COL5A3 CAPN2 BRAFVAV1

MAPK signalling pathwayCASP8 CASP2 ECSIT DUSP10 BRAF

PRKACA DUSP4FGF20IKBKB

Hematopoietic cell lineage EPORFCER2

Huntington’s disease CASP8

Glycerolipid metabolism LCT DGKI AGPAT6

Axon guidance EFNB1 EPHA1 UNC5DEPHB6

Glycerophospholipid metabolism DGKI ARD1A AGPAT6

Adherens junction INSP FGFR1

Neurodegenerative disorders CASP8

T cell receptor signalling pathwayCD28 VAV1 IKBKBCTLA4ICOS

Long-term potentiation PRKACA GNAQ BRAF

Apoptosis CASP8 IRAK1 CAPN2 IKBKBPRKACA

Toll-like receptor signalling pathway CASP8 IRAK1 IKBKB

TICAM1

Wnt signalling pathway PRKACA DKK4SFRP1

Glutathione metabolism IDH1 GSTK1

Calcium signalling pathwayCHRM2 PTGER1 GNAQ ADRB3

PRKACA GNA14 VDAC3ITPKB

298 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 298

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M

Page 7: A systematic, data driven approach to the combined analysis of microarray and qtl data

MICROARRAY AND QTL DATA ANALYSIS 299

CONCLUSION

Systematic data-driven automated approaches offer an excellent means to triagedata from high-throughput technologies, providing a shortlist of viable targets forthorough manual analysis and experimental confirmation.

ACKNOWLEDGEMENTSThis work was wholly funded by The Wellcome Trust.

REFERENCES

1 Kristjanson PM, Swallow BM, Rowlands GJ, Kruska RL, de Leeuw PN: Measuring the costs of Africananimal trypanosomosis, the potential benefits of control and returns to research. Agric Syst 1999;59:79-98.

2 Murray M, D’Ieteren G, Teale AJ. Trypanotolerance, in Maudlin I, Holmes PH, Miles MA (eds): TheTrypanosomiases. Wallingford UK, CABI Publishing, 2004, pp 461-477.

3 Naessens J, Leak SG, Kennedy D, Kemp SJ, Teale AJ: Responses of bovine chimaeras combiningtrypanosomosis resistant and susceptible genotypes to experimental infection with Trypanosoma congelense.Vet Parasitol 2003;111:125-142.

4 Hanotte O, Ronin Y, Agaba M, Nilsson P, Gelhaus A, Horstmann R et al: Mapping of quantitative traitloci controlling trypanotolerance in a cross of tolerant West African N’dama and susceptible East AfricanBoran cattle. Proc Natl Acad Sci USA 2003;100(13):7443-7448.

5 Dennis GJ, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC et al: DAVID: Database for Annotation,Visualisation, and Integrated Discovery. Genome Biol 2003;4(9):R60.

6 Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S et al: A systematic strategy for large-scale analysis of genotype-phenotype correlations: identification of candidate genes involved in Africantrypanosomiasis. Nucl Acids Res 2007;35(16):5625-5633.

7 Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation andoutlier detection. Proc Natl Acad Sci USA 2001;98:31-36.

Catriona Rennie, LF8, Kilburn Building, The University of Manchester, Oxford Rd, Manchester, M13 9PL,UK.E-mail: [email protected]

AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 299

Dow

nloa

ded

by: L

. Daw

kins

-Hal

l - 4

1643

0U

nive

rsity

of L

eice

ster

14

3.21

0.24

7.14

0 -

2/27

/201

7 1:

16:5

8 P

M