using network information with gene expression data - jean yee hwa yang

35
Using network information with gene expression data Jean Yee Hwa Yang School of Mathematics and Statistics

Upload: australian-bioinformatics-network

Post on 11-May-2015

552 views

Category:

Technology


0 download

DESCRIPTION

Large-scale molecular interaction networks are dynamic in nature and changes in these networks, rather than changes in individual genes/proteins, are often drivers of complex diseases such as cancer. In this talk, I use data from stage III melanoma patients provided by Prof. Mann lab that comprise of clinical, mRNA and miRNA data to discuss how network information can be utilise in the analysis of gene expression analysis to aid in biological interpretation. I will also present an R software package, Variability Analysis in Networks (VAN), that enables an integrative analysis of protein-protein or microRNA-gene networks and expression data to identify hubs (i.e. highly connected proteins/microRNAs in a network) that are dysregulated, in terms of expression correlation with their interaction partners.

TRANSCRIPT

Page 1: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Using network information with gene

expression data

Jean Yee Hwa Yang School of Mathematics and Statistics

Page 2: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Central dogma of molecular biology

2

Image  source:  Central  dogma  of  molecular  biology,  Wikipedia;  h<p://en.wikipedia.org/wiki/Central  dogma  of  molecular  biology

Microarrays  are  used  to  detect  the    extent  to  which  genes  are  being  expressed.  

*!*!*!*!*!

Page 3: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

mRNA

Expression!data!microarray!

Varia

ble

(500

0-30

000

gene

s) o

r (20

00 m

iRN

As)!

N samples!*!*!*!*!

*!

Technologies ~ measuring expression

microRNA

Next-gen Sequencing! Count!

data!

3

Page 4: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Motivation: Melanoma prognosis

›  Melanomas are common in a large demographic of the population, especially in Caucasians living in sunny climates. Of those that metastasise (Stage III), about 40% go on to live cancer free, but another 40% succumb to the disease in less than 1 year.

›  Samples were obtained from Professor Graham Mann's group from the Westmead Institute for Cancer Research and Melanoma Institute Australia.

›  Aim: To predict survival prognosis for Stage III melanoma patients.

Currently, we have gene expression data for 79 Stage III individuals. In addition, we have clinical data consisted of patient stage at diagnosis, survival status as well as histology, pathological and mutation information.

4

Page 5: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Research aims

› New prognostic markers

-  To determine whether there are significant biomarker and pathway differences between melanomas of good and bad prognosis after resection of nodal metastatic disease;

› New therapeutic targets

-  To identify and validate the principal regulatory pathway abnormalities that characterise metastatic (stage III and IV) melanomas;

-  To investigate novel genomic drivers of melanoma tumour progression and outcome.

Provided by Sara-Jane Schramm  (Usyd)

Page 6: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Survival outcome Survival time of stage III melanoma patients

Survival time (years)

Freq

uenc

y

0 2 4 6 8 10 12

05

1015

20 Two  survival  groups    

Bad  prognosis:  Survival  <  1  year  and  

died  due  to  melanoma  

 Good  prognosis:    Survival  ≥  4  years  with  no  sign  of  

relapse    

Page 7: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Gene expression (microarray data)

7

No correlation with BRAF mutation

Expressio

n    value

s

Pink:  no  BRAF  mutaQon  Gray:  BRAF  mutaQon

PP GP

Page 8: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Gene expression : DE analysis

1.  Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).

2.  Cluster analysis: finding common patterns between samples / genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).

8

Three main types of questions

Image reproduced from JOURNAL OF INVESTIGATIVE DERMATOLOGY|Vol 133|2013

Page 9: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Gene expression : cluster analysis

1.  Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).

2.  Cluster analysis: finding common patterns between samples / genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).

Three main types of questions

Page 10: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Gene expression : classification

1.  Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).

2.  Cluster analysis: finding common patterns between samples / genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).

10

Three main types of questions

Gene  1  Mi1  <  -­‐0.67  

Gene  2  Mi2  >  0.18  

B-­‐ALL  

AML  

T-­‐ALL  

yes  

yes  

no  

no  

Error  rate  

Number  of  features  (genes)

Page 11: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Most of the approches to date can be considered as “single gene” analysis.

11

Page 12: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Three different levels of DE analysis

1. Single gene level: this is gene-by-gene analysis (individual node)

2. Gene set level: the features are subsets of genes (set of nodes), e.g. gene set test.

3. Network level: examine a subsets of genes (nodes in the network) together with information on relationships between the genes (the edges in the network).

12

Lets think of performing DE analysis at 3 different levels:

Page 13: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Networks

Network

CD19CD38

LYN

VAV1

VAV2

ABL1

LCP2

NCK1

VAV3

ZAP70

YWHAQ

CD6

GRAP2

GRB2

SHB

MAP4K1

SYK

RHOA

DVL2

ELAVL4

SWAP70

THY1

RHOG

CDC42

RAC1

KLK3

EGFR

ALKLCK

13

A network is made up of nodes and edges:

Page 14: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Network discovery vs perdefined networks

› Network discovery: use microarray information to find genes with highly correlated gene expression probes, and define edges accordingly (e.g. WGCNA).

›  Predefined networks: use predefined gene interaction databases such as MetaCore or iRefWeb. -  E.g. protein-protein interaction networks: a node represents a protein-coding

gene, and an edge between two nodes represents an interaction between the proteins coded for by the genes.

14

We have used two different methods for defining networks:

Page 15: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Protein-protein interaction data

›  Human Protein Reference Database -  Keshava Prasad et al. 2009

›  iRefWeb -  Turner et al. 2010

›  BioGRID -  Chatr-aryamontri et al. 2013

›  MetaCore -  From GeneGo Inc.

Hairball image generated using Cytoscape

(Smoot et al. 2011)

Thanks to Simone Li and Drs Igy Pang and David Fung at the Systems Biology Initiative, the University of New South Wales

Page 16: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

VAV3 hub subnetwork

VAV3

RHOG

CDC42

RAC1

RHOAKLK3

EGFR

GRB2

ALK

LCP2

LCK

SYK

Metacore network dataset

›  Split the network into subnetworks, containing a central hub gene (a gene with 5 interactors) and its immediate interactors.

›  For example, one network dataset from Metacore database consists of 1273 hub subnetworks with a total of 3607 genes in common with the microarray dataset.

16

VAV3 hub subnetwork

Page 17: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Talyor et al, Nature Biotech, 2009

P-valueHub = frequency of random average hub difference > real average hub difference

1000

NATURE BIOTECH.|Vol 27|2009

Page 18: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Talyor et al, Nature Biotech, 2009

P-valueHub = frequency of random average hub difference > real average hub difference

1000

NATURE BIOTECH.|Vol 27|2009

Page 19: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Finding hubs of interest

19

For a given sub-network (predefined hub) i:

Hub  gene Interactor  gene  i

Page 20: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Finding hubs of interest

›  For each edge, k , the correlation difference between the two classes (GP and PP) was calculated.

20

For a given sub-network (predefined hub) i:

ΔPP,GP,k = PPcork −GPcork

Hub  gene Interactor  gene  i

Page 21: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Finding hubs of interest

›  For each sub-network i , calculate the average absolute difference in hub –interactor correlation:

where ni is the number of interactors of the central hub gene in the network i .

21

For a given sub-network (predefined hub) i:

ΔPP,GP,i = PPcori −GPcori

AveHubDiffi =ΔPP,GP,ki=1

ni∑ni −1

Rank the hub subnetworks based on their AveHubDiff values or use permutation test to determine the statistical significance of each hub.

Page 22: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Applying to Melanoma gene expression data

›  A: Patients surviving >4yr post resection of metastatic disease

›  B: Patients surviving <1yr post resection of metastatic disease

› C & D:

›  Enlarged view (HDAC)

Results – gene co-expression networks are significantly disturbed among patients with good and poor clinical outcomes

PIG. CELL & MEL. RES.|In press|2013 Provided by Sara-Jane Schramm  (Usyd)

Page 23: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Software: VAN

Transcriptomics data

����VWDWHV�Network data

�33,�PLFUR51$�JHQH�

Data analysis

2EWDLQ�KXE�LQWHUDFWRU�FRUUHODWLRQV�LQ�HDFK�VWDWH

3HUIRUP�WHVWV�RI�VLJQLILFDQFH�WR�LGHQWLI\�KXEV�ZKHUH�DYHUDJH�FRUUHODWLRQ��ZLWK�LQWHUDFWRUV��YDULHV�DFURVV�VWDWHV

3HUIRUP�PHWD�DQDO\VLV��EDVHG�RQ�p�YDOXHV��LI�PXOWLSOH�GDWDVHWV�DUH�FRQVLGHUHG

Cancer gene

census data

Hubs causally

implicated in cancer

+XE�DQG�LQWHUDFWRrV�í�DOLYH

CASP8

KCNQ1

%5&$�

HNF4A

HGS

EIF3A

06+�

*5%�

PDGFRB

+XE�DQG�LQWHUDFWRrV�í�GGP

CASP8

KCNQ1

%5&$�

HNF4A

HGS

EIF3A

06+�

*5%�

PDGFRB

HNF4A

KCNQ1

NOS1AP

PTK2B

ZMYM2

CYLD

CASP8

MSH2

EIF3A

CREB3

RFXANK

NOV

PFKFB2

KRT15

CLOCK

MIF

LCP2

LMO2

RNF126GABRG2

CDH5

MRFAP1

PIN1

ATMPOLR2A

MTA1

FANCF BARD1 MAP3K3 BRE

ATN1 RPS6KA3

SHC1

TAF6L

PPFIA2

MED10

NR3C1PARK7

CD2AP

SMAD3

GPX1HIST2H2BE

TERF1

UBA3

NCL

HIF1A

PPP1CB

RBPJMUC1

TFAP2A

HIPK2

USP20

U2AF1

TUBA4A

SP100

PLK1HDAC3

DES CDH1 SREBF1

JAK1

ACTB

TOPORS

KAT2A

TSC2

ETS2

MAP3K14

ZBTB16IRAK4

EIF3EUBE2D3 UGCG MLL CCNG1

AMFR

PLDN

TPRKB

TSG101

MAGOH

BNIP3HNRNPD

NFKBIZFKBP5

PIK3R2

MED25

RELA

CD3D

HIST1H3A

LYST

TADA3

CDYLNFKBIASLC9A3R1

GHR

GATA4

MAP2K7

TAF15

AKAP14

PTPN3

KLF4KCNA2

TOPBP1

HIST1H1C

MED28

SOX9

FBXO5

NUP155

DAZAP2

AXL

SMAD4

LYN

FKBP3

PIAS1

CDC42

SMEK2

COPS2

ABL1

KPNB1

NCOA6

IRAK1

SF3B1

GRM7

CDC25A

OGG1ARF3HMG20BPKN3

PLAURMTF2

AKT1

GSK3B

BCAR1

MED6YWHAB

SP1

ERCC4

GRIK1

PRMT1

EP300

ASH2L

NFYC

PRKDC

NUB1

RBBP7ACTN1

CALCR

FHL3IKZF3

SERPINB2

HSPA4

SDCBP

CREBBP

CBX1

UBE2L3

ANAPC10

SMC3

FASTK

CDK5R1

TRIB3

MBD3

BCL2L1

CDKN1A

LAT

FOS

MORF4L2

SUPT3H

RIPK1

PCNA

RBL1 STK24PARP1

SOCS3

SMARCA5

RBBP4

PDGFRB

UBC

HGS

DAXX

TRAF1

MYC

HDAC1

TNFRSF1A

PSMD2

BAD

CSTF2

PIAS4

DIABLO

TOP3A

CSNK2A1

TP53

PRKAB1

FYN

NPAS2

CCNE1

SH3GL3

CBLB

CRKL

PRDM2

CTNNB1

SUV39H1

MAP2K1

USP7

CHAF1A

BRCA2

C19orf40

SMN1

SYT1

RHOA

GABBR2

HOXA9

HSP90AA1

CSRP3

RASA1

GNAI3

IL2RG

CBL

PDGFRA

CAB39

GRB2

ESR1

MSX2DERL1

JAK2

MAD2L2

SMARCB1

CUL4A

PEX19CD27 KDM5A

CCDC130

HSPD1

CAPN1

UBE2I

SNRPF

PSEN2

PPP1R15A

CHD8

HDAC2

NDN

R &\WRVFDSH

Hubs of interest

1HWZRUN�YLVXDOL]DWLRQ�XVLQJ�5��RQH�KXE�DW�D�WLPH��RU�&\WRVFDSH��PXOWLSOH�KXEV�DW�WKH�VDPH�WLPH�

23

VAN: Identifying biologically perturbed networks using differential variability analysis

Page 24: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Hub and interactors

24

ANSR DM

Page 25: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Software: VAN

25

VAN: Identifying biologically perturbed networks using differential variability analysis

Transcriptome data

Network data

Cancer gene data

Data analsysis

Page 26: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Moving to classification

1.  Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).

2.  Cluster analysis: finding common patterns between genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and model (classifier).

26

Three main categories of question

How to extend this concept from DE analysis to classification and prediction

Page 27: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

DE analysis Classification

1.  Gene-based features: rank genes using network information (e.g. NetRank), or construct weights for genes using network information (e.g. weighted lasso).

2.  Network-based features: dene some network measure which can be used to quantify network perturbation between the two classes; rank the networks accordingly (e.g. Rapaport et al., Taylor et al., BSS=WSS).

› Note that it is surprisingly difficult to come up with a network measure which can be translated from a DE framework into a classification framework.

27

Constructing features for the network approach in two main ways:

Page 28: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Talyor et al: feature

›  Instead of using the top ranked networks as the classification features, Taylor et al. use the edges in the top ranked networks.

›  Each edge k in the selected networks is assigned the feature value.

28

I1

HI2

I3 I4

I5

Page 29: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Looking at one hub

›  Some individual networks are capable of separating the classes reasonably well, by considering the difference between hub and interactor expression (the LDA method).

29

−2 −1 0 1 2

0.4

0.6

0.8

1.0

1.2

1.4

CEBPB (49 interactors)

Expression for the CEBPB gene

Med

ian

abso

lute

exp

ress

ion

for t

he in

tera

ctor

s

GPPP

Page 30: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Classification procedure

30

Page 31: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Other network based approaches

31

Winter  et  al,  Plos  ComputaQonal  Biology,  2012

Page 32: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Cross-validation error rate

32

Mod−t

Unw

eigh

ted

lass

o

Aver

age

expr

essi

on

Tayl

or

Rap

apor

t

Inne

r pro

duct

BSS/

WSS

Wei

ghte

d la

sso

(hub

)

Wei

ghte

d la

sso

(all)

0.2

0.3

0.4

0.5

0.6

Random forestC

lass

ifica

tion

erro

r

Single−gene Gene Network−based features

Gene−basedfeaturesset

Page 33: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

›  Error rates for Taylor's method are only slightly better than for the classical single-gene moderated-t method.

› However, the two methods are capturing dierent information: they are correctly classifying dierent subsets of patients.

33

Page 34: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Summary and discussion

›  VAN (R package) enables the testing of modules for dysregulation based on two or more conditions, it is also suitable for the examination of changes across developmental timelines.

› Majority of network methods based on the discovery network do not perform as well as methods based on the predefined network.

› Combining Taylor's method and the single-gene method could yield a more accurate classier.

› Using the LDA method, some hub subnetworks independently act as accurate prognostic predictors.

›  The best performing network feature selection methods only select small hub subnetworks.

34

Page 35: Using Network Information With Gene Expression Data - Jean Yee Hwa Yang

Acknowledgements ›  Graham Mann (Usyd)

-  Gulietta Pupo & Varsha Tembe

›  Sara-Jane Schramm ›  John Thompson ›  Richard Scolyer (RPA)

›  Marc Wilkins (UNSW) -  Simone Li

-  Chi Nam Ignatius Pang -  David Fung -  Apurv Goel

-  Natalie Twine

›  School of Mathematics and Statistics (Usyd)

-  Samuel Mueller

-  Vivek Jayaswal

-  Kaushala Jayawardana

-  Rebecca Barter

-  Shila Ghanazfar

-  Anna Campain