for written notes on this lecture, please read chapter 14...

85
For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician. CS2220: Introduction to Computational Biology Lecture 4: Gene Expression Analysis Limsoon Wong

Upload: doannhi

Post on 30-Apr-2018

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician.

CS2220: Introduction to Computational Biology

Lecture 4: Gene Expression Analysis

Limsoon Wong

Page 2: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

2

Plan

• Microarray background

• Gene expression profile classification

• Gene expression profile clustering

• Normalization

• Extreme sample selection

• Intersection Analysis

Page 3: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Background on Microarrays

Page 4: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

4

What is a Microarray?

• Contain large number of DNA molecules spotted

on glass slides, nylon membranes, or silicon

wafers

• Detect what genes are being expressed or found

in a cell of a tissue sample

• Measure expression of thousands of genes

simultaneously

Page 5: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

5

Affymetrix GeneChip Array

Page 6: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

6

quartz is washed to ensure uniform

hydroxylation across its surface and to

attach linker molecules

exposed linkers become deprotected and

are available for nucleotide coupling

Making Affymetrix GeneChip Array

Exercise: What is the other commonly used

type of microarray? How is that one different

from Affymetrix’s?

Page 7: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

7

Gene Expression Measurement

by Affymetrix GeneChip Array

Click to

watch an

interesting

movie

explaining the

working of

microarray

Page 8: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

8

A Sample Affymetrix GeneChip

Data File (U95A)

Page 9: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

9

Some Advice on

Affymetrix Gene Chip Data

• Ignore AFFX genes

– These genes are control genes

• Ignore genes with “Abs Call” equal to “A” or “M”

– Measurement quality is suspect

• Upperbound 40000, lowerbound 100

– Accuracy of laser scanner

• Deal with missing values Exercise: Suggest 2 ways

to deal with missing value

Page 10: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

10

Type of Gene Expression Datasets

Class Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 .....

Sample1 Cancer 0.12 -1.3 1.7 1.0 -3.2 0.78 -0.12

Sample2 Cancer 1.3

.

~Cancer

SampleN ~Cancer

1000 - 100,000 columns

100-500 rows

Gene-Conditions or Gene-Sample (numeric or discretized)

Gene-Sample-Time Gene-Time

time

expre

ssio

n le

ve

l

Page 11: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

11

Type of Gene Expression Datasets

1000 - 100,000 columns

100-500 rows

Gene-Conditions or Gene-Sample (numeric or discretized)

Class Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 .....

Sample1 Cancer 1 0 1 1 1 0 0

Sample2 Cancer 1

.

~Cancer

SampleN ~Cancer

Gene-Sample-Time Gene-Time

time

expre

ssio

n le

ve

l

Page 12: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

12

Application: Disease Subtype Diagnosis

???

malign

malign

malign

malign

benign

benign

benign

benign

??? ???

genes

sam

ple

s

Page 13: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

13

Application: Treatment Prognosis

???

NR

NR

NR

NR

R

R

R

R

??? ???

genes

sam

ple

s

Page 14: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

14

Type of Gene Expression Datasets

1000 - 100,000 columns

100-500 rows

Gene-Conditions or Gene-Sample (numeric or discretized)

Gene1 Gene2 Gene3 Gene 4 Gene5 Gene6 Gene7

Cond1 0.12 -1.3 1.7 1.0 -3.2 0.78 -0.12

Cond2 1.3

.

CondN

Gene-Sample-Time Gene-Time

time

expre

ssio

n le

ve

l

Page 15: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

15

Application: Drug Action Detection

Normal

Normal

Normal

Normal

Drug

Drug

Drug

Drug

genes

condit

ions

Which group of genes are the drug affecting on?

Page 16: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Gene Expression Profile Classification

Diagnosis of Childhood Acute

Lymphoblastic Leukemia and Optimization

of Risk-Benefit Ratio of Therapy

Page 17: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

17

• The subtypes look similar

• Conventional diagnosis

– Immunophenotyping

– Cytogenetics

– Molecular diagnostics

• Unavailable in most

ASEAN countries

Childhood ALL

• Major subtypes: T-ALL,

E2A-PBX, TEL-AML, BCR-

ABL, MLL genome

rearrangements,

Hyperdiploid>50

• Diff subtypes respond

differently to same Tx

• Over-intensive Tx

– Development of

secondary cancers

– Reduction of IQ

• Under-intensiveTx

– Relapse

Page 18: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

18

Mission

• Conventional risk assignment procedure requires

difficult expensive tests and collective judgement

of multiple specialists

• Generally available only in major advanced

hospitals

Can we have a single-test easy-to-use platform

instead?

Page 19: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

19

Single-Test Platform of

Microarray & Machine Learning

Page 20: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

20

Overall Strategy

• For each subtype, select

genes to develop

classification model for

diagnosing that subtype

• For each subtype, select

genes to develop

prediction model for

prognosis of that subtype

Diagnosis

of subtype

Subtype-

dependent

prognosis

Risk-

stratified

treatment

intensity

Page 21: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

21

Subtype Diagnosis by PCL

• Gene expression data collection

• Gene selection by 2

• Classifier training by emerging pattern

• Classifier tuning (optional for some machine

learning methods)

• Apply classifier for diagnosis of future cases by

PCL

Page 22: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

22

Childhood ALL Subtype

Diagnosis Workflow

A tree-structured

diagnostic

workflow was

recommended by

our doctor

collaborator

Page 23: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

23

Training and Testing Sets

Page 24: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

24

Signal Selection Basic Idea

• Choose a signal w/ low intra-class distance

• Choose a signal w/ high inter-class distance

Page 25: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

25

Signal Selection by 2

Page 26: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

26

Emerging Patterns

• An emerging pattern is a set of conditions

– usually involving several features

– that most members of a class satisfy

– but none or few of the other class satisfy

• A jumping emerging pattern is an emerging

pattern that

– some members of a class satisfy

– but no members of the other class satisfy

• We use only jumping emerging patterns

Page 27: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

27

Examples

Reference number 9: the expression of gene 37720_at > 215

Reference number 36: the expression of gene 38028_at 12

Patterns Frequency (P) Frequency(N)

{9, 36} 38 instances 0

{9, 23} 38 0

{4, 9} 38 0

{9, 14} 38 0

{6, 9} 38 0

{7, 21} 0 36

{7, 11} 0 35

{7, 43} 0 35

{7, 39} 0 34

{24, 29} 0 34

Easy interpretation

Page 28: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

28

PCL: Prediction by Collective Likelihood

Page 29: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

29

PCL Learning

Top-Ranked EPs in

Positive class

Top-Ranked EPs in

Negative class

EP1P (90%)

EP2P (86%)

.

.

EPnP (68%)

EP1N (100%)

EP2N (95%)

.

.

EPnN (80%)

The idea of summarizing multiple top-ranked EPs is intended

to avoid some rare tie cases

Page 30: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

30

PCL Testing

ScoreP = EP1P’ / EP1

P + … + EPkP’ / EPk

P

Most freq EP of pos class

in the test sample

Most freq EP of pos class

Similarly,

ScoreN = EP1N’ / EP1

N + … + EPkN’ / EPk

N

If ScoreP > ScoreN, then positive class,

Otherwise negative class

Page 31: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

31

Accuracy of PCL (vs. other classifiers)

The classifiers are all applied to the 20 genes selected

by 2 at each level of the tree

Page 32: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

32

Understandability of PCL

• E.g., for T-ALL vs. OTHERS, one ideally

discriminatory gene 38319_at was found,

inducing these 2 EPs

• These give us the diagnostic rule

Page 33: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

33

Multidimensional Scaling Plot

for Subtype Diagnosis

Obtained by performing PCA on the 20 genes chosen for each level

Page 34: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

34

Childhood ALL Cure Rates

• Conventional risk

assignment procedure

requires difficult

expensive tests and

collective judgement of

multiple specialists

Not available in less

advanced ASEAN

countries

Page 35: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

35

Childhood ALL Treatment Cost

• Treatment for childhood ALL over 2 yrs

– Intermediate intensity: US$60k

– Low intensity: US$36k

– High intensity: US$72k

• Treatment for relapse: US$150k

• Cost for side-effects: Unquantified

Page 36: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

36

Current Situation

(2000 new cases/yr in ASEAN)

• Intermediate intensity

conventionally applied in

less advanced ASEAN

countries

• Over intensive for 50% of

patients, thus more side

effects

• Under intensive for 10% of

patients, thus more relapse

• US$120m (US$60k * 2000)

for intermediate intensity tx

• US$30m (US$150k * 2000 *

10%) for relapse tx

• Total US$150m/yr plus un-

quantified costs for dealing

with side effects

Page 37: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

37

Using Our Platform

• Low intensity applied to

50% of patients

• Intermediate intensity to

40% of patients

• High intensity to 10% of

patients

Reduced side effects

Reduced relapse

75-80% cure rates

• US$36m (US$36k * 2000 *

50%) for low intensity

• US$48m (US$60k * 2000 *

40%) for intermediate

intensity

• US$14.4m (US$72k * 2000 *

10%) for high intensity

• Total US$98.4m/yr

Save US$51.6m/yr

Page 38: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

38

A Nice Ending…

• Asian Innovation

Gold Award 2003

Page 39: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Gene Expression Profile Clustering

Novel Disease Subtype Discovery

Page 40: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

40

Is there a new subtype?

• Hierarchical

clustering of

gene expression

profiles reveals a

novel subtype of

childhood ALL

Exercise: Name and describe

one bi-clustering method

Page 41: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

41

Hierarchical Clustering

• Assign each item to its own cluster

– If there are N items initially, we get N clusters,

each containing just one item

• Find the “most similar” pair of clusters, merge

them into a single cluster, so we now have one

less cluster

– ―Similarity‖ is often defined using

• Single linkage

• Complete linkage

• Average linkage

• Repeat previous step until all items are clustered

into a single cluster of size N

Page 42: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

42

Single, Complete, & Average Linkage

Single linkage defines distance

betw two clusters as min distance

betw them

Complete linkage defines distance

betw two clusters as max distance betw

them

Exercise: Give definition of “average linkage”

Image source: UCL Microcore Website

Page 43: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

43

Some Patient Samples

• Does Mr. A have cancer?

malign

malign

malign

malign

benign

benign

benign

benign

genes

sam

ple

s

??? Mr. A:

Page 44: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

44

Let’s rearrange the rows…

• Does Mr. A have cancer?

genes

sam

ple

s

malign

malign

malign

malign

benign

benign

benign

benign

??? Mr. A:

Page 45: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

45

and the columns too…

malign

malign

malign

malign

benign

benign

benign

benign

genes

sam

ple

s

??? Mr. A:

Page 46: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Normalization

Page 47: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

47

Sometimes, a gene expression study

may involve batches of data collected

over a long period of time…

0

10

20

30

40

50

60

70

Jan-0

4

Apr-

04

Jul-04

Oct-

04

Jan-0

5

Apr-

05

Jul-05

Oct-

05

Jan-0

6

Apr-

06

Jul-06

Oct-

06

Jan-0

7

Apr-

07

Jul-07

Oct-

07

Jan-0

8

Apr-

08

Jul-08

Oct-

08

Jan-0

9

Apr-

09

Jul-09

Oct-

09

Jan-1

0

Time Span of Gene Expression Profiles

Image credit: Dong Difeng

Page 48: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

48

In such a case, batch effect may be

severe… to the extent that you can

predict the batch that each sample

comes!

Need normalization to correct for batch effect

Image credit: Dong Difeng

Page 49: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

49

Approaches to Normalization

• Aim of

normalization:

Reduce variance

w/o increasing bias

• Scaling method

– Intensities are scaled

so that each array

has same ave value

– E.g., Affymetrix’s

• Xform data so that

distribution of

probe intensities is

same on all arrays

– E.g., (x ) /

• Quantile

normalization

Page 50: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

50

Quantite Normalization

• Given n arrays of length p,

form X of size p × n where

each array is a column

• Sort each column of X to

give Xsort

• Take means across rows

of Xsort and assign this

mean to each elem in the

row to get X’sort

• Get Xnormalized by arranging

each column of X’sort to

have same ordering as X

• Implemented in some

microarray s/w, e.g.,

EXPANDER

Page 51: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

51

After quantile

normalization

Page 52: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Selection of Patient Samples and Genes

for Disease Prognosis

Page 53: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

53

Gene Expression Profile

+ Clinical Data

Outcome Prediction

• Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002)

• Fuzzy neural network (Ando et al 2002)

• Partial least squares regression (Park et al 2002)

• Weighted voting algorithm (Shipp et al 2002)

• Gene index and “reference gene” (LeBlanc et al 2003)

• ……

Page 54: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

54

Our Approach

“extreme”

sample

selection

ERCOF

Page 55: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

55

Short-term Survivors v.s. Long-term Survivors

T: sample

F(T): follow-up time

E(T): status (1:unfavorable; 0: favorable)

c1 and c2: thresholds of survival time

Short-term survivors

who died within a short

period

F(T) < c1 and E(T) = 1

Long-term survivors

who were alive after a

long follow-up time

F(T) > c2

Extreme Sample Selection

Page 56: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

56

ERCOF Entropy-

Based Rank

Sum Test &

Correlation

Filtering

Remove genes with

expression values w/o

cut point found (can’t

be discretized)

Calculate Wilcoxon rank

sum w(x) for gene x.

Remove gene x if w(x)

[clower, cupper]

Group features by

Pearson Correlation

For each group, retain

the top 50% wrt class

entropy

Page 57: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

57

Linear Kernel SVM regression function

bixTKyaTG i

i

i ))(,()(

T: test sample, x(i): support vector,

yi: class label (1: short-term survivors; -1: long-term survivors)

Transformation function (posterior probability)

)(1

1)(

TGe

TS

))1,0()(( TS

S(T): risk score of sample T

Risk Score Construction

Page 58: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

58

Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the

most common type of

lymphoma in adults

• Can be cured by

anthracycline-based

chemotherapy in 35 to 40

percent of patients

DLBC lymphoma

comprises several

diseases that differ in

responsiveness to

chemotherapy

• Intl Prognostic Index (IPI)

– age, ―Eastern Cooperative

Oncology Group‖ Performance

status, tumor stage, lactate

dehydrogenase level, sites of

extranodal disease, ...

• Not very good for stratifying

DLBC lymphoma patients for

therapeutic trials

Use gene-expression

profiles to predict outcome

of chemotherapy?

Page 59: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

59

Rosenwald et al., NEJM 2002

• 240 data samples

– 160 in preliminary group

– 80 in validation group

– each sample described by 7399 microarray

features

• Rosenwald et al.’s approach

– identify gene: Cox proportional-hazards model

– cluster identified genes into four gene signatures

– calculate for each sample an outcome-predictor

score

– divide patients into quartiles according to score

Page 60: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

60

Knowledge Discovery from Gene

Expression of ―Extreme‖ Samples

“extreme”

sample

selection:

< 1 yr vs > 8 yrs

knowledge

discovery

from gene

expression

240

samples

80

samples 26 long-

term survivors

47 short-

term survivors

7399

genes

84

genes

T is long-term if S(T) < 0.3

T is short-term if S(T) > 0.7

Page 61: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

61

73 25 47+1(*) Informative

160 72 88 Original DLBCL

Alive Dead

Total Status Data set Application

Number of samples in original data and selected informative training set.

(*): Number of samples whose corresponding patient was dead at the end

of follow-up time, but selected as a long-term survivor.

Discussions: Sample Selection

Page 62: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

62

84(1.7%) Phase II

132(2.7%) Phase I

4937(*) Original

DLBCL Gene selection

Number of genes left after feature filtering for each phase.

(*): number of genes after removing those genes who were

absent in more than 10% of the experiments.

Discussions: Gene Identification

Page 63: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

63

p-value of log-rank test: < 0.0001

Risk score thresholds: 0.7, 0.3

Kaplan-Meier Plot for 80 Test Cases

Page 64: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

64

(A) IPI low,

p-value = 0.0063

(B) IPI intermediate,

p-value = 0.0003

Improvement Over IPI

Page 65: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

65

(A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009)

No clear difference on the overall survival of the 80 samples in the validation

group of DLBCL study, if no training sample selection conducted

Merit of ―Extreme‖ Samples

Page 66: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

66

About the Inventor: Huiqing Liu

• Huiqing Liu

– PhD, NUS, 2004

– Currently Senior

Scientist at Centocor

– Asian Innovation

Gold Award 2003

– New Jersey Cancer

Research Award for

Scientific Excellence

2008

– Gallo Prize 2008

Page 67: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Beyond Disease Diagnosis & Prognosis

Page 68: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

68

Beyond Classification of

Gene Expression Profiles

• After identifying the candidate genes by feature

selection, do we know which ones are causal

genes, which ones are surrogates, and which are

noise? Diagnostic ALL BM samples (n=327)

3 -3 -2 -1 0 1 2 = std deviation from mean

Ge

nes

fo

r c

las

s

dis

tin

cti

on

(n

=2

71

)

TEL-AML1 BCR-

ABL

Hyperdiploid >50 E2A-

PBX1

MLL T-ALL Novel

Page 69: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

69

Gene Regulatory Circuits

• Genes are “connected”

in “circuit” or network

• Expr of a gene in a

network depends on

expr of some other

genes in the network

• Can we “reconstruct”

the gene network from

gene expression and

other data? Source: Miltenyi Biotec

Page 70: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

70

• Each disease subtype has underlying cause

There is a unifying biological theme for genes

that are truly associated with a disease subtype.

• Uncertainty in reliability of selected genes can be

reduced by considering molecular functions and

biological processes associated with the genes

• The unifying biological theme is basis for

inferring the underlying cause of disease subtype

Hints to extend reach of prediction

Page 71: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

71

Intersection Analysis

• Intersect the list of

differentially expressed

genes with a list of genes

on a pathway

• If intersection is

significant, the pathway is

postulated as basis of

disease subtype or

treatment response

Caution:

• Initial list of differentially

expressed genes is

defined using test

statistics with arbitrary

thresholds

• Diff test statistics and diff

thresholds result in a diff

list of differentially

expressed genes

Outcome may be unstable Exercise: What is a good test

statistics to determine if the

intersection is significant?

Page 72: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Gene Interaction Prediction

Page 73: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

73

Beyond Classification of

Gene Expression Profiles

• After identifying the candidate genes by feature

selection, do we know which ones are causal

genes and which ones are surrogates?

Diagnostic ALL BM samples (n=327)

3 -3 -2 -1 0 1 2 = std deviation from mean

Ge

nes

fo

r c

las

s

dis

tin

cti

on

(n

=2

71

)

TEL-AML1 BCR-

ABL

Hyperdiploid >50 E2A-

PBX1

MLL T-ALL Novel

Page 74: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

74

Gene Regulatory Circuits

• Genes are “connected” in

“circuit” or network

• Expression of a gene in a

network depends on

expression of some other

genes in the network

• Can we reconstruct the

gene network from gene

expression data?

Page 75: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

75

Key Questions

• For each gene in the network:

• Which genes affect it?

• How they affect it?

– Positively?

– Negatively?

– More complicated ways?

Page 76: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

76

Some Techniques

• Bayesian Networks

– Friedman et al., JCB 7:601--620, 2000

• Boolean Networks

– Akutsu et al., PSB 2000, pages 293--304

• Differential equations

– Chen et al., PSB 1999, pages 29--40

• Classification-based method

– Soinov et al., ―Towards reconstruction of gene

network from expression data by supervised

learning‖, Genome Biology 4:R6.1--9, 2003

Page 77: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

77

A Classification-Based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003

• Given a gene expression matrix X

– each row is a gene

– each column is a sample

– each element xij is expression of gene i in sample j

• Find the average value ai of each gene i

• Denote sij as state of gene i in sample j,

– sij = up if xij > ai

– sij = down if xij ai

Page 78: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

78

A Classification-Based Technique Soinov et al., Genome Biology 4:R6.1-9, Jan 2003

• To see whether the state of

gene g is determined by

the state of other genes

– see whether sij | i g

can predict sgj

– if can predict with high

accuracy, then ―yes‖

– Any classifier can be

used, such as C4.5, PCL,

SVM, etc.

• To see how the state of

gene g is determined by

the state of other genes

– apply C4.5 (or PCL or

other ―rule-based‖

classifiers) to predict sgj

from sij | i g

– and extract the decision

tree or rules used

Page 79: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

79

Advantages of this method

• Can identify genes affecting a target gene

• Don’t need discretization thresholds

• Each data sample is treated as an example

• Explicit rules can be extracted from the classifier

(assuming C4.5 or PCL)

• Generalizable to time series

Page 80: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Concluding Remarks

Page 81: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

81

Bcr-Abl

• Targeted drug dev

– Know what

molecular effect you

want to achieve

• E.g., inhibit a

mutated form of a

protein

– Engineer a

compound that

directly binds and

causes the desired

effect

• Gleevec (imatinib)

– 1st success for real drug

– Targets Bcr-Abl fusion

protein (ie, Philadelphia

chromosome, Ph)

– NCI summary of clinical

trial of imatinib for ALL

at

http://www.cancer.gov/clini

caltrials/results/ALLimati

nib1109/print

Page 82: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

82

What have we learned?

• Technologies

– Microarray

– PCL, ERCOF

• Microarray applications

– Disease diagnosis by supervised learning

– Subtype discovery by unsupervised learning

• Important tactics

– Extreme sample selection

– Intersection analysis, Gene network reconstruction

Page 83: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

83

Useful Gene Expression

Analysis Packages

• EXPANDER (EXPression Analyser & DisplayER)

– Ron Shamir @ Tel Aviv University

– http://acgt.cs.tau.ac.il/expander

• BRB-Array Tools

– Biometric Research Branch @ National Cancer

Center, USA

– http://linus.nci.nih.gov/BRB-ArrayTools.html

Page 84: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Any Question?

Page 85: For written notes on this lecture, please read chapter 14 ...wongls/courses/cs2220/2012/Lect4_2220_gene...For written notes on this lecture, please read chapter 14 of The Practical

Copyright 2012 © Limsoon Wong

85

References

• E.-J. Yeoh et al., ―Classification, subtype discovery, and

prediction of outcome in pediatric acute lymphoblastic leukemia

by gene expression profiling‖, Cancer Cell, 1:133--143, 2002

• H. Liu, J. Li, L. Wong. Use of Extreme Patient Samples for

Outcome Prediction from Gene Expression Data. Bioinformatics,

21(16):3377--3384, 2005.

• L.D. Miller et al., ―Optimal gene expression analysis by

microarrays‖, Cancer Cell 2:353--361, 2002

• J. Li, L. Wong, ―Techniques for Analysis of Gene Expression‖,

The Practical Bioinformatician, Chapter 14, pages 319—346,

WSPC, 2004

• B. Bolstad et al. ―A comparison of normalization methods for

high density oligonucleotide array data based on variance and

bias‖. Bioinformatics, 19:185–193. 2003