limma linear model analysis of microarrays bayesian regularized t-test (baldi & long 2001) the...

163
T-statistics is widespread in T-statistics is widespread in assessing differential expression. assessing differential expression. Unstable variance estimates that Unstable variance estimates that arise when sample size is small arise when sample size is small can be corrected using: can be corrected using: Error fudge factors (SAM) Error fudge factors (SAM) Bayesian methods (Limma) Bayesian methods (Limma)

Upload: natalie-horn

Post on 26-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

Page 2: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

LimmaLimma

Linear model analysis of Linear model analysis of microarraysmicroarrays

Page 3: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Bayesian regularized t-testBayesian regularized t-test(Baldi & Long 2001)(Baldi & Long 2001)

C

C

T

T

CT

nn

mmt

22

C

C

T

T

CT

nn

mmt

22

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

My gene

{

Page 4: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Bayesian regularized t-testBayesian regularized t-test

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

Page 5: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Bayesian regularized t-testBayesian regularized t-test

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

Page 6: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

BH correctionBH correction

• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.

• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:– The gene expressions are independent from each The gene expressions are independent from each

other.other.– The raw distribution of p values should be uniform in The raw distribution of p values should be uniform in

the non significant range.the non significant range.

Page 7: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 8: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The application of BH correction to these pvalues will not produceany differential expressed gene!

The application of BH correction to these pvalues will not produceany differential expressed gene!

Page 9: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Let’s identify differentially expressedprobe sets by linear modelling

Let’s identify differentially expressedprobe sets by linear modelling

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

Page 10: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

If more than two conditions are available more contrasts can be evaluated

If more than two conditions are available more contrasts can be evaluated

Page 11: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Contrast parameterization is saved with a specific name

Contrast parameterization is saved with a specific name

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

Page 12: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Before evaluating differential expression raw p-value distribution is checked.

Before evaluating differential expression raw p-value distribution is checked.

AA

BB

CC

Page 13: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 14: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

BB

CC

AAIf BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

If BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

Page 15: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

A

B

Page 16: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

These results can be saved in a new topTable containing only the probe sets shown in red on plots

These results can be saved in a new topTable containing only the probe sets shown in red on plots

Yes

Page 17: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

TopTable structureTopTable structure

AffyIDAffyID

Gene Symbol

Gene Symbol

Gene Description

Gene Description

Log2 FCLog2 FC

Average intensity

Average intensity

T statisticsT statistics

P-valuesP-values

Log-odd statistics

Log-odd statistics

Page 18: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 10 Exercise 10 (30 minutes)(30 minutes)

• Go in the folder Go in the folder estrogen.IGF1estrogen.IGF1..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named targets.txt: named targets.txt:

– Targets file is made of three columns with the following header:Targets file is made of three columns with the following header:• NameName• FileNameFileName• TargetTarget

– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the corresponding .CEL place the name of the corresponding .CEL

filefile– In column In column TargetTarget place the experimental conditions (e.g. control, place the experimental conditions (e.g. control,

treatment, etc)treatment, etc)• Create a target only for MCF7 and Sker-3 with/without Create a target only for MCF7 and Sker-3 with/without

estrogen (E2) treatment.estrogen (E2) treatment.• Calculate Probe set summaries with RMACalculate Probe set summaries with RMA

See next page

Page 19: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 10 Exercise 10 (30 minutes)(30 minutes)

• In this experiment we have a breast In this experiment we have a breast cancer tumor cell line (MCF7) and a tumor cancer tumor cell line (MCF7) and a tumor cell line derived by central nervous system cell line derived by central nervous system (SKER3).(SKER3).

• Question:Question:– Which are the probe sets controlled by E2 in a Which are the probe sets controlled by E2 in a

tissue independent manner?tissue independent manner?

See next page

Page 20: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 10Exercise 10

• Filter the data:Filter the data:– IQR 0.25, intensity 25% >100IQR 0.25, intensity 25% >100

• Calculate the models for E2 versus Calculate the models for E2 versus untreated cells both in mcf7 and sker3.untreated cells both in mcf7 and sker3.

• Contrasts:Contrasts:mcf7.e2 – mcf7.ctrlmcf7.e2 – mcf7.ctrl

sher3.e2 – sker3.ctrl sher3.e2 – sker3.ctrl

See next page

Page 21: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 10Exercise 10

• Evaluate if the raw p-value distributions Evaluate if the raw p-value distributions are suitable for BH correction.are suitable for BH correction.

• Question:Question:– Is the raw p-value distribution good to perfom Is the raw p-value distribution good to perfom

BH correction?BH correction?• YES NOYES NO

See next page

Page 22: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 10Exercise 10

• Use the “Table of Genes Ranked in order Use the “Table of Genes Ranked in order of Differential Expression”.of Differential Expression”.

• Plot differentially expressed genes with Plot differentially expressed genes with raw p-value raw p-value ≤≤ 0.05 and an absolute fold 0.05 and an absolute fold change change ≥≥ 1 for the two constrast. 1 for the two constrast.

• Save the subset of the topTables in Save the subset of the topTables in ex10.mcf7.xls, ex10.sker3.xlsex10.mcf7.xls, ex10.sker3.xls

• Save the project as ex10.lmaSave the project as ex10.lma

Page 23: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

BB

AA

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

DD EE FFGG

CC

Page 24: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The various list subsets will be saved in your working directory

The various list subsets will be saved in your working directory

Yes

Page 25: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 11 Exercise 11 (15 minutes)(15 minutes)

• Using Using "Venn Diagram between probe set "Venn Diagram between probe set lists“, lists“, evaluate the level of overlap between the evaluate the level of overlap between the Entrez Genes differentially expressed upon E2 Entrez Genes differentially expressed upon E2 treatment in MCF7 and in SKER3.treatment in MCF7 and in SKER3.

• Filter the expression data by the genes in Filter the expression data by the genes in common between the two conditions and export common between the two conditions and export the Normalized Expression Values the Normalized Expression Values (ex10.common.txt).(ex10.common.txt).

Page 26: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Time Course experimentsTime Course experiments

• maSigPro is a R package for the analysis of single and multiseries time course microarray experiments.

• maSigPro follows a two steps regression strategy to find genes with– significant temporal expression changes – significant differences between experimental

groups.

Page 27: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

• Time course experimental design:Time course experimental design:– We denote We denote experimental groupsexperimental groups as the experimental as the experimental

factor (dummy variables) for which temporal profiles factor (dummy variables) for which temporal profiles are defined (e.g. ”Treatment A”, ”Tissue1”, etc) are defined (e.g. ”Treatment A”, ”Tissue1”, etc)

– Conditions are Conditions are each experimental group vs. time each experimental group vs. time combinationcombination (e.g. ”Treatment A at Time 0”). (e.g. ”Treatment A at Time 0”). Conditions can have or not replicates. Conditions can have or not replicates.

– Variables are the Variables are the regression variablesregression variables defined by the defined by the maSigPro approach for the experiment regression maSigPro approach for the experiment regression model. model.

– maSigPro defines maSigPro defines dummy variablesdummy variables to model to model differences between experimental groups. differences between experimental groups.

– Dummy variables, Time and their interactions are the Dummy variables, Time and their interactions are the variablesvariables of the regression model. of the regression model.

Page 28: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Time Course design for maSigProTime Course design for maSigPro

All these information should be collapsed in the Target column of the targets file using _ to combine data.This can be done using the function JOIN in excel.

IMPORTANT: each treatment at each time has its corresponding untreated control!

Page 29: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Time Course design for maSigProTime Course design for maSigPro

Page 30: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Time Course design for maSigProTime Course design for maSigPro

The targets file for maSigPro has a peculiar structure:Each row of the column named Target describes the array on the basis of the experimental design.

Each element describing the time course experiment is separated from the others by an underscore.

The first three elements of the row are fixed and represent Time, Replicate, Control, all the other elements refer to various experimental conditions.

In this case we have a 8, 24 48 h time course, in triplicates with two different treatments: cond1 and cond2

Page 31: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Target column is reformatted to be used by maSigPro using the command

Page 32: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Large data setLarge data set

• oneChannelGUI interface has some limits oneChannelGUI interface has some limits (RAM memory) in loading/handling large (RAM memory) in loading/handling large set of .CEL files. set of .CEL files.

• This is expecially true for a large time This is expecially true for a large time course experiment like our example.course experiment like our example.

• To overcome this problem probe set To overcome this problem probe set average expression intensities are average expression intensities are calculated by Expression Console.calculated by Expression Console.

Page 33: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 34: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 35: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Loading tab delimited file the Bioconductor annotation library is not automatically defined.

Annotation Library information can be attached using:

Page 36: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Do not forgetDo not forget!

• Multiple test problem is also present in Multiple test problem is also present in mSigPro analysis.mSigPro analysis.

• Therefore, before running maSigPro, Therefore, before running maSigPro, remember to perform some filter based on remember to perform some filter based on functional information or samples functional information or samples distribution.distribution.

Page 37: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 38: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Ones the experiment design for maSigPro is ready it is possible to run the analysis

When maSigPro is running, check what is going on in the main R window!

Yes

Page 39: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Some parameters need to be set

Q: The first step is to compute a regression fit for each gene. The p-value associated to the F-Statistic of the model are computed and they are subsequently used to select significant genes. maSigPro corrects this p-value for multiple comparisons by applying false discovery rate (FDR) procedures. The level of FDR control is given by the function parameter Q.

Page 40: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Some parameters need to be set

Alpha: maSigPro applies a variable selection procedure to find significant variables for each gene. This will ultimatelly be used to find which are the profile differences between experimental groups. At each regression step the p-value of each variable is computed and variables get in/out the model when this p-value is lower or higher than the given cut-off value alfa.

Page 41: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Some parameters need to be set

R-squared: The following step is to generate lists of significant genes according to the way we want to see results.As filtering maSigPro uses the R-squared of the regression model.

Page 42: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

What is the R-squared coefficient?What is the R-squared coefficient?

• r.squared: r.squared: the "fraction of variance explained by a linearthe "fraction of variance explained by a linearmodel“model“

RR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))

where y* is the mean of y[i] if there is an where y* is the mean of y[i] if there is an intercept and zero otherwise.intercept and zero otherwise.

Page 43: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

YY

XX

Sum(R[i]Sum(R[i]22))

YY

XX

Sum((y[i]- y*)Sum((y[i]- y*)22))

R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))

Page 44: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

R-squared graphical viewRR22 = 1 - 0/ Sum((y[i]- y*) = 1 - 0/ Sum((y[i]- y*)22)=1)=1

YY

XX

Sum(R[i]Sum(R[i]22))

YY

XX

Sum((y[i]- y*)Sum((y[i]- y*)22))

Page 45: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Sum(R[i]Sum(R[i]22) = Sum((y[i]- ) = Sum((y[i]- y*)y*)22))

R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22)= 0)= 0

Sum((y[i]- y*)Sum((y[i]- y*)22))

YY

XX

YY

XX

Page 46: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Computation info are available in the main R window

Step 1

The procedure first adjusts this global model by the least-squared technique to identify differentially expressed genes and selects significant genes applying false discovery rate control procedures.

Step 2

Secondly, stepwise regression is applied as a variable selection strategy to study differences between experimental groups and to find statistically significant different profiles.

Page 47: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

When the computation is finished a message pops up

The coefficients obtained in the second regression model will be useful to cluster together significant genes with similar expression patterns and to visualize the results.

Page 48: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Results can be visualized as Venn diagrams or plotting in a PDF file the curves.The K mean clustering is not yet implemented

Page 49: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Results can be visualized plotting in a PDF file the curves.

C

B

D

A

The plots are related only to the sub set of genes specific of each treatment condition.

Page 50: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 51: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 12 (30 minutes)Exercise 12 (30 minutes)• This experiment was done with HGU133A.This experiment was done with HGU133A.

– This is a cell line experiment made of three time points This is a cell line experiment made of three time points 8, 24, 48 h.8, 24, 48 h.

– Each point is made of three biological replicates.Each point is made of three biological replicates.– Two different chemotherapeutics agents have been Two different chemotherapeutics agents have been

used (Treatment 1 and 2)used (Treatment 1 and 2)– Since these data have not yet published the probe set Since these data have not yet published the probe set

ids have been scrambled.ids have been scrambled.• In the time.course directory there are two files:In the time.course directory there are two files:

– An expression file derived from expression consoleAn expression file derived from expression console– A tab delimited file describing the experimental A tab delimited file describing the experimental

conditions.conditions.• Use this information to load the data, filter them by Use this information to load the data, filter them by

IQR (threshold of your choice), to run (e.g. IQR (threshold of your choice), to run (e.g. Q=0.05, Q=0.05, =0.05, R=0.8) and view results =0.05, R=0.8) and view results generated by maSigPro.generated by maSigPro.

Page 52: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

Page 53: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

AnnotationAnnotation

• An important issue in microarray data An important issue in microarray data analysis is the specific association of analysis is the specific association of probe identifiers with genome annotated probe identifiers with genome annotated transcripts. transcripts.

• A critical point in annotation is the way A critical point in annotation is the way in which the association between in which the association between probes and genes is produced.probes and genes is produced.

Page 54: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Annotation in AffymetrixAnnotation in Affymetrix• NetAffxNetAffx: Affymetrix annotation repository: Affymetrix annotation repository• Bioconductor:Bioconductor:

– uses a specific annotation library, AnnBuilder, to create annotation uses a specific annotation library, AnnBuilder, to create annotation libraries starting from the association probe set identifierlibraries starting from the association probe set identifierGeneBank GeneBank accession number (i.e. the primary target for probes design). accession number (i.e. the primary target for probes design).

• RESOURCERER (Tsai et al. 2001):RESOURCERER (Tsai et al. 2001):– the annotation tool at TIGR center uses EST and gene sequences the annotation tool at TIGR center uses EST and gene sequences

stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). – They provide an analysis of publicly available EST and gene sequence They provide an analysis of publicly available EST and gene sequence

data for the identification of transcripts and their placement in a genomic data for the identification of transcripts and their placement in a genomic context, and the identification of orthologs and paralogs wherever context, and the identification of orthologs and paralogs wherever possible. possible.

• Neither Bioconductor nor TIGR methods operate at the probe level, Neither Bioconductor nor TIGR methods operate at the probe level, nor do they consider the limited reliability of some sets due to probe nor do they consider the limited reliability of some sets due to probe cross-hybridization or erroneous probe/transcript annotation. cross-hybridization or erroneous probe/transcript annotation.

• Ensembl:Ensembl:– Annotation with the Ensembl tool is built by direct matching of Affymetrix Annotation with the Ensembl tool is built by direct matching of Affymetrix

probes over the Ensembl sequence database. probes over the Ensembl sequence database. – Its weak point is that matching of only 50% of the probes of a specific set Its weak point is that matching of only 50% of the probes of a specific set

to an Ensembl gene is needed for a true association definition "probe set to an Ensembl gene is needed for a true association definition "probe set identifier"/"Ensembl gene identifier". identifier"/"Ensembl gene identifier".

Page 55: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Gene OntologyGene Ontology

Page 56: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

OntologiesOntologies

• An ontology is a specification of a An ontology is a specification of a conceptualization:conceptualization:– a hierarchical mapping of concepts within a given frame a hierarchical mapping of concepts within a given frame

of reference.of reference.

• An ontology is a restricted structured vocabulary of An ontology is a restricted structured vocabulary of terms that represent domain knowledge. terms that represent domain knowledge.

• An ontology specifies a vocabulary that can be An ontology specifies a vocabulary that can be used to exchange queries and assertions. used to exchange queries and assertions.

• A commitment to the use of the ontology is an A commitment to the use of the ontology is an agreement to use the shared vocabulary in a agreement to use the shared vocabulary in a consistent way.consistent way.

Page 57: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Gene OntologyThe Gene Ontology• The goal of the Gene Ontology (GO) Consortium is to The goal of the Gene Ontology (GO) Consortium is to

produce a controlled vocabulary that can be applied to all produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in organisms even as knowledge of gene and protein roles in cells is accumulating and changing. cells is accumulating and changing. – http://www.geneontology.org/

• For genes and gene products the Gene Ontology For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address Consortium (GO) is an initiative that is designed to address the problem of defining the problem of defining common set of terms and common set of terms and descriptions for basic biological functionsdescriptions for basic biological functions..

• GO provides a restricted vocabulary as well as clear GO provides a restricted vocabulary as well as clear indications of the relationships between terms.indications of the relationships between terms.

Page 58: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Gene OntologyThe Gene Ontology

• The Gene Ontology (GO) consortium produces three independent ontologies for gene products.

• The three ontologies are:– molecular function of a gene product which is defined to

be biochemical activity or action of the gene product (MF 7220).

– biological process interpreted as a biological objective to which the gene product contributes (BP 9529).

– cellular component is a component of a cell that is part of some larger object or structure (CC 1536).

Page 59: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Graph Structure of GOThe Graph Structure of GO

• The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents.

• GO node is interchangeable with GO term.• Child terms are more specific than their

parents:– The term “transmembrane receptor protein-

tyrosine kinase” is child of • “transmembrane receptor” and “protein tyrosine

kinase”.

Page 60: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Graph Structure of GOThe Graph Structure of GO

• The relationship between a child and a parent can be characterized by the relations:– is a – has a (part of)

• “mitotic chromosome” is a child of “chromosome” and the relationship is an is a relation.

• “telomere” is a child of “chromosome” with the has a relation.

Page 61: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Top node

Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)

GO structureGO structure

Page 62: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Induced GO graph for a set of diff exprs genes.Induced GO graph for a set of diff exprs genes.

GO can be used to link differentially expressed GO can be used to link differentially expressed genes to specific functional classesgenes to specific functional classes..

Top node

The induced GO graph colored according to unadjusted hypergeometric p-The induced GO graph colored according to unadjusted hypergeometric p-valuevalue0.010.01

Page 63: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Page 64: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

Page 65: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

Page 66: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Example:Example: Population Size: Population Size: 40 genes40 genes

Subset of differentially Subset of differentially expressed genes: expressed genes: 12 genes12 genes

10 genes, shown in light blue, have a common GO term 10 genes, shown in light blue, have a common GO term and 8 occur within the set of differentially expressed and 8 occur within the set of differentially expressed genes.genes.

Page 67: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Contingency MatrixContingency Matrix

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

Page 68: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

outout

inin

GO termGO term

outoutininSubsetSubset

22

44 2626

88

ContingencyContingencyMatrixMatrix

Page 69: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Hypergeometric Hypergeometric DistributionDistribution

aa bb

cc dd

a+ca+c

a+ba+b

b+db+d

c+dc+d

!!!!!

)!()!()!()!(

)!()!(!

!!)!(

!!)!(

dcban

dbcadcba

dcban

dbdb

caca

The probability of any The probability of any particularparticularmatrix occurring by randommatrix occurring by randomselection, given no associationselection, given no associationbetween the two variables, is givenbetween the two variables, is givenby the by the hypergeometric rulehypergeometric rule..

Page 70: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Assigning Significance to the FindingsAssigning Significance to the Findings

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

88 22

44 2626

inin outout

inin

outout

SubsetSubset

GO termGO term p p .0002 .0002

( 2x2 contingency matrix )( 2x2 contingency matrix )

Page 71: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

EASEEASE(Expression Analysis Systematic Explorer)(Expression Analysis Systematic Explorer)

• EASE analysis identifies prevalent biological EASE analysis identifies prevalent biological themes within gene clusters.themes within gene clusters.

• The highest-ranking themes derived by a The highest-ranking themes derived by a computational method can recapitulate manually computational method can recapitulate manually derived themes in previously published derived themes in previously published microarray, proteomics and SAGE results, and microarray, proteomics and SAGE results, and to provide evidence that these themes are stable to provide evidence that these themes are stable to varying methods of gene selection.to varying methods of gene selection.

Hosack et al. Genome Biol., 4:R70-R70.8, 2003.Hosack et al. Genome Biol., 4:R70-R70.8, 2003.

Page 72: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 73: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 74: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

• Consider all of the ResultsConsider all of the Results

EASE reports all themes represented in a cluster EASE reports all themes represented in a cluster and although some themes may not meet and although some themes may not meet statistical significance it may still be important statistical significance it may still be important to note that particular biological roles or to note that particular biological roles or pathways are represented in the cluster.pathways are represented in the cluster.

• Independently Verify RolesIndependently Verify Roles

Once found, biological themes should be Once found, biological themes should be independently verified using annotation independently verified using annotation resources.resources.

EASE ResultsEASE Results

Page 75: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

GOstats packageGOstats package

• To perform an analysis using the To perform an analysis using the Hypergeometric-based test, one needs to define Hypergeometric-based test, one needs to define a a gene universegene universe and a list of and a list of selected genesselected genes from the universe.from the universe.

• To identify the set of expressed genes from a To identify the set of expressed genes from a microarray experiment, R. Gentleman (GOstats microarray experiment, R. Gentleman (GOstats developer) proposed that a non-specific filter be developer) proposed that a non-specific filter be applied and that the genes that pass the filter be applied and that the genes that pass the filter be used to form the universe for any subsequent used to form the universe for any subsequent functional analyses.functional analyses.

Page 76: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

A

B

D

C

Page 77: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

Page 78: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The reason of this representation is the selection of the GO terms that

contains smaller subsets.

Page 79: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

GO identifierGO identifierGO identifierGO identifier

Description of Description of GO termGO term

Description of Description of GO termGO term

significancesignificancesignificancesignificance

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

Page 80: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

Page 81: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

A

C

B

D

Page 82: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 83: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 13 Exercise 13 (20 minutes)(20 minutes)

• Using GOenrichment function, check if Using GOenrichment function, check if there is any overlap between the GO there is any overlap between the GO classes BP found enriched (p-value classes BP found enriched (p-value 0.01) using the set of probe sets found 0.01) using the set of probe sets found differentially expressed upon E2 treatment differentially expressed upon E2 treatment in MCF7 or SKER3.in MCF7 or SKER3.

• Question:Question:– Which are the BP or MF GO terms in common Which are the BP or MF GO terms in common

between the two set of differentially exprssed between the two set of differentially exprssed probe sets?probe sets? See next page

Page 84: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 13 Exercise 13 (10 minutes)(10 minutes)

• Using plotGO see which are the parents of the Using plotGO see which are the parents of the GO term(s) in common between the probe sets GO term(s) in common between the probe sets differentially expressed in MCF7 and those in differentially expressed in MCF7 and those in SKER3 upon E2 treatment.SKER3 upon E2 treatment.

• Using extractAffyids function, check the number Using extractAffyids function, check the number of probe sets derived by limma differential of probe sets derived by limma differential expression also present in the common GO expression also present in the common GO termsterms.

• Question:– Probe sets belonging to the common GO terms are Probe sets belonging to the common GO terms are

the same in the two differential expression analyses?the same in the two differential expression analyses?

Page 85: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

ClusteringClustering

Page 86: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Is it available an ideal clustering Is it available an ideal clustering procedure?procedure?

• No!No!– Each clustering algorithm has it ideal data Each clustering algorithm has it ideal data

structure.structure.

• Since we do not know which is the data Since we do not know which is the data structure:structure:

• Various clustering methods have to be applied in Various clustering methods have to be applied in order to identify the one that better fit to the data order to identify the one that better fit to the data under analysisunder analysis

N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)

Page 87: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Supervised versus unsupervised Supervised versus unsupervised clusteringclustering

• Supervised clusteringSupervised clustering try to find the best try to find the best partition for data that belong to a know set partition for data that belong to a know set of classesof classes

• Unsupervised clusteringUnsupervised clustering try to define the try to define the number and the size of the classes in number and the size of the classes in which the transcription profiles can be which the transcription profiles can be fitted in.fitted in.

Page 88: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The Expression Matrix is a representation of data from multipleThe Expression Matrix is a representation of data from multiplemicroarray experiments.microarray experiments.

N

D

X11 X12 X13 … X1d (L)

X21 X22 X23 … X2d (L)

Xn1 Xn2 Xn3 … xnd (L)

experiment

Probe set

Each element is a log ratioEach element is a log ratio

+

-

0

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Page 89: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Large data set can be loaded as tab delimited

files

Large data set can be loaded as tab delimited

files

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

Page 90: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

Join function in excelJoin function in excel

Page 91: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 92: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 93: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Loading data as tab delimited fileLoading data as tab delimited file

Select as format description tab delimited files

Select as format description tab delimited files

Export expression data as tab delimited files

Export expression data as tab delimited files

Page 94: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Select the first numerical value and load the dataSelect the first numerical value and load the data

Page 95: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Expression VectorsExpression Vectors

• Gene Expression Vectors Gene Expression Vectors encapsulate the expression of a encapsulate the expression of a gene over a set of experimental gene over a set of experimental conditions or sample types.conditions or sample types.

--0.80.8

0.80.8 1.51.5

1.81.8 0.50.5

--1.31.3

--0.40.4

1.51.5

-2

0

2

1 2 3 4 5 6 7 8

loglog22(time(timett//timetime00))

Page 96: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Data reformattingData reformatting• Clustering can be performed using as reference a virtual array:Clustering can be performed using as reference a virtual array:

– A virtual array can be calculated averaging gene expression over the A virtual array can be calculated averaging gene expression over the experimental conditions.experimental conditions.

• Clustering can be performed building virtual two-dye Clustering can be performed building virtual two-dye experiments:experiments:

where i=1…I, j=1…Jwhere i=1…I, j=1…J

• Clustering can be performed also without the use of a common Clustering can be performed also without the use of a common reference by:reference by:– Genes centeringGenes centering

– Experiments centeringExperiments centering

C

T2log

j

i

C

T2logor

Page 97: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

col

colii

XZ

Data reformattingData reformatting

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

Gene centering

Array centering

Centering at gene levels removes thescaling differences!

Centering at gene levels removes thescaling differences!

Page 98: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Various data reformating are availableVarious data reformating are available

We will use mainly gene/row adjustmentWe will use mainly gene/row adjustment

Page 99: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Distance and SimilarityDistance and Similarity

• The ability to calculate a distance (or The ability to calculate a distance (or similarity, it’s inverse) between two similarity, it’s inverse) between two expression vectors is fundamental to expression vectors is fundamental to clustering algorithms.clustering algorithms.

• Distance between vectors is the basis Distance between vectors is the basis upon which decisions are made when upon which decisions are made when grouping similar patterns of expression.grouping similar patterns of expression.

• Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance.concept of distance.

Page 100: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 101: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

x = (5,5)

y = (9,8)Euclidean distance:d(x,y) = (42+32) = 5

Manhattan distance:d(x,y) = 4+3 = 7

4

35

Distance is Defined by a MetricDistance is Defined by a Metric

Page 102: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Distance is Defined by a MetricDistance is Defined by a Metric

Euclidean Pearson Distance Metric:

4.2

1.4

1.00

0.90D

D

-2

0

2

log

log 22(

time

(tim

e tt/tim

e/t

ime 00))

Page 103: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Page 104: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Hierarchical Clustering Hierarchical Clustering (HCL(HCL)

• HCL is an agglomerative/divisive HCL is an agglomerative/divisive clustering method. clustering method.

• The iterative process continues until all The iterative process continues until all groups are connected in a hierarchical groups are connected in a hierarchical tree.tree.

Page 105: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 106: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 107: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)

g8g1 g2 g3 g4 g5 g6 g7

g7g1 g8 g2 g3 g4 g5 g6

g7g1 g8 g4 g2 g3 g5 g6

g1 is most like g8

g4 is most like {g1, g8}

Page 108: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

g7g1 g8 g4 g2 g3 g5 g6

g6g1 g8 g4 g2 g3 g5 g7

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical ClusteringHierarchical Clustering

g5 is most like g7

{g5,g7} is most like {g1, g4, g8}

Page 109: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical TreeHierarchical Tree

Page 110: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Hierarchical ClusteringHierarchical Clustering

• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine decisions must be made to determine which clusters should be joined. which clusters should be joined.

• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern must be calculated. The rules that govern this calculation are this calculation are linkage methodslinkage methods..

Page 111: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Agglomerative Linkage MethodsAgglomerative Linkage Methods

• Linkage methods are rules or metrics that Linkage methods are rules or metrics that return a value that can be used to return a value that can be used to determine which elements (clusters) determine which elements (clusters) should be linked.should be linked.

• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage– Average LinkageAverage Linkage– Complete LinkageComplete Linkage

Page 112: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Single LinkageSingle Linkage• Cluster-to-cluster distance is defined Cluster-to-cluster distance is defined

as as the minimum distance between the minimum distance between members of one cluster and members of one cluster and members of the another clustermembers of the another cluster. .

• Single linkage tends to create Single linkage tends to create ‘elongated’ clusters with individual ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.

DAB

Single

Page 113: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Average LinkageAverage Linkage• Cluster-to-cluster distance is Cluster-to-cluster distance is

defined as defined as the average distance the average distance between all members of one between all members of one cluster and all members of another cluster and all members of another clustercluster. .

• Average linkage has a slight Average linkage has a slight tendency to produce clusters of tendency to produce clusters of similar variance.similar variance.

DAB

Ave.

Page 114: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Complete LinkageComplete Linkage

DAB

• Cluster-to-cluster distance is Cluster-to-cluster distance is defined as defined as the maximum distance the maximum distance between members of one cluster between members of one cluster and members of the another and members of the another clustercluster. .

• Complete linkage tends to create Complete linkage tends to create clusters of similar size and clusters of similar size and variabilityvariability

Complete

Page 115: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

HCLHCL• A clustering result can be represented by A clustering result can be represented by

many different graphical views.many different graphical views.

1 2 3 4 1 2 34 12 34

Page 116: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

HCLHCL

• HCL does not converge to a unique result HCL does not converge to a unique result and each run represent one of the and each run represent one of the possible solution.possible solution.

• To obtain information on cluster stability a To obtain information on cluster stability a resampling method should be applied:resampling method should be applied:– Bootstrapping:

• resampling with replacement

– Jackknifing:• resampling without replacement

Page 117: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

To perform HCL click on HCL iconTo perform HCL click on HCL icon

Page 118: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

To see results click onTo see results click on

Page 119: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Visualization can be reformattedVisualization can be reformatted

Page 120: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Bootstrapping (ST)Bootstrapping (ST)

Bootstrapping – resampling with replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various bootstrapped matrices (by experiments):

Exp 2 Exp 3 Exp 4

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 2 Exp 4 Exp 4 Exp 1 Exp 3 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 5

Page 121: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Jackknifing (ST)Jackknifing (ST)Jackknifing – resampling without replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various jackknifed matrices (by experiments):

Exp 1 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 2 Exp 3 Exp 4 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

1000 bootstrapsEuclidean distanceAverage clustering

1000 bootstrapsEuclidean distanceAverage clustering

Page 122: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

To run HCL with resamplingTo run HCL with resampling

Page 123: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

To see results click onTo see results click on

Page 124: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

A sub set of genes can be selected clicking on the

node of interest

A sub set of genes can be selected clicking on the

node of interest

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Page 125: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 126: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a Principal component analysis (PCA) involves a mathematical procedure that transforms a number of mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called uncorrelated variables called principal componentsprincipal components. .

• The first principal component accounts for as much of The first principal component accounts for as much of the variability in the data as possiblethe variability in the data as possible

• Each succeeding component accounts for as much of Each succeeding component accounts for as much of the remaining variability as possible. the remaining variability as possible.

• The components can be thought of as axes in n-The components can be thought of as axes in n-dimensional space, where n is the number of dimensional space, where n is the number of components. Each axis represents a different trend in components. Each axis represents a different trend in the data.the data.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

Page 127: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

2

1

2° PC will be orthogonal to the 1st

Page 128: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

A

Cluster 1

Cluster 2

Cluster 1

MCF7 SKER-3E2 IGFE2 IGF

MCF7 SKER-3

E2 IGFE2 IGF

Cluster 2

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

Page 129: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Quaglino et al. J. Clin. Invest. 2004

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

Page 130: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

We have already used PCA for quality controlWe have already used PCA for quality control

Page 131: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Results clicking onResults clicking on

Click on right mouse buttonOver 3D/2D

view

Click on right mouse buttonOver 3D/2D

view

Page 132: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Cluster Affinity Search Technique Cluster Affinity Search Technique (CAST)(CAST)

• CAST uses an iterative approach to CAST uses an iterative approach to segregate elements with ‘high affinity’ into segregate elements with ‘high affinity’ into a cluster.a cluster.

• The process iterates through two phases:The process iterates through two phases:– additionaddition of high affinity elements to the of high affinity elements to the

cluster being createdcluster being created– removalremoval or clean-up of low affinity elements or clean-up of low affinity elements

from the cluster being createdfrom the cluster being created

Page 133: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Clustering Affinity Search Technique (CAST)-1Clustering Affinity Search Technique (CAST)-1Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as%age of maximum affinity at that point

1. Create a new empty cluster C1.

3. Move the two most similar genes into the new cluster.

Empty cluster C1

G2G4

G9

G8

G12

G6

G1

G7

G13

G11

G14

G3

G5 G15

G10

Unassigned genes

4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1)

2. Set initial affinity of all genes to zero

5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds theuser-specified threshold affinity, pick the unassigned gene whose affinity is the highest,and add it to cluster C1. Update the affinities of all the genes accordingly.

ADD GENES:

Page 134: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

CAST – 2CAST – 2

6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, removethe lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene.

7. Repeat step 6 while C1 contains a low-affinity gene.

8. Repeat steps 5-7 as long as changes occur to the cluster C1.

REMOVE GENES:

9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps1-8.

10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster

Current cluster C1

G2G4

G9

G8

G12G6

G1 G7

G13

G11

G14

G3

G5

G15G10

Unassigned genes

Page 135: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Click onClick on

Parameter to be setParameter to be set

Page 136: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

SOMsSOMs

Page 137: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Self-organizing maps (SOMs) – 1Self-organizing maps (SOMs) – 1

1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal

N = NodesG = GenesG1 G6

G3

G5

G4G2

G11

G7G8

G10

G9

G12 G13

G14G15

G19G17

G22

G18

G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

Page 138: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

SOMs – 2SOMs – 22. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The node closest to G9 (N2) is movedthe most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved.

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

Page 139: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

SOM Neighborhood OptionsSOM Neighborhood Options

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

Bubble Neighborhood

Gaussian

Neighborhoodradius

All move, alpha is scaled.

Some move, alpha is constant.

Page 140: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

SOMs – 3SOMs – 3

4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) arerepeated many (usually several thousand) times. However, with each iteration, the amountthat the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than itsdistance to any other node

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3

N4

N5N6

Page 141: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Click onClick on

Page 142: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 14Exercise 14• This exercise is based on the breast cancer data This exercise is based on the breast cancer data

set published by Chin on Cancer Cell 2006 set published by Chin on Cancer Cell 2006 (hgu133A HT platform)(hgu133A HT platform)

• Using the clinical data (E-TABM-158-Using the clinical data (E-TABM-158-clinical.data.txt) available in large.data.set dir:clinical.data.txt) available in large.data.set dir:– Construct a target file, like the one used in time Construct a target file, like the one used in time

course.course.– Load the data in E-TABM-158-processed-data.txt Load the data in E-TABM-158-processed-data.txt

using the created target file.using the created target file.– Filter the data by IQR 0.5 and 25% of samples should Filter the data by IQR 0.5 and 25% of samples should

have a signal over 100 as intensity.have a signal over 100 as intensity.– Save project as ex14.lmaSave project as ex14.lma

Page 143: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 14Exercise 14– Filter the data on the basis of a list of EGs Filter the data on the basis of a list of EGs

derived by Ingenuity related to cell signaling derived by Ingenuity related to cell signaling (use the advance search at Ingenuity). (use the advance search at Ingenuity).

– Export the data as tab delimited files. Export the data as tab delimited files.

• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.

Page 144: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 14 Exercise 14 (30 minutes)(30 minutes)

• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.

– Subset and save the clusters you have Subset and save the clusters you have identified.identified.

– Combine them in excel.Combine them in excel.– Load them in TMEV and see how PCA divide Load them in TMEV and see how PCA divide

the samples.the samples.

Page 145: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

ClassificationClassification

Page 146: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

ClassificationClassification

• The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature.

• The task is to classify and predict the The task is to classify and predict the diagnostic category of a sample on the diagnostic category of a sample on the basis of its gene expression profile. basis of its gene expression profile.

Page 147: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

The example of classification The example of classification problem used in PAM publicationproblem used in PAM publication

• Data for small round blue cell tumors (SRBCT) of childhood (Khan et al. 2001), consisting of expression measurements on 2,308 genes, were obtained from glass-slide cDNA microarrays.

• The tumors are classified as:– Burkitt lymphoma (BL),– Ewing sarcoma (EWS), – neuroblastoma (NB), – rhabdomyosarcoma(RMS).

• A total of 63 training samples and 25 test samples

were provided, although five of the latter were not SRBCTs.

Page 148: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

PAMPAM

• PAM is a modification of the nearest-PAM is a modification of the nearest-centroid method, called ‘‘nearest shrunken centroid method, called ‘‘nearest shrunken centroid.’’centroid.’’

• PAM uses ‘‘de-noised’’ versions of the PAM uses ‘‘de-noised’’ versions of the centroids as prototypes for each class. centroids as prototypes for each class.

Centroids (Centroids (greygrey) and shrunken centroids () and shrunken centroids (redred) for the SRBCT dataset) for the SRBCT datasetThe overall centroid has been subtracted from the centroid from each class.The overall centroid has been subtracted from the centroid from each class.

Page 149: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

• SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter .

• The value 4.34 is chosen and yields a subset of 43 selected genes.

Page 150: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

• Shrunken differences dik for the 43 genes having at least one nonzero difference. • The genes with nonzero components in each class are almost mutually exclusive.

Page 151: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

PAM performancePAM performance

• Misclassification rates for seven classifiers on six microarray datasets based on 50 Misclassification rates for seven classifiers on six microarray datasets based on 50 random partitions into learning sets (two-thirds of the data) and test sets (one-third of random partitions into learning sets (two-thirds of the data) and test sets (one-third of the data)the data)

• The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks NNR and DLDA do surprisingly well and can almost keep up except on the prostate NNR and DLDA do surprisingly well and can almost keep up except on the prostate data (the largest dataset in the analysis).data (the largest dataset in the analysis).

• The success of such methodologically simple tools is limited to gene expression The success of such methodologically simple tools is limited to gene expression datasets with small sample size.datasets with small sample size.

Page 152: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Riorganize clinical information

Load a large data set as tab delimited file.Save in a file the description of the clinical parameters collapsed in the Target column of the targets file.

Page 153: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Riorganize clinical information

Page 154: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

run PAMR analysis

Page 155: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 156: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 157: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

If the selected probe sets are less than 50If the selected probe sets are less than 50

Page 158: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency
Page 159: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Yes

Page 160: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Nice separation between ER positive Nice separation between ER positive and negative samples can be achieved and negative samples can be achieved

also on the test set also on the test set

Page 161: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Exercise 15Exercise 15

• Load the ex14.lma.

• Attach the clinical parameters description

• Divide the data in training and test set on the base of one of the non-continuous parameters (e.g Yes/No; Pos/Neg).

• Use PAMR to define the minimal subset of genes, if any, discriminating between the two groups.

Page 162: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency

Revision exerciseRevision exercise

• Use the data set HuGene, made of:– Three breast samples:

• TisMap_Breast_01_v1_WTGene1.CEL• TisMap_Breast_02_v1_WTGene1.CEL• TisMap_Breast_03_v1_WTGene1.CEL

– Three brain samples:• TisMap_Brain_01_v1_WTGene1.CEL• TisMap_Brain_02_v1_WTGene1.CEL• TisMap_Brain_03_v1_WTGene1.CEL

• Perform all the steps of a microarray analysis:– QC, filtering, statistical analysis, GO analysis.

Page 163: Limma Linear model analysis of microarrays Bayesian regularized t-test (Baldi & Long 2001) The method tries to decouple the mean–variance dependency