limma linear model analysis of microarrays bayesian regularized t-test (baldi & long 2001) the...

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

• T-statistics is widespread in assessing T-statistics is widespread in assessing differential expression.differential expression.

• Unstable variance estimates that arise Unstable variance estimates that arise when sample size is small can be when sample size is small can be corrected using:corrected using:– Error fudge factors (SAM)Error fudge factors (SAM)– Bayesian methods (Limma) Bayesian methods (Limma)

LimmaLimma

Linear model analysis of Linear model analysis of microarraysmicroarrays

Bayesian regularized t-testBayesian regularized t-test(Baldi & Long 2001)(Baldi & Long 2001)

C

C

T

T

CT

nn

mmt

22

C

C

T

T

CT

nn

mmt

22

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The method tries to decouple the mean–variance dependency The method tries to decouple the mean–variance dependency by modeling the variance of the expression of a gene as a by modeling the variance of the expression of a gene as a

function of the mean expression of the genefunction of the mean expression of the gene

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

The empirical variance is modulated by The empirical variance is modulated by 00 ‘pseudo-observations’ ‘pseudo-observations’associated with a background variance associated with a background variance 00

22

My gene

{

Bayesian regularized t-testBayesian regularized t-test

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

The main goal of this approach is to stabilize the The main goal of this approach is to stabilize the variance estimates that arise when sample size is small, variance estimates that arise when sample size is small,

to make more robust the t-test resultsto make more robust the t-test results

Bayesian regularized t-testBayesian regularized t-test

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

The regularized t-test makes more evident the The regularized t-test makes more evident the presence of significant differential expressionspresence of significant differential expressions

BH correctionBH correction

• BH is the most used method for the correction of BH is the most used method for the correction of type I errors in microarray analysis.type I errors in microarray analysis.

• However, it has some limitation due to the initial However, it has some limitation due to the initial hypotheses:hypotheses:– The gene expressions are independent from each The gene expressions are independent from each

other.other.– The raw distribution of p values should be uniform in The raw distribution of p values should be uniform in

the non significant range.the non significant range.

The application of BH correction to these pvalues will not produceany differential expressed gene!

The application of BH correction to these pvalues will not produceany differential expressed gene!

Let’s identify differentially expressedprobe sets by linear modelling

Let’s identify differentially expressedprobe sets by linear modelling

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

To use linear models targets description and raw data will be reorganized on the basis of the number of factors under analysis by Compute Linear Model Fit.

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

Next step is the definition of the contrasts, which represent the differential expression couples to be considered.

If more than two conditions are available more contrasts can be evaluated

If more than two conditions are available more contrasts can be evaluated

Contrast parameterization is saved with a specific name

Contrast parameterization is saved with a specific name

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

REMEMBER: contrasts represent the different experimental groups (e.g. Treated, Control).Making Treated – Control means that the log(expression) of control samples are subtracted to that of treated samples.The result is the log2(fold change)

Before evaluating differential expression raw p-value distribution is checked.

Before evaluating differential expression raw p-value distribution is checked.

AA

BB

CC

BB

CC

AAIf BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

If BH correction can be applied to correct type I errors, we can move to the selection of the subset of differentially expressed genes

These results can be saved in a new topTable containing only the probe sets shown in red on plots

These results can be saved in a new topTable containing only the probe sets shown in red on plots

Yes

TopTable structureTopTable structure

AffyIDAffyID

Gene Symbol

Gene Symbol

Gene Description

Gene Description

Log2 FCLog2 FC

Average intensity

Average intensity

T statisticsT statistics

P-valuesP-values

Log-odd statistics

Log-odd statistics

Exercise 10 Exercise 10 (30 minutes)(30 minutes)

• Go in the folder Go in the folder estrogen.IGF1estrogen.IGF1..• Create, with excel, Create, with excel, a tab delimited filea tab delimited file named targets.txt: named targets.txt:

– Targets file is made of three columns with the following header:Targets file is made of three columns with the following header:• NameName• FileNameFileName• TargetTarget

– In column In column NameName place a brief name (e.g. c1, c2, etc) place a brief name (e.g. c1, c2, etc)– In column In column FileNameFileName place the name of the corresponding .CEL place the name of the corresponding .CEL

filefile– In column In column TargetTarget place the experimental conditions (e.g. control, place the experimental conditions (e.g. control,

treatment, etc)treatment, etc)• Create a target only for MCF7 and Sker-3 with/without Create a target only for MCF7 and Sker-3 with/without

estrogen (E2) treatment.estrogen (E2) treatment.• Calculate Probe set summaries with RMACalculate Probe set summaries with RMA

See next page


• In this experiment we have a breast In this experiment we have a breast cancer tumor cell line (MCF7) and a tumor cancer tumor cell line (MCF7) and a tumor cell line derived by central nervous system cell line derived by central nervous system (SKER3).(SKER3).

• Question:Question:– Which are the probe sets controlled by E2 in a Which are the probe sets controlled by E2 in a

tissue independent manner?tissue independent manner?

See next page

Exercise 10Exercise 10

• Filter the data:Filter the data:– IQR 0.25, intensity 25% >100IQR 0.25, intensity 25% >100

• Calculate the models for E2 versus Calculate the models for E2 versus untreated cells both in mcf7 and sker3.untreated cells both in mcf7 and sker3.

• Contrasts:Contrasts:mcf7.e2 – mcf7.ctrlmcf7.e2 – mcf7.ctrl

sher3.e2 – sker3.ctrl sher3.e2 – sker3.ctrl

See next page


• Evaluate if the raw p-value distributions Evaluate if the raw p-value distributions are suitable for BH correction.are suitable for BH correction.

• Question:Question:– Is the raw p-value distribution good to perfom Is the raw p-value distribution good to perfom

BH correction?BH correction?• YES NOYES NO

See next page


• Use the “Table of Genes Ranked in order Use the “Table of Genes Ranked in order of Differential Expression”.of Differential Expression”.

• Plot differentially expressed genes with Plot differentially expressed genes with raw p-value raw p-value ≤≤ 0.05 and an absolute fold 0.05 and an absolute fold change change ≥≥ 1 for the two constrast. 1 for the two constrast.

• Save the subset of the topTables in Save the subset of the topTables in ex10.mcf7.xls, ex10.sker3.xlsex10.mcf7.xls, ex10.sker3.xls

• Save the project as ex10.lmaSave the project as ex10.lma

BB

AA

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

A max of three files can be compared.Attention:Each file is made by a unique column of probe sets ID without header.Comparison can be performed at probe sets or EG level.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

Differential expressions probe set lists generated by affylmGUI or SAM can be compared using Venn Diagrams.

DD EE FFGG

CC

The various list subsets will be saved in your working directory

The various list subsets will be saved in your working directory

Yes


• Using Using "Venn Diagram between probe set "Venn Diagram between probe set lists“, lists“, evaluate the level of overlap between the evaluate the level of overlap between the Entrez Genes differentially expressed upon E2 Entrez Genes differentially expressed upon E2 treatment in MCF7 and in SKER3.treatment in MCF7 and in SKER3.

• Filter the expression data by the genes in Filter the expression data by the genes in common between the two conditions and export common between the two conditions and export the Normalized Expression Values the Normalized Expression Values (ex10.common.txt).(ex10.common.txt).

Time Course experimentsTime Course experiments

• maSigPro is a R package for the analysis of single and multiseries time course microarray experiments.

• maSigPro follows a two steps regression strategy to find genes with– significant temporal expression changes – significant differences between experimental

groups.

• Time course experimental design:Time course experimental design:– We denote We denote experimental groupsexperimental groups as the experimental as the experimental

factor (dummy variables) for which temporal profiles factor (dummy variables) for which temporal profiles are defined (e.g. ”Treatment A”, ”Tissue1”, etc) are defined (e.g. ”Treatment A”, ”Tissue1”, etc)

– Conditions are Conditions are each experimental group vs. time each experimental group vs. time combinationcombination (e.g. ”Treatment A at Time 0”). (e.g. ”Treatment A at Time 0”). Conditions can have or not replicates. Conditions can have or not replicates.

– Variables are the Variables are the regression variablesregression variables defined by the defined by the maSigPro approach for the experiment regression maSigPro approach for the experiment regression model. model.

– maSigPro defines maSigPro defines dummy variablesdummy variables to model to model differences between experimental groups. differences between experimental groups.

– Dummy variables, Time and their interactions are the Dummy variables, Time and their interactions are the variablesvariables of the regression model. of the regression model.

Time Course design for maSigProTime Course design for maSigPro

All these information should be collapsed in the Target column of the targets file using _ to combine data.This can be done using the function JOIN in excel.

IMPORTANT: each treatment at each time has its corresponding untreated control!


The targets file for maSigPro has a peculiar structure:Each row of the column named Target describes the array on the basis of the experimental design.

Each element describing the time course experiment is separated from the others by an underscore.

The first three elements of the row are fixed and represent Time, Replicate, Control, all the other elements refer to various experimental conditions.

In this case we have a 8, 24 48 h time course, in triplicates with two different treatments: cond1 and cond2

The Target column is reformatted to be used by maSigPro using the command

Large data setLarge data set

• oneChannelGUI interface has some limits oneChannelGUI interface has some limits (RAM memory) in loading/handling large (RAM memory) in loading/handling large set of .CEL files. set of .CEL files.

• This is expecially true for a large time This is expecially true for a large time course experiment like our example.course experiment like our example.

• To overcome this problem probe set To overcome this problem probe set average expression intensities are average expression intensities are calculated by Expression Console.calculated by Expression Console.

Loading tab delimited file the Bioconductor annotation library is not automatically defined.

Annotation Library information can be attached using:

Do not forgetDo not forget!

• Multiple test problem is also present in Multiple test problem is also present in mSigPro analysis.mSigPro analysis.

• Therefore, before running maSigPro, Therefore, before running maSigPro, remember to perform some filter based on remember to perform some filter based on functional information or samples functional information or samples distribution.distribution.

Ones the experiment design for maSigPro is ready it is possible to run the analysis

When maSigPro is running, check what is going on in the main R window!

Yes

Some parameters need to be set

Q: The first step is to compute a regression fit for each gene. The p-value associated to the F-Statistic of the model are computed and they are subsequently used to select significant genes. maSigPro corrects this p-value for multiple comparisons by applying false discovery rate (FDR) procedures. The level of FDR control is given by the function parameter Q.


Alpha: maSigPro applies a variable selection procedure to find significant variables for each gene. This will ultimatelly be used to find which are the profile differences between experimental groups. At each regression step the p-value of each variable is computed and variables get in/out the model when this p-value is lower or higher than the given cut-off value alfa.


R-squared: The following step is to generate lists of significant genes according to the way we want to see results.As filtering maSigPro uses the R-squared of the regression model.

What is the R-squared coefficient?What is the R-squared coefficient?

• r.squared: r.squared: the "fraction of variance explained by a linearthe "fraction of variance explained by a linearmodel“model“

RR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))

where y* is the mean of y[i] if there is an where y* is the mean of y[i] if there is an intercept and zero otherwise.intercept and zero otherwise.

YY

XX

Sum(R[i]Sum(R[i]22))

YY

XX

Sum((y[i]- y*)Sum((y[i]- y*)22))

R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22))

R-squared graphical viewRR22 = 1 - 0/ Sum((y[i]- y*) = 1 - 0/ Sum((y[i]- y*)22)=1)=1

YY

XX

Sum(R[i]Sum(R[i]22))

YY

XX

Sum((y[i]- y*)Sum((y[i]- y*)22))

Sum(R[i]Sum(R[i]22) = Sum((y[i]- ) = Sum((y[i]- y*)y*)22))

R-squared graphical viewRR22 = 1 - Sum(R[i] = 1 - Sum(R[i]22) / Sum((y[i]- y*)) / Sum((y[i]- y*)22)= 0)= 0

Sum((y[i]- y*)Sum((y[i]- y*)22))

YY

XX

YY

XX

Computation info are available in the main R window

Step 1

The procedure first adjusts this global model by the least-squared technique to identify differentially expressed genes and selects significant genes applying false discovery rate control procedures.

Step 2

Secondly, stepwise regression is applied as a variable selection strategy to study differences between experimental groups and to find statistically significant different profiles.

When the computation is finished a message pops up

The coefficients obtained in the second regression model will be useful to cluster together significant genes with similar expression patterns and to visualize the results.

Results can be visualized as Venn diagrams or plotting in a PDF file the curves.The K mean clustering is not yet implemented

Results can be visualized plotting in a PDF file the curves.

C

B

D

A

The plots are related only to the sub set of genes specific of each treatment condition.

Exercise 12 (30 minutes)Exercise 12 (30 minutes)• This experiment was done with HGU133A.This experiment was done with HGU133A.

– This is a cell line experiment made of three time points This is a cell line experiment made of three time points 8, 24, 48 h.8, 24, 48 h.

– Each point is made of three biological replicates.Each point is made of three biological replicates.– Two different chemotherapeutics agents have been Two different chemotherapeutics agents have been

used (Treatment 1 and 2)used (Treatment 1 and 2)– Since these data have not yet published the probe set Since these data have not yet published the probe set

ids have been scrambled.ids have been scrambled.• In the time.course directory there are two files:In the time.course directory there are two files:

– An expression file derived from expression consoleAn expression file derived from expression console– A tab delimited file describing the experimental A tab delimited file describing the experimental

conditions.conditions.• Use this information to load the data, filter them by Use this information to load the data, filter them by

IQR (threshold of your choice), to run (e.g. IQR (threshold of your choice), to run (e.g. Q=0.05, Q=0.05, =0.05, R=0.8) and view results =0.05, R=0.8) and view results generated by maSigPro.generated by maSigPro.

Analysis pipe-lineAnalysis pipe-line

NormalizationNormalization FilteringFiltering StatisticalStatisticalanalysisanalysis

AnnotationAnnotationBiological Biological KnowledgeKnowledgeextractionextraction

QualityQualitycontrolcontrol

AnnotationAnnotation

• An important issue in microarray data An important issue in microarray data analysis is the specific association of analysis is the specific association of probe identifiers with genome annotated probe identifiers with genome annotated transcripts. transcripts.

• A critical point in annotation is the way A critical point in annotation is the way in which the association between in which the association between probes and genes is produced.probes and genes is produced.

Annotation in AffymetrixAnnotation in Affymetrix• NetAffxNetAffx: Affymetrix annotation repository: Affymetrix annotation repository• Bioconductor:Bioconductor:

– uses a specific annotation library, AnnBuilder, to create annotation uses a specific annotation library, AnnBuilder, to create annotation libraries starting from the association probe set identifierlibraries starting from the association probe set identifierGeneBank GeneBank accession number (i.e. the primary target for probes design). accession number (i.e. the primary target for probes design).

• RESOURCERER (Tsai et al. 2001):RESOURCERER (Tsai et al. 2001):– the annotation tool at TIGR center uses EST and gene sequences the annotation tool at TIGR center uses EST and gene sequences

stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). stored in the TGI databases (www.tigr.org/tdb/tgi.shtml). – They provide an analysis of publicly available EST and gene sequence They provide an analysis of publicly available EST and gene sequence

data for the identification of transcripts and their placement in a genomic data for the identification of transcripts and their placement in a genomic context, and the identification of orthologs and paralogs wherever context, and the identification of orthologs and paralogs wherever possible. possible.

• Neither Bioconductor nor TIGR methods operate at the probe level, Neither Bioconductor nor TIGR methods operate at the probe level, nor do they consider the limited reliability of some sets due to probe nor do they consider the limited reliability of some sets due to probe cross-hybridization or erroneous probe/transcript annotation. cross-hybridization or erroneous probe/transcript annotation.

• Ensembl:Ensembl:– Annotation with the Ensembl tool is built by direct matching of Affymetrix Annotation with the Ensembl tool is built by direct matching of Affymetrix

probes over the Ensembl sequence database. probes over the Ensembl sequence database. – Its weak point is that matching of only 50% of the probes of a specific set Its weak point is that matching of only 50% of the probes of a specific set

to an Ensembl gene is needed for a true association definition "probe set to an Ensembl gene is needed for a true association definition "probe set identifier"/"Ensembl gene identifier". identifier"/"Ensembl gene identifier".

Gene OntologyGene Ontology

OntologiesOntologies

• An ontology is a specification of a An ontology is a specification of a conceptualization:conceptualization:– a hierarchical mapping of concepts within a given frame a hierarchical mapping of concepts within a given frame

of reference.of reference.

• An ontology is a restricted structured vocabulary of An ontology is a restricted structured vocabulary of terms that represent domain knowledge. terms that represent domain knowledge.

• An ontology specifies a vocabulary that can be An ontology specifies a vocabulary that can be used to exchange queries and assertions. used to exchange queries and assertions.

• A commitment to the use of the ontology is an A commitment to the use of the ontology is an agreement to use the shared vocabulary in a agreement to use the shared vocabulary in a consistent way.consistent way.

The Gene OntologyThe Gene Ontology• The goal of the Gene Ontology (GO) Consortium is to The goal of the Gene Ontology (GO) Consortium is to

produce a controlled vocabulary that can be applied to all produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in organisms even as knowledge of gene and protein roles in cells is accumulating and changing. cells is accumulating and changing. – http://www.geneontology.org/

• For genes and gene products the Gene Ontology For genes and gene products the Gene Ontology Consortium (GO) is an initiative that is designed to address Consortium (GO) is an initiative that is designed to address the problem of defining the problem of defining common set of terms and common set of terms and descriptions for basic biological functionsdescriptions for basic biological functions..

• GO provides a restricted vocabulary as well as clear GO provides a restricted vocabulary as well as clear indications of the relationships between terms.indications of the relationships between terms.

http://www.geneontology.org/



The Gene OntologyThe Gene Ontology

• The Gene Ontology (GO) consortium produces three independent ontologies for gene products.

• The three ontologies are:– molecular function of a gene product which is defined to

be biochemical activity or action of the gene product (MF 7220).

– biological process interpreted as a biological objective to which the gene product contributes (BP 9529).

– cellular component is a component of a cell that is part of some larger object or structure (CC 1536).

The Graph Structure of GOThe Graph Structure of GO

• The GO ontologies are structured as directed acyclic graphs (DAGs) that represent a network in which each term may be a child of one or more parents.

• GO node is interchangeable with GO term.• Child terms are more specific than their

parents:– The term “transmembrane receptor protein-

tyrosine kinase” is child of • “transmembrane receptor” and “protein tyrosine

kinase”.

The Graph Structure of GOThe Graph Structure of GO

• The relationship between a child and a parent can be characterized by the relations:– is a – has a (part of)

• “mitotic chromosome” is a child of “chromosome” and the relationship is an is a relation.

• “telomere” is a child of “chromosome” with the has a relation.

Top node

Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)Graph of GO relationships for the term: transcription factor (GO:0003700)

GO structureGO structure

Induced GO graph for a set of diff exprs genes.Induced GO graph for a set of diff exprs genes.

GO can be used to link differentially expressed GO can be used to link differentially expressed genes to specific functional classesgenes to specific functional classes..

Top node

The induced GO graph colored according to unadjusted hypergeometric p-The induced GO graph colored according to unadjusted hypergeometric p-valuevalue0.010.01

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Consider a population of genes Consider a population of genes representing a diverse set of GO terms representing a diverse set of GO terms

shown below as different colors.shown below as different colors.

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

Many methods can be used to identify a set of Many methods can be used to identify a set of differentially expressed genesdifferentially expressed genes

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

What are the some of the predominant What are the some of the predominant GO terms represented in the set of GO terms represented in the set of

differentially expressed genes and how differentially expressed genes and how should significance be assigned to a should significance be assigned to a

discovered GO term?discovered GO term?

Example:Example: Population Size: Population Size: 40 genes40 genes

Subset of differentially Subset of differentially expressed genes: expressed genes: 12 genes12 genes

10 genes, shown in light blue, have a common GO term 10 genes, shown in light blue, have a common GO term and 8 occur within the set of differentially expressed and 8 occur within the set of differentially expressed genes.genes.

Contingency MatrixContingency Matrix

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

A 2x2 contingency matrix is typically used to capture the relationships between differentially expressed

membership and membership to a GO term.

outout

inin

GO termGO term

outoutininSubsetSubset

22

44 2626

88

ContingencyContingencyMatrixMatrix

Hypergeometric Hypergeometric DistributionDistribution

aa bb

cc dd

a+ca+c

a+ba+b

b+db+d

c+dc+d

!!!!!

)!()!()!()!(

)!()!(!

!!)!(

!!)!(

dcban

dbcadcba

dcban

dbdb

caca

The probability of any The probability of any particularparticularmatrix occurring by randommatrix occurring by randomselection, given no associationselection, given no associationbetween the two variables, is givenbetween the two variables, is givenby the by the hypergeometric rulehypergeometric rule..

Assigning Significance to the FindingsAssigning Significance to the Findings

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

The The HyperGeometric TestHyperGeometric Test permits us to determine if permits us to determine if there are non-random associations between the two there are non-random associations between the two

variables, variables, differential expression membership and membership to differential expression membership and membership to

a a particular Gene Ontology term. particular Gene Ontology term.

88 22

44 2626

inin outout

inin

outout

SubsetSubset

GO termGO term p p .0002 .0002

( 2x2 contingency matrix )( 2x2 contingency matrix )

EASEEASE(Expression Analysis Systematic Explorer)(Expression Analysis Systematic Explorer)

• EASE analysis identifies prevalent biological EASE analysis identifies prevalent biological themes within gene clusters.themes within gene clusters.

• The highest-ranking themes derived by a The highest-ranking themes derived by a computational method can recapitulate manually computational method can recapitulate manually derived themes in previously published derived themes in previously published microarray, proteomics and SAGE results, and microarray, proteomics and SAGE results, and to provide evidence that these themes are stable to provide evidence that these themes are stable to varying methods of gene selection.to varying methods of gene selection.

Hosack et al. Genome Biol., 4:R70-R70.8, 2003.Hosack et al. Genome Biol., 4:R70-R70.8, 2003.

• Consider all of the ResultsConsider all of the Results

EASE reports all themes represented in a cluster EASE reports all themes represented in a cluster and although some themes may not meet and although some themes may not meet statistical significance it may still be important statistical significance it may still be important to note that particular biological roles or to note that particular biological roles or pathways are represented in the cluster.pathways are represented in the cluster.

• Independently Verify RolesIndependently Verify Roles

Once found, biological themes should be Once found, biological themes should be independently verified using annotation independently verified using annotation resources.resources.

EASE ResultsEASE Results

GOstats packageGOstats package

• To perform an analysis using the To perform an analysis using the Hypergeometric-based test, one needs to define Hypergeometric-based test, one needs to define a a gene universegene universe and a list of and a list of selected genesselected genes from the universe.from the universe.

• To identify the set of expressed genes from a To identify the set of expressed genes from a microarray experiment, R. Gentleman (GOstats microarray experiment, R. Gentleman (GOstats developer) proposed that a non-specific filter be developer) proposed that a non-specific filter be applied and that the genes that pass the filter be applied and that the genes that pass the filter be used to form the universe for any subsequent used to form the universe for any subsequent functional analyses.functional analyses.

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

In Bioconductor is available a In Bioconductor is available a library called GOstat which library called GOstat which allows the calculation of allows the calculation of enriched GO classes within a enriched GO classes within a set of differentially expressed set of differentially expressed probe sets.probe sets.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the threshold of Select the threshold of significance and the significance and the GO class of interest.GO class of interest.

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

Select the list of Select the list of affyIDs representing affyIDs representing the differentially the differentially expressed probe sets.expressed probe sets.REMEMBER: the file REMEMBER: the file should contain only the should contain only the affy ids!!!!affy ids!!!!

A

B

D

C

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

If the names of GO If the names of GO classes are too tiny classes are too tiny in in the plotthe plot , save it as pdf , save it as pdf and visualize it with and visualize it with Acrobat Reader, Acrobat Reader, zooming in the figure.zooming in the figure.

The reason of this representation is the selection of the GO terms that

contains smaller subsets.

GO identifierGO identifierGO identifierGO identifier

Description of Description of GO termGO term

Description of Description of GO termGO term

significancesignificancesignificancesignificance

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes belonging to N. of genes belonging to the GO terms in the the GO terms in the universeuniverse

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

N. of genes in the N. of genes in the differentially differentially expressed setexpressed set

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

To know more on the To know more on the parents of a specific parents of a specific GO term you can use GO term you can use the plotGO functionthe plotGO function

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

It is possible to identify the It is possible to identify the affy ids associated to a affy ids associated to a specific GO term. specific GO term.

A

C

B

D


• Using GOenrichment function, check if Using GOenrichment function, check if there is any overlap between the GO there is any overlap between the GO classes BP found enriched (p-value classes BP found enriched (p-value 0.01) using the set of probe sets found 0.01) using the set of probe sets found differentially expressed upon E2 treatment differentially expressed upon E2 treatment in MCF7 or SKER3.in MCF7 or SKER3.

• Question:Question:– Which are the BP or MF GO terms in common Which are the BP or MF GO terms in common

between the two set of differentially exprssed between the two set of differentially exprssed probe sets?probe sets? See next page


• Using plotGO see which are the parents of the Using plotGO see which are the parents of the GO term(s) in common between the probe sets GO term(s) in common between the probe sets differentially expressed in MCF7 and those in differentially expressed in MCF7 and those in SKER3 upon E2 treatment.SKER3 upon E2 treatment.

• Using extractAffyids function, check the number Using extractAffyids function, check the number of probe sets derived by limma differential of probe sets derived by limma differential expression also present in the common GO expression also present in the common GO termsterms.

• Question:– Probe sets belonging to the common GO terms are Probe sets belonging to the common GO terms are

the same in the two differential expression analyses?the same in the two differential expression analyses?

ClusteringClustering

Is it available an ideal clustering Is it available an ideal clustering procedure?procedure?

• No!No!– Each clustering algorithm has it ideal data Each clustering algorithm has it ideal data

structure.structure.

• Since we do not know which is the data Since we do not know which is the data structure:structure:

• Various clustering methods have to be applied in Various clustering methods have to be applied in order to identify the one that better fit to the data order to identify the one that better fit to the data under analysisunder analysis

N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)N.B. For the this presentation was used Tmev 4.0 (www.tigr.org)

Supervised versus unsupervised Supervised versus unsupervised clusteringclustering

• Supervised clusteringSupervised clustering try to find the best try to find the best partition for data that belong to a know set partition for data that belong to a know set of classesof classes

• Unsupervised clusteringUnsupervised clustering try to define the try to define the number and the size of the classes in number and the size of the classes in which the transcription profiles can be which the transcription profiles can be fitted in.fitted in.

The Expression Matrix is a representation of data from multipleThe Expression Matrix is a representation of data from multiplemicroarray experiments.microarray experiments.

N

D

X11 X12 X13 … X1d (L)

X21 X22 X23 … X2d (L)

…

Xn1 Xn2 Xn3 … xnd (L)

experiment

Probe set

Each element is a log ratioEach element is a log ratio

+

-

0

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Up modulation isUp modulation isusually representedusually representedas as REDRED and down and down

modulation as modulation as GREENGREEN

Large data set can be loaded as tab delimited

files

Large data set can be loaded as tab delimited

files

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

To load them you need 1) a tab delimited file with array names on the first row and probe set ids on first column2) A target file containing the clinical information. The usual Target column o the target file should have this characterstics.

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

This file can be generated joining the columns on the clinical parameters by an underscore “_”.

Join function in excelJoin function in excel

Loading data as tab delimited fileLoading data as tab delimited file

Select as format description tab delimited files

Select as format description tab delimited files

Export expression data as tab delimited files

Export expression data as tab delimited files

Select the first numerical value and load the dataSelect the first numerical value and load the data

Expression VectorsExpression Vectors

• Gene Expression Vectors Gene Expression Vectors encapsulate the expression of a encapsulate the expression of a gene over a set of experimental gene over a set of experimental conditions or sample types.conditions or sample types.

--0.80.8

0.80.8 1.51.5

1.81.8 0.50.5

--1.31.3

--0.40.4

1.51.5

-2

0

2

1 2 3 4 5 6 7 8

loglog22(time(timett//timetime00))

Data reformattingData reformatting• Clustering can be performed using as reference a virtual array:Clustering can be performed using as reference a virtual array:

– A virtual array can be calculated averaging gene expression over the A virtual array can be calculated averaging gene expression over the experimental conditions.experimental conditions.

• Clustering can be performed building virtual two-dye Clustering can be performed building virtual two-dye experiments:experiments:

where i=1…I, j=1…Jwhere i=1…I, j=1…J

• Clustering can be performed also without the use of a common Clustering can be performed also without the use of a common reference by:reference by:– Genes centeringGenes centering

– Experiments centeringExperiments centering

C

T2log

j

i

C

T2logor

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

col

colii

XZ

Data reformattingData reformatting

row

rowii

XZ

row

rowii

XZ

col

colii

XZ

Gene centering

Array centering

Centering at gene levels removes thescaling differences!

Centering at gene levels removes thescaling differences!

Various data reformating are availableVarious data reformating are available

We will use mainly gene/row adjustmentWe will use mainly gene/row adjustment

Distance and SimilarityDistance and Similarity

• The ability to calculate a distance (or The ability to calculate a distance (or similarity, it’s inverse) between two similarity, it’s inverse) between two expression vectors is fundamental to expression vectors is fundamental to clustering algorithms.clustering algorithms.

• Distance between vectors is the basis Distance between vectors is the basis upon which decisions are made when upon which decisions are made when grouping similar patterns of expression.grouping similar patterns of expression.

• Selection of a Selection of a distance metricdistance metric defines the defines the concept of distance.concept of distance.

x = (5,5)

y = (9,8)Euclidean distance:d(x,y) = (42+32) = 5

Manhattan distance:d(x,y) = 4+3 = 7

4

35

Distance is Defined by a MetricDistance is Defined by a Metric

Distance is Defined by a MetricDistance is Defined by a Metric

Euclidean Pearson Distance Metric:

4.2

1.4

1.00

0.90D

D

-2

0

2

log

log 22(

time

(tim

e tt/tim

e/t

ime 00))

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Many distance metrics are available.If a selection is not performed the deafult

selection for each type of clustering approach will be used.

Hierarchical Clustering Hierarchical Clustering (HCL(HCL)

• HCL is an agglomerative/divisive HCL is an agglomerative/divisive clustering method. clustering method.

• The iterative process continues until all The iterative process continues until all groups are connected in a hierarchical groups are connected in a hierarchical tree.tree.

Hierarchical Clustering Hierarchical Clustering (agglomerative)(agglomerative)

g8g1 g2 g3 g4 g5 g6 g7

g7g1 g8 g2 g3 g4 g5 g6

g7g1 g8 g4 g2 g3 g5 g6

g1 is most like g8

g4 is most like {g1, g8}

g7g1 g8 g4 g2 g3 g5 g6

g6g1 g8 g4 g2 g3 g5 g7

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical ClusteringHierarchical Clustering

g5 is most like g7

{g5,g7} is most like {g1, g4, g8}

g6g1 g8 g4 g5 g7 g2 g3

Hierarchical TreeHierarchical Tree

Hierarchical ClusteringHierarchical Clustering

• During construction of the hierarchy, During construction of the hierarchy, decisions must be made to determine decisions must be made to determine which clusters should be joined. which clusters should be joined.

• The distance or similarity between clusters The distance or similarity between clusters must be calculated. The rules that govern must be calculated. The rules that govern this calculation are this calculation are linkage methodslinkage methods..

Agglomerative Linkage MethodsAgglomerative Linkage Methods

• Linkage methods are rules or metrics that Linkage methods are rules or metrics that return a value that can be used to return a value that can be used to determine which elements (clusters) determine which elements (clusters) should be linked.should be linked.

• Three linkage methods that are commonly Three linkage methods that are commonly used are: used are: – Single LinkageSingle Linkage– Average LinkageAverage Linkage– Complete LinkageComplete Linkage

Single LinkageSingle Linkage• Cluster-to-cluster distance is defined Cluster-to-cluster distance is defined

as as the minimum distance between the minimum distance between members of one cluster and members of one cluster and members of the another clustermembers of the another cluster. .

• Single linkage tends to create Single linkage tends to create ‘elongated’ clusters with individual ‘elongated’ clusters with individual genes chained onto clusters.genes chained onto clusters.

DAB

Single

Average LinkageAverage Linkage• Cluster-to-cluster distance is Cluster-to-cluster distance is

defined as defined as the average distance the average distance between all members of one between all members of one cluster and all members of another cluster and all members of another clustercluster. .

• Average linkage has a slight Average linkage has a slight tendency to produce clusters of tendency to produce clusters of similar variance.similar variance.

DAB

Ave.

Complete LinkageComplete Linkage

DAB

• Cluster-to-cluster distance is Cluster-to-cluster distance is defined as defined as the maximum distance the maximum distance between members of one cluster between members of one cluster and members of the another and members of the another clustercluster. .

• Complete linkage tends to create Complete linkage tends to create clusters of similar size and clusters of similar size and variabilityvariability

Complete

HCLHCL• A clustering result can be represented by A clustering result can be represented by

many different graphical views.many different graphical views.

1 2 3 4 1 2 34 12 34

HCLHCL

• HCL does not converge to a unique result HCL does not converge to a unique result and each run represent one of the and each run represent one of the possible solution.possible solution.

• To obtain information on cluster stability a To obtain information on cluster stability a resampling method should be applied:resampling method should be applied:– Bootstrapping:

• resampling with replacement

– Jackknifing:• resampling without replacement

To perform HCL click on HCL iconTo perform HCL click on HCL icon

To see results click onTo see results click on

Visualization can be reformattedVisualization can be reformatted

Bootstrapping (ST)Bootstrapping (ST)

Bootstrapping – resampling with replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various bootstrapped matrices (by experiments):

Exp 2 Exp 3 Exp 4

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 2 Exp 4 Exp 4 Exp 1 Exp 3 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 5

Jackknifing (ST)Jackknifing (ST)Jackknifing – resampling without replacement

Original expression matrix:

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Various jackknifed matrices (by experiments):

Exp 1 Exp 3 Exp 4 Exp 5 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Exp 1 Exp 2 Exp 3 Exp 4 Exp 6

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

1000 bootstrapsEuclidean distanceAverage clustering

1000 bootstrapsEuclidean distanceAverage clustering

To run HCL with resamplingTo run HCL with resampling

To see results click onTo see results click on

A sub set of genes can be selected clicking on the

node of interest

A sub set of genes can be selected clicking on the

node of interest

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Locating the mouse over the

node and clicking on the right mouse

botton various information about

the group of genes can be saved

Principal component analysisPrincipal component analysis

• Principal component analysis (PCA) involves a Principal component analysis (PCA) involves a mathematical procedure that transforms a number of mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called uncorrelated variables called principal componentsprincipal components. .

• The first principal component accounts for as much of The first principal component accounts for as much of the variability in the data as possiblethe variability in the data as possible

• Each succeeding component accounts for as much of Each succeeding component accounts for as much of the remaining variability as possible. the remaining variability as possible.

• The components can be thought of as axes in n-The components can be thought of as axes in n-dimensional space, where n is the number of dimensional space, where n is the number of components. Each axis represents a different trend in components. Each axis represents a different trend in the data.the data.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

In general the first three components account for nearly all the variability.Therefore, PCA can be reasonably

represented in a 3D space.

2

1

2° PC will be orthogonal to the 1st

A

Cluster 1

Cluster 2

Cluster 1

MCF7 SKER-3E2 IGFE2 IGF

MCF7 SKER-3

E2 IGFE2 IGF

Cluster 2

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent on the correlated variablesdependent on the correlated variables

Quaglino et al. J. Clin. Invest. 2004

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

The parameters that produce the partition in PCA are The parameters that produce the partition in PCA are dependent by the correlated variablesdependent by the correlated variables

We have already used PCA for quality controlWe have already used PCA for quality control

Results clicking onResults clicking on

Click on right mouse buttonOver 3D/2D

view

Click on right mouse buttonOver 3D/2D

view

Cluster Affinity Search Technique Cluster Affinity Search Technique (CAST)(CAST)

• CAST uses an iterative approach to CAST uses an iterative approach to segregate elements with ‘high affinity’ into segregate elements with ‘high affinity’ into a cluster.a cluster.

• The process iterates through two phases:The process iterates through two phases:– additionaddition of high affinity elements to the of high affinity elements to the

cluster being createdcluster being created– removalremoval or clean-up of low affinity elements or clean-up of low affinity elements

from the cluster being createdfrom the cluster being created

Clustering Affinity Search Technique (CAST)-1Clustering Affinity Search Technique (CAST)-1Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as%age of maximum affinity at that point

1. Create a new empty cluster C1.

3. Move the two most similar genes into the new cluster.

Empty cluster C1

G2G4

G9

G8

G12

G6

G1

G7

G13

G11

G14

G3

G5 G15

G10

Unassigned genes

4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1)

2. Set initial affinity of all genes to zero

5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds theuser-specified threshold affinity, pick the unassigned gene whose affinity is the highest,and add it to cluster C1. Update the affinities of all the genes accordingly.

ADD GENES:

CAST – 2CAST – 2

6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, removethe lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene.

7. Repeat step 6 while C1 contains a low-affinity gene.

8. Repeat steps 5-7 as long as changes occur to the cluster C1.

REMOVE GENES:

9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps1-8.

10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster

Current cluster C1

G2G4

G9

G8

G12G6

G1 G7

G13

G11

G14

G3

G5

G15G10

Unassigned genes

Click onClick on

Parameter to be setParameter to be set

SOMsSOMs

Self-organizing maps (SOMs) – 1Self-organizing maps (SOMs) – 1

1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal

N = NodesG = GenesG1 G6

G3

G5

G4G2

G11

G7G8

G10

G9

G12 G13

G14G15

G19G17

G22

G18

G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

SOMs – 2SOMs – 22. Choose a random gene, e.g., G9

3. Move the nodes in the direction of G9. The node closest to G9 (N2) is movedthe most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved.

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3 N4

N5 N6

SOM Neighborhood OptionsSOM Neighborhood Options

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

G11

G7G8

G10G9

N1 N2

N3 N4

N5 N6

Bubble Neighborhood

Gaussian

Neighborhoodradius

All move, alpha is scaled.

Some move, alpha is constant.

SOMs – 3SOMs – 3

4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) arerepeated many (usually several thousand) times. However, with each iteration, the amountthat the nodes are allowed to move is decreased.

5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than itsdistance to any other node

G1 G6

G3

G5G4

G2

G11

G7G8

G10G9

G12 G13G14

G15

G19G17

G22

G18G20

G16

G21G23

G25G24

G26 G27

G29G28

N1 N2

N3

N4

N5N6

Click onClick on

Exercise 14Exercise 14• This exercise is based on the breast cancer data This exercise is based on the breast cancer data

set published by Chin on Cancer Cell 2006 set published by Chin on Cancer Cell 2006 (hgu133A HT platform)(hgu133A HT platform)

• Using the clinical data (E-TABM-158-Using the clinical data (E-TABM-158-clinical.data.txt) available in large.data.set dir:clinical.data.txt) available in large.data.set dir:– Construct a target file, like the one used in time Construct a target file, like the one used in time

course.course.– Load the data in E-TABM-158-processed-data.txt Load the data in E-TABM-158-processed-data.txt

using the created target file.using the created target file.– Filter the data by IQR 0.5 and 25% of samples should Filter the data by IQR 0.5 and 25% of samples should

have a signal over 100 as intensity.have a signal over 100 as intensity.– Save project as ex14.lmaSave project as ex14.lma

Exercise 14Exercise 14– Filter the data on the basis of a list of EGs Filter the data on the basis of a list of EGs

derived by Ingenuity related to cell signaling derived by Ingenuity related to cell signaling (use the advance search at Ingenuity). (use the advance search at Ingenuity).

– Export the data as tab delimited files. Export the data as tab delimited files.

• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.


• After row mean centering perform:After row mean centering perform:– Hierarchical clustering and select those gene Hierarchical clustering and select those gene

cluster that group samples in two main cluster that group samples in two main groups. Label those groups.groups. Label those groups.

– Apply Cast or SOM and see how the HCL Apply Cast or SOM and see how the HCL groups of genes are reorganized.groups of genes are reorganized.

– Subset and save the clusters you have Subset and save the clusters you have identified.identified.

– Combine them in excel.Combine them in excel.– Load them in TMEV and see how PCA divide Load them in TMEV and see how PCA divide

the samples.the samples.

ClassificationClassification

ClassificationClassification

• The task of diagnosing cancer on the basis of microarray data has been termed class prediction in the literature.

• The task is to classify and predict the The task is to classify and predict the diagnostic category of a sample on the diagnostic category of a sample on the basis of its gene expression profile. basis of its gene expression profile.

The example of classification The example of classification problem used in PAM publicationproblem used in PAM publication

• Data for small round blue cell tumors (SRBCT) of childhood (Khan et al. 2001), consisting of expression measurements on 2,308 genes, were obtained from glass-slide cDNA microarrays.

• The tumors are classified as:– Burkitt lymphoma (BL),– Ewing sarcoma (EWS), – neuroblastoma (NB), – rhabdomyosarcoma(RMS).

• A total of 63 training samples and 25 test samples

were provided, although five of the latter were not SRBCTs.

PAMPAM

• PAM is a modification of the nearest-PAM is a modification of the nearest-centroid method, called ‘‘nearest shrunken centroid method, called ‘‘nearest shrunken centroid.’’centroid.’’

• PAM uses ‘‘de-noised’’ versions of the PAM uses ‘‘de-noised’’ versions of the centroids as prototypes for each class. centroids as prototypes for each class.

Centroids (Centroids (greygrey) and shrunken centroids () and shrunken centroids (redred) for the SRBCT dataset) for the SRBCT datasetThe overall centroid has been subtracted from the centroid from each class.The overall centroid has been subtracted from the centroid from each class.

• SBRCT classification: training (tr, green), cross-validation (cv, red), and test (te, blue) errors are shown as a function of the threshold parameter .

• The value 4.34 is chosen and yields a subset of 43 selected genes.

• Shrunken differences dik for the 43 genes having at least one nonzero difference. • The genes with nonzero components in each class are almost mutually exclusive.

PAM performancePAM performance

• Misclassification rates for seven classifiers on six microarray datasets based on 50 Misclassification rates for seven classifiers on six microarray datasets based on 50 random partitions into learning sets (two-thirds of the data) and test sets (one-third of random partitions into learning sets (two-thirds of the data) and test sets (one-third of the data)the data)

• The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks The nearest shrunken centroid classifier (PAM), as well as the simple benchmarks NNR and DLDA do surprisingly well and can almost keep up except on the prostate NNR and DLDA do surprisingly well and can almost keep up except on the prostate data (the largest dataset in the analysis).data (the largest dataset in the analysis).

• The success of such methodologically simple tools is limited to gene expression The success of such methodologically simple tools is limited to gene expression datasets with small sample size.datasets with small sample size.

Riorganize clinical information

Load a large data set as tab delimited file.Save in a file the description of the clinical parameters collapsed in the Target column of the targets file.

Riorganize clinical information

run PAMR analysis

If the selected probe sets are less than 50If the selected probe sets are less than 50

Nice separation between ER positive Nice separation between ER positive and negative samples can be achieved and negative samples can be achieved

also on the test set also on the test set


• Load the ex14.lma.

• Attach the clinical parameters description

• Divide the data in training and test set on the base of one of the non-continuous parameters (e.g Yes/No; Pos/Neg).

• Use PAMR to define the minimal subset of genes, if any, discriminating between the two groups.

Revision exerciseRevision exercise

• Use the data set HuGene, made of:– Three breast samples:

• TisMap_Breast_01_v1_WTGene1.CEL• TisMap_Breast_02_v1_WTGene1.CEL• TisMap_Breast_03_v1_WTGene1.CEL

– Three brain samples:• TisMap_Brain_01_v1_WTGene1.CEL• TisMap_Brain_02_v1_WTGene1.CEL• TisMap_Brain_03_v1_WTGene1.CEL

• Perform all the steps of a microarray analysis:– QC, filtering, statistical analysis, GO analysis.

limma linear model analysis of microarrays bayesian regularized t-test (baldi & long 2001) the...

Documents

b slide

log2fold change slide

differential expressed

ab c slide

treated control

ttest results

gene expressions

bh correction bh