chipster 2d spring2008 - extras.csc.fi · microarray data analysis with chipster 16.-17.4.2008...

1

Microarray data analysis with Chipster16.-17.4.2008

Jarno TuimalaEija Korpelainen

Program – an analysis workflow

Day 1.Basic functionality of Chipster (Eija)Data import (Eija)Quality control (Jarno)Normalization (Jarno)

• Describing the experimentFiltering and missing value considerations (Jarno)

Day 2.Statistical testing (Jarno)Clustering and visualization (Jarno)Annotation (Eija)Promoter analysis (Eija)Experimental design (Jarno) – if time allows

Demo data

Affymetrix• Kidney cancer• 8 controls, 9 cancer patients

Agilent• Acute leukemia• 7 controls, 7 FLT mutated

Illumina• Teratozoospermia• 5 controls, 8 affected

Introduction to microarraysIntroduction to microarrays

2

Research using microarrays

Plan!• Experimental design

Laboratory work• Extract, label, hybridize

Computer work• Scanning, image analysis• Bioinformatics

Laboratory work• Confirmation

Publish• Submit data to public databases

Introduction to Chipster

ChipsterGoal: Easy access to leading analysis tools such as those developed in theR/Bioconductor project

Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays• Automatic tracking of performed analyses

http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdfhttp://chipster.csc.fi

How does it work?

internet

front end

SSL

SOAP

international Web ServicesANALYSIS VISUALISATION

CSC desktop

client Java Web Startinstalls and updates client automatically

Corona/Murska

analyser

security

3

Aleksi KallioJarno TuimalaTaavi HupponenMika Rissanen, Janne Käki, Mikko Koski, Petri Klemelä

All the pilot usersDepartment of computer science (HY)Dario Greco (HY)Prof. Olli Yli-Harja’s group (TUT)GeneCruiser team (MIT Broad Institute)

Tekes/SA SYSBIO-program

Acknowledgements

Data ToolsVisu

aliza

tion

Phenodata – describing your experiment

Phenodata file is created during normalizationFill in the group column with numbers describing your experimental setup

• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work

If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”

If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column

4

Visualizing the data

Data visualization panel• Maximize and redraw for better viewing

Two types of visualizations1. Interactive visualizations produced by the client program

• Select the visualization method from the pulldown menu of the data visualization panel

• Save by right clicking on the image2. Static images produced by R/Bioconductor, Weeder, etc

• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”

Interactive visualizations by the clientSpreadsheetHistogramScatterplot3D scatterplotExpression profilesClustered profilesHierarchical clusteringSOM clusteringArray pseudo-image

Available actions:Change titles, colors etcZoom in/outSelect and annotate genes using the MIT GeneCruiser

5

Static images produced by R/Bioconductor

Volcano plotBox plotHistogramHeatmapVenn diagramIdiogramChromosomal positionCorrelogramDendrogramQC stats plotRNA degradation plotK-means clusteringSOM-clustering

Automatic tracking of analysis history Running many analyses simultaneouslyYou can have max 5 analysis jobs running at the same timeUse Task manager to

• view parameters, status,…• cancel jobs

6

Workspace – continue later/elsewhere

Saving your workspace allows you to continue later• File/ Save workflow• File/ Load workflow

Currently it is possible to have only one workspace saved at the time

If you would like to continue your work on another computer, youneed to transfer the workspace-snapshot -folder to the corresponding location

• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot

Workflow – reusing your analysis pipeline

Creates a ”macro” that can be applied to another normalized dataset and phenodata

Choose a dataset, and workflow recordsthe analysis steps that lead to thatdataset

You can give the workflow a meaningfulname (ending .bsh), but it has to belocated in the chipster-scripts folderunder nami-work-files

You can run a workflow on anothercomputer by making it visible to Chipsterwith ”Reload workflows from disk”

You can change parameters directly to the workflow file

Wizard– autopilot for analysis Wizard for Affymetrix data

Ready-made workflow to find differentiallyexpressed genes

• Normalization• Phenodata creation• Statistical test• Hierarchical clustering

7

Importing files

Affymetrix CEL-files are imported to Chipster automatically

Other files are imported using the Import tool

Import tool, step 1

Define• Header• Footer• Title row• Delimiter

Import tool, step 2

Define columnsModify flags

Importing Agilent files

Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeUID) Annotation (ControlType)

https://extras.csc.fi/biosciences/chipster-manual/data-formats.html

8

Exercise

Exercise I

1. Import the demo data of your favorite type in ChipsterAffymetrixAgilent

2. Save the workspace3. Have lunch (back at 13.00)

Quality control

Quality control tools

Quality control -tools• Affymetrix basic

RNA degradation + Affy QC• Affymetrix RLE & NUSE (might take a long time to run)

Fits a model to expression values• Agilent

MA-plot + density plot + boxplotVisualization – dendrogramStatistics - NMDS

9

Affymetrix I

Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data

Affymetrix II

Agilent General QC – dendrogram and NMDS

10

Scatterplots Heatmaps (this took an hour to calculate)

QC-tools in Chipster

Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color

Visualization• Dendrogram• Heatmap• Correlogram

Statistics• NMDS

Normalization

11

What is normalization?

Normalization is the process of removing systematic variation from the data.Typically you would normalize your data so that all the chips become comparable.

Methods

Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out

expression values after quantile normalization and median polishingAgilent

• Background correction + averaging duplicate spots + normalization

After normalization the expression values are always expressed on log2-scale

AffymetrixMethods: MAS5, Plier, RMA, GCRMA, Li-Wong

• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a

few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)

Custom chiptype• If you want to use reannotated probes (they are really assigned to

the genes where they belong), select one from this menu.

Agilent I

Background correction• Background treatment

None, Subtract, Edwards, Normexp• Background offset

0 or 50Normalize chips

• None, median, loessNormalize genes (not typically used)

• None, scale (to median), quantileChiptype

• A must setting!

12

Agilent IIBackground treatment typically generates many negative values that are coded as missing values after log2-transformation.

• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,

and gives rather good estimates (best method reported)Loess removes curvature from the data (suggested)

Checking normalization

Exercise

Exercise II

Normalize your dataset• Use two different normalization schemes

Describe the experiment (fill in phenodata)Check the quality of your dataset

• Is there difference between the normalization schemes• If there is, select the better one, and continue with it

13

Filtering

Gene filtering

Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)

Often a good idea, and necessary before multiple testing correction can be adequately applied

• Some controversy on this…

Non-specific filtering• Expression, flags, SD, …

Specific filtering• Statistical testing

Non-specific filtering

Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal

Typically, non-changing genes are also removedThese can be removed using

• Filter by standard deviation• Filter by interquartile range• Filter by expression

Specific filtering

Selecting genes that are associated with some phenotypeTypically involves statistical testing

Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.

• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.

• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)

and test for differences (what is statistically significant)

14

Unspecific filtering in Chipster

Pre-processing• Filter by expression

• Select the upper and lower cut-offs• Select the number of chips this rule has to fulfilled on• Select whether to return genes inside or outside the range

• Filter by SD• Select the percentage of genes to filter out

• Filter by interquartile range (IQR)• Select the IQR

• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)

Utilities1. Calculate descriptive statistics2. Filter using a column

Venn diagram

Select three datasets in ChipsterRun the Venn diagram tool from Visualization tool category

SD CV

IQR

Exercise

Exercise III

Filter your dataset using unspecific filtering• Use two different schemes• Compare the schemes using Venn diagram• Are there any common genes?

15

Statistics

Some terminology

Usually tests for comparing means of two or more groups are used

• Variance might be of interest too, but in practise this is never done.

Parametric tests (assume data normally distributed)• Typically used for microarray data

Non-parametric tests (assume no normality)

P-value• Risk of saying that there is a difference when there really isn’t• Traditionally 0.05 is used as a cut-off for significance• False discovery range is a p-value corrected for multiple tests (more on

this later)

Mean and variance, an example for 1 gene

-6 -4 -2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

density.default(x = x1)

N = 100000 Bandwidth = 0.08956

Den

sity

-10 -5 0 5 10

0.0

0.1

0.2

0.3

0.4

density.default(x = y1)

N = 100000 Bandwidth = 0.09006

Den

sity

Statistical testing

Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the

measurements. This is typically measured by standard deviation.Comparing means (parametric tests)

• One-group tests• Compare to a known mean• Example: One-sample t-test

• Two-group tests• Compare two groups’ means• Example: Two-sample t-test

• Several group tests• Compare several groups’ means• Example: Analysis of variance (ANOVA)

• Two or more groups, two or more factors• Compare means in the groups according to both factor simultaneously• Example: multiple linear regression (linear modeling in Chipster)

16

t-test

Compares means of two groups• If the p-value is small that means that there is a difference between the groups.• If the p-value is large (>0.05), there is no difference between the groups.• p-value is a risk of saying that there is a difference when there actually isn’t.

A test for every gene is run separately -> thousands of tests and p-values

SExxt 21 −=

ANOVA

A generalization of t-test.Compares means of several groups.Tells whether the means are different, but not which means differ from each other.

• For this you can use post-hoc tests (not implemented in Chipster) or linear modelling (implemented in Chipster)

A test for every gene is run separately -> thousands of tests and p-values

Multiple testing correction I

After getting the results for all the genes, p-values are adjusted for the number of tests conducted.When making several comparisons using the same test, some of the results will be chance findings.

• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).

This can be corrected for (to some extent) by using a multiple testing correction.

• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.

• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.

Multiple testing correction II

The ranking of the genes does not change after multiple testing correction!

• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before orafter the multiple testing correction.

• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.

17

Gene set test (”global test”)

A typical result of an microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Are pathways associated with our endpoints of interest?

• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?

Works on the expression values data.

Gene enrichment analysis

A typical result of an microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.Works on the gene list.

Statistical tests in Chipster

Statistics• One sample tests

• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling

Visualization• Volcano plot

Exercise

18

Exercise IV

To find differentially expressed genes, run a suitable statistical test for your (filtered) data set.Are these expressed genes enriched into some KEGG pathways?

• There is a separate test for this.

Clustering

Clustering methods

Hierarchical clusteringNon-hierarchical clustering

• K-means• QT-clustering• Self-organizing maps

Classification aka class prediction• K-nearest neighbor (KNN)

Unsupervised v. supervised

19

Hierachical clustering

Two phases:• Pick a distance measure

• Euclidean distance• Standard / Pearson correlation

• Pick the dendrogram drawing method• Average linkage

Average linkage example

Hierarchical clustering - heatmapK-means clustering

Finds K clusters from the data.User has to specify the number of clusters (K).

20

K-means clustering Clustering in Chipster

Clustering• Hierarchical

• Includes reliability checking of the resulting tree with bootstrapping

• K-meansStatistics

• PCA (principal component analysis)• NMDS (non-metric multidimensional scaling)

Exercise

Exercise V

Cluster your differentially expressed genes using hierarchical clustering

21

Annotation

Annotation

Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).Not available for all chiptypes, but required by certain analyses (annotation, gene enrichment analysis, promoter analysis)For Affymetrix: either built-in or GeneCruiserFor other chiptypes: built-in

Alternative CDF environments for Affy

CDF if a file that links individual probes to their location in genes (probesets)Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genesAlternative CDFs (custom chiptype in Affymetrix normalization) fixes this problemAfter using the alt CDFs, you can’t use gene set enrichment or promoter analysis tools

• No annotation files exist for alt CDFs

Promoter analysis

22

Promoter analysis with Chipster

Promoter sequences = sequences upstream of annotated transcription start site of RefSeq genes (from UCSC Golden Path)

Pattern discovery: Weeder• looks for common sequence motifs in a set of promoters

Pattern matching: ClusterBuster• looks for clusters of known transcription factor binding sites using the

JASPAR matrices

Promoters from genes with similar expression patterns

Pattern discovery

Program to find common motifs- Tool comparison: Nature Biotech. (2005) 23:137 => Weeder

WeederEnumerates all oligos of given length, determines which appear in a significant fraction of seqs, ranks them according to statistical significancePavesi et al (2004) Nuc Acids Res. Jul (W199-203)

Species (human, mouse, rat, yeast) [human]• Background frequency files (oligo count of intergenic regions of a given organism)

Promoter size (short, medium, long) [short] Analyze strands (single, both) [single] Motif appears more than once per sequence (yes, no) [no] Number of motifs to return (1-100) [10] Percentage of sequences the motif should appear in (1-100) [50] Transcription factor binding site size (small, medium) [small]

• Small= 6 (1 mismatch allowed) and 8 (2 mismatches allowed)• Medium= 10 (3 mismatches allowed)

23

Collection of known binding motifs for TFs (Genomatix, Transfac, JASPAR)

Program to scan the sequence for binding sites

Pattern matching

TTTTTATA

ClusterBuster

Looks for clusters of transcription factor binding sitesUses the JASPAR open access matrix database

• http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl

Frith et al (2003) Nuc Acids Res, 31(13):3666-8

Species (human, mouse, rat, yeast) [human] Promoter size (short, medium, long) [short]Cluster score threshold [5]Motif score threshold [6]Expected distance between motifs in a cluster [35]Range for counting nucleotide frequencies [100]Pseudocounts [0.375]

ClusterBuster output

24

Exercise

Exercise VI

Search your list of differentially expressed genes for binding sites of known transcription factors

Extra material Linear modeling in Chipster

25

Linear model

Y = a + bx1 + cx2 + dx1x2• Like a normal multiple regression• Intercept (a) is included by default• Can contain both main effects (b, c) and interaction effects (d)

Linear modeling in Chipster can take into account at most three main effect, their interactions, one technical replication level, and one level of pairing

• This is enough for all the experiments I’ve encountered in GEO so far.

• Technical replication: one biological sample is hybridized on more than one array

• Pairing: before-after –type of setting. Measurements available just prior to treatment and after it from exactly the same cell culture flasks.

Setting up the model I

All columns (max. three) in the phenodata can be either tested as linear (is there a trend towards higher numbers?) or as a factor (are there differences between the groups?).

• With 2 groups there’s no difference in these settings.

1 2 3 1 2 3

linear factor

Liner modeling tool

Columns 1…3• Main effects

Column 4• Technical repl.

Column 5• Pairing

One main effect – 3 groups

linear

factor

26

Setting up the model

If you want to include more than one main effect, you need to add new columns to you phenodata.

Two main effects – both have two groups

No interactions

Two-way interactions, with significant genes returned for all effects (main effects and interactions)

Pairing or technical replication

All samples in the same pairing or replication groups are coded with the same number. Different groups are coded with a running number.

Result files

A model matrix and one result file are saved.

27

Experimental design

Some things to ponder

Bad experimental design is bad science!• Wasted money• More animal or human suffering• Unreliable results

The main aspects of experimental design are• Randomization and balancing (often neglected)• Replication (usually rather well handled)• Blocking (not even known of)• Factorial experiments (sometimes considered)

You also need to consider• Sample size• Controls (direct or indirect measurements)

Before running the experiment

Define the principal hypothesis to test. Everything cannot be tested!

• ”I run this experiment for comparing two treatments on Arabidopsis. Now coming to think of it, these plants were of different age. Can you also test for the effect of it?”

Which are the main sources of variability? They need to be taken into account in the experimental design!

• Laboratory personnel (more than one person involved?)• Chips (from more than one batch?)• Biological samples (inter- or intraindividual variability?)• Hybridization conditions (is the method standardized?)• Day (often the greatest source or variation)

• Intermingled with variation from chips, biological samples, etc. if not properly taken into account

Replication

Techical replication:• Take a sample per animal, and hybridize every sample to

several chips. Biological replicate:

• Take a sample per animal, and hybridize every sample to one chip.

Replication does not mean taking repeated measurements from the same experimental units. That typically generates a time series.Technical replication, when analysed as a biological replicate is a pseudoreplicate. Pseudoreplication generates more problems than it solves.

28

Balancing

Balancing means that there should be an equal number of experimental units is all groups.Balanced designs are statistically more powerful than unbalanced designs.Example:

• In the study of breast cancer, 30 individuals were recruited frothe cancer cohort, and 30 individuals as their health controls (balanced for the disease).

• 60 Affymetrix chips are available for hybridizing these samples.Affymetrix station only takes 8 chips at a time, so 4 cancer patients and 4 healthy controls are randomly picked to be hybridized in every batch (balanced for day effect).

• Two laboratory technicians are making the hybridizations. Both process 30 samples, half being cancer patients and half healthy controls (balanced for the technician).

Randomization

Randomization is a way to control for effects of factors not explicitely taken care of in the experimental design.In randomization experimental units are randomly allocated to treatment groups.

• Sixty cell culture vials are randomly divided into control and treatment groups. They retain their places in the incubator regardless of the group (completely randomized trial).

Random does not mean haphazard. Randomization takes some effort. Use e.g., dice, playing cards, random number generator, random number tables, etc. for randomization. In the best case the randomization is blind. The experimenter must not be able to identify the samples before the whole experiment has been concluded.

Completely randomized design

12114

21123

12122

22211

DCBARow #

Let’s divide 16 samples into two groups of equal size. I’ve created a random number table on the right.Reading the table from the top left to the bottom right, the cell culture vials are assigned to two groups.We might then arrange the vials on the tray in the same order and put the tray in the incubator.

Blocking

Blocking is arranging experimental units into similar groups. Blocking is used for controlling for factor that can not be manipulated, but are known.Example:

• While studying a responce to a drug treatment, both males and females were recruited for the study. Responce might depend on sex, so individuals were first divided into two groups according to their sex, and then randomly assigned to treatment groups (randomized block design).

29

Factorial designs

In factorial design several factors are manipulated at the same time.Better to analyze together than separately, because factorial design allows one to assess the possible interaction.Example:

• Cells were treated with vitamin-C and hydroxen peroxide. Culturing cell alone with either chemical leads to missing the interaction where vitamin-C prevents peroxide induced cell death to some extent.

Main effects: vitamin-C and peroxideInteraction: vitamin-C * peroxide

Sample sizeWe need to use a sufficient amount of samples to reach reliable conclusions. Using too small or too big sample size is a waste of resources.Founding out the correct sample size for DNA microarray experiments is tricky. Use of previous experiments for the same chip type and biological material is often needed.In epidemiological studies estimating the sample size is a must. It might be hard to get published otherwise.To estimate the sample size, we need an estimate of

• Effect size• Variability• Desired false positive rate• Desired false negative rate

Sample size – a comparsion of two experiments Sample size – a rule of thumb

In statistics, variability in intrincically associated with statistical significance. The lower the variability of replicates, the higher the significance.Doubling sample size halves variance making the detection of differences easier.

30

Direct or indirect measurements?

Reference Sample ReferenceSample Sample 2Sample 2

An example of a better…good design

Comparing two groups of samples.• 20 samples in each group (40 in total).

• You’re interested in comparing the two states (diseased, health).

• Interindividual variability (due to sex) can be expected.• Using Affymetrix chips (all from the same batch).• You’re doing all the wetlab work.

Hybridize (randomly ordered):• 12122211• 22112112 1=healthy• 21211212 2=diseased• 22221111 1=male• 12211212 2=female