localization analysis 11/07/07. microarray probes are oligonucleotide sequences with regular spacing...

Localization Analysis

11/07/07

• Microarray probes are oligonucleotide sequences with regular spacing covering a whole genomic region.

chromosome

Tiling arrays

Tiling Arrays

http://en.wikipedia.org/

Typical applications:

Comparitive Genomic Hybridization (aCGH) – copy number variation

RNA analysis: transcript structure, transcript discovery, etc.

Location analysis: nuclease sensitivity

Location analysis: chromatin immunoprecipitation (ChIP)

NOTE: ALL of these things can also be done by deep sequencing, which we will briefly cover towards the end

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Series1

Series2

Spike-in experiments – we can find linkers as short as 7 bp

Location of labeled PCR product Measured red/green ratio

Experimental Determination of Cross-Hybridization

Spike in PCR product – (1+1)/1 > (1+n)/n, so X-hybing probes will detect less enrichment experimentally

-8

-6

-4

-2

0

2

4

6

Series1

Series3

X-hyb

Spike-in data

-2

-1.5

-1

-0.5

0

0.5

1

1.5

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187 193 199Series1

Series2

-4

-3

-2

-1

0

1

2

1 14 27 40 53 66 79 92 105 118 131 144 157 170 183 196 209 222 235 248 261 274 287 300 313 326 339 352 365 378 391 404 417 430 443 456

Series1

Series2

Array CGH Technology

Genome-wide measurement of DNA copy number alteration by array CGH

Pollack J R et al. PNAS 2002;99:12963-12968

©2002 by The National Academy of Sciences

DNA copy number alteration across chromosome 8 by array CGH

Pollack J R et al. PNAS 2002;99:12963-12968

©2002 by The National Academy of Sciences

RNA vs genomic

5’ UTR

3’ UTR

Tiling of the Hox loci – mRNA vs. genomic

ZY Xu et al. Nature 000, 1-5 (2009) doi:10.1038/nature07728

Transcript maps.

DNaseI HS profiling

DHS profiling identifies promoters, enhancers, and insulators

Isolation of nucleosomal DNA

Experimental Protocol

• Step 1: crosslink protein with DNA

• Step 2: sonication (break) DNA

Kim and Ren 2007


• Step 1: crosslink– fix protein with DNA

• Step 2: sonication– break DNA

• Step 3: immuno-precipitation– Pull down target protein

by specific antibody

Kim and Ren 2007


• Step 1: crosslink– fix protein with DNA

• Step 2: sonication– break DNA

• Step 3: immuno-precipitation– Pull down target protein

by specific antibody• Step 4: hybridization

– Hybridize input and pulled-down DNA on microarray

Kim and Ren 2007

Chromatin Immuno-precipitation

Tiling Array Data

Each TF binding signal is represented by multiple probes.

Need more sophisticated statistical tools.Kim and Ren 2007

Boyer et al. 2005

Tiling arrays provide high resolution for identifying bound fragments

Overlapping 25-mer fragments

Mapping histone modifications

Chromatin’s primary structure

OK, now what?

•Analysis method strongly depends on how widespread the thing being examined is, and if you have a guess regarding its localization

•CGH: Just look!

•TF ChIP-chip, DHS: peak finding algorithms (BUT BUT BUT).

•RNA, chromatin marks: Hidden Markov Models, aggregation plots

CGH Array Segmentation

• Key idea: Most probe targets have same copy number as their next neighbors

• Can average over neighbors• Key issue: when is a difference real?• Recommended Programs:• DNACopy – Solid statistical basis; slow• StepGram – Heuristic ; fast

Methods

• Moving average t-test (Keles et al. 2004)

• HMM (Li et al. 2005; Yuan et al. 2005)

• Tilemap (Ji and Wong 2005)

• MAT (Johnson et al. 2006)

Keles’ method• Calculate a two-sample t-

statistic Y2

Y1

i

CHIP-signal

Input-signal

22,21

2,1

,1,2,

/ˆ/ˆ nn

YYT

ii

iini

Keles et al. 2004

Keles’ method• Calculate a two-sample t-

statistic Y2

Y1

i

CHIP-signal

Input-signal

22,21

2,1

,1,2,

/ˆ/ˆ nn

YYT

ii

iini

w

1

,*,

1 wi

ihnhni T

wT

• Moving average scan-statistic

Multiple hypothesis testing

• Multiple hypothesis testing needs to be considered to control false positive error rates.

• What is the null distribution of this statistic?

1

,*,

1 wi

ihnhni T

wT

Multiple hypothesis testing

• Assume has t-distribution• Approximate

by normal distribution.

• Alternatively can use resampling method to estimate the null distribution.

nhT ,

1

,*,

1 wi

ihnhni T

wT

ChIPOTle: a simple method for identifying ‘bound’ genomic fragments(Buck et al. 2005)

Assumption: real binding site will have distribution of bound fragments encapsulating it.Therefore, true positives will likely have multiple, contiguous fragments with high signal.

1. Walk across tiled genomic probes with user-defined window size

2. Calculate mean signal intensitywithin each window

3. Estimate p-value of binding(Bonferroni-corrected) basedon a standard error model or

by permuting the dataset.

BUT:

• Extensive low-affinity transcriptional interactions in the yeast genome

• Amos Tanay

• Genome Research 2006

OK, what about more continuous data like RNA or chromatin marks?

Inferring nucleosomes: HMM

A Hidden Markov Model objectively identifies nucleosome positions

Hidden Markov Models for Identifying Bound Fragments

HMM’s are trained on known data to recognize different states (eg. bound vs. unbound fragments) and the probability of moving between those states

Example: ChIP-chip data from a tiling microarray identifying regions bound toa transcription complex with a known 50bp binding sequence.

You expect that a bound fragment will have high signal on the array and that the bound fragment will be 2-3 probes long.

Once trained, an HMM can be used to identify the ‘hidden’ states in an unknown dataset, based on the known characteristics of each state (‘emission probabilities ’) and

the probability of moving between states (‘transition probabilities’)

Example: “A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences” 2005. Li, Meyer, Liu



P( I ) = 0.2P( i ) = 0.8

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

I = Intensity units > 10,000 i = Intensity units < 10,000

P= 0.5

P= 0.5

P= 1.0

P= 0

P= 0.7

P= 0.3

P= 1.0

Unbound 25mer Bound 25mer Bound 25mer Bound 25mer



P= 0.5

P= 0.5

P= 1.0

P= 0

P= 0.7

P= 0.3

P= 1.0

Unbound 25mer Bound 25mer Bound 25mer Bound 25mer

Emission Probabilities

Transition Probabilities

Given the data, an HMM will consider many different models and give back the optimal model

P( I ) = 0.2P( i ) = 0.8

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

P( I ) = 0.8P( i ) = 0.2

Other types and uses of microarrays: aCGH

CGH (comparative genomic hybridization) looks at cytogenetic abnormalities

•genomic DNA hybridized to array

•often uses large clones (e.g., BACs) as array features

Validation of data

There’s no way that all of your microarray data can be validated.

It’s strongly recommended that any key findings be verified by independent means.

Northern blots and quantitative RT-PCR are the typical ways of doing this; real-time, quantitative RT-PCR is generally the method of choice.

Chromatin’s primary structure

One way to turn this 1D trace into

2D is via “averageogram”

H4 K16 Acetyl, aligned by NFR

Beyond Transcription

% nucleosomes(Printed Arrays)

% exchange events(Printed Arrays)

CDSTSS3:

TSS5:

promoter:

Null:

tRNA:ARS:

CDS

TSS3:

TSS5:

promoter:

Null:tRNA:ARS:

Multiple visualizations of tiling data

RNA-Seq

Lockhart and Winzeler 2000

Wang et al. 2009

RNA-Seq

• Whole Transcriptome Shotgun Sequencing– Sequencing cDNA– Using NexGen technology

• Revolutionary Tool for Transcriptomics– More precise measurements– Ability to do large scale experiments with little

starting material

RNA-Seq Experiment

Wang et al. 2009

Mapping

• Create unique scaffolds– Harder algorithms with such short reads

Unbiased sequencing of the yeast transcriptome

Yassour M et al. PNAS 2009;106:3264-3269

©2009 by National Academy of Sciences

Mapping

• Place reads onto a known genomic scaffold– Requires known genome and depends on

accuracy of the reference

http://en.wikipedia.org/

Ab initio assembly of a transcript catalog

Yassour M et al. PNAS 2009;106:3264-3269

©2009 by National Academy of Sciences

Biases

Wang et al. 2009

What the data look like

Superimposing channels

Giresi et al, Genome Res. 10

Experimental Design for Microarrays

There are a number of important experimental design considerations for a microarray experiment:•technical vs biological replicates

•amplification of RNA

•dye swaps

•reference samples


Technical vs biological replicates

•technical replicates are repeat hybridizations using the same RNA isolate

•biological replicates use RNA isolated from separate experiments/experimental organisms

Although technical replicates can be useful for reducing variation due to hybridization, imaging, etc., biological replicates are necessary for a properly controlled experiment


Amplification of RNA

• linear amplification methods can be used to increase the amount of RNA so that microarray experiments can be performed using very small numbers of cells. It’s not clear to what degree this affects results, especially with respect to rare transcripts, but seems to be generally OK if done correctly


Dye swaps

When using 2-color arrays, it’s important to hybridize replicates using a dye-swap strategy in which the colors (labels) are reversed between the two replicates. This is because there can be biases in hybridization intensity due to which dye is used (even when the sequence is the same).

S1 S2

S1 S2


Reference samples

•one common strategy is to use a reference sample in one channel on each array. This is usually something that will hybridize to most of the features (e.g., a complex RNA mixture). Using a reference sample allows comparisons to be made between different experimental conditions, as each is compared to the common reference.S1

S2

S3

R

R

R

compareS1/R vs. S2/R vs. S3/R


The bottom line is that you should discuss your experimental design with a statistician before going ahead and beginning your experiments. It’s usually too late and too expensive to change the design once you’ve begun!

• EXPERIMENT DESIGNtype, factors, number of arrays, reference sample, qc, database accession (ArrayExpress, GEO)

• SAMPLES USED, PREPARATION AND LABELING

• HYBRIDIZATION PROCEDURES AND PARAMETERS

• MEASUREMENT DATA AND SPECIFICATIONSquantitations, hardware & software used for scanning and analysis, raw measurements, data selection and transformation procedures, final expression data

• ARRAY DESIGNplatform type, features and locations, manufacturing protocols or commercial p/n

MIAME (Minimal Information About a Microarray Experiment)

When you publish a microarray experiment, you are expected to make available the following minimal information. This allows others to evaluate your data and compare it to other experimental results:

localization analysis 11/07/07. microarray probes are oligonucleotide sequences with regular spacing...

Documents

genomic slide

end slide

utr slide

data slide

array cgh technology

dnasei hs profiling

localization analysis

transcript structure