tag-based expression/function analysis
DESCRIPTION
Tag-based expression/function analysis. Data files at webpage (link at todays date), and also: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/. Where are we now? R to do statistics Genome browsers and galaxy to visualize genes and genomics data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/1.jpg)
![Page 2: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/2.jpg)
Tag-based expression/function analysis
Data files at webpage (link at todays date), and also:http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
![Page 3: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/3.jpg)
Where are we now?• R to do statistics• Genome browsers and galaxy to visualize
genes and genomics data• Analyzing expression by microarrays +R and
Bioconductor• Tag analysis• Proteomics
![Page 4: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/4.jpg)
What we want in transcriptomics• Know what transcripts that are transcribed,
and how much they are transcribed – Implicitly also what transcripts that exist in the
cell, and how they look!
• Intuitively, we could get all this information by sequencing all mRNAs in one cell
![Page 5: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/5.jpg)
General problems with cDNA sequencing:
Reverse transcriptase falls offHard to sequence long transcripts
Many cDNAs are identical, but some occurs only once per cell (or less!). Need to sequence
MANY cDNAsVery expensive if you want to sequence all
molecules
![Page 6: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/6.jpg)
Solutions:
1) Do not sequence: use probes and hybridization: microarrays and tiling arrays ( this is where we are now!)
2) Only sequence parts of transcripts: tag sequencing (this is where we are getting)
![Page 7: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/7.jpg)
Thought exercise
• What are the pros/cons with hybridization (micro/tiling arrays) vs sequencing? 2 minutes with your sideman
![Page 8: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/8.jpg)
Albin’s take• + Cheap(per “gene”)• + Mature methods• + Standardized• -complex normalization needed• - cross-hybridization• - highly dependant on
annotation of probes• -dependant on designed
probes for genes• -Cannot deal with repeats• +/-Integrative signal (more on
next slide)
• - expensive (now, but changing)• -”unbiased” - no designed probes• - non-standard computational
methods• - more demanding processing
(now) • - much easier statistics in the end• + less noisy• + much higher resolution - up to
nucleotide level• + location information• +/- Sampled signal (more on next
slides)
Hybridization Sequencing
![Page 9: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/9.jpg)
Hybridization: integrative
We have many identical probes. Each time a probe gets a hybridization event, we add a little to the signal.
This includes non-optimal hybridization events - just something labeled that hybridizes will give some signal
![Page 10: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/10.jpg)
Sequencing: sampling
The number of cDNAs in a library is VERY LARGE
We pick only some of them to do sequencing, randomly
Blind sampling (does not know anything about RNAs)
We map sequences back to the genome ( a kind of quality check)
![Page 11: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/11.jpg)
Why is this interesting?• Sequencing approaches are generally
better than hybridization in quality and you can also do more diverse experiments
• New sequencers make it possible to do this almost as cheap as with hybridization – normal research groups can now buy the capacity of an old sequencing centre
• It is basically the technology of the future
![Page 12: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/12.jpg)
5 types of sequencing data data for expression – and functional- studies
• Non-subtracted cDNA• ESTs• SAGE• CAGE• RNA-seq
![Page 13: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/13.jpg)
Why so many techniques?
• Historical reasons – technology development over time
• Some of these technologies are only for expression – others also give other information (and different information)
• Difference in costs - efficiency
![Page 14: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/14.jpg)
Non-subtracted cDNA
• Theoretically possible to sequence all cDNAs in a cell
• Very, very expensive!• Hard to get true expression, since
amplification is length-dependant• Not very necessary to have the whole cDNA
for expression?
![Page 15: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/15.jpg)
Expressed sequence tags ESTsSequence from 5’ and 3’ ends – until the reverse transcriptase falls off
Cheaper than full-length cDNAs
Problems: many ESTs are simply trash – the result of over-enthusiastic sequencing
For longer genes, no coverage of the middle part
![Page 16: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/16.jpg)
How can we use ESTs?
• View the EST as a ranom sample from a pool of transcripts:– The number of ESTs found from a transcript
should be proportional to the concentration of that transcript in the cell=the expression
• How do we know what transcripts an EST comes from?
![Page 17: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/17.jpg)
Unigene:clustering ESTs to “genes”
Back in the 90s, the idea was to use a lot of ESTs to find, and puzzle together, genes
The UNIGENE database is one of the outcome of this. Slightly obsolete, but useful at times
Basically, it tries to cluster ESTs and cDNAs to functional units: “genes”
Bonus: we can use this to look at expression of these genes – because we can count ESTs from different libraries
![Page 18: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/18.jpg)
Thought exercise: How?
• Say that we have two lung EST libraries(= two collections of tags) from two patients, one who has lung cancer
• How can we prove that a given gene, like RARA, is significantly altered in expression in lung cancer?
• Think R! What do we need, and what tests should we use?
• 2 minutes with your side man
![Page 19: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/19.jpg)
“Electronic Northern blot”
• In a nutshell: Fill in the following contingency table for a given gene
ESTs from tissue A
ESTs from tissue B
RARA
Rest of ESTs
Fisher exact test situation!
We can do this within unigene for single genes
![Page 20: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/20.jpg)
Side-story for non-life-scientists: Northern what?
• Northern blot is classical method for detecting RNA molecules
• Related to Southern and Western blot (DNA and protein detection methods)
![Page 21: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/21.jpg)
![Page 22: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/22.jpg)
However…
• An electronic Northern is just a clever name, although it has the same goals - finding RNAs
• It is nothing more than a statistical over-representation test of mRNAs, by use of ESTs
![Page 23: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/23.jpg)
Unigene:
• http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene
• …or just google for unigene
![Page 24: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/24.jpg)
EST hits from different tissues
Public microarray data (nice for comparison - but not important now)
Let’s look at the tissue constraints of human RARA…
![Page 25: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/25.jpg)
![Page 26: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/26.jpg)
Note that the sample sizes are very different!1tag of 282332 is not the same as 1 tag out of 131488
![Page 27: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/27.jpg)
What is TPM?
TPM= Tags per millionA normalization to be able to compare libraries of different sizes. Used very often for tag-based expression.
“How many tags would my gene have we have if the sample size is 1 million?”
…so, 10^6 * (#tags in my gene)/(#total tags)
![Page 28: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/28.jpg)
Challenge
• Is the RARA gene significantly different in expression in eye vs blood?
![Page 29: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/29.jpg)
ESTs from blood
ESTs from eye
Gene X 12 12
Rest of ESTs
124139-12
210756-12
![Page 30: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/30.jpg)
> a<-matrix(c( 12,12,124139-12, 210756-12), nrow=2,byrow=T) > fisher.test(a)
Fisher's Exact Test for Count Data
data: a p-value = 0.2078# so,despite twice the TPM value, not significant
![Page 31: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/31.jpg)
So ESTs are fantastic?…not really!Sometime useful butThere are too few of them, and very diverse libraries…and way too expensive to make routinely in a
normal lab
Basically, ESTs are rarely used now, but it is data worth considering
![Page 32: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/32.jpg)
Modern tag sequencing
• SAGE, CAGE and RNASeq
![Page 33: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/33.jpg)
Underlying idea:
• Only sequence as much as you need: 5', 3' or whole cDNA (in pieces)
• Map tags to known cDNAs or the genome (Thought exercise: what is the difference?)
![Page 34: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/34.jpg)
SAGE
![Page 35: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/35.jpg)
SAGE
• After sequencing:– Mask out adapters and primers– Make a database of all possible hits in mRNAs following the
restriction site (white board demo)– Map tags to this database, or the genome
• Mapping is surprisingly tricky– We cannot use BLAST or BLAT alignments (too short sequences)– Sequencing errors exist, as well as RNA editing– Some species have very few known mRNAs
![Page 36: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/36.jpg)
Common approach
First identify all unique tags, and how many times we have seen themAAAGATGCTGC 67CAGTCGATCGAT 192…Correlate these tags with our gene database. Sum up all the tags for each geneMake expression analysis!
![Page 37: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/37.jpg)
How can we analyze count data?
• The difference to micro arrays is that we deal with integers
• The more counts for a gene, the more expressed it is - theoretically a linear relation. We are theoretically counting actual RNA molecules
• Very much like the EST case, we can make statistics based on contingency tables if we have two samples
![Page 38: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/38.jpg)
Data flow for tags
…is a bit too complex for this course to do in real life - takes time and requires programming (and a big computer)
Mapping of tags to genes is complex, and no standard solutions are adopted (yet)
Statistical analysis often involves making multiple fisher exact tests - this involves some R programming
To get a feeling for the data, we will instead use a website to to these things for us
![Page 39: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/39.jpg)
Typical data after mapping:
Tag FrequencyAAAAAAAAAA 173AAAAAAAAAG 1AAAAAAAAAT 1AAAAAAAATA 2AAAAAAACAA 1AAAAAAACTA 2AAAAAAATAA 1
We want to go from here to actual counts per gene: we will let a web system do this for us
![Page 40: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/40.jpg)
• In the data directory, I have collected two such files:SAGE_Colon…, corresponding to normal and cancer colon
• These are linked in the web page, also here: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
• Then, go to http://cgap.nci.nih.gov/SAGE/• This page has many SAGE-related analyses. We
will try Digital Gene Expression Displayer (DGED)
![Page 41: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/41.jpg)
Challenge
• Using DGED• Use the “Two of your files” option to use
the two colon samples. Select “short tags”• Try to understand what the statistical test
does (accept defaults)• What types of genes are “over-expressed”
in colon i) cancer tissue vs normal tissue, ii) normal tissue vs cancer tissue
![Page 42: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/42.jpg)
Thought exercise
• What are the limitations with SAGE?
![Page 43: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/43.jpg)
Albin’s take
• We can only measure expression – the location of tags in genes have no functional meaning
• Dependent on gene annotation - we can map to the genome, but hard to interpret such data (what genes?)
• Compared to array data: very few standard analysis methods
• Limited sequencing depth
![Page 44: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/44.jpg)
5’ tagging
• Three methods that really do the same thing. Difference lies in chemistry and throughput and length of tags– CAGE– 5’SAGE– 5’ Oligo-capping
• We will use CAGE as an example (“Cap Analysis of Gene Expression)
![Page 45: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/45.jpg)
Sequencing and mapping to the genome
CAGE
![Page 46: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/46.jpg)
CAGE vs …
• SAGE– Conceptually same thing, but you catch the 5’
end of the gene: the transcription start site and thereby the promoter– which is a functional entity
– Higher number of tags– 5’ ends give functional data apart from
expression
![Page 47: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/47.jpg)
Issues
• Only capped transcripts– Some real transcripts are not capped– Some capped transcripts are not full-length
• Associating 5’ ends with gene products is sometimes problematic – We only know starts of genes, not the length
• Tag length is borderline for mapping - 20-21 bp• Not clear how to define cutoffs - how many tags are
“real biological promoter”• Under-sampling: we miss a lot of promoters because
there are so many of them
![Page 48: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/48.jpg)
StrengthsWe are actually looking at promoters, not genesFind novel promoters - sometimes within known
genesWe can look at expression at promoter level - for
instance define “tissue-specific” promotersWe can get a first unbiased look at where promoters
are, and how much they are used in a given cell
![Page 49: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/49.jpg)
CAGE concepts
• The atom unit in CAGE is the tag, mapped to the genome. The tag comes from a given experiment (and has a label)
• What positional information is the most relevant for analysis?
20-21 bp
The tag
? ?
![Page 50: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/50.jpg)
Only 5’ ends are interesting!
• …since the 20 bp length is only for mapping purposes .
• What if we have many tags overlapping one another? How can we represent this?
![Page 51: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/51.jpg)
Some soon-to-be-outdated terminology
![Page 52: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/52.jpg)
So…
• Unlike SAGE, CAGE can be viewed as a “barplot” on the genome, on nucleotide level
• How to cluster nearby CAGE tags to a meaningful “promoter” is an open problem
![Page 53: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/53.jpg)
Within a promoter…
• …we can do exactly the same Fisher exact tests as before (as in SAGE or ESTs do for whole genes)
• What is the advantage/disadvantage of doing this on promoters instead of genes? (2min)
![Page 54: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/54.jpg)
The big answer: alternative promoters with different tissue usage
![Page 55: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/55.jpg)
CAGE resources• Genomic element viewer ( very similar to
UCSC browser)– CAGE tags and cDNA landscapes– Easiest by the links on fantom.gsc.riken.jp/3
![Page 56: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/56.jpg)
![Page 57: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/57.jpg)
![Page 58: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/58.jpg)
Clicking on cage clusters give two options:CAGE analysis viewerCAGE basic viewer
![Page 59: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/59.jpg)
CAGE resources
• Basic CAGE viewer– Comprehensive browser of CAGE tags and CAGE
tag clusters, and library information
![Page 60: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/60.jpg)
![Page 61: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/61.jpg)
Challenge• Look at the RARA gene in the MM5
assembly in the genomic elements viewer(browser) (so, NOT UCSC).
• How many alternative promoters does it have?
• Are any of these biased towards certain tissues?
![Page 62: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/62.jpg)
Some points
• Not that easy to say which of these promoters that are “significant”
• Easy to get overwhelmed by numbers when counting tags
![Page 63: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/63.jpg)
Back to work…
• We can treat CAGE tag counts, or really TPMs in a promoter as expression
• We can do the same analyses as in microarrays - including the typical heatmap
• We will do a small exploratory study of some CAGE data
![Page 64: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/64.jpg)
• http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
![Page 65: Tag-based expression/function analysis](https://reader036.vdocument.in/reader036/viewer/2022062315/56814e4a550346895dbbd474/html5/thumbnails/65.jpg)
Walk-thru of CAGE exercise
• Also at http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
• …together with updated slides• And linked from web page