differential expression analysis of de novo assembled transcriptomes - nadia davidson

41
Differential expression analysis of de novo assembled transcriptomes Nadia Davidson Murdoch Childrens Research Institute WEHI Bioinformatics Seminar April 9 th 2013

Upload: australian-bioinformatics-network

Post on 28-Nov-2014

5.515 views

Category:

Technology


1 download

DESCRIPTION

With next generation sequencing it has become possible to analyse the transcriptome of non-model organism by performing a de novo assembly of RNA-seq reads. In particular, differential expression analysis can be undertaken without the need for a reference genome or annotation. While a number of studies have compared the relative merits of different transcriptome assembly programs, less attention has been given to the methodology for performing a differential expression analysis after the transcriptome has been assembled. Differential expression analysis on a de novo assembly suffers from several challenges including mapping reads to transcripts, clustering similar transcripts and producing a summary of read counts for statistical testing. In particular, we have found that transcriptome assembly produces a much larger number of transcripts than would generally be expected. I will discuss the reasons for this and will assess the different strategies for taking the de novo assembled transcripts and producing a list of differentially expressed genes. I demonstrate that clustering transcripts into loci improves the interpretability of results and increases statistical power, but that results are very dependent on the choice of clustering. Most clustering tools are not optimised for de novo assembled sequences, and to address this, we are developing a method which uses hierarchical clustering to group transcripts based on shared reads. We also explore possible choices for mapping and summarising read counts to gene clusters.

TRANSCRIPT

Page 1: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Differential expression analysis of de novo assembled

transcriptomes Nadia Davidson

Murdoch Childrens Research Institute

WEHI Bioinformatics Seminar April 9th 2013

Page 2: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

RNA-Seq on non-model organisms

• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression

• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome

• By de novo assembling the transcriptome – But it has its challenges

Page 3: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

De novo assembly

Page 4: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Transcriptome Assemblers – For genome assembly a k-mer length must be selected and

optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple

assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach.

– Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach.

– Many studies compare the different assemblers. – Few studies explore ways to do a differential expression

analysis after the transcriptome has been assembled • Our aim

Page 5: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Our RNA-Seq dataset

• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples

– Already had the data from another project – Model organism

• Assembled the data using Trinity and

Oases. – Starting with these assemblies we

investigated how to perform a differential expression analysis

– 300k and 600k transcripts from Trinity and Oases respectively.

Page 6: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Q1. Why so many transcripts?

Page 7: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Transcripts grow with reads Fracis et. Al., BMC Genomics 2013

Page 8: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

– Sequencing errors – Heterozygosity – Different Isoforms – Paralogs

AGGTCTGA

ATTCGATG

ATTCCATG ACCTGAGA

AGGTCTGA ATTCGATG ACCTGAGA

AGGTCTGA ATTCCATG ACCTGAGA

Reported Transcripts

De Bruijn Graph Complexity

Page 9: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Vijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7

A simulation study of de novo transcriptome assemblies. 17K genes. 100 million, 100bp paired-end reads

“Even in the data sets simulated without alternative splicing, no sequencing error, no polymorphism and no paralogs for 7.87% of the genes many isoforms were erroneously inferred (ranging from 2 to 335 isoforms per gene)”

Simulation Study

Page 10: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Variation in coverage • Across transcripts

Reported Transcripts

Reported Transcripts Different coverage could mean different contigs assembled for each k-mer

Page 11: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Other transcripts • We get about 4.3 transcript for each known chicken

gene (Ensembl) in our Trinity assembly and 13.4 for Oases

• What are the other transcripts?

Known genes

Novel in genome

Novel not in geome

Trinity Assembly Oases Assembly

Page 12: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233

with > 100 million reads

Abundance of Gene Type from ENCODE

Page 13: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Our novel genes

Trinity assembly

Page 14: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Q2. Isoform or gene-level analysis?

Page 15: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

The Cons Isoforms • List may be too long:

– Difficult to interpret – Computationally expensive – Larger correction for

multiple testing • Not obvious how to assign

ambiguously aligned reads – Can lead to double

counting if ignored, or – Less power if reads are

split between transcripts • Not all transcript represent

different isoforms anyway

Genes • Not sensitive to differential

splicing • Not obvious how transcripts

should be clustered into genes

Page 16: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Q3. How to cluster transcripts into genes

Page 17: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Which clusters to use?

• This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by

splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g.

Gene A

Gene A Two incomplete sequences from the same gene

Cluster ✓

Gene A

Gene B Low coverage repeat sequence past UTR

Do not cluster ✗

Page 18: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Clustering Options

• What you can use: – The locus/component information from the assembler.

• General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length>

– Sequence similarity clustering such as CD-HIT, Blastclust etc.

We tested the accuracy of these clustering methods on our assemblies

“Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)

Page 19: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

“over clustered”

“und

er c

lust

ered

How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2

TP = 2, TN = 6 FP = 0, FN = 2

False negative indicate “under clustering”

TP = 4, TN = 0 FP = 6, FN = 0

False positives indicate “over clustering”

Page 20: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Trinity Assembly Clustering

TP = true positives number of pairs of transcripts which correctly share a cluster FN = false negatives number of pairs of transcripts which are incorrectly are split

335,377 transcripts

Number of clusters:

“over clustered”

“und

er c

lust

ered

✕ Ideal CD-HIT-EST Trinity

Page 21: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Oases Assembly Clustering

Number of clusters:

Max transcripts in a cluster:

540,933 transcripts “over clustered”

“und

er c

lust

ered

✕ Ideal CD-HIT-EST Oases

Page 22: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Can we do better?

• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low

level – We could separate sequences which show different

expression between sample groups – Using pair-end reads gives extra leverage to group

transcripts

• We are developing a tools which will take multi-mapped reads and output clusters along with counts for each cluster

Page 23: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters

• Transcripts are grouped if they share ANY reads with another transcript

– For each super-cluster • For each pair of transcripts, calculate the distance

• To do: incorporate sample information too

– Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too

large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster

Rab – Number of reads which map to transcript a and b

Page 24: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Step 1

Page 25: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Step 2 – make a distance matrix

0 (R=2)

0.5 (R=1)

0.5 (R=1)

0 (R=2)

0 (R=2) 0 (R=2)

Distance = , R = reads

Step 1

0 (R=2)

0.5 (R=1) 0 (R=2)

Update R2’2’ = R22+R33-R23 R12’ = max(R12,R23) Recalculate the distance

Page 26: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Distance = 0.5

Cutting the tree at 0.5 or less would give the correct clustering

Distance = 0

Distance = 1

Page 27: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Trinity assembly

“over clustered”

“und

er c

lust

ered

How do we do?

✕ Ideal CD-HIT-EST Oases/Trinity

Oases assembly

“over clustered”

“und

er c

lust

ered

Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Page 28: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Impact on differential expression (DE) • To assess this we:

– Mapped reads back to all transcripts (“best” mapping - bowtie)

– Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list

• Obtained from a genome based analysis on RefSeq genes (5 thousand genes)

• True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded

Page 29: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

DE results - Oases

Conclusion: Better to “under” cluster than “over” cluster

vs.

“over clustered”

“und

er c

lust

ered

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Page 30: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

DE results - Trinity “over clustered”

“und

er c

lust

ered

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Page 31: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Q4. What is the best way to go from reads to counts?

Page 32: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Approaches 1. Do what we did before – add up counts for each

cluster 2. Trinity and Oases suggest:

– Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously

mapped reads • RSEM can take the clustering as input and return gene-

level counts.

3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts

e.g. Sandmann et. al., Genome Biology, 2011 12:R76

Page 33: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Gene–level counts

Multi-map Reads:

Trinity script

Get counts: RSEM

edgeR

Gene–level DE results

Best-map reads: bowtie

Count reads overlapping transcripts: samtools

Select Representative Transcript:

longest

The alternatives: “Best” map

Reads: bowtie

Count reads overlapping transcripts: samtools

Add counts in a cluster

own script

Use the same clustering for all three approaches

Page 34: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Results

Oases Assembly Trinity Assembly

Difference between methods is small - could probably do any of them

Page 35: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Conclusions • Q1. Why so many transcripts?

– Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation.

– De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult.

– Sequencing to a high depth may give you more intergenic non-coding transcripts.

• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts

has a number of advantages over isoform-level counts

Page 36: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Conclusions cont.

• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the

clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works

better than the alternatives based on differential expression results

• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance

estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the

differential expression results than other steps in the pipe-line

Page 37: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Future Work • We have only looked at one RNA-Seq dataset.

– Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general:

• Different species (all model organisms) • Different read depths

• Our clustering tool: – Would like to output the gene-level counts for each cluster.

• Then compare to other abundance estimation approaches.

– Would like to incorporate differences in expression between groups to improve the clustering

• More investigation into the pipe-line methods – E.g. mapping

Page 38: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Acknowledgements

Chicken RNA-Seq Data from Katie Ayers (MCRI) Craig Smith (MCRI)

MCRI Bioinformatics Alicia Oshlack The Bioinformatics Group

Red Jungle Fowl (credit: NHGRI)

VLSCI AGRF

Page 39: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Extra Slides

Page 40: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Trinity and Oases compared Oases

Trinity – version from the start of 2012

Trinity – version from the end of 2012

frac_match = length of the longest matching assembled transcript / “true” length of the transcript

Page 41: Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson

Number of genes to transcripts (ordered by DE)

Yeast

Chicken (Trinity)

Chicken (Oases)