differential expression analysis of de novo assembled transcriptomes - nadia davidson

Differential expression analysis of de novo assembled

transcriptomes Nadia Davidson

Murdoch Childrens Research Institute

WEHI Bioinformatics Seminar April 9th 2013

RNA-Seq on non-model organisms

• RNA-Seq is a powerful technology for studying the transcriptome: – Gene annotation, splice variants – Estimating gene abundance and differential gene expression

• In particular, these things can be done for non-model organisms – Without the need for a gene annotation – Without the need for a reference genome

• By de novo assembling the transcriptome – But it has its challenges

De novo assembly

Transcriptome Assemblers – For genome assembly a k-mer length must be selected and

optimised for the coverage level. – But transcriptomes have a high dynamic range of coverage – Solution 1.: Use a genome assembler and perform multiple

assemblies with different k-mer values and then merge the results. The Trans-abyss/Abyss and Oases/Velvet approach.

– Solution 2: Write a dedicated assembler for transcriptomes using a single k-mer. The Trinity approach.

– Many studies compare the different assemblers. – Few studies explore ways to do a differential expression

analysis after the transcriptome has been assembled • Our aim

Our RNA-Seq dataset

• One Hi-Seq lane: 160 million 100bp, paired end, reads from chickens (4 female, 4 male) samples

– Already had the data from another project – Model organism

• Assembled the data using Trinity and

Oases. – Starting with these assemblies we

investigated how to perform a differential expression analysis

– 300k and 600k transcripts from Trinity and Oases respectively.

Q1. Why so many transcripts?

Transcripts grow with reads Fracis et. Al., BMC Genomics 2013

– Sequencing errors – Heterozygosity – Different Isoforms – Paralogs

AGGTCTGA

ATTCGATG

ATTCCATG ACCTGAGA

AGGTCTGA ATTCGATG ACCTGAGA

AGGTCTGA ATTCCATG ACCTGAGA

Reported Transcripts

De Bruijn Graph Complexity

Vijay et. al., Molecular Ecology, 2012. doi:10.1111/mec.12014. Supplementary Fig. 7

A simulation study of de novo transcriptome assemblies. 17K genes. 100 million, 100bp paired-end reads

“Even in the data sets simulated without alternative splicing, no sequencing error, no polymorphism and no paralogs for 7.87% of the genes many isoforms were erroneously inferred (ranging from 2 to 335 isoforms per gene)”

Simulation Study

Variation in coverage • Across transcripts

Reported Transcripts

Reported Transcripts Different coverage could mean different contigs assembled for each k-mer

Other transcripts • We get about 4.3 transcript for each known chicken

gene (Ensembl) in our Trinity assembly and 13.4 for Oases

• What are the other transcripts?

Known genes

Novel in genome

Novel not in geome

Trinity Assembly Oases Assembly

S Djebali et al. Nature 000, 1-8 (2012) doi:10.1038/nature11233

with > 100 million reads

Abundance of Gene Type from ENCODE

Our novel genes

Trinity assembly

Q2. Isoform or gene-level analysis?

The Cons Isoforms • List may be too long:

– Difficult to interpret – Computationally expensive – Larger correction for

multiple testing • Not obvious how to assign

ambiguously aligned reads – Can lead to double

counting if ignored, or – Less power if reads are

split between transcripts • Not all transcript represent

different isoforms anyway

Genes • Not sensitive to differential

splicing • Not obvious how transcripts

should be clustered into genes

Q3. How to cluster transcripts into genes

Which clusters to use?

• This is not an obvious problem to solve: – group genes which share sequence. i.e. only differ by

splicing, SNPs or in-dels. – but place paralogs in a different cluster – This is complicated by the quality of the assembly e.g.

Gene A

Gene A Two incomplete sequences from the same gene

Cluster ✓

Gene A

Gene B Low coverage repeat sequence past UTR

Do not cluster ✗

Clustering Options

• What you can use: – The locus/component information from the assembler.

• General form of a transcript name from the assembler: <loci>_<transcript>_<other info such as length>

– Sequence similarity clustering such as CD-HIT, Blastclust etc.

We tested the accuracy of these clustering methods on our assemblies

“Truth” clusters were determined by matching transcripts to RefSeq genes using blat (98% identity over 200 bases)

“over clustered”

“und

er c

lust

ered

”

How we assess clustering Scored based on correct/incorrect pairwise groupings (like for the Rand Index). Example: true positives = 2 true negatives = 4 false positives = 2 false negatives = 2

TP = 2, TN = 6 FP = 0, FN = 2

False negative indicate “under clustering”

TP = 4, TN = 0 FP = 6, FN = 0

False positives indicate “over clustering”

Trinity Assembly Clustering

TP = true positives number of pairs of transcripts which correctly share a cluster FN = false negatives number of pairs of transcripts which are incorrectly are split

335,377 transcripts

Number of clusters:


“und

er c

lust

ered

”

✕ Ideal CD-HIT-EST Trinity

Oases Assembly Clustering

Number of clusters:

Max transcripts in a cluster:

540,933 transcripts “over clustered”

“und

er c

lust

ered

”

✕ Ideal CD-HIT-EST Oases

Can we do better?

• CD-HIT-EST uses only the sequence information, but we also have the reads – We could down weight region which are expressed at a low

level – We could separate sequences which show different

expression between sample groups – Using pair-end reads gives extra leverage to group

transcripts

• We are developing a tools which will take multi-mapped reads and output clusters along with counts for each cluster

The idea – Multi-map reads to the assembly – Separate transcripts into super-clusters

• Transcripts are grouped if they share ANY reads with another transcript

– For each super-cluster • For each pair of transcripts, calculate the distance

• To do: incorporate sample information too

– Hierarchical cluster the transcripts using the distance metric – Stop when the distance between grouped transcripts is too

large – this threshold is a parameter of the algorithm – To do: output the counts for each cluster

Rab – Number of reads which map to transcript a and b

Step 1

Step 2 – make a distance matrix

0 (R=2)

0.5 (R=1)

0.5 (R=1)

0 (R=2)

0 (R=2) 0 (R=2)

Distance = , R = reads

Step 1

0 (R=2)

0.5 (R=1) 0 (R=2)

Update R2’2’ = R22+R33-R23 R12’ = max(R12,R23) Recalculate the distance

Distance = 0.5

Cutting the tree at 0.5 or less would give the correct clustering

Distance = 0

Distance = 1

Trinity assembly


“und

er c

lust

ered

”

How do we do?

✕ Ideal CD-HIT-EST Oases/Trinity

Oases assembly


“und

er c

lust

ered

”

Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Impact on differential expression (DE) • To assess this we:

– Mapped reads back to all transcripts (“best” mapping - bowtie)

– Counted the reads which overlapped a transcript (samtools) – Added up all the counts for each cluster – Performed a DE analysis in edgeR for males vs. females – Compared against a “truth” DE list

• Obtained from a genome based analysis on RefSeq genes (5 thousand genes)

• True positives – false discovery rate < 0.05 • RefSeq genes were identified in the de novo assembly • Non-identified clusters were excluded

DE results - Oases

Conclusion: Better to “under” cluster than “over” cluster

vs.


“und

er c

lust

ered

”

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

DE results - Trinity “over clustered”

“und

er c

lust

ered

”

✕ Ideal CD-HIT-EST Oases/Trinity Ours – dist.=0.1 Ours – dist.=0.3 Ours – dist.=0.5 Ours – dist.=0.7 Ours – dist.=0.9

Q4. What is the best way to go from reads to counts?

Approaches 1. Do what we did before – add up counts for each

cluster 2. Trinity and Oases suggest:

– Muti-mapping reads to transcripts – Then use a program which can deal with ambiguously

mapped reads • RSEM can take the clustering as input and return gene-

level counts.

3. What people have actually done: – Select a set of representative transcript (i.e. the longest one) – Map reads using their favorite mapper. – Count the number of reads which overlap the transcripts

e.g. Sandmann et. al., Genome Biology, 2011 12:R76

Gene–level counts

Multi-map Reads:

Trinity script

Get counts: RSEM

edgeR

Gene–level DE results

Best-map reads: bowtie

Count reads overlapping transcripts: samtools

Select Representative Transcript:

longest

The alternatives: “Best” map

Reads: bowtie

Count reads overlapping transcripts: samtools

Add counts in a cluster

own script

Use the same clustering for all three approaches

Results

Oases Assembly Trinity Assembly

Difference between methods is small - could probably do any of them

Conclusions • Q1. Why so many transcripts?

– Expect de novo transcriptome assemblies to produce may more transcripts than a typical annotation.

– De novo transcriptome assemblies must deal with a number of issues which make full-length transcript assembly, without redundancy, difficult.

– Sequencing to a high depth may give you more intergenic non-coding transcripts.

• Q2. Isoform or gene-level analysis? – Doing a differential expression analysis on gene-level counts

has a number of advantages over isoform-level counts

Conclusions cont.

• Q3. How to cluster transcripts into genes? – We found that Trinity’s clustering was good, but the

clustering from Oases and CD-HIT-EST were poor – We are developing a tool for clustering which already works

better than the alternatives based on differential expression results

• Q4. What is the best way to go from reads to counts? – We compared three alternatives for mapping/abundance

estimation – Results were similar for all three – Getting the clustering correct has a bigger impact on the

differential expression results than other steps in the pipe-line

Future Work • We have only looked at one RNA-Seq dataset.

– Would like to look at RNA-Seq from at least two other datasets to ensure that the conclusions drawn here also hold in general:

• Different species (all model organisms) • Different read depths

• Our clustering tool: – Would like to output the gene-level counts for each cluster.

• Then compare to other abundance estimation approaches.

– Would like to incorporate differences in expression between groups to improve the clustering

• More investigation into the pipe-line methods – E.g. mapping

Acknowledgements

Chicken RNA-Seq Data from Katie Ayers (MCRI) Craig Smith (MCRI)

MCRI Bioinformatics Alicia Oshlack The Bioinformatics Group

Red Jungle Fowl (credit: NHGRI)

VLSCI AGRF

Extra Slides

Trinity and Oases compared Oases

Trinity – version from the start of 2012

Trinity – version from the end of 2012

frac_match = length of the longest matching assembled transcript / “true” length of the transcript

Number of genes to transcripts (ordered by DE)

Yeast

Chicken (Trinity)

Chicken (Oases)

differential expression analysis of de novo assembled transcriptomes - nadia davidson

Technology