post-assembly data analysis - cornell universitycbsuss05.tc.cornell.edu/doc/trinity_lecture2.pdf ·...

27
Post-assembly Data Analysis Quantification: the expression level of each gene in each sample DE genes: genes differentially expressed between samples Clustering/network analysis Identifying over-represented functional categories in DE genes Evaluation of the quality of the assembly Assembled transcriptome

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Post-assembly Data Analysis

• Quantification: the expression level of each gene in each sample

• DE genes: genes differentially expressed between samples

• Clustering/network analysis

• Identifying over-represented functional categories in DE genes

• Evaluation of the quality of the assembly

Assembled transcriptome

Page 2: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Different summarization strategies will result in the inclusion or exclusion of different sets of reads in the table of counts.

Part 1. Abundance estimation using RSEM

Page 3: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Map reads to genome (TOPHAT)

Map reads to Transcriptome(BOWTIE)

vs

What is easier?No issues with alignment across splicing junctions.

What is more difficult?The same reads could be repeatedly aligned to different splicing isoforms, and paralogous genes.

Page 4: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Hass et al. Nature Protocols 8:1494

RSEM assign ambiguous reads based on unique reads mapped to the same transcript.

Red & yellow: unique regionsBlue: regions shared between two transcripts

Transcript 1

Transcript 2

Page 5: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Trinity provides a script for calling BOWTIE and RSEMalign_and_estimate_abundance.pl

align_and_estimate_abundance.pl \--transcripts Trinity.fasta \--seqType fq \--left sequence_1.fastq.gz \--right sequence_2.fastq.gz \--SS_lib_type RF \--aln_method bowtie \--est_method RSEM \--thread_count 4 \--trinity_mode \--output_prefix tis1rep1 \

Page 6: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Parameters for align_and_estimate_abundance.pl

--aln_method: alignment methodDefault: “bowtie” .Alignment file from other aligner might not be supported.

--est_method: abundance estimation methodDefault : RSEM, slightly more accurate.Optional: eXpress, faster and less RAM required.

--thread_count: number of threads

--trinity_mode: the input reference is from Trinity. Non-trinity reference requires a gene-isoform mapping file (--gene_trans_map).

Page 7: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

transcript_id gene_id length effective_length

expected_count

TPM FPKM IsoPct

gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08

gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92

gene_id transcript_id(s) length effective_length

expected_count

TPM FPKM

gene1 gene1_isoform1,gene1_isoform2

2169.1 2005.04 24 3.94 4.27

Output files from RSEM (two files per sample)

*.genes.results table

*.isoforms.results table

Page 8: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

transcript_id gene_id length effective_length

expected_count

TPM FPKM IsoPct

gene1_isoform1 gene1 2169 2004.97 22.1 3.63 3.93 92.08

gene1_isoform2 gene1 2170 2005.97 1.9 0.31 0.34 7.92

gene_id transcript_id(s) length effective_length

expected_count

TPM FPKM

gene1 gene1_isoform1,gene1_isoform2

2169.1 2005.04 24 3.94 4.27

Output files from RSEM (two files per sample)

*.genes.results table

*.isoforms.results table

sum

Percentage of an isoform in a gene

Page 9: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

filter_fasta_by_rsem_values.pl \

--rsem_output=s1.isoforms.results,s2.isoforms.results \

--fasta=Trinity.fasta \

--output=Trinity.filtered.fasta \

--isopct_cutoff=5 \

--fpkm_cutoff=10 \

--tpm_cutoff=10 \

Filtering transcriptome reference based on RSEMfilter_fasta_by_rsem_values.pl

• Can be filtered by multiple RSEM files simultaneously. (criteria met in any one filter will be filtered )

Page 10: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Expression level

% o

f re

plica

tes

0 5 10 15 20

0.0

00

.05

0.1

00

.15

0.2

00

.25

Condition 1

Condition 2

Distribution of Expression Level of A Gene

… with 3 biological replicates in each species

Part 2. Differentially Expressed Genes

Page 11: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Reported for each gene:

• Normalized read count (Log2 count-per-million);

• Fold change between biological conditions (Log2 fold)

• Q(FDR).

Use EdgeR or DESeq to identify DE genes

Using FDR, fold change and read count values to filter:

E.g. Log2(fold) >2 or <-2FDR < 0.01 Log2(counts) > 2

Page 12: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

abundance_estimates_to_matrix.pl \

--est_method RSEM \

s1.genes.results \

s2.genes.results \

s3.genes.results \

s4.genes.results \

--out_prefix mystudy

Combine multiple-sample RSEM results into a matrix

s1 s2 s3 s4

TR10295|c0_g1 0 0 0 0

TR17714|c6_g5 51 70 122 169

TR23703|c1_g1 0 0 0 0

TR16168|c0_g2 1 1 1 0

TR16372|c0_g1 10 4 65 91

TR12445|c0_g3 0 0 0 0

Output file: mystudy.counts.matrix

……

Page 13: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Use run_DE_analysis.pl script to call edgeR or DESeq

contition1 s1

condition1 s2

condition2 s3

condition2 s4

Make a condition-sample mapping file

run_DE_analysis.pl \

--matrix mystudy.counts.matrix \

--samples_file mysamples \

--method edgeR \

--min_rowSum_counts 10 \

--output edgeR_results

Run script run_DE_analysis.pl

Skip genes with summed read counts less than 10

Page 14: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

logFC logCPM PValue FDR

TR8841|c0_g2 13.28489 7.805577 3.64E-20 1.05E-15

TR14584|c0_g2 -12.9853 7.507117 2.82E-19 4.04E-15

TR16945|c0_g1 -13.3028 7.823384 4.21E-18 3.11E-14

TR15899|c3_g2 9.485185 7.708629 5.35E-18 3.11E-14

TR8034|c0_g1 -10.1468 7.836908 5.42E-18 3.11E-14

TR3434|c0_g1 -12.909 7.431009 1.16E-17 5.55E-14

Output files from run_DE_analysis.pl

Filtering (done in R or Excel):

FDR: False-discovery-rate of DE detection (0.05 or below)

logFC: fold change log2(fc)

logCPM: average expression level (count-per-million)

Page 15: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Heat map and Hierarchical clustering K-means clustering

Part 3. Clustering/Network Analysis

Page 16: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

# Heap map

analyze_diff_expr.pl \

--matrix ../mystudy.TMM.fpkm.matrix \

--samples ../mysamples \

-P 1e-3 -C 2 \

# K means clustering

define_clusters_by_cutting_tree.pl \

-K 6 -R cluster_results.matrix.RData

Clustering tools in Trinity package

Output:Images in PDF format;Gene list for each group from K-means analysis;

Page 17: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Part 4. Functional Category Enrichment Analysis

(from DE or clustering)TR14584|c0_g2TR16945|c0_g1TR8034|c0_g1TR3434|c0_g1TR13135|c1_g2TR15586|c2_g4TR14584|c0_g1TR20065|c2_g3TR1780|c0_g1TR25746|c0_g2

GO ID over represented pvalue

FDR GO term

GO:0044456 3.40E-10 3.78E-06synapse part

GO:0050804 1.49E-09 8.29E-06regulation of synaptic transmission

GO:0007268 5.64E-09 2.09E-05synaptic transmission

GO:0004889 1.00E-07 0.00012394acetylcholine-activated cation-selective channel activity

GO:0005215 1.23E-070.00013647

1transporter activity

DE Gene list(From DEseq or EdgeR)

Enriched GO categoriesFrom GO enrichment analysis of DE genes

Results from RNA-seq data analysis

Page 18: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Steps in RNA-seq analysis:

1. Identifying open reading frames and translate into

protein sequences (optional, not needed if using

BLAST2GO et al);

2. Annotate the transcripts with GO

• BLAST2GO

• TRINOTATE

• INTERPROSCAN

3. Enrichment analysis

Page 19: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Transdecoder – Identify ORF from Transcript

Page 20: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Transdecoder – Identify ORF from Transcript

• Long ORF

• Alignment to known protein sequences or motif

• Markov model of coding sequences (model trained on genes from the same genome)

Page 21: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Transdecoder – Identify ORF from Transcript

#identify longest ORF

TransDecoder.LongOrfs -t myassembly.fa

#BLAST ORF against known protein datbase

blastp -query myassembly.fa.transdecoder_dir/longest_orfs.pep \

-db swissprot

-max_target_seqs 1 \

-outfmt 6 -evalue 1e-5 \

-num_threads 8 > blastp.outfmt6

#Predict ORF based on a Markov model of coding sequence (trained from long ORF)

TransDecoder.Predict --cpu 8 \

--retain_blastp_hits blastp.outfmt6 \

-t myassembly.fa

Page 22: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Function annotation

Predict gene function from sequences

Trinotate package• BLAST Uniprot : homologs in known proteins• PFAM: protein domain)• SignalP: signal peptide)• TMHMM: trans-membrane domain• RNAMMER: rRNA

Output: go_annotations.txt (Gene Ontology annotation file)Step-by-step guidance: https://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=143#c

BLAST2GOhttps://cbsu.tc.cornell.edu/lab/userguide.aspx?a=software&i=73#c

or Recommended

Page 23: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

run_GOseq.pl \

--genes_single_factor DE.filtered.txt \

--GO_assignments go_annotations.txt \

--lengths genes.lengths.txt

Enrichment analysisUse the geneIDs in

the first column

GO annotation file from Trinotate

3rd column from RSEM *.genes.results

file

GO ID over represented pvalue

FDR GO term

GO:0044456 3.40E-10 3.78E-06synapse part

GO:0050804 1.49E-09 8.29E-06regulation of synaptic transmission

GO:0007268 5.64E-09 2.09E-05synaptic transmission

acetylcholine-activated cation-

Results:

Page 24: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Part 5. Evaluation of Assembly Quality

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment

A. Representation of known proteins

• Compare to SWISSPROT (% of full length)

• Compare to genes in a closely related species

(% of full length; completeness)

• Compare to BUSCO core proteins (BUSCO)

Page 25: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

analyze_blastPlus_topHit_coverage.pl \blastx.outfmt6 \Trinity.fasta \melanogaster.pep.all.fa \

Blastx \-query Trinity.fasta \-db melanogaster.pep.all.fa \-out blastx.outfmt6 -evalue 1e-20 -num_threads 6 \

BLAST against D. melanogaster protein database

Analysis the blast results and generate histogram

Page 26: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Part 5. Evaluation of Assembly Quality

#hit_pct_cov_bin count_in_bin >bin_below

100 5165 5165

90 1293 6458

80 1324 7782

70 1434 9216

60 1634 10850

50 1385 12235

40 1260 13495

30 1091 14586

20 992 15578

10 507 16085

Compare the Drosophila yakuba assembly used in the exercise to Drosophila melanogaster proteins from flybase

Page 27: Post-assembly Data Analysis - Cornell Universitycbsuss05.tc.cornell.edu/doc/Trinity_Lecture2.pdf · transcript_id gene_id length effective_ length expected_ count TPM FPKM IsoPct

Part 5. Evaluation of Assembly Quality

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment

A. Statistics scores

• Map RNA-Seq reads to the assembly (>80% of

reads can be aligned)

• ExN50 transcript contig length

N50: shortest sequence length at 50% of the genome

E90N50: N50 based on transcripts representing 90% of

expression data

• DETONATE score

With or without reference sequences