rna sequencing and data analysisrna sequencing and data analysis length of mrna transcripts in the...

54
RNA SEQUENCING AND DATA ANALYSIS

Upload: others

Post on 01-Apr-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA SEQUENCING AND DATA ANALYSIS

Page 2: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000

0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000

Page 3: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Length of mRNA transcripts in the human genome

0 2,000 4,000 6,000 8,000 10,000

0

1,000

2,000

3,000

4,000

5,000

0 200 400 600 800

5,000

4,000

3,000

0

2,000

1,000

Insert size ~ 200bp

Page 4: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Overview of RNA sequencing protocol

Fwd read Reverse read

Insert

SEQUENCING

Read length: 48-76bp

Page 5: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Sequencing parameters

Read Depth

Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome

More reads needed for splicing variant discovery and differential comparison among samples

Current output: 120-180 million raw reads / lane

Multiplex level: 4-12 libraries / lane recommended

Page 6: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

All RNA is not the same

Types of RNA:

Page 7: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

All RNA is not the same

Types of RNA:

Messenger RNA

Micro RNA

Long non-coding RNA

Ribosomal RNA

Page 8: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Methods for RNA enrichment prior to library construction

Poly(A)-RNA selection By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation

rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA

Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column

Page 9: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Different methods capture different types of RNA

Poly(A)-RNA

selection

rRNA depletion

Small RNA

extraction

Messenger RNA

Micro RNA

Long non-coding RNA

Ribosomal RNA

Page 10: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Different methods capture different types of RNA

Poly(A)-RNA

selection

rRNA depletion

Small RNA

extraction

Messenger RNA X X

Micro RNA X X

Long non-coding RNA X

Ribosomal RNA X

Page 11: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Paraffin embedded vs fresh frozen

Fresh Frozen

REA

D Q

UA

LITY

Page 12: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

First step: alignment

Page 13: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Or: assembly, then alignment

Page 14: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Alignment versus assembly

Assembly

Trinity, Cufflinks, ABySS

Particularly useful when no reference genome is available, like in bacterial transcriptomes

Alignment

Bowtie, BWA, Mosaic

Maximum sensitivity, fewer false positives

Page 15: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA sequencing applications

Page 16: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA sequencing applications

Quantification of transcript expression levels

Detection of splice variation/different isoforms of the same gene

Allele specific expression levels

Strand specific expression levels

Detection of fusion transcripts (such as BCR-ABL in CML)

Detection of sequence variation (limited application)

Validation of DNA sequence variants

Page 17: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA-seq expression levels are linear where microarrays get saturated or are insensitive

Expression is measured as ‘reads per kilobase per million’ (RPKM)

or ‘fragments per kilobase of exon per million fragments mapped’

(FPKM) to normalize for gene length and library size

Page 18: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

In GBM, the gene EGFR is frequently targeted by intragenic deletions

Page 19: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

vIII deletion occurs in same domain as point mutations

Page 20: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Detecting EGFR transcript variants using RNA-seq data

Page 21: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/SpliceSeq:Overview

Page 22: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Allele-/Strand-specific RNA-seq

Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data

Strand-specific RNA-seq requires specific library preparation protocol

Costs more

Output more accurate, useful for analysis in absence of a reference genome

Page 23: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Identification of fusion transcripts

Popular methods search for

Read pairs that map to two different genes

Need to correct for gene homology

Reads that span fusion junction

Split reads in half and align separate halfs

Make a database of all possible fusion junctions and align full reads

PRADA, MapSplice, TopHat

http://sourceforge.net/projects/prada/

Page 24: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

FGFR3-TACC3 fusion in GBM is the result of a local inversion

FGFR3-TACC3

Page 25: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Fusion transcripts are often associated with copy number difference and genomic breakpoints

Copy number profile of two FGFR3-TACC3 cases in TCGA

FGFR3-TACC3

Page 26: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

6.4% of GBM harbors transcript fusions involving EGFR

All fusions fall within the area of the EGFR amplification

Page 27: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Fusion Module Discordant read pair: Each end of the

read pair maps uniquely to distinct protein-coding genes.

Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.

Gene A Gene B

Page 28: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Structural transcript variants in low grade glioma

RNA-seq data from 272 TCGA low grade glioma

Fusion detection accuracy affected by:

PRADA detected 1,843 fusion transcripts

#mapped

reads per

sample

Detected #fusion transcripts per sample

Page 29: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Filtering out artifacts

Homology E value larger than 0.01 (column Evalue)

No mismatches in junction spanning reads

Count the number of partner genes for each individual gene

Identify genes with fusions mapping to more than 10 different chromosome arms

970/1,843 fusions filtered

Validation of predicted transcript fusions

509/970 fusions filtered

Page 30: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Define four tiers of fusion transcripts based on evidence

Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample

Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window

Use matching DNA copy number profile

Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all

Tier 4: The rest

Page 31: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Validation of RNA fusions using output of BreakDancer

BreakDancer detects DNA rearrangements in low pass sequencing data

Page 32: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Validation of RNA fusions using output of BreakDancer

BreakDancer detects DNA rearrangements in low pass sequencing data

Page 33: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Variant detection

Approximately 30% of mutations are covered sufficiently to

be detected at a validation rate of ~ 80%.

From TCGA renal

cell clear cell

carcinoma project

Reverse transcriptase step to convert RNA to cDNA complicates

detection of RNA edits and mutations

Page 34: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200
Page 35: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA sequencing read alignment in PRADA

Transcripts from same gene

Reads are aligned to all possible transcripts

Reads are also aligned to genome

Page 36: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA sequencing read alignment in PRADA

Reads are aligned to all possible transcripts

Reads are also aligned to genome

Final and single placement for

each read it determined by

re-mapping

Page 37: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

PRADA alignments – advantages versus disadvantages

Advantage:

Alignment to DNA means mapping of unannotated transcripts

Alignment to transcriptome means mapping across exon-exon junctions

Disadvantage

More conservative alignment than split-read

Page 38: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

PRADA focuses on the analysis of paired-end RNA-sequencing data.

Four modules: 1. Processing

2. Expression and Quality Control

3. Gene fusion

4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

OUTPUTS

http://sourceforge.net/projects/prada/

Page 39: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Expression & QC Module RNA-SeQC provides three types of

quality control metrics: Read Counts

Coverage

Correlation

RPKM Values at transcript level

For longest transcript

RNAseQC Process (java)

Page 40: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

OUTPUTS

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

RNA-SeQC

Fusion Module Discordant read pair: Each end of the

read pair maps uniquely to distinct protein-coding genes.

Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.

Gene A Gene B

Page 41: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Implementation Results Samples processed

>400 KIRC

>170 GBM

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

OUTPUTS

RNA-SeQC

Works well in MDACC HPC* system

PRADA-fusion module validation rate ~85 % (53 out of 62)

Page 42: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

RNA sequencing in The Cancer Genome Atlas

mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads

miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.

Page 43: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Detecting fusion transcripts in GBM

Page 44: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal

33 interchromosomal

in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)

DHX33-NLRP1 (n=2, chr2)

TRIP12-SLC16A14 (n=2, chr17)

TFG-GRP128 (n=4, chr3)

Page 45: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

KIRC fusion validation

Sample ID 5’ Gene 3’ Gene Discordant Read Pairs

Fusion Span Reads

Fusion Junction (s)

5’ Gene Chr

3’ Gene Chr

Validated?

TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes

TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes

TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes

TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes

TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes

TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes

TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes

TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes

TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No

TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No

PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays

TFE3-SFPQ was validated in three individual samples

Page 46: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

KIRC fusion validation: RT-PCR

FAM172A-FHIT

(a) (b)

Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. SFPQ-TFE3

(a) (b)

Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. TFE3-SFPQ

Page 47: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

KIRC fusion results

We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.

We identified 80 bona-fide fusion transcripts, 57 intrachromosomal

33 interchromosomal

in 62 individual samples

“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)

DHX33-NLRP1 (n=2, chr2)

TRIP12-SLC16A14 (n=2, chr17)

TFG-GRP128 (n=4, chr3)

Page 48: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

TFG-GRP128 has been reported in other cancers

Page 49: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

TFG-GRP128 has been reported in other cancers

Page 50: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

TFG-GRP128 has been reported in other cancers

TCGA has 1,000s of RNA seq samples - how

can we quickly scan many samples for the

presence of this fusion?

Page 51: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Processing Module

Read Alignment

Remap alignments

Combine two ends

Quality Scores

Recalibrated

INPUTS

.fastq files [END1 & END2]

Config.txt [location of scripts and reference files]

Preprocessing .bam file

[PAIRED END] GUESS-ft [YES|| NO|| ONLY]

RPKM & QC metrics

-geneA -geneB

Expression & QC Module

[YES|| NO|| ONLY]

Fusion Candidates

Supervised search evidence

Fusion Module [YES|| NO|| ONLY]

OUTPUTS

RNA-SeQC

Supervised Search Module GUESS-ft: General User dEfined Supervised

Search for fusion transcripts

BAM

GUESS-ft

Mapped to A

or B

A-B

Discordant

reads

Unmapped

reads

Junction DB

Junction

spanning reads

Summary

report

Use high quality

mapping reads

only, Checks

read

orientation

fulfills fusion

schema, allow

up to one

mismatch.

Two read ends

map to A and B

respectively

Parse

Unmapped

reads with the

other end

mapping to A

or B

Map parsed

reads to DB of

all possible

exon junctions

List reads with

one end map

to junction, the

other map to A

or B

Time consuming step

Page 52: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Identification of TFG-GRP128 fusion

All available normal samples in cghub

Subset of tumor samples selected based on RPKM expression pattern

Table. Samples across cancer types

Cancer Type # of normal

samples

# of tumor

samples

Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)

Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)

Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)

Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)

Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)

Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)

Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)

Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)

Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)

Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)

* All performed by PRADA fusion module.

Page 53: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Tumors with the fusion have higher GPR128 expression levels

RPKM expression pattern seen in KIRC tumors

Fusion sample(s)

Higher expression of GPR128 (activation)

TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal

Page 54: RNA SEQUENCING AND DATA ANALYSISRNA SEQUENCING AND DATA ANALYSIS Length of mRNA transcripts in the human genome 0 2,000 4,000 6,000 8,000 10,000 0 1,000 2,000 3,000 4,000 5,000 0 200

Thanks.

http://sourceforge.net/projects/prada/