rna-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... ·...

RNA-seq analysis

Alexey Sergushichev

July 30th, MIPT

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis: from an expression table to pathway analysis

Overview of the lecture

RNA-seq analysis in R: from an expression table to pathway analysis

• Getting publicly available expression tables

• Doing differential expression

• Doing pathway analysis

Practical session

Central dogma of molecular biology

Information

storage

Function

Mass-spectrometry based high-throughput proteomics:

• Measures what really matters: proteins

• Measuring thousands proteins simultaneously, but not all

• Complex instrument: costly maintenance of the instrument, complex

raw data, complex experimental design, complex analysis

Ask Pavel Synitcyn for more about proteomics

Measuring proteins: proteomics

Central dogma of molecular biology

Information

storage

FunctionMeasuring RNA

as a proxy to protein 8

Correct central dogma of molecular biology

Measuring RNA

as a proxy to protein 9

RNA-seq = RNA->cDNA + DNA-sequencing

http://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf10

FASTQ format has four lines per read: name, sequence, comment, base

call qualities

Raw RNA-sequencing data: FASTQ files

@HW-ST997:532:h8um1adxx:1:1101:2141:1965 1:N:0:

NGGGCCAAAGGAGCTTTCAAGGAGAGAAAGAGAAGAAATAGAGAAGCAAA

#1=DDFFFHFHDHIJIJJJIIJGIGFGHIJIIGGIJJJJIIIIFHID9BD

run number

flowcellID

lanenumber tile

number

X coordinateof cluster

Y coordinateof cluster

read number(single/paired)

Y – filteredN - not

control number

instrumentname

base callqualities

https://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm11

model organism: with good reference genome

non-model organism: with no/poor reference genome

Two distinct types of RNA-seq

Well studied Not so much

Human: chr1-22, chrX, chrY, chrM,

• 3235 Mb, 19815 genes

Mouse: chr1-19, chrX, chrY, chrM,

• 2718 Mb, 21971 genes

Assembly is mostly complete, but not 100% -

there are unplaced scaffolds and gaps

There are genes in unplaced/unlocalized

sequence, which could be important

Well-defined genomes

http://www.slideshare.net/hhalhaddad/the-human-genome-project-part-iii13

Human:

• UCSC notation (hg19, hg38)

• Genome reference consortium notation (major: GRCh37, minor:

GRCh38.p7)

• 1000 genomes notation (b37)

Mouse – same (mm10, GRCm37)

In RNA-seq always use the latest primary assembly:

• hg38/GRCh38 for human

• mm10/GRCm38 for mouse

Popular genome assemblies

Primary assembly: the best known assembly of a haploid genome.

• Chromosome assembly: a sequence with known physical location (e.g.

according to a physical map).

• Unlocalized sequence: a sequence found in an assembly that is

associated with a specific chromosome but cannot be ordered or

oriented on that chromosome.

• Unplaced sequence: a sequence found in an assembly that is not

associated with any chromosome.

Genome Reference Consortium terminology

https://www.ncbi.nlm.nih.gov/grc/help/definitions/

http://lh3lh3.users.sourceforge.net/humanref.shtml 15

Unlocalized/unplaced sequences and patches can

contain genes!

rRNA – ribosomal RNA: 80% of the cell RNA

tRNA – transfer RNA: 15% of the cell RNA

mRNA – messenger RNA for protein coding genes

Other RNAs: miRNA, lncRNA, …

Some of the RNAs are short: tRNA, miRNA, … and are not getting into normal RNA-seq

Main types of RNAs

https://www.ncbi.nlm.nih.gov/books/NBK21729/ 17

Protein-coding RNA

Estimated 105 to 106 mRNA molecules per animal cell with high dynamic

range for genes: from several copies to 104

mRNA levels correlate with protein levels

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3129258/

https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111220 18

polyA selection: most standard, relatively cheap and easy protocol,

selects mRNAs and some non-coding RNAs

riboZero: depletes rRNA, works better for degraded RNA, captures all long

Two main approaches for RNA selection

DNA is transcribed a lot, giving multiple types of RNA

Some encode proteins, some do not

Set of transcripts with a similar function = gene

For a canonical protein-coding gene, transcripts = isoforms

What’s a gene?

RefSeq – very

conservative

ENSEMBL/Gencode –very inclusive

Practical definition of a gene: genome annotation

Using Gencode is the most practical

https://www.gencodegenes.org 22

We have raw data: FASTQ-files

We have genome reference and genome annotation

Summary (1)

Designed for DNA-seq, so “bad” is not always badQCFail: https://sequencing.qcfail.com/

Quality control: FastQC

https://sequencing.qcfail.com/articles/positional-

sequence-bias-in-random-primed-libraries/ 25

Alignment

• HISAT2

• STAR

• bowtie/bowtie2

Counting

• featureCounts

• htseq

• mmquant

• RSEM

Alignment + counting pipeline

CIGAR string

Typical SAM file with alignment

mapping quality

bitwiseSAMflag

CIGAR string 27

Expectations for genomic RNA-seq alignment

All reads

Not mappedanywhere

Uniquely mapped

Multimapped2-15 times

Multimapped>15 times

2-10% 70%

10-20% ~5%28

Generate read coverage and

visualize in a genome browser

Check alignment rate

Check library strategy (next slides)

Useful to check ribosomal RNA content as well

Tools: RSeQC, Picard/CollectRnaSeqMetrics, QoRTs

RNA-seq quality control

Library strategies: single-end vs paired-end

https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html30

Library strategies: stranded vs unstranded

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 31

Library strategies: 3’- or 5’- specific, full-length

https://bitesizebio.com/13559/ngs-quality-control-in-rna-sequencing-some-free-tools/ 32

For stranded experiment, we can

distinguish between two different

genes if they are on the opposite

strands

For non-stranded experiment,

we can’t htseq-count discards reads with 2

or more features (ambiguous)

~50% assignment rate is normal

FeatureCounts

>5M assigned reads are required for a typical analysis, thus there should

be >10M raw reads

Usually it’s better to increase the number of biological replicates instead of library depth

Library depth

Gene expression table

Very fast pseudo-alignment

No sam/bam output

Transcript level

quantification

Expectation-maximization for

counting

multimappers/ambigous

Very useful for reprocessing

of public datasets

Kallisto

FPKM = Fragments per Kilobase of gene per Million

• First normalize to library depth, then to gene length

TPM = Transcripts per Kilobase Million

• First normalize to effective transcript length, then to library depth

• Sum of all TPMs is one million

• Works well with isoforms, ~proportional to concentrations

Effective length for 3’-seq is the same for all genes/transcripts

Units: FPKM vs TPM

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ 37

FeatureCounts vs Kallisto: similar but different

Generate a single report for many tools:

• fastqc

• hisat2

• rseqc

• kallisto

• …

MultiQC

polyA vs riboZero:

• If interested only in protein-coding genes, then polyA-selection, unless

RNA quality is bad

Library strategy:

• Always stranded if possible

• Paired and full-length for isoform quantitifcation

• Single-end 3’ is enough for simple mRNA analysis, no bias for transcript length

Library preparation recap

Went from raw data to gene expression tables

• Alignment + quantification pipeline

• Alignment-free analysis with kallisto

QC for every step

Summary so far

RNA-seq analysis: from an expression table to pathway analysis

Two biological conditions: heart after

myocardial injury or sham treatment

Looking for individual genes

differentially expressed between

conditions

Looking for molecular processes

differentially regulated between

conditions

The simplest experiment design

https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.117.028252 43

Quality control: principal component analysis

Outlier?

Multiple pipelines

• DESeq2 – the easiest one

• edgeR

• Limma+voom – faster and

better for large datasets

Differential gene expression analysis

Differential gene expression: there can be too

much genes

~2000 significant genes 47

Pathway analysis: gene set enrichment analysis

Epithelial–mesenchymal

transition pathway48

Gene set enrichment analysis table

MSigDB

Reactome

Enrichr pathways

Gene Ontology

Pathway databases

Do QC and visualize data to check that biology worked

Differential gene expression results can be hard to interpret directly

Pathway analysis allows to go from single genes to molecular pathways

which are much more robust and interpretable

Summary (3)

RNA-seq is a tool, not the result

Generate hypothesis based on RNA-seq data

Validate hypotheses experimentally

Epilogue: RNA-seq is a beginning of a beautiful

biology

Annual Systems Biology Workshop

(http://bioinf.me/sbw)

International master

program “Bioinformatics and Systems Biology”(https://vk.com/bioinf_itmo)

rna-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... ·...

Documents

analysis of rna-seq data - university of hong...

flexible expressed region analysis for rna-seq with...

practical rna-seq analysis

lecture 9: rna seq analysis in practice

practical rnapractical rna-seq...

introduction to rna-seq and rna-seq data analysis (ueb-uat...

rna-seq co-expression analysis using mixture...

diﬀerential gene expression analysis using rna-seq

rna-seq/microarray deg analysis

rna-seq analysis overview

rna-seq analysis -...

the rna-‐seq analysis pipeline - alicia oshlack

differential expression analysis: rna-seq -...

protein rna interactions: analysis of iclip-seq data ·...

rna-seq and transcriptome analysis

lecture 8: rna seq analysis in practice

20.109 rna-seq analysis -...

tutorial - qiagen bioinformatics€¦ · four workflows:...

rna-seq data analysis and differential expression ·...

rna-seq data analysis - dkfz · pdf file1 rna-seq data...