rna-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... ·...

53
RNA-seq analysis Alexey Sergushichev July 30th, MIPT

Upload: others

Post on 01-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

RNA-seq analysis

Alexey Sergushichev

July 30th, MIPT

Page 2: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis: from an expression table to pathway analysis

Overview of the lecture

2

Page 3: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

• Getting publicly available expression tables

• Doing differential expression

• Doing pathway analysis

Practical session

3

Page 4: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

Overview of the lecture

4

Page 5: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Central dogma of molecular biology

5

Page 6: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Central dogma of molecular biology

Information

storage

Function

6

Page 7: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Mass-spectrometry based high-throughput proteomics:

• Measures what really matters: proteins

• Measuring thousands proteins simultaneously, but not all

• Complex instrument: costly maintenance of the instrument, complex

raw data, complex experimental design, complex analysis

Ask Pavel Synitcyn for more about proteomics

Measuring proteins: proteomics

7

Page 8: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Central dogma of molecular biology

Information

storage

FunctionMeasuring RNA

as a proxy to protein 8

Page 9: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Correct central dogma of molecular biology

Measuring RNA

as a proxy to protein 9

Page 10: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

RNA-seq = RNA->cDNA + DNA-sequencing

http://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf10

Page 11: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

FASTQ format has four lines per read: name, sequence, comment, base

call qualities

Raw RNA-sequencing data: FASTQ files

@HW-ST997:532:h8um1adxx:1:1101:2141:1965 1:N:0:

NGGGCCAAAGGAGCTTTCAAGGAGAGAAAGAGAAGAAATAGAGAAGCAAA

+

#1=DDFFFHFHDHIJIJJJIIJGIGFGHIJIIGGIJJJJIIIIFHID9BD

run number

flowcellID

lanenumber tile

number

X coordinateof cluster

Y coordinateof cluster

read number(single/paired)

Y – filteredN - not

control number

instrumentname

base callqualities

https://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm11

Page 12: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

model organism: with good reference genome

non-model organism: with no/poor reference genome

Two distinct types of RNA-seq

Well studied Not so much

12

Page 13: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Human: chr1-22, chrX, chrY, chrM,

• 3235 Mb, 19815 genes

Mouse: chr1-19, chrX, chrY, chrM,

• 2718 Mb, 21971 genes

Assembly is mostly complete, but not 100% -

there are unplaced scaffolds and gaps

There are genes in unplaced/unlocalized

sequence, which could be important

Well-defined genomes

http://www.slideshare.net/hhalhaddad/the-human-genome-project-part-iii13

Page 14: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Human:

• UCSC notation (hg19, hg38)

• Genome reference consortium notation (major: GRCh37, minor:

GRCh38.p7)

• 1000 genomes notation (b37)

Mouse – same (mm10, GRCm37)

In RNA-seq always use the latest primary assembly:

• hg38/GRCh38 for human

• mm10/GRCm38 for mouse

Popular genome assemblies

14

Page 15: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Primary assembly: the best known assembly of a haploid genome.

• Chromosome assembly: a sequence with known physical location (e.g.

according to a physical map).

• Unlocalized sequence: a sequence found in an assembly that is

associated with a specific chromosome but cannot be ordered or

oriented on that chromosome.

• Unplaced sequence: a sequence found in an assembly that is not

associated with any chromosome.

Genome Reference Consortium terminology

https://www.ncbi.nlm.nih.gov/grc/help/definitions/

http://lh3lh3.users.sourceforge.net/humanref.shtml 15

Page 16: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Unlocalized/unplaced sequences and patches can

contain genes!

16

Page 17: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

rRNA – ribosomal RNA: 80% of the cell RNA

tRNA – transfer RNA: 15% of the cell RNA

mRNA – messenger RNA for protein coding genes

Other RNAs: miRNA, lncRNA, …

Some of the RNAs are short: tRNA, miRNA, … and are not getting into normal RNA-seq

Main types of RNAs

https://www.ncbi.nlm.nih.gov/books/NBK21729/ 17

Page 18: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Protein-coding RNA

Estimated 105 to 106 mRNA molecules per animal cell with high dynamic

range for genes: from several copies to 104

mRNA levels correlate with protein levels

mRNA

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3129258/

https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111220 18

Page 19: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

polyA selection: most standard, relatively cheap and easy protocol,

selects mRNAs and some non-coding RNAs

riboZero: depletes rRNA, works better for degraded RNA, captures all long

RNAs

Two main approaches for RNA selection

19

Page 20: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

DNA is transcribed a lot, giving multiple types of RNA

Some encode proteins, some do not

Set of transcripts with a similar function = gene

For a canonical protein-coding gene, transcripts = isoforms

What’s a gene?

20

Page 21: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

RefSeq – very

conservative

ENSEMBL/Gencode –very inclusive

Practical definition of a gene: genome annotation

21

Page 22: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Using Gencode is the most practical

https://www.gencodegenes.org 22

Page 23: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

We have raw data: FASTQ-files

We have genome reference and genome annotation

Summary (1)

23

Page 24: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

Overview of the lecture

24

Page 25: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Designed for DNA-seq, so “bad” is not always badQCFail: https://sequencing.qcfail.com/

Quality control: FastQC

https://sequencing.qcfail.com/articles/positional-

sequence-bias-in-random-primed-libraries/ 25

Page 26: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Alignment

• HISAT2

• STAR

• bowtie/bowtie2

Counting

• featureCounts

• htseq

• mmquant

• RSEM

Alignment + counting pipeline

26

Page 27: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

CIGAR string

Typical SAM file with alignment

mapping quality

bitwiseSAMflag

CIGAR string 27

Page 28: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Expectations for genomic RNA-seq alignment

All reads

Not mappedanywhere

Uniquely mapped

Multimapped2-15 times

Multimapped>15 times

2-10% 70%

10-20% ~5%28

Page 29: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Generate read coverage and

visualize in a genome browser

Check alignment rate

Check library strategy (next slides)

Useful to check ribosomal RNA content as well

Tools: RSeQC, Picard/CollectRnaSeqMetrics, QoRTs

RNA-seq quality control

29

Page 30: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Library strategies: single-end vs paired-end

https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html30

Page 31: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Library strategies: stranded vs unstranded

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 31

Page 32: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Library strategies: 3’- or 5’- specific, full-length

https://bitesizebio.com/13559/ngs-quality-control-in-rna-sequencing-some-free-tools/ 32

Page 33: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

For stranded experiment, we can

distinguish between two different

genes if they are on the opposite

strands

For non-stranded experiment,

we can’t htseq-count discards reads with 2

or more features (ambiguous)

~50% assignment rate is normal

FeatureCounts

33

Page 34: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

>5M assigned reads are required for a typical analysis, thus there should

be >10M raw reads

Usually it’s better to increase the number of biological replicates instead of library depth

Library depth

34

Page 35: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Gene expression table

35

Page 36: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Very fast pseudo-alignment

No sam/bam output

Transcript level

quantification

Expectation-maximization for

counting

multimappers/ambigous

reads

Very useful for reprocessing

of public datasets

Kallisto

36

Page 37: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

FPKM = Fragments per Kilobase of gene per Million

• First normalize to library depth, then to gene length

TPM = Transcripts per Kilobase Million

• First normalize to effective transcript length, then to library depth

• Sum of all TPMs is one million

• Works well with isoforms, ~proportional to concentrations

Effective length for 3’-seq is the same for all genes/transcripts

Units: FPKM vs TPM

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ 37

Page 38: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

FeatureCounts vs Kallisto: similar but different

38

Page 39: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Generate a single report for many tools:

• fastqc

• hisat2

• rseqc

• kallisto

• …

MultiQC

39

Page 40: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

polyA vs riboZero:

• If interested only in protein-coding genes, then polyA-selection, unless

RNA quality is bad

Library strategy:

• Always stranded if possible

• Paired and full-length for isoform quantitifcation

• Single-end 3’ is enough for simple mRNA analysis, no bias for transcript length

Library preparation recap

40

Page 41: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Went from raw data to gene expression tables

• Alignment + quantification pipeline

• Alignment-free analysis with kallisto

QC for every step

Summary so far

41

Page 42: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis: from an expression table to pathway analysis

Overview of the lecture

42

Page 43: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Two biological conditions: heart after

myocardial injury or sham treatment

Looking for individual genes

differentially expressed between

conditions

Looking for molecular processes

differentially regulated between

conditions

The simplest experiment design

https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.117.028252 43

Page 44: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Quality control: principal component analysis

(PCA)

44

Page 45: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Quality control: principal component analysis

(PCA)

Outlier?

45

Page 46: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Multiple pipelines

• DESeq2 – the easiest one

• edgeR

• Limma+voom – faster and

better for large datasets

Differential gene expression analysis

46

Page 47: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Differential gene expression: there can be too

much genes

~2000 significant genes 47

Page 48: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Pathway analysis: gene set enrichment analysis

Epithelial–mesenchymal

transition pathway48

Page 49: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Gene set enrichment analysis table

49

Page 50: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

MSigDB

Reactome

KEGG

Enrichr pathways

Gene Ontology

Pathway databases

50

Page 51: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Do QC and visualize data to check that biology worked

Differential gene expression results can be hard to interpret directly

Pathway analysis allows to go from single genes to molecular pathways

which are much more robust and interpretable

Summary (3)

51

Page 52: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

RNA-seq is a tool, not the result

Generate hypothesis based on RNA-seq data

Validate hypotheses experimentally

Epilogue: RNA-seq is a beginning of a beautiful

biology

52

Page 53: RNA-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... · RNA-seq analysis in R: from an expression table to pathway analysis Overview of the

Annual Systems Biology Workshop

(http://bioinf.me/sbw)

International master

program “Bioinformatics and Systems Biology”(https://vk.com/bioinf_itmo)

Advertisement

53