rna-seq analysis - bioinformaticsinstitute.rubioinformaticsinstitute.ru/...rna-seq_analysis... ·...

Post on 01-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RNA-seq analysis

Alexey Sergushichev

July 30th, MIPT

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis: from an expression table to pathway analysis

Overview of the lecture

2

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

• Getting publicly available expression tables

• Doing differential expression

• Doing pathway analysis

Practical session

3

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

Overview of the lecture

4

Central dogma of molecular biology

5

Central dogma of molecular biology

Information

storage

Function

6

Mass-spectrometry based high-throughput proteomics:

• Measures what really matters: proteins

• Measuring thousands proteins simultaneously, but not all

• Complex instrument: costly maintenance of the instrument, complex

raw data, complex experimental design, complex analysis

Ask Pavel Synitcyn for more about proteomics

Measuring proteins: proteomics

7

Central dogma of molecular biology

Information

storage

FunctionMeasuring RNA

as a proxy to protein 8

Correct central dogma of molecular biology

Measuring RNA

as a proxy to protein 9

RNA-seq = RNA->cDNA + DNA-sequencing

http://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf10

FASTQ format has four lines per read: name, sequence, comment, base

call qualities

Raw RNA-sequencing data: FASTQ files

@HW-ST997:532:h8um1adxx:1:1101:2141:1965 1:N:0:

NGGGCCAAAGGAGCTTTCAAGGAGAGAAAGAGAAGAAATAGAGAAGCAAA

+

#1=DDFFFHFHDHIJIJJJIIJGIGFGHIJIIGGIJJJJIIIIFHID9BD

run number

flowcellID

lanenumber tile

number

X coordinateof cluster

Y coordinateof cluster

read number(single/paired)

Y – filteredN - not

control number

instrumentname

base callqualities

https://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm11

model organism: with good reference genome

non-model organism: with no/poor reference genome

Two distinct types of RNA-seq

Well studied Not so much

12

Human: chr1-22, chrX, chrY, chrM,

• 3235 Mb, 19815 genes

Mouse: chr1-19, chrX, chrY, chrM,

• 2718 Mb, 21971 genes

Assembly is mostly complete, but not 100% -

there are unplaced scaffolds and gaps

There are genes in unplaced/unlocalized

sequence, which could be important

Well-defined genomes

http://www.slideshare.net/hhalhaddad/the-human-genome-project-part-iii13

Human:

• UCSC notation (hg19, hg38)

• Genome reference consortium notation (major: GRCh37, minor:

GRCh38.p7)

• 1000 genomes notation (b37)

Mouse – same (mm10, GRCm37)

In RNA-seq always use the latest primary assembly:

• hg38/GRCh38 for human

• mm10/GRCm38 for mouse

Popular genome assemblies

14

Primary assembly: the best known assembly of a haploid genome.

• Chromosome assembly: a sequence with known physical location (e.g.

according to a physical map).

• Unlocalized sequence: a sequence found in an assembly that is

associated with a specific chromosome but cannot be ordered or

oriented on that chromosome.

• Unplaced sequence: a sequence found in an assembly that is not

associated with any chromosome.

Genome Reference Consortium terminology

https://www.ncbi.nlm.nih.gov/grc/help/definitions/

http://lh3lh3.users.sourceforge.net/humanref.shtml 15

Unlocalized/unplaced sequences and patches can

contain genes!

16

rRNA – ribosomal RNA: 80% of the cell RNA

tRNA – transfer RNA: 15% of the cell RNA

mRNA – messenger RNA for protein coding genes

Other RNAs: miRNA, lncRNA, …

Some of the RNAs are short: tRNA, miRNA, … and are not getting into normal RNA-seq

Main types of RNAs

https://www.ncbi.nlm.nih.gov/books/NBK21729/ 17

Protein-coding RNA

Estimated 105 to 106 mRNA molecules per animal cell with high dynamic

range for genes: from several copies to 104

mRNA levels correlate with protein levels

mRNA

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3129258/

https://bionumbers.hms.harvard.edu/bionumber.aspx?id=111220 18

polyA selection: most standard, relatively cheap and easy protocol,

selects mRNAs and some non-coding RNAs

riboZero: depletes rRNA, works better for degraded RNA, captures all long

RNAs

Two main approaches for RNA selection

19

DNA is transcribed a lot, giving multiple types of RNA

Some encode proteins, some do not

Set of transcripts with a similar function = gene

For a canonical protein-coding gene, transcripts = isoforms

What’s a gene?

20

RefSeq – very

conservative

ENSEMBL/Gencode –very inclusive

Practical definition of a gene: genome annotation

21

Using Gencode is the most practical

https://www.gencodegenes.org 22

We have raw data: FASTQ-files

We have genome reference and genome annotation

Summary (1)

23

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis in R: from an expression table to pathway analysis

Overview of the lecture

24

Designed for DNA-seq, so “bad” is not always badQCFail: https://sequencing.qcfail.com/

Quality control: FastQC

https://sequencing.qcfail.com/articles/positional-

sequence-bias-in-random-primed-libraries/ 25

Alignment

• HISAT2

• STAR

• bowtie/bowtie2

Counting

• featureCounts

• htseq

• mmquant

• RSEM

Alignment + counting pipeline

26

CIGAR string

Typical SAM file with alignment

mapping quality

bitwiseSAMflag

CIGAR string 27

Expectations for genomic RNA-seq alignment

All reads

Not mappedanywhere

Uniquely mapped

Multimapped2-15 times

Multimapped>15 times

2-10% 70%

10-20% ~5%28

Generate read coverage and

visualize in a genome browser

Check alignment rate

Check library strategy (next slides)

Useful to check ribosomal RNA content as well

Tools: RSeQC, Picard/CollectRnaSeqMetrics, QoRTs

RNA-seq quality control

29

Library strategies: single-end vs paired-end

https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html30

Library strategies: stranded vs unstranded

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393 31

Library strategies: 3’- or 5’- specific, full-length

https://bitesizebio.com/13559/ngs-quality-control-in-rna-sequencing-some-free-tools/ 32

For stranded experiment, we can

distinguish between two different

genes if they are on the opposite

strands

For non-stranded experiment,

we can’t htseq-count discards reads with 2

or more features (ambiguous)

~50% assignment rate is normal

FeatureCounts

33

>5M assigned reads are required for a typical analysis, thus there should

be >10M raw reads

Usually it’s better to increase the number of biological replicates instead of library depth

Library depth

34

Gene expression table

35

Very fast pseudo-alignment

No sam/bam output

Transcript level

quantification

Expectation-maximization for

counting

multimappers/ambigous

reads

Very useful for reprocessing

of public datasets

Kallisto

36

FPKM = Fragments per Kilobase of gene per Million

• First normalize to library depth, then to gene length

TPM = Transcripts per Kilobase Million

• First normalize to effective transcript length, then to library depth

• Sum of all TPMs is one million

• Works well with isoforms, ~proportional to concentrations

Effective length for 3’-seq is the same for all genes/transcripts

Units: FPKM vs TPM

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ 37

FeatureCounts vs Kallisto: similar but different

38

Generate a single report for many tools:

• fastqc

• hisat2

• rseqc

• kallisto

• …

MultiQC

39

polyA vs riboZero:

• If interested only in protein-coding genes, then polyA-selection, unless

RNA quality is bad

Library strategy:

• Always stranded if possible

• Paired and full-length for isoform quantitifcation

• Single-end 3’ is enough for simple mRNA analysis, no bias for transcript length

Library preparation recap

40

Went from raw data to gene expression tables

• Alignment + quantification pipeline

• Alignment-free analysis with kallisto

QC for every step

Summary so far

41

Intro to RNA-sequencing

RNA-seq quantification: from raw data to gene expression table

RNA-seq analysis: from an expression table to pathway analysis

Overview of the lecture

42

Two biological conditions: heart after

myocardial injury or sham treatment

Looking for individual genes

differentially expressed between

conditions

Looking for molecular processes

differentially regulated between

conditions

The simplest experiment design

https://www.ahajournals.org/doi/full/10.1161/CIRCULATIONAHA.117.028252 43

Quality control: principal component analysis

(PCA)

44

Quality control: principal component analysis

(PCA)

Outlier?

45

Multiple pipelines

• DESeq2 – the easiest one

• edgeR

• Limma+voom – faster and

better for large datasets

Differential gene expression analysis

46

Differential gene expression: there can be too

much genes

~2000 significant genes 47

Pathway analysis: gene set enrichment analysis

Epithelial–mesenchymal

transition pathway48

Gene set enrichment analysis table

49

MSigDB

Reactome

KEGG

Enrichr pathways

Gene Ontology

Pathway databases

50

Do QC and visualize data to check that biology worked

Differential gene expression results can be hard to interpret directly

Pathway analysis allows to go from single genes to molecular pathways

which are much more robust and interpretable

Summary (3)

51

RNA-seq is a tool, not the result

Generate hypothesis based on RNA-seq data

Validate hypotheses experimentally

Epilogue: RNA-seq is a beginning of a beautiful

biology

52

Annual Systems Biology Workshop

(http://bioinf.me/sbw)

International master

program “Bioinformatics and Systems Biology”(https://vk.com/bioinf_itmo)

Advertisement

53

top related