Download - small RNAseq data analysis - INRA
![Page 1: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/1.jpg)
small RNAseq data analysis
Philippe Bardou, Christine Gaspin, Jérôme Mariette & Olivier Rué
![Page 2: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/2.jpg)
2
Introduction to miRNA world and
sRNAseq
![Page 3: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/3.jpg)
3
Central dogma of molecular biology
• Evolution of the dogma : 1950-1970
DNA structure descovery.
DNA
One gene = one function
Replication
Translation
Transcription
RNA
Protein
![Page 4: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/4.jpg)
4
• Evolution of the dogma : 1970-1980
Genome analysis
DNA
Replication
Translation
Transcription
RNA
Protein
Reversetranscription
Splicing andedition events
Central dogma of molecular biology
![Page 5: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/5.jpg)
5
• Evolution of the dogma : aujourd'hui
Genome analysis + Sequencing
DNA
Replication
Translation
RNA
ProteinS
Reversetranscription
Splicing andedition events
RegulationGene formationEpigenesis
Central dogma of molecular biology
Transcription
Many genes = one functionnel complex
![Page 6: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/6.jpg)
6
• Evolution of the dogma : aujourd'hui
Genome analysis + Sequencing
DNA
Many genes = one functionnel complex
Replication
Translation
RNA
ProteinS
Reversetranscription
Splicing andedition events
RegulationGene formationEpigenesis
Central dogma of molecular biology
Transcription
![Page 7: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/7.jpg)
7
The RNA world
• An expending universe of RNA
« Regulatory » RNAs
- miRNA (development...)- siRNA (defense)- piRNA (epigenetic and post-transcriptional gene silencing in spermatogenesis)
- sRNA (adaptative responses in bacteria)
« Housekeeping » RNAs
- telomerase RNA (replication)- snRNA (maturation-splicing)- snoRNA, gRNA (modification, editing)
« Catalytic » RNAs
- Hairpin ribozyme- Hammerhead ribozyme- …
Non coding RNA
mRNA <Riboswitches>
RNA
→ Multiple roles of RNA in genes regulation
![Page 8: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/8.jpg)
8
The RNA world
• An expending universe of RNA
« Regulatory » RNAs
- miRNA (development...)- siRNA (defense)- piRNA (epigenetic and post-transcriptional gene silencing in spermatogenesis)
- sRNA (adaptative responses in bacteria)
« Housekeeping » RNAs
- telomerase RNA (replication)- snRNA (maturation-splicing)- snoRNA, gRNA (modification, editing)
« Catalytic » RNAs
- Hairpin ribozyme- Hammerhead ribozyme- …
Non coding RNA
mRNA <Riboswitches>
RNA
→ Multiple roles of RNA in genes regulation
![Page 9: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/9.jpg)
9
RNA background
• RNA folds on itself by base pairing :
• A with U : A-U, U-A
• C with G : G-C, C-G
• Sometimes G with U : U-G, G-U
• Folding = Secondary structure
• Structure related to function : ncRNA of the same family have a conserved structure
• Sequence less conserved
![Page 10: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/10.jpg)
10
RNA backgroundDifferent elementary motifs
![Page 11: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/11.jpg)
11
RNA backgroundExample: tRNA structure
![Page 12: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/12.jpg)
12
The non coding protein RNA world
• Not predicted by gene prediction• No specific signal (start, stop, splicing sites...)
• Multiple location (intergenic, intronic, coding, antisens)
• Variable size
• No strong sequence conservation in general
• A variety of existing approaches not always easy to integrate
• Known family: Homology prediction
• New family: De novo prediction
![Page 13: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/13.jpg)
13
• Large non coding protein RNA• >300 nt
• rRNA, tRNA, Xist, H19, ...
• Genome structure & expression
• Small non coding protein RNA• >30 nt
• snoRNA, snRNA...
• mRNA maturation, translation
• Micro non coding protein RNA• 18-30 nt
• miRNA, hc-siRNA, ta-siRNA, nat-siRNA, piRNA...
• PTGS, TGS, Genome stability, defense...
The non coding protein RNA world
![Page 14: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/14.jpg)
14
• Large non coding protein RNA• >300 nt
• rRNA, tRNA, Xist, H19, ...
• Genome structure & expression
• Small non coding protein RNA• >30 nt
• snoRNA, snRNA...
• mRNA maturation, translation
• Micro non coding protein RNA• 18-30 nt
• miRNA, hc-siRNA, ta-siRNA, nat-siRNA, piRNA...
• PTGS, TGS, Genome stability, defense...
The non coding protein RNA world
![Page 15: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/15.jpg)
15
The miRNA world
• Discovery of lin-4 in C. elegans in 1993
![Page 16: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/16.jpg)
16
The miRNA world
• A key regulation function
![Page 17: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/17.jpg)
17
The miRNA world
• Animals– Developmental timing (C. elegans): lin-4, let-7
– Neuronal left/right asymetry (C. elegans): Lys-6, mir-273
– Programmed cell death/fat metabolism (D. melanogaster): mir-14
– Notch signaling (D. malanogaster): mir-7
– Brain morphogenesis (Zebrafish): mir-430
– Myogeneses and cardiogenesis: mir-1, miR-181, miR-133
– Insulin secretion: miR-375
– …
• Plants– Floral timing and leaf development: miR-156
– Organ polarity, vascular and meristen development: mir-165, miR-166
– Expression of auxin response genes: miR-160
– ...
![Page 18: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/18.jpg)
18
The miRNA biogenesis
Pol II transcription Into a pri-miRNA
![Page 19: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/19.jpg)
19
The miRNA biogenesis
Drosha processing one or more pre-miRNAExported in the cytoplasm
![Page 20: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/20.jpg)
20
The miRNA biogenesisDicer processing
Into a duplex miRNAStructure
![Page 21: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/21.jpg)
21
The miRNA biogenesis
One strand (miRNA)Incorporated into
RISC
![Page 22: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/22.jpg)
22
The miRNA biogenesistarget mRNA
translationally repressed
![Page 23: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/23.jpg)
23
The miRNA location
→ Cluster organisation
![Page 24: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/24.jpg)
24
A. E. Pasquinelli et al., Nature 408, 86-9 (2000)
The miRNA conservation
![Page 25: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/25.jpg)
25
How can we study miRNA ?
• RNAseq not suited for miRNA (protocol and size)
• small RNAseq: ability of high throughput sequencing to
• Interrogate known and new small RNAs
• Quantify them
• Profile them on a large number of samples
• Cost-effective
![Page 26: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/26.jpg)
26
small RNAseq platforms comparisons
Platform 454Roche Titanium
HiSeq2000Illumina
Solid 3+Life Technologies
Caracteristics -Titanium chemistry-Pyrosequencing-PCR amplification
- Polymerase-based sequence-by-synthesis-PCR amplification-Multiplexing
-ligation-base-sequencing-PCR amplification
Applications -De novo sequencing -Small genomes-Transcriptome
-Resequencing -Transcriptome -Epigenomic-Small RNA-Allele specific sequencing
-De novo sequencing-Resequencing-Transcriptome-Epigenomic-Small RNA
Paired end separation
Not used 200bp 200bp
Mb / run 800Mb 600Gb 60Gb
Read length 800 bp 100bp 50bp
Known Biases
- Long homopolymer - makes signal saturation- read duplication
- Rich GC or AT regions: under-representation during amplification- Most error in end of cycle
- read duplication ?
![Page 27: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/27.jpg)
27
small RNA-Seq library preparation
• Monophosphate presence in 5' extremity and OH presence in 3' extremity
Total RNA: contain all kinds of RNA species including miRNA, mRNA, tRNA, rRNA...
RNA with modified 3'-end will not ligate with 3' adapters. Only RNA with OH in 3'-end will ligate.
Ligate with 3' adapter
Ligate with 5' adapter
RT-PCR and Size Selection
MicroRNA sequencing library
Only RNA with monophosphate in 5'-end will ligate with 5' adapters.
CDNA containing both adapter sequences will be amplified. MicroRNA will be enriched from PCR and gel size selection.
![Page 28: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/28.jpg)
28
What are we looking for ?
• List of known miRNA
• List of new miRNA
• miRNA target(s)
• miRNA quantification
• Differential expression
![Page 29: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/29.jpg)
29
small RNAseq data analysis
![Page 30: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/30.jpg)
30
• Pre-miRNA information:
– Hairpin structure of the pre-miRNA
– Pre-miRNA localisation (coding/non coding TU intronic/exonic )
– Presence of cluster
– Size of the pre-miRNA
• miRNA and miRNA* information:
– Existence of both miRNA and miRNA*
– Sequence conservation
– Overhang (around 2 nt) related to drosha and Dicer cuts
– Size of miRNA and miRNA*
– Overexpression of the miRNA compared to the miRNA*
• Existence of other products in sRNAseq data
What should we retain for data analysis ?
![Page 31: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/31.jpg)
31
• 5 experiments (5 lanes, no multiplexing)• Different tissues, different stages
• No reference genome• Only scaffolds
Description of the dataset
95.803.789
45.156.426
48.013.44791.418.250
121.171.011
![Page 32: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/32.jpg)
32
Fastq format
![Page 33: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/33.jpg)
33
Line 1 starts with @
Fastq format
![Page 34: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/34.jpg)
34
Line 1 starts with @
Fastq format
Line 2 Raw sequence of 36 nt (36 cycles in sequencing)
![Page 35: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/35.jpg)
35
Line 1 starts with @
Fastq format
Line 2 Raw sequence of 36 nt (36 cycles in sequencing)
Line 3 starts with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
![Page 36: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/36.jpg)
36
Line 1 starts with @
Fastq format
Line 2 Raw sequence of 36 nt (36 cycles in sequencing)
Line 3 starts with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
![Page 37: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/37.jpg)
37
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 38: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/38.jpg)
38
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 39: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/39.jpg)
39
A simple way to do quality control. It provides a modular set of analyses to give a quick impression of whether data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are:- Import of data from BAM, SAM or FastQ files (any variant)- Provide a quick overview to tell you in which areas there may be problems- Summary graphs and tables to quickly assess your data- Export of results to an HTML based permanent report- Offline operation to allow automated generation of reports without running the interactive application
1. Quality control
• FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)
Fastqc -o nf.out nf_in.fastq
![Page 40: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/40.jpg)
40
1. Quality control
• Per base quality
![Page 41: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/41.jpg)
41
1. Quality control
• Sequences content in nucleotides
![Page 42: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/42.jpg)
42
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 43: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/43.jpg)
Outputed reads
2. Why cleaning ?
![Page 44: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/44.jpg)
Outputed reads- Some sequences contain only adapters
2. Why cleaning ?
![Page 45: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/45.jpg)
Outputed reads- Some sequences contain only adapters- Some sequences contain sequences of interest flanked by the beginning of adapters:
- Some of them are miRNA (yellow).
2. Why cleaning ?
![Page 46: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/46.jpg)
Outputed reads- Some sequences contain only adapters- Some sequences contain sequences of interest flanked by the beginning of adapters:
- Some of them are miRNA (yellow).- Some of them are other type of
RNAs (green).
2. Why cleaning ?
![Page 47: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/47.jpg)
Outputed reads- Some sequences contain only adapters- Some sequences contain sequences of interest flanked by the beginning of adapters:
- Some of them are miRNA (yellow).- Some of them are other type of
RNAs (green). - Some adapters contain errors (blue).
2. Why cleaning ?
![Page 48: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/48.jpg)
Outputed reads- Some sequences contain only adapters- Some sequences contain sequences of interest flanked by the beginning of adapters:
- Some of them are miRNA (yellow).- Some of them are other type of
RNAs (green). - Some adapters contain errors (blue).
- Some sequences contain polyN (red)
2. Why cleaning ?
![Page 49: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/49.jpg)
49
2. Cleaning
• Adapters removing and length filtering
Cutadapt http://code.google.com/p/cutadapt/.
Cutadapt removes adapter sequences from high-throughput sequencing data. Indeed, reads are usually longer than the RNA, and therefore contain parts of the 3’ adapter. It also allows to keep only sequences of desired length (15<length<29).
cutadapt -a ATCTCGTATGCCGTCTTCTGCTTG -m 15 -M 29 -o nf_out.fg nf_in.fq
![Page 50: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/50.jpg)
50
2. Cleaning
• 56 % of reads discarded
![Page 51: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/51.jpg)
51
2. Cleaning
• Size in between 18bp:24bp
→ miRNA ?
![Page 52: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/52.jpg)
52
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 53: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/53.jpg)
53
• Removing identical reads– save computational time
– useless to keep all the read
– Keep the number of occurrence for each reads
fastqnr.pl sample.fq | sort -k1,1 > sample.matrix
3. Remove redundancy
…AAATGAATGATCTATGGACAGCA 2AAATGAATGATCTATGGACAGCAG 38AAATGAATGATCTATGGACAGCAGA 2AAATGAATGATCTATGGACAGCAGAAAG 1AAATGAATGATCTATGGACAGCAGC 51AAATGAATGATCTATGGACAGCAGCA 82AAATGAATGATCTATGGACAGCAGCAA 5AAATGAATGATCTATGGACAGCAGCAAA 2AAATGAATGATCTATGGACAGCAGCAAC 3AAATGAATGATCTATGGACAGCAGCAAG 57AAATGAATGATCTATGGACAGCAGCAG 2AAATGAATGATCTATGGACAGCCGC 1AAATGAATGATCTATGGACGGCAGCA 1...
![Page 54: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/54.jpg)
54
3. Remove redundancy
![Page 55: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/55.jpg)
55
3. Remove redundancymiRNA ? piRNA ?
• More differencies between piRNAs than with miRNAs ?
![Page 56: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/56.jpg)
56
Exercice 1:
– Quality control
– Cleaning
– Remove redundancy
![Page 57: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/57.jpg)
57
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 58: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/58.jpg)
58
• Computes an expression matrix
– Read must be at least in 2 samples if present less than 5 times
quatification.pl -i 2 -a 5 sample1.matrix sample2.matrix ... > quantification.matrix
3. Quantification
#seq s_1_uT21 s_1_uT2 s_1_uT4 s_2_uT22 s_2_uT5...AAAAGGGCTGTTTGTGCAGGCAG 87 14 0 85 5AAAAGGGCTGTTTGTGCAGGCAGA 1 0 0 1 0AAAAGGGCTGTTTGTGCAGGCAGG 1 0 0 2 0AAAAGGGCTGTTTGTGCAGGCAGT 1 0 0 3 0AAAAGGGCTGTTTGTGCAGGCAGTTT 0 0 0 0 1AAAAGGGCTGTTTGTGCAGGCAT 1 2 0 3 0AAAAGGGCTGTTTGTGCAGGCTA 0 0 0 1 0AAAAGGGCTGTTTGTGCAGGCTG 1 0 0 1 0AAAAGGGCTGTTTGTGCAGGCTT 1 0 0 0 0AAAAGGGCTGTTTGTGCAGGG 6 1 0 4 2AAAAGGGCTGTTTGTGCAGGGA 11 1 0 3 4AAAAGGGCTGTTTGTGCAGGGAG 88 9 0 62 11AAAAGGGCTGTTTGTGCAGGGAGC 1 0 0 0 0AAAAGGGCTGTTTGTGCAGGGAGCTGA 0 0 0 1 0AAAAGGGCTGTTTGTGCAGGGAGT 0 1 0 0 0AAAAGGGCTGTTTGTGCAGGGAGTT 0 0 0 1 0AAAAGGGCTGTTTGTGCAGGGAT 2 0 0 0 1AAAAGGGCTGTTTGTGCAGGGATT 1 0 0 0 0...
![Page 59: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/59.jpg)
59
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 60: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/60.jpg)
60
• Useful databases:
– miRbase (http://microrna.sanger.ac.uk/)• miRBase::Registry provides names to novel miRNA genes prior
to their publication.
• miRBase::Sequences provides miRNA sequence data, annotation, references and links to other resources for all published miRNAs.
• miRBase::Targets provides an automated pipeline for the prediction of targets for all published animal miRNAs.
4. Annotation
![Page 61: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/61.jpg)
61
• Useful databases:
– miRbase (http://microrna.sanger.ac.uk/)
– Rfam (http://rfam.sanger.ac.uk/) • A collection of RNA families
– Rfam 10.1, June 2011, 1973 families
• A track now included in the UCSC genome browser
• Be careful: also contains (not all) miRNA families
4. Annotation
![Page 62: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/62.jpg)
62
• Useful databases:
– miRbase (http://microrna.sanger.ac.uk/)
– Rfam (http://rfam.sanger.ac.uk/)
– Silva (http://www.arb-silva.de/) • A comprehensive on-line resource for quality checked and
aligned ribosomal RNA sequence data. – SSU (16S rRNA, 18S rRNA)– LSU (23S rRNA, 28S rRNA)– Eukarya, bacteria, archaea
4. Annotation
![Page 63: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/63.jpg)
63
• Useful databases:
– miRbase (http://microrna.sanger.ac.uk/)
– Rfam (http://rfam.sanger.ac.uk/)
– Silva (http://www.arb-silva.de/)
– GtRNAdb(http://gtrnadb.ucsc.edu/) • Contains tRNA gene predictions made by the program
tRNAscan-SE (Lowe & Eddy, Nucl Acids Res 25: 955-964, 1997) on complete or nearly complete genomes.
• All annotation is automated and has not been inspected for agreement with published literature.
4. Annotation
![Page 64: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/64.jpg)
64
4. Annotation
• Reads with multiple annotation
![Page 65: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/65.jpg)
65
4. Annotation
• Reads with multiple annotation
→ A lot of reads annotated with mirBase but also with tRNA and rRNA database
![Page 66: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/66.jpg)
Mir-739 or 28S rRNA ?
4. Annotation
• rRNA present in miRBase
![Page 67: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/67.jpg)
67
4. Annotation
Annotation occurences
![Page 68: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/68.jpg)
68
Exercice 2:
– Annotation
![Page 69: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/69.jpg)
69
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 70: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/70.jpg)
70
• Blat http://genome.ucsc.edu/cgi-bin/hgBlat
• Blast http://blast.ncbi.nlm.nih.gov/Blast.cgi
• Gmap http://www.gene.com/share/gmap/
• Bowtie http://bowtie-bio.sourceforge.net/index.shtml
• BWA http://bio-bwa.sourceforge.net
• ...
5. Mapping the reads
![Page 71: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/71.jpg)
71
5. Mapping the reads with bwa
![Page 72: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/72.jpg)
72
5. Mapping the reads with bwa
• Reference sequence indexing:
bwa index -a bwtsw db.fasta
• Read alignment:
bwa aln db.fasta short_read.fastq > short_read.sai
• Formatting reads:
bwa samse db.fasta short_read.sai short_read.fastq > short_read.sam
![Page 73: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/73.jpg)
73
5. Mapping the reads with bwa
![Page 74: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/74.jpg)
74
5. Mapping the reads with bwa
![Page 75: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/75.jpg)
75
5. Mapping the reads with bwa
![Page 76: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/76.jpg)
5. Mapping the reads with bwa
• Alignement of annotated reads
![Page 77: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/77.jpg)
5. Mapping the reads with bwa
• Alignement of annotated reads
→ keep reads aligned the most at 4 positions with 0 or 1 error
![Page 78: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/78.jpg)
5. Mapping the reads with bwa
• Alignement of all reads
→ keep reads aligned the most at 4 positions with 0 or 1 error
![Page 79: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/79.jpg)
79
Exercice 3:
– Mapping the reads
![Page 80: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/80.jpg)
80
small RNAseq pipeline
Quality check
Cleaning
Remove redundancy + Quantification
Mapping Annotation
Prediction
New miRNAKnown miRNA
Quantification matrix
Known miRNAQuantification matrix
with reference
![Page 81: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/81.jpg)
81
6. Prediction
• Precise excision of a 21-22mer is typical of microRNA
• less represented reads are products of Dicer errors and sequencing/sample preparation artifacts
![Page 82: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/82.jpg)
82
6. Prediction
• Once the reads mapped
![Page 83: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/83.jpg)
83
6. Prediction
• Identify all contiguous read regions
![Page 84: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/84.jpg)
84
6. Prediction
• Identify all contiguous read regions
![Page 85: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/85.jpg)
85
• miRNA precursors have a characteristic secondary structure
– The detection of a microRNA* sequence, opposing the most frequent read in a stable hairpin (but shifted by 2 bases), is sufficient to diagnose a microRNA.
6. Prediction
![Page 86: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/86.jpg)
86
6. Prediction
• Extend and fold read regions
![Page 87: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/87.jpg)
87
6. Prediction
• Extend and fold read regions~ 100bp
![Page 88: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/88.jpg)
88
6. Prediction
• Extend and fold read regions~ 100bp
![Page 89: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/89.jpg)
89
6. Prediction
• Extend and fold read regions~ 100bp
• Stable hairpin structure shifted by 2 bases
• miRNA > miRNA*
![Page 90: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/90.jpg)
90
6. Prediction
• Extend and fold read regions
~ 100bp
![Page 91: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/91.jpg)
91
6. Prediction
• Extend and fold read regions
~ 100bp
• In the absence of reads corresponding to an expected miRNA*, additional checks on the structure are:
– Degree of pairing in the miRNA region
– Hairpin: around 70nt in length
– The secondary structure is significantly more stable than randomly shuffled versions of the same sequence
– miRNA cluster
![Page 92: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/92.jpg)
92
N N N N A NG NG NG NG NU N C N C – N G – N U – N A – N U – N C – N C – N U – N A – N C – N U – N G – N U – N U – N U – N C – N C – N C – N U – N U – N U N G N
N N N NN AN GN GN GN GN U N C N – C N – G N – U N – A N – U N – C N – C N – U N – A N – C N – U N – G N – U N – U N – U N – C N – C N – C N U N U U G
?
6. Prediction
• Which one should be used ?
![Page 93: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/93.jpg)
93
Exercice 4:
– Locus identification
![Page 94: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/94.jpg)
94
miRDeep
• Tool for identification of known and novel miRNA
• Animals– Friedländer MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N.
(2008) Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol 26(4), 407-15.
– Friedländer, M.R., Mackowiak, S.D., Li, N., Chen, W., and Rajewsky, N. 2011. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res.
• Tool for plants but nothing to do with miRDeep !– Plants : Xiaozeng Yang, Lei Li. 2011 miRDeep-P: a computational tool for analyzing the
microRNA transcriptome in plant. Bioinformatics, doi: 10.1093
![Page 95: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/95.jpg)
95
miRDeep2
• Complex pipeline (3 main steps)
![Page 96: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/96.jpg)
96
miRDeep2
1 : Mapper
Mapping of the SGS data on the reference genome
Pipeline :
* Filter reads (not [ACGTN])
* Clip adapters
* Filter reads on size (<18 nt)
* Collapse reads
* Align with bowtie
* Transform bowtie output to specific miRDeep2 .arf format
* Filter the .arf file (soft clip)
![Page 97: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/97.jpg)
97
miRDeep2
2 : Quantifier
Annotation of sequences on miRBase database
Pipeline :
* Map mature miRNAs on precursors
* Map reads on precursors
* Intersect the 2 mappings
* Output signature and structure of annotated miRNAs
![Page 98: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/98.jpg)
98
miRDeep2
3 : miRDeep2
Prediction of novel miRNAs
Pipeline :
* Test input files
* Keep only perfect mappings of at least 18 nt
* Excise potential precursors within 20 & 70 nt up and down
* Map reads and known miRNAs on potential precursors
* Merge alignments
* RNAfold + randfold
* Run permuted controls
* Filter potential precursors
* Output novel miRNAs
![Page 99: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/99.jpg)
99
miRDeep2 output
PDF filePDF file
scaffold_336 280330 280384 novel:scaffold_336_2215130.7 280330 2803840,0,255
text file
![Page 100: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/100.jpg)
100
Why develop a new tool ?
* MiRDeep2 pipeline is not optimized :
– A lot of redundant steps (mapped reads filtering, inter-fastq redundant reads kept)
– A lot of temporary files :
• Input : 166 Go
• Output : 1,5 To
– A lot of time-processing :
• Mapper : 37 h
• MiRDeep2 : 390 h
* Bugs :
– Bad algorithm of 3' adapters clipping
– Quantification step not used for prediction
– Options not available
– ...
* Not enough user-defined parameters (bowtie, RNAfold ...)
* Not adapted for discovering other small RNAs (tRNA...)
x 10 !
![Page 101: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/101.jpg)
101
Why develop a new tool ?
* MiRDeep2 pipeline is not optimized :
– A lot of redundant steps (mapped reads filtering, inter-fastq redundant reads kept)
– A lot of temporary files :
• Input : 166 Go
• Output : 1,5 To
– A lot of time-processing :
• Mapper : 37 h
• MiRDeep2 : 390 h
* Bugs :
– Bad algorithm of 3' adapters clipping
– Quantification step not used for prediction
– Options not available
– ...
* Not enough user-defined parameters (bowtie, RNAfold ...)
* Not adapted for discovering other small RNAs (tRNA...)
x 10 !
Keep only 6 first nuc of ADAPTER
>1ADAPTEBLABLA → ADAPTE>2MYSEQUENCEA → MYSEQUENCE
![Page 102: small RNAseq data analysis - INRA](https://reader036.vdocument.in/reader036/viewer/2022062303/62a5c132da36c016aa1fcb6b/html5/thumbnails/102.jpg)
102
sRNAseq & GALAXYhttp://sigenae-workbench.toulouse.inra.fr/