rna’seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...ngs data basics:...
TRANSCRIPT
![Page 1: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/1.jpg)
11/06/2014
RNA Seq analysis Cleaning
Plateforme ABiMS
![Page 2: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/2.jpg)
RNA Seq analysis : cleanning
![Page 3: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/3.jpg)
Data Cleaning
![Page 4: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/4.jpg)
Why do we care about cleaning ?
![Page 5: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/5.jpg)
Why do we care about cleaning ?
RAW SEQUENCES
![Page 6: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/6.jpg)
Why do we care about cleaning ?
RAW SEQUENCES
AMAZING TRANSCRIPTOME !!!
![Page 7: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/7.jpg)
Why do we care about cleaning ?
RAW SEQUENCES
AMAZING TRANSCRIPTOME !!!
NO !!
![Page 8: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/8.jpg)
Because…
• Unknown nucleoNdes • Bad quality nucleoNdes • Adaptors and primers sub-‐
sequences • Poly A/T tails • Low complexity sequences • rRNA sequences • Contaminant sequences • Short length sequences But also: • Removing singletons • In-‐silico normalizaNon • Sequencing errors correcNon • …
![Page 9: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/9.jpg)
But first… What data do we have ?
![Page 10: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/10.jpg)
NGS sequences
• Illumina, 454 (Roche), Ion Torrent, Solid, …
• Single, Paired-‐end, Mate pairs
• Sequences length: 25, 35, 50, 75, 100, 150, 250, 500, 700, 800, … base pairs
• File format: Fastq Phred+33, Fastq Phred+64, 2 files (.fasta + .qual), Colorspace
![Page 11: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/11.jpg)
RNAseq
• Illumina, 454 (Roche), Ion Torrent, Solid, …
• Single, Paired-‐end, Mate pairs
• Sequences length: 25, 35, 50, 75, 100, 150, 250, 500, 700, 800, … base pairs
• File format: Fastq Phred+33, Fastq Phred+64, 2 files (.fasta + .qual), Colorspace
![Page 12: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/12.jpg)
RNAseq
• Illumina, 454 (Roche), Ion Torrent, Solid, …
• Single, Paired-‐end, Mate pairs
• Sequences length: 25, 35, 50, 75, 100, 150, 250, 500, 700, 800, … base pairs
• File format: Fastq Phred+33, Fastq Phred+64, 2 files (.fasta + .qual), Colorspace
![Page 13: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/13.jpg)
NGS data Quality Checking (QC)
• These apply to all NGS data (not just RNAseq).
• Some of these problems can be worked around but others indicate that the lane is bad & must be re-‐run (or a new library is needed).
• Bias should be corrected in reverse order of their generaNon 1. Sequencing biases (bad quality, unknowns) 2. Library preparaNon
a. Adaptors and primers sequences b. Poly A/T tails
3. Biological sample (low complexity, rRNA, contaminants)
• Our favorite NGS QC tools is FastQC. hfp://www.bioinformaNcs.babraham.ac.uk/projects/fastqc/
15/10/13 Trinity Lille
13
![Page 14: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/14.jpg)
1. Sequencing biases
• Unknown nucleoNdes (Ns) • Bad quality nucleoNdes • Hexamers biases (random priming) ? (Illumina. Now corrected ?)
• Why do we need to correct those ? – To remove a lot of sequencing errors (detrimental to the vast majority of assemblers)
– Because most de-‐bruijn graph based assemblers can’t handle unknown nucleoNdes
15/10/13 Trinity Lille
14
![Page 15: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/15.jpg)
PRINSEQ
• hfp://prinseq.sourceforge.net/index.html
• Perl sojware for PReprocessing and INformaNon of SEQuence data
• Not the fastest, but very exhausNve
• 2 versions. We use the command-‐line version: prinseq_lite.pl
• But also: FASTX Toolkit, …
15/10/13 Trinity Lille
15
![Page 16: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/16.jpg)
2. Adaptors & primers sequences
• Can be found in 3’ end if insert size is too short
Adaptor Contaminations
44$
v
Adaptor Contaminations
44$
Normal case: insert size > sequencing length
Abnormal case: insert size < sequencing length
15/10/13 Trinity Lille
16
![Page 17: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/17.jpg)
2. Adaptors & primers sequences
• Can be found in 3’ end if insert size is too short
• Why do we need to remove those ? – Because they can lead to “bridges” (links) between unrelated sequences (eg. 2 genes) and generate chimeras
gene1 transcript gene2 transcript
adaptor sequence
15/10/13 Trinity Lille
17
![Page 18: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/18.jpg)
Cutadapt
• hfp://code.google.com/p/cutadapt/
• Trimming of adaptors sequences from NGS data
• But also: trimmomaNc, far, btrim, SeqTrim, TagCleaner, solexaQA, ...
15/10/13 Trinity Lille
18
![Page 19: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/19.jpg)
3. Poly A/T tails, low complexity reads
• Some poly A/T tails can be lej during library preparaNon
• Poly A/T or low complexity sequences can also lead to “bridges” between unrelated sequences and generate chimeras
> ACGTAGCTACTAGCTGACGATTCCCGTAGATCATCGGATAAAAAAAAAAAAAAAAAAAAAAA > TTTTTTTTTTTTTTTTTTTTTTTTTTTACTGCGTAGCACATGGCTATTATTTCGGCCATCAA > CGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG > ATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGATGAT
15/10/13 Trinity Lille
19
![Page 20: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/20.jpg)
PRINSEQ 2
• Trimming poly A/T tails – From 5’-‐end and 3’-‐end – w/ nucleoNde nb >= 5
• Filtering low complexity sequences – Entropy < 70 (out of 100)
• Filtering short reads (< 50 nu)
15/10/13 Trinity Lille
20
![Page 21: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/21.jpg)
4. ContaminaNons
• Most RNA-‐seq libraries comprise ribosomal RNA that you may want to remove
• ContaminaNons can also occur with foreign RNA/DNA (PhiX, Bacteria, …)
15/10/13 Trinity Lille
21
![Page 22: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/22.jpg)
riboPicker
• hfp://ribopicker.sourceforge.net/
• Easy idenNficaNon and removal of rRNA-‐like sequences
• For RNAseq and DNAseq
• But also: SortMeRNA, DeconSeq, …
15/10/13 Trinity Lille
22
![Page 23: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/23.jpg)
23
TP
![Page 24: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/24.jpg)
So… What data do you have ?
But first, let’s retrieve it:
• History à Create New
• Shared Data à Data Libraries à RNA-‐seq de-‐novo
• Select all datasets and import to current history
• Name your new history
TP
![Page 25: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/25.jpg)
So… What data do you have ? TP
![Page 26: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/26.jpg)
So… What data do you have ? TP
![Page 27: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/27.jpg)
NGS Data basics : FASTQ format, SE data NGS Data Basics: FASTQ format, SE data
@C060CACXX:1:2108:04435:81967 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA + ?@@DDDFFHFFFHJJEHIJIJIGHHHIJJIJJJJJJ@HGHGICBFGCHIECGGGDHACBC @C060CACXX:1:1103:08674:67296 GTGCATTCTTATTTTATAATATTGACTCTATGACTCAAAAATTACAAGTGTTTATAACCC + CCCFFFFFHHHGHJIGIIGIGHIGIJJJIJJJIIJIJJJJJJJJJJIJEGGIIJIGIICH @C060CACXX:1:1208:18816:38654 CTCCTTTCCCATTAATTGATTCATGTTCTCTTCTAGTAGCTTGATTGCAAAATTACAAGT + ==>AA@?;?++@<=>AC>BB4,A7,,3?A>4+2?2A<@BBBA7):*111*?0?3:=?A>A @C060CACXX:1:1305:16126:134486 ATCTATTCCTGAACAGGTCAATTTTAATGACTGATTCTTCAATCCGTGGTGGTCGAGATG + ;>=AAAAABB+@=@C3+?++<,,33<=C<+?77+*:=7*1?A?=3?0:0=A<A3(<AA## @C060CACXX:1:1308:04529:41884 ATTTGCCATCCCTGCATTGTGCGTGGTTTTCAGCAGCTTTTTAACAGGTGTTGTTTTTAT + @@<DDDEAFHHFDIGEEGGE9FGHHIA@FGIIGIIGIIJJJJIIIIEHDDBFFBCGHGII @C060CACXX:1:2202:06955:98871 CTGAGATCTTCTTTAATTTCTTTCTTCAGGGACTTGAAGTTTTTATCATACAGATCTTTC + BCCDFFFFHHHHHJJJJJJJJJJIJJJIJJJIIJJJGIIFIJJJJJJJJJJIJJJJJJIJ @C060CACXX:1:1105:15276:91210 TAGGAATCAGCGTGAGCTGTATTCTGACGGAGAATCTCTTCTGGTACCAGAAGGTTTGGA + ?7?>BDD:C3:02@+AE2<3AEEDF++<))?D?DD4BDB9DDIIDBDD49DB;8.48@5@ @C060CACXX:1:1301:16367:35650 CGCTCTCCAAGCTCCTCCTCCTGGCCCTCAGCTTCTGTGGCTTTCTGGTCTTCACCAACC + ==<;A8A7+?A7?CB9AAACA++++2<?)5@3*1????*0:?=>**00/*9AA43))==A @C060CACXX:1:1205:17708:111304 CTGGTAGTAAAGTAGCTGCATGGAGTTCACCTGCAGTTCGTGCTGCTTGGCGCCGACCCA + ?@@DABB=CC<,C:ACG4CFE4@E;+<?+<C3CDCFF?91::)0:?<93BG(7;;''58( @C060CACXX:1:1208:13509:106734 GCTTTGTGGTCTTCACCAACCTTTCTCTGCAGAACAACACCATAGGCACCTATCAGCTGG + @CCFFFDFHFHHHJIJIJJJJJJJJJIJIIJJJJIIJJJJEHIIJIGIIJJJJJJJIHJG @C060CACXX:1:1101:03034:113094 ATTCTCCGTCAGAATACAGCTCACGCTGATTCCTATTACTGTAGGTGTAATCCTAAATTC + @CCFFFFFHHHFHIIIJIHIIIJJIIHIJEIJJGJBHGIGGDDFCDHEFFCIBGICHIIG . . . .
@C060CACXX:1:1305:16126:134486 ATCTATTCCTGAACAGGTCAATTTTAATGACTGATTCTTCAATCCGTGGTGGTCGAGATG + ;>=AAAAABB+@=@C3+?++<,,33<=C<+?77+*:=7*1?A?=3?0:0=A<A3(<AA##
Standard format is 4 lines per read:
1. Unique read identifier.
2. Read sequence.
3. Either read identifier again or a placeholder like ‘+’.
4. Phred-like base quality scores [Q:0-40].
Q = -10 log10 (e), where e is the estimated probability of a wrong base. So the probability that a base call is an error is…
* 0.01% if Q=40 * 0.1% if Q=30 * 1% if Q=20 * 10% if Q=10
Standard format is 4 lines per read:
1. Unique read idenNfier. 2. Read sequence. 3. Either read idenNfier again or a place holder like “+”.
4. Phred-‐like base quality scores [Q:0-‐40]. Q = -‐10 log10(e), where e is the es4mated probability of a wrong base. So the probability that a base call is an error is:
![Page 28: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/28.jpg)
NGS Data basics : FASTQ format, SE data
![Page 29: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/29.jpg)
NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1
AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA
+
?@@DDDFFHFFFHJJEHIJIJIGHHHIJJIJJJJJJ@HGHGICBFGCHIECGGGDHACBC
@C060CACXX:1:1103:08674:67296/1
GTGCATTCTTATTTTATAATATTGACTCTATGACTCAAAAATTACAAGTGTTTATAACCC
+
CCCFFFFFHHHGHJIGIIGIGHIGIJJJIJJJIIJIJJJJJJJJJJIJEGGIIJIGIICH
@C060CACXX:1:1208:18816:38654/1
CTCCTTTCCCATTAATTGATTCATGTTCTCTTCTAGTAGCTTGATTGCAAAATTACAAGT
+
==>AA@?;?++@<=>AC>BB4,A7,,3?A>4+2?2A<@BBBA7):*111*?0?3:=?A>A
@C060CACXX:1:1305:16126:134486/1
ATCTATTCCTGAACAGGTCAATTTTAATGACTGATTCTTCAATCCGTGGTGGTCGAGATG
+
;>=AAAAABB+@=@C3+?++<,,33<=C<+?77+*:=7*1?A?=3?0:0=A<A3(<AA##
@C060CACXX:1:1308:04529:41884/1
ATTTGCCATCCCTGCATTGTGCGTGGTTTTCAGCAGCTTTTTAACAGGTGTTGTTTTTAT
+
@@<DDDEAFHHFDIGEEGGE9FGHHIA@FGIIGIIGIIJJJJIIIIEHDDBFFBCGHGII
@C060CACXX:1:2202:06955:98871/1
CTGAGATCTTCTTTAATTTCTTTCTTCAGGGACTTGAAGTTTTTATCATACAGATCTTTC
+
BCCDFFFFHHHHHJJJJJJJJJJIJJJIJJJIIJJJGIIFIJJJJJJJJJJIJJJJJJIJ
@C060CACXX:1:1105:15276:91210/1
TAGGAATCAGCGTGAGCTGTATTCTGACGGAGAATCTCTTCTGGTACCAGAAGGTTTGGA
+
?7?>BDD:C3:02@+AE2<3AEEDF++<))?D?DD4BDB9DDIIDBDD49DB;8.48@5@
@C060CACXX:1:1301:16367:35650/1
CGCTCTCCAAGCTCCTCCTCCTGGCCCTCAGCTTCTGTGGCTTTCTGGTCTTCACCAACC
+
==<;A8A7+?A7?CB9AAACA++++2<?)5@3*1????*0:?=>**00/*9AA43))==A
@C060CACXX:1:1205:17708:111304/1
CTGGTAGTAAAGTAGCTGCATGGAGTTCACCTGCAGTTCGTGCTGCTTGGCGCCGACCCA
+
?@@DABB=CC<,C:ACG4CFE4@E;+<?+<C3CDCFF?91::)0:?<93BG(7;;''58(
@C060CACXX:1:1208:13509:106734/1
GCTTTGTGGTCTTCACCAACCTTTCTCTGCAGAACAACACCATAGGCACCTATCAGCTGG
+
@CCFFFDFHFHHHJIJIJJJJJJJJJIJIIJJJJIIJJJJEHIIJIGIIJJJJJJJIHJG
@C060CACXX:1:1101:03034:113094/1
ATTCTCCGTCAGAATACAGCTCACGCTGATTCCTATTACTGTAGGTGTAATCCTAAATTC
+
@CCFFFFFHHHFHIIIJIHIIIJJIIHIJEIJJGJBHGIGGDDFCDHEFFCIBGICHIIG
.
.
.
.
@C060CACXX:1:2108:04435:81967/2
GGGAAATAGTTATTTTAGGAAGTAGAAGATTTTTCTCTTTGTGTCTGAGTCTTTCATTTG
+
??@DDBDEHF>,C:C@EFBCFHG>HHBDGGHD@<EHGGIJJEB1?F4*:BDGG9DGGI??
@C060CACXX:1:1103:08674:67296/2
GTTTTTATACCATTTCTAACACAACATCTTTGCAACAGAAGAATGTGGAATGGTGTTTCT
+
@CCFFFFDHHAFHIIJIHIJJIDIIIGGHIJJEIGIIJHEHIGGIFGIJIFFHBFGHIIG
@C060CACXX:1:1208:18816:38654/2
GCTAGAAGAGAATCACAATAATTTGGGCAGATACTTTGCAGGTATGCAGAACCATGAGTT
+
:B844A2AACA?A4<EFGI++AF:FHG92@;E><@C?D?*:00?*BB@BFFF(?DAG>BF
@C060CACXX:1:1305:16126:134486/2
ATTTGCCATCCCTGCATTGTGCGTGTTTTTCAGCAGCTTTTTAACAGGTGTTGTTTTTAT
+
:??D1A;;22+2<2CFG?3<,+)+11+)::?C9?41)*9?HG9*?*?8B*??########
@C060CACXX:1:1308:04529:41884/2
ATCTTATTCCTGAACAGGTCAATTTTAATGACTGATTCTTCAATCCGTGGTGGTCGAGAT
+
?B@+4=BDFFHBHGB<E@<+3A?CFBE39<?2ACDGC>DF?CDDDF:FBDDF?@F(<6@A
@C060CACXX:1:2202:06955:98871/2
CAATTTCGACAACAAAAGGAGATCAAGGGGATACAAATTGGAAAAGAGGAAGTCAAAATA
+
?BB4AAAD?CFDAFHIEHD?A8AAE?HHIE::?BFE?FAGDEHIBFCGAHA@==@GHEGH
@C060CACXX:1:1105:15276:91210/2
CTGCTGGTGTCCATCTGCATCGTGTTCCTCAACAAATGGATCTATGTAGACCACGGCTTC
+
=1?D+=:2222A<,2AGEB?<)<CCC9<AFHEH@):1??C?3**0:0**9?B@(/?@A@)
@C060CACXX:1:1301:16367:35650/2
AGTAAAAGTAGCTGCATGGAGTTCACCTGCAGGTCGTGCTGCTTGGCTCCGACCCACACT
+
+:+4+2=A22:+2A+A2A?<A:+<<CB9+<C?)1*:0)?B?B>DD)9*90?:;-;(;(;A
@C060CACXX:1:1205:17708:111304/2
GCTTTGTGGGCTTCACCAACCTTTCTCTGCAGAACAACACTATAGGCACCTATCAGCTGG
+
+:++AD22C)1<CAFDGF@G:E<+924C*91**1:3933B***9B*0*97?383BFH)))
@C060CACXX:1:1208:13509:106734/2
GCAGGCATGGCAGAAGACATGGGGGCCTGGTAGTAAAGTAGCTGCATGGAGTTCACCTGC
+
BBC+A@DDHFHHFIGIBGGIHJIGHJIIHJ?DGBDGAGBDFGIGIIIGHDCGHIIHCHFH
@C060CACXX:1:1101:03034:113094/2
GATAAGTTCACCATGAAAACGATTATTCCAGACAGCAGGACCATAAGCAAAGCAGAAACT
+
=?B=A=2A=C:CD++<CF++333<2+A+AE?9)1):C1)0)?F**900?BF3?F.8BF)/
.
.
.
.
@C060CACXX:1:2108:04435:81967/2 GGGAAATAGTTATTTTAGGAAGTAGAAGATTTTTCTCTTTGTGTCTGAGTCTTTCATTTG
+
??@DDBDEHF>,C:C@EFBCFHG>HHBDGGHD@<EHGGIJJEB1?F4*:BDGG9DGGI??
@C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA
+
?@@DDDFFHFFFHJJEHIJIJIGHHHIJJIJJJJJJ@HGHGICBFGCHIECGGGDHACBC
NGS Data basics : FASTQ format, SE data
![Page 30: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/30.jpg)
FASTQ quality encoding
Thanks to Wikipedia… ;-‐)
![Page 31: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/31.jpg)
FASTQ quality encoding
![Page 32: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/32.jpg)
FASTQ quality encoding
![Page 33: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/33.jpg)
FASTQ quality encoding
@MERCURE_0127:7:1101:1162:2110#CTTGTA/1 TAATAACCCATTAAATACCAATCCAGAAAGCAGCGTGGGTTCAATTCCCAAGATCGGAAG +MERCURE_0127:7:1101:1162:2110#CTTGTA/1 bbbeeeeegggggiiiihfgfffgihhiihfhfcab``aKZ^]b]]_]`b^^_b``[a__ @MERCURE_0127:7:1101:1182:2111#CTTGTA/1 ACTTACCTCCTGACCCCCCAAAGCCTACTCTCCACTTGCCTGGATGAGCGCAGCTCCAAC +MERCURE_0127:7:1101:1182:2111#CTTGTA/1 bbbeeeeegggghiihhihiiiiiigaaabb`b`b]`b`b^`T]T]bc_aOEETR___BB
@HWI-ST227:191:D16GHACXX:8:2308:20216:200677 1:N:0:CGATGT GCCATTGATGGTGGTGTGTGTTTGGTTGGTTGTTGGATGGGGGTGGGGGGTGTGGTGCG + ++1BD2222==2A+2+2<3CFFIIA<E)1?C:)0?)*0*0?D@################ @HWI-ST227:191:D16GHACXX:8:2308:20300:200513 1:N:0:CGATGT CGTTGTTCCTCGCGACGAGAAAAGTGCAGACGGTTTAGGGATCATCGGTATTTCGTGCG + ?@?ADDDDDBCF@HIEIAGDHB;DDBHGIIEBG:FBDGHBD@CA+9:>098595?CCC<
![Page 34: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/34.jpg)
FASTQ quality encoding
@MERCURE_0127:7:1101:1162:2110#CTTGTA/1 TAATAACCCATTAAATACCAATCCAGAAAGCAGCGTGGGTTCAATTCCCAAGATCGGAAG +MERCURE_0127:7:1101:1162:2110#CTTGTA/1 bbbeeeeegggggiiiihfgfffgihhiihfhfcab``aKZ^]b]]_]`b^^_b``[a__ @MERCURE_0127:7:1101:1182:2111#CTTGTA/1 ACTTACCTCCTGACCCCCCAAAGCCTACTCTCCACTTGCCTGGATGAGCGCAGCTCCAAC +MERCURE_0127:7:1101:1182:2111#CTTGTA/1 bbbeeeeegggghiihhihiiiiiigaaabb`b`b]`b`b^`T]T]bc_aOEETR___BB
@HWI-ST227:191:D16GHACXX:8:2308:20216:200677 1:N:0:CGATGT GCCATTGATGGTGGTGTGTGTTTGGTTGGTTGTTGGATGGGGGTGGGGGGTGTGGTGCG + ++1BD2222==2A+2+2<3CFFIIA<E)1?C:)0?)*0*0?D@################ @HWI-ST227:191:D16GHACXX:8:2308:20300:200513 1:N:0:CGATGT CGTTGTTCCTCGCGACGAGAAAAGTGCAGACGGTTTAGGGATCATCGGTATTTCGTGCG + ?@?ADDDDDBCF@HIEIAGDHB;DDBHGIIEBG:FBDGHBD@CA+9:>098595?CCC<
Phred+64
Phred+33
![Page 35: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/35.jpg)
TP
![Page 36: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/36.jpg)
FastQC TP
![Page 37: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/37.jpg)
FastQC : Basic StaNsNcs
![Page 38: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/38.jpg)
FastQC : Basic StaNsNcs
@MERCURE_0127:7:1101:1162:2110#CTTGTA/1 TAATAACCCATTAAATACCAATCCAGAAAGCAGCGTGGGTTCAATTCCCAAGATCGGAAG +MERCURE_0127:7:1101:1162:2110#CTTGTA/1 bbbeeeeegggggiiiihfgfffgihhiihfhfcab``aKZ^]b]]_]`b^^_b``[a__ @MERCURE_0127:7:1101:1182:2111#CTTGTA/1 ACTTACCTCCTGACCCCCCAAAGCCTACTCTCCACTTGCCTGGATGAGCGCAGCTCCAAC +MERCURE_0127:7:1101:1182:2111#CTTGTA/1 bbbeeeeegggghiihhihiiiiiigaaabb`b`b]`b`b^`T]T]bc_aOEETR___BB
@HWI-ST227:191:D16GHACXX:8:2308:20216:200677 1:N:0:CGATGT GCCATTGATGGTGGTGTGTGTTTGGTTGGTTGTTGGATGGGGGTGGGGGGTGTGGTGCG + ++1BD2222==2A+2+2<3CFFIIA<E)1?C:)0?)*0*0?D@################ @HWI-ST227:191:D16GHACXX:8:2308:20300:200513 1:N:0:CGATGT CGTTGTTCCTCGCGACGAGAAAAGTGCAGACGGTTTAGGGATCATCGGTATTTCGTGCG + ?@?ADDDDDBCF@HIEIAGDHB;DDBHGIIEBG:FBDGHBD@CA+9:>098595?CCC<
Phred+64 Phred+33
![Page 39: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/39.jpg)
This plot shows the base quality score distribuNon for all reads in a lane, with each read posiNon considered independently. • x-‐axis = posiNon in read (bp) • y-‐axis = Phred-‐like base quality score [pink=0-‐20, tan=20-‐30, green=30-‐40] • red bar = median score, blue line = mean score • yellow box = 25th to 75th percenNle, black whiskers = 10th to 90th percenNle
FastQC : Per base sequence quality 15/10/13 Trinity Lille
39
![Page 40: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/40.jpg)
FastQC : Per base sequence quality 15/10/13 Trinity Lille
40
![Page 41: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/41.jpg)
FastQC: Per sequence quality scores
![Page 42: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/42.jpg)
FastQC: Per base sequence content NGS QC: Sequence bias across read length. (1) This plot shows the nucleotide distribution per read position for all reads in a lane.
• x-axis = position in read (bp) • y-axis = % of all reads in the lane • colors refer to individual nucleotides: A, C, G, T
GOOD LANE BAD LANE
Can this be fixed? No.
![Page 43: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/43.jpg)
NGS QC: Sequence bias across read length. (2) This lane has a different problem – one sequence motif is highly over-represented.
Can this be fixed? Yes. Simply remove the reads w/adapter contamination, and everything that’s left should be fine. (Talk to a bioinformatics analyst for help.)
In this lane, ~10% of reads have the adapter sequence & the rest are normal.
primer/adapter sequence: GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
Note: This sample underwent bisulfite treatment prior to sequencing.
Can this be fixed? Yes. Simply remove the reads w/ adapter contaminaBon, and everything that’s leE should be fine. (Talk to a bioinformaBcs analyst for help.)
FastQC: Per base sequence content
![Page 44: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/44.jpg)
FastQC: Per sequence GC content NGS QC: Sequence bias relative to reference genome.
GOOD LANE BAD LANE
Can this be fixed? No.
This plot shows the distribution of GC content per read for all reads in a lane. • x-axis = mean GC content (%) • y-axis = # of reads • red: observed read count, blue: theoretical distribution (given observed)
mouse genome ≈ 40% GC
* *
mouse genome ≈ 40% GC
![Page 45: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/45.jpg)
FastQC: Per sequence GC content
• A contaminaNon ?
![Page 46: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/46.jpg)
FastQC: Per sequence GC content
• A contaminaNon ?
Can this be fixed ? Maybe…
![Page 47: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/47.jpg)
FastQC: Per base N content
![Page 48: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/48.jpg)
FastQC: Sequence Length DistribuNon
![Page 49: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/49.jpg)
FastQC: Sequence DuplicaNon Levels
NGS QC: Low uniqueness among reported reads. This plot shows the degree of duplication for a subset of reads in a lane.
• x-axis = sequence duplication level • y-axis = % duplicates relative to unique reads
GOOD LANE BAD LANE
Can this be fixed? Maybe.
![Page 50: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/50.jpg)
FastQC: Sequence DuplicaNon Levels
Can this be fixed? Hem…
![Page 51: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/51.jpg)
FastQC: Overrepresented sequences
![Page 52: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/52.jpg)
FastQC: Kmer Content
![Page 53: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/53.jpg)
TP Quality cleanning
![Page 54: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/54.jpg)
Most lanes will not have problems with sequence bias, GC content, adapters, etc. Most lanes will have reads with base quality problems. Here is a typical example... Note: Stringency of base quali4es to retain is somewhat applica4on-‐specific. Step 1 = Trimming by base quality. Trim right reads where the base quality falls below 20. Step 2 = Filtering by base quality. Retain only reads with an average base quality score ≥ 20.
Quality Trimming & Filtering Example (1)
![Page 55: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/55.jpg)
PRINSEQ step 1
• Removing all unknown nucleoNdes – First by trimming – Then by filtering
• Trimming, from 3’ end, nucleoNdes w/ Q < 20
• Filtering sequences – w/ average quality score < 25 – w/ length < 50
TP
![Page 56: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/56.jpg)
TP
![Page 57: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/57.jpg)
Quality Trimming & Filtering Example (2)
![Page 58: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/58.jpg)
PRINSEQ : add stringency
• Removing all unknown nucleoNdes – First by trimming – Then by filtering
• Trimming, from 3’ end, nucleoNdes w/ Q < 20
• Filtering sequences – w/ average quality score < 25 – w/ length < 50
Q < 25
average Q < 30
TP
![Page 59: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/59.jpg)
More stringent
Quality Trimming & Filtering Example (3)
![Page 60: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/60.jpg)
Trimming effect
Recent publicaNons have idenNfied contradictory results of the effects of trimming raw reads on the quality of the assembly -‐> How de novo assemblers manage the variable reads size? -‐> Should we prefer a complete removal of the read to the deleNon of the only poor quality part? -‐> Add later addiNonal cleanning step Del Fabbro, C., Scalabrin, S., Morgante, M., & Giorgi, F. M. (2013). An Extensive EvaluaNon of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE, 8(12), e85024. doi:10.1371/journal.pone.0085024 MacManes, M. D. (2014, November). On the opNmal trimming of high-‐throughput mRNAseq data. Biorxiv. doi:10.1101/000422 Sleep, J. A., Schreiber, A. W., & Baumann, U. (2013). Sequencing error correcNon without a reference genome. BMC BioinformaNcs, 14(1), 367. doi:10.1186/gb-‐2011-‐12-‐11-‐r112
![Page 61: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/61.jpg)
TP Adaptor cleanning
![Page 62: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/62.jpg)
Cutadapt
1. Compute opNmal alignment between the read and the adapter sequences. The type of alignment produced is called end-‐space (or regular semi-‐global) alignment. It does not penalize iniNal or trailing gaps.
2. Depending on the parameter used (-‐a -‐b -‐g) cutadapt considers that you know where the adapter is located or not.
M. MarNn. Cutadapt removes adapter sequences from high-‐throughput sequencing reads. EMBnet.journal, North America, 17, May 2011. Available at: hfp://journal.embnet.org/index.php/embnetjournal/arNcle/view/
![Page 63: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/63.jpg)
Cutadapt
• Trimming from 3’end
AGATCGGAAGAGCACACGTCTGAACTCCAG
• Filtering short reads (< 50 nu)
TP
![Page 64: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/64.jpg)
Cutadapt TP
![Page 65: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/65.jpg)
TP PolyA and low complexity cleaning
![Page 66: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/66.jpg)
PRINSEQ step 2
• Trimming poly A/T tails – From 5’-‐end and 3’-‐end – w/ nucleoNde nb >= 5
• Filtering low complexity sequences – Entropy < 70 (out of 100)
• Filtering short reads (< 50 nu)
TP
![Page 67: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/67.jpg)
TP
![Page 68: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/68.jpg)
PRINSEQ step 2
• Trimming poly A/T tails – From 5’-‐end and 3’-‐end – w/ nucleoNde nb >= 5
• Filtering low complexity sequences – Entropy < 70 (out of 100)
• Filtering short reads (< 50 nu)
Entropy < 50
TP
![Page 69: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/69.jpg)
TP
![Page 70: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/70.jpg)
riboPicker
• Select “rrnadb” as the reference database
TP
![Page 71: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/71.jpg)
TP
![Page 72: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/72.jpg)
riboPicker
• For addiNonal databases (chloroplasts, mitochondrions, …) please contact your favorite bioinformaNc analysts at support.abims@sb-‐roscoff.fr
![Page 73: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/73.jpg)
Get Pairs
• Data cleaning is performed on every sequence file without using the paired informaNon
è Cleaning leads to singletons generaNon • Very few tools can work with both paired reads and singletons
• For the next part of the pipeline we need to retrieve paired reads and isolate singletons
![Page 74: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/74.jpg)
TP Get Pairs
![Page 75: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/75.jpg)
Get Pairs TP
![Page 76: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/76.jpg)
AddiNonal opNonal step
FLASH (Fast Length Adjustment of SHort reads) is a very fast and accurate sojware tool to merge paired-‐end reads. • FLASH is designed to merge pairs of reads when the original DNA fragments are shorter than
twice the length of reads. • The resulNng longer reads can significantly improve genome assemblies. They can also
improve transcriptome assembly when FLASH is used to merge RNA-‐seq data
Sequencing error correcNons. Error occur during the sequencing process. These errors impact the assembly process (less idenNty, larger graphs,...) Removing these errors before assembly : • Limits the errors in the conNgs • Speeds the assembly Many different sojware packages. Ex. SGA SOAP REPTILE One adapted to RNA-‐Seq reads = Seecer. The challenge is to separate errors from rare polymorphisms in an efficient manner. !!! MacManes, M. D., & Eisen, M. B. (2013). Improving transcriptome assembly through error correcNon of high-‐throughput sequence reads. PeerJ, 1, e113.
![Page 77: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/77.jpg)
NGS reads normalizaNon (by Trinity)
• Context: -‐ By definiNon RNAseq display a wide range of expressions Very low expressed à Very highly expressed transcripts
-‐ The informaNon given by reads from high expression transcripts is redundant, and very high coverage also brings more sequencing errors
-‐ De-‐novo assemblers do not benefit from coverage increase beyond a certain point, and fewer data means quicker assemblies
è How to decrease coverage of highly expressed transcripts without decreasing that of low expressed transcripts ?
![Page 78: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/78.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
![Page 79: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/79.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1
![Page 80: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/80.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1
![Page 81: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/81.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1
![Page 82: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/82.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1
![Page 83: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/83.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 1
![Page 84: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/84.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 1 GATCA 1
![Page 85: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/85.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 2 GATCA 1
![Page 86: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/86.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2
![Page 87: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/87.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2 ATCAG 1
![Page 88: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/88.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 1 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2 ATCAG 1 TCAGT 1
![Page 89: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/89.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 2 AGTCG 1 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2 ATCAG 1 TCAGT 1
![Page 90: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/90.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 2 AGTCG 2 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2 ATCAG 1 TCAGT 1
![Page 91: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/91.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish):
e.g. for k = 5 > CAGTCGATCA > CGATCAGTCG
CAGTC 2 AGTCG 2 GTCGA 1 TCGAT 1 CGATC 2 GATCA 2 ATCAG 1 TCAGT 1 …
![Page 92: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/92.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish): • with k = 25
2. For each read, compute the median, average and stdev kmers coverage
![Page 93: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/93.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish): • with k = 25
2. For each read, compute the median, average and stdev kmers coverage
3. Accept a read with a probability of:
![Page 94: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/94.jpg)
NGS reads normalizaNon (by Trinity)
3. Accept a read with a probability of:
e.g. with 𝑚𝑎𝑥𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒=30 Read_A: 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒=60 à 𝑚𝑎𝑥_𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒/𝑚𝑒𝑑𝑖𝑎𝑛 =0.5
è Read_A has a 50% chance of being kept Read_B: 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒=10 à 𝑚𝑎𝑥_𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒/𝑚𝑒𝑑𝑖𝑎𝑛 =3
è Read_B has a 300% chance of being kept ;-‐) è Read_B will be kept
![Page 95: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/95.jpg)
NGS reads normalizaNon (by Trinity)
3. Accept a read with a probability of:
Read_A comes from a highly expressed transcript and is 2 Nmes more covered than the threshold. We know its informaNon is also contained by other reads.
è So it has less chance to be kept. Read_B comes from a low expressed transcript, way below the threshold. Its informaNon is not very redondant, we will need it for the assembly.
è So it will absolutly be kept
![Page 96: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/96.jpg)
NGS reads normalizaNon (by Trinity)
1. Count kmers in all the data (Jellyfish): • with k = 25
2. For each read, compute the median, average and stdev kmers coverage
3. Accept a read with a probability of:
4. Remove a read if: (100%)
![Page 97: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/97.jpg)
NGS reads normalizaNon (by Trinity)
4. Remove a read if: (100%)
is also known as the coefficient of variaNon (CV)
The CV measures the dispersion of the values
Applied to NGS reads the CV is an indicaNon of the variability in the kmer coverage of a read
A high variability in a read kmer coverage means there is probably a lot of sequencing errors in this read
![Page 98: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/98.jpg)
NGS reads normalizaNon (by Trinity)
• Pros: – Reduce the data to be assembled
à faster assemblies à RAM requirement highly reduced
– Remove reads with potenNally lots of sequencing errors à befer assemblies ?
• Cons: – Small loss of informaNon à slightly worse assemblies ? – Stringent filter on kmer coverage variability
à loss of low expressed alternaNve transcripts (splice juncNons) ?
![Page 99: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/99.jpg)
TP NormalizaNon
![Page 100: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/100.jpg)
NGS reads normalizaNon (by Trinity)
• Concatenate lej reads from all condiNons à all.read1.fastq
• Concatenate right reads from all condiNons à all.read2.fastq
• Normalize by kmer coverage: – Paired: all.read1.fastq & all.read2.fastq – pairs together – max coverage = 30 – max pct stdev = 100
TP
![Page 101: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/101.jpg)
TP
![Page 102: RNA’Seq’ analysis’application.sb-roscoff.fr/download/fr2424/abims/corre/...NGS Data Basics: FASTQ format, PE data @C060CACXX:1:2108:04435:81967/1 AGAGAATGGTACAGGTACCAACAACATGCCATATGCATAGAGCAGCACAGAGCAACATAA](https://reader036.vdocument.in/reader036/viewer/2022081617/60501b3c634f3c5bf340ac18/html5/thumbnails/102.jpg)
TP