nextgeneraon,sequencing:,...
TRANSCRIPT
Next-‐Genera*on Sequencing: Quality Control and Mapping
BaRC Hot Topics – January 2015
Bioinforma*cs and Research Compu*ng Whitehead Ins*tute
hKp://barc.wi.mit.edu/hot_topics/
Outline
• Quality control • Preprocessing • Read mapping
– Non-‐spliced alignment – Spliced alignment
• Post process the mapped read files – Remove unmapped reads, sort, index etc – Mapping sta*s*cs
2
Illumina data format • Fastq format:
3
@ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 GTAGAACTGGTACGGACAAGGGGAATCTGACTGTAG +ILLUMINA-F6C19_0048_FC:5:1:12440:1460#0/1 hhhhhhhhhhhghhhhhhhehhhedhhhhfhhhhhh
@seq iden*fier seq +any descrip*on seq quality values
/1 or /2 paired-‐end
Input quali*es Illumina versions
-‐-‐solexa-‐quals <= 1.2
-‐-‐phred64 1.3-‐1.7
-‐-‐phred33 >= 1.8
hKp://en.wikipedia.org/wiki/FASTQ_format
Check read quality with fastqc (hKp://www.bioinforma*cs.babraham.ac.uk/projects/fastqc/)
1. Run fastqc to check read quality
bsub fastqc sample.fastq
2. Open output file: “fastqc_report.html”
4
Fastqc report
5
We have to know the quality encoding to use the
appropriate parameter in the mapping step.
FastQC: per base sequence quality
•Content
6
very good quality calls
reasonable quality
poor quality
6 Red: median blue: mean yellow: 25%, 75% whiskers: 10%, 90%
Preprocessing tools • Fastx Toolkit (�hKp://hannonlab.cshl.edu/fastx_toolkit/)
– FASTQ/A Trimmer: Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise).
– FASTQ Quality Filter: Filters sequences based on quality – FASTQ Quality Trimmer: Trims (cuts) sequences based on quality
– FASTQ Masker: Masks nucleo*des with 'N' (or other character) based on quality
(for a complete list go to the link above) • cutadapt to remove adapters (hKps://code.google.com/p/cutadapt/)
7
What preprocessing do we need?
8
Flagged Kmer Content: About 100% of the first six bases are the same sequence -‐> Use “FASTQTrimmer”
Bad quality -‐> Use “FASTQ Quality Filter” and/or “FASTQ Quality Trimmer”
Sequence Count Percentage Possible Source
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCA 7360116 82.88507591015895 RNA PCR Primer, Index 3 (100% over 40bp)
GCGAGTGCGGTAGAGGGTAGTGGAATTCTCGGGTGCCAAG 541189 6.094535921273932 No Hit
TCGAATTGCCTTTGGGACTGCGAGGCTTTGAGGACGGAAG 291330 3.2807783416601866 No Hit
CCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGG 210051 2.365464495397192 RNA PCR Primer, Index 3 (100% over 38bp)
Overrepresented sequences -‐> If the over represented sequence is an adapter use “cutadapt”
Examples of preprocessing I hands on exercise
9
Remove reads with lower quality bsub fastq_quality_filter -‐v -‐q 20 -‐p 75 -‐i sample.fastq -‐o sample_filtered.fastq -‐q 20 -‐p 75 Trim the reads # Delete the first 6nt from 5’ bsub fastx_trimmer -‐v -‐f 7 -‐l 36 -‐i sample.fastq -‐o sample_trimmed.fastq
-‐i: input file -‐o: output file -‐v: report number of sequences -‐q 20 the quality value required -‐p 75 the percentage of bases that have to have that quality value
-‐f: First base to keep -‐l: Last base to keep -‐i: input file -‐o: output file -‐v: report number of sequences
Examples of preprocessing II hands on exercise
• Remove adapter/Linker
10
10
cutadapt # usage bsub " cutadapt -‐m 20 -‐b GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG sample2.fastq | fastx_ar*facts_filter > sample2_trimFilt.fastq” -‐a: Sequence of an adapter that was ligated to the 3' end -‐b: Sequence of an adapter that was ligated to the 5' or 3' end -‐g: Sequence of an adapter that was ligated to the 5' end -‐o: output file name
Recommenda*on for preprocessing
• Treat all the samples the same way. • Watch out for preprocessing that may result in very different read length in the different samples as that can affect mapping.
• If you have paired-‐end reads, make sure you s*ll have both reads of the pair aver the processing is done.
• Run fastqc on the processed samples to see if the problem has been removed.
11
Local genomic files needed for mapping tak: /nfs/genomes/
– Human, mouse, zebrafish, C.elegans, fly, yeast, etc. – Different genome builds
• mm9: mouse_gp_jul_07 • mm10: mouse_mm10_dec_11
– human_gp_feb_09 vs human_gp_feb_09_no_random? • human_gp_feb_09 includes *_random.fa, *hap*.fa, etc.
– Sub directories: • bow*e
– Bow*e1: *.ebwt – Bow*e2: *.bt2
• fasta: • fasta_whole_genome: all sequences in one file • gz: gene models from Refseq, Ensembl, etc.
12
Mapping I Non-‐spliced alignment sovware
§ Used mapping DNA fragments, i.e. ChIP-‐seq, SNP calling
§ Bow*e: § bow*e 1 vs bow*e 2
§ For reads >50 bp Bow*e 2 is generally faster, more sensi*ve, and uses less memory than Bow*e 1.
§ Bow*e 2 supports gapped alignment, it makes it beKer for snp calling. Bow*e 1 only finds ungapped alignments.
§ Bow*e 2 supports a "local" alignment mode, in addi*on to the “end-‐to-‐end" alignment mode supported by bow*e1.
§ BWA: § refer to the BaRC SOP for detailed informa*on
13
Mapping reads with bow*e2 • Mapping single reads:
bow*e2 [op*ons]* -‐x <bt2-‐index> -‐U <r> [-‐S <output.sam>] bsub bowAe2 -‐-‐phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 –U DNA.fastq –S DNA.sam
• Mapping paired-‐end reads: bow*e2 [op*ons]* -‐x <bt2-‐index> -‐1 <m1> -‐2 <m2> [-‐S < output.sam >] bsub bowAe2 -‐-‐phred64 –x /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 -‐1 Reads1.fastq -‐2 Reads2.fastq –S DNA.sam
14
Some important parameters in bow*e2 • ReporAng
(default) look for mul*ple alignments, report best, with MAPQ OR -‐k <int> report up to <int> alns per read; MAPQ not meaningful OR -‐a/-‐-‐all report all alignments; very slow, MAPQ not meaningful
• Alignment mode -‐-‐end-‐to-‐end en*re read must align; no clipping (on) OR
-‐-‐local local alignment; ends might be sov clipped (off) • -‐L <int> length of seed substrings; must be >3 and <32 (default=22) • -‐N <int> max # mismatches in seed alignment; can be 0 or 1 (default=0)
15
Input quali*es Illumina versions
-‐-‐solexa-‐quals <= 1.2
-‐-‐phred64 1.3-‐1.7
-‐-‐phred33 (default) >= 1.8
Mapping II Spliced alignment sovware
§ Used if mapping RNA fragments § Tophat2 (uses bow*e2) § Star: maps >60 *mes faster than Tophat2, tends to align
more reads to pseudogenes. See barc SOPs
16
Spliced alignment with tophat2 Tophat2 uses bow*e2 to map the reads
# single-‐end reads bsub tophat -‐-‐solexa1.3-‐quals -‐-‐segment-‐length 20 -‐-‐no-‐novel-‐juncs -‐G /nfs/genomes/mouse_mm10_dec_11_no_random/gX/mm10_no_random.refseq.gX /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq # paired-‐end reads: Add addi*onal fastq file to the end of above command.
17
Input quali*es Refer to bow*e2 mapping slide
-‐-‐segment-‐length Shortest length of a spliced read that can map to one side of the junc*on. default:25
-‐-‐no-‐novel-‐juncs Only look at reads across junc*ons in the supplied GFF file
-‐G <GTF file> Map reads to virtual transcriptome (from gz file) first.
-‐N max. number of mismatches in a read, default is 2
-‐o/-‐-‐output-‐dir default = tophat_out
-‐-‐library-‐type (fr-‐unstranded, fr-‐firststrand, fr-‐secondstrand)
-‐I/-‐-‐max-‐intron-‐length default: 500000
Op*mize mapping across introns • Tophat default parameters are designed for mammalian RNA-‐seq data.
• Reduce “maximum intron length” for non-‐mammalian organisms -‐l: default is 500,000
18
Species Max_intron_length yeast 2,484 arabidopsis 11,603 C. elegans 100,913 fly 141,628
Hands on Mapping
• bowAe2 bsub bowAe2 -‐-‐phred64 –x /nfs/genomes/
mouse_mm10_dec_11_no_random/bowAe/mm10 –U DNA.fastq –S DNA.sam
• tophat bsub tophat -‐-‐solexa1.3-‐quals -‐-‐segment-‐length 20 -‐G /nfs/
genomes/mouse_mm10_dec_11_no_random/gX/mm10_no_random.refseq.gX /nfs/genomes/mouse_mm10_dec_11_no_random/bowAe/mm10 sample_good_trimmed.fastq
Note: tophat output file will be: tophat_out/accepted_hits.bam
19
Mapped reads file formats: SAM/BAM • SAM: Sequence Alignment/Map format. It is a TAB-‐delimited text format consis*ng of a header sec*on, which is op*onal, and an alignment sec*on. Each alignment line has 11 mandatory fields for essen*al alignment informa*on.
• BAM: binary format. It is much smaller than sam.
• Bam is needed for viewing in a genome browser. It has to be sorted and indexed.
• To save space you should convert mapped files to .bam format, and delete the .sam file.
20
SAM tools: Set of tools for manipula*ng mapped read files
21
TOOL DESCRIPTION
samtools view conversion between SAM and BAM files
samtools flagstat simple sta*s*cs on the mapped reads
samtools sort sort alignment file
samtools index index alignment
samtools rmdup remove PCR duplicates
samtools displays all the tools available
Hands on
Convert .sam to .bam format, sort and index. bsub /nfs/BaRC_Public/BaRC_code/Perl/SAM_to_BAM_sort_index/SAM_to_BAM_sort_index.pl DNA.sam
1. Convert .sam to .bam 2. Sort bam file 3. Index bam file, created a .bai file
Delete the .sam file
22
How to get the number of reads mapped • Bow*e2 prints to STDERR the number of reads mapped, so
you will see if in the email that you received. • Tophat makes a summary file in the tophat output
directory. head tophat_out/align_summary.txt
• Tools: – bam_stat.py -‐i accepted_hits.bam – samtools flagstat mapped_unmapped.bam
• See BaRC SOPs hKp://barcwiki.wi.mit.edu/wiki/SOPs/miningSAMBAM
23
What to look for when few reads mapped?
• Reads are not perfectly paired * – Usually occurs aver QC’ing step. Removing low quality reads or adapters creates uneven distribu*on of reads
bsub “/nfs/BaRC_Public/BaRC_code/Perl/cmpfastq/cmpfastq.pl s_8_1_filtered.fastq s_8_2_filtered.fastq”
• Reads may have adapter sequences – Blast top overrepresented sequences in fastQC output – Refer to the preprocessing steps
• Mapping parameters are too stringent * – Increase number of mismatches – Adjust the insert size of paired-‐end reads?
24 * Refer to BaRC SOP for more informa*on
Summary • Quality control
– fastqc • Clean up reads:
– fastx tool kit: fastq_quality_filter, fastx_trimmer – Cutadapt
• Map reads: – Bow*e2 – Tophat2
• Understand the mapped files, and check mapping quality: – Samtools – RSeQC:bam_stat.py
25
26
hKp://barcwiki.wi.mit.edu/wiki/SOPs
BaRC Standard opera*ng procedures
References Fastqc: hKp://www.bioinforma*cs.babraham.ac.uk/projects/fastqc Fastx Toolkit: �hKp://hannonlab.cshl.edu/fastx_toolkit/ cutadapt: hKps://code.google.com/p/cutadapt Bow*e: Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-‐efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25. TopHat: Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of inser*ons, dele*ons and gene fusions. Genome Biology 2013, 14:R36 Systema*c evalua*on of spliced alignment programs for RNA-‐seq data Engstrom et.al Nature Methods 10, 1185–1191 (2013)
27