bioinformatics summer school 2014 konstantin okonechnikov...
TRANSCRIPT
Bioinformatics Summer School 2014
Konstantin OkonechnikovMax Planck Institute For Infection
Biology
Quality Control of High Throughput Sequencing Data
Летняя Школа Биоинформатики 2014
If we lived in a perfect world...
Meanwhile in the real world...
Quality control of High Throughput Sequencing Data
● HTS is a complex technology; it is prone to biases and errors
● Errors might lead to wrong conclusions
● Understanding biases and limitations is critical for analysis of HTS data and inference
● Bioinformatics methods exist to detect biases
● Bias handling is technolgy-specific
● Experimental design is extremely important
A bit of nomenclature (reminder)
● Basepairs = bp = основания
● Sequencing = секвенирование
● Short reads = короткие риды
● Alignment = выравнивание
● Assembly = сборка
● Coverage = покрытие
● GC content = GC-состав
● Others: BAM/SAM format, WGS, RNA-seq, ChIP-seq
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Illumina sequencing overview
Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf
Sources of errors and biases
● DNA preparation: biological contamination, biased fragment selection
● PCR amplfiction: GC-content shift, fragment duplication, adapter contamination
● Sequencing: base substitutions and indels
● Techonology specific biases: RNA-seq, ChIP-seq etc.
● Analysis errors: algorithm errors, inadequate model, human errors
Detecting biases: FastQC
● Input: raw read analysis
● Output: interactive GUI, HTML report
● Metrics:
– Per base statistics
– Per base quality profile
– Per sequence ACGTN content
– Sequence length distribution
– Duplicate sequences
– Overrepresented sequences, adapters, kmer content
● Link: www.bioinformatics.babraham.ac.uk/projects/fastqc
Detecting biases: QualiMap
● Input: BAM file, optionally genomic regions in GTF/GFF format
● Output: interactive GUI, HTML and PDF report
● Metrics:
– Summary statistics of alignment (coverage, ACGT, insert size, mapping quality, mismatches and indels etc.)
– Coverage across reference and various histograms
– Duplication rate
– Homopolymer indels
– Mapping quality plots
– Insert size plots
– Mapped reads GC-content and distribution
● Link: http://qualimap.bioinfo.cipf.es/
Read errors
Illumina read profile: quality decreases towards 3' end
Typical errors rates:Substitutions: 0.1 — 0.3 % Indels: ~10E-5
Based on: http://genomebiology.com/2011/12/11/R112
Read errors
● Cut 3' prime end
● Remove reads with bad quality
– Empirical rule: keep only reads that have more than ⅔ of reads Q> 30
● Tools: FastX, Cutadapt, trimmomatic
● What about other platforms?
– 454 : homopolymers
– PacBio: increased error rate (up to 20%)
GC-content problems
GC-content distribution: compare to theoretical
GC-content problems
Compare and normalize according to expected distribution
Fragment duplication
● Duplicates can be removed using picard tools.
Contaminations
● Biological contamination
– Map and remove reads (bacterial, rRNA, etc)
● Adapter contamination
– Solution: cut adapters or remove reads containing them (cutadapt, scythe, trimmomatic)
Alignment analysis: descritpive statistics
● Read metrics: mapped, paired, chimeric, singletons.
● Mismatch and indel count
● Coverage
● Mapping quality
● Insert size
Hint:
use Qualimap
Alignment analysis: coverage
● Coverage histogram
Alignment analysis: insert size
● Insert size histogram (QualiMap, picard tools)
Multisample Analysis
● Detection of outlier in group of sequences: clustering and PCA
RNA-seq specific
● (Not so) random hexamer primers
RNA-seq specific
● Counts analysis: sequencing saturation
RNA-seq specific
● Counts analysis: feature distribution
ChIP-seq specific
Analysis errors
● Statistical model and simulations
● Self checks:
– Visualizations
– Edge conditions
● Published data examination (ENCODE)
● Method cross-checks
Conclusions
● HTS is prone to random errors and systematic biases
● Quality control is critical for analysis
● There are tools available for detection and removal of QC-relatated problems
● Additional QC analysis should be performed based on problem (genome assembly, SNP calling, etc) and technology (RNA-seq, ChiP-seq, <your choise>-seq...)
Tools for QC and EDA
Estimating quality metrics
● FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
● QualiMap http://qualimap.bioinfo.cipf.es/
Removing errors and cleaning reads
● FastX http://hannonlab.cshl.edu/fastx_toolkit/
● Cutadapt https://code.google.com/p/cutadapt/
● Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic
● Picard-tools http://picard.sourceforge.net/
Technology specific quality control
● RNA-seq: Rnaseq-QC, RSeqQC
● ChIP-seq: CEAS, Repitools
● Genome Assembly: Quast
Thank you for your attention!
Спасибо за внимание!