bioinformatics summer school 2014 konstantin okonechnikov...

29
Bioinformatics Summer School 2014 Konstantin Okonechnikov Max Planck Institute For Infection Biology Quality Control of High Throughput Sequencing Data Летняя Школа Биоинформатики 2014

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Bioinformatics Summer School 2014

Konstantin OkonechnikovMax Planck Institute For Infection

Biology

Quality Control of High Throughput Sequencing Data

Летняя Школа Биоинформатики 2014

Page 2: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

If we lived in a perfect world...

Page 3: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Meanwhile in the real world...

Page 4: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Quality control of High Throughput Sequencing Data

● HTS is a complex technology; it is prone to biases and errors

● Errors might lead to wrong conclusions

● Understanding biases and limitations is critical for analysis of HTS data and inference

● Bioinformatics methods exist to detect biases

● Bias handling is technolgy-specific

● Experimental design is extremely important

Page 5: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

A bit of nomenclature (reminder)

● Basepairs = bp = основания

● Sequencing = секвенирование

● Short reads = короткие риды

● Alignment = выравнивание

● Assembly = сборка

● Coverage = покрытие

● GC content = GC-состав

● Others: BAM/SAM format, WGS, RNA-seq, ChIP-seq

Page 6: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Page 7: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Page 8: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Illumina sequencing overview

Source: http://res.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

Page 9: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Sources of errors and biases

● DNA preparation: biological contamination, biased fragment selection

● PCR amplfiction: GC-content shift, fragment duplication, adapter contamination

● Sequencing: base substitutions and indels

● Techonology specific biases: RNA-seq, ChIP-seq etc.

● Analysis errors: algorithm errors, inadequate model, human errors

Page 10: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Detecting biases: FastQC

● Input: raw read analysis

● Output: interactive GUI, HTML report

● Metrics:

– Per base statistics

– Per base quality profile

– Per sequence ACGTN content

– Sequence length distribution

– Duplicate sequences

– Overrepresented sequences, adapters, kmer content

● Link: www.bioinformatics.babraham.ac.uk/projects/fastqc

Page 11: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Detecting biases: QualiMap

● Input: BAM file, optionally genomic regions in GTF/GFF format

● Output: interactive GUI, HTML and PDF report

● Metrics:

– Summary statistics of alignment (coverage, ACGT, insert size, mapping quality, mismatches and indels etc.)

– Coverage across reference and various histograms

– Duplication rate

– Homopolymer indels

– Mapping quality plots

– Insert size plots

– Mapped reads GC-content and distribution

● Link: http://qualimap.bioinfo.cipf.es/

Page 12: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Read errors

Illumina read profile: quality decreases towards 3' end

Typical errors rates:Substitutions: 0.1 — 0.3 % Indels: ~10E-5

Based on: http://genomebiology.com/2011/12/11/R112

Page 13: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Read errors

● Cut 3' prime end

● Remove reads with bad quality

– Empirical rule: keep only reads that have more than ⅔ of reads Q> 30

● Tools: FastX, Cutadapt, trimmomatic

● What about other platforms?

– 454 : homopolymers

– PacBio: increased error rate (up to 20%)

Page 14: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

GC-content problems

GC-content distribution: compare to theoretical

Page 15: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

GC-content problems

Compare and normalize according to expected distribution

Page 16: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Fragment duplication

● Duplicates can be removed using picard tools.

Page 17: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Contaminations

● Biological contamination

– Map and remove reads (bacterial, rRNA, etc)

● Adapter contamination

– Solution: cut adapters or remove reads containing them (cutadapt, scythe, trimmomatic)

Page 18: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Alignment analysis: descritpive statistics

● Read metrics: mapped, paired, chimeric, singletons.

● Mismatch and indel count

● Coverage

● Mapping quality

● Insert size

Hint:

use Qualimap

Page 19: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Alignment analysis: coverage

● Coverage histogram

Page 20: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Alignment analysis: insert size

● Insert size histogram (QualiMap, picard tools)

Page 21: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Multisample Analysis

● Detection of outlier in group of sequences: clustering and PCA

Page 22: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

RNA-seq specific

● (Not so) random hexamer primers

Page 23: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

RNA-seq specific

● Counts analysis: sequencing saturation

Page 24: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

RNA-seq specific

● Counts analysis: feature distribution

Page 25: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

ChIP-seq specific

Page 26: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Analysis errors

● Statistical model and simulations

● Self checks:

– Visualizations

– Edge conditions

● Published data examination (ENCODE)

● Method cross-checks

Page 27: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Conclusions

● HTS is prone to random errors and systematic biases

● Quality control is critical for analysis

● There are tools available for detection and removal of QC-relatated problems

● Additional QC analysis should be performed based on problem (genome assembly, SNP calling, etc) and technology (RNA-seq, ChiP-seq, <your choise>-seq...)

Page 28: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Tools for QC and EDA

Estimating quality metrics

● FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

● QualiMap http://qualimap.bioinfo.cipf.es/

Removing errors and cleaning reads

● FastX http://hannonlab.cshl.edu/fastx_toolkit/

● Cutadapt https://code.google.com/p/cutadapt/

● Trimmomatic http://www.usadellab.org/cms/?page=trimmomatic

● Picard-tools http://picard.sourceforge.net/

Technology specific quality control

● RNA-seq: Rnaseq-QC, RSeqQC

● ChIP-seq: CEAS, Repitools

● Genome Assembly: Quast

Page 29: Bioinformatics Summer School 2014 Konstantin Okonechnikov ...bioinformaticsinstitute.ru/sites/default/files/07... · Quality control of High Throughput Sequencing Data HTS is a complex

Thank you for your attention!

Спасибо за внимание!