c3bi - institut pasteur...y ou don’t need to do quality trimming with bwa-mem. […] bwa-mem...

C3BI

VARIANTS CALLINGNovember 2016Pierre LechatStéphane Descorps-Declère

General Workflow (GATK)

software websites

software website

bwa http://bio-bwa.sourceforge.net/

picard http://picard.sourceforge.net/

samtools http://samtools.sourceforge.net/

GATK http://www.broadinstitute.org/gatk/

IGV http://software.broadinstitute.org/software/igv/

tablet http://bioinf.scri.ac.uk/tablet/

vcftools http://vcftools.sourceforge.net/

Raw Sequence Data Format

• FASTQ format

• Phred quality score!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRS

0………………………………………………………………………………………….50

1……………………………………………………………………………………..0.00001

Phred score

Error rate

Phred score = -10 * log10P

Sequence ID

Sequence

Quality score

Quality Score

• Q-Score = Quality Table(Quality Predictors)

How do my newly obtained data look?

Check for overall data quality. FastQC is a great tool that enables the quality assessment.

Good quality! Poor quality!

Quality Checks

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

What do I do when FastQC calls my data poor?

Poor quality at the ends can be remedied

“quality trimmers” like trimmomatic, fastx-toolkit, etc.

Left-over adapter sequences in the reads can be removed

“adapter trimmers” like trimmomatic.

Always trim adapters as a matter of routine

Once the trimmers have been used, it is best to rerun the data through FastQC to check the resulting data

Quality Checks

Before quality trimming After quality trimming

Quality Checks

QC and Mapping

You don’t need to do quality trimming withbwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, itwill be (soft) clipped.

Heng Li.

It is still recommended to trim adapter sequences. After all, adapters are not part of the samples you are sequencing. They mightaffect variant calling in corner cases.

Heng Li.

General Workflow

Short Read Alignment

Sequencing machine

And you get MILLIONS of them !

Short Read Mapping

Need to map them back to the reference chromosomes

13

Mapping

14

CTGACCTCATGTGATCCACCCGCCTTGGCC

TGATCCAC

Find best match for the read in a reference sequence

Reference sequence

(a read of length 8 bases)

Challenges:Errors in readsErrors in librariesRepetitive regions (repeats, homologous regions)HomopolymersIndividual polymorphisms

Different Mapping Algorithms

• BWA – 2009• BWA-SW – 2010• BWA-MEM – 2013• Bowtie – 2009• Bowtie2 – 2012• Gem – 2012• Cushaw2 – 2014• Novoalign Li, arXiv:1303.3997 (2013)

Further reading: “A survey of sequence alignment algorithms for next-generation sequencing”Li H. and Homer N. 2010. Briefing In Bioinformatics

SAM/BAM Format

• SAM (Sequence Alignment/Map) format:– Single unified format for storing read alignments

to a reference genome

• BAM (Binary Alignment/Map) format:– Binary equivalent of SAM– Advantages

– Supports indexing– Compact size

BAM Improvement

• Remove duplicates• Local realignment• Base quality

recalibration

Improvement

raw reads

reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

MQ is a phred-score of the quality of the alignment

Mapping Quality (MQ)

Library Duplicates

• All second-gen sequencing platforms are NOT single molecule sequencing– PCR amplification step in library preparation can

result in duplicate DNA fragments in the final library prep.

(PCR-free protocols do exist – require large volumes of input DNA).

• Can result in false SNP calls– Duplicates manifest themselves as high read

depth support

Duplicates and False SNP Calls

Remove Duplicates

• Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy :

– Samtools: (samtools rmdup || samtools rmdupse)– Picard/GATK: MarkDuplicates

Local Realignment - indels

• The trouble with mapping approaches

Local realignment in GATK

• Uses information from known SNPs/indels(dbSNP, 1000 Genomes)

• Uses information from other reads • Smith-Waterman exhaustive alignment on

select reads

The quality of a call depends on multiple factors (e.g. position in the read, sequence context). In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).

It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).

It combines all the available information to re-evaluate the probability of a wrong call at each position in each read.

It requires a catalogue of variable sites!

We will not run it but you can find how to do it at http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

Base Quality Recalibration

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

Base Quality Recalibration

More information : http://zenfractal.com/2014/01/25/bqsr/

Variant Calling

• SNP Calling• Short Indels• Structural Variants

• A variant call is a conclusion : there is a nucleotidedifference vs. some reference at a given position in an individual genome or transcriptome.

• Sometime accompanied by an estimate of variant frequency and some measure of confidence

Structural Variants (SVs)

Structural variant is the umbrella term to encompass a group of genomic alterations involving segments of DNA typically larger than 1 kb.

The structural variation may be

•Quantitative (CNVs – indels and duplications)

•Positional (translocations)

•Orientational (inversions).

Reads

Zero level

Read count

Reference

Genome

Reference

Read

Deletion

3. Split read 2. Read depth

Mapping

Reference

Genome

Deletion

Mapping

1. Paired endsReference

Genome

Mapping

Reference

Sequenced paired-ends

Deletion

Structural Variants Detection

Sequence Read Depth Analysis

Individual sequence

Zero level

32

Read depth signal

Reads

Mapping

Reference genome

Counting mapped reads

CNVnator on RD data

NA12878, Solexa 36 bp paired reads, ~28x coverage

Paired-ends methods

34

PEM(2)

Deletion? Insertion?

Deletion? Insertion?

Different classes of SVs

Nature Reviews Genetics 12, 363-376 (May 2011) | doi:10.1038/nrg2958

Bioinformatics. 2010 Aug 1; 26(15): 1895–1896.

SNP Calling• SNP – single nucleotide polymorphisms

– Examine the bases aligned to position and look for differences

• Factors to consider when calling SNPs– Base call qualities of each supporting base– Proximity to small indels– Mapping qualities of the reads

supporting the SNP– Read length– Paired reads– Sequencing depth– Cluster of SNPs

Example SNP

http://www.sanger.ac.uk/mousegenomes

Short Indel Calling

• Small insertions and deletions observed in the alignment of the read relative to the reference genome

• Factors to consider when calling indels– Misalignment of the read– Homopolymer runs either side of the indel

• AAAA or TTTTTTTT

– Length of the reads

Example Indel

http://www.sanger.ac.uk/mousegenomes

How the HaplotypeCaller works ?

Pileup format

Exemple

Variant Call Format (VCF)

• VCF is a standardized format for storing DNA polymorphism data– SNPs, insertions, deletions and structural variants– With rich annotations

• Indexed for fast data retrieval of variants from a range of positions

• Store variant information across many samples• Record meta-data about the site

– dbSNP accession, filter status, validation status,• Very flexible format

Headerlines starting with ##: arbitrary number of meta-information linesline starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

Dataone line per site (all columns described above per line); useful information per site and per sample



http://vcftools.sourceforge.net/specs.html

Exemple:

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (suchas amino acid changes).

SnpEff : annotation of variants

A typical SnpEff use case would be:

• Input: The inputs are predicted variants (SNPs, insertions, deletionsand MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).

• Output: SnpEff analyzes the input variants. It annotates the variantsand calculates the effects they produce on known genes (e.g. aminoacid changes).

Common cautions (*):- Base quality BQ20- Depth (min and max) very dependent on your average- Mapping quality MQ50/60- Strand-bias- SNP density dependent on the genome [e.g. no more than 1 SNP/10 bp]- Indel proximity not closer than 10bp to an indel

(*) Some filters may be applied during the variant calling while others are applied afterwards

Further reading: “Consensus Rules in Variant Detection from Next-Generation Sequencing Data” Jia et al 2012 PLoS One

Filtering SNP Rules

Keep in mind your project may have some specific requirements !

What to do if I don’t have a validreference ?

c3bi - institut pasteur...y ou don’t need to do quality trimming with bwa-mem. […] bwa-mem...

Documents