c3bi - institut pasteur...y ou don’t need to do quality trimming with bwa-mem. […] bwa-mem...

52
C3BI VARIANTS CALLING November 2016 Pierre Lechat Stéphane Descorps-Declère

Upload: others

Post on 18-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

C3BI

VARIANTS CALLINGNovember 2016Pierre LechatStéphane Descorps-Declère

Page 2: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

General Workflow (GATK)

Page 3: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

software websites

software website

bwa http://bio-bwa.sourceforge.net/

picard http://picard.sourceforge.net/

samtools http://samtools.sourceforge.net/

GATK http://www.broadinstitute.org/gatk/

IGV http://software.broadinstitute.org/software/igv/

tablet http://bioinf.scri.ac.uk/tablet/

vcftools http://vcftools.sourceforge.net/

Page 4: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

General Workflow (GATK)

Page 5: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Raw Sequence Data Format

• FASTQ format

• Phred quality score!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRS

0………………………………………………………………………………………….50

1……………………………………………………………………………………..0.00001

Phred score

Error rate

Phred score = -10 * log10P

Sequence ID

Sequence

Quality score

Page 6: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Quality Score

• Q-Score = Quality Table(Quality Predictors)

Page 7: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

How do my newly obtained data look?

Check for overall data quality. FastQC is a great tool that enables the quality assessment.

Good quality! Poor quality!

Quality Checks

Page 8: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

What do I do when FastQC calls my data poor?

Poor quality at the ends can be remedied

“quality trimmers” like trimmomatic, fastx-toolkit, etc.

Left-over adapter sequences in the reads can be removed

“adapter trimmers” like trimmomatic.

Always trim adapters as a matter of routine

Once the trimmers have been used, it is best to rerun the data through FastQC to check the resulting data

Quality Checks

Page 9: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Before quality trimming After quality trimming

Quality Checks

Page 10: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

QC and Mapping

You don’t need to do quality trimming withbwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, itwill be (soft) clipped.

Heng Li.

It is still recommended to trim adapter sequences. After all, adapters are not part of the samples you are sequencing. They mightaffect variant calling in corner cases.

Heng Li.

Page 11: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

General Workflow

Page 12: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Short Read Alignment

Sequencing machine

And you get MILLIONS of them !

Page 13: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Short Read Mapping

Need to map them back to the reference chromosomes

13

Page 14: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Mapping

14

CTGACCTCATGTGATCCACCCGCCTTGGCC

TGATCCAC

Find best match for the read in a reference sequence

Reference sequence

(a read of length 8 bases)

Challenges:Errors in readsErrors in librariesRepetitive regions (repeats, homologous regions)HomopolymersIndividual polymorphisms

Page 15: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Different Mapping Algorithms

• BWA – 2009• BWA-SW – 2010• BWA-MEM – 2013• Bowtie – 2009• Bowtie2 – 2012• Gem – 2012• Cushaw2 – 2014• Novoalign Li, arXiv:1303.3997 (2013)

Further reading: “A survey of sequence alignment algorithms for next-generation sequencing”Li H. and Homer N. 2010. Briefing In Bioinformatics

Page 16: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

SAM/BAM Format

• SAM (Sequence Alignment/Map) format:– Single unified format for storing read alignments

to a reference genome

• BAM (Binary Alignment/Map) format:– Binary equivalent of SAM– Advantages

– Supports indexing– Compact size

Page 17: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

BAM Improvement

• Remove duplicates• Local realignment• Base quality

recalibration

Improvement

Page 18: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

raw reads

reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps

MQ is a phred-score of the quality of the alignment

Mapping Quality (MQ)

Page 19: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Library Duplicates

• All second-gen sequencing platforms are NOT single molecule sequencing– PCR amplification step in library preparation can

result in duplicate DNA fragments in the final library prep.

(PCR-free protocols do exist – require large volumes of input DNA).

• Can result in false SNP calls– Duplicates manifest themselves as high read

depth support

Page 20: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Duplicates and False SNP Calls

Page 21: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Remove Duplicates

• Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy :

– Samtools: (samtools rmdup || samtools rmdupse)– Picard/GATK: MarkDuplicates

Page 22: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Local Realignment - indels

• The trouble with mapping approaches

Page 23: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Local Realignment - indels

• The trouble with mapping approaches

Page 24: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Local Realignment - indels

• The trouble with mapping approaches

Page 25: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Local realignment in GATK

• Uses information from known SNPs/indels(dbSNP, 1000 Genomes)

• Uses information from other reads • Smith-Waterman exhaustive alignment on

select reads

Page 26: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

The quality of a call depends on multiple factors (e.g. position in the read, sequence context). In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).

It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).

It combines all the available information to re-evaluate the probability of a wrong call at each position in each read.

It requires a catalogue of variable sites!

We will not run it but you can find how to do it at http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html

Base Quality Recalibration

Page 27: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Base Quality Recalibration

More information : http://zenfractal.com/2014/01/25/bqsr/

Page 28: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

General Workflow (GATK)

Page 29: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Variant Calling

• SNP Calling• Short Indels• Structural Variants

• A variant call is a conclusion : there is a nucleotidedifference vs. some reference at a given position in an individual genome or transcriptome.

• Sometime accompanied by an estimate of variant frequency and some measure of confidence

Page 30: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Structural Variants (SVs)

Structural variant is the umbrella term to encompass a group of genomic alterations involving segments of DNA typically larger than 1 kb.

The structural variation may be

•Quantitative (CNVs – indels and duplications)

•Positional (translocations)

•Orientational (inversions).

Page 31: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Reads

Zero level

Read count

Reference

Genome

Reference

Read

Deletion

3. Split read 2. Read depth

Mapping

Reference

Genome

Deletion

Mapping

1. Paired endsReference

Genome

Mapping

Reference

Sequenced paired-ends

Deletion

Structural Variants Detection

Page 32: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Sequence Read Depth Analysis

Individual sequence

Zero level

32

Read depth signal

Reads

Mapping

Reference genome

Counting mapped reads

Page 33: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

CNVnator on RD data

NA12878, Solexa 36 bp paired reads, ~28x coverage

Page 34: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Paired-ends methods

34

Page 35: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

PEM(2)

Deletion? Insertion?

Deletion? Insertion?

Page 36: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Different classes of SVs

Nature Reviews Genetics 12, 363-376 (May 2011) | doi:10.1038/nrg2958

Page 37: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Bioinformatics. 2010 Aug 1; 26(15): 1895–1896.

Page 38: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

SNP Calling• SNP – single nucleotide polymorphisms

– Examine the bases aligned to position and look for differences

• Factors to consider when calling SNPs– Base call qualities of each supporting base– Proximity to small indels– Mapping qualities of the reads

supporting the SNP– Read length– Paired reads– Sequencing depth– Cluster of SNPs

Page 39: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Example SNP

http://www.sanger.ac.uk/mousegenomes

Page 40: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Short Indel Calling

• Small insertions and deletions observed in the alignment of the read relative to the reference genome

• Factors to consider when calling indels– Misalignment of the read– Homopolymer runs either side of the indel

• AAAA or TTTTTTTT

– Length of the reads

Page 41: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Example Indel

http://www.sanger.ac.uk/mousegenomes

Page 42: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

How the HaplotypeCaller works ?

Page 43: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

General Workflow (GATK)

Page 44: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Pileup format

Page 45: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Pileup format

Page 46: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Exemple

Page 47: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Variant Call Format (VCF)

• VCF is a standardized format for storing DNA polymorphism data– SNPs, insertions, deletions and structural variants– With rich annotations

• Indexed for fast data retrieval of variants from a range of positions

• Store variant information across many samples• Record meta-data about the site

– dbSNP accession, filter status, validation status,• Very flexible format

Page 48: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Headerlines starting with ##: arbitrary number of meta-information linesline starting with #: column definition – mandatory columns include:

CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)

Dataone line per site (all columns described above per line); useful information per site and per sample

Variant Call Format (VCF)

Page 49: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Variant Call Format (VCF)

http://vcftools.sourceforge.net/specs.html

Exemple:

Page 50: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (suchas amino acid changes).

SnpEff : annotation of variants

A typical SnpEff use case would be:

• Input: The inputs are predicted variants (SNPs, insertions, deletionsand MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).

• Output: SnpEff analyzes the input variants. It annotates the variantsand calculates the effects they produce on known genes (e.g. aminoacid changes).

Page 51: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

Common cautions (*):- Base quality BQ20- Depth (min and max) very dependent on your average- Mapping quality MQ50/60- Strand-bias- SNP density dependent on the genome [e.g. no more than 1 SNP/10 bp]- Indel proximity not closer than 10bp to an indel

(*) Some filters may be applied during the variant calling while others are applied afterwards

Further reading: “Consensus Rules in Variant Detection from Next-Generation Sequencing Data” Jia et al 2012 PLoS One

Filtering SNP Rules

Keep in mind your project may have some specific requirements !

Page 52: C3BI - Institut Pasteur...Y ou don’t need to do quality trimming with bwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, it will be (soft) clipped

What to do if I don’t have a validreference ?