c3bi - institut pasteur...y ou don’t need to do quality trimming with bwa-mem. […] bwa-mem...
TRANSCRIPT
C3BI
VARIANTS CALLINGNovember 2016Pierre LechatStéphane Descorps-Declère
General Workflow (GATK)
software websites
software website
bwa http://bio-bwa.sourceforge.net/
picard http://picard.sourceforge.net/
samtools http://samtools.sourceforge.net/
GATK http://www.broadinstitute.org/gatk/
IGV http://software.broadinstitute.org/software/igv/
tablet http://bioinf.scri.ac.uk/tablet/
vcftools http://vcftools.sourceforge.net/
General Workflow (GATK)
Raw Sequence Data Format
• FASTQ format
• Phred quality score!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRS
0………………………………………………………………………………………….50
1……………………………………………………………………………………..0.00001
Phred score
Error rate
Phred score = -10 * log10P
Sequence ID
Sequence
Quality score
Quality Score
• Q-Score = Quality Table(Quality Predictors)
How do my newly obtained data look?
Check for overall data quality. FastQC is a great tool that enables the quality assessment.
Good quality! Poor quality!
Quality Checks
What do I do when FastQC calls my data poor?
Poor quality at the ends can be remedied
“quality trimmers” like trimmomatic, fastx-toolkit, etc.
Left-over adapter sequences in the reads can be removed
“adapter trimmers” like trimmomatic.
Always trim adapters as a matter of routine
Once the trimmers have been used, it is best to rerun the data through FastQC to check the resulting data
Quality Checks
Before quality trimming After quality trimming
Quality Checks
QC and Mapping
You don’t need to do quality trimming withbwa-mem. […] Bwa-mem largely does local alignment. If a tail cannot be mapped well, itwill be (soft) clipped.
Heng Li.
It is still recommended to trim adapter sequences. After all, adapters are not part of the samples you are sequencing. They mightaffect variant calling in corner cases.
Heng Li.
General Workflow
Short Read Alignment
Sequencing machine
And you get MILLIONS of them !
Short Read Mapping
Need to map them back to the reference chromosomes
13
Mapping
14
CTGACCTCATGTGATCCACCCGCCTTGGCC
TGATCCAC
Find best match for the read in a reference sequence
Reference sequence
(a read of length 8 bases)
Challenges:Errors in readsErrors in librariesRepetitive regions (repeats, homologous regions)HomopolymersIndividual polymorphisms
Different Mapping Algorithms
• BWA – 2009• BWA-SW – 2010• BWA-MEM – 2013• Bowtie – 2009• Bowtie2 – 2012• Gem – 2012• Cushaw2 – 2014• Novoalign Li, arXiv:1303.3997 (2013)
Further reading: “A survey of sequence alignment algorithms for next-generation sequencing”Li H. and Homer N. 2010. Briefing In Bioinformatics
SAM/BAM Format
• SAM (Sequence Alignment/Map) format:– Single unified format for storing read alignments
to a reference genome
• BAM (Binary Alignment/Map) format:– Binary equivalent of SAM– Advantages
– Supports indexing– Compact size
BAM Improvement
• Remove duplicates• Local realignment• Base quality
recalibration
Improvement
raw reads
reference genome
low MQ: the probability of mapping to different locations is high, but no perfect multiple matches
high MQ: a single match
MQ0: a perfect multiple match
What if there are several possible places to align your sequencing read?This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
MQ is a phred-score of the quality of the alignment
Mapping Quality (MQ)
Library Duplicates
• All second-gen sequencing platforms are NOT single molecule sequencing– PCR amplification step in library preparation can
result in duplicate DNA fragments in the final library prep.
(PCR-free protocols do exist – require large volumes of input DNA).
• Can result in false SNP calls– Duplicates manifest themselves as high read
depth support
Duplicates and False SNP Calls
Remove Duplicates
• Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy :
– Samtools: (samtools rmdup || samtools rmdupse)– Picard/GATK: MarkDuplicates
Local Realignment - indels
• The trouble with mapping approaches
Local Realignment - indels
• The trouble with mapping approaches
Local Realignment - indels
• The trouble with mapping approaches
Local realignment in GATK
• Uses information from known SNPs/indels(dbSNP, 1000 Genomes)
• Uses information from other reads • Smith-Waterman exhaustive alignment on
select reads
The quality of a call depends on multiple factors (e.g. position in the read, sequence context). In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).
It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).
It combines all the available information to re-evaluate the probability of a wrong call at each position in each read.
It requires a catalogue of variable sites!
We will not run it but you can find how to do it at http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html
Base Quality Recalibration
Base Quality Recalibration
More information : http://zenfractal.com/2014/01/25/bqsr/
General Workflow (GATK)
Variant Calling
• SNP Calling• Short Indels• Structural Variants
• A variant call is a conclusion : there is a nucleotidedifference vs. some reference at a given position in an individual genome or transcriptome.
• Sometime accompanied by an estimate of variant frequency and some measure of confidence
Structural Variants (SVs)
Structural variant is the umbrella term to encompass a group of genomic alterations involving segments of DNA typically larger than 1 kb.
The structural variation may be
•Quantitative (CNVs – indels and duplications)
•Positional (translocations)
•Orientational (inversions).
Reads
Zero level
Read count
Reference
Genome
Reference
Read
Deletion
3. Split read 2. Read depth
Mapping
Reference
Genome
Deletion
Mapping
1. Paired endsReference
Genome
Mapping
Reference
Sequenced paired-ends
Deletion
Structural Variants Detection
Sequence Read Depth Analysis
Individual sequence
Zero level
32
Read depth signal
Reads
Mapping
Reference genome
Counting mapped reads
CNVnator on RD data
NA12878, Solexa 36 bp paired reads, ~28x coverage
Paired-ends methods
34
PEM(2)
Deletion? Insertion?
Deletion? Insertion?
Different classes of SVs
Nature Reviews Genetics 12, 363-376 (May 2011) | doi:10.1038/nrg2958
Bioinformatics. 2010 Aug 1; 26(15): 1895–1896.
SNP Calling• SNP – single nucleotide polymorphisms
– Examine the bases aligned to position and look for differences
• Factors to consider when calling SNPs– Base call qualities of each supporting base– Proximity to small indels– Mapping qualities of the reads
supporting the SNP– Read length– Paired reads– Sequencing depth– Cluster of SNPs
Example SNP
http://www.sanger.ac.uk/mousegenomes
Short Indel Calling
• Small insertions and deletions observed in the alignment of the read relative to the reference genome
• Factors to consider when calling indels– Misalignment of the read– Homopolymer runs either side of the indel
• AAAA or TTTTTTTT
– Length of the reads
Example Indel
http://www.sanger.ac.uk/mousegenomes
How the HaplotypeCaller works ?
General Workflow (GATK)
Pileup format
Pileup format
Exemple
Variant Call Format (VCF)
• VCF is a standardized format for storing DNA polymorphism data– SNPs, insertions, deletions and structural variants– With rich annotations
• Indexed for fast data retrieval of variants from a range of positions
• Store variant information across many samples• Record meta-data about the site
– dbSNP accession, filter status, validation status,• Very flexible format
Headerlines starting with ##: arbitrary number of meta-information linesline starting with #: column definition – mandatory columns include:
CHROM chromosomePOS position of the start of the variantID unique identifier of the variant (e.g. rs number for SNPs)REF reference alleleALT comma separated list of alternate non-reference allelesQUAL phred-scaled quality scoreFILTER site filtering informationINFO user extensible annotation (e.g. samtools and GATK may differ in this)
Dataone line per site (all columns described above per line); useful information per site and per sample
Variant Call Format (VCF)
Variant Call Format (VCF)
http://vcftools.sourceforge.net/specs.html
Exemple:
SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants (suchas amino acid changes).
SnpEff : annotation of variants
A typical SnpEff use case would be:
• Input: The inputs are predicted variants (SNPs, insertions, deletionsand MNPs). The input file is usually obtained as a result of a sequencing experiment, and it is usually in variant call format (VCF).
• Output: SnpEff analyzes the input variants. It annotates the variantsand calculates the effects they produce on known genes (e.g. aminoacid changes).
Common cautions (*):- Base quality BQ20- Depth (min and max) very dependent on your average- Mapping quality MQ50/60- Strand-bias- SNP density dependent on the genome [e.g. no more than 1 SNP/10 bp]- Indel proximity not closer than 10bp to an indel
(*) Some filters may be applied during the variant calling while others are applied afterwards
Further reading: “Consensus Rules in Variant Detection from Next-Generation Sequencing Data” Jia et al 2012 PLoS One
Filtering SNP Rules
Keep in mind your project may have some specific requirements !
What to do if I don’t have a validreference ?