genomics and transcriptomics

51
INSTRUCTOR: Aureliano Bombarely Department of Bioscience Universita degli Studi di Milano [email protected] Genomics and Transcriptomics Class 08 - Variant Calling

Upload: others

Post on 24-Oct-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomics and Transcriptomics

INSTRUCTOR:Aureliano Bombarely

Department of BioscienceUniversita degli Studi di [email protected]

Genomics and Transcriptomics

Class 08 - Variant Calling

Page 2: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 3: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 4: Genomics and Transcriptomics

1. Basics about variants in genomics.

Genetic variation can be defined as different forms of a genetic region of different individuals of different populations (polymorphisms) or species (variants).

** **polymorphismsvariants

Page 5: Genomics and Transcriptomics

1. Basics about variants in genomics.

Genetic variation can be defined as different forms of a genetic region of different individuals of different populations (polymorphisms) or species (variants).

**variants**

Page 6: Genomics and Transcriptomics

1. Basics about variants in genomics.

Genetic variation can be defined as different forms of a genetic region of different individuals of different populations (polymorphisms) or species/cells (variants).

** **variants

Page 7: Genomics and Transcriptomics

1. Basics about variants in genomics.

Genetic variation can be defined as different forms of a genetic region of different individuals of different populations (polymorphisms) or species (variants).

Genetic variation

• Large size changes - Chromosomal reorganisations (Structural Variants)

• Medium size changes - Changes in the gene copy number

• Discrete changes - Single Nucleotide Variants, Insertions/deletions

Page 8: Genomics and Transcriptomics

1. Basics about variants in genomics.

Genetic variation can be defined as different forms of a genetic region of different individuals of different populations (polymorphisms) or species (variants).

There are different approaches for study genetic variants

• Cellular techniques (e.g. flow cytometry, FISH…).

• Molecular techniques (e.g. PCR…).

• Genetic/Genomic techniques (e.g. WGS).

Page 9: Genomics and Transcriptomics

1. Basics about variants in genomics.

Variant calling is the process by which identify variants between different individuals or cell types.

Sample 1 Sample 2

• Pool of Individuals • Single individual • Single cell type • Single cell

DNA 1 DNA 2

raw/proc. reads 1

mapped reads 1

raw/proc. reads 2

mapped reads 2

variants 1 variants 2

reference reference

variant calling variant calling

read mapping read mapping

sequencing sequencing

Page 10: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 11: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

mapped reads 1

variants 1

variant calling

raw sam/bam file

filtered sam/bam file

sorted sam/bam file

indexed sam/bam file

Variants

Before the variant calling it is essential to perform several steps

Page 12: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Before the variant calling it is essential to perform several steps

http://samtools.sourceforge.net/

Page 13: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

mapped reads 1

variants 1

variant calling

raw sam/bam file

filtered sam/bam file

sorted sam/bam file

indexed sam/bam file

Variants

Before the variant calling it is essential to perform several steps

samtools view -F 4 -Sb -o filtered.bam mapping.sam

samtools sort -o sorted.bam filtered.bam

samtools index sorted.bam

Page 14: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Good practices before perform the variant calling:

1. Retrieve information about your mapping.

1.1. How many reads were mapped?

1.2. How many regions have coverage of 0?

1.3. How many regions have a coverage of < 5?

1.4. What it is the average coverage?

1.5. What it is the maximum coverage?

2. Know the limitations of your technique and filter your reads accordingly (e.g. for WGS it is worthy to filter PCR duplications).

3. Realign reads if the variant caller does not have this process integrated

Page 15: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Good practices before perform the variant calling:

1. Retrieve information about your mapping.

1.1. How many reads were mapped?

1.2. How many regions have coverage of 0?

1.3. How many regions have a coverage of < 5?

1.4. What it is the average coverage?

1.5. What it is the maximum coverage?

2. Know the limitations of your technique and filter your reads accordingly (e.g. for WGS it is worthy to filter PCR duplications).

3. Realign reads if the variant caller does not have this process integrated

https://bedtools.readthedocs.io/en/latest/

Page 16: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

https://bedtools.readthedocs.io/en/latest/

1. Retrieve information about your mapping

https://broadinstitute.github.io/picard/

• Samtools

• Bedtools

• Picards tools

Page 17: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Library preparation problems

Sequencing errors produce biases in the variant call.

Page 18: Genomics and Transcriptomics

Sequencing errors - Solutions: • High coverage (< 20 X) to minimize sequencing errors. • Recalibrate bases (Base Score Quality Recalibration - BSQR)

using tools such as BaseRecalibrator.

2. Manipulating SAM/BAMs and coverage.

Library preparation problems

Page 19: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Good practices before perform the variant calling:

1. Retrieve information about your mapping.

1.1. How many reads were mapped?

1.2. How many regions have coverage of 0?

1.3. How many regions have a coverage of < 5?

1.4. What it is the average coverage?

1.5. What it is the maximum coverage?

2. Know the limitations of your technique and filter your reads accordingly (e.g. for WGS it is worthy to filter PCR duplications).

3. Realign reads if the variant caller does not have this process integrated

Page 20: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Library preparation problems

PCR duplications produce biases in the variant call (e.g. het.) • Library specific problem for Whole Genome Sequencing.

GeneA GeneB GeneC

Fragmentation

Librarypreparation

PCRDuplication

Page 21: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Library preparation problems

PCR duplications - Solutions: • Mark duplicates with tools such as samtools rmdup

SKIPPCRDUPLICATIONMARKINGSTEPFORGBS,RAD-SEQ…

CAREFUL:SomereducedrepresentationstechniqueswithunequalratiosofsiteamplicationWILLPRODUCETHOUSANDSPCRDUPLICATION

Page 22: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Good practices before perform the variant calling:

1. Retrieve information about your mapping.

1.1. How many reads were mapped?

1.2. How many regions have coverage of 0?

1.3. How many regions have a coverage of < 5?

1.4. What it is the average coverage?

1.5. What it is the maximum coverage?

2. Know the limitations of your technique and filter your reads accordingly (e.g. for WGS it is worthy to filter PCR duplications).

3. Realign reads if the variant caller does not have this process integrated

Page 23: Genomics and Transcriptomics

2. Manipulating SAM/BAMs and coverage.

Alignment problems

coordinates 12345678901234 5678901234567890123456reference aggttttttataac---aattaagtctacagagcaactasample aggttttttataacAATaattaagtctacagagcaactaread1 aggttttttataac***aaAtaaread2 ggttttttataac***aaAtaaTtread3 ttttataacAATaattaagtctacaread4 CaaT***aattaagtctacagagcaacread5 aaT***aattaagtctacagagcaactread6 T***aattaagtctacagagcaacta

Aligners calculate the alignment correctness and give it a score depending of:

• Length of the alignment. • Number of mismatches and gaps. • Uniqueness of the alignment (number of hits).} Good alignment

Misaligned bases

Misaligned bases - Solutions: • Read realignment (IndelRealigner for GATK (obsolete),

now it is integrated in the HaplotypeCaller). • Mark alignment quality per base (BAQ) and do not use for

variant calling.

Page 24: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 25: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

GACGTGC

GCCGTGC| |||||

Sample 1

Sample 2

GACGTGC

G-CGTGC| |||||

Sample 1

Sample 2

SNVs/SNPs

INDELs/DIVs/DIPs

GACGTGC

GCTGTGC| ||||

Sample 1

Sample 2

MNVs/MNPs

Single Nucleotide Variant/Polymorphism (SNV/SNP) is a substitution of a single nucleotide at a specific position

Insertion/Deletion (InDel/DIV/DIP) is a insertion or a deletion of several nucleotides at a specific position

Multiple Nucleotide Variant/Polymorphism (MNV/MNP) is the substitution of several nucleotide at a specific position

Page 26: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Considerations

Page 27: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Processed Reads

Mapped Reads

Processed Map

Variants

Read mapping

Local realignment, sort and filtering

Variant calling

Annotated Variants

Variant annotation

Variant calling:

• Heuristic methods (read depth)

- SamTools

- VarScan

• Probabilistic methods (bayesian)

- GATK

- FreeBayes

- SOAPsnp/SOAPindel

Variant filtering

Page 28: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Name Type Strength Weaknesses

SamTools Heuristic

• Assumes errors are non-independent (matches data)

• Good accuracy with low coverage data

• Reasonably quick

• Increase false positives at high coverage

• Lower quality indel calling

GATK Probabilistic• Trains with real data • Excellent accuracy with high

coverage data • Low false positive rate

• Assumes errors are independent

• High level of preprocessing • Very slow

FreeBayes Probabilistic

• Combined bam population estimate

• Good accuracy with low coverage data

• Very very quick

• No training, population level estimate only

• Lower quality indel calling

Variant calling popular tools

Page 29: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Choosing the right tool

Page 30: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Methods for Variant Evaluation

• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other

datasets (e.g. transcriptome) if it is possible.

• Comparison with other method (e.g. genotyping chip).

• Different mapping and variant calling tools comparison (with a “truth set” or a

“gold standard” if it is possible).

https://gatkforums.broadinstitute.org/gatk/discussion/6308/evaluating-the-quality-of-a-variant-callset

Page 31: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

• Validation by Sanger Sequencing of specific candidates (~5 - 500) using other

datasets (e.g. transcriptome) if it is possible.

Variants from RNASeq

(Illumina)

Variants from ESTs

(Sanger)

Page 32: Genomics and Transcriptomics

• Different mapping, variant calling tools and datasets comparison (with a “truth

set” or a “gold standard” if it is possible).

3. Simple variants: SNVs, InDels and MNVs.

Assumptions:

1. The content of the truth set has been validated.

2. Your samples are expected to have similar genomic content as the

population of samples that was used to produce the truth set

Page 33: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

• Different mapping, variant calling tools and datasets comparison (with a “truth

set” or a “gold standard” if it is possible).

Metrics: 1. Variant level concordance: Percentage of variants in your samples that

match (are concordant with) variants in your truth set.

2. Genotype concordance: Percentage of variants in your genotype that match (are concordant with) variants in your truth set.

False Positives (FP) False Negatives (FN)

True Positives (TP)

My Dataset (16) True Set (18) % SENSITIVITY: TP * 100 / (TP + FN) = 13 * 100 / (13 + 5) = 72%

% FALSE DISCOVERY RATE:FP * 100 / (TP + FP) = 3 * 100 / (13 + 3) = 20%

% GT CONCORDANCE: SumMatches * 100 / TP

6 * 100 / 11 = 54%

A * T C T C C * C A CA T T C * C C T * A *1 0 1 1 0 1 1 0 0 1 0

True Set (9)My Dataset (8)

Matches (6)

Page 34: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

• Different mapping, variant calling tools and datasets comparison (with a “truth

set” or a “gold standard” if it is possible).

Metrics:

3. Number of SNPs and INDELs: Between different datasets should be consistent for the same number of mapped reads.

4. TiTv Ratio: Ratio of transition (Ts) to transversion (Tv) SNPs should be random (~0.5). Methylation islands (CpG) and other factors may introduce a bias so expected values will range from 0.5 - 3.0.

5. Ratio Insertions/Deletions: It should be close to 1, except in rare alleles that it could be 0.2 - 0.5.

Page 35: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Comparison between different tools:

https://bcbio.wordpress.com/

• Different mapping, variant calling tools and datasets comparison (with a “truth

set” or a “gold standard” if it is possible).

Page 36: Genomics and Transcriptomics

3. Simple variants: SNVs, InDels and MNVs.

Tools:

Name URL

VariantEvaluation(GATK)

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php

GenotypeConcordance(GATK)

https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php

VCFTools http://vcftools.sourceforge.net/

VCFStats http://lindenb.github.io/jvarkit/

PicardTools https://broadinstitute.github.io/picard/index.html

• Different mapping, variant calling tools and datasets comparison (with a “truth

set” or a “gold standard” if it is possible).

Page 37: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 38: Genomics and Transcriptomics

4. Copy Number Variation (CNV).

https://www.genome.gov/genetics-glossary/Copy-Number-Variation

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

Page 39: Genomics and Transcriptomics

4. Copy Number Variation (CNV).

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

Page 40: Genomics and Transcriptomics

4. Copy Number Variation (CNV).

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

Zhang, Le, et al. "Comprehensively benchmarking applications for detecting copy number variation." PLoS computational biology 15.5 (2019): e1007069.

Page 41: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 42: Genomics and Transcriptomics

5. Structural Variants (SV).

https://www.ncbi.nlm.nih.gov/dbvar/content/overview/

Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances (insertions and deletions), commonly referred to as copy number variants (CNVs). These CNVs often overlap with segmental duplications, regions of DNA >1 kb present more than once in the genome, copies of which are >90% identical. If present at >1% in a population a CNV may be referred to as copy number polymorphism (CNP).

Figure 1: Charcot-Marie Tooth (CMT) disease. Unequal crossing over between two highly homologous repeats on chromosome 17p12 can result in (A) 3 copies of the PMP22 gene with the CMT1A phenotype or the reciprocal (B) and 1 copy of the PMP22 gene with the HNPP phenotype.

Page 43: Genomics and Transcriptomics

5. Structural Variants (SV).

Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances

Page 44: Genomics and Transcriptomics

5. Structural Variants (SV).

Structural variation (SV) is generally defined as a region of DNA approximately 1 kb and larger in size and can include inversions and balanced translocations or genomic imbalances

Kosugi, Shunichi, et al. "Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing." Genome biology 20.1 (2019): 117.

Page 45: Genomics and Transcriptomics

1. Basics about variants in genomics.

2. Manipulating SAM/BAMs and coverage.

3. Simple variants: SNVs, InDels and MNVs.

4. Copy Number Variation (CNV).

5. Structural Variants (SV).

6. Annotating variants and assessing its impact.

Outline of Topics

Page 46: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

Variant

Non coding regions Coding regions

Synonymous Non Synonymous

Misssense

Nonssense

Page 47: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

https://en.wikipedia.org/wiki/DNA

mRNA is read by groups of three nucleotides called codons. Each three nucleotides represent an aminoacid that it is carried by a tRNA during the translation.

Page 48: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

A non-synonymous substitution is a nucleotide mutation that alters the amino acid sequence of a protein. Non-synonymous substitutions differ from synonymous substitutions, which do not alter amino acid sequences and are (sometimes) silent mutations. As non-synonymous substitutions result in a biological change in the organism, they are subject to natural selection.

Non Synonymous

Misssense Nonssense

https://en.wikipedia.org/wiki/Nonsynonymous_substitution

Page 49: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

Program used to annotate variants

http://snpeff.sourceforge.net/

Page 50: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

Program used to annotate variants http://snpeff.sourceforge.net/

Page 51: Genomics and Transcriptomics

6. Annotating variants and assessing its impact.

Program used to annotate variants http://snpeff.sourceforge.net/