introduction to next generation sequencing (ngs) · introduction to next generation sequencing...
TRANSCRIPT
Introduction to Next Generation Sequencing (NGS)
Andrew ParrishExeter, 2nd November 2017
• What is Next Generation Sequencing (NGS)?
• Why do we need NGS?
• Common approaches to NGS
• NGS Workflow
Topics to cover today
What is Next Generation Sequencing? (NGS)
• Historically we have used Sanger sequencing to investigate genetic diseases
• This looks at one stretch of DNA from one patient at a time (~600 base pairs in length)
• Measures fluorescence given off when dye labelled nucleotides are excited by a laser to determine order of bases
What is Next Generation Sequencing? (NGS)
• NGS (also referred to as high throughput sequencing or massively parallel sequencing)
• Generates hundreds of millions of overlapping short sequences (up to 300bp) in a single run
• These have to be computationally put back together
• Can look at multiple patients in one run
Why do we need Next Generation Sequencing? (NGS)
• Human Genome project took 15 years to complete using Sanger based technology at an estimated cost of $3 billion
• Today, using NGS, this could be completed in a day or two for under $1000
Common approaches to NGS
• Targeted panels (tNGS)– Pull out specific genes from the patient’s DNA and only obtain
the sequence data from these genes (up to about 150 genes)
• “Rare disease”/”Medeliome”/”Clinical” exome– Essential a very large (6,110 genes) panel that looks at the exons
of genes known to cause human disease (at the time of design!)
• “Whole” exome– Looks at the exons of 23,244 expressed genes that encode 1-2%
of the human genome
• Genome sequencing– Looks at the complete (ish) DNA sequence from a patient
Common approaches to NGS
• Single gene disease– Easily clinically recognisable disease– Single genetic aetiology (mutations in one gene cause this disorder)– Existing tests widely available in diagnostic laboratories
• Small number of genes for a disease– Clinically recognisable disease– Multiple sub-types caused by mutations in different genes – Highly developed clinical expertise and knowledge available in specialist
centres
• Large number of possible causes (or no known cause)– Strong suggestion of monogenic disease, but no clear clue to which gene to
test
Workflow for NGS
Raw Reads (FASTQ)
Assess quality and process reads
Processed reads (FASTQ)
Map to reference genome
Aligned Reads (SAM/BAM)
Call variants (VCF)
Variant and sample quality control
Annotate variants
Assess depth and breadth of coverage
Filter and prioritise variants
Integrate with clinical data
Shortlist of disease related variants
Visualise data
Visualise data
Patient
Extract DNA, prepare library and sequence
Diagnostic report
Workflow for NGS
Quality Control
Map to reference genome
Call variants
Annotate variants
Shortlist of disease related variants
Visualise data
Visualise data
Patient
Extract DNA, prepare library and sequence
Diagnostic report
Genomic DNA
Fragment
Target
Attach adaptors for paired end sequencing
DNA extraction and library preparation
SequencingSequencing
Read mapping
.. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC .. Reference Genome
ATCTTGTAGGGAAACACAAAGTG GTCTAGGGAAGAAGG
• After base calling, align/map sequences onto reference genome
• Determine coordinates (chromosomal position) and add basicannotations (coding, non-coding, etc) if known
Read mapping
Coverage
• Vertical coverage – how many times a particular base has been sequenced (e.g. 20X, 30X etc.)– Greater depth of coverage means improved accuracy for variant detection
(but is more expensive)
• Horizontal coverage – how much of the genome has been sequenced– Greater target size means more genome is sequenced (but is more
expensive)
Coverage
Variant calling
Variant calling
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germlinechr4 27668 . T C 8.65 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3chr4 27669 . G T 4.77 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3chr4 27712 . T C 44 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8chr4 27774 . G A 5.47 . DP=2;AF1=0.5011; AC1=2; … GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28chr4 36523 . A T 10.4 . DP=1;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3
Variants
• A variant is a DNA sequence that is different to the “normal” sequence for a particular species.
• These should be named according to standardised nomenclature (HGVS)
• This allows consistent reporting and must include:– Reference sequence - e.g. NM_0000123.4– cDNA change - e.g. c.123A>G– Protein change - e.g. p.(V59M) or p.(Val59Met)
Variants
Variant types
The sun was hot but the man did not get his hat.
• SNV – a change to a single base pairThe sun wos hot but the man did not get his cat.The sun was .ot but the man did not get his hat.
• Small insertion/deletion (InDel) – in frameThe sun hot but the man did not get his hat.The sun was too hot but the man did not get his hat.
• Small insertion/deletion (InDel) – frameshiftThe sun wah otb utt hem and idn otg eth ish atThe sun wwa sho tbu tth ema ndi dno tge thi sha t
Variant types
• A variant is pathogenic if it interferes with normal protein production.– There are many ways that this can happen!
Stop codonChangeaminoacid
Changesplice site,add intron
Changesplice site,removeexon
Newstopcodon
Frameshift, causing stop codon later
Regulatory region
Variant pathogenicityVariant pathogenicity
• Frameshift and stop gain (nonsense substitution) variants are highly likely to be pathogenic.
• Splicing variants are likely to be pathogenic, but need checking with a splicing predictor.
• Missense variants can be pathogenic, and there are in-silico tools to predict the effect. The effect depends on how the amino acids are changed.
• Synonymous substitutions are very unlikely to be pathogenic unless they affect splicing.
Variant prioritisationVariant prioritisation
~30,000 variants
Causal mutation(s)
Exclude common variants
Identify potential pathogenic mutation(s)
Variant prioritisation
• We can pull information in from a variety of external sources, including:– “Population” databases, e.g. ExAC and dbSNP
• These provide an approximation of the variants that are common in the population and may be excluded from consideration
– Disease databases, e.g. HGMD• These provide a list of the known disease causing mutations
seen in a variety of settings and may be a flag for prioritisation
– In silico analysis packages, e.g. SIFT, PolyPhen– Phenotypic terms provided by clinician using HPO
Variant annotation and filtering
Questions?