introduction to next generation sequencing (ngs) · introduction to next generation sequencing...

Introduction to Next Generation Sequencing (NGS)

Andrew ParrishExeter, 2nd November 2017

• What is Next Generation Sequencing (NGS)?

• Why do we need NGS?

• Common approaches to NGS

• NGS Workflow

Topics to cover today

What is Next Generation Sequencing? (NGS)

• Historically we have used Sanger sequencing to investigate genetic diseases

• This looks at one stretch of DNA from one patient at a time (~600 base pairs in length)

• Measures fluorescence given off when dye labelled nucleotides are excited by a laser to determine order of bases

What is Next Generation Sequencing? (NGS)

• NGS (also referred to as high throughput sequencing or massively parallel sequencing)

• Generates hundreds of millions of overlapping short sequences (up to 300bp) in a single run

• These have to be computationally put back together

• Can look at multiple patients in one run

Why do we need Next Generation Sequencing? (NGS)

• Human Genome project took 15 years to complete using Sanger based technology at an estimated cost of $3 billion

• Today, using NGS, this could be completed in a day or two for under $1000

Common approaches to NGS

• Targeted panels (tNGS)– Pull out specific genes from the patient’s DNA and only obtain

the sequence data from these genes (up to about 150 genes)

• “Rare disease”/”Medeliome”/”Clinical” exome– Essential a very large (6,110 genes) panel that looks at the exons

of genes known to cause human disease (at the time of design!)

• “Whole” exome– Looks at the exons of 23,244 expressed genes that encode 1-2%

of the human genome

• Genome sequencing– Looks at the complete (ish) DNA sequence from a patient

Common approaches to NGS

• Single gene disease– Easily clinically recognisable disease– Single genetic aetiology (mutations in one gene cause this disorder)– Existing tests widely available in diagnostic laboratories

• Small number of genes for a disease– Clinically recognisable disease– Multiple sub-types caused by mutations in different genes – Highly developed clinical expertise and knowledge available in specialist

centres

• Large number of possible causes (or no known cause)– Strong suggestion of monogenic disease, but no clear clue to which gene to

test

Workflow for NGS

Raw Reads (FASTQ)

Assess quality and process reads

Processed reads (FASTQ)

Map to reference genome

Aligned Reads (SAM/BAM)

Call variants (VCF)

Variant and sample quality control

Annotate variants

Assess depth and breadth of coverage

Filter and prioritise variants

Integrate with clinical data

Shortlist of disease related variants

Visualise data

Visualise data

Patient

Extract DNA, prepare library and sequence

Diagnostic report

Workflow for NGS

Quality Control

Map to reference genome

Call variants

Annotate variants

Shortlist of disease related variants

Visualise data

Visualise data

Patient

Extract DNA, prepare library and sequence

Diagnostic report

Genomic DNA

Fragment

Target

Attach adaptors for paired end sequencing

DNA extraction and library preparation

SequencingSequencing

Read mapping

.. TAGTACCCCATCTTGTAGGTCTGAAACACAAAGTGTGGGGTGTCTAGGGAAGAAGGTGTGTGACCAGGGAGGTCCC .. Reference Genome

ATCTTGTAGGGAAACACAAAGTG GTCTAGGGAAGAAGG

• After base calling, align/map sequences onto reference genome

• Determine coordinates (chromosomal position) and add basicannotations (coding, non-coding, etc) if known

Read mapping

Coverage

• Vertical coverage – how many times a particular base has been sequenced (e.g. 20X, 30X etc.)– Greater depth of coverage means improved accuracy for variant detection

(but is more expensive)

• Horizontal coverage – how much of the genome has been sequenced– Greater target size means more genome is sequenced (but is more

expensive)

Coverage

Variant calling

Variant calling

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germlinechr4 27668 . T C 8.65 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3chr4 27669 . G T 4.77 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3chr4 27712 . T C 44 . DP=2;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8chr4 27774 . G A 5.47 . DP=2;AF1=0.5011; AC1=2; … GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28chr4 36523 . A T 10.4 . DP=1;AF1=1;AC1=4;… GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3

Variants

• A variant is a DNA sequence that is different to the “normal” sequence for a particular species.

• These should be named according to standardised nomenclature (HGVS)

• This allows consistent reporting and must include:– Reference sequence - e.g. NM_0000123.4– cDNA change - e.g. c.123A>G– Protein change - e.g. p.(V59M) or p.(Val59Met)

Variants

http://varnomen.hgvs.org/

Variant types

The sun was hot but the man did not get his hat.

• SNV – a change to a single base pairThe sun wos hot but the man did not get his cat.The sun was .ot but the man did not get his hat.

• Small insertion/deletion (InDel) – in frameThe sun hot but the man did not get his hat.The sun was too hot but the man did not get his hat.

• Small insertion/deletion (InDel) – frameshiftThe sun wah otb utt hem and idn otg eth ish atThe sun wwa sho tbu tth ema ndi dno tge thi sha t

Variant types

• A variant is pathogenic if it interferes with normal protein production.– There are many ways that this can happen!

Stop codonChangeaminoacid

Changesplice site,add intron

Changesplice site,removeexon

Newstopcodon

Frameshift, causing stop codon later

Regulatory region

Variant pathogenicityVariant pathogenicity

• Frameshift and stop gain (nonsense substitution) variants are highly likely to be pathogenic.

• Splicing variants are likely to be pathogenic, but need checking with a splicing predictor.

• Missense variants can be pathogenic, and there are in-silico tools to predict the effect. The effect depends on how the amino acids are changed.

• Synonymous substitutions are very unlikely to be pathogenic unless they affect splicing.

Variant prioritisationVariant prioritisation

~30,000 variants

Causal mutation(s)

Exclude common variants

Identify potential pathogenic mutation(s)

Variant prioritisation

• We can pull information in from a variety of external sources, including:– “Population” databases, e.g. ExAC and dbSNP

• These provide an approximation of the variants that are common in the population and may be excluded from consideration

– Disease databases, e.g. HGMD• These provide a list of the known disease causing mutations

seen in a variety of settings and may be a flag for prioritisation

– In silico analysis packages, e.g. SIFT, PolyPhen– Phenotypic terms provided by clinician using HPO

Variant annotation and filtering

http://exac.broadinstitute.org/

https://www.ncbi.nlm.nih.gov/snp/

http://www.hgmd.cf.ac.uk/ac/index.php

http://sift.jcvi.org/www/SIFT_aligned_seqs_submit.html

http://genetics.bwh.harvard.edu/pph2/

http://compbio.charite.de/hpoweb/showterm?id=HP:0000118

Questions?

introduction to next generation sequencing (ngs) · introduction to next generation sequencing...

Documents