2015 bioc4010 lecture1and2

159
Next-Generation Sequence Analysis for Biomedical Applications BIOC 4010/5010 Lecture 1 Dr. Dan Gaston Postdoctoral Fellow Department of Pathology Dr. Karen Bedard Lab

Upload: dan-gaston

Post on 15-Jul-2015

115 views

Category:

Education


5 download

TRANSCRIPT

Next-Generation Sequence Analysis for Biomedical Applications

BIOC 4010/5010

Lecture 1

Dr. Dan Gaston

Postdoctoral Fellow Department of Pathology

Dr. Karen Bedard Lab

LECTURE 1

Introduction to Next-Gen Sequencing

Overview: Lecture 1

• Why Next-Gen Sequencing Matters

• What is Next-Gen Sequencing

• Bioinformatics Workflows

• Types of Next-Gen Experiments

• Working with the Human Genome

• Slides available on slideshare:

– http://www.slideshare.net/DanGaston

Personalized Medicine

Major Areas in Human Disease Genomics

• Complex Disease– Genome Wide Association Studies (GWAS)

• Mendelian Disease– Whole Genome/Exome Sequencing

– Transcriptomics

– Genetic Linkage – Sanger Sequencing

• Cancer– Tumour Genomics

– Transcriptomics

Traditional Diagnosis of Genetic Disease

• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype

• For rare diseases or unusual phenotypes may run tens to hundreds of tests

• …..EXPENSIVE (Easily thousands of dollars)

Next Generation Diagnosis of Genetic Disease

• NGS-Based Targeted Sequencing Panels

• Clinical Exome

• Clinical Genome

Genetic Disease Research: The Slow and Traditional Way in the Dark Ages (circa 2009)

Genetic Disease Research: Cutis Laxa

Chromosome 9:120,962,282 -133,033,431

Cutis Laxa

• Linked Genomic Region ~13Mb in size

• Contains 143 Genes

• Prioritize and select genes for individual sanger sequencing

• …Slow

• …Laborious

• …Can be expensive

Human Genomics: More Power!

Human Genomics: More Power!

Human Genomics: More Power!

• $5,000 - $10,000 to sequence whole genome

– Dropping towards $1000 for sequencing only

• ~$1000 to sequence only protein-coding portion (exome, later)

Clinical Genomics

• Rapid diagnosis of genetic disease in NICU cases

• Quicker and cheaper than sequential genetic testing (traditional method)

Clinical Genomics

Clinical Genomics

Personalized Medicine: Oncology

Tumour SampleDNA

Non-Tumour Sample

DNA

Databases and Annotations

Sequence

Tumour Specific

Mutations

Tumour Classification

Drugs

Personalized Medicine: Oncology

Welch JS, et al. JAMA, 2011;305, 1577

Personalized Medicine: Monitoring For Cancer Chemotherapy Resistance

Composition of Human Genome

Size: 3.2 Gb

Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA

1 249,250,621 4,401,091 2,012 31 1,130 134 66 106

2 243,199,373 4,607,702 1,203 50 948 115 40 93

3 198,022,430 3,894,345 1,040 25 719 99 29 77

4 191,154,276 3,673,892 718 39 698 92 24 71

5 180,915,260 3,436,667 849 24 676 83 25 68

6 171,115,067 3,360,890 1,002 39 731 81 26 67

7 159,138,663 3,045,992 866 34 803 90 24 70

8 146,364,022 2,890,692 659 39 568 80 28 42

9 141,213,431 2,581,827 785 15 714 69 19 55

10 135,534,747 2,609,802 745 18 500 64 32 56

11 135,006,516 2,607,254 1,258 48 775 63 24 53

12 133,851,895 2,482,194 1,003 47 582 72 27 69

13 115,169,878 1,814,242 318 8 323 42 16 36

14 107,349,540 1,712,799 601 50 472 92 10 46

15 102,531,392 1,577,346 562 43 473 78 13 39

16 90,354,753 1,747,136 805 65 429 52 32 34

17 81,195,210 1,491,841 1,158 44 300 61 15 46

18 78,077,248 1,448,602 268 20 59 32 13 25

19 59,128,983 1,171,356 1,399 26 181 110 13 15

20 63,025,520 1,206,753 533 13 213 57 15 34

21 48,129,895 787,784 225 8 150 16 5 8

22 51,304,566 745,778 431 21 308 31 5 23

X 155,270,560 2,174,952 815 23 780 128 22 52

Y 59,373,566 286,812 45 8 327 15 7 2

mtDNA 16,569 929 13 0 0 0 2 22

Exome and Genome Sequencing

Short Reads

Millions of “short reads” 75-150bp each

Usually “paired”

FastQ Format

Read ID

Sequence

Quality line

FastQ Quality Scores

Quality Score (Q) Probability of incorrect base call Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.90%

40 1 in 10000 99.99%

50 1 in 100000 100.00%

Q = -10 log10 P

Quality Scores of Sequencing Reads

General Genomics Workflow

Quality Control of Raw

Data

Raw Data

Analysis

Alignment to reference

genome

Whole Genome

Mapping

Detection of genetic variation

(SNPs, Indels, SV)Variant Calling

Linking variants to biological

informationAnnotation

Find the Location of Each Read in the Genome

• Problems:

– Short sequence

Find the Location of Each Read in the Genome

• Problems:

– Short sequence

– Millions of short sequences

Find the Location of Each Read in the Genome

• Problems:

– Short sequence

– Millions of short sequences

– Big genome

Find the Location of Each Read in the Genome

• Problems:

– Short sequence

– Millions of short sequences

– Big genome

– Mismatches

• Polymorphisms

• Sequencing errors

Find the Location of Each Read in the Genome

• Problems:

– Short sequence

– Millions of short sequences

– Big genome

– Mismatches

• Polymorphisms

• Sequencing errors

– Insertions and deletions

Find the Location of Each Read in the Genome

• Problems:– Short sequence

– Millions of short sequences

– Big genome

– Mismatches• Polymorphisms

• Sequencing errors

– Insertions and deletions

– May be processing many (100’s) of individuals

Short Read Mapping

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

GCGCCCTAGCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATT CGGTATAC

TAGGCTATAAGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT

1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than

mismatches with higher quality bases

Short Read Mapping: Brute Force Method (Stupid)

Simple conceptually: Compare each query k-mer to all k-mers of genome

Scales with size of the genome and the reads (Not particularly well)

Genome = AGCATGCTGCAGTCATGCTTAGGCTA

Read = GCT

Solution

Index the Reference Genome

Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.

Suffix ArraySplit genome into all suffixes (substrings) and sort alphabetically

Allows query to be searched against an alphabetical reference, skipping 96% of the genome

Ex: banana$banana$ $anana$ a$nana$ ana$ana$ anana$na$ banana$a$ nana$$ na$

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Index # Sequence Pos Pos

1 ACAGATTACC… 6

2 ACC… 13

3 AGATTACC… 8

4 ATTACAGATTACC… 3

5 ATTACC… 10

6 C… 15

7 CAGATTACC… 7

8 CC… 14

9 GATTACAGATTACC… 2

10 GATTACC… 9

11 TACAGATTACC… 5

12 TACC… 12

13 TGATTACAGATTACC… 1

14 TTACAGATTACC… 4

15 TTACC… 11

Search for GATTACA…

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Index # Sequence Pos Pos

1 ACAGATTACC… 6

2 ACC… 13

3 AGATTACC… 8

4 ATTACAGATTACC… 3

5 ATTACC… 10

6 C… 15

7 CAGATTACC… 7

8 CC… 14

9 GATTACAGATTACC… 2

10 GATTACC… 9

11 TACAGATTACC… 5

12 TACC… 12

13 TGATTACAGATTACC… 1

14 TTACAGATTACC… 4

15 TTACC… 11

Search for GATTACA…

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Index # Sequence Pos Pos

1 ACAGATTACC… 6

2 ACC… 13

3 AGATTACC… 8

4 ATTACAGATTACC… 3

5 ATTACC… 10

6 C… 15

7 CAGATTACC… 7

8 CC… 14

9 GATTACAGATTACC… 2

10 GATTACC… 9

11 TACAGATTACC… 5

12 TACC… 12

13 TGATTACAGATTACC… 1

14 TTACAGATTACC… 4

15 TTACC… 11

Search for GATTACA…

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Index # Sequence Pos Pos

1 ACAGATTACC… 6

2 ACC… 13

3 AGATTACC… 8

4 ATTACAGATTACC… 3

5 ATTACC… 10

6 C… 15

7 CAGATTACC… 7

8 CC… 14

9 GATTACAGATTACC… 2

10 GATTACC… 9

11 TACAGATTACC… 5

12 TACC… 12

13 TGATTACAGATTACC… 1

14 TTACAGATTACC… 4

15 TTACC… 11

Search for GATTACA…

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Index # Sequence Pos Pos

1 ACAGATTACC… 6

2 ACC… 13

3 AGATTACC… 8

4 ATTACAGATTACC… 3

5 ATTACC… 10

6 C… 15

7 CAGATTACC… 7

8 CC… 14

9 GATTACAGATTACC… 2

10 GATTACC… 9

11 TACAGATTACC… 5

12 TACC… 12

13 TGATTACAGATTACC… 1

14 TTACAGATTACC… 4

15 TTACC… 11

Search for GATTACA…

Binary Search

• Initialize search range to entire list

– mid = (hi+lo)/2; middle = suffix[mid]

– if query matches middle: done

– else if query < middle: pick low range

– else if query > middle: pick hi range

• Repeat until done or empty range

Applied to Human Genome

• In practice simple methods of indexing the genome can create very large data structures

– Suffix Array: > 12 GB

• Solution: Apply complex procedures that allow you to index and compress the data:

– Burrows-Wheeler Transform

– FM-Index

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA

CircularPermutation

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

BANANA$ANANA$BNANA$BAANA$BANNA$BANAA$BANAN$BANANA

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

LexicographicalSort

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$Burrows-Wheeler Matrix

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

T(string) = ANNB$AA

Transformed String: Compressible and Reversible

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

T(string) = ANNB$AA

Suffix Array

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

6531042

Burrows-Wheeler Transform

• Similar in many ways to creation of Suffix Array

BANANA$

TT(string) = ANNB$AA

FM-Index

$BANANAA$BANANANA$BANANANA$BBANANA$NA$BANANANA$BA

6531042

6, 5, 3, 1, 0, 4, 2

+

+Character Count Tables

Short Read Aligners

• BLAT: BLAST-Like Alignment Tool

• MAQ: First to take in to account quality scores

• Bowtie: One of the first to use BWT, ungappedalignment only

• BWA: One of the first to use BWT. First gapped BWT, incredibly fast and memory efficient

• Bowtie2: Allows indels

• SOAP, SOAP2: Also use BWT

• … and many more

Next-Gen Sequencing Experiments

• Whole Genome Sequencing

• Targeted Exome Sequencing

• RNA-Seq

• ChIP-Seq

• CLIP-Seq

Exome Sequencing

Transcriptomics: RNA-Seq

• Sequence the actively transcribed genes in a cell line or tissue– Only about 20% of genes are transcribed in

particular cell types

• Two types:– Poly-A selection

– Total RNA + ribodepletion

• Many experimental questions can be addressed

RNA-Seq: Gene Expression

Condition 1

Condition 2

RNA-Seq: Differential Splicing

Exon1 Exon 2 Exon 3

RNA-Seq: Novel/Non-Canonical Exon Discovery

Exon1 Exon 2 Exon 3Exon X

RNA-Seq: Gene Fusion Events

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

RNA-Seq

• Important to take in to account biological variability. A sample of cells is a mixed population

– Replicates!

• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)

• High false positive rates for fusion gene discovery, novel exons, when low expression levels

CHiP-Seq

CHiP-Seq

LECTURE 2

Identifying and Annotating Genomic Variation for Disease Gene Discovery

Overview/Objectives

• Genetic Variation– Types

• Identifying Genetic Variation– Methods

• Annotation of Genes and Variants– Methods

– Sources

• Gene/Variant Prioritization– Methods

Mapping Alone is Insufficient

Need Information on Variation

Why Identify Variants?

Why Identify Variants?

Types of Genetic Variation

Genetic Variation

• dbSNP (NCBI) build 142– Catalogs Single Nucleotide Variants (SNV)

– 365 Million Submitted

– 113 Million Validated

– 54 Million in Genes

– 36 Million With Frequency in Populations

• 50-80% of mutations involved in inherited disease caused by SNVs– May be an overestimate due to lack of knowledge

SNP vs SNV

• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population

• What is common?

– Greater than 5% in a population a typical definition

– Definition for rare ranges from < 0.1% to < 1.0%

FREQUENCY OF GENETIC VARIANTS

Studies and Populations

Frequency of Polymorphisms: Common vs Rare

• Mendelian disorders are caused by rare variation, < 1% frequency in the relevant population

• Leverage large projects aimed at assessing genetic diversity in populations around the world

1000 Genomes Project

Exome Sequencing Project

• Multi-Institutional

• Total possible patient pool of > 250,000 individuals, well phenotyped– Includes healthy individuals and diseased

• Currently 6700 exomes sequenced– 4420 European descent

– 2312 African American

• 1.2 million coding variations– Most extremely rare/unique

– Many population specific

Other Resources and Projects

• Exome Aggregation Consortium: 60,000 Exomes

Other Resources and Projects

• Exome Aggregation Consortium: 60,000 Exomes

• Personal Genome Project (Ongoing)

• 100,000 Genomes Project (UK, Ongoing)

• BGI (Announced, China): 1 Million Genomes

• Precision Medicine Initiative (US, Announced): 1 Million Genomes

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)

– Adaptation to agriculture and diet changes, pathogen exposure and urban living

Human Populations

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen

exposure and urban living

• Monogenic diseases have different prevalence in different populations– Cystic fibrosis in European population

– Hereditary Hemochromatosis in Northern Europeans

– Tay-Sachs in Ashkenazi Jews

– Sickle-Cell Anemia in Sub-Saharan African populations

DISCOVERING GENETIC VARIATION

Finding the Needles

Finding All Needles

SNPs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGACGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

Finding All Needles

SNPs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGACGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

All regions with mismatches are potential variants

Genotype Calling: Determining the Type of Needle, The Absurdly Simple

Way (Stupid)

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGACGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

reference genome

Read depth at base: 10 T: 4 A: 6

Genotype: Heterozygous A/T

Genotype Calling: The Absurdly Simple Way (Stupid)

• Doesn’t account for sequencing error

• Doesn’t account for sequencing bias

• Doesn’t count for bias in short-read mapping process

• Doesn’t account for mapping error

• Doesn’t consider any external source of information regarding populations or known genetic variations

Genotype Calling: The Absurdly Simple Way (Slightly less Stupid)

• Algorithm:

– Count all aligned bases that pass quality threshold (e.g. >Q20)

– If #reads with alternative base > lower bound (20%) and < upper bound (80%) call heterozygous alt

– Else if > upper bound call homozygous alternative

– Else call homozygous reference

• …But what about base qualities for more than keeping reads?

Improving Genotype Calling: Local Realignment

Duplicate Reads

Remove Duplicate Reads

What’s Missing

• No estimate of the confidence (stats) of variant and genotype calls

What’s Missing

• No estimate of the confidence (stats) of variant and genotype calls

• Doesn’t account robustly for known sources of error

What’s Missing

• No estimate of the confidence (stats) of variant and genotype calls

• Doesn’t account robustly for known sources of error

• Doesn’t make use of any sources of external information

What’s Missing

• No estimate of the confidence (stats) of variant and genotype calls

• Doesn’t account robustly for known sources of error

• Doesn’t make use of any sources of external information

• Doesn’t include base qualities

Improving Genotype Calling: Bayes Theorem

Improving Genotype Calling: Bayes Theorem

Prior ProbabilityError Model

All Possible Genotypes

Improved Genotype Calling: Prior Probability

• Known Polymorphic Site?

– Allele Frequencies

• Global rate of polymorphisms

• Other samples

• Substitution Type

Substitution Type

• Transition: – Purine to Purine (A to G)

– Pyrimidine to Pyrimidine (C to T)

• Transversion– Purine to Pyrimidine

• Transition/Transversion ratio– Transitions 2x as common (Genome Wide)

– 4x when looking only at exons

– Random Error: 0.5

Prior Probability Example

Assume:Heterozygous SNP Rate of 0.001Homozygous SNP Rate of 0.0005Reference: GTransition/Transversion Ratio: 2

Prior Probability Example

A C G T

A 3.33x10-4 1.11x10-7 6.67x10-4 1.11x10-7

C 8.33x10-5 1.67x10-4 2.78x10-8

G 0.9985 1.67x10-4

T 8.33x10-5

Assume:Heterozygous SNP Rate of 0.001Homozygous SNP Rate of 0.0005Reference: GTransition/Transversion Ratio: 2

Improved Genotype Calling: Error Rates

Predicted Base

A C G T

Actual Base

A - 57.7 17.1 25.2

C 34.9 - 11.3 53.9

G 31.9 5.1 - 63.0

T 45.9 22.1 32.0 -

If a base was miscalled, what is it most likely to be called as instead?

Variant Calling

• SNP Calls infested with False Positives

– Machine artifacts

– Mis-mapped reads

– Mis-aligned indels

• 5 – 20% false positive rate

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.

– Pro: Few false positives

– Con: Will miss real variants

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.

– Pro: Few false positives

– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives

– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Pro: Won’t miss real variants

– Con: Many more false positives

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives

– Con: Will miss real variants

• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Con: False positives

– Pro: Won’t miss real variants

How Good Are My Calls?

• How many called SNPs?

– Human average of 1 heterozygous SNP / 1000 bases

• Fraction of variants already in dbSNP

– ~90%

• Transition/Transversion ratio

– Transitions 2x as common

• 3x when looking only at exons

ANNOTATING VARIANTS

Methods and Practices

Identifying Genetic Variation Causing Genetic Disease

Discovering Genetic Variants Causing Mendelian Disease

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Single Causal Genetic Variant

If a problem cannot be solved, enlarge it.

--Dwight D. Eisenhower

Supreme Commander Allied Forces: Second World War34th President USA

Variant Annotation Pipeline Example

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

StartTAAStop

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

StartTAAStopmRNA coding for protein

Splice Sites

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Missense

Transcript Effects: Impact

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyrSplice Site Loss

Missense/Frameshift Stop Gain

Predicting Pathogenicity

Example: SIFT Algorithm

Input Query Sequence

Psi-BLAST

Homologs

Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

PSSM

NormalizeBy most

frequent AA

Score

Prediction Take-Away

The more conserved a site is the more likely any substitution is to be deleterious

However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis

Variant Annotation Pipeline Example

Classifying Genetic Variants

4 million variants

Intronic

Unknown Splice Site

Potential Disease Causing

Exonic

Amino Acid Changing

Known Genetic Disease Variant

Stop Loss / Stop Gain

Missense Mutation

Known Polymorphism in

Population

Silent Mutation Splice Site

Potential Disease Causing

Intergenic

Visualization

GENE LEVEL ANNOTATION

Annotating Genes and Variants

• Is variant in a known protein-coding gene?

– What does the gene do?

– What molecular pathways?

– What protein-protein interactions?

– What tissues is it expressed in?

– When in development?

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Gene Level Annotations

GENETIC REGIONS OF INTEREST

Identifying Genetic Regions of Interest

Identifying Genetic Regions of Interest

Number of Genes in Genomic Regions of Interest

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…

– Acadians

– Native American

– Non-Acadian/European Descent

Population Frequency

• Mendelian disorders are rare

• If variation is in database, is it associated with disease?

• Causal variation also needs to be rare

– Cutoff somewhere in the < 0.1 - < 1% range

– Should appear rarely or not at all in local controls

– Track with disease in family members under study

CASE STUDIES

IGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Brain Calcification

Brain Calcification

• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous

variants within region shared between two patients

• 29 genes with at least one targeted region with little or no sequencing coverage

• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for

possibly copy-number variations of targeted regions using exome sequencing data

Brain Calcification

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:120,962,282 -133,033,431

Cutis Laxa: Genetic Mapping

Chromosome 17:79,596,811-81,041,077

Charcot-Marie-Tooth Cutis Laxa

• 143 genes in region• 13 known causative genes

– MPZ– PMP22– GDAP1– KIF1B– MFN2– SOX– EGR2– DNM2– RAB7– LITAF (SIMPLE)– GARS– YARS– LMNA

• 52 genes in region• 5 known causative genes

– ATP6V0A2– ELN– FBLN5– EFEMP2– SCYL1BP1– ALDH18A1

Gene Level Annotations

Pathway and Interaction Data

• 37 pathways– Clathrin-derived vesicle

budding

– Lysosome vesicle biogenesis

– Endocytosis

– Golgi-associated vesicle biogenesis

– Membrane trafficking

– Trans-Golgi network vesicle budding

• Primarily LMNA or DNM2

• 10 pathways– Phagosome

– Collecting duct acid secretion

– Lysosome

– Protein digestion and absorption

– Metabolic pathways

– Oxidative phosphorylation

– Arginine and prolinemetabolism

• Primarily ATP6V0A2

Simple Prioritization

Pathways and Protein-Protein Interactions of Known Genes

Pathways and Protein-Protein Interactions of Variant Genes

Results: Charcot-Marie-Tooth

• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis

• For more information– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways,

Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation

• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9