bioc4010 lectures 1 and 2

Next-Generation Sequence Analysis for Biomedical Applications

BIOC 4010/5010Lecture 1

Dr. Dan GastonPostdoctoral Fellow Department of Pathology

Dr. Karen Bedard LabBioinformatician, IGNITE Project

LECTURE 1Introduction to Next-Gen Sequencing

Overview: Lecture 1

• Introduction AKA “Why does this matter?”• “Next-Gen” Sequencing• Bioinformatics Workflows• Types of Next-Gen Experiments• Working with the Human Genome• Slides available on slideshare:

– http://www.slideshare.net/DanGaston

Major Areas in Human Disease Genomics

• Complex diseases– Genome Wide Association Studies (GWAS)

• Cancer– Tumour genomics (Driver mutations)– Transcriptomics

• Mendelian disease– Whole Genome/Exome Sequencing– Transcriptomics– Genetic Linkage

Diagnosing Genetic Diseases

• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype

• For rare diseases or unusual phenotypes may run tens to hundreds of tests

• …..EXPENSIVE (Easily thousands of dollars)

Genetic Disease Research

Genetic Disease Research: Cutis Laxa

Chromosome 9:120,962,282 -133,033,431

Cutis Laxa

• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual

sanger sequencing• …Slow• …Laborious• …Can be expensive

Personalized Medicine

Human Genomics

• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding

portion (exome, later)

Clinical Genomics

• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic

testing (traditional method)

Cancer Genomics

Welch JS, et al. JAMA, 2011;305, 1577

Cancer Chemotherapy Resistance

Human Disease Genomics at Dalhousie

• IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada– 3 year, $2.5 million Genome Canada Project– Currently working on >10 different diseases including

two inherited cancer’s– Sequenced >20 individual exomes, 4 whole genomes,

and several transcriptomes– More on Thursday…

• Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines

Short Reads

Millions of paired “short reads”, 75-150bp each

FastQ Format

Read ID

Sequence

Quality line

FastQ Quality Scores

Quality Score (Q) Probability of incorrect base call Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.90%40 1 in 10000 99.99%50 1 in 100000 100.00%

Q = -10 log10 P

Quality Scores of Sequencing Reads

General Genomics Workflow

Quality Control of Raw DataRaw Data Analysis

Alignment to reference genome

Whole Genome Mapping

Detection of genetic variation(SNPs, Indels, SV)

Variant Calling

Linking variants to biological information

Annotation

Short Read Mapping

…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA

GCCCTATCGGCCCTATCG

CCTATCGGACTATCGGAAA

AAATTTGCAAATTTGC

TTTGCGGTTTGCGGTA

GCGGTATA

GTATAC…

TCGGAAATT CGGTATAC

TAGGCTATAAGGCTATATAGGCTATATAGGCTATAT

GGCTATATGCTATATGCG

…CC…CC…CCA…CCA…CCAT

ATAC…C…C…

…CCAT

1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than

mismatches with higher quality bases

Discovering Genetic VariationSNPs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG

TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC

GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT

ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs

ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA

TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

Next-Gen Sequencing Experiments

• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq

Composition of Human Genome

Size: 3.2 Gb

Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA

1 249,250,621 4,401,091 2,012 31 1,130 134 66 1062 243,199,373 4,607,702 1,203 50 948 115 40 933 198,022,430 3,894,345 1,040 25 719 99 29 774 191,154,276 3,673,892 718 39 698 92 24 715 180,915,260 3,436,667 849 24 676 83 25 686 171,115,067 3,360,890 1,002 39 731 81 26 677 159,138,663 3,045,992 866 34 803 90 24 708 146,364,022 2,890,692 659 39 568 80 28 429 141,213,431 2,581,827 785 15 714 69 19 55

10 135,534,747 2,609,802 745 18 500 64 32 5611 135,006,516 2,607,254 1,258 48 775 63 24 5312 133,851,895 2,482,194 1,003 47 582 72 27 6913 115,169,878 1,814,242 318 8 323 42 16 3614 107,349,540 1,712,799 601 50 472 92 10 4615 102,531,392 1,577,346 562 43 473 78 13 3916 90,354,753 1,747,136 805 65 429 52 32 3417 81,195,210 1,491,841 1,158 44 300 61 15 4618 78,077,248 1,448,602 268 20 59 32 13 2519 59,128,983 1,171,356 1,399 26 181 110 13 1520 63,025,520 1,206,753 533 13 213 57 15 3421 48,129,895 787,784 225 8 150 16 5 822 51,304,566 745,778 431 21 308 31 5 23X 155,270,560 2,174,952 815 23 780 128 22 52Y 59,373,566 286,812 45 8 327 15 7 2

mtDNA 16,569 929 13 0 0 0 2 22

http://en.wikipedia.org/wiki/Chromosome_1_(human)






















http://en.wikipedia.org/wiki/X_chromosome

http://en.wikipedia.org/wiki/Y_chromosome

http://en.wikipedia.org/wiki/Mitochondrial_DNA

Exome Sequencing

Transcriptomics: RNA-Seq

• Sequence the actively transcribed genes in a cell line or tissue– Only about 20% of genes are transcribed in

particular cell types• Two types:

– Poly-A selection– Total RNA + ribodepletion

• Many experimental questions can be addressed

RNA-Seq: Gene ExpressionCondition 1

Condition 2

RNA-Seq: Differential Splicing

Exon1 Exon 2 Exon 3

RNA-Seq: Novel/Non-Canonical Exon Discovery

Exon1 Exon 2 Exon 3Exon X

RNA-Seq: Gene Fusion Events

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

RNA-Seq

• Important to take in to account biological variability. A sample of cells is a mixed population– Replicates!

• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)

• High false positive rates for fusion gene discovery, novel exons, when low expression levels

CHiP-Seq

Short Read Mapping: Placing Millions of Reads on Human Reference

• Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome

• Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements)

• Problem: Sequencing reads may be paired

Short Read Mapping: Brute Force Method

Simple conceptually: Compare each query k-mer to all k-mers of genome

Genome Size (N): 3.2 billion basesK-mer length (M): 7Number of comparisons((N-M + 1) * M): 21 billion

Solution

Index the Reference Genome

Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.

Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sort alphabetically

Allows query to be searched against an alphabetical reference, skipping 96% of the genome

Ex: bananaSorted:

bananaa

ananaana

nana ananaana bananana nanaa

na

Short Read Alignment: Binary Search

• Searching the index efficiently is still a problem…

Search for GATTACA…

Binary Search

• Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range

• Repeat until done or empty range

Applied to Human Genome

• In practice simple methods of indexing the genome can create very large data structures– Suffix Array: > 12 GB

• Solution: Apply complex procedures that allow you to index and compress the data:– Burrows-Wheeler Transform– FM-Index

Short Read Mapping: Mapping Quality

• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the quality

scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments

MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10-

SUM_BASE_Q(i))) )

Short Read Aligners

• BLAT: BLAST-Like Alignment Tool• MAQ: First to take in to account quality scores• BWA: First to use Burrows-Wheeler Transform• Bowtie: Ungapped alignment only• Bowtie2: Allows indels• … and many more

LECTURE 2Identifying and Annotating Genomic Variation for Disease Gene Discovery

Genetic Variation

• dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans– 38 million validated– 22 million in genes– 36 million with frequencies

• 50-80% of mutations involved in inherited disease caused by SNVs

SNP vs SNV

• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population

• What is common?– Greater than 5% in a population a typical

definition– Definition for rare ranges from < 0.5% to < 1.5%

Discovering Genetic Variation

SNPs






ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA

GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA

TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG

INDELs


TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG

reference genome

Variant Calling: The Absurdly Simple Way






TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA


reference genome

Read depth at base: 10 T: 4 A: 6

Genotype: Heterozygous A/T

Variant Calling: The Absurdly Simple Way

• Algorithm:– Count all aligned bases that pass quality threshold

(e.g. >Q20)– If #reads with alternative base > lower bound (20%)

and < upper bound (80%) call heterozygous alt– Else if > upper bound call homozygous alternative– Else call homozygous reference

• …But what about base qualities for more than keeping reads?

Improving Variant Calling

• MAQ (Mapping and Assembling with Quality):– Short Read Mapper and Genotype Caller– First to use base qualities for either– Introduced mapping Quality


① Base quality can not be more reliable than mapping quality of read

② At most individual can have two real nucleotides at a position (two alleles)

① Only consider two most frequent nucleotides② Simplify to two states: A and B


• Three Possible Genotypes:– AA, BB, AB

• Construct a model that includes base quality to estimate the probability of error

• Calculate the probability of each genotype given the data and error rate

• Genotype with highest probability is called

The Model

The Model

m = ploidy (2)k = number of reads

g = genotype e = error probability

The Model

Reads that match reference

The Model

Reads that don’t match reference


• Two widely used tool sets for calling variants– samtools (uses MAQ-type calculation)– Genome Analysis Toolkit (GATK)

UnifiedGenotyper• UnifiedGenotyper: Capable of calling both

indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples

UnifiedGenotyper

Apply filters to discard poor reads and remove biases:① Duplicate reads② Malformed reads (i.e. mismatch in #bases and base

qualities)③ Bad mate (paired-end sequencing, paired reads map to

different chromosomes)④ Mapping quality zero (maps to multiple locations

equally well)⑤ Fewer than 10% mismatch on read in 20bp to either

side of position

Remove Duplicate ReadsApplication Avg

#Molecules/Library

Read Length Avg #Molecules Sampled

Molecules Sampled > 1

30X Genome 5bn 2x100 450m 4.4%

4x Genome 5bn 2x100 60m 0.6%

100x Exome 500m 2x75 20m 2.0%

Duplicate reads break the assumption of independent sampling from the library

Identify reads with identical start/stop positions

Sequencer-Specific Error Models

Predicted Base

A C G T

Actual Base

A - 57.7 17.1 25.2

C 34.9 - 11.3 53.9

G 31.9 5.1 - 63.0

T 45.9 22.1 32.0 -

If a base was miscalled, what is it most likely to be called as instead?

Variant Calling

• SNP Calls infested with False Positives– Machine artifacts– Mis-mapped reads– Mis-aligned indels

• 5 – 20% false positive rate

Decisions and Trade-Offs

• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.


• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants



• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage



• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Pro: Won’t miss real variants– Con: Many more false positives



• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Con: False positives– Pro: Won’t miss real variants

How Good Are My Calls?

• How many called SNPs?– Human average of 1 heterozygous SNP / 1000

bases• Fraction of variants already in dbSNP• Transition/Transversion ratio

– Transitions 2x as common• 2.8x when looking only at exons

ANNOTATING VARIANTS

Identifying Genetic Variation Causing Genetic Disease

Discovering Genetic Variants Causing Mendelian Disease

4 million genetic variants

2 million associated with protein-coding genes

10,000 possibly of disease

causing type

1500 <1% frequency in population

Discovering Genetic Variants Causing Mendelian Disease




causing type


Single Causal Genetic Variant

If a problem cannot be solved, enlarge it.

--Dwight D. Eisenhower

TYPES OF SINGLE NUCLEOTIDE VARIANTS

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStop



StartTAAStopmRNA coding for protein

Splice Sites



Patient


Exon 1 Intron 1 Exon 2

Splice Sites



Patient



Splice Sites

TACTyr



Patient



Splice Sites

TACTyrSplice Site Loss



Patient



Splice Sites


Missense



Patient



Splice Sites


Missense/Frameshift Stop Gain

GENETIC REGIONS OF INTEREST

Identifying Genetic Regions of Interest

Number of Genes in Genomic Regions of Interest

FREQUENCY OF GENETIC VARIANTS

Frequency of Polymorphisms: Common vs Rare

• Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population

• Leverage large projects aimed at assessing genetic diversity in populations around the world– 1000 Genomes– NHLBI Exome Sequencing Project

Human Populations

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes,

pathogen exposure and urban living

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen

exposure and urban living• Monogenic diseases have different prevalence in

different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations

Population Matters

• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen

exposure and urban living• Monogenic diseases have different prevalence in

different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations

• Polygenic disorders

1000 Genomes Project

Exome Sequencing Project

• Multi-Institutional• Total possible patient pool of > 250,000

individuals, well phenotyped– Includes healthy individuals and diseased

• Currently 6700 exomes sequenced– 4420 European descent– 2312 African American

• 1.2 million coding variations– Most extremely rare/unique– Many population specific

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…

IGNITE Project: Local Controls

• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada

• Atlantic Canada harbours several non-represented population groups and sub-groups…– Acadians– Native American– Non-Acadian/European Descent

Population Frequency

• Mendelian disorders are rare• If variation is in database, is it associated with

disease?• Causal variation also needs to be rare

– Cutoff somewhere in the < 0.5 - < 1.5% range– Should appear rarely or not at all in local controls– Track with disease in family members under study

Predicting the Impact of Missense Mutations

• Most use some level of evolutionary conservation to determine how severe a mutation is– SIFT– PolyPhen– GERP++– EvoD

Example: SIFT Algorithm

Input Query Sequence

Psi-BLAST

Homologs

Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

PSSM

NormalizeBy most

frequent AA

Score

Predicting Impact

• Other approaches include additional features:– Protein structure information– Site level annotation (active sites, binding sites,

etc)– Protein domain information– Biophysical properties of amino acids in that

position and of the substituted amino acid

Prediction Take-Away

The more conserved a site is the more likely any substitution is to be deleterious

However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis

Classifying Genetic Variants

4 million variants

Intronic

Unknown Splice Site

Potential Disease Causing

Exonic

Amino Acid Changing

Known Genetic Disease Variant

Stop Loss / Stop Gain

Missense Mutation

Known Polymorphism in

Population

Silent Mutation Splice Site

Potential Disease Causing

Intergenic

GENE LEVEL ANNOTATION

Annotating Genes and Variants

• Is variant in a known protein-coding gene?– What does the gene do?– What molecular pathways?– What protein-protein interactions?– What tissues is it expressed in?– When in development?




causing type


Gene Level Annotations

ADDING ANNOTATIONS TO VARIANTS

Genomic Intervals, Searching, and Annotation

• Most common way of describing genomic features is as an interval

• Multiple formats (BED, WIG, VCF, etc)• In common for all is location:

– Chromosome– Start Position of Feature– End Position of Feature– Annotations/Info (Optional)

Searching and Annotating: Interval Trees

• Interval Trees allow efficient searching of all overlapping intervals

• Easiest to make one tree per chromosome• Given a set of intervals (n) on a number line

(chromosome) construct a tree

Interval Trees

All intervals to rightAll intervals to left

Node Contains:

- Centre point

- Intervals sorted by start

- Intervals sorted by end

CASE STUDIESIGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Brain Calcification

Brain Calcification

• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous

variants within region shared between two patients• 29 genes with at least one targeted region with little

or no sequencing coverage• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for possibly

copy-number variations of targeted regions using exome sequencing data

Brain Calcification

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:120,962,282 -133,033,431

Cutis Laxa: Genetic Mapping

Chromosome 17:79,596,811-81,041,077

Charcot-Marie-Tooth Cutis Laxa• 143 genes in region• 13 known causative genes

– MPZ– PMP22– GDAP1– KIF1B– MFN2– SOX– EGR2– DNM2– RAB7– LITAF (SIMPLE)– GARS– YARS– LMNA

• 52 genes in region• 5 known causative

genes– ATP6V0A2– ELN– FBLN5– EFEMP2– SCYL1BP1– ALDH18A1

Pathway and Interaction Data

• 37 pathways– Clathrin-derived vesicle

budding– Lysosome vesicle

biogenesis– Endocytosis– Golgi-associated vesicle

biogenesis– Membrane trafficking– Trans-Golgi network vesicle

budding

• Primarily LMNA or DNM2

• 10 pathways– Phagosome– Collecting duct acid

secretion– Lysosome– Protein digestion and

absorption– Metabolic pathways– Oxidative phosphorylation– Arginine and proline

metabolism

• Primarily ATP6V0A2

Results: Charcot-Marie-Tooth

• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis

• For more information– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation

• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

bioc4010 lectures 1 and 2

Education

gen sequencing experiments

genetic disease research

human genomics

quality scores of sequencing

genome canada project

gen experiments

human genome slides

rare diseases