bioc4010 lectures 1 and 2
DESCRIPTION
Introduction to NGS and Bioinformatics for Human Disease ApplicationsTRANSCRIPT
Next-Generation Sequence Analysis for Biomedical Applications
BIOC 4010/5010Lecture 1
Dr. Dan GastonPostdoctoral Fellow Department of Pathology
Dr. Karen Bedard LabBioinformatician, IGNITE Project
LECTURE 1Introduction to Next-Gen Sequencing
Overview: Lecture 1
• Introduction AKA “Why does this matter?”• “Next-Gen” Sequencing• Bioinformatics Workflows• Types of Next-Gen Experiments• Working with the Human Genome• Slides available on slideshare:
– http://www.slideshare.net/DanGaston
Major Areas in Human Disease Genomics
• Complex diseases– Genome Wide Association Studies (GWAS)
• Cancer– Tumour genomics (Driver mutations)– Transcriptomics
• Mendelian disease– Whole Genome/Exome Sequencing– Transcriptomics– Genetic Linkage
Diagnosing Genetic Diseases
• Genetic Counselors/Physicians order individual testing of genes based on patient phenotype
• For rare diseases or unusual phenotypes may run tens to hundreds of tests
• …..EXPENSIVE (Easily thousands of dollars)
Genetic Disease Research
Genetic Disease Research: Cutis Laxa
Chromosome 9:120,962,282 -133,033,431
Cutis Laxa
• Linked Genomic Region ~13Mb in size• Contains 143 Genes• Prioritize and select genes for individual
sanger sequencing• …Slow• …Laborious• …Can be expensive
Personalized Medicine
Human Genomics
• $5,000 - $10,000 to sequence whole genome• $1000 to sequence only protein-coding
portion (exome, later)
Clinical Genomics
• Rapid diagnosis of genetic disease in NICU cases• Quicker and cheaper than sequential genetic
testing (traditional method)
Cancer Genomics
Welch JS, et al. JAMA, 2011;305, 1577
Cancer Chemotherapy Resistance
Human Disease Genomics at Dalhousie
• IGNITE: Identifying genetic mutations causing rare mendelian diseases in Atlantic Canada– 3 year, $2.5 million Genome Canada Project– Currently working on >10 different diseases including
two inherited cancer’s– Sequenced >20 individual exomes, 4 whole genomes,
and several transcriptomes– More on Thursday…
• Dr. Graham Dellaire: Transcriptome sequencing and analysis on multiple cancer cell lines
Short Reads
Millions of paired “short reads”, 75-150bp each
FastQ Format
Read ID
Sequence
Quality line
FastQ Quality Scores
Quality Score (Q) Probability of incorrect base call Base call accuracy
10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.90%40 1 in 10000 99.99%50 1 in 100000 100.00%
Q = -10 log10 P
Quality Scores of Sequencing Reads
General Genomics Workflow
Quality Control of Raw DataRaw Data Analysis
Alignment to reference genome
Whole Genome Mapping
Detection of genetic variation(SNPs, Indels, SV)
Variant Calling
Linking variants to biological information
Annotation
Short Read Mapping
…CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…GCGCCCTA
GCCCTATCGGCCCTATCG
CCTATCGGACTATCGGAAA
AAATTTGCAAATTTGC
TTTGCGGTTTGCGGTA
GCGGTATA
GTATAC…
TCGGAAATT CGGTATAC
TAGGCTATAAGGCTATATAGGCTATATAGGCTATAT
GGCTATATGCTATATGCG
…CC…CC…CCA…CCA…CCAT
ATAC…C…C…
…CCAT
1) Report location of genome where read matches best2) Minimize mismatches3) Mismatches with lower quality bases better than
mismatches with higher quality bases
Discovering Genetic VariationSNPs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
INDELs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
reference genome
Next-Gen Sequencing Experiments
• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
Next-Gen Sequencing Experiments
• Whole Genome Sequencing• Targeted Exome Sequencing• RNA-Seq• ChIP-Seq• CLIP-Seq
Composition of Human Genome
Size: 3.2 Gb
Genomic ContentChromosome Base pairs Variations Confirmed proteins Putative proteins Pseudogenes miRNA rRNA Misc ncRNA
1 249,250,621 4,401,091 2,012 31 1,130 134 66 1062 243,199,373 4,607,702 1,203 50 948 115 40 933 198,022,430 3,894,345 1,040 25 719 99 29 774 191,154,276 3,673,892 718 39 698 92 24 715 180,915,260 3,436,667 849 24 676 83 25 686 171,115,067 3,360,890 1,002 39 731 81 26 677 159,138,663 3,045,992 866 34 803 90 24 708 146,364,022 2,890,692 659 39 568 80 28 429 141,213,431 2,581,827 785 15 714 69 19 55
10 135,534,747 2,609,802 745 18 500 64 32 5611 135,006,516 2,607,254 1,258 48 775 63 24 5312 133,851,895 2,482,194 1,003 47 582 72 27 6913 115,169,878 1,814,242 318 8 323 42 16 3614 107,349,540 1,712,799 601 50 472 92 10 4615 102,531,392 1,577,346 562 43 473 78 13 3916 90,354,753 1,747,136 805 65 429 52 32 3417 81,195,210 1,491,841 1,158 44 300 61 15 4618 78,077,248 1,448,602 268 20 59 32 13 2519 59,128,983 1,171,356 1,399 26 181 110 13 1520 63,025,520 1,206,753 533 13 213 57 15 3421 48,129,895 787,784 225 8 150 16 5 822 51,304,566 745,778 431 21 308 31 5 23X 155,270,560 2,174,952 815 23 780 128 22 52Y 59,373,566 286,812 45 8 327 15 7 2
mtDNA 16,569 929 13 0 0 0 2 22
Exome Sequencing
Transcriptomics: RNA-Seq
• Sequence the actively transcribed genes in a cell line or tissue– Only about 20% of genes are transcribed in
particular cell types• Two types:
– Poly-A selection– Total RNA + ribodepletion
• Many experimental questions can be addressed
RNA-Seq: Gene ExpressionCondition 1
Condition 2
RNA-Seq: Differential Splicing
Exon1 Exon 2 Exon 3
RNA-Seq: Novel/Non-Canonical Exon Discovery
Exon1 Exon 2 Exon 3Exon X
RNA-Seq: Gene Fusion Events
Exon1 Exon 2 Exon 3
Gene 2 Exon 4
RNA-Seq
• Important to take in to account biological variability. A sample of cells is a mixed population– Replicates!
• Not suited for discovering polymorphisms due to higher error rates introduced by reverse transcription step (RNA -> cDNA)
• High false positive rates for fusion gene discovery, novel exons, when low expression levels
CHiP-Seq
CHiP-Seq
Short Read Mapping: Placing Millions of Reads on Human Reference
• Problem: Efficiently place millions of reads (75bp – 200bp) accurately within 3.2Gb of reference genome
• Problem: Read may match equally well at more than one location (pseudogenes, copy number variation, repetititve elements)
• Problem: Sequencing reads may be paired
Short Read Mapping: Brute Force Method
Simple conceptually: Compare each query k-mer to all k-mers of genome
Genome Size (N): 3.2 billion basesK-mer length (M): 7Number of comparisons((N-M + 1) * M): 21 billion
Solution
Index the Reference Genome
Indexing the reference is like constructing a phone book, quickly move towards the relevant portion of the genome and ignore the rest.
Short Read Alignment: Suffix ArraySplit genome into all suffixes (substrings) and sort alphabetically
Allows query to be searched against an alphabetical reference, skipping 96% of the genome
Ex: bananaSorted:
bananaa
ananaana
nana ananaana bananana nanaa
na
Short Read Alignment: Binary Search
• Searching the index efficiently is still a problem…
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a problem…
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a problem…
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a problem…
Search for GATTACA…
Short Read Alignment: Binary Search
• Searching the index efficiently is still a problem…
Search for GATTACA…
Binary Search
• Initialize search range to entire list – mid = (hi+lo)/2; middle = suffix[mid] – if query matches middle: done – else if query < middle: pick low range – else if query > middle: pick hi range
• Repeat until done or empty range
Applied to Human Genome
• In practice simple methods of indexing the genome can create very large data structures– Suffix Array: > 12 GB
• Solution: Apply complex procedures that allow you to index and compress the data:– Burrows-Wheeler Transform– FM-Index
Short Read Mapping: Mapping Quality
• Have also ignored quality scores of reads• Mapping Quality (for a read): Sum the quality
scores at mismatched bases for alignment (SUM_BASE_Q(best)), also consider all other possible alignments
MQ = -log10 (1 – (10-SUM_BASE_Q(best) /SUMi(10-
SUM_BASE_Q(i))) )
Short Read Aligners
• BLAT: BLAST-Like Alignment Tool• MAQ: First to take in to account quality scores• BWA: First to use Burrows-Wheeler Transform• Bowtie: Ungapped alignment only• Bowtie2: Allows indels• … and many more
LECTURE 2Identifying and Annotating Genomic Variation for Disease Gene Discovery
Genetic Variation
• dbSNP (NCBI) catalogues > 53 million Single Nucleotide Variations (SNVs) in humans– 38 million validated– 22 million in genes– 36 million with frequencies
• 50-80% of mutations involved in inherited disease caused by SNVs
SNP vs SNV
• Technically a polymorphism is a variation that doesn’t cause disease and is common in a population
• What is common?– Greater than 5% in a population a typical
definition– Definition for rare ranges from < 0.5% to < 1.5%
Discovering Genetic Variation
SNPs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
ATCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATTCCGATCGAACTGTCAGCGGCAAGCTGATCG CGATC TCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
GATCGAACTGTCAGCGGCAAGCTGATCG CGATCGA AACTGTCAGCGGCAAGCTGATCG CGATCGATGCTA
TGTCAGCGGCAAGCTGATCGATCGATCGATGCTAG
INDELs
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
TCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
reference genome
Variant Calling: The Absurdly Simple Way
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA CGGTGAACGTTATCGACGATCCGATCGAACTGTCAGC GGTGAACGTTATCGACGTTCCGATCGAACTGTCAGCG
TGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGCTGAACGTTATCGACGTTCCGATCGAACTGTCAGCGGC
GTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCT
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGATCGATGCTAGTG
TTATCGACGATCCGATCGAACTGTCAGCGGCAAGCTTCGACGATCCGATCGAACTGTCAGCGGCAAGCTGAT
TTCCGATCGAACTGTCAGCGGCAAGCTGATCGATCGA
ATCCTGATTCGGTGAACGTTATCGACGATCCGATCGA
reference genome
Read depth at base: 10 T: 4 A: 6
Genotype: Heterozygous A/T
Variant Calling: The Absurdly Simple Way
• Algorithm:– Count all aligned bases that pass quality threshold
(e.g. >Q20)– If #reads with alternative base > lower bound (20%)
and < upper bound (80%) call heterozygous alt– Else if > upper bound call homozygous alternative– Else call homozygous reference
• …But what about base qualities for more than keeping reads?
Improving Variant Calling
• MAQ (Mapping and Assembling with Quality):– Short Read Mapper and Genotype Caller– First to use base qualities for either– Introduced mapping Quality
Improving Variant Calling
① Base quality can not be more reliable than mapping quality of read
② At most individual can have two real nucleotides at a position (two alleles)
① Only consider two most frequent nucleotides② Simplify to two states: A and B
Improving Variant Calling
• Three Possible Genotypes:– AA, BB, AB
• Construct a model that includes base quality to estimate the probability of error
• Calculate the probability of each genotype given the data and error rate
• Genotype with highest probability is called
The Model
The Model
m = ploidy (2)k = number of reads
g = genotype e = error probability
The Model
Reads that match reference
The Model
Reads that don’t match reference
Improving Variant Calling
• Two widely used tool sets for calling variants– samtools (uses MAQ-type calculation)– Genome Analysis Toolkit (GATK)
UnifiedGenotyper• UnifiedGenotyper: Capable of calling both
indels and single nucleotide polymorphisms (SNPs) and allele frequencies given multiple samples
UnifiedGenotyper
Apply filters to discard poor reads and remove biases:① Duplicate reads② Malformed reads (i.e. mismatch in #bases and base
qualities)③ Bad mate (paired-end sequencing, paired reads map to
different chromosomes)④ Mapping quality zero (maps to multiple locations
equally well)⑤ Fewer than 10% mismatch on read in 20bp to either
side of position
Remove Duplicate ReadsApplication Avg
#Molecules/Library
Read Length Avg #Molecules Sampled
Molecules Sampled > 1
30X Genome 5bn 2x100 450m 4.4%
4x Genome 5bn 2x100 60m 0.6%
100x Exome 500m 2x75 20m 2.0%
Duplicate reads break the assumption of independent sampling from the library
Identify reads with identical start/stop positions
Sequencer-Specific Error Models
Predicted Base
A C G T
Actual Base
A - 57.7 17.1 25.2
C 34.9 - 11.3 53.9
G 31.9 5.1 - 63.0
T 45.9 22.1 32.0 -
If a base was miscalled, what is it most likely to be called as instead?
Variant Calling
• SNP Calls infested with False Positives– Machine artifacts– Mis-mapped reads– Mis-aligned indels
• 5 – 20% false positive rate
Decisions and Trade-Offs
• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.
Decisions and Trade-Offs
• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants
Decisions and Trade-Offs
• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage
Decisions and Trade-Offs
• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Pro: Won’t miss real variants– Con: Many more false positives
Decisions and Trade-Offs
• Option 1: Use stringent program options for calling variants and hard filtering early to produce only highly-confident call set.– Pro: Few false positives– Con: Will miss real variants
• Option 2: Use less stringent (but reasonable) options and filtering. Produce high-confidence call set. Progressive filtering at later stage– Con: False positives– Pro: Won’t miss real variants
How Good Are My Calls?
• How many called SNPs?– Human average of 1 heterozygous SNP / 1000
bases• Fraction of variants already in dbSNP• Transition/Transversion ratio
– Transitions 2x as common• 2.8x when looking only at exons
ANNOTATING VARIANTS
Identifying Genetic Variation Causing Genetic Disease
Discovering Genetic Variants Causing Mendelian Disease
4 million genetic variants
2 million associated with protein-coding genes
10,000 possibly of disease
causing type
1500 <1% frequency in population
Discovering Genetic Variants Causing Mendelian Disease
4 million genetic variants
2 million associated with protein-coding genes
10,000 possibly of disease
causing type
1500 <1% frequency in population
Single Causal Genetic Variant
If a problem cannot be solved, enlarge it.
--Dwight D. Eisenhower
TYPES OF SINGLE NUCLEOTIDE VARIANTS
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
StartTAAStop
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
StartTAAStopmRNA coding for protein
Splice Sites
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyr
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyrSplice Site Loss
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyrSplice Site Loss
Missense
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyrSplice Site Loss
Missense/Frameshift Stop Gain
GENETIC REGIONS OF INTEREST
Identifying Genetic Regions of Interest
Number of Genes in Genomic Regions of Interest
FREQUENCY OF GENETIC VARIANTS
Frequency of Polymorphisms: Common vs Rare
• Mendelian disorders are caused by rare variation, < 1-2% frequency in the relevant population
• Leverage large projects aimed at assessing genetic diversity in populations around the world– 1000 Genomes– NHLBI Exome Sequencing Project
Human Populations
Population Matters
• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes,
pathogen exposure and urban living
Population Matters
• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen
exposure and urban living• Monogenic diseases have different prevalence in
different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations
Population Matters
• Most variations in protein-coding genes occurred fairly recently (last 20,000 years)– Adaptation to agriculture and diet changes, pathogen
exposure and urban living• Monogenic diseases have different prevalence in
different populations– Cystic fibrosis in European population– Hereditary hemochromotosis in Northern Europeans– Tay-Sachs in Ashkenazi Jews– Sickle-Cell anemia in Sub-saharan Africa populations
• Polygenic disorders
1000 Genomes Project
Exome Sequencing Project
• Multi-Institutional• Total possible patient pool of > 250,000
individuals, well phenotyped– Includes healthy individuals and diseased
• Currently 6700 exomes sequenced– 4420 European descent– 2312 African American
• 1.2 million coding variations– Most extremely rare/unique– Many population specific
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-represented population groups and sub-groups…
IGNITE Project: Local Controls
• IGNITE: Tasked with studying rare monogenic diseases identified in Atlantic Canada
• Atlantic Canada harbours several non-represented population groups and sub-groups…– Acadians– Native American– Non-Acadian/European Descent
Population Frequency
• Mendelian disorders are rare• If variation is in database, is it associated with
disease?• Causal variation also needs to be rare
– Cutoff somewhere in the < 0.5 - < 1.5% range– Should appear rarely or not at all in local controls– Track with disease in family members under study
Predicting the Impact of Missense Mutations
• Most use some level of evolutionary conservation to determine how severe a mutation is– SIFT– PolyPhen– GERP++– EvoD
Example: SIFT Algorithm
Input Query Sequence
Psi-BLAST
Homologs
Alignment
Multiple Sequence Alignment
Multiple Sequence Alignment
PSSM
NormalizeBy most
frequent AA
Score
Predicting Impact
• Other approaches include additional features:– Protein structure information– Site level annotation (active sites, binding sites,
etc)– Protein domain information– Biophysical properties of amino acids in that
position and of the substituted amino acid
Prediction Take-Away
The more conserved a site is the more likely any substitution is to be deleterious
However: Current methods have pretty poor performance, not suitable for clinical-level diagnosis
Classifying Genetic Variants
4 million variants
Intronic
Unknown Splice Site
Potential Disease Causing
Exonic
Amino Acid Changing
Known Genetic Disease Variant
Stop Loss / Stop Gain
Missense Mutation
Known Polymorphism in
Population
Silent Mutation Splice Site
Potential Disease Causing
Intergenic
GENE LEVEL ANNOTATION
Annotating Genes and Variants
• Is variant in a known protein-coding gene?– What does the gene do?– What molecular pathways?– What protein-protein interactions?– What tissues is it expressed in?– When in development?
4 million genetic variants
2 million associated with protein-coding genes
10,000 possibly of disease
causing type
1500 <1% frequency in population
Gene Level Annotations
ADDING ANNOTATIONS TO VARIANTS
Genomic Intervals, Searching, and Annotation
• Most common way of describing genomic features is as an interval
• Multiple formats (BED, WIG, VCF, etc)• In common for all is location:
– Chromosome– Start Position of Feature– End Position of Feature– Annotations/Info (Optional)
Searching and Annotating: Interval Trees
• Interval Trees allow efficient searching of all overlapping intervals
• Easiest to make one tree per chromosome• Given a set of intervals (n) on a number line
(chromosome) construct a tree
Interval Trees
All intervals to rightAll intervals to left
Node Contains:
- Centre point
- Intervals sorted by start
- Intervals sorted by end
CASE STUDIESIGNITE: Brain Calcification, Charcot-Marie-Tooth and Cutis Laxa
IGNITE Data Pipeline and Integration
Mapped Region(s)
Known Genes
Gene Definitions
Pathway and Interactions
Annotated Genomic Variants
FilterSort
Prioritize
Gene Annotations
Brain Calcification
Brain Calcification
• 84 genes in chromosome 5 region• No likely homozygous or compound heterozygous
variants within region shared between two patients• 29 genes with at least one targeted region with little
or no sequencing coverage• Many only lacked coverage in 5’ and 3’ UTRs• Collaborators performed statistical tests for possibly
copy-number variations of targeted regions using exome sequencing data
Brain Calcification
Charcot-Marie-Tooth: Genetic Mapping
Chromosome 9:120,962,282 -133,033,431
Cutis Laxa: Genetic Mapping
Chromosome 17:79,596,811-81,041,077
Charcot-Marie-Tooth Cutis Laxa• 143 genes in region• 13 known causative genes
– MPZ– PMP22– GDAP1– KIF1B– MFN2– SOX– EGR2– DNM2– RAB7– LITAF (SIMPLE)– GARS– YARS– LMNA
• 52 genes in region• 5 known causative
genes– ATP6V0A2– ELN– FBLN5– EFEMP2– SCYL1BP1– ALDH18A1
Pathway and Interaction Data
• 37 pathways– Clathrin-derived vesicle
budding– Lysosome vesicle
biogenesis– Endocytosis– Golgi-associated vesicle
biogenesis– Membrane trafficking– Trans-Golgi network vesicle
budding
• Primarily LMNA or DNM2
• 10 pathways– Phagosome– Collecting duct acid
secretion– Lysosome– Protein digestion and
absorption– Metabolic pathways– Oxidative phosphorylation– Arginine and proline
metabolism
• Primarily ATP6V0A2
Results: Charcot-Marie-Tooth
• 8 Genes PrioritizedGene Interactions PathwayLRSAM1 Multiple EndocytosisDNM1 DNM2 -FNBP1 DNM2 -TOR1A MNA -STXBP1 Multiple FiveSH3GLB2 - EndocytosisPIP5KL1 - EndocytosisFAM125B - Endocytosis
• For more information– Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
Results: Cutis Laxa• 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways, Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation
• For more information – Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9