bioinformatics in gene research
DESCRIPTION
Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q ITRANSCRIPT
Genetics Noon Symposium Series
Daniel Gaston, PhD
Dr. Karen Bedard Lab, Department of Pathology
Bioinformatics in Genetics Research
November 21st, 2012
IGNITE
Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment
Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada
Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
IGNITE
Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment
Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada
Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca
Outline
Introduction Bioinformatics in Disease Genomics Next-Generation Sequencing
Genomics in Research and the Clinic The Data Deluge and its Solutions
Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion
Bioinformatics in Disease Genomics Handling and long-term storage of raw data
(sequencing, gene expression, etc) Maintenance and support of computational
infrastructure Experimental design Data analysis Methods development
Analysis pipelines Statistical analyses Algorithm design
Bioinformatics in Disease Genomics Handling and long-term storage of raw data
(sequencing, gene expression, etc) Maintenance and support of computational
infrastructure Experimental design Data analysis Methods development
Analysis pipelines Statistical analysis techniques Algorithm design
‘Next-Generation’ Sequencing and Disease Genomics
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
StartTAAStop
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
StartTAAStopmRNA coding for protein
Splice Sites
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyr
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyr
Disease Genomics: Hunting Down Pathogenic Genetic Variation
Exon 1 Intron 1 Exon 2Reference
Patient
StartTAAStopmRNA coding for protein
Exon 1 Intron 1 Exon 2
Splice Sites
TACTyr
Disease Genomics: Research vs Clinic Still predominantly research oriented
Complex/Common disease Mendelian disorders Cancer genomics
Disease Genomics: Research vs Clinic Still predominantly research oriented
Complex/Common disease Mendelian disorders Cancer genomics
Clinical genomics starting to gain traction Cancer genomics
Cancer subtype identification Personalized medicine and predicting outcomes
Mendelian disorders Early diagnosis Cost effectiveness
Clinical Genomics
Children’s Mercy Hospital NICU In the US >20% of infant deaths due to genetic
disease Serial sequencing of candidate genes too slow
Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic
disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19
days Test on seven infants, two previously diagnosed
using standard methods, five undiagnosed
Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic
disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19
days Test on seven infants, two previously diagnosed using
standard methods, five undiagnosed Caveats
Bioinformatics portion not available outside of hospital Requires thorough clinical phenotyping using a
controlled vocabulary Generates a large amount of data
The Data Deluge
4 million genetic variants
2 million associated with protein-coding
genes10,000
possibly of disease
causing type1500 <1%
frequency in
population
Surviving the Data Deluge
Reducing the Search Space: Exome Sequencing
Exome Sequencing
Exome: Portion of genome composed of protein-coding exons and functional RNA sequences
1.5 - 2% of human genome (50 Mb)
> 85% of monogenic diseases due to variants in exome
Complete exome sequencing: ~ $1000/sample
Caveats
Incomplete and non-uniform coverage of exome Systematic bias (GC content) Random sampling
Not all genetic variants amenable to discovery Non-coding variants Structural variants
Surviving The Data Deluge
Bioinformatics
Typical Bioinformatics WorkflowQC of Raw
Data
Map to Reference
QC
Find Variants
QC
Annotate
Filter
It Sounds simple but… For every stage there are multiple programs
available and published in the literature
It Sounds simple but… For every stage there are multiple programs
available and published in the literature For every program there are a wide-variety of
parameter values and options. Defaults often “good enough” but not always
It Sounds simple but… For every stage there are multiple programs
available and published in the literature For every program there are a wide-variety of
parameter values and options. Defaults often “good enough” but not always
Best combinations of programs and options not well understood
It Sounds simple but… For every stage there are multiple programs
available and published in the literature For every program there are a wide-variety of
parameter values and options. Defaults often “good enough” but not always
Best combinations of programs and options not well understood
Protocols changing rapidly as new technologies and methods developed
It Sounds simple but… For every stage there are multiple programs
available and published in the literature For every program there are a wide-variety of
parameter values and options. Defaults often “good enough” but not always
Best combinations of programs and options not well understood
Protocols changing rapidly as new technologies and methods developed
Different centres and groups use slightly different workflows with similar, but not identical results
Typical Bioinformatics WorkflowQC of Raw
Data
Map to Reference
QC
Find Variants
QC
Annotate
Filter
Annotating Variants
If a problem cannot be solved, enlarge it.
--Dwight D. Eisenhower
Annotations Associated with Genomic Variants Is variant in a known protein-coding gene?
What does the gene do? What molecular pathways? What protein-protein interactions? What tissues is it expressed in? When in development?
Has this variant been seen before? What population(s)? With what frequency? Has it been seen in local sequencing projects? Is there any known clinical significance?
What is the effect of the variation? Does it change the resulting protein? How?
4 million genetic variants
2 million associated with protein-coding
genes
10,000 possibly of
disease causing type1500 <1% frequency
in population
Gene Annotation Resources
Variant Annotation Resources
Potential Pitfalls with Annotation Sources
Databases often overlap and agree, but there may be disagreements
Source of information: Predicted versus experimental
Incorrect and out-of-date information Large-scale un-validated versus manually
curated datasets
Bioinformatics Analyses of Genomic Variants
Combining Data Sources and Filtering
IGNITE Data Pipeline and Integration
Mapped Region(s)
Known Genes
Gene Definitions
Pathway and
Interactions
Annotated Genomic Variants
FilterSort
Prioritize
Gene Annotations
4 million variants
Intronic
Unknown Splice Site
Potential Disease Causing
Exonic
Amino Acid Changing
Known Genetic Disease Variant
Stop Loss / Stop Gain
Amino Acid Change Likely
Pathogenic
Amino Acid Change
Likely Benign
Known Polymorphis
m in Population
Silent Mutation Splice Site
Potential Disease Causing
Intergenic
Filtering the Data: Categorization
Filtering the Data: Common or Rare?
Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease
Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes)
Number of times variant previously seen at sequencing centre/locally
Notes on Filtering and Variant Annotation Very important to be aware of population
when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
Notes on Filtering and Variant Annotation Very important to be aware of population
when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
Reasonably well-sampled local populations are better than any other reference
Notes on Filtering and Variant Annotation Very important to be aware of population
when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
Reasonably well-sampled local populations are better than any other reference
Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants
Notes on Filtering and Variant Annotation Very important to be aware of population
when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence
Reasonably well-sampled local populations are better than any other reference
Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants
Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease
Applications to Real Data
Charcot-Marie-Tooth Disease and Cutis Laxa
IGNITE Data Pipeline and Integration
Mapped Region(s)
Known Genes
Gene Definitions
Pathway and
Interactions
Annotated Genomic Variants
FilterSort
Prioritize
Gene Annotations
Charcot-Marie-Tooth: Genetic Mapping
Chromosome 9:120,962,282 -133,033,431
Cutis Laxa: Genetic Mapping
Chromosome 17:79,596,811-81,041,077
Charcot-Marie-Tooth Cutis Laxa 143 genes in region 13 known genes in genome
MPZ PMP22 GDAP1 KIF1B MFN2 SOX EGR2 DNM2 RAB7 LITAF (SIMPLE) GARS YARS LMNA
52 genes in region 5 known genes in
genome ATP6V0A2 ELN FBLN5 EFEMP2 SCYL1BP1 ALDH18A1
Pathway and Interaction Data 37 pathways
Clathrin-derived vesicle budding
Lysosome vesicle biogenesis
Endocytosis Golgi-associated
vesicle biogenesis Membrane trafficking Trans-Golgi network
vesicle budding Primarily LMNA or
DNM2
10 pathways Phagosome Collecting duct acid
secretion Lysosome Protein digestion and
absorption Metabolic pathways Oxidative
phosphorylation Arginine and proline
metabolism Primarily ATP6V0A2
Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions Pathway
LRSAM1 Multiple Endocytosis
DNM1 DNM2 -
FNBP1 DNM2 -
TOR1A MNA -
STXBP1 Multiple Five
SH3GLB2 - Endocytosis
PIP5KL1 - Endocytosis
FAM125B - Endocytosis
For more information Guernsey et al (2010) PLoS Genetics. 6(8): e1001081
Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways,
Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation
For more information Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9
Conclusions
Conclusions Bioinformatics is involved at every stage of
genomic research from experimental design through to final analysis
Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed
Progress towards automatic generation of clinically interpretable genomics studies
Annotation, filtering, and prioritization of genetic variants crucial
Balance between false positive calls and false negatives
Where Are We Headed?
Integration of more data sources Gene expression More annotation sources
Controlled phenotype vocabularies Gene Ontology terms
Predictive models Recessive versus Dominant inheritance and Penetrance
“New” and Emerging Technologies RNA-Seq (Gene Expression) ChIP-Seq (Protein-DNA binding) Single-Molecule Sequencing
Dalhousie University Dr. Karen Bedard Dr. Chris McMaster Dr. Andrew Orr Dr. Conrad Fernandez Dr. Marissa Leblanc Mat Nightingale Bedard Lab IGNITE
Acknowledgements
McGill/Genome Quebec Dr. Jacek Majewski Jeremy
Schwartzentruber
Dr. Sarah Dyack Dr. Johane Robataille Genome Atlantic