bioinformatics in gene research

Genetics Noon Symposium Series

Daniel Gaston, PhD

Dr. Karen Bedard Lab, Department of Pathology

Bioinformatics in Genetics Research

November 21st, 2012

IGNITE

Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment

Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada

Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca

Outline

Introduction Bioinformatics in Disease Genomics Next-Generation Sequencing

Genomics in Research and the Clinic The Data Deluge and its Solutions

Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion

Bioinformatics in Disease Genomics Handling and long-term storage of raw data

(sequencing, gene expression, etc) Maintenance and support of computational

infrastructure Experimental design Data analysis Methods development

Analysis pipelines Statistical analyses Algorithm design

Bioinformatics in Disease Genomics Handling and long-term storage of raw data

(sequencing, gene expression, etc) Maintenance and support of computational

infrastructure Experimental design Data analysis Methods development

Analysis pipelines Statistical analysis techniques Algorithm design

‘Next-Generation’ Sequencing and Disease Genomics

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStop



StartTAAStopmRNA coding for protein

Splice Sites



Patient


Exon 1 Intron 1 Exon 2

Splice Sites



Patient


Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Disease Genomics: Research vs Clinic Still predominantly research oriented

Complex/Common disease Mendelian disorders Cancer genomics

Disease Genomics: Research vs Clinic Still predominantly research oriented

Complex/Common disease Mendelian disorders Cancer genomics

Clinical genomics starting to gain traction Cancer genomics

Cancer subtype identification Personalized medicine and predicting outcomes

Mendelian disorders Early diagnosis Cost effectiveness

Clinical Genomics

Children’s Mercy Hospital NICU In the US >20% of infant deaths due to genetic

disease Serial sequencing of candidate genes too slow

Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic

disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19

days Test on seven infants, two previously diagnosed

using standard methods, five undiagnosed

Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic

disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19

days Test on seven infants, two previously diagnosed using

standard methods, five undiagnosed Caveats

Bioinformatics portion not available outside of hospital Requires thorough clinical phenotyping using a

controlled vocabulary Generates a large amount of data

The Data Deluge

4 million genetic variants

2 million associated with protein-coding

genes10,000

possibly of disease

causing type1500 <1%

frequency in

population

Surviving the Data Deluge

Reducing the Search Space: Exome Sequencing

Exome Sequencing

Exome: Portion of genome composed of protein-coding exons and functional RNA sequences

1.5 - 2% of human genome (50 Mb)

> 85% of monogenic diseases due to variants in exome

Complete exome sequencing: ~ $1000/sample

Caveats

Incomplete and non-uniform coverage of exome Systematic bias (GC content) Random sampling

Not all genetic variants amenable to discovery Non-coding variants Structural variants

Surviving The Data Deluge

Bioinformatics

Typical Bioinformatics WorkflowQC of Raw

Data

Map to Reference

QC

Find Variants

QC

Annotate

Filter

It Sounds simple but… For every stage there are multiple programs

available and published in the literature


available and published in the literature For every program there are a wide-variety of

parameter values and options. Defaults often “good enough” but not always




Best combinations of programs and options not well understood





Protocols changing rapidly as new technologies and methods developed





Protocols changing rapidly as new technologies and methods developed

Different centres and groups use slightly different workflows with similar, but not identical results

Typical Bioinformatics WorkflowQC of Raw

Data

Map to Reference

QC

Find Variants

QC

Annotate

Filter

Annotating Variants

If a problem cannot be solved, enlarge it.

--Dwight D. Eisenhower

Annotations Associated with Genomic Variants Is variant in a known protein-coding gene?

What does the gene do? What molecular pathways? What protein-protein interactions? What tissues is it expressed in? When in development?

Has this variant been seen before? What population(s)? With what frequency? Has it been seen in local sequencing projects? Is there any known clinical significance?

What is the effect of the variation? Does it change the resulting protein? How?

4 million genetic variants

2 million associated with protein-coding

genes

10,000 possibly of

disease causing type1500 <1% frequency

in population

Gene Annotation Resources

Variant Annotation Resources

Potential Pitfalls with Annotation Sources

Databases often overlap and agree, but there may be disagreements

Source of information: Predicted versus experimental

Incorrect and out-of-date information Large-scale un-validated versus manually

curated datasets

Bioinformatics Analyses of Genomic Variants

Combining Data Sources and Filtering

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and

Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

4 million variants

Intronic

Unknown Splice Site

Potential Disease Causing

Exonic

Amino Acid Changing

Known Genetic Disease Variant

Stop Loss / Stop Gain

Amino Acid Change Likely

Pathogenic

Amino Acid Change

Likely Benign

Known Polymorphis

m in Population

Silent Mutation Splice Site

Potential Disease Causing

Intergenic

Filtering the Data: Categorization

Filtering the Data: Common or Rare?

Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease

Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes)

Number of times variant previously seen at sequencing centre/locally

Notes on Filtering and Variant Annotation Very important to be aware of population

when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence



Reasonably well-sampled local populations are better than any other reference




Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants




Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants

Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease

Applications to Real Data

Charcot-Marie-Tooth Disease and Cutis Laxa

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and

Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:120,962,282 -133,033,431

Cutis Laxa: Genetic Mapping

Chromosome 17:79,596,811-81,041,077

Charcot-Marie-Tooth Cutis Laxa 143 genes in region 13 known genes in genome

MPZ PMP22 GDAP1 KIF1B MFN2 SOX EGR2 DNM2 RAB7 LITAF (SIMPLE) GARS YARS LMNA

52 genes in region 5 known genes in

genome ATP6V0A2 ELN FBLN5 EFEMP2 SCYL1BP1 ALDH18A1

Pathway and Interaction Data 37 pathways

Clathrin-derived vesicle budding

Lysosome vesicle biogenesis

Endocytosis Golgi-associated

vesicle biogenesis Membrane trafficking Trans-Golgi network

vesicle budding Primarily LMNA or

DNM2

10 pathways Phagosome Collecting duct acid

secretion Lysosome Protein digestion and

absorption Metabolic pathways Oxidative

phosphorylation Arginine and proline

metabolism Primarily ATP6V0A2

Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions Pathway

LRSAM1 Multiple Endocytosis

DNM1 DNM2 -

FNBP1 DNM2 -

TOR1A MNA -

STXBP1 Multiple Five

SH3GLB2 - Endocytosis

PIP5KL1 - Endocytosis

FAM125B - Endocytosis

For more information Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways,

Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation

For more information Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Conclusions

Conclusions Bioinformatics is involved at every stage of

genomic research from experimental design through to final analysis

Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed

Progress towards automatic generation of clinically interpretable genomics studies

Annotation, filtering, and prioritization of genetic variants crucial

Balance between false positive calls and false negatives

Where Are We Headed?

Integration of more data sources Gene expression More annotation sources

Controlled phenotype vocabularies Gene Ontology terms

Predictive models Recessive versus Dominant inheritance and Penetrance

“New” and Emerging Technologies RNA-Seq (Gene Expression) ChIP-Seq (Protein-DNA binding) Single-Molecule Sequencing

Dalhousie University Dr. Karen Bedard Dr. Chris McMaster Dr. Andrew Orr Dr. Conrad Fernandez Dr. Marissa Leblanc Mat Nightingale Bedard Lab IGNITE

Acknowledgements

McGill/Genome Quebec Dr. Jacek Majewski Jeremy

Schwartzentruber

Dr. Sarah Dyack Dr. Johane Robataille Genome Atlantic

bioinformatics in gene research

Health & Medicine

disease genomics handling

proteinstoptactyrpatient

generation sequencing

proteinstoppatient exon

data deluge bioinformatics

genetic variants amenable

estarttaa mrna coding

causative genetic variations