bioinformatics in gene research

57
Genetics Noon Symposium Series Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology Bioinformatics in Genetics Research November 21 st , 2012

Upload: dan-gaston

Post on 10-May-2015

539 views

Category:

Health & Medicine


0 download

DESCRIPTION

Intended for a mixed/general audience of Clinicians, Business Interests, and Research Scientists. No audio, however the event was recorded and posted to youtube by Genome Atlantic: http://www.youtube.com/watch?v=FLVjwOngu-Q I

TRANSCRIPT

Page 1: Bioinformatics in Gene Research

Genetics Noon Symposium Series

Daniel Gaston, PhD

Dr. Karen Bedard Lab, Department of Pathology

Bioinformatics in Genetics Research

November 21st, 2012

Page 2: Bioinformatics in Gene Research

IGNITE

Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment

Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada

Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca

Page 3: Bioinformatics in Gene Research

IGNITE

Orphan Diseases: Identifying Genes and Novel Therapeutics to Enhance Treatment

Identify causative genetic variations in orphan diseases with an emphasis on Atlantic Canada

Develop animal and cell culture models Identify and develop novel therapeutics igniteproject.ca

Page 4: Bioinformatics in Gene Research

Outline

Introduction Bioinformatics in Disease Genomics Next-Generation Sequencing

Genomics in Research and the Clinic The Data Deluge and its Solutions

Bioinformatic Methods for Analyzing Genomic Data Case Studies Conclusion

Page 5: Bioinformatics in Gene Research

Bioinformatics in Disease Genomics Handling and long-term storage of raw data

(sequencing, gene expression, etc) Maintenance and support of computational

infrastructure Experimental design Data analysis Methods development

Analysis pipelines Statistical analyses Algorithm design

Page 6: Bioinformatics in Gene Research

Bioinformatics in Disease Genomics Handling and long-term storage of raw data

(sequencing, gene expression, etc) Maintenance and support of computational

infrastructure Experimental design Data analysis Methods development

Analysis pipelines Statistical analysis techniques Algorithm design

Page 7: Bioinformatics in Gene Research

‘Next-Generation’ Sequencing and Disease Genomics

Page 8: Bioinformatics in Gene Research
Page 9: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStop

Page 10: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

StartTAAStopmRNA coding for protein

Splice Sites

Page 11: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

Page 12: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Page 13: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Page 14: Bioinformatics in Gene Research

Disease Genomics: Hunting Down Pathogenic Genetic Variation

Exon 1 Intron 1 Exon 2Reference

Patient

StartTAAStopmRNA coding for protein

Exon 1 Intron 1 Exon 2

Splice Sites

TACTyr

Page 15: Bioinformatics in Gene Research

Disease Genomics: Research vs Clinic Still predominantly research oriented

Complex/Common disease Mendelian disorders Cancer genomics

Page 16: Bioinformatics in Gene Research

Disease Genomics: Research vs Clinic Still predominantly research oriented

Complex/Common disease Mendelian disorders Cancer genomics

Clinical genomics starting to gain traction Cancer genomics

Cancer subtype identification Personalized medicine and predicting outcomes

Mendelian disorders Early diagnosis Cost effectiveness

Page 17: Bioinformatics in Gene Research

Clinical Genomics

Children’s Mercy Hospital NICU In the US >20% of infant deaths due to genetic

disease Serial sequencing of candidate genes too slow

Page 18: Bioinformatics in Gene Research

Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic

disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19

days Test on seven infants, two previously diagnosed

using standard methods, five undiagnosed

Page 19: Bioinformatics in Gene Research

Children’s Mercy Hospital NICU 50-hour differential diagnosis of monogenic

disease Sample preparation and sequencing: 30.5 hours Automated bioinformatics analysis: 17.5 hours Previous high-throughput sequencing methods: 19

days Test on seven infants, two previously diagnosed using

standard methods, five undiagnosed Caveats

Bioinformatics portion not available outside of hospital Requires thorough clinical phenotyping using a

controlled vocabulary Generates a large amount of data

Page 20: Bioinformatics in Gene Research

The Data Deluge

4 million genetic variants

2 million associated with protein-coding

genes10,000

possibly of disease

causing type1500 <1%

frequency in

population

Page 21: Bioinformatics in Gene Research

Surviving the Data Deluge

Reducing the Search Space: Exome Sequencing

Page 22: Bioinformatics in Gene Research

Exome Sequencing

Exome: Portion of genome composed of protein-coding exons and functional RNA sequences

1.5 - 2% of human genome (50 Mb)

> 85% of monogenic diseases due to variants in exome

Complete exome sequencing: ~ $1000/sample

Page 23: Bioinformatics in Gene Research

Caveats

Incomplete and non-uniform coverage of exome Systematic bias (GC content) Random sampling

Not all genetic variants amenable to discovery Non-coding variants Structural variants

Page 24: Bioinformatics in Gene Research

Surviving The Data Deluge

Bioinformatics

Page 25: Bioinformatics in Gene Research

Typical Bioinformatics WorkflowQC of Raw

Data

Map to Reference

QC

Find Variants

QC

Annotate

Filter

Page 26: Bioinformatics in Gene Research

It Sounds simple but… For every stage there are multiple programs

available and published in the literature

Page 27: Bioinformatics in Gene Research

It Sounds simple but… For every stage there are multiple programs

available and published in the literature For every program there are a wide-variety of

parameter values and options. Defaults often “good enough” but not always

Page 28: Bioinformatics in Gene Research

It Sounds simple but… For every stage there are multiple programs

available and published in the literature For every program there are a wide-variety of

parameter values and options. Defaults often “good enough” but not always

Best combinations of programs and options not well understood

Page 29: Bioinformatics in Gene Research

It Sounds simple but… For every stage there are multiple programs

available and published in the literature For every program there are a wide-variety of

parameter values and options. Defaults often “good enough” but not always

Best combinations of programs and options not well understood

Protocols changing rapidly as new technologies and methods developed

Page 30: Bioinformatics in Gene Research

It Sounds simple but… For every stage there are multiple programs

available and published in the literature For every program there are a wide-variety of

parameter values and options. Defaults often “good enough” but not always

Best combinations of programs and options not well understood

Protocols changing rapidly as new technologies and methods developed

Different centres and groups use slightly different workflows with similar, but not identical results

Page 31: Bioinformatics in Gene Research

Typical Bioinformatics WorkflowQC of Raw

Data

Map to Reference

QC

Find Variants

QC

Annotate

Filter

Page 32: Bioinformatics in Gene Research

Annotating Variants

Page 33: Bioinformatics in Gene Research

If a problem cannot be solved, enlarge it.

--Dwight D. Eisenhower

Page 34: Bioinformatics in Gene Research

Annotations Associated with Genomic Variants Is variant in a known protein-coding gene?

What does the gene do? What molecular pathways? What protein-protein interactions? What tissues is it expressed in? When in development?

Has this variant been seen before? What population(s)? With what frequency? Has it been seen in local sequencing projects? Is there any known clinical significance?

What is the effect of the variation? Does it change the resulting protein? How?

4 million genetic variants

2 million associated with protein-coding

genes

10,000 possibly of

disease causing type1500 <1% frequency

in population

Page 35: Bioinformatics in Gene Research

Gene Annotation Resources

Page 36: Bioinformatics in Gene Research

Variant Annotation Resources

Page 37: Bioinformatics in Gene Research

Potential Pitfalls with Annotation Sources

Databases often overlap and agree, but there may be disagreements

Source of information: Predicted versus experimental

Incorrect and out-of-date information Large-scale un-validated versus manually

curated datasets

Page 38: Bioinformatics in Gene Research

Bioinformatics Analyses of Genomic Variants

Combining Data Sources and Filtering

Page 39: Bioinformatics in Gene Research

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and

Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Page 40: Bioinformatics in Gene Research

4 million variants

Intronic

Unknown Splice Site

Potential Disease Causing

Exonic

Amino Acid Changing

Known Genetic Disease Variant

Stop Loss / Stop Gain

Amino Acid Change Likely

Pathogenic

Amino Acid Change

Likely Benign

Known Polymorphis

m in Population

Silent Mutation Splice Site

Potential Disease Causing

Intergenic

Filtering the Data: Categorization

Page 41: Bioinformatics in Gene Research

Filtering the Data: Common or Rare?

Variants in dbSNP – Typically known polymorphisms, unlikely to be associated with rare disease

Variants with relatively high frequency in control populations (1000 Genomes, HapMAP, EVS, 2800 Exomes)

Number of times variant previously seen at sequencing centre/locally

Page 42: Bioinformatics in Gene Research

Notes on Filtering and Variant Annotation Very important to be aware of population

when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence

Page 43: Bioinformatics in Gene Research

Notes on Filtering and Variant Annotation Very important to be aware of population

when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence

Reasonably well-sampled local populations are better than any other reference

Page 44: Bioinformatics in Gene Research

Notes on Filtering and Variant Annotation Very important to be aware of population

when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence

Reasonably well-sampled local populations are better than any other reference

Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants

Page 45: Bioinformatics in Gene Research

Notes on Filtering and Variant Annotation Very important to be aware of population

when referencing frequency of a variant. Incorrect background leads to incorrect assumptions on prevalence

Reasonably well-sampled local populations are better than any other reference

Strike a balance between hard filtering for variants of largest potential effect and being inclusive to not miss variants

Some genes acquire large effect variants (stop loss / stop gain, etc) frequently. Some genes can be lost without causing disease

Page 46: Bioinformatics in Gene Research

Applications to Real Data

Charcot-Marie-Tooth Disease and Cutis Laxa

Page 47: Bioinformatics in Gene Research

IGNITE Data Pipeline and Integration

Mapped Region(s)

Known Genes

Gene Definitions

Pathway and

Interactions

Annotated Genomic Variants

FilterSort

Prioritize

Gene Annotations

Page 48: Bioinformatics in Gene Research

Charcot-Marie-Tooth: Genetic Mapping

Chromosome 9:120,962,282 -133,033,431

Page 49: Bioinformatics in Gene Research

Cutis Laxa: Genetic Mapping

Chromosome 17:79,596,811-81,041,077

Page 50: Bioinformatics in Gene Research

Charcot-Marie-Tooth Cutis Laxa 143 genes in region 13 known genes in genome

MPZ PMP22 GDAP1 KIF1B MFN2 SOX EGR2 DNM2 RAB7 LITAF (SIMPLE) GARS YARS LMNA

52 genes in region 5 known genes in

genome ATP6V0A2 ELN FBLN5 EFEMP2 SCYL1BP1 ALDH18A1

Page 51: Bioinformatics in Gene Research

Pathway and Interaction Data 37 pathways

Clathrin-derived vesicle budding

Lysosome vesicle biogenesis

Endocytosis Golgi-associated

vesicle biogenesis Membrane trafficking Trans-Golgi network

vesicle budding Primarily LMNA or

DNM2

10 pathways Phagosome Collecting duct acid

secretion Lysosome Protein digestion and

absorption Metabolic pathways Oxidative

phosphorylation Arginine and proline

metabolism Primarily ATP6V0A2

Page 52: Bioinformatics in Gene Research

Results: Charcot-Marie-Tooth 8 Genes PrioritizedGene Interactions Pathway

LRSAM1 Multiple Endocytosis

DNM1 DNM2 -

FNBP1 DNM2 -

TOR1A MNA -

STXBP1 Multiple Five

SH3GLB2 - Endocytosis

PIP5KL1 - Endocytosis

FAM125B - Endocytosis

For more information Guernsey et al (2010) PLoS Genetics. 6(8): e1001081

Page 53: Bioinformatics in Gene Research

Results: Cutis Laxa 10 genes prioritizedGene Interactions PathwayHEXDC Multiple PhagosomeHG5 - PhagosomeHG5 Multiple Lysosome, Protein digestionSIRT7 Multiple Metabolic PathwaysFASN - Metabolic PathwaysDCXR - Metabolic PathwaysPYCR1 - Metabolic Pathways,

Arginine/ProlinePCYT2 - Metabolic PathwaysARHGDIA - Oxidative Phosphorylation

For more information Guernsey et al (2009) Am J Hum Genet. 85(1): 120-9

Page 54: Bioinformatics in Gene Research

Conclusions

Page 55: Bioinformatics in Gene Research

Conclusions Bioinformatics is involved at every stage of

genomic research from experimental design through to final analysis

Standards and best practices do exist, but are rapidly evolving as new technologies and methods are developed

Progress towards automatic generation of clinically interpretable genomics studies

Annotation, filtering, and prioritization of genetic variants crucial

Balance between false positive calls and false negatives

Page 56: Bioinformatics in Gene Research

Where Are We Headed?

Integration of more data sources Gene expression More annotation sources

Controlled phenotype vocabularies Gene Ontology terms

Predictive models Recessive versus Dominant inheritance and Penetrance

“New” and Emerging Technologies RNA-Seq (Gene Expression) ChIP-Seq (Protein-DNA binding) Single-Molecule Sequencing

Page 57: Bioinformatics in Gene Research

Dalhousie University Dr. Karen Bedard Dr. Chris McMaster Dr. Andrew Orr Dr. Conrad Fernandez Dr. Marissa Leblanc Mat Nightingale Bedard Lab IGNITE

Acknowledgements

McGill/Genome Quebec Dr. Jacek Majewski Jeremy

Schwartzentruber

Dr. Sarah Dyack Dr. Johane Robataille Genome Atlantic