integrative functional genomics anil jegga biomedical informatics, cchmc [email protected]

of 56 /56
Integrative Functional Genomics Anil Jegga Biomedical Informatics, CCHMC [email protected]

Author: erik-mclaughlin

Post on 31-Dec-2015




0 download

Embed Size (px)


  • Integrative Functional GenomicsAnil JeggaBiomedical Informatics, [email protected]

  • Medical InformaticsBioinformatics & the omesPatient RecordsDisease DatabasePubMedClinical TrialsTwo Separate Worlds..With Some Data ExchangeOMIMClinical Synopsis>380 omes so farand there is UNKNOME too - genes with no function known

  • To correlate diseases with anatomical parts affected, the genes/proteins involved, and the underlying physiological processes (interactions, pathways, processes). In other words, bringing the disciplines of Medical Informatics (MI) and BioInformatics (BI) together (Biomedical Informatics - BMI) to support personalized or tailor-made medicine.MotivationHow to integrate multiple types of genome-scale data across experiments and phenotypes in order to find genes associated with diseases and drug response

  • Model Organism Databases: Common Issues Heterogeneous Data Sets - Data IntegrationFrom Genotype to PhenotypeExperimental and Consensus ViewsIncorporation of Large DatasetsWhole genome annotation pipelines Large scale mutagenesis/variation projects (dbSNP)Computational vs. Literature-based Data Collection and Evaluation (MedLine)Data Miningextraction of new knowledgetestable hypotheses (Hypothesis Generation)

  • Support Complex QueriesShow me all genes involved in brain development that are expressed in the Central Nervous System.

    Show me all genes involved in brain development in human and mouse that also show iron ion binding activity.

    For this set of genes, what aspects of function and/or cellular localization do they share?

    For this set of genes, what mutations are reported to cause pathological conditions?

  • Bioinformatic Data-1978 to presentDNA sequenceGene expressionProtein expressionProtein StructureGenome mappingSNPs & MutationsMetabolic networksRegulatory networksTrait mappingGene function analysisScientific literatureand others..

  • Human Genome Project Data DelugeNo. of Human Gene Records currently in NCBI: ~30K (excluding pseudogenes, mitochondrial genes and obsolete records).Includes ~700 microRNAsNCBI Human Genome Statistics as on November 4, 2009

  • The Gene Expression Data DelugeTill 2000: 413 papers on microarray!Problems Deluge!Allison DB, Cui X, Page GP, Sabripour M. 2006. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 7(1): 55-65.

    YearPubMed Articles20018342002155720032421200435082005440020064824200751082008588420095207..

  • 3 scientific journals in 1750Now - >120,000 scientific journals!>500,000 medical articles/year>4,000,000 scientific articles/year>16 million abstracts in PubMed derived from >32,500 journalsInformation Deluge..

  • AccelerinAntiquitinBang SenselessBride of SevenlessChristmas FactorCockeyeCrackDraculinDickies small eyeDisease namesMobius Syndrome with Polands AnomalyWerners syndromeDowns syndromeAngelmans syndromeCreutzfeld-Jacob disease DraculinFidgetinGleefulKnobheadLunatic FringeMortalinOrphaninProfilactinSonic HedgehogData-driven Problems..Gene NomenclatureGenerally, the names refer to some feature of the mutant phenotypeDickies small eye (Thieler et al., 1978, Anat Embryol (Berl), 155: 81-86) is now Pax6Gleeful: "This gene encodes a C2H2 zinc finger transcription factor with high sequence similarity to vertebrate Gli proteins, so we have named the gene gleeful (Gfl)." (Furlong et al., 2001, Science 293: 1632)Whats in a name!Rose is a rose is a rose is a rose!

  • Rose is a rose is a rose is a rose.. Not Really!Image Sources: Somewhere from the internetWhat is a cell?any small compartment;(biology) the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in higher plants and animals a device that delivers an electric current as the result of a chemical reaction a small unit serving as part of or as the nucleus of a larger political movement cellular telephone: a hand-held mobile radiotelephone for use in an area divided into small sections, each with its own short-range transmitter/receiver small room is which a monk or nun lives a room where a prisoner is kept

  • Foundation Model Explorer

  • DatabasenameNo. of RecordsQuery= p53Query= TP53 (HGNC)Query= p53 OR TP53PubMed48,679336049,469PMC21,193152921,564Book782504820Nucleotide94735929773Protein62195096377Genome22123OMIM403141414SNP424337453Gene16423381750Homologene63968GEO Profiles352,68415,140358,999Cancer Chr302161463


  • The REAL Problems

  • Integrative Genomics - what is it? Another buzzword or a meaningful concept useful for biomedical research?Acquisition, Integration, Curation, and Analysis of biological dataIntegrative Genomics: the study of complex interactions between genes, organism and environment, the triple helix of biology. Gene Organism Environment It is definitely beyond the buzzword stage - Universities now have programs named 'Integrated Genomics.'

  • Link driven federationsExplicit links between databanks.

    WarehousingData is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.

    Others.. Semantic Web, etcMethods for Integration

  • Creates explicit links between databanksquery: get interesting results and use web links to reach related data in other databanks

    Examples: NCBI-Entrez, SRSLink-driven Federations






  • Advantagescomplex queriesFastDisadvantagesrequire good knowledgesyntax basedterminology problem not solvedLink-driven Federations

  • Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.Data WarehousingAdvantagesGood for very-specific, task-based queries and studies.Since it is custom-built and usually expert-curated, relatively less error-proneDisadvantagesCan become quickly outdated needs constant updates.Limited functionality For e.g., one disease-based or one system-based.

  • No Integrative Genomics is Complete without OntologiesGene Ontology (GO)Unified Medical Language System (UMLS)

  • Molecular Function = elemental activity/taskthe tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity What a product does, precise activityBiological Process = biological goal or objectivebroad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions Biological objective, accomplished via one or more ordered assemblies of functionsCellular Component = location or complexsubcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzymeis located in (is a subcomponent of )The 3 Gene Ontologies

  • Function (what) Process (why)

    Drive a nail - into wood Carpentry

    Drive stake - into soil Gardening

    Smash a bug Pest Control

    A performers juggling objectEntertainment

    Example: Gene Product = hammer

  • ISS: Inferred from sequence or structural similarityIDA: Inferred from direct assayIPI: Inferred from physical interactionTAS: Traceable author statementIMP: Inferred from mutant phenotypeIGI: Inferred from genetic interactionIEP: Inferred from expression patternND: no data availableGO term associations: Evidence Codes

  • Access gene product functional informationFind how much of a proteome is involved in a process/ function/ component in the cell Map GO terms and incorporate manual annotations into own databasesProvide a link between biological knowledge andgene expression profiles proteomics dataWhat can researchers do with GO?Getting the GO and GO_Association FilesData MiningMy Favorite GeneBy GOBy SequenceAnalysis of DataClustering by function/processOther ToolsAnd how?

  • list enrichment analysis tools (DAVID, FatiGO, ToppGene)

  • Open biomedical ontologies

  • Unified Medical Language System Knowledge Server UMLSKS

    The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.

    The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.

    The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.

  • Unified Medical Language SystemMetathesaurusabout >1 million biomedical concepts About 5 million concept names from more than 100 controlled vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems.The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together. Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus.Uses: linking between different clinical or biomedical vocabulariesinformation retrieval from databases with human assigned subject index terms and from free-text information sourceslinking patient records to related information in bibliographic, full-text, or factual databasesnatural language processing and automated indexing research

  • UMLSKS Semantic NetworkComplexity reduced by grouping concepts according to the semantic types that have been assigned to them. There are currently 15 semantic groups that provide a partition of the UMLS Metathesaurus for 99.5% of the concepts.ACTI|Activities & Behaviors|T053|BehaviorANAT|Anatomy|T024|TissueCHEM|Chemicals & Drugs|T195|AntibioticCONC|Concepts & Ideas|T170|Intellectual ProductDEVI|Devices|T074|Medical DeviceDISO|Disorders|T047|Disease or SyndromeGENE|Genes & Molecular Sequences|T085|Molecular SequenceGEOG|Geographic Areas|T083|Geographic AreaLIVB|Living Beings|T005|VirusOBJC|Objects|T073|Manufactured ObjectOCCU|Occupations|T091|Biomedical Occupation or DisciplineORGA|Organizations|T093|Health Care Related OrganizationPHEN|Phenomena|T038|Biologic FunctionPHYS|Physiology|T040|Organism FunctionPROC|Procedures|T061|Therapeutic or Preventive Procedure

  • UMLSKS Semantic Navigator

  • Part 2Integrative Functional Genomic Approaches to Identify and Prioritize Disease Genes

  • Disease Gene Identification and PrioritizationHypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.Functional Similarity Common/sharedGene Ontology termPathwayPhenotypeChromosomal locationExpressionCis regulatory elements (Transcription factor binding sites)miRNA regulatorsInteractionsOther features..

  • Most of the common diseases are multi-factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors. High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes.Background, Problems & Issues

  • Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins.Background, Problems & Issues

  • Direct proteinprotein interactions (PPI) are one of the strongest manifestations of a functional relation between genes.Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated.Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, proteinprotein interactions might in principle be used to identify potentially interesting disease gene candidates. PPI - Predicting Disease Genes

  • ToppGene Suite General Schema

  • ToppGene Suite Applications

    ApplicationDescriptionInputOutputToppFunDetects functional enrichment of input gene list based on Transcriptome (gene expression), Proteome (protein domains and interactions), Regulome (TFBS and miRNA), Ontologies (GO, Pathway), Phenotype (human disease and mouse phenotype), Pharmacome (Drug-Gene associations), and Bibliome (literature co-citation).Supported identifiers include NCBI Entrez gene IDs, approved human gene symbols, NCBI Reference Sequence accession numbers;Single gene list. Html output;Tab-delimited downloadable text file;Graphical chartsToppGenePrioritize or rank candidate genes based on functional similarity to training gene list.Same as above but with two gene lists (training and test)Html outputToppNetPrioritize or rank candidate genes based on topological features in protein-protein interaction network.Same as aboveHtml output;Cytoscape compatible input file;Graphical networksToppGeNetIdentify and prioritize the neighboring genes of the seeds in protein-protein interaction network based on functional similarity to the "seed" list (ToppGene) or topological features in protein-protein interaction network (ToppNet).Single gene listSame as above

  • Results of the genetic disease prioritizations using ToppGene and ToppNetTraining sets: Compiled using phenotype/disease annotations in NCBIs Entrez Gene records and OMIM

    Test set genes: Artificial linkage interval - Candidate gene + 99 nearest neighboring genes based on their genomic distance on the same chromosome.The gene-disease associations were from recently reported GWAS and include novel disease gene associations.

    DiseaseReferenceGeneToppGene RankToppNet RankBipolar DisorderLe-Niculescu et al.KLF12215Bipolar DisorderLe-Niculescu et al. RORB418Bipolar DisorderLe-Niculescu et al.RORA713Bipolar DisorderLe-Niculescu et al. ALDH1A110No interaction dataBipolar DisorderLe-Niculescu et al. AK3L111No interaction dataCardiomyopathyDhandapany et al.MYBPC312Celiac DiseaseHunt et al.SH2B318Celiac DiseaseHunt et al.CCR323Celiac DiseaseHunt et al.IL18R1329Celiac DiseaseHunt et al.RGS1926Celiac DiseaseHunt et al.TAGAP14No interaction dataCeliac DiseaseHunt et al.IL12A1410Crohns DiseaseFisher et al.MST1127Crohns DiseaseFisher et al.NKX2-3127Crohns DiseaseFisher et al.IRGM2No interaction dataCrohns DiseaseVillani et al.NLRP351Crohns DiseaseFisher et al.IL12B71Crohns DiseaseBarrett et al. Franke et al.STAT3111Crohns DiseaseFranke et al.PTPN2306ObesityRenstrom et al.MC4R11Mean6.811.75

  • ToppGene Suite (

  • ToppGene Suite (

  • ToppGene Suite (

  • ToppGene Suite (

  • Why is a test set gene ranked higher?ToppGene Suite (

  • Part 3Drug Repositioning

  • What is Drug RepositioningDrug development: It takes about 15 years and $800 million to bring a drug to market!The number of new drugs approved by the FDA each year remains at just 2030 compounds. At this rate it will take more than 300 years for the number of approved drugs to double! Instead start from existing (already in the market) or failed drugs (late-stage failures discontinued in development), and test them to uncover new applications. By-pass early stages of drug development required to assess toxicity - Enter clinical trials comparatively quickly Discovery of novel disease indications for existing drugsThe most fruitful basis for the discovery of a new drug is to start with an old drug - Sir James Black, Nobel Laureate, Physiology and Medicine, 1988

  • Because existing drugs have known pharmacokinetics and safety profiles, and are often approved by regulatory agencies for human use, any newly identified use can be rapidly evaluated in phase II clinical trials, which last ~two years and cost much less (~$17 million).In 2008, of the 31 new medicines that reached their first markets, drug repositioning accounted for one-third. Since this strategy is economically more attractive than the de novo drug discovery and development, pharmaceutical and biotech companies have directed their efforts towards it.ViagraRogaine

  • Topiramate: From epilepsy to obesityIntegrative Functional Genomics ApproachesPRADAR (Pharmacoinformatics Radar): Pattern Recognition Algorithms for Drug Analysis and Repositioning

  • Adverse Drug Reactions Mouse Phenotype: New Indications?

  • PubMedOMIM