csce555 bioinformatics

36
CSCE555 Bioinformatics CSCE555 Bioinformatics Lecture 21 Integrative Genomics Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu .

Upload: blue

Post on 21-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

CSCE555 Bioinformatics. Lecture 21 Integrative Genomics Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSCE555 Bioinformatics

CSCE555 BioinformaticsCSCE555 BioinformaticsLecture 21 Integrative Genomics

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering

2008 www.cse.sc.edu.

Page 2: CSCE555 Bioinformatics

OutlineOutlineWhat is Integrative GenomicsWhy integrative genomicsThe Data SourcesIntegrating strategiesIssues in Integrative genomicsApplication Example: disease

gene prioritization

Page 3: CSCE555 Bioinformatics

Integrative Genomics - what is it?

Acquisition, Integration, Curation, and Analysis of biological data

Integrative Genomics: the study of complex interactions between genes, organism and environment, the triple helix of biology. Gene <–> Organism <-> Environment

It is definitely beyond the buzzword stage - Universities now have programs named 'Integrated Genomics.'

Hypothesis

Information is not knowledge - Albert Einstein

Page 4: CSCE555 Bioinformatics

Why Integrative Genomics? Why Integrative Genomics? Support Complex QueriesSupport Complex Queries• Show me all genes involved in brain development

that are expressed in the Central Nervous System.• Show me all genes involved in brain development

in human and mouse that also show iron ion binding activity.

• For this set of genes, what aspects of function and/or cellular localization do they share?

• For this set of genes, what mutations are reported to cause pathological conditions?

Page 5: CSCE555 Bioinformatics

How to integrate multiple types of genome-scale data across experiments and phenotypes in order to find genes associated with diseases

Integrative genomics for Biomedicine

• To correlate diseases with

• anatomical parts affected,

• the genes/proteins involved, and

• the underlying physiological processes (interactions, pathways, processes).

• support personalized or “tailor-made” medicine.

Page 6: CSCE555 Bioinformatics

Medical Informatics Bioinformatics & the “omes

Patient Records

Patient Records

Disease Database

Disease Database

→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……

PubMed

Clinical Trials

Clinical Trials

Two Separate Worlds…..

With Some Data Exchange…

Genome

Transcriptome

miRNAome

Interactome

Metabolome

Physiome

Regulome Variome

Pathome

OMIMClinical

Synopsis

Disease World

382 “omes” so far………

and there is “UNKNOME” too - genes with no function knownhttp://omics.org/index.php/Alphabetically_ordered_list_of_omics

Proteome

Page 7: CSCE555 Bioinformatics

Data Sources: The –Data Sources: The –OmicsOmics

Clinical dataClinical dataDisease dataDisease data

Page 8: CSCE555 Bioinformatics

• DNA sequence• Gene expression• Protein expression• Protein Structure• Genome mapping• SNPs & Mutations

Bioinformatic Data-1978 to present

• Metabolic networks• Regulatory networks• Trait mapping• Gene function analysis• Scientific literature• and others………..

Page 9: CSCE555 Bioinformatics

Human Genome Project – Data DelugeDatabase name Records

Nucleotide 12,427,463

Protein 419,759

Structure 11,232

Genome Sequences

75

Popset 21,010

SNP 11,751,216

3D Domains 41,857

Domains 19

GEO Datasets 5,036

GEO Expressions 16,246,778

UniGene 123,777

UniSTS 323,773

PubMed Central 4,278

HomoloGene 19,520

Taxonomy 1

No. of Human Gene Records currently in NCBI: 29413 (excluding pseudogenes, mitochondrial genes and obsolete records).

Includes ~460 microRNAs

NCBI Human Genome Statistics – as on February12, 2008

Page 10: CSCE555 Bioinformatics

• 3 scientific journals in 1750

• Now - >120,000 scientific journals!

• >500,000 medical articles/year

• >4,000,000 scientific articles/year

• >16 million abstracts in PubMed derived from >32,500 journals

Information Deluge…..

A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer (Baasiri et al., 1999 Oncogene 18: 7958-7965).

Page 11: CSCE555 Bioinformatics

1. Link driven federations• Explicit links between databanks.

2. Warehousing• Data is downloaded, filtered,

integrated and stored in a warehouse. Answers to queries are taken from the warehouse.

• Integrative analysis3. Others….. Semantic Web, etc………

Methods for Integration

Page 12: CSCE555 Bioinformatics

1. Creates explicit links between databanks

2. query: get interesting results and use web links to reach related data in other databanks

Examples: NCBI-Entrez, SRS

Link-driven Federations

Page 13: CSCE555 Bioinformatics

http://www.ncbi.nlm.nih.gov/Database/datamodel/

Page 14: CSCE555 Bioinformatics

http://www.ncbi.nlm.nih.gov/Database/datamodel/

Page 15: CSCE555 Bioinformatics

http://www.ncbi.nlm.nih.gov/Database/datamodel/

Page 16: CSCE555 Bioinformatics

http://www.ncbi.nlm.nih.gov/Database/datamodel/

Page 17: CSCE555 Bioinformatics

http://www.ncbi.nlm.nih.gov/Database/datamodel/

Page 18: CSCE555 Bioinformatics

Querying Entrez-GeneQuerying Entrez-Gene

Page 19: CSCE555 Bioinformatics
Page 20: CSCE555 Bioinformatics

Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.

Data Warehousing

Advantages1. Good for very-specific, task-based

queries and studies.

2. Since it is custom-built and usually expert-curated, relatively less error-prone

Disadvantages1. Can become quickly outdated –

needs constant updates.

2. Limited functionality – For e.g., one disease-based or one system-based.

Page 21: CSCE555 Bioinformatics

Integrative data analysis Integrative data analysis Data is downloaded, filteredInference algorithms that

integrate heterogeneous dataEvidences are usually weak from

one data source, integration will enhance signals

Cross-validation effect to reduce false positive

Page 22: CSCE555 Bioinformatics

Common Issues in Integrative Genomics

• Heterogeneous Data Sets - Data Integration– From Genotype to Phenotype– Experimental and Consensus Views

• Incorporation of Large Datasets– Whole genome annotation pipelines

– Large scale mutagenesis/variation projects (dbSNP)

• Computational vs. Literature-based Data Collection and Evaluation (MedLine)

• Data Mining– extraction of new knowledge

– testable hypotheses (Hypothesis Generation)

Page 23: CSCE555 Bioinformatics

No Integrative Genomics is Complete No Integrative Genomics is Complete without Ontologieswithout Ontologies

Gene Ontology (GO)

• Unified Medical Language System (UMLS)

Gene World Biomedical World

Page 24: CSCE555 Bioinformatics

• Molecular Function = elemental activity/task– the tasks performed by individual gene products; examples are

carbohydrate binding and ATPase activity

– What a product ‘does’, precise activity

• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine metabolism,

that are accomplished by ordered assemblies of molecular functions

– Biological objective, accomplished via one or more ordered assemblies of functions

• Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes;

examples include nucleus, telomere, and RNA polymerase II holoenzyme

– ‘is located in’ (‘is a subcomponent of’ )

The 3 Gene Ontologies (Recap)

http://www.geneontology.org

Page 25: CSCE555 Bioinformatics

• Access gene product functional information

• Find how much of a proteome is involved in a process/ function/ component in the cell

• Map GO terms and incorporate manual annotations into own databases

• Provide a link between biological knowledge and

• gene expression profiles

• proteomics data

What can researchers do with GO?

• Getting the GO and GO_Association Files

• Data Mining– My Favorite Gene– By GO– By Sequence

• Analysis of Data– Clustering by

function/process• Other Tools

Page 26: CSCE555 Bioinformatics

Unified Medical Language System Unified Medical Language System (UMLS) (UMLS) http://umlsks.nlm.nih.gov/kss/http://umlsks.nlm.nih.gov/kss/

The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.

The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.

The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.

Page 27: CSCE555 Bioinformatics
Page 28: CSCE555 Bioinformatics

Example Study: Disease Gene Identification and Prioritization

Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.

Functional Similarity – Common/shared•Gene Ontology term•Pathway•Phenotype•Chromosomal location•Expression•Cis regulatory elements (Transcription factor binding sites)•miRNA regulators•Interactions•Other features…..

Page 29: CSCE555 Bioinformatics

Mining human interactome

HPRDBioGrid

Page 30: CSCE555 Bioinformatics

Example: Breast cancer

OMIM genes (level 0)

Directly interacting genes (level 1)

Indirectly interacting genes (level2)

15 342 2469!

Page 31: CSCE555 Bioinformatics

ToppGene – General Schemahttp://

toppgene.cchmc.org

Page 32: CSCE555 Bioinformatics

TOPPGene - Data SourcesTOPPGene - Data Sources1. Gene Ontology: GO and NCBI Entrez Gene2. Mouse Phenotype: MGI (used for the first

time for human disease gene prioritization)3. Pathways: KEGG, BioCarta, BioCyc,

Reactome, GenMAPP, MSigDB4. Domains: UniProt (Pfam, Interpro,etc.)5. Interactions: NCBI Entrez Gene (Biogrid,

Reactome, BIND, HPRD, etc.)6. Pubmed IDs: NCBI Entrez Gene7. Expression: GEO8. Cytoband: MSigDB9. Cis-Elements: MSigDB10.miRNA Targets: MSigDB

New features added

Page 33: CSCE555 Bioinformatics

1. To unravel the connection between genotype and phenotype - Systematically identify novel phenotype–genotype relationships.

2. Hypotheses generator.

3. Paves way for prognosis, diagnosis, and personalized medicine (adverse drug reactions, etc.).

4. Deeper understanding of disease and an enhanced integration of medicine with biology.

5. Increasing knowledge of the genes associated with diseases will allow researchers to address more complicated issues, including the relative contributions to disease of genes in the core biological set shared by all species and those encoding proteins specific to humans; how sequence features (such as conservation and polymorphism) relate to disease characteristics; and how protein function relates to the outcome of clinical treatment

6. And MANY MORE……..

Benefits of Integrative Genomics

Page 34: CSCE555 Bioinformatics

SummarySummaryNetworks and integration of databases

are keys to success in Bioinformatics.Integration of computation and data

into a single cohesive whole will increase the efficiency of research effort ◦ by reducing the serendipity & hit and miss nature

of empirical research and ◦ will provide valuable clues to the biomedical

researchers on their choice of experiments - limitations of funds, manpower and time.

Users have to know what is available and how to access (what are the limitations) and use the resources they are offered.

Page 35: CSCE555 Bioinformatics

Thank You!

Page 36: CSCE555 Bioinformatics

Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (NN, k-NN, SVM, GA, ..)• Markov chain models• hidden Markov models• Markov Chain Monte Carlo (MCMC) algorithms• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms• text analysis• hybrid/combinatorial techniques and more…