content, format, and standards in genomics scale data the ilsi – ebi collaboration wm. b. mattes,...

36
Content, Format, and Standards in Genomics Scale Data The ILSI – EBI Collaboration Wm. B. Mattes, PhD, DABT

Upload: dortha-dixon

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Content, Format, and Standards in Genomics Scale Data

The ILSI – EBI Collaboration

Wm. B. Mattes, PhD, DABT

Outline

Why do we need a database for toxicogenomics

How is it envisioned that this will be developed

What are the issues for such a database

Who is involved in such development

The ILSI – EBI Collaboration

Traditional Biology

One tree at a time

“Omic” Biology

Forests and

Mountains

Challenge of Genomics

“It’s the informatics, period!”

And it’s awfully tempting to take shortcuts!

Experiment

Biological ExplanationINFORMATICS

?

Why do we need a database?

Volume of data Traditional endpoints per animal

<20 histopathology observations<10 gross measurements (e.g. weights, food)<25 serum measurements<10 urinalysis measurements

Genomic endpoints per animal5,000-10,000 transcripts !!!

Why do we need a database?(cont)

Influence of technology details Influence of probe sequence

Many genes are “alternatively spliced” – such events may not be detected unambiguously by a microarray

Influence of Probe Sequence

Most arrays target this region of the mRNA!

Why do we need a database?(cont)

Influence of technology details Influence of probe sequence

Many genes are “alternatively spliced” – such events may not be detected unambiguously by a microarray

For cDNA arrays, probes may hybridize to more than one sequence

A database that captures probe sequence is required to resolve discrepancies through automated bioinformatics

How are databases being developed?

Microarray Gene Expression Data Society - MGED Society MIAME - Minimum Information About a

Microarray Experiment “the minimum information that should be reported

about a microarray experiment to enable its unambiguous interpretation and reproduction”

Essentially, what should go into the database

How are databases being developed?

MIAME – Basic Areas Experiment Design Samples used, extract preparation and

labeling Hybridization procedures and parameters Measurement data and specifications Array Design

How are databases being developed? (cont)

MGED Society MAGE

Programming conventions and data structures to communicate Microarray Gene Expression data MAGE-OM Object Model MAGE-ML Markup Language

Essentially, how the data is exchanged/ how the database is constructed

How are databases being developed? (cont)

MGED Society Ontology working group

Ontologies provide a vocabulary for representing and communicating knowledge about a topic,allowing interpretation and use by computers

MGED Ontology will provide standard terms for the annotation of microarray experiments, allowing:

structured queries

unambiguous descriptions of experiments

How are databases being developed? (cont)

MGED Society Data Transformation and Normalization

Working GroupStandards for recording how microarray data are

transformed and normalized.

What are the issues for a toxicogenomics database?

Scope of the ILSI effort: Genotoxicity Group

10 array platforms11 compounts

>2 time points, up to 10 doses / compound

Nephrotoxicity Group6 array platforms3 compounds, 260 animals

What are the issues for a toxicogenomics database?

Scope of the ILSI effort: Hepatotoxicity Group

8 array platforms2 compounds, 144 animals2 in-life studies / compound

ALL GroupsAnalysis of each sample at multiple sites

What are the issues fortoxicogenomics databases? (cont)

Traditional toxicology endpoints are not currently covered by MAGE, MIAME, or the MGED Ontologies! Organ weights Clinical pathology Histopathology Etc

What are the issues for toxicogenomics databases?

Traditional toxicology endpoints are not standardized in nomenclature Clinical pathology/chemistry

AACC IUPAC

Histopathology STP WHO/IARC/RITA NACAD SNOMED NTP, TDMS Database Pathology Code Table

Who is involved in database development

Private Companies Genelogic, Iconix, Curagen

MSU- dbZach NIEHS - CEBS NCTR - ArrayTrack ILSI - EBI

ILSI-HESI and EBI collaboration

Establishment of database for toxicogenomics data

Capture, store and analyse gene expression data produced from many different toxicogenomic experiments, conducted in several different laboratories worldwide by the ILSI-HESI members

Interrogate the gene array data integrating information from genomic, experimental and toxicological domains

Gain knowledge of possible links between gene expression changes and toxicological endpoints

ILSI-HESI and EBI collaboration

Aims of the database and tools Provide a way to integrate the different domains Control the annotation to achieve data harmonization Centralize the information to ease data access and data

sharing Improve array annotations as the genome assemblies are

released ALLOW data comparison

ILSI-HESI and EBI collaboration

Main challenge• Get internally consistent data to allow comparability among

the experiments and run complex queries across and within domains

• Note= Experiments conducted in ~40 different sites, using different array platforms and terminologies, measuring parameters with different units and storing information in different format !

ILSI-HESI and EBI collaboration

‘Simple’ question:• “Does gene X expression goes up after treatment with

compound Y with biological endpoint Z in experiments from ILSI-HESI members A and B ?”

‘Not simple’ question:• “Which are the most reproducible gene expression changes

(and the quantitative measure of this reproducibility) for all experiments on the rat arrays, with biological endpoint X, and which functional category these genes belong to and which are the human homologues ? ”

An international effort aiming to• Share expertise• Encourage harmonization• Promote standardization initiative

A call for community participation!

NIEHS-NCT

EMBL-EBI

Toxico-genomics ILSI-HESI

MIAME/Tox

MIAME/Tox objectives

Standard contextual information• Establish worldwide scientific consensus on the minimal information

descriptors for array-based toxicogenomics experiments

Data harmonization• Encourage use of controlled vocabularies for the toxicological

assessments

Data integration and data sharing• Link data within a study • Link several studies from one institution • Exchange datasets among institutions

Data storage• Facilitate development of MIAME/Tox compliant data management

softwares and databases- ArrayExpress @ EBI and CEBS @ NIEHS-NCT

MIAME/Tox document

Promote standard contextual information• Defining the core common to most experiments

- Minimum/sufficient information- Structured information

Promote data harmonization, data capture and communication

• MIAME/Tox is based on MIAME

Focus on toxicological domain• Sample treatment and conventional toxicology information

- Clinical pathology, pathology, histopathology……

MIAME/Tox document

Available at the MGED Society and ILSI-HESI web sites• Circulate for consensus

- Toxicogenomics, pharmacogenomics and ecotoxicogenomics communities

- Regulatory bodies- MGED Meeting (AAAS, Denver, Feb 2003; MGED6, France, Sept 2003)- Toxicology societies (SOT Meeting, Salt Lake City, March 2003)

• Review and publish Work closely with the MGED working groups

• Ontology working group- Identify controlled vocabularies for toxicological metadata

Data Input As a Key Step

1. Capture data in a standard manner Tox-MIAMExpress

2. Store information domains in database ArrayExpress

3. Compare/query across and within domains

Tox-MIAMExpress

Protocols• Conventional toxicology tests• Microarray experiments

Tox-MIAMExpress

Array designs• A set of procedures for formatting the array design information into a standard referencing format (ADF)• A set of procedure to re-annotate or up date the array designs via a link to another database at EBI (EnsMart)

Tox-MIAMExpress

Experiment• Experiment design, quality controls, publications• Sample source and treatment• Conventional toxicology tests data• Microarray hybridizations data

Tox-MIAMExpress

Tox-MIAMExpress

Tox-MIAMExpress

ILSI-HESI and EBI collaboration

Status: Interface and database infrastructure

developed Data input ongoing

Acknowledgments

Microarray Informatics Team at EBI, in particular

Alvis Brazma (Team Leader and MGED Society President)

Susanna-Assunta Sansone

Philippe Rocca-Serra (Data Management)

NIEHS-NCT and NTP ILSI-HESI EBI Steering Committee ILSI-HESI Genomics Committee