database representation of phenotype: issues and challenges prakash nadkarni

Database Representation of Phenotype: Issues and

Challenges

Prakash Nadkarni

Human Phenotyping Studies “Phenotype” means different things to clinical

researchers and to classical human or animal geneticists.

To the latter, it has traditionally been a “syndrome”, consisting of one or more detectable or visible traits.

These days, it is often likely to be defined in terms of variation from the norm (for better or for worse).

The single most useful catalog of human variation is Online Mendelian Inheritance in Man (OMIM). Being a text database, OMIM has limited computability.

Why standardize electronic phenotype representation?

In the post-sequencing era of genomics, electronic publication of primary data may eventually be mandated, like sequence data in Genbank.

Requirements: As in publication of research papers, the description

must be detailed and unambiguous enough to allow others to reproduce the experimental design.

Must allow the possibility of data mining by analytical tools. While complete “understanding” of the data by a computer is rarely possible, the objective is to facilitate understanding by computer-aided human interaction.

Challenges and Caveats The data itself is highly diverse - sub-cellular , cellular,

organ system, clinical. For structured data, the format of the data depends on its nature.

A single batch of data could include combinations of types of data. (e.g., both alterations in enzyme function as well as clinical descriptors).

The potential number of data elements could range in the hundreds of thousands across the entire life sciences field.

Phenotype-genotype “correlation” becomes difficult in multigenic disorders where thousands of genetic loci have been screened. A given trait may be the result of several interacting haplotypes.

Challenges and Caveats- II Data does not age well with time. If more than a

few years old, unlikely to be of much use, because newly discovered parameters now considered essential to phenotype characterization may be missing from old data (example of diabetes- insulin deficiency vs. resistance at various levels).

Mining of someone else’s data only suggests hypotheses. To confirm them, one usually needs to go back to the original subjects, which is not always possible. (Reconsenting, HIPAA).

How has Phenotype Data been represented so far?

Unstructured Data: Narrative Text is used to describe qualitative findings (e.g., OMIM)

Structured Data: Multiple values of quantitative data are typically represented in tables/spreadsheets that are described and annotated by accompanying text (descriptors or “meta-data”).

For a mix of structured and unstructured data, one must fall back on narrative (free) text. (Even after the standardization approaches described later, narrative text will still be necessary, because it captures nuances that codification cannot.)

Phenotype Database Structure The problem of representing phenotypic data is very

similar to the problem of representing clinical patient data in clinical patient record systems. A vast number of clinical parameters can potentially

apply to a human subject, but for a given clinical study, only a modest number of parameters actually apply.

The same modeling approach – (Entity-Attribute-Value) can be used. First used in the TMR system (Stead and Hammond) in the 1970s.

For phenotyping data, the major challenge is one of imposing an organization of the universe of attributes – I.e., standardizing the metadata.

Standardizing Phenotype MetaData: While the potential number of data elements is vast, the

number of types of data elements is much fewer, and will possibly be tractable.

E.g., there are thousands of different clinical lab tests. However, all of them belong to one type of data (“lab test”). The description of a lab test is standardized by LOINC..

Data type, e.g., number, string,Source of sample, how sample is collected (random vs. post-prandial, single vs. cumulative), how performed (bibliographic ref),Units in which recorded, precision of methodMaximum and minimum legal values, normal range of values, if applicable.

Objectives of Metadata Standardization By structuring the metadata to a sufficient degree of

richness, software can compare the scientific metadata accompanying the data items from disparate data sets and determine whether it is safe to combine them in a meta-analysis.

An example of data that cannot be directly combined is blood glucose done by tests detecting reducing substances (normal 80-120 mg/dl) vs. tests based on glucose oxidase methods (60-100).

A more mundane example – weight in kilos vs. lbs. Conversion factors can then be employed.

Mechanisms of Standardization: Controlled Vocabularies

Controlled Vocabularies organize the concepts (attributes) that describe a particular domain into a taxonomy.

Address the issues of different synonyms or lexical forms for the same concept. (e.g., “glucose, blood” vs. “blood glucose”).

Concepts are assigned stable Identifiers which do not change between versions of the vocabulary.

Using a ID as part of the annotation of a dataset minimizes ambiguity, and reduces the need to supply numerous inferable details.

Limitations: for questionnaires, e.g., those used in psychometry, each questionnaire is essentially its own vocabulary- mapping to “standard” vocabularies is rarely possible. (LOINC is trying to incorporate these questionnaires themselves, but newly devised questionnaires will be left out until a standards committee decides to incorporate it.)

Examples of Controlled Vocabularies / IDs in the Life Sciences

Bibliographic databases: PubMED ID, OMIM ID etc. Sequence and Other Biology Databases: Gene Ontology ID,

Geninfo ID, EC Number, etc. Clinical: SNOMED for clinical diagnosis, LOINC for Lab tests

and Clinical Observations. Different controlled vocabularies are specialized for managing

different types of data. Where there is overlap, some are known to do a better job than others.

Liberal use of vocabulary identifiers for annotation and cross-referencing in narrative text, as in OMIM (structured text). Some degree of automated mapping of text to identifiers is possible, but the results require curation.

Requirements of Submission Tools Must manage a balancing act- ease and

convenience of use vs. rigorous validation. Must facilitate the (re-)use of standard definitions,

both within a submission as well as across submissions by the same investigator, or different investigators.

The need by the repository’s curators to define the minimum set of mandatory descriptors- cf. MAGE for microarray experiments. The value of data that is not accompanied by this minimum set may be markedly diminished for investigators other than those who originated it. (Nature of minimum set varies with category of data.)

Requirements (2) The submission tools must provide intuitive

searching of previously created definitions, or reuse is unlikely to occur.

The submission tools must provide access to set of controlled vocabularies. These must be searched by curators, often in response to researchers who declare the intention to submit data. (Based on limited experience, we believe that requiring submitters to explore vocabularies during the process of submission itself may be too onerous- the addition of vocabulary mappings may therefore be done after submission.)

Structure of a Data Submission A submission consists of two parts:

Descriptors (metadata) for each data item in the submission

the data itself. Based on the category of phenotypic data for that

item, the minimum set of descriptors for each item must be specified.

Some examples of descriptors that are possibly universal Primary data vs. aggregated (mean, S.D.) Qualitative vs. Quantitative: (nominal, ordinal, interval,

ratio). Choice Sets. Raw vs. derived (e.g., composite score- in psychometry). ?Test of significance, p value

Conclusions & Summary It is essential to define minimum descriptors for a variety of

phenotypic parameters. While these have been defined for clinical parameters, defining them for pre-clinical parameters is a major challenge.

Collaboration within and across research consortia, coupled with experience, will determine the minimum set; several iterations may be required.

Some forbearance of informatics personnel (who are building tools for electronic submission) may be required: some investigators who submit data may not want to be bothered to provide the extensive metadata to make that data understandable to others. Making the tools intuitive enough, and comprehensive enough, will also take several iterations.

Acknowledgments

database representation of phenotype: issues and challenges prakash nadkarni

Documents