brief bioinform 2005 scherf 287 97

11
Matthias Scherf, PhD joined Genomatix Software GmbH in 2000, where he is Head of Discovery. He did his postdoctoral work in the group of Dr Werner at the GSF where he developed the first specific approach for genome wide promoter prediction in mammalian genomes. He has over 15 years of experience in pattern recognition, artificial intelligence and medicine. Anton Epple received a Masters Degree in Biology from LMU Munich and completed postgraduate studies in Computer Science at the Technische Universita ¨t Mu ¨nchen in 2001. His research interests include the design of software for natural language processing, and in particular information extraction techniques for systems biology. He is currently a scientist at Genomatix GmbH. Thomas Werner, PhD is CEO and CSO of Genomatix Software GmbH. Since 1998 he has been a full-time bioinformatics researcher at the GSF-National Research Centre for Environment and Health in Neuherberg, Germany, focusing on the analysis of genomic sequences with special emphasis on aspect of the regulation of transcription. He founded Genomatix Software GmbH in 1997 and it has rapidly developed a unique expertise and advanced software for genomic research. Keywords: literature/text mining, gene regulation, promoter analysis, integrated analysis Matthias Scherf, Genomatix Software GmbH, Landsberger Strasse 6, Munich, D-80339, Germany Tel: þ49 89 5997660 Fax:þ49 89 59976655 E-mail: [email protected] The next generation of literature analysis: Integration of genomic analysis into text mining M. Scherf, A. Epple and T. Werner Date received (in revised form): 27th May 2005 Abstract Text-mining systems are indispensable tools to reduce the increasing flux of information in scientific literature to topics pertinent to a particular interest in focus. Most of the scientific literature is published as unstructured free text, complicating the development of data processing tools, which rely on structured information. To overcome the problems of free text analysis, structured, hand-curated information derived from literature is integrated in text-mining systems to improve precision and recall. In this paper several text-mining approaches are reviewed and the next step in development of text-mining systems, which is based on a concept of multiple lines of evidence, is described: results from literature analysis are combined with evidence from experiments and genome analysis to improve the accuracy of results and to generate additional knowledge beyond what is known solely from literature. INTRODUCTION The annual worldwide production of information in publications is estimated to be 8 terabytes in books, 25 terabytes in newspapers, 20 terabytes in magazines and 2 terabytes in journals. 1 It would take five years to read the new scientific material that is produced every day. This rapid growth of information is observed in the field of biomedicine as well: over 15 million entries maintained by the National Library of Medicine 2 are available today in MedLine, which is the primary source of free textual information data in biomedical literature. Thousands of new entries are added every day. Consequently, information processing systems must be applied to restrict the available information to that fraction which is pertinent to a particular topic or more precisely even to a particular context within a topic. A crucial requirement for such systems is the ability to analyse and extract information from unstructured text. The systems available today put their main focus on the analysis of abstracts from scientific papers; abstracts summarise the results of the scientific work in a compact way and are the predominant source available in electronic form. The challenge that must be addressed in developing systems for the analysis of text databases is founded in Zipf’s law. 3 It states that few instances (words) cover most of the text, while most instances appear very seldom. Our own findings confirm this statement. In more than 15 million abstracts contained in MedLine, more than 40 per cent of the words occur only once. This number illustrates the lack of standards in unstructured text. Obviously, even text on the same or similar topics is not similar with respect to the wording. As a consequence, general rules are difficult to set up for the classification of a text, eg by topic. Information processing methods, however, are based on finite sets of well- defined instructions to accomplish a certain task. Thus, the development of algorithms to analyse unstructured text remains challenging. The authors are not aware of any & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 6. NO 3. 287–297. SEPTEMBER 2005 287 by guest on June 20, 2014 http://bib.oxfordjournals.org/ Downloaded from

Upload: jsm789

Post on 23-Nov-2015

13 views

Category:

Documents


5 download

DESCRIPTION

The next generation ofliterature analysis: Integrationof genomic analysis intotext mining

TRANSCRIPT

  • Matthias Scherf, PhD

    joined Genomatix Software

    GmbH in 2000, where he is

    Head of Discovery. He did his

    postdoctoral work in the group

    of Dr Werner at the GSF

    where he developed the first

    specific approach for genome

    wide promoter prediction in

    mammalian genomes. He has

    over 15 years of experience in

    pattern recognition, artificial

    intelligence and medicine.

    Anton Epple

    received a Masters Degree in

    Biology from LMU Munich and

    completed postgraduate

    studies in Computer Science at

    the Technische Universitat

    Munchen in 2001. His research

    interests include the design of

    software for natural language

    processing, and in particular

    information extraction

    techniques for systems biology.

    He is currently a scientist at

    Genomatix GmbH.

    ThomasWerner, PhD

    is CEO and CSO of Genomatix

    Software GmbH. Since 1998 he

    has been a full-time

    bioinformatics researcher at

    the GSF-National Research

    Centre for Environment and

    Health in Neuherberg,

    Germany, focusing on the

    analysis of genomic sequences

    with special emphasis on aspect

    of the regulation of

    transcription. He founded

    Genomatix Software GmbH in

    1997 and it has rapidly

    developed a unique expertise

    and advanced software for

    genomic research.

    Keywords: literature/textmining, gene regulation,promoter analysis, integratedanalysis

    Matthias Scherf,

    Genomatix Software GmbH,

    Landsberger Strasse 6,

    Munich, D-80339, Germany

    Tel: 49 89 5997660Fax:49 89 59976655E-mail: [email protected]

    The next generation ofliterature analysis: Integrationof genomic analysis intotext miningM. Scherf, A. Epple and T. WernerDate received (in revised form): 27th May 2005

    Abstract

    Text-mining systems are indispensable tools to reduce the increasing flux of information in

    scientific literature to topics pertinent to a particular interest in focus. Most of the scientific

    literature is published as unstructured free text, complicating the development of data

    processing tools, which rely on structured information. To overcome the problems of free

    text analysis, structured, hand-curated information derived from literature is integrated in

    text-mining systems to improve precision and recall. In this paper several text-mining

    approaches are reviewed and the next step in development of text-mining systems, which is

    based on a concept of multiple lines of evidence, is described: results from literature analysis

    are combined with evidence from experiments and genome analysis to improve the accuracy

    of results and to generate additional knowledge beyond what is known solely from literature.

    INTRODUCTIONThe annual worldwide production of

    information in publications is estimated to

    be 8 terabytes in books, 25 terabytes in

    newspapers, 20 terabytes in magazines and

    2 terabytes in journals.1 It would take five

    years to read the new scientific material

    that is produced every day. This rapid

    growth of information is observed in the

    field of biomedicine as well: over 15

    million entries maintained by the

    National Library of Medicine2 are

    available today in MedLine, which is the

    primary source of free textual information

    data in biomedical literature. Thousands

    of new entries are added every day.

    Consequently, information processing

    systems must be applied to restrict the

    available information to that fraction

    which is pertinent to a particular topic or

    more precisely even to a particular

    context within a topic. A crucial

    requirement for such systems is the ability

    to analyse and extract information from

    unstructured text. The systems available

    today put their main focus on the analysis

    of abstracts from scientific papers; abstracts

    summarise the results of the scientific

    work in a compact way and are the

    predominant source available in electronic

    form.

    The challenge that must be addressed in

    developing systems for the analysis of text

    databases is founded in Zipfs law.3 It

    states that few instances (words) cover

    most of the text, while most instances

    appear very seldom. Our own findings

    confirm this statement. In more than 15

    million abstracts contained in MedLine,

    more than 40 per cent of the words occur

    only once. This number illustrates the

    lack of standards in unstructured text.

    Obviously, even text on the same or

    similar topics is not similar with respect to

    the wording. As a consequence, general

    rules are difficult to set up for the

    classification of a text, eg by topic.

    Information processing methods,

    however, are based on finite sets of well-

    defined instructions to accomplish a

    certain task. Thus, the development of

    algorithms to analyse unstructured text

    remains challenging.

    The authors are not aware of any

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 8 7

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • generally applicable approach in the field

    of biomedicine and molecular biology

    where the number of relevant documents

    found by text mining reaches the number

    of detected documents (100 per cent

    precision) and the number of relevant

    documents detected reaches the total

    number of relevant documents (100 per

    cent recall). However, many different

    approaches exist, focusing on specific tasks

    and knowledge domains.

    The next section starts with a brief

    overview of the different tasks and

    challenges in text mining of biomedical

    literature. Our focus will be on the tasks

    of identification, description and

    classification of relations between

    biological entities from free text. The

    methods available for this task are

    described. Next, the basic biological

    concepts that are used to improve and

    classify text analysis results by the

    integration of content from structured

    information resources derived from

    literature by human experts are explained.

    For further reading, the reader is referred

    to Blaschke et al.,4 Shatkay and Feldman,5

    Hirschman et al.,6 Dickman,7 De Bruijn

    and Martin,8 Grivell,9 Andrade and

    Bork10 and Schulze-Kremer11 for in-

    depth reviews on methods in text mining

    and natural language processing in the

    domain of biomedicine and molecular

    biology.

    The subsequent sections will illustrate

    how the principle of data integration in

    text-mining systems can be extended to

    genome analysis. Our focus will be on the

    biological process of transcription

    regulation as one example for genome

    analysis resulting in gene relations.

    TEXT MINING INMOLECULAR BIOLOGYAND BIOMEDICINEText mining deals with the analysis of text

    and the extraction of information. One

    very common task of text mining in the

    field of biomedicine and molecular

    biology is to identify and analyse relations

    between biological entities such as genes,

    proteins or diseases from free text. The

    process of text mining for this task implies

    the following steps:

    Identification of biological entities.

    Identification of entity relations.

    Classification of entity relations.

    Every step of the text-mining process can

    be addressed with several different

    methods. Consequently a large variety of

    method can be combined to solve the

    various aspects of literature mining. The

    combinatorial possibilities are also

    reflected in the number of currently

    available tools for literature analysis.

    However, all of these tools address more

    or less the same tasks of identification and

    analysis of gene relations.

    Since the strength and weaknesses of

    the different tools differ according to the

    user queries, an objective and fair

    comparison is hard to achieve on the level

    of integrated tools. Thus we will focus on

    the underlying methods which can be

    applied for the three steps described above

    to offer the reader the basis for a method-

    oriented way of evaluation.

    Step 1: Identification ofbiological entitiesBiomedical literature contains a special

    category of entities that refer to gene and

    protein names, chemical compounds,

    diseases, tissues, cellular components or

    other predefined biological concepts.

    Therefore, identification of such

    biological entities in text is a crucial first

    step and essential for any subsequent

    analysis.

    The major challenge in entity

    identification is the synonym/homonym

    problem: biological entities such as genes

    not only have different names, ie

    synonyms (eg CSEN, DREAM,

    KCHIP3, MGC18289 and KCNIP3)

    with different typographical variants (eg

    Kcnip3, KCNIP-3 and KCNIP 3), but the

    names can also be ambiguous. An

    example for ambiguity is the abbreviation

    NRL which is used for natural rubber

    Precision and Recall arecrucial measurementsfor literature analysis

    2 8 8 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

    Scherf, Epple and Werner

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • latex as well as neural retina leucine

    zipper gene.The synonym/homonym problem

    Owing to the synonym/homonym

    problem, the task of entity recognition

    requires an identification and

    disambiguation step. Identification

    strategies range from methods that use ad

    hoc rules about typical syntactic structures

    of entity identifiers12 to algorithms that

    search identifiers of a given dictionary

    with exact and inexact pattern matching

    methods.13 Principally, dictionaries are

    used in combination with pattern

    recognition approaches. The dictionaries

    are based on publicly available sources of

    standardised, structured data annotated by

    human experts. Examples for sources are

    HUGO Gene Nomenclature

    Committee,14,15 sequence annotation

    databases such as LocusLink,16 protein

    databases such as Swiss-Prot,17 Gene

    Ontology (GO)18 or medical pathological

    terms UMLS.19

    Although a dictionary-based search

    allows a robust algorithm-based

    identification of entities, it is naturally

    restricted to the terms included in the

    dictionary. The disambiguation step

    implies a classification method deciding

    whether the text where the entity has

    been identified refers to the expected

    topic. A large variety of methods, ranging

    from machine learning,20 support vector

    machines21 and hidden Markov models22

    to Bayesian learning and decision

    trees,23,24 have been applied to this task.

    It is difficult to compare methods for

    identification and disambiguation, since

    the published methods generally focus on

    different kinds of biological entities and

    are often trained and tested on preselected

    text sets from certain domains. For

    example, the authors of the PROPER

    method12 reported over 90 per cent in

    precision and recall but calculated these

    values based on only 30 abstracts of papers

    about the SH3 domain.

    Regardless of the different methods

    applied, most authors agree that

    interaction with experts is indispensable

    to obtain good results in identification

    and disambiguation. These experts define

    the dictionaries for entity identifiers and

    create simple, generally applicable, rules

    for disambiguation. Thus, the

    performance of the various methods relies

    to a significant fraction on the excellence

    of the biological experts involved rather

    than on the algorithm employed to

    encode and use the expert knowledge.

    Step 2: Identification ofrelations between entitiesThe next goal after identification and

    disambiguation of biological entities is the

    detection of relations between the entities

    as the text describes them. The most

    straightforward approach for this task is to

    assume relations between entities based on

    co-occurrence in a text. The probability

    of an established relation between entities

    depends to some extent on the location of

    the entities within a text. The weakest

    assumption about a relation is due to a co-

    occurrence of entities anywhere in a text.

    If two entities co-occur within the same

    sentence, a true relation becomes more

    likely, while the coverage might decrease

    simultaneously.

    Sophisticated approaches try to further

    improve the analysis on sentence level by

    employing dictionaries and rule-based

    analysis techniques.2528 The dictionaries

    contain words related to the description

    of relations; the rules are designed for the

    analysis of sentence or phrase structures.

    These approaches lead to a better

    precision of results but decrease the recall

    owing to the restricted set of vocabulary

    and sentence structures. While all of these

    methods take the textual and sometimes

    grammatical context into consideration,

    none of them truly integrates the

    biological context (other than what was

    manually entered into the dictionaries).

    An ideal system would consider biological

    restrictions, eg rule out impossible

    combinations of entities such as

    information about bacterial genes

    involved in operon structures in relation

    to mammalian genes pertinent to

    chromatin organisation.

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 8 9

    The next generation of literature analysis

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • Step 3: Classification ofentity relations

    Gene relationsdepend on theirfunctional context

    Relations between biological entities are

    not fixed but change according to the

    functional context in which an entity

    applies. The biological mechanisms and

    the environment in which the entity was

    observed generally specify the functional

    context of a biological entity.

    Consequently, the description of a

    functional context is usually distributed in

    multiple sentences, figures and tables, and

    in-depth expert knowledge is required to

    decode the functional context from

    publications. Text-mining systems might

    support the identification of single aspects

    of a functional context such as a tissue

    type, but it is still impossible to

    automatically elucidate the complex

    dependencies between the components of

    a functional context. A relation might

    thus be described and correctly identified

    by steps 1 and 2 from a text, but the

    functional context in which the relation

    was observed might not correspond to the

    topic of interest. Prominent examples for

    relations that change according to the

    functional context are signal transduction

    pathways such as the MAP kinase system,

    which can trigger a number of different

    transcriptional activators.29

    The most common way to consider

    functional context in text mining is the

    introduction of structured, hand-curated

    information about biological entities.

    Available sources such as GO18 or

    KEGG30 assign biological entities to

    classes, eg of biological functions and

    pathways. MESH31 assigns domains like

    diseases or anatomy to publications.

    Integrateindependent linesof evidence

    These sources can be used to establish

    different lines of evidence for an entity

    relation derived from text. A certain

    disease can be assigned to a relation if the

    paper in which the two entities were

    identified has been assigned to the disease

    via MESH. If both genes of an identified

    relation belong to the same class in GO,

    then this class is assigned to the relation.

    The natural internal consistency of

    biological facts and findings make such an

    approach possible. True connections of

    biological entities are characterised by at

    least two hallmarks: they do not conflict

    with each other (within the correct

    context!) and they are always present on

    several levels. For example, two proteins

    reported to interact functionally (such as

    an enzyme and its substrate) necessarily

    also can be shown to interact physically

    (eg in yeast two hybrid systems). In

    addition, they are necessarily co-expressed

    in at least one cell type. Often such co-

    expression is also evident from common

    regulatory structures in the corresponding

    gene promoters. In short, isolated findings

    that are not supported on other levels

    (genome, transcriptome, proteome,

    metabolome) or are in conflict with other

    findings are usually much less likely to be

    true than findings consistently supported

    by several independent lines of evidence.

    This is a very general and a very powerful

    basic biological concept that can be used

    to enhance the results of any information

    retrieval system.

    In the following section we will discuss

    the extension of the line of evidence

    approach towards literature-independent

    data from genomic analysis.

    COMBINATION OF TEXTMINING AND GENOMICANALYSISSources such as GO or MESH contain

    specific, high-quality annotations

    primarily derived from the literature by

    experts. Such annotations can be

    complemented by literature-independent

    data, eg from laboratory or in-silico

    experiments, which confirm text-mining

    results and assign additional functional,

    cellular or molecular context to entity

    relations. It should be noted that only

    independent lines of evidence provide

    support. If three groups report the same

    results based on very similar experiments,

    this is only incremental evidence of one

    line. However, if a physical interaction

    deduced from a functional assay is

    supported by direct demonstration of

    physical interactions (proteinprotein or

    proteinDNA/RNA), then two lines of

    evidence are established. This is much

    more difficult to realise based solely on

    2 9 0 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

    Scherf, Epple and Werner

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • text mining, but can readily be achieved

    when text mining is combined with other

    sources such as genomics or proteomics-

    based sequence analyses.

    Below, the integration of literature

    independent data is illustrated with several

    examples. Special emphasis is placed on

    the regulation of gene transcription since

    it defines important proteingene

    relations on molecular level.

    Transcription regulation

    Transcriptionregulation

    Gene transcription is regulated in part by

    nuclear factors (proteins) that recognise

    short DNA sequence motifs, called

    transcription factor binding sites (TFBSs).

    TFBSs are in most cases located upstream

    of the first exon of a transcript in so-called

    promoter and enhancer regions.

    Identification of TFBSs in regulatory

    regions of transcripts can confirm

    important relations between transcription

    factors and genes and add considerably to

    annotation of the biological context of

    genes. As a consequence, the analysis of

    transcription regulation first requires the

    annotation of regulatory regions for a

    gene and the identification of TFBSs in

    the annotated regulatory regions.

    Annotation of genes and theirregulatory regions

    Genes can havealternative promoters

    The human genome currently is

    annotated with 23,245 gene loci (NCBI

    Build 34). For these loci 43,975

    transcripts are known. About 45 per cent

    (10,368) of the genes have alternative

    transcripts ranging from 2 to 40. In

    addition 6,418 of the annotated loci have

    two or more promoters, ie alternative

    promoters. Figure 1 summarises the

    distribution of genes with alternative

    transcripts.

    Alternative transcripts of a gene differ

    according to alternative splicing (see, eg,

    gene LIPT1 and Figure 2a), alternative

    termination (see, eg, gene FBLN1) or

    alternative first exons (see, eg, gene

    CYP19A1 and Figure 2b). This flexibility

    of alternative transcripts reflects the

    various biological contexts in which a

    gene might be functionally involved.

    Since publications in general describe

    only genes and not their transcript/

    regulatory regions, it is not possible to

    identify the functional context by text-

    mining methods. On the other hand, the

    flexibility of alternative transcripts needs

    to be understood to truly comprehend

    disease processes, especially for

    individualised diagnostics of chronic

    diseases and cancer. Only this knowledge

    will allow addressing the correct genetic

    mechanism in the pertinent context. The

    aromatase gene, which is the terminal

    enzyme responsible for oestrogen

    biosynthesis in mammals, provides a good

    example to illustrate this point. Aromatase

    has at least six different alternative

    promoters that regulate the production of

    the same gene product (exon II-X always

    remain the same). Dysregulation of

    aromatase promoters is found in severe

    diseases, especially breast cancer.32

    Aromatase in normal breast tissues is

    mainly regulated by promoter 1.4. In

    Figure 1: Organisationof the human genome:45 per cent of all geneshave alternativetranscripts andalternative promoters

    !

    ! !

    !

    !

    "

    "

    !

    ! "

    !

    !

    Figure 2: Alternative splicing (a) andalternative promoters (b). Alternativetranscripts occur by alternative splicing ofone or several exons from a single primarytranscript or by transcription starting fromalternative promoters (P: promoter, E: exon)

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 1

    The next generation of literature analysis

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • breast cancer tissues, over-activation of

    promoter 1.3 and 1.2 is often observed

    (Figure 3). This shift in promoter usage

    does not affect the coding region at all; all

    transcripts encode the exact same protein.

    Therefore, full understanding of this

    disease mechanism and subsequent design

    of an effective therapy requires detailed

    knowledge of transcription activation via

    alternative promoters.

    Analysis of regulatory regionsIdentification of regulatory mechanisms

    by promoter provides a crucial link

    between the static nucleotide sequence of

    the genome and the dynamic aspects of

    gene regulation and expression.

    Furthermore, it provides a unique way to

    define functional context based on co-

    regulation mechanisms that cannot be

    derived by literature analysis.

    Promoter modulesdefine biologicalfunctions

    Activation of transcription is triggered

    by binding of transcription factors to the

    promoter sequence of a transcript.

    However, in mammalian systems this is

    usually not achieved by individual

    transcription factors but by characteristic

    combinations of factors. Similar TFBSs

    patterns within the promoters of

    transcripts are expressed in the same tissue

    under similar conditions. Thus, the

    organisation of promoter motifs represents

    a footprint or framework of the

    transcriptional regulatory mechanisms at

    work in a specific biological context,

    consequently providing information

    about signal and tissue-specific control of

    expression.

    Software that allows detection and

    characterisation of individual binding sites

    is available from several sources, including

    MatInspector,33,34 Signal Scan,35,36

    MATRIX SEARCH37 or MATCH.38

    A collection of functional binding sites

    for high-quality prediction is derived

    from the literature and included in

    MatInspector.33 TRANSFAC is another

    source which provides extensive

    information about transcription factors

    derived from literature.39,40

    Although binding site detection is

    important in higher organisms, it is

    generally not sufficient for the elucidation

    of promoter function since, in more

    complex systems, the functional TFBSs

    within promoters are organised

    hierarchically41,42 (Figure 4). This

    hierarchical organisation increases the

    specificity and selectivity of gene

    regulation via TFBSs.43,44 Combinatorial

    biology appears to be the key to

    understand regulation in higher

    organisms, where promoter function is

    determined to a large extent by the

    functional context within which the

    binding sites are located.

    The smallest entities on the level of

    TFBSs combinations that can be assigned

    to a particular biological function are

    called promoter modules. Promoter

    modules are defined as two or more

    individual elements that act in a

    coordinated way (either synergistically or

    antagonistically) and are arranged within a

    defined distance and in sequential order

    (Figure 4).44 Work to date suggests that

    Figure 3: Gene structure of human aromatase (Cyp19A1). Datamodified from Clyne et al.32

    Figure 4: Promoters in higher eukaryotesare organised hierarchically and elementsthat control a specific pattern of expressionmay also be found in other promotersexpressed under similar circumstances

    2 9 2 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

    Scherf, Epple and Werner

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • promoter modules can be pathway or cell

    type specific41 and, in this regard, can

    mediate the transcriptional response to

    specific signal transduction pathways,45,46

    cell type-specific expression, and events

    central to developmental regulation.47 A

    given promoter module may show a

    robust stimulus-specific response in one

    tissue, but may not be functional in

    another cell type.

    Combination ofpromoter analysis andliterature analysis

    Although the inclusion of transcription

    factorgene relations in combination

    with literature mining has been realised in

    the BiblioSphere PathwayEdition,34 more

    sophisticated promoter analyses have not

    yet been implemented within text-mining

    tools. However, given the biological

    relevance outlined above, this can be

    expected to be added in the near future to

    biological text mining.

    The power of such a combinatorial

    approach is illustrated with a recent

    example.48 During an microarray-based

    analysis of genes involved in the response

    of astrocytes to expression of the HIV-1

    protein Nef, a set of nine genes was

    identified as relevant (BCL2L1, CDC42,

    HCK, Jak2, JNK, MAPK1, RAC1,

    STAT3, Vav1). Unfortunately, most of

    these genes are involved in cell cycle

    regulation resulting in an extraordinary

    large body of related literature. Figure 5

    illustrates the strategy and the results of

    our approach. Even a co-citation

    network analysis restricting networks to

    genes co-cited with at least five of the

    nine genes can restrict the gene list only

    from the initial 2,846 to 440 genes. In

    contrast automatic promoter framework

    analysis of the promoters of the nine

    initial genes yielded a framework

    consisting of four TFBSs, present in the

    promoters of three of the nine genes

    (BCL2L1, HCK, RAC1). This network

    selected 159 promoters out of 36,000

    human annotated promoters. The

    molecular evidence (159) was then

    crossed with the literature around the

    nine initial genes (2,846), which resulted

    in a network of 18 genes where all

    connection could be verified as

    functional. All of these 18 genes are thus

    directly relevant for the initial query, the

    response of astrocytes to the viral Nef

    protein of HIV-1.

    In summary, promoter analysis

    provides information on transcription

    factorgene relations on the molecular

    level independent of literature data.

    Relations between transcription factors

    and genes can be associated in a context-

    dependent manner. Moreover,

    information on alternative promoters and

    transcripts provides a detailed view on

    different biological contexts in which a

    gene can function. While this will clearly

    reduce the recall from the literature it

    can dramatically increase the context-

    dependent precision, which is the most

    important parameter for the usefulness of

    data mining in general.

    Further approaches to integrateliterature-independent dataAnother approach to integrate

    experimental, non-literature based

    information on relations between

    biological entities is to integrate

    Figure 5: Scheme illustrating the combination of literature mining andpromoter analysis on the example of a group of nine genes initiallyidentified in a microarray-based study of astrocyte response to the HIV-1Nef gene48

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 3

    The next generation of literature analysis

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • information about proteinprotein

    interactions. There are a number of

    databases available containing such

    information, derived from experiments

    such as yeast/mammalian two hybrid. An

    example is the DIP database.49 Here,

    information from a variety of sources is

    combined to create a single, consistent set

    of proteinprotein interactions. The text-

    mining system Chilibot50 uses this source

    to integrate additional relations in its

    literature-mining results.

    Four approaches tocharacterise generelations

    Literature-mining systems today often

    are used for the interpretation of

    expression array data. However, the data

    from expression arrays also provide gene

    relations, defined by genes with similar

    expression profiles under defined

    experimental conditions (cell type,

    treatment, etc). This source is not directly

    correlated with gene regulation analysis

    since co-expressed genes are not

    necessarily co-regulated. Moreover,

    effects from different biological

    mechanisms such as post-transcriptional

    regulation via microRNA, RNA stability,

    etc are cumulated in the expression signal.

    GeneExpressionOmnibus51,52 offers a

    large collection of documented results

    from expression array experiments that

    might be integrated in literature mining

    systems in the future.

    CONCLUSIONRelations between biological entities are

    conditional and may change when the

    same genes are considered in a different

    functional context. As a consequence,

    every relation between entities must be

    qualified with the functional context in

    which the relation was observed.

    Moreover, it is impossible to make

    general statements whether a relation

    detected by literature mining is a true or

    a false relation without considering the

    observed context.

    This context-dependency of relations

    also precludes any quantitative

    comparison of the content of the various

    databases underlying the discussed

    methods. Pure numbers cannot answer

    the crucial question how well such

    relations are qualified with respect to their

    biological context. It is safe to assume that

    all of the methods have ample basic

    information to build on. The only

    external indicator we could identify at

    least in a qualitative manner is the

    assessment of how many different lines of

    evidence are combined by the systems.

    Although the quality of the final results

    still depends crucially on how such

    integration procedures are implemented,

    the concept of multiple lines of evidence

    at least allows for using the principle of

    biological consistency discussed above.

    There is no doubt that text-mining

    methods are powerful tools to further

    understand biological principles by

    problem-oriented preselection of

    publications about biological entities and

    their relations. The main challenge in text

    mining remains coping with free

    unstructured text and its individual

    properties, which are characterised by

    Zipfs law. Text mining in molecular

    biology and biomedicine is complicated

    by multiple layers of problems. They

    range from the identification of biological

    entities over disambiguation (synonyms

    and homonyms) and identification of

    relations all the way to the interpretation

    of the functional context. Numerous

    approaches in the field of text mining

    show that the identification and

    disambiguation of biological entities

    already give remarkable results in

    precision and coverage, while the analysis

    of sentences and text to discover relations

    or biological concepts is still a challenge.

    Fortunately, the connection of

    molecular biology and biomedical

    literature to biology not only complicates

    the task, but also offers several

    opportunities unique to the field. The

    biological consistency also includes the

    interaction of the cellular transcriptional

    and translational machinery with the

    genome and the transcriptome. Since we

    do have access to several genome

    sequences as well as considerable parts of

    the transcriptomes (via cDNA/expressed

    sequence tag approaches), biological

    knowledge mining is no longer restricted

    Further independentlines of evidence

    2 9 4 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

    Scherf, Epple and Werner

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • to literature alone. Every phenomenon

    described in the literature necessarily has a

    molecular foundation within the genomic

    sequence. Although our current

    knowledge does not allow understanding

    all of these molecular correlations, a great

    deal of information can already be derived

    from genomic sequences, especially about

    transcriptional regulation as detailed

    above. Moreover, as the biological

    principles governing gene regulation seem

    to be very general, it becomes possible to

    use sequence analyses not only to confirm

    knowledge from the literature, but also to

    derive new relationships beyond current

    knowledge.

    Focusing on the aspect of confirming

    results from text mining by other

    biological data, the following picture

    emerges. Starting from the co-occurrence

    of biological entities in a text, four

    approaches can be identified from the

    available applications to confirm, classify

    or discard the relation:

    In-depth analysis on sentence orphrase level. Approaches range from

    the application of general syntax rules

    such as co-occurrences of entities in

    the same sentences up to application

    of in-depth analysis by syntactic and

    semantic parsers. The results generally

    cause a decrease of coverage, with an

    increase in precision.

    Hand annotation of preselectedsentences by curators. This approach is

    applicable independent from a certain

    scientific topic but slow.

    Integration of hand-curated,structured data sources on gene classes

    and text annotations.

    Integration of experimental resultseither from laboratory or in-silico

    analyses.

    While the first three methods are

    mainly text-driven, the integration of

    results from experiments and in-silico

    analyses introduces literature-independent

    data. This represents an independent level

    of defining functional context to evaluate

    relations of biological entities. It becomes

    increasingly clear that text mining will be

    only one tool for information retrieval

    and management in biomedical research.

    Only the combination with other

    methods and information sources will lead

    to the best possible structuring and

    compilation of biological knowledge.

    This does not come as a surprise because

    the whole concept of systems biology is

    based on this notion. The most immediate

    consequence is that text mining of

    biomedical literature cannot be

    outsourced from biology to other

    disciplines but has to be carried out in

    tight interaction with biologists.

    References

    1. Lyman, P. and Hal, R. V. (2003), How muchinformation? (URL: http://www.sims.berkeley.edu/how-much-info-2003/).

    2. Wheeler, D. L., Chappey, C., Lash. A. E. et al.(2003), Database resources of the NationalCenter of Biotechnology, Nucleic Acids Res.,Vol. 31, pp. 193195.

    3. Li, W. (1992), Random texts exhibit Zipfs-law-like word frequency distribution, IEEETrans Information Theory, Vol. 38(6), pp.18421845.

    4. Blaschke, C., Hirschman, L. and Valencia, A.(2002), Information extraction in molecularbiology, Brief. Bioinformatics, Vol. 3,pp. 154165.

    5. Shatkay, H. and Feldman, R. (2003), Miningthe biomedical literature in the genomic era:An overview, J. Comput. Biol., Vol. 10,pp. 821855.

    6. Hirschman, L., Park, J. C., Tsujii, J. et al.(2002), Accomplishments and challenges inliterature data mining for biology,Bioinformatics, Vol. 18, pp. 15531561.

    7. Dickman, S. (2003), Tough mining: Thechallenges of searching the scientific literature,PLoS Biol., Vol. 1, p. E48.

    8. de Bruijn, B. and Martin, J. (2002), Getting tothe (c)ore of knowledge: Mining biomedicalliterature, Int. J. Med. Inf., Vol. 67, pp. 718.

    9. Grivell, L. (2002), Mining the bibliome:Searching for a needle in a haystack? Newcomputing tools are needed to effectively scanthe growing amount of scientific literature foruseful information, EMBO Rep., Vol. 3,pp. 200203.

    10. Andrade, M. A. and Bork, P. (2000),

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 5

    The next generation of literature analysis

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • Automated extraction of information inmolecular biology, FEBS Lett., Vol. 476,pp. 1217.

    11. Schulze-Kremer, S. (2002), Ontologies formolecular biology and bioinformatics, In SilicoBiol., Vol. 2, pp. 179193.

    12. Fukuda, K., Tamura, A., Tsunoda, T. andTakagi, T. (1998), Toward informationextraction: identifying protein names frombiological papers, in Proceedings of the 3rdPacific Symposium on Biocomputing,4th9th January, Hawaii, pp. 705716.

    13. Krauthammer, M., Rzhetsky, A., Morozov, P.and Friedman, C. (2000), Using BLAST foridentifying gene and protein names in journalarticles, Gene, Vol. 256, pp. 245252.

    14. Wain, H. M., Lush, M. J., Ducluzeau, F. et al.(2004), Genew: The Human GeneNomenclature Database, 2004 updates, NucleicAcids Res., Vol. 32 (Database issue),pp. D255257.

    15. Wain, H. M., Bruford, E. A., Lovering, R. C.et al. (2002), Guidelines for Human GeneNomenclature, Genomics, Vol. 79(4),pp. 464470.

    16. Maglott, D. R., Katz, K. S., Sicotte, H. andPruitt, K. D. (2000), NCBIs LocusLink andRefSeq, Nucleic Acids Res., Vol. 28(1),pp. 126128.

    17. Bairoch, A. and Apweiler, R. (1998), TheSWISS-PROT protein sequence data bankand its supplement TrEMBL in 1998, NucleicAcids Res., Vol. 26, pp. 3842.

    18. Ashburner, M., Ball, C. A., Blake, J. A. et al(2000), Gene ontology: Tool for theunification of biology, The Gene OntologyConsortium, Nat. Genet., Vol. 25, pp. 2529.

    19. Bodenreider, O. (2004), The Unified MedicalLanguage System (UMLS): Integratingbiomedical terminology, Nucleic Acids Res.,Vol. 32, pp. 267270.

    20. Tanabe, L. and Wilbur, W. J. (2002), Tagginggene and protein names in biomedical text,Bioinformatics, Vol. 18, pp. 11241132.

    21. Kazama, J., Makino, T., Ohta, Y. and Tsujii, J.(2002), Tuning support vector machines forbiomedical named entity recognition, inProceedings of the Natural LanguageProcessing in the Biomedical Domain,Association for Computational Linguistics,Philadelphia, pp. 18.

    22. Nobata, C., Collier, N. and Tsujii, J. (1999),Automatic term identification andclassification in biology texts, in Proceedingsof the Natural Language Pacific RimSymposium, Beijing, November,pp. 369375.

    23. Hatzivassiloglou, V., Duboue, P. A. andRzhetsky, A. (2001), Disambiguatingproteins, genes, and RNA in text: A machine

    learning approach, Bioinformatics, Vol. 17,pp. S97S106.

    24. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19(13). pp. 16991706.

    25. Friedman, C., Kra, P., Yu, H. et al. (2001),GENIES: A natural-language processingsystem for the extraction of molecularpathways from journal articles, Bioinformatics,Vol. 17, pp. 7482.

    26. Park, J. C., Kim, H. S. and Kim, J. J. (2001),Bidirectional incremental parsing forautomatic pathway identification withcombinatory categorial grammar, inProceedings of the 6th Pacific Symposium onBiocomputing, 3rd7th January, Hawaii,pp. 396407.

    27. Pustejovsky, J., Castano, J., Zhang, J. et al.(2002), Robust relational parsing overbiomedical literature: extracting inhibitrelations, in Proceedings of the 7th PacificSymposium, 3rd7th January, Hawaii,pp. 362373.

    28. Novichkova, S., Egorov, S. and Daraselia, N.(2003), MedScan, a natural languageprocessing engine for MEDLINE abstracts,Bioinformatics, Vol. 19, pp. 16991706.

    29. Kolch, W., Calder, M. and Gilbert, D. (2005),When kinases meet mathematics: The systemsbiology of MAPK signalling, FEBS Lett., Vol.579(8), pp. 18911895.

    30. Ogata, H., Goto, S., Sato, K. et al. (1999),KEGG: Kyoto Encyclopedia of Genes andGenomes, Nucleic Acids Res., Vol. 27,pp. 2934.

    31. Golbeck, J. (2003), The National CancerInstitutes thesaurus and ontology, J. WebSemantics, Vol. 1, pp. 7580.

    32. Clyne, C. D., Kovacic, A., Speed, C. J. et al.(2004), Regulation of aromatase expression bythe nuclear receptor LRH-1 in adipose tissue,Mol. Cell Endocrinol., Vol. 215(12),pp. 3944.

    33. Quandt, K., Frech, K., Karas, H. et al. (1995),MatInd and MatInspector: New fast andversatile tools for detection of consensusmatches in nucleotide sequence data, NucleicAcids Res., Vol. 23, pp. 48784884.

    34. URL: http://www.genomatix.de/

    35. Prestridge, D. S. (2000), Computer softwarefor eukaryotic promoter analysis, Methods Mol.Biol,. Vol. 130, pp. 265295.

    36. URL: http://bimas.dcrt.nih.gov/molbio/signal

    37. Chen, Q. K., Hertz, G. Z. and Stormo, G. D.(1995), MATRIX SEARCH 1.0: A computerprogram that scans DNA sequences fortranscriptional elements using a database of

    2 9 6 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005

    Scherf, Epple and Werner

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from

  • weight matrices, Comput. Appl. Biosci. , Vol.11, pp. 63566.

    38. Kel, A. E., Gossling, E., Reuter, I. et al.(2003), MATCHTM: A tool for searchingtranscription factor binding sites in DNAsequences, Nucleic Acids Res., Vol. 31,pp. 35763579.

    39. Heinemeyer, T., Chen, X., Karas, H. et al.(1999), Expanding the TRANSFAC databasetowards an expert system of regulatorymolecular mechanisms, Nucleic Acids Res.,Vol. 27, pp. 318322.

    40. URL: http://transfac.gbf.de/TRANSFAC/

    41. Klingenhoff, A., Frech, K., Quandt, K. andWerner, T. (1999), Functional promotermodules can be detected by formal modelsindependent of overall nucleotide sequencesimilarity, Bioinformatics, Vol. 15,pp. 180186.

    42. Klingenhoff, A., Frech, K. and Werner, T.(2002), Regulatory modules shared withingene classes as well as across gene classes can bedetected by the same in silico approach, InSilico Biol., Vol. 2, pp. S17S26.

    43. Werner, T. (1999), Models for prediction andrecognition of eukaryotic promoters, Mamm.Genome, Vol. 10, pp. 168175.

    44. Firulli, A. B. and Olson, E. N. (1997),Modular regulation of muscle genetranscription: A mechanism for muscle celldiversity, Trends Genet, Vol. 13, pp. 364369.

    45. Boehlk, S., Fessele, S., Mojaat, A. et al. (2000),ATF and Jun transcription factors, actingthrough an Ets/CRE promoter module,mediate lipopolysaccharide inducibility of thechemokine RANTES in monocytic MonoMac 6 cells, Eur. J. Immunol,. Vol. 30,pp. 11021112.

    46. Fessele, S., Boehlk, S., Mojaat, A. et al. (2001),Molecular and in silico characterization of apromoter module and C/EBP element thatmediate LPS-induced RANTES/CCL5expression in monocytic cells, FASEB J.,Vol. 15, pp. 577579.

    47. Wang, Q., Sigmund, C. D. and Lin, J. J.(2000), Identification of cis elements in thecardiac troponin T gene conferring specificexpression in cardiac muscle of transgenicmice, Circ. Res., Vol. 86, pp. 478484.

    48. Kramer-Hammerle, S., Hahn, A., Brack-Werner, R. and Werner, T. (2005),Elucidating effects of long-term expression ofHIV-1 Nef on astrocytes by microarray,promoter, and literature analyses, Gene, June13, available online: PMID: 15958282.

    49. Xenarios, I., Salwinski, L., Duan, X. J. et al.(2002), DIP: The Database of InteractingProteins. A research tool for studying cellularnetworks of protein interactions, Nucleic AcidsRes., Vol. 30, pp. 303305.

    50. Hao, C. and Burt, M. S. (2004), Content-richbiological network constructed by miningPubMed abstracts, BMC Bioinformatics, Vol. 5,p. 147.

    51. Barrett, T., Suzek, T. O., Troup, D. B. et al.(2005), NCBI GEO: Mining millions ofexpression profiles database and tools,Nucleic Acids Res., Vol. 33 (Database issue),pp. D562566.

    52. Edgar, R., Domrachev, M. and Lash, A. E.(2002) Gene Expression Omnibus: NCBIgene expression and hybridization array datarepository, Nucleic Acids Res., Vol. 30(1),pp. 207210.

    & HENRY STEWART PUBLICATIONS 1467-5463. BR IEF INGS IN BIOINFORMATICS . VOL 6. NO 3. 287297. SEPTEMBER 2005 2 9 7

    The next generation of literature analysis

    by guest on June 20, 2014http://bib.oxfordjournals.org/

    Dow

    nloaded from