a text-mining system for knowledge discovery from biomedical documents

18
A text-mining system for knowledge discovery from biomedical documents by N. Uramoto H. Matsuzawa T. Nagano A. Murakami H. Takeuchi K. Takeda This paper describes the application of IBM TAKMI for Biomedical Documents to facilitate knowledge discovery from the very large text databases characteristic of life science and healthcare applications. This set of tools, designated MedTAKMI, is an extension of the TAKMI (Text Analysis and Knowledge MIning) system originally developed for text mining in customer- relationship-management applications. MedTAKMI dynamically and interactively mines a collection of documents to obtain characteristic features within them. By using multifaceted mining of these documents together with biomedically motivated categories for term extraction and a series of drill-down queries, users can obtain knowledge about a specific topic after seeing only a few key documents. In addition, the use of natural language techniques makes it possible to extract deeper relationships among biomedical concepts. The MedTAKMI system is capable of mining the entire MEDLINE database of 11 million biomedical journal abstracts. It is currently running at a customer site. The life science industry is an emerging market in which application spaces, such as drug discovery and development in the pharmaceutical sector and clin- ical record management in health care, have become areas of significant recent interest. 1 Documents in the scientific literature play an important role in life science by serving as a potential source for under- lying knowledge discovery. These documents are a rich repository of information on relationships among biomedical concepts such as genes, proteins, diseases, and a variety of other key topics. Text mining is a technology that makes it possible to discover patterns and trends semiautomatically from huge collections of unstructured text. 2–6 It is based on technologies such as natural language pro- cessing, information retrieval, information extrac- tion, and data mining. 7 Early papers in this area men- tioned the possibility of knowledge discovery from the biomedical literature. Hearst, one of the founders of text mining, proposed a system for predicting the functions of unknown genes using biomedical doc- uments. 2 Swanson also described the idea of discov- ering new knowledge from the biomedical litera- ture. 4 Subsequently, considerable research has been done in the areas of biomedical concept extraction (named-entity extraction), relationship extraction, and network/pathway construction for protein-pro- tein interaction. However, although text mining has proved a promising approach for knowledge discov- ery from text sources, certain specific problems are encountered when trying to apply it to the realm of life science. First, existing approaches are incapable of handling the vast amount of textual domain-specific informa- tion available. Indeed, there is more data available than anyone could possibly read or digest. For ex- Copyright 2004 by International Business Machines Corpora- tion. Copying in printed form for private use is permitted with- out payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copy- right notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. N. URAMOTO ET AL. 0018-8670/04/$5.00 © 2004 IBM IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 516

Upload: others

Post on 12-Sep-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A text-mining system for knowledge discovery from biomedical documents

A text-miningsystem for knowledgediscovery frombiomedical documents

by N. UramotoH. MatsuzawaT. NaganoA. MurakamiH. TakeuchiK. Takeda

This paper describes the application of IBMTAKMI� for Biomedical Documents tofacilitate knowledge discovery from the verylarge text databases characteristic of lifescience and healthcare applications. This setof tools, designated MedTAKMI, is anextension of the TAKMI (Text Analysis andKnowledge MIning) system originallydeveloped for text mining in customer-relationship-management applications.MedTAKMI dynamically and interactivelymines a collection of documents to obtaincharacteristic features within them. By usingmultifaceted mining of these documentstogether with biomedically motivatedcategories for term extraction and a series ofdrill-down queries, users can obtainknowledge about a specific topic after seeingonly a few key documents. In addition, theuse of natural language techniques makes itpossible to extract deeper relationshipsamong biomedical concepts. The MedTAKMIsystem is capable of mining the entireMEDLINE� database of 11 million biomedicaljournal abstracts. It is currently running at acustomer site.

The life science industry is an emerging market inwhich application spaces, such as drug discovery anddevelopment in the pharmaceutical sector and clin-ical record management in health care, have becomeareas of significant recent interest.1 Documents inthe scientific literature play an important role in lifescience by serving as a potential source for under-lying knowledge discovery. These documents are a

rich repository of information on relationships amongbiomedical concepts such as genes, proteins, diseases,and a variety of other key topics.

Text mining is a technology that makes it possibleto discover patterns and trends semiautomaticallyfrom huge collections of unstructured text.2–6 It isbased on technologies such as natural language pro-cessing, information retrieval, information extrac-tion, and data mining.7 Early papers in this area men-tioned the possibility of knowledge discovery fromthe biomedical literature. Hearst, one of the foundersof text mining, proposed a system for predicting thefunctions of unknown genes using biomedical doc-uments.2 Swanson also described the idea of discov-ering new knowledge from the biomedical litera-ture.4 Subsequently, considerable research has beendone in the areas of biomedical concept extraction(named-entity extraction), relationship extraction,and network/pathway construction for protein-pro-tein interaction. However, although text mining hasproved a promising approach for knowledge discov-ery from text sources, certain specific problems areencountered when trying to apply it to the realm oflife science.

First, existing approaches are incapable of handlingthe vast amount of textual domain-specific informa-tion available. Indeed, there is more data availablethan anyone could possibly read or digest. For ex-

�Copyright 2004 by International Business Machines Corpora-tion. Copying in printed form for private use is permitted with-out payment of royalty provided that (1) each reproduction is donewithout alteration and (2) the Journal reference and IBM copy-right notice are included on the first page. The title and abstract,but no other portions, of this paper may be copied or distributedroyalty free without further permission by computer-based andother information-service systems. Permission to republish anyother portion of this paper must be obtained from the Editor.

N. URAMOTO ET AL. 0018-8670/04/$5.00 © 2004 IBM IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004516

Page 2: A text-mining system for knowledge discovery from biomedical documents

ample, MEDLINE**8 is a database of over 11 millioncitations (abstracts) of biomedical articles datingback to the 1960s. MEDLINE is widely used as a goldenstandard for text-mining systems in life science, andseveral text-mining applications using MEDLINE havebeen proposed. The MedMeSH Summarizer9 ex-tracts MeSH** (Medical Subject Headings) terms10

that can summarize the nature of a cluster of genenames obtained from DNA microarrays (also calledDNA chips). MedMiner11 is a system that filters in-formation for the PubMed** search engine.12 Ob-viously, any approach that applies text-mining meth-ods to such a large document collection must behighly scalable and robust.

Second, existing information extraction systems onlyprovide extracted concepts and relationships in afixed way. Because these systems are noninteractive,it is difficult to iteratively apply mining processes ontheir results directly. With an interactive text-min-ing system, users are better able to discover hiddenknowledge by using a combination of mining func-tions and a trial-and-error approach.

To address these problems we have developed a text-mining system called IBM TAKMI* for BiomedicalDocuments (designated MedTAKMI hereafter),which is capable of mining the entire MEDLINE da-tabase in an interactive manner. The predecessor ofthis system, TAKMI (Text Analysis and KnowledgeMIning), is a text-mining system for customer rela-tionship management (CRM), which has been suc-cessfully used in call centers to mine customer sup-port call logs.13 The MedTAKMI system extendsTAKMI to provide a useful set of tools for knowledgediscovery from biomedical documents. MedTAKMIis designed to handle large document sets and is thuscapable of mining the entire set of MEDLINE citations.

The development of methods for extracting infor-mation on such biomedical concepts as genes, pro-teins, and diseases from text is an active area of re-search14–19 and typically involves the following twoprimary subtasks:

1. Entity extraction—the recognition of gene, pro-tein, and chemical names from biomedical text

2. Relation extraction—the extraction of relation-ships among these entities

Thus architecturally MedTAKMI consists of twomain components designed to handle informationextraction and entity/relationship mining.

The MedTAKMI system performs entity extractionbased on dictionary lookup. This approach is simpleconceptually and can recognize entities very quickly.We have developed a large domain dictionary thatcontains two million biomedical entities. These en-tities and their associated category names are usedas keywords in the MedTAKMI system so that userscan search for documents that contain a keywordwithin a specific category, for example, a query onthe keyword “p53” within the gene category.

In a preprocessing stage input documents are parsedby a shallow syntactic parser which extracts keywords(entities) with category labels, as well as any binaryand ternary relationships that may exist among theseentities. The MedTAKMI runtime engine then usesthis information to provide mining functions to users.Categories are constructed from public ontologicalknowledge, for example, using the MeSH terms inMEDLINE or the resources provided by Gene Ontol-ogy**.3 User-defined resources may also beemployed.

There has been extensive research in relation extrac-tion,20–35 wherein the goal is to extract relationshipsamong biomedical entities (e.g. proteins and genes),from patterns such as “A inhibits B” and “A acti-vates B,” where A and B represent specific entities.Such relationships may be extracted by using one ormore of the following information and methods:

● Surface string patterns20

● Syntactic information from shallow parsing21,22 andfull parsing23–27

● Templates and rules28–30

● Statistical information with machine learning32–35

In particular, the MedTAKMI system uses syntacticinformation with a shallow parser to extract binary(a noun and a verb) and ternary (two nouns and averb) relationships. These relationships from thedocument collection are aggregated and can be dis-played by category viewers as described later.

As previously noted, MedTAKMI is an extension ofTAKMI, a text-mining system for CRM.13 The maindifferences between these two systems are the fol-lowing:

● The use of hierarchical categories: TAKMI only sup-ports flat categories such as product names. ForMedTAKMI we developed a hierarchical categoryviewer because most biomedical entities (e.g.,genes and diseases) are defined hierarchically.

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 517

Page 3: A text-mining system for knowledge discovery from biomedical documents

● The extraction of ternary relationships to captureprotein-protein interaction by using deeper lan-guage analysis

● The introduction of support for domain-specificmining functions

● The development of a new system architecture andcomponentization structure

This paper is organized as follows: The next sectiondescribes the key features of MedTAKMI, includ-ing the system architecture and the information ex-traction process. We then introduce the searchingand mining functionalities of MedTAKMI. The fol-lowing section provides an example of the applica-tion of the MedTAKMI system to a specific user sce-nario. Finally, we summarize our work and describedirections for future research.

Features of the MedTAKMI mining system

The MedTAKMI architecture consists of two maincomponents: a preprocessing information extractionstage and a runtime search/mining server, as shownin Figure 1. In this section we briefly discuss thesemain components and then describe methods for in-formation extraction.

Information extraction process. Information extrac-tion occurs as a preprocessing step and involves sev-eral subcomponents (see Figure 1, upper left). InStep 1, the term annotator finds words in the inputtext (i.e., the sentence with the label “Input” shownin Figure 2) using the term dictionary and identifiesthese words using their canonical form. (As describedin more detail later, the term dictionary contains a

Preprocessing

OfflineTool

Subset Analysis Tool

MiningDatabaseMEDLINE

Documents

Terminology

Term Annotator

CategoryDictionary

CategoryDefinition

CCAT/TALENT

Web Server CategoryManager

TAKMIServer APIs

Parser CategoryAnnotater

FileDefinition

Indexes

• Keyword-based and full-text searching• Hierarchical category viewer• Chronological viewer• Two-dimensional viewer (term-association)• Trend analysis viewer

Figure 1 MedTAKMI architecture

MiningFunctions

IndexBuilder

Runtime Process

Information Extraction

Search/Mining Server

Building Indexes

Client(IE)

Download applet (client )program and access to server over TCP/IP

Applet

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004518

Page 4: A text-mining system for knowledge discovery from biomedical documents

pair of forms for each term: a surface form and acanonical form.) The obtained canonical words areembedded in the text document as annotations inXML (eXtensible Markup Language). In Step 2, theannotated text is passed to a syntactic parser. Theparser outputs segments of phrases labeled with theirsyntactic roles, for example NP (noun phrase) or VG

(verb group). The category annotator then assignscategories to the terms in these segments andphrases. The category dictionary consists of a set ofcanonical forms and their categories, which also in-dicates the node label in the hierarchy of categories.The hierarchical categories are in turn imported fromexisting hierarchies, such as the MeSH terms in MED-

Figure 2 Information extraction process

Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids

repetitive sequence-based polymerase chain reaction

Proper Noun

verb

DNA

Proper Noun

repetitive sequence-based polymerase chain reaction

repetitive sequence-based polymerase chain reaction

DNA

effect

repetitive sequence-based polymerase chain reaction ... effect

repetitive sequence-based polymerase chain reaction ... effect ... DNA

Proper Noun

Proper Noun

A.1.2.23.4

verb

S ... V

S ... V ... O

DNA

Proper Noun

Category=A.1.2.23.4

Output:

V

Step 2: Parsing and assigning categories

Input:

Step 1: Term annotation

S O

Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids

Repetitive sequence-based polymerase chain reaction effects deoxyribonucleic acids

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 519

Page 5: A text-mining system for knowledge discovery from biomedical documents

LINE, or from user-designed resources. Syntactic re-lationships among these entities, for example, sub-ject-verb (S-V) or subject-verb-object (S-V-O), arealso extracted from the output of the parser. All ex-tracted information is finally encoded into an indexfile that is used by the runtime part of the system.

Search/mining server process. The search and min-ing server, shown in the lower part of Figure 1, pro-vides users with the searching and mining servicesdescribed in detail later. The MedTAKMI system isa Web application and its client code is loaded intoInternet Explorer** as an applet (a servlet versionis also available now). Note that hierarchical cate-gory definitions are shared by both the information-extraction and the runtime-server preprocessingcomponents.

MEDLINE—a collection of biomedical documents.In their work life-science researchers typically useMEDLINE,8 a bibliography database that covers thebiomedical area. To understand our approach to theextraction of information from MEDLINE, some un-derstanding of MEDLINE itself is required. MEDLINEis administered by the National Center for Biotech-nology Information (NCBI)36 of the United States Na-tional Library of Medicine (NLM).37 It contains ap-proximately 11 million biomedical citations, datingfrom the mid 1960s to the present. Citations in MED-LINE are collected from over 4600 biomedical jour-nals published worldwide. Biomedical citations inMEDLINE are available to the general public at thePubMed Web site.12 Figure 3 shows an example ofone such citation.

To make citation lookup easy, NLM indexes the ar-ticles in MEDLINE for retrieval and classification us-ing the Medical Subject Headings (MeSH) thesau-rus.10 MeSH headings consist of sets of descriptorsin a hierarchical structure. Subheadings, or qualifi-ers, provide additional specificity for each descrip-tor. The major MeSH headings indicate the maincontents of the article, and the minor MeSH head-ings are used to describe secondary topics. MeSHalso contains other features such as check tags andage tags.

Each citation contains the article title, abstract, au-thors� names, MeSH headings, affiliations, publica-tion date, journal name, and other information. Thetitle and abstract are text strings that can be manip-ulated by natural-language-processing text-miningtechniques. For example, the MEDLINE citation

shown in Figure 3 contains the information shownin Table 1.

Figure 4 shows the result of retrieving the MeSH de-scriptor “Amino Acid Sequence” using the MeSHBrowser.38 These results indicate that there are twonodes labeled “Amino Acid Sequence” in the de-scriptor thesaurus: one at G06.184.603.060 and theother at L01.453.245.667.060. These codes representlocations; for example, G06.184.603.060 is the childnode of Molecular Structure (G06.184.603). The let-ter G at the beginning of the location represents themajor category Biological Sciences, while L repre-sents another major category, namely, InformationScience.

Dictionary-based information extraction. The pre-processing stage thus extracts meta-data informationfrom MEDLINE documents. In addition to informa-tion such as author and publication date, the systemcan extract information from other text fields, forexample, title or abstract, using natural language pro-cessing. This information can include sets of key-words (e.g. protein names), predicate-argument bi-nary relations (e.g. “activate-protein”), and ternaryrelations (e.g. subject-verb-object dependency triplets).

In the preprocessing phase of the MedTAKMI sys-tem, document titles and abstracts are parsed byCCAT, a shallow syntactic parser developed at the IBMThomas J. Watson Research Center using an ap-proach originally proposed by Charniak.39 Becausethis is a general-purpose parser that has not beentrained for biomedical documents, it is difficult toobtain optimized results by parsing documents fromthe medical domain. We solve this problem by firstannotating the text with domain dictionaries. Theterm dictionaries are constructed by users and em-ploy resources such as UMLS** (Unified MedicalLanguage System)40 or the users� own proprietaryresources. The annotations facilitate the parsing ofmedical-domain text even when the parser has notbeen specifically trained for this domain. This an-notation process is needed for two primary reasons:

1. Identification of term boundaries—Most techni-cal terms in the medical domain, for example,protein names, are compound words. Thus, bio-medical terms tend to consist of a combinationof numerals, symbols, and verbs, making it verydifficult to find term boundaries. For instance,the compound noun “repetitive sequence-basedpolymerase chain reaction” consists of an adjec-tive (repetitive), a past participle of a verb (se-

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004520

Page 6: A text-mining system for knowledge discovery from biomedical documents

Figure 3 PubMed Web site

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 521

Page 7: A text-mining system for knowledge discovery from biomedical documents

quence-based) and three nouns (polymerase,chain, reaction). The annotations made by thetechnical term dictionary are based upon a part-of-speech (POS) analysis. Note that words, suchas chain, which can be a noun or a verb, can fur-ther complicate this process.

2. Aggregation of synonymous expressions and spell-ing variations—There can be multiple expres-sions that are synonymous with a particular tech-nical term. These can arise from abbreviationsor acronyms as well as from spelling variations.If these variations are recognized as different en-tities, it can often cause problems for text min-ing. For instance, “DNA” and “deoxyribonucleicacid” are synonyms; thus, they should be countedas the same entity in the mining process. The dic-tionary contains spelling/abbreviation variantsand their canonical forms. By reducing these var-iants to a single canonical form, we can treatthem as the same entity.

After text is annotated with a technical term dictio-nary it is parsed by a parser, or tagger. In this sys-tem, we use the CCAT parser to assign a POS to eachword. This determination is based on the statisticaldistribution of candidate POSs for a word and theprobability of POS transitions (from/to adjoiningwords) that are extracted from a training corpus. Af-ter the annotation and parsing processes, phrases aredetermined by using head-driven models,41,42 and cat-egory identifiers are assigned. The current Med-TAKMI system maintains approximately 270000 hi-erarchical category identifiers. Of these, roughly170000 are MeSH category identifiers; the rest areuser-defined categories. MeSH terms are updatedannually, and the revisions are incorporated intoMedTAKMI.

Relation extraction. In general, a conventional textanalysis system counts the frequency of occurrenceof a given term and calculates its importance. The

Table 1 Meta-data for a MEDLINE citation

Symbol Meaning Value

PMID PubMed Identifier 12060689

TIS ISSN for the journal 1362-4962

VI Volume 30

DP Published Date 2002 Jun 15

TI Title Dictionary-driven prokaryotic gene finding

AB Abstract Gene identification, also known as gene finding or generecognition, is among the important problems of molecularbiology that have been receiving increasing attention with theadvent of large scale sequencing projects.......... (Snipped.)

AD Affiliation Exploratory Technology, IBM Tokyo Research Laboratory,1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502,Japan.

AU Author Shibuya, Tetsuo

AU Author Rigoutsos, Isidore

MH MeSH heading Algorithms

MH MeSH heading Amino Acid Sequence

MH MeSH heading Base Sequence

MH MeSH heading Codon, Initiator

MH MeSH heading Computational Biology/*methods (* represents Major MeSH)

MH MeSH heading *Genes, Archaeal

TA Journal Title Abbreviation Nucleic Acids Res

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004522

Page 8: A text-mining system for knowledge discovery from biomedical documents

Figure 4 MeSH browser

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 523

Page 9: A text-mining system for knowledge discovery from biomedical documents

system can then find documents that contain thequery terms or their synonyms. It may also displayrelationships among various terms using visualiza-tion tools. This would allow us to discover terms thatare strongly related, but we still would not know howor in what way they are related. For example, wecould find all documents containing the word smok-ing. But even though smoking may appear frequentlywithin these documents, conventional text analysissystems are unable to tell us, for example, what theeffects of smoking are, who is affected, and to whatextent. We could next identify terms that are stronglyrelated to the word smoking, but simple co-occur-rence of two terms (e.g., smoking and risk) revealsvery little about the semantic relationship betweenthem. Conventional text analysis systems ignore hintsthat indicate whether a verb is negative or not andwhether it is active or passive. Such systems are un-able to distinguish between the meanings of the fol-lowing two sentences: “Smoking increases the riskof lung cancer,” and, “The risk of lung cancer in-creases the consequences of smoking.” Both sen-tences contain the terms smoking and risk; simpleco-occurrence detection is not sufficient to deriverelationships.

The TAKMI system for CRM13 is able to extract “sub-ject . . . verb” or “verb . . . object” relationships ina sentence. It can extract such concepts as “modem. . . broken,” “file . . . not found,” and “hard disk . . .slow” from customer queries. These extractions wereinstrumental in facilitating problem detection andworkload reduction for analysts at customer helpcenters. Although the binary “subject . . . verb” or“verb . . . object” relationships suffice for customersupport purposes, a more detailed concept extrac-tion is needed for biomedical articles. This is espe-cially true because the “what subject influences whatobject” relationship is a very important one in thebiomedical domain. For instance, it is clear that ex-traction of the relationship “protein_A. . .activates

. . .protein_B. . .with. . .enzyme_C” provides moredetailed information than either “protein_A. . .ac-tivates. . .protein_B” or “protein_B. . .with. . .en-zyme_C.” Furthermore, we need to distinguish be-tween “protein_A. . .activates. . .protein_B. . .with. . .enzyme_C” and “protein_B. . .activates. . .pro-tein_A. . .with. . .enzyme_C.” In the MedTAKMIsystem, we extract these ternary relationships (as wellas binary relationships) using natural-language-pro-cessing technology. These extracted relationships areused as keywords by various MedTAKMI miningfunctions, as described in the next section. Table 2shows actual examples of ternary relationships ex-tracted from MEDLINE documents.

Searching and mining functionsThe MedTAKMI system is a text-mining system thatcan be integrated with search engines. It supportskeyword-based search engines and can also be usedwith other search engine implementations such asIBM�s GTR (Global Text Retrieval) and DB2* NetSearch Extender. The mining process can be appliedto the whole database or to a subset document col-lection obtained by a series of searches. Users cansubmit a query and receive a document collectionin which each document contains the query keywordsor their synonyms. Mining functions can then be ap-plied to the collection in order to discover under-lying information, such as protein-protein relation-ships. Alternatively, users can continue the searchprocess by using the results of previous searching andmining operations. Interactivity is a very importantMedTAKMI feature because it allows users to switchbetween searching and mining in a flexible manner.

MedTAKMI provides various mining functions forlarge document collections. Some of these functionsare general, but others are tailored to the life-sci-ence domain. The following functions are introducedin this section:

● Keyword-based and full-text searching● Hierarchical category viewer● Chronological viewer● Two-dimensional viewer (term-association)● Trend analysis viewer● Other analytical tools

Most of these viewers can function interactively, aunique feature of the MedTAKMI system. Experi-ences with text mining in the CRM domain13 havepreviously demonstrated the importance of interac-tivity in that application, and interactivity also proves

Table 2 Examples of ternary relationships extracted fromMEDLINE

apoptosis. . .induce. . .p53

apoptosis. . .inhibit. . .tumor protein p53 (Li-Fraumenisyndrome)

tumor protein p53. . .play. . .role

adhesion receptors. . .activate. . .extracellular signal-regulatedkinase

Ang II. . .activated. . .p38

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004524

Page 10: A text-mining system for knowledge discovery from biomedical documents

to be an important requirement for users in the life-science domain. For example, in knowledge-discov-ery and data-mining (KDD) tasks, such as on-line an-alytical processing (OLAP), users may wish to use afact discovered in a previous mining result to struc-ture a new query from a different point of view. In-deed, a user often cannot define a complete querybeforehand but rather must iteratively refine thequery based on previous results. Furthermore, whena document collection is very large, a single querymay not be able to sufficiently narrow the collection,requiring a user to submit a sequence of queries toobtain the desired subset of documents. MedTAK-MI�s viewers help users to navigate toward this goalinteractively.

Keyword-based and full-text searching. The Med-TAKMI system provides two types of searching: key-word and full text. A keyword index which associ-ates keywords with category codes is built by theinformation extraction phase, as described above.Users can thus submit a query such as “search fordocuments that contain the word p53 as a genename.” A disadvantage of keyword search, however,involves inaccuracies in keyword extraction. If a termconsisting of multiple words is not recognized as akeyword in the indexing phase, then this term can-not be matched later in a query submitted by a user.Thus, in MedTAKMI, users can also choose a full-text search which uses different indexes built sepa-rately to retrieve documents in which the term ap-pears literally. The MedTAKMI system allows usersto switch between these two search engines seamlessly.

Hierarchical category viewer. The hierarchical cat-egory viewer shows keyword distribution in a datacollection over a predefined hierarchy. For exam-ple, Figure 5 shows the distribution of keywords forthe “Disease” node and its child nodes in the MeSHhierarchy.

Blue bars at the bottom of the viewer show the fre-quency of occurrence of a keyword. In Figure 5, thereare 35 documents that contain the term “neoplasms”or any of the terms in its child nodes. (Note that forefficiency reasons this frequency was calculated fora subset of documents sampled from the full doc-ument collection. The cardinality of such a subsetcan be specified by the user.)

Red bars indicate relative frequency—this measurecompares the current document subcollection to theinitial document collection indexed in the Med-TAKMI system. In other words, searching is a pro-

cess that narrows down a document collection. As-sume D is the initial document collection. A keywordsearch due to query q1 returns D1, the subset of Dthat satisfies q1. Similarly, let (q1, q2, . . . , qn) be asequence of successive queries and let (D1, D2, . . . ,Dn) be document collections such that Di is the re-sult of query qi made also on collection D. The rel-ative frequency for a keyword w in the document col-lection Di is calculated using the following formula:

relfreq�w, Di� �

c�w, Di�

Dic�w, D�

D

where Di is the number of documents in the doc-ument collection Di and c(w, Di) is the number of doc-uments that contain the word w in the collection Di.

Two types of hierarchies are registered with the Med-TAKMI system. The flat type has no children; forexample, in Figure 5, “Compound,” “Amino Acid,”and “Organ” are flat. The tree type appears with a‘�’ notation; thus “Dry Lab Methods” and “MeSHMinor” in Figure 5 are tree type. Users can definethe hierarchy set using public hierarchies such asMeSH and Gene Ontology3 as well as user-specificor company-specific hierarchies. We have developeda hierarchy that currently consists of 95000 nodesfor the various concepts in the life-science domain.

Chronological viewer. This viewer allows a user todiscover trends by viewing the chronological distri-bution of a set of documents. Figure 6 shows themonthly distribution of approximately 330000 doc-uments selected from MEDLINE in 2002. MedTAKMIsupports yearly, monthly, and daily distributions. Us-ing this viewer, one can determine when a certainkeyword began to appear in the biomedical litera-ture and how its frequency of occurrence changedwith time. For example, the term HIV does not ap-pear in MEDLINE documents in the 1970s, but by the1990s it occurs with high frequency.

Two-dimensional maps (term-association). Thetwo-dimensional viewer allows a user to visualize thestrength of association between keywords. Figure 7shows protein-protein associations in a mouse doc-ument collection. The value in each cell representsthe strength of association of two keywords—thehigher the value, the stronger the association. Forexample, the proteins “Bcl2-associated X protein”

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 525

Page 11: A text-mining system for knowledge discovery from biomedical documents

Figure 5 Hierarchical category view

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004526

Page 12: A text-mining system for knowledge discovery from biomedical documents

and “G elongation factor” have a strong association.The numbers “19 (36.54%)” in the cell mean thatthere were 19 documents which mentioned bothBcl2-associated X protein and G elongation factor,and that these 19 documents comprise 36.54 percentof the documents mentioning Bc12-associated X pro-tein. Because 36.54 percent is much higher than otherpercentages, the cell is automatically highlighted.

Formally, the value for each cell v(wxi, wyj, D) is cal-culated by using the following formula:

p�wxi, wyj, D� �c�wxi � wyj, D�

c�wyj, D�

v�wxi, wyj, D� �p�wxi, wyj, D�

1M �

j�1

N

p�wxi, wyj, D�

.

where c(w, D) is the number of documents that con-tains the word w in the collection D. The words wxi

and wyj belong to the categories of x and y axes, re-spectively. M and N are the size of the matrix foreach x and y axis. Clicking a cell leads to furthersearching using the two keywords.

Users are not limited to protein-protein interactions.Associations for protein-gene, gene-disease, disease-organ, and so forth, are also possible. In fact, anycombination of hierarchy terms is possible.

Other analytical tools. The viewers described thusfar are interactive, allowing for a flexible mining pro-cess. However, some tools require a much longeranalysis time to produce a response. These tools areavailable to the user as off-line tools in the Med-TAKMI system. During the mining process interme-diate results can be stored and used as input to anoff-line tool. The final output of the off-line tool isshown independent of the runtime MedTAKMI pro-cess. This section introduces two off-line functionsprovided by MedTAKMI.

Similar entry search tool. The first function is one thatpermits searching of similar biological entities. Thisfunction uses the term-association table data de-scribed in the previous section. Each row of the ta-ble is represented by a vector whose attributes cor-respond to the keywords in the columns and whoseelements represent the joint occurrence probabili-ties of the two corresponding entities. The degreeof similarity between row entities is calculated based

on the distance between their corresponding vectorsusing a cosine measure. Principal component anal-ysis is used to reduce dimensions and improve pre-cision because the number of columns tends to belarge. Figure 8 shows the ranking result of signalingproteins similar to protein 14_3_3.

Subset analysis tool. The second function allows forthe category view data or term association view dataof two different queries to be compared and analyzedto identify significant differences. Consider two datasets A and B. The occurrence probabilities of en-tries that appear in both data set A and data set Bare compared. Entries are first organized into thefollowing three categories based on the AIC (AkaikeSelection Criterion) approach for statistical modelselection criteria.43

● Entries frequently appearing in data set A● Entries frequently appearing in data set B● Entries common to both data set A and data set B

For each category, entries are ranked by a scorebased on a statistical test. We use the following sta-tistical test:

H0: pA � pB

where pA and pB are the probability of occurrenceof the entry in data set A and data set B respectively.The chi-square statistic is calculated from the occur-rence data. Under the null hypothesis H0, this statis-tic has a chi-square distribution with one degree offreedom �1

2. We define the following ranking score,

Figure 6 Chronological view

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 527

Page 13: A text-mining system for knowledge discovery from biomedical documents

Figure 8 Result of similar entity search (query: 14_3_3)

Figure 7 Two-dimensional view of proteins vs proteins (mouse-related) for a document set that contains the term “p53”

PrintCopySave

protein tumor suppressor

breast cancer

Bc12-associated X protein

myelocytomatosis oncogene

transcription factor

cross category

By FrequecyBy Alphabet

protein

tumor suppressor

breast cancer

Bcl2-associated X-prot

myelocytomatosis onco

transcription factor

G elongation factor

proliferative cell nuclea

period

enhancer of rudimentar

epiregulin

progesterone receptor

Harvey rat sarcoma viru

tumor necrosis factor

epidermal growth factor

vascular endothelial gro

vitamin D receptor

estrogen receptor

protein-L-isoaspartate

MAS1 oncogene

protein kinase

cornichon

Mus musculus

vertical category

Cross Axle:

Mus musculus

Compound Amino Acid Organ+Dry Lab Methods+FUNCTION+cellular component+gene from Gene Ontology+Root of LocusLink phenotype+MeSH Minor (Tree)+MeSH Major (Tree) Minor MeSH Major MeSH+Protein By Species Age Tags Country Data Completed Publication Data URL Full-Text Last Revision Date Organization Name Prime Name of Substance Personal Name as Subject Publication Type Subset Secondary Source Identifier Source Summary For Patient In Publication Status Journal Title Abbreviation Check Tags Transliterated/Vernacular Title Update In Update Of URL Summary+PROPERTY Minor Qualifier Major Qualifier+SACCHARIDE+SMART Domain Affiliation Author Data Created

632 (100.0%)

57 (38.51%)

42 (35.9%)

60 (59.41%)

24 (38.1%)

27 (45.0%)

28 (53.85%)

19 (37.25%)

18 (37.5%)

17 (42.5%)

17 (42.5%)

13 (32.5%)

12 (32.42%)

11 (33.33%)

11 (33.33%)

9 (29.03%)

10 (33.33%)

8 (28.57%)

7 (25.92%)

9 (33.33%)

7 (25.92%)

8 (30.76%)

60 (9.49%)

8 (5.4%)

2 (1.7%)

101 (100.0%)

4 (6.34%)

3 (5.0%)

19 (36.54%)

2 (3.92%)

2 (4.16%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

1 (2.7%)

4 (12.12%)

0 (0.0%)

1 (3.22%)

1 (3.33%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

4 (14.81%)

0 (0.0%)

24 (3.79%)

6 (4.05%)

8(6.83%)

4 (3.96%)

63 (100.0%)

2 (3.33%)

1 (1.92%)

3 (5.88%)

3 (6.25%)

1 (2.5%)

1 (2.5%)

1 (2.5%)

9 (24.32%)

1 (3.03%)

3 (9.09%)

1 (3.22%)

1 (3.33%)

1 (3.57%)

3 (11.11%)

1 (3.7%)

1 (3.7%)

0 (0.0%)

27 (4.27%)

10 (6.75%)

3 (2.56%)

3 (2.97%)

2 (3.17%)

60 (100.0%)

1 (1.92%)

1 (1.96%)

0 (0.0%)

3 (7.5%)

3 (7.5%)

0 (0.0%)

1 (2.7%)

2 (6.06%)

0 (0.0%)

1 (3.22%)

1 (3.33%)

0 (0.0%)

1 (3.7%)

0 (0.0%)

1 (3.7%)

1 (3.84%)

57 (9.01%)

148 (100.0%)

4 (3.41%)

8 (7.92%)

6 (9.52%)

10 (16.66%)

3 (5.76%)

1 (1.96%)

2 (4.16%)

2 (5.0%)

2 (5.0%)

0 (0.0%)

5 (13.51%)

3 (9.09%)

2 (6.06%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

1 (3.7%)

1 (3.84%)

42 (6.64%)

4 (2.7%)

117 (100.0%)

2 (1.98%)

8 (12.7%)

3 (5.0%)

1 (1.92%)

3 (5.88%)

4 (8.33%)

23 (57.5%)

23 (57.5%)

20 (50.0%)

0 (0.0%)

0 (0.0%)

6 (18.18%)

1 (3.22%)

2 (6.66%)

16 (57.14%)

2 (7.4%)

11 (40.74%)

0 (0.0%)

3 (11.54%)

2D Map

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004528

Page 14: A text-mining system for knowledge discovery from biomedical documents

Ranking score � �log10 p

using the p-value calculated from the observed data.

Figure 9 shows the result of subset analysis by thisfunction using disease-experiment two-dimensional-

map data extracted by the MedTAKMI system. Doc-uments are selected using the term “Adult” in theage category in data set A and “Child” in data setB. Figure 9A shows the common entries in both Aand B. Figure 9B shows the entries which appear fre-quently in data set A, and Figure 9C shows those

Figure 9 Result of subset analysis tool

A Common entries

B Entries frequently appearing in data set A

C Entries frequently appearing in data set B

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 529

Page 15: A text-mining system for knowledge discovery from biomedical documents

appearing frequently in data set B. In Parts 9B and9C, entries are ranked by the statistical base scoredescribed earlier.

Usage scenarioThe MedTAKMI system can be used to aid in drugdiscovery and clinical information retrieval. For ex-ample, defects in the AML1 gene are thought tocause acute leukemia. Suppose we are interested indeveloping a treatment for leukemia and are look-ing for potential drug targets. We describe below theapplication of the MedTAKMI system to this prob-lem. For this example a subset of approximately330000 of the latest abstracts from the 2003 MEDLINEdistribution was used. A keyword search reveals that54 papers in the sample set mention AML1. Theseare depicted in Figure 10. Each line shows the pub-lication year, PubMed ID, and title of the paper.

A category view (Figure 10B) based on the NCBILocusLink44 phenotype information indicates thatleukemia is indeed the term most frequently asso-ciated with AML1, appearing in 17 of the 54 papers.Furthermore, the relative frequency of the term leu-kemia indicates that it appears 54.41 times more of-ten in this subset of 54 papers than it does in theentire sample database.

If we sort the entries in Figure 10B in decreasingorder of relative frequency (Figure 10C), we see thatmyelodysplasia syndrome 1 (a protein that appearsto be amplified in human prostate cancer)45,46 and3q21q26 syndrome (a syndrome associated with over-expression of the Evi-1 gene, an event in turn asso-ciated with myeloid leukemia) are the phenotypesthat are strongly associated with AML1. Similarly,we can discover that various fusion genes and pro-teins such as AML1-MTG16, MTG8, AML1-ETO,and TEL-AML1 are also strongly associated withAML1 when we consider the category view for primenames of substances (Figure 10D).

Now, consider a search using the keyword leukemia,which selects 1051 papers from the entire sample da-tabase. We have already established that there ap-pears to be a close association between AML1 andleukemia; if we can find another protein that isclosely associated with leukemia, that protein mightin turn serve as a potential target for drug design.A two-dimensional (2D) map analysis between Lo-cusLink phenotypes and signaling proteins in thissubcollection of 1051 papers shows very interestingresults (Figure 11).

Figure 10 Results and analysis of keyword search for the term AML1

Myelodysplasia syndrome-1

3q21 q26 syndrome

Gaucher disease

Myeloid leukemia

Megakaryoblastic leukemia

Metachromatic leukodystrophydue to deficiency of SAP-1

22

33

11

11

11

11

6066.066066.06

2022.022022.02

1011.011011.01

1011.011011.01

866.57866.57

195.67195.67

AML1-MTG16 protein

MTG8 protein

TEL-AML1 fusion protein

AML1-ETO fusion protein

22

1313

1010

6066.066066.06

226066.066066.06

4638.754638.75

4332.894332.89

Leukemia1717

44

44

44

44

33

54.4154.41

36.8136.81

50.0250.02

50.0250.02

50.0250.02

2022.022022.02

Melanoma

Li Fraumeni syndrome

Melanoma and neural system tumor syndrome

Pancreatic cancer/melanoma syndrome

3q21 q26 syndrome

A List of 54 papers containing the term AML1

LocusLink phenotypes with the highest relative frequencies in the list

LocusLink phenotypes most frequently appearing in the list

Prime names of substances with the highest relative frequency in the listD

B

C

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004530

Page 16: A text-mining system for knowledge discovery from biomedical documents

We find that some signaling proteins such as STYKcand TyrKc are commonly referred to in papers aboutleukemia, HMG-CoA lyase deficiency, hepatic lipasedeficiency, and Miller-Dieker syndrome. Similarly,SAM is associated with the same phenotypes. (Infact, fusion of the SAM domain of the TEL onco-gene to AML1 has been shown.45,46) In addition,HATPase_c appears associated with all the pheno-types. ITAM, on the other hand, might reveal a po-tential interesting relationship between leukemia andother phenotypes such as osteosarcoma. We can nowclick on a cell in the 2D map to access the corre-sponding papers if we are interested in pursuing aparticular association. These results could then beexplored as possible leads for drug development.Thus, the MedTAKMI system provides a contextualoverview for identifying potential targets for drugdesign and facilitates the navigation of relevantpapers.

Conclusions and future workThere are many useful applications for text miningin the life-science industry, particularly because ofthe vast amount of technical data and the myriad re-lationships contained therein that are waiting to beinferred, identified, and collated. There are alsomany textual databases other than MEDLINE to beexplored. For example, laboratory notebooks, inter-nal reports, and patent documents are all very im-portant resources that contain valuable informationabout new gene, protein, and drug discoveries.

The next stages of the MedTAKMI system evolu-tion include: (1) expanding the range of document

sources to include these other databases, (2) devel-oping a layer that allows integration of the Med-TAKMI system with external and internal tools thatare currently in daily use, and (3) improving the per-formance of the results system by developing a dis-tributed version of MedTAKMI running on a gridarchitecture.47,48 Because both the information-ex-traction process and the search/mining process han-dle several files independently, we believe that thefile I/O and calculation costs can be shared acrossmultiple processing units. If MedTAKMI can be ex-tended to run on a grid architecture, the performanceof both processes will be improved.

Compliance with the Unstructured InformationManagement Architecture (UIMA) proposed by a re-search community in IBM is also a major challenge.49

The UIMA is a common framework for text analyticstools that are defined as text analysis engines (TAEs).We have already redefined our information extrac-tors as TAEs. We are also working on developing text-mining middleware for life-science applications.50 Inthis project, the MedTAKMI system runs as a partof UIMA-based middleware.

In summary, this paper has described the Med-TAKMI text-mining system, a set of tools for knowl-edge discovery derived from millions of biomedicaldocuments. The MedTAKMI system is able to parse11 million MEDLINE citations with a syntactic shal-low parser and extract relationships from theseparsed entities with hierarchical category identifiers.It also provides a toolkit of interactive viewers to aid

Figure 11 2D map analysis correlating LocusLink phenotypes (vertical axis) and signaling proteins (horizontal axis) in 1051 papers

Leukemia

HMG-CoA lyase deficiency

Hepatic lipase deficiency

Miller-Dieker lissencephaly syndrome

Colorectal cancer

Lupus erythematosus

Osteosarcoma

Histiocytoma

Li-Fraumeni syndrome

9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

S_TKc

9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

STYKc

9 (5.23%)

7 (10.29%)

7 (10.29%)

2 (5.4%)

0 (0.0%)

1 (3.57%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

TyrKc

3 (1.74%)

3 (4.41%)

3 (4.41%)

1 (2.7%)

2 (6.06%)

3 (10.71%)

1 (3.7%)

1 (3.7%)

1 (3.7%)

HATPase_c

2 (1.16%)

2 (2.94%)

2 (2.94%)

1 (2.7%)

1 (3.03%)

3 (10.71%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

HisKA

2 (1.16%)

3 (4.41%)

3 (4.41%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

HR1

2 (1.16%)

2 (2.94%)

2 (2.94%)

1 (2.7%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

SAM

1 (0.58%)

0 (0.0%)

0 (0.0%)

0 (0.0%)

1 (3.03%)

0 (0.0%)

1 (3.7%)

1 (3.7%)

1 (3.7%)

ITAM

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 531

Page 17: A text-mining system for knowledge discovery from biomedical documents

in the discovery of underlying knowledge in the lit-erature of life science.

Acknowledgments

We would like to thank Celestar Lexico-Sciences,Inc. for collaborative work during the developmentand deployment of MedTAKMI.

*Trademark or registered trademark of International BusinessMachines Corporation.

**Trademark or registered trademark of the Gene Ontology Con-sortium, the United States National Library of Medicine, or Mi-crosoft Corporation.

Cited references

1. S. Arlington, S. Barnett, S. Hughes, and J. Palo, Pharma 2010:The Threshold of Innovation, IBM Corporation, http://www-1.ibm.com/services/strategy/e_strategy/pharma_2010.html.

2. M. Hearst, “Untangling Text Data Mining,” Proceedings ofthe 37th Annual Meeting of the Association for Computer Lin-guistics (ACL�99), College Park, MD, ACL, East Stroudsburg,PA (1999), pp. 3–10.

3. Gene Ontology Consortium, http://www.geneontology.org.4. D. Swanson, “Medical Literature as a Potential Source of New

Knowledge,” Bulletin of the Medical Library Association 78,No. 1, 29–37 (1990).

5. V. Brusic and J. Zeleznikow, “Knowledge Discovery and DataMining in Biological Databases,” Knowledge Engineering Re-view 14, No. 3, 257–277 (1999).

6. D. Swanson and N. Smalheiser, “An Interactive System forFinding Complementary Literatures: A Stimulus to Scien-tific Discovery,” Artificial Intelligence 91, No. 2, 183–203(1997).

7. R. Agrawal and R. Srikant, “Fast Algorithms for Mining As-sociation Rules in Large Databases,” Proceedings of the 20thInternational Conference on Very Large Databases (VLDB),Santiago, Chile, ACM, New York (1994), pp. 487–499.

8. MEDLINE, http://www.nlm.nih.gov/databases/databases_medline.html.

9. P. Kankar, S. Adak, A. Sarkar, K. Murari, and G. Sharma,“MedMesh Summarizer: Text Mining for Gene Clusters,”Proceedings of the 2nd SIAM International Conference on DataMining, Arlington, VA (2002), pp. 548–565.

10. Medical Subject Headings (MeSH), http://www.nlm.nih.gov/mesh/meshhome.html.

11. L. Tanabe, U. Scherf, L. Smith, J. Lee, L. Hunter, and J. Wein-stein, “MedMiner: An Internet Tool for Filtering and Or-ganizing Gene Expression and Pharmacological Information,”BioTechniques 27, 1210–1217 (1999).

12. PubMed, http://www.ncbi.nlm.nih.gov/entrez.13. T. Nasukawa and T. Nagano, “Text Analysis and Knowledge

Mining System,” IBM Systems Journal 40, No. 4, 967–984(2001).

14. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi, “TowardInformation Extraction: Identifying Protein Names from Bi-ological Papers,” Proceedings of the Pacific Symposium on Bio-computing �98 (PSB �98), Kapalua, HI, (1998), pp. 707–718.

15. M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Fried-man, “Using BLAST for Identifying Gene and Protein Namesin Journal Articles,” Gene 259, No. 1–2, 245–252 (2000).

16. N. Collier, C. Nobota, and J. Tsujii, “Extracting the Namesof Genes and Gene Products with a Hidden Markov Model,”Proceedings of the 18th International Conference on ComputerLinguistics (COLING 2000), Saarbrucken, Germany, ACL,East Stroudsburg, PA (2000), pp. 201–207.

17. V. Hatzivassiloglou, P. Duboue, and A. Rzhetsky, “Disam-biguating Proteins, Genes, and RNA in Text: A MachineLearning Approach,” Bioinformatics 17, Suppl. 1, S97–106(2001).

18. K. Yamamoto, T. Kudo, A. Konagaya, and Y. Matsumoto,“Protein Name Tagging for Biomedical Annotation in Text,”Proceedings of the ACL 2003 Workshop on Natural LanguageProcessing in Biomedicine (BioNLP), Sapporo, Japan, ACL,East Stroudsburg, PA (2003), pp. 65–72.

19. Y. Tsuruoka and J. Tsujii, “Boosting Precision and Recall ofDictionary-Based Protein Name Recognition,” Proceedingsof the ACL 2003 Workshop on Natural Language Processingin Biomedicine (BioNLP), Sapporo, Japan, ACL, EastStroudsburg, PA (2003), pp. 41–48.

20. C. Blaschke, M. Andrade, C. Ouzounis, and A. Valencia, “Au-tomatic Extraction of Biological Information from ScientificText: Protein-Protein Interactions,” Proceedings of the AAAIConference on Intelligent Systems in Microbiology (ISMB �99),Heidelberg, Germany, AAAI, Menlo Park, CA (1999), pp.60–77.

21. T. Sekimizu, H. Park, and J. Tsujii, “Identifying the Inter-action between Genes and Gene Products Based on Fre-quently Seen Verbs in MEDLINE Abstracts,” Genome In-formatics, 62–71 (1998).

22. T. Rindflesch J. Rajah, and L. Hunter, “Extracting Molec-ular Binding Relationships from Biomedical Text,” Proceed-ings of the 6th Applied Natural Language Processing Confer-ence (ANLP-NAACL 2000), Seattle, WA, ACL, EastStroudsburg, PA (2000), pp. 188–195.

23. A. Yakushiji, Y. Tateisi, Y. Miyao, and J. Tsujii, “Event Ex-traction from Biomedical Papers Using a Full Parser,” Pro-ceedings of the Pacific Symposium on Biocomputing �01(PSB�01), Kohala Coast, HI (2001), pp. 408–419.

24. C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A.Rzhetsky, “GENIES: A Natural-Language Processing Sys-tem for the Extraction of Molecular Pathways from JournalArticles,” Bioinformatics 17, Suppl. 1, S74–82 (2001).

25. J. Park, H. Kim, and J. Kim, “Bidirectional Incremental Pars-ing for Automatic Pathway Identification with CombinatoryCategorical Grammar,” Proceedings of the Pacific Symposiumon Biocomputing �01 (PSB�01), Kohala Coast, HI (2001), pp.396–407.

26. J. Hosaka, J. Koh, and A. Konagaya, “Effect of Utilizing Ter-minology on Extraction of Protein-Protein Interaction Infor-mation from Biomedical Literature,” Proceedings of the 10thConference of the European Chapter of the Association for Com-puter Linguistics (EACL�03), Budapest, Hungary, ACL, EastStroudsburg, PA (2003), pp. 107–110.

27. S. Novichkova, S. Egorov, and N. Daraselia, “MedScan, a Nat-ural Language Processing Engine for MEDLINE Abstracts,”Bioinformatics 19, No. 13, 1699–1706 (2003).

28. J. Thomas, D. Milward, C. Ouzounis, and S. Pulman, “Au-tomatic Extraction of Protein Interactions from Scientific Ab-stracts,” Proceedings of the Pacific Symposium on Biocomput-ing �00 (PSB�00), Honolulu, HI (2000), pp. 538–549.

29. K. Humphreys, G. Demetriou, and R. Gaizauskas, “Auto-matically Extracting Enzyme Interaction and Protein Struc-ture Information from Biological Science Journal Articles,”Proceedings of the Symposium on Artificial Intelligence in Bioin-formatics of the 2000 Convention of the Society for the Study

N. URAMOTO ET AL. IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004532

Page 18: A text-mining system for knowledge discovery from biomedical documents

of Artificial Intelligence and Simulation of Behavior (AISB�00),Birmingham, UK, AISB, Brighton, UK (2000).

30. K. Humphreys, G. Demetriou, and R. Gaizauskas, “Two Ap-plications of Information Extraction to Biological ScienceJournal Articles: Enzyme Interactions and Protein Struc-tures,” Proceedings of the Pacific Symposium on Biocomput-ing �00 (PSB�00), Honolulu, HI (2000), pp. 502–513.

31. G. Leroy and H. Chen, “Filling Preposition-Based Templatesto Capture Information from Medical Abstracts,” Proceed-ings of the Pacific Symposium on Biocomputing �02 (PSB�02),Lihue, HI (2000), pp. 350–361.

32. M. Craven and J. Kumlien, “Constructing Biological Knowl-edge Bases by Extracting Information from Text Sources,”Proceedings of the AAAI Conference on Intelligent Systems inMicrobiology (ISMB �99), Heidelberg, Germany, AAAI,Menlo Park, CA (1999), pp. 77–86.

33. E. Marcotte, I. Xenarios, and D. Eisenberg, “Mining Liter-ature for Protein-Protein Interactions” Bioinformatics 17, No.4, 359–363 (2001).

34. See note 17.35. B. Stapley and G. Benoit, “Biobibliometrics: Information Re-

trieval and Visualization from Co-occurrences of Gene Namesin MEDLINE Abstracts,” Proceedings of the Pacific Sympo-sium on Biocomputing �00 (PSB�00), Honolulu, HI (2000),pp. 529–540.

36. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/.

37. United States National Library of Medicine, http://www.nlm.nih.gov/.

38. MeSH Browser, http://www.nlm.nih.gov/mesh/MBrowser.html.

39. E. Charniak, Statistical Language Learning, MIT Press, Cam-bridge, MA (1993).

40. Unified Medical Language System, http://www.nlm.nih.gov/research/umls/umlsmain.html.

41. C. Pollard and I. Sag, Head-Driven Phrase Structure Gram-mar, University of Chicago Press and CSLI Publications, Chi-cago, IL (1994).

42. R. Berwick, Robert, “Computational Complexity and Lex-ical Functional Grammar,” Proceedings of the 19th AnnualMeeting of the Association for Computational Linguistics, Stan-ford, CA, ACL, East Stroudsburg, PA (1981), pp. 7–12.

43. H. Akaike, “A New Look at the Statistical Model Identifi-cation,” IEEE Transactions on Automatic Control 19, 716–723(1974).

44. LocusLink, http://www.ncbi.nih.gov/LocusLink/.45. H. Tran, C. Kim, S. Faham, M. Siddall, and J. Bowie, “Na-

tive Interface of the SAM Domain Polymer of TEL,” BMCStructural Biology 2, No. 1, 5–11 (2002).

46. H. Sattler, R. Lensch, V. Rohde, E. Zimmer, E. Meese, H.Bonkhoff, M. Retz, T. Zwergel, A. Bex, M. Stoeckle, and B.Wullich, “Novel Amplification Unit at Chromosome 3q25-q27 in Human Prostate Cancer,” Prostate 45, No. 3, 207–15(2000).

47. I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy ofthe Grid: Enabling Scalable Virtual Organizations,” Interna-tional Journal of Supercomputer Applications 15, No. 3,200–222 (2001), http://www.globus.org/research/papers/anatomy.pdf.

48. The Grid 2: Blueprint for a New Computing Infrastructure (2ndEdition), I. Foster and C. Kesselman, Editors, Morgan Kauf-mann Publishing, Inc., San Francisco, CA (2003).

49. D. Ferrucci and A. Lally, “Building an Example Applicationwith the Unstructured Information Management Architecture,”IBM Systems Journal 43, No. 3, 455-475 (2004, this issue).

50. R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown,A. Coden, J. Cooper, A. Inokuchi, B. Iyer, K. Kummamuru,Y. Maas, H. Matsuzawa, and L. Subramaniam, “Text Ana-lytics for Life Science using the Unstructured InformationManagement Architecture,” IBM Systems Journal 43, No. 3,490-515 (2004, this issue).

Accepted for publication February 6, 2004.

Naohiko Uramoto IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Dr. Uramoto joined the IBM Tokyo Re-search Laboratory in 1990 after receiving a master�s degree incomputer science from Kyushu University. He received a Ph.D.in computer science from Kyushu University in 1990. He has beena visiting associate professor at the National Institute of Infor-matics since 2000. His research interests include natural languageprocessing, text mining, XML and metadata, and informationintegration.

Hirofumi Matsuzawa IBM Research Division, Tokyo ResearchLaboratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Mr. Matsuzawa joined the IBM TokyoResearch Laboratory in 1993 after receiving a master�s degreein software engineering from Waseda University. His experienceincludes 3D solid modeling, business intelligence, parallel OLAP,data mining, and text mining. His current focus is on data min-ing, text mining, and the application of grid architectures to lifescience.

Tohru Nagano IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Mr. Nagano joined the IBM Tokyo Re-search Laboratory text mining project in 1998 after receiving amaster�s degree in computer science at the University of Tsukuba.Currently he is developing applications for text-mining systems.His research interests include natural language processing, ma-chine learning, and statistical analysis.

Akiko Murakami IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Ms. Murakami joined the IBM Tokyo Re-search Laboratory in 1999 after receiving a master�s degree inphysics at Waseda University. Her current research interests in-clude natural language processing and knowledge understand-ing in collaborative documents.

Hironori Takeuchi IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Mr. Takeuchi joined the IBM Tokyo Re-search Laboratory in 2000 after receiving a master�s degree inmathematical engineering at the University of Tokyo. His researchinterests include information retrieval based on statistical anal-ysis and patent knowledge management.

Koichi Takeda IBM Research Division, Tokyo Research Lab-oratory, 1623-14, Shimotsuruma, Yamato, Kanagawa, Japan([email protected]). Mr. Takeda joined the IBM Tokyo Re-search Laboratory in 1983 after receiving an M.E. degree in in-formation science from Kyoto University. His research interestsinclude machine translation, text analytics, and informationretrieval.

IBM SYSTEMS JOURNAL, VOL 43, NO 3, 2004 N. URAMOTO ET AL. 533