leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling...

35
Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic A guide to tapping the power of taxonomies for integrated search solutions Skill Level: Intermediate Jochen Dörre ([email protected]) Software Engineer IBM Josemina Magdalen ([email protected]) Software Engineer IBM Wendi Pohs ([email protected]) Principal InfoClear Consulting Bob St. Clair ([email protected]) Senior Product Manager SchemaLogic Inc. 08 Feb 2007 Employ professional tools for taxonomy management and auto-classification to enhance enterprise search solutions built with IBM® OmniFind™ Enterprise Edition. Use SchemaLogic's Enterprise Suite to centrally maintain a consolidated enterprise taxonomy and the IBM Classification Module for automatic classification of documents into taxonomic categories. Consider this article your step-by-step guide to the practical integration of the three applications. Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic © Copyright IBM Corporation 1994, 2006. All rights reserved. Page 1 of 35

Upload: others

Post on 06-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Leverage taxonomies for enterprise search usingIBM OmniFind, IBM Classification Module, andSchemaLogicA guide to tapping the power of taxonomies for integrated searchsolutions

Skill Level: Intermediate

Jochen Dörre ([email protected])Software EngineerIBM

Josemina Magdalen ([email protected])Software EngineerIBM

Wendi Pohs ([email protected])PrincipalInfoClear Consulting

Bob St. Clair ([email protected])Senior Product ManagerSchemaLogic Inc.

08 Feb 2007

Employ professional tools for taxonomy management and auto-classification toenhance enterprise search solutions built with IBM® OmniFind™ Enterprise Edition.Use SchemaLogic's Enterprise Suite to centrally maintain a consolidated enterprisetaxonomy and the IBM Classification Module for automatic classification of documentsinto taxonomic categories. Consider this article your step-by-step guide to the practicalintegration of the three applications.

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 1 of 35

Page 2: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Introduction

As there is an ever-increasing volume of online documents in the enterprise, asystematic organization of enterprise content through classification becomes moreand more important. This need is amplified by the advancing integration of formerlyseparate systems and comprehensive enterprise search systems, like IBMOmniFind Enterprise Edition, that allow end-users to retrieve documents fromdisparate sources through a single point of access. The use of an enterprisetaxonomy (a hierarchically organized set of relevant categories, in other words) andthe classification of document content, is a powerful approach to address this need.

This paper leads you through the individual tasks required to successfully create anddeploy a high-performance enterprise taxonomy using the following tools:

• SchemaLogic Enterprise Suite to manage, model, and maintain aconsolidated enterprise taxonomy

• IBM Classification Module for OmniFind Discovery Edition (ICM) toautomatically classify documents

• IBM OmniFind Enterprise Edition (henceforth called OmniFind) to exploitcategory information in document search

This article assumes you have a basic knowledge of the functionality of anenterprise search system and OmniFind in particular. This article takes the viewpointof how OmniFind can be integrated with SchemaLogic Suite and ICM into a trueenterprise search system leveraging the power of taxonomies andauto-classification.

The structure of the article is as follows:

• Sections "Taxonomy management" and "Taxonomy deployment" coverrequirements of taxonomy systems and how taxonomies are deployed,with the succeeding section highlighting "SchemaLogic Enterprise Suite"

• Sections "Automatic text classification" and "ICM server and tools" give abackground on auto-classification in general, the classification methodused in ICM, its virtues and superiority to other techniques, as well ashow to work with the tooling

• Sections "Integrating classification and search" and "How to fit thesystems together step by step" describe the benefits of integratingclassification and search and how the integration of the three applications

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 2 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 3: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

works in practice

Uses for an enterprise taxonomy

An enterprise taxonomy is a set of terms and the associated relationships that existbetween them. It can be as simple as a list of products, or it can be a complexstructure that supports the relationships between companies, their suppliers, andtheir customers. Enterprise taxonomies are used to describe the content that isgenerated by a business. Using the standard language that is present in thesetaxonomies, both business users and the software that supports their work candescribe similar content the same way.

Broken down into its component parts, an enterprise taxonomy is:

• A multi-purpose, often hierarchical, list of terms that describes content

• Centrally managed or distributed with a strong governance model

It typically includes a combination of the following:

• Controlled vocabularies

• Allowed values for defined metadata fields

• Preferred terms

• Synonyms, acronyms, and abbreviations

• Attributes

• Standard thesaurus relationships

• Other named relationships

Enterprise taxonomies provide the following benefits:

• Taxonomies can describe content across applications

• Taxonomies are focus-agnostic; they have no explicit "point of view"

• Taxonomies can be used to control the values in metadata

• Taxonomies can enhance search queries by adding synonyms,acronyms, and abbreviations

• Taxonomies can assist site navigation

More recently, enterprise taxonomies are also being used to enhance corporate

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 3 of 35

Page 4: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

search. New search techniques, like semantic and actionable search, are greatlyimproved by adding the knowledge that is already built into an enterprise taxonomystructure.

Typical examples include:

• Faceted navigation: An active interface -- a dynamic combination ofsearch and taxonomy browse

• Search results clustering: Automatic grouping of documents intospontaneously labeled categories

• Categorized search results: Search results are grouped into ameaningful and stable classification defined by the taxonomy

• Actionable search: Allows users to do something directly from a searchresult

• Semantic search: Search enhanced by relevant data from differentsources; this data is described by using an enterprise taxonomy

Taxonomy management

Taxonomies are designed by looking at content and talking to subject matter expertsso that an appropriate model of the data can be created. Typically, taxonomists lookat existing databases, Web sites, organizational charts, product lines, other legacydatabases, term lists, and product documentation to tease out existing categories.Often a good, representative category system can be drawn from other systems.

Once the majority of this up-front analysis has been completed and the requirementsfor the enterprise taxonomy have been determined, a taxonomy-modeling softwaresystem must be put in place. Up until the late 1990s, large organizations either builttheir own applications for managing taxonomies, relied on modeling packages thatshipped with individual systems, such as search engines or auto-classificationsystems, or used simple desktop applications, such as spreadsheets. Eachapproach presents challenges to the enterprise.

Home-grown systems can be very expensive to develop and maintain, particularly inenvironments that rely on large, complex, and dynamic taxonomies that are centralto the business, such as driving high quality search results or high qualityauto-classification. As a result, either organizations have continued to make theinvestment in their own systems, or, in a non-critical environment, the high costcontributes to the demise of taxonomy projects.

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 4 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 5: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Modeling tools that are part of a system, such as a search or auto-classificationsystem, are built to interact specifically with that system. As such, the taxonomymodels managed in these system-specific tools are not easily integrated with othersystems without extensive custom software development efforts. Additionally, thesetools are often simple and cannot be used in more complex modeling, large scale, ordistributed ownership scenarios.

Recently, however, another option has become available to organizations. A numberof software vendors have developed and released taxonomy modeling applicationsthat are designed for generic enterprise use and are not tied to specific systems.These systems run the gamut of cost and capabilities, from single-user desktopmodeling applications, to robust enterprise-grade semantic management systems.These systems are designed for modeling and are generally more usable thansystem-specific tools. The real power of these applications is to provide a centralizedmodeling environment that multiple users, user types, and systems throughout theenterprise can access, both for modeling and consuming models.

Choosing the right system for your enterprise requires mapping the requirementsgathered in the up-front analysis with features, capabilities, and costs of each option.

Taxonomy system features

To determine which route to go for managing your taxonomy (whether to build asystem, license a commercial system, or use a taxonomy package that ships with anexisting system), a number of considerations need to take place. The primary driversfor your decision should be based on the needs of the organization and how youplan to manage and use your taxonomy.

General requirements

The general requirements include technological requirements, user and usabilityrequirements, and taxonomic requirements, such as the size and level of activity ofthe taxonomy. Some of these key requirements include:

• Level of activity: This describes how dynamic the taxonomies will be.Taxonomies that change very little typically have few editors or owners,and the model updates are not required to flow immediately to othersystems. In these cases, a simpler modeling application may suffice.However, highly dynamic taxonomies, ones that are changing daily oreven hourly, will require a system with enough performance to supportrapid changes to models from multiple users, multiple systems, or both.

• Size and complexity of vocabularies: Taxonomies with a large numberof terms or a large number of relationships between those terms will

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 5 of 35

Page 6: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

require a modeling system that can scale with the growth of thetaxonomy.

• Taxonomy integration: The more systems that centralized taxonomymodels can be leveraged in, the more powerful and cost effective thetaxonomy creation and maintenance process will be. Broadly, othersystems interact with the modeling system in one or both of two ways:

• Subscribing systems: These are systems that consume the wholeor a sub-set of the taxonomy. How a taxonomy can be used is usuallylimited by how subscribing systems can use it and what structuresthey can utilize. These systems are on the receiving end of thetaxonomy and may consume anything from flat lists to complexhierarchies all the way to complex ontologies.

• Publishing systems: In some environments, other systems maypublish to the modeling application. In one case, the "taxonomy ofrecord" for certain subsets of the overall enterprise taxonomy mightbe managed in another system. For example, a product list may bemanaged in an ERP system, but that list can be utilized by othersystems (such as an ECM, Auto-classification, or Search system). Inthis case, the modeling application may be used as a "clearinghouse," receiving the product list from the ERP system, thenre-distributing all or sub-sets to different systems.In another case, another application might generate terms that shouldbe incorporated into the enterprise taxonomy. These types of systemsinclude advanced natural language processing systems that candiscover new terms and relationships by analyzing content (such asdocument text in ECM systems). Other examples include termsgenerated from Social tagging systems and terms generated fromsearch analytic systems.

• Subscribing and publishing systems: In some cases, a system canboth subscribe and publish to the taxonomy modeling system. Anexample of this is an auto-categorization system that can consume ataxonomy for the purpose of categorization and can also discovernew terms and relationships to feed back into the model. Iterativetaxonomy maintenance helps auto-categorization systems becomeiteratively more accurate.

Active enterprise environments, with multiple subscribing systems,publishing systems, or both require a modeling application that canreadily and reliably connect to multiple systems in such a manner.

• Distribution of ownership: How broadly, geographically, and

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 6 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 7: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

organizationally model ownership is distributed throughout an enterprisewill dictate how robust the modeling tool must be to support a diversepopulation of users. This includes the ability for users to connect to thesystem easily and for client-side application management to be minimal.

• Number and types of users: In addition to model owners and taxonomyeditors, an enterprise may have many users who are stake holders andinterested parties to the taxonomy or taxonomy sub-sets and may becontributors to the taxonomy development process. These users need tobe able to access and view the models in ways that fit with their needsand authority. They may need to contribute to the taxonomy developmentprocess directly in-situ within a subscribing system. For example, anauthor tagging a document in an ECM system may need to suggest anew term for a particular option list managed in the taxonomy modelingtool. Ideally, that process would be seamlessly integrated with the ECMUI so that users would not even realize they interacting with the taxonomymanagement system.

• How users will use the tools: Different types of users will use thetaxonomy modeling system in different ways. Taxonomy editors needpowerful editing capabilities to quickly and easily make individual and bulkmodel changes. They need powerful search and browse capabilities toquickly locate terms and taxonomy branches. Business owners of subsetsmay need to view just their portion of the taxonomy and not need aspowerful capabilities. Thirdly, stake holders and users who occasionallysuggest new terminology may need an even smaller subset ofcapabilities.

• Usability: The modeling Graphical User Interfaces must be easy to useby editors and contributors, and the capabilities must match the user'srole. Capabilities to search and navigate the taxonomy, create, edit, anddelete terms and branches must be easy and intuitive to use.

• Technical architecture: The system must fit within enterprise technicalarchitecture guidelines, such as supported platforms and databases.Additionally, if custom connections to the modeling system are to bedeveloped and maintained by the enterprise, the system must havewell-documented APIs and a comprehensive SDK.

• Multilingual capability: Many organizations need to maintain theirtaxonomies in many languages. Simple multi-lingual environments mayneed the ability to map languages one to one. More complexenvironments may need to accurately model complex interrelationshipsamongst languages, such as the ability to map a single complex term inone language to multiple terms or even a Boolean type statement in

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 7 of 35

Page 8: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

another language.

• Scalability: The modeling system may need to scale in a number ofdifferent ways, such as the volume of terms, the number of users, andgeographical distribution of servers.

• Security and permissions: The security model of the system must meetthe needs of the organization, including the ability to tie into enterprisesecurity systems such as LDAP systems. The permissions model mustallow for sufficient levels of ownership, viewing, editing, and collaborationrights.

Modeling capabilities

A second category of requirements has to do with the actual logical modelingcapabilities and the type and complexity of the models. These considerationsinclude:

• Term relationships: The types of taxonomies and the complexity of theterm relationships need to be determined. These relationships can rangethe gamut from flat lists (no relationships), to simple hierarchies (forexample, parent/child relationships), to thesaurus relationships (forexample, those conforming to the NISO construction standards; seeResources), all the way to highly specified, defined relationships, andontologies.

• Sub-setting taxonomies: The ability to define subsets of the taxonomy,often based on facets, term relationships, or other attributes is oftenneeded for integrating with downstream systems, which have their ownconstraints on required terminology or complexity of model.

• System- and user-defined attributes: Attributes of terms in yoursystem, such as when a term was created or modified, by whom, and soon, are often required for managing a taxonomy. Most of these attributeswill come with the modeling system "out of the box." However, particularlyfor integrating taxonomies with other systems, creation and managementof user-defined attributes is necessary. User-defined attributes are thoseattributes that can be established and configured by the customer. Oftenwhen integrating with other systems, information specific to that systemmust be published along with the terms or the taxonomy. Being able torecord, store, and manage that information in the modeling system ishighly advantageous.

• Other modeling capabilities: Additionally, there are a number of otherconsiderations to be made concerning modeling capabilities. These

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 8 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 9: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

should be based on known requirements and include:

• Polyhierarchy: The ability for terms to have multiple parents

• Topic maps: The ability to model topic maps (for example, thoseconforming to ISO/IEC 13250:2003)

• RDF: The ability to model information in a manner compliant withRDF (Resource Description Framework, a family of World Wide WebConsortium specifications)

• OWL (Web Ontology Language): The ability to render a taxonomicmodel in an OWL-compliant format (see Resources)

Taxonomy deployment

Integrating the taxonomy with other systems

The technical method for integrating taxonomies and taxonomy subsets is typicallydictated by the capabilities of both the taxonomy modeling system and by the targetsystem.

Typically, integration falls into one of two types, with different variations on thetheme. At the simplest or shallowest level, the integration can occur by transformingand transferring files, such as XML files. Most systems can now readily exporttaxonomies in an XML format, and subscribing systems can import taxonomies asXML.

At the deepest level, the modeling system can be directly connected to subscribingsystems using APIs. On the modeling system side, this allows the exposure of themodeling system's capabilities directly to other systems, including their userinterfaces. Additionally, this allows the subscribing system to use the models storedin the modeling system without having to store another representation of it, such asin an XML file or in a database structure. Whenever the subscribing system needsthe taxonomy or a subset, either for display or other purposes such as building asearch index or analyzing a query string, it accesses the data in real time from themodeling system. This prevents the problem of having multiple versions of thetaxonomy in different systems and the versions getting out of synchronization.

Synchronizing the taxonomy with other systems

In all but the deepest API to API integrations, processes must be put in place tosynchronize changes made in the modeling system to subscribing systems.Synchronization methods typically fall into one or more of the following categories:

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 9 of 35

Page 10: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

• Manual: A user-activated batch process, typically executed through a UIor a command line interface

• Scheduled: An automated synchronization set up to occur on a specifiedschedule

• Event driven: An automated process that occurs whenever specifiedchanges are made to the model

SchemaLogic Enterprise Suite

One such enterprise-grade taxonomy management system is the SchemaLogicEnterprise Suite.

At the core of the SchemaLogic Enterprise Suite is SchemaServer, a centralrepository where business model standards are gathered, created, refined, andreconciled, and from which the standards are distributed to subscribing systems.The Suite is capable of managing both semantic, including taxonomic, and structuralmodels.

For semantic models, SchemaServer allows an organization to capture and managethe standard business terminology, code systems, and semantics that must be usedconsistently across the enterprise as vocabularies, terms, and relationships. Thispowerful and flexible system can model simple but critical lists like sales regions ormarketing segmentation models up to complex multi-faceted taxonomies, thesauri,or ontologies describing complex business systems or product and servicesnetworks with hundreds of thousands of terms.

Additionally, SchemaServer describes the structural models used to store andexchange information as a hierarchy of information classes. This logical modelingcapability allows you to capture a consistent and easy-to-understand model of all ofthe information systems in the enterprise and easily view and understand how theyinterrelate with each other. Relational, Object-Oriented, XML, SOA, and othertechnical models can be unified and brought under a business-orientedmanagement model appropriate for business and technical participants. Therelationships between semantic and structural models are also captured, enablingcomprehensive governance and impact analysis to be implemented.

SchemaServer is accessed through powerful GUIs that allow you to quickly andeasily create and maintain taxonomies. Workshop is a powerful modeling applicationused by expert modelers and taxonomy editors to administer corporate taxonomies.It includes powerful editing tools for importing, exporting, and making large-scale orbulk changes to the model.

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 10 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 11: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Workshop Web is a zero footprint, browser-based UI to manage the objects inSchemaServer. It is designed for everyday business users of the system.

Figure 1. Geography hierarchy in Workshop Web

The SchemaLogic solution is built around four key capabilities to enableorganizations to manage business semantics and taxonomies within the context ofeveryday operations:

1. Model the structure and information relationships

2. Govern and manage the changes

3. Publish to subscribing systems

4. Collaborate to expand and maintain

These key capabilities enable participation and contribution across organizational,corporate, and industry boundaries to facilitate the development of businesssemantics in a dynamic, constantly changing environment.

SchemaLogic Enterprise Suite: Key features and capabilities

The SchemaLogic Enterprise Suite includes many of the key features required for

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 11 of 35

Page 12: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

taxonomy development and deployment projects:

• Ease of use: Workshop Web emphasizes ease of use by generalbusiness users who typically do not have deep taxonomy developmentexperience and is highly graphical. Workshop, in addition to thecapabilities found in Workshop Web, provides additional powerful editingand administrative capabilities.

• Import/export: Workshop users can perform manual file-based importsand exports utilizing XML or CSV-formatted files. Imports and exports canbe performed against any defined subset or the entire model.Non-file-based imports and exports can be accomplished with the API orthrough product Adapters.

• Collaboration: The SchemaLogic Suite includes a customizablegovernance system of contracts, permissions, and rights that allows allusers to collaborate on the development of the semantics model.

• Permissions model: The permissions model allows user roles to beapplied to any in the system. Changes can be suggested but notcommitted until the appropriate owners and stake holders have approvedthe changes.

• Impact analysis: The impact analysis feature allows users to graphicallysee which objects would be affected by a proposed change as well as allthe owners, stake holders, and subscribers affected by a change.

• Defined relationships: Customers can define any number of customterm relationships, allowing organizations to model a full range ofsemantic relationships, including flat lists, simple hierarchical, thesaurus,and ontological.

• Configurable attributes: Customers can define any number of customattributes for any object in the system. This is useful when integratingtaxonomies with other systems, as those systems often require specificterm attributes in order to integrate taxonomies.

• SDK: The SchemaLogic Suite includes a fully documented Web servicesSOAP API and a Java API to allow customers to write customapplications against the modeling server. This allows the modeling serverto be integrated with existing line-of-business applications, exposing themodeling capabilities to users through their line-of-business UI's.

• Integration service and adapters: The SchemaLogic Suite includes asmall footprint integration server, an integration architectural frameworkaccessible through Web services designed to give organizations the

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 12 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 13: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

ability to quickly build and deploy API to API Adapters to publishtaxonomies or taxonomy subsets to subscribing systems. Additionally,SchemaLogic has pre-built product adapters for systems such as IBMContent Management suite and IBM OmniFind suite of systems.

Integrating SchemaLogic Suite with OmniFind Enterprise Edition, Version 8.4

OmniFind Enterprise Edition V8.4 exposes a number of capabilities that can beleveraged by an organization's taxonomy to finely tune and significantly enhancesearch results.

• Rule-based taxonomy: To simplify enterprise search deployments,OmniFind Enterprise Edition V8.4 provides the ability to configure ataxonomy of categories and category rules. The taxonomy serves twopurposes. First, when the search index is created, taxonomy categoriesare applied to documents based on whether a document satisfies the rule.Secondly, once categories are applied to documents, the taxonomy canbe used to create a browsing interface to the collection. Unlike manynavigation-only solutions, OmniFind Enterprise Edition does not require apre-defined taxonomy in order to deliver highly relevant search results.However, it can take advantage of taxonomy tags to influence both theresults and interface of a search application.Using the SchemaLogic Suite, organizations can apply OmniFind-specificrules to existing taxonomic terms and publish those taxonomies, orsubsets, to OmniFind. This allows organizations to leverage existingtaxonomies in OmniFind and to manage their OmniFind taxonomy as partof their overall taxonomy management system and processes.

• Linguistic dictionaries: OmniFind allows organizations to manage anumber of different dictionaries to fine tune results. In the SchemaLogicSuite, each of these dictionaries can be managed seamlessly within theoverall taxonomy of the organization and periodically published toOmniFind.

• Synonyms: This dictionary can be used to expand terms in the querystring sent to the search engine to include specified synonyms. Forexample, this allows organizations to tune the search engine to search forthe complete spellings of common enterprise acronyms. If a usersearches on "WAS," the search engine can also automatically returnresults for "WebSphere Application Server," and the other way around.

• Boost words: This dictionary specifies terms and phrases that raise orlower the rank value of the document in which the term appears. Thisallows organizations to manipulate the ranking of search results to

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 13 of 35

Page 14: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

provide more highly relevant documents to users.

• Stop words: This dictionary specifies a list of enterprise-specific termsthat are removed from query strings to improve the relevancy of searchresults. Typically, stop words are commonly occurring words or phraseswhose inclusion in a query string may cause a large quantity of poorresults.

Automatic text classification

Background

There are various approaches to building document taxonomies. Some of theapproaches are rule-based, mostly created and maintained by human experts.Others rely on automatic text classification techniques. Various text classificationmethodologies may yield different types of "models," or statistical descriptions of theworld. Models may be complex or simple in the sense that the "classifiers," or thesoftware components that determine if a text belongs to a category, can bearchitecturally simple or complex. Typically there is a trade off between sophisticatedmodels and simple models, or between variance and bias.

Techniques such as Bayesian networks or neural networks use highly expressivemodels, which try to produce a non-biased classifier in order to "describe" a corpusof documents. Their results tend to have very high variance, which can be reducedonly by large training sets and very static data. But in most real-world situations, andalways in the customer interaction space, the databases, or corpora, are small,heterogeneous, and tend to change rapidly. Therefore, it cannot be assumed thatthere will be enough data to train these "complex" structures and reduce thevariance. This typically results in what is called an over-fitted system, perhapsperforming well in artificial tests, but not in the real world.

The ICM RME approach

ICM relies on a proprietary and unique algorithm (Relationship Modeling Engine,RME) to create an optimal trade-off between variance and bias. This approach issuperior to the apparently "complex" methods such as Bayesian networks or neuralnetworks. ICM RME's sophistication is not in the architecture of the classifiers, butrather in how these classifiers are fine-tuned and built in real time.

The ICM RME advantages

• Robust and accurate across noisy, imperfect, multi-intent,and ambiguous content

• Semantic understanding of text results in high accuracy,

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 14 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 15: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

the ability to serve multiple channels, and cross learn

• Supports both statistical and rule-based classification;rules may be applied as required to guide the ConceptModeling phase (for example, identify different intentsbased on channels of communication)

• Easy bootstrapping techniques and configuration toolsmake it simple to deploy

• Elegant and embeddable architecture make it easy tointegrate

• Multiple languages supported, including a languageidentification module

Most of the algorithmic effort in the development of ICM RME was invested in how toautomatically create and tune classifiers. As a result, ICM RME classifiers providesuperior accuracy and the ability to generalize and learn from small training sets. Inaddition, these classifiers are highly intelligent in the way they are createddynamically and tuned, with either training or incremental learning.

ICM RME's algorithmic infrastructure is a unique self-learning engine, capable ofclassifying textual information, even in imperfect and noisy situations. It incorporatesnew knowledge on the fly, without the need to reconfigure or re-train the system.ICM RME's technology is different from standard classification techniques; itemphasizes cleanness and transparency. Using Concept Modeling techniques, ICMRME has the unique capability of serving multiple applications from a singleknowledge base. This is the original premise of knowledge management -- pushknowledge from the more human-intensive channels to the more automated orunattended channels automatically. ICM RME provides a mature technology thatcan adapt to real-world changes and continuously provide accuracy levels that makeit valuable in a variety of real-world, mission-critical applications.

ICM RME provides services to applications that need to understand text or correlatebetween text and certain objects (for example, personalization or general dataclassification applications).

Typically, an application sends raw data to ICM RME for analysis and expects toreceive a quick and accurate response based on the data content. Note that abouthalf of the content is irrelevant to the actual analysis of the message and that themessage contains shorthand (abbreviations), potential spelling errors, and otherimperfect characteristics.

ICM RME receives this message and processes it in two main phases: a multilingualNatural Language Processing (NLP) phase, and a language-independent statisticConcept Modeling phase. The first step of NLP processing primarily consists of

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 15 of 35

Page 16: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

finding portions of the text that contain relevant data, and extracting key features orlinguistic events from the text that will be used later by the Concept Modeling engine(and possibly by the calling application directly).

The NLP engine processes input text, regardless of channel, and creates a ConceptModel. A Concept Model is a computer-readable data structure containing theprimary concepts that appear in the original text and some of their relationships. Thisstructure is then fed into the Concept Modeling engine for pattern matching.

ICM RME employs these break-throughtechnologies

• ICM RME classifiers are built on a proprietary algorithmthat performs a unique bias variance trade off

• A special processing phase at the end of the semanticmodeling process uses nonlinear warping techniques inorder to express classifier results directly as actualstatistical probabilities

• Real-time learning: ICM RME has a unique capability tolearn on the fly. The learning process is adaptive andconstantly changes its own characteristics based onfeedback and the knowledge it gains. It uses an adaptivevariable memory based on automatically-collectedcharacteristics and knowledge of the world.

• Generic multilingual support: The engine was built in alanguage-independent way, capturing the true semanticcharacteristics of a category, regardless of the language

• Robust NLP analysis of imperfect language

NLP processing addresses the fact that many different variants of the same "word"can appear in the text. Some of these variations are morphological variants (forexample, "go," "goes," and "went" are linked to the same concept), and some aredue to spelling errors or other naturally-occurring variations in expression. Conceptscan be words, short sentences, multi-word tokens, numbers, dates, URLs, e-mailaddresses, or any other meaningful patterns that appear in the document.

This highlights one of the ICM RME's differentiators: The system is looking forpatterns in higher-level semantic structures, leveraging automatically collecteddomain and language knowledge rather than finding text patterns directly.

The result of ICM RME processing is a list of categories or intents embedded in theoriginal text. The system may also extract certain features or patterns and createmetadata fields if configured to do so. The system can also flag all messages withcertain categories over a pre-defined threshold for special processing. It is importantto note that the certainty factor is actually an estimate of the statistical likelihood that

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 16 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 17: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

the category was identified correctly. This is another unique feature of ICM RME,which makes its configuration much more straightforward and provides companieswith much greater control over how and when fully-automated actions are taken byapplications -- a critical requirement in most environments, but absolutely essentialin customer interactions.

Note that ICM RME performs very well in multi-intent scenarios; the feedback can beprovided as a list of categories. In addition, the feedback process is very simple forthe calling application; it only has to tell ICM RME which categories were correct --the system does the rest. There is no need to say why or express a degree ofconfidence in the feedback. ICM RME automatically finds out why the feedback wasgiven, and, in case of erroneous feedback, it will quickly nullify its effect (or "unlearn"it).

ICM server and tools

IBM Classification Module for OmniFind Discovery Edition

IBM Classification Module for OmniFind Discovery Edition is a cross-platform serverapplication for writing client applications that interact with the Relationship ModelingEngine. The Relationship Modeling Engine is a full suite of language processingtechnologies targeted at analyzing, understanding, and classifying the naturallanguage of customer interactions and other types of everyday communication. Thisfunctionality is easily embedded with the Classification Module, which exposes allthe functionality necessary to develop applications that harness the power of theRelationship Modeling Engine. The Classification Module provides several client APIlibraries to enable rapid development of various client applications in severalprogramming languages, in particular in Java. Ease-of-use and maintenance iscombined with high availability and scalability. It is designed to run on multiplemachines and provides the ability to scale according to customer load by makingoptimal use of hardware and software resources. The system is configured andmaintained using the Classification Manager application.

About the Relationship Modeling Engine (RME)

The Relationship Modeling Engine uses natural language processing andsophisticated semantic analysis techniques to analyze and categorize text. When anapplication sends input text to the Relationship Modeling Engine for analysis, thesystem identifies the categories that are most likely to match this text. TheRelationship Modeling Engine works together with an adaptive knowledge base -- aset of collected data used to analyze and categorize texts. The knowledge basereflects the kinds of text that the system is expected to handle. RelationshipModeling Engine-enabled applications use categories to denote the intent of texts.

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 17 of 35

Page 18: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

When text is sent to the Relationship Modeling Engine for matching, the knowledgebase data is used to select the category that is most likely to match the text. Beforethe knowledge base can analyze texts, it must be trained with a sufficient number ofsample texts that are properly classified into categories. A trained knowledge basecan take a text and compute a numerical measure of its relevancy to each category.This process is called matching or categorization. The numerical measure is calledrelevancy or score. The accuracy of a knowledge base can be maintained andimproved over time by providing it with feedback -- confirmation or correction of thecurrent categorization. The feedback is used to automatically update and improvethe knowledge base. This process of automatic self-adjustment is called learning.

Classification Workbench

Classification Workbench is an application that allows you to create a knowledgebase (KB) for use with IBM Classification Module for OmniFind Discovery Edition(ICM), analyze the KB, and evaluate its accuracy using reports and graphicaldiagnostics. The result is a KB that can be used in conjunction with applicationspowered by ICM.

Prior to using Classification Workbench, you'll collect pre-categorized sample data(for example, documents) representative of the data you expect to classify usingICM. You'll import this data into Classification Workbench to create a corpus file.Classification Workbench provides a variety of features and techniques that allowyou to fine-tune the corpus to optimize KB accuracy. Using the corpus as input, youcan create and test the KB. Then you can evaluate the KB using ClassificationWorkbench reports and graphical diagnostics and improve its accuracy by editingthe corpus you use to create the KB. The final product is a production-ready KB, foruse with ICM-based applications.

An ICM RME KB is represented as a tree of nodes, with each node containingstatistical knowledge or rules that assist the system in classifying text. Categoriesare the names of the nodes in the KB. The simplest way to organize nodes in a KBis a flat knowledge base structure, so that all nodes are on the same level.Classification Workbench builds such KBs automatically from a categorized corpus,and you do not have to explicitly specify its structure. In some cases, you may wantto build a hierarchical knowledge base, consisting of nodes at multiple levels in thehierarchy.

One important advantage of ICM RME KB is the ability to mix rules and statistics.This way you can effectively apply business logic, external non-statistical informationusually defined through metadata, in the classification process. You can easily craftsuch a KB using the Classification Workbench interactive KB Editor. Alternatively,the KB structure can be specified in an external textual format and imported toClassification Workbench.

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 18 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 19: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Figure 2 illustrates a possible hierarchical KB structure. Squares represent rulenodes that work on metadata (for example, "language = French" or "Products =Servers"). Ellipses represent statistical nodes.

Figure 2. KB structure example

A typical workflow of using Classification Workbench to create a KB would be:

1. Gather pre-categorized data that will form the basis of a corpus.

2. Convert this data into a format recognized by Workbench (for example,Workbench recognizes CSV or XML obeying a certain pattern). Writing anapplication that will already produce the format recognized by Workbenchcan be a good option.

3. Create the KB structure. Workbench recognizes an XML format for KB.

Then you'll use Workbench to:

4. Import the data and create a corpus file

5. Import the KB (if available)

6. Edit and categorize corpus items, as required

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 19 of 35

Page 20: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

7. Create and analyze a KB, and generate analysis results

8. Evaluate KB accuracy by viewing summary reports and graphs. The bestway to evaluate the KB accuracy is using the "KB Tune-Up Wizard."

9. As required, improve KB accuracy by editing the corpus and retraining

10. Export the KB to the IBM Classification Module for OmniFind DiscoveryEdition (ICM) Server

For the training task, the Classification Workbench reports present a lot ofinformation both on the overall KB accuracy and on a per-category basis. Theevaluation should start from the overall KB accuracy verification, by generating the:"KB Data Sheet," "KB Summary," and "Cumulative Success" reports. The "KB DataSheet" will provide a highlight of the potential problems. Measures like "Totalcumulative success," "Top performing categories," "Poorest performing categories,""Categories that may be determined by external factors," and "Pairs of categorieswith overlapping intents" are very informative for the general KB accuracymeasurement, but they represent only informative indications. The final decision hasto be taken by the KB administrator who understands the data and the businesslogic of the project.

"Categories that may be determined by external factors" may indicate that the usershould add external information to the documents using metadata and rules to theKB that refer to the metadata.

"Pairs of categories with overlapping intent" may indicate that categories should beredefined, either split the "overlapping" categories into several non-overlappingones, or combine several categories into one. These are possible indications, butthe decision has to be made according to the project data and business logic needs.

If the nature of the data changes over time, the KB accuracy can be verifiedperiodically, using Classification Workbench reporting tools. If needed, a retrainingwill be done.

To conclude, IBM Classification Module for OmniFind Discovery Edition (ICM) is apowerful tool that uses natural language processing and sophisticated semanticanalysis techniques to analyze and categorize text. ICM works together with anadaptive knowledge base/taxonomy (KB) that uses categories to denote the intent oftexts. When text is sent to ICM for matching, the knowledge base data is used toselect the category that is most likely to match the text. The KB can be hierarchicaland can combine rule-based and statistical information. The ClassificationWorkbench tool allows easy creation, analysis, and tuning of a knowledge base from

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 20 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 21: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

representative data. Its reporting tools are very powerful, allowing editing and tuningof the data and of the KB to increase the accuracy of the classification.

Integrating SchemaLogic Suite with ICM

Using the SchemaLogic Suite, organizations can publish existing taxonomic terms tothe Classification Workbench so that the subset of the taxonomy that is imported intothe Workbench becomes the KB structure, and hence the set of categories, whichthe classifier is trained upon.

Thus, the auto-categorizer will tag documents or text streams withenterprise-specific categories that are actively managed within the organization. Thisensures that a consistent set of approved terminology is used forauto-categorization.

Integrating classification and search

Search and classification are often integrated together in a single system. They fittogether nicely, because of several reasons.

First, they provide complementary mechanisms for describing documents. Searchdescribes the document based on a small set of words supplied by the user (such asthe query "fat"), whereas classification attempts to describe the overall documentbased on a set of descriptors supplied by the taxonomy (for example, in a subjecttaxonomy, one of the subjects). This means that if a search engine supplies thecategory to the user, it can be extremely easy for the user to distinguish whichsearch results are really relevant. For example, if the user query is "fat," some of theresults will be marked as "dieting" or "nutrition," but others will be marked as "filesystems" (because FAT is also File Allocation Table, used by the DOS operatingsystem). A user seeing this mixture of topics can then refine the query to select justthe ones intended by this ambiguous query.

Secondly, the processing of the data required by search and classification, (in otherwords, document fetching, tokenization, lemmatization, and so on) is the same to alarge extent. Hence, a system that couples them together can take advantage ofcommon processing steps.

Search and classification can be paired in a number of ways:

• Search within a category: You can select a category and then searchonly documents that are both within the category and that match yourquery.

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 21 of 35

Page 22: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

• Faceted search: In this method, you are allowed to specify severaldifferent facets (or characteristics) of a document to a search engine (forexample, "search for all PDF documents about databases from lastyear"). This is actually a generalization of "search within a category,"where multiple criteria that may or may not be categories from ataxonomy can be combined.

• Taxonomy browsing: Some or all of the documents on a Web site aredisplayed as a taxonomy that can be navigated, with each documentassigned to one or more nodes of the taxonomy.

• Classifying search results: The results of a search are displayedtogether with their assigned categories. Categories can be used to groupor sort result sets.

Integrating the three applications

To address these usage scenarios in an optimal way, organizations can leverage thepower of all three applications together by using the SchemaLogic Suite to centrallymanage the enterprise taxonomy and publish appropriate subsets to both OmniFindand the Classification Workbench. This results in the use of a consistent, activelymanaged set of semantics for auto-categorization and search, significantlyenhancing results and ensuring that these systems are automatically kept up to datewith the ever-evolving enterprise taxonomy.

The following section demonstrates how the integration works in practice.

How to fit the systems together step by step

This section gives detailed instructions for using the three systems in concert in thefollowing scenario:

• A taxonomy is centrally managed using the SchemaLogic Suite

• Based on the taxonomy, a KB is trained for auto-classification using ICM

• The taxonomy is deployed within OmniFind, which uses a plug-in in itsdocument-processing pipeline to connect to the Classification Moduleserver and receive classifications from the taxonomy for each document itprocesses

The description assumes the following software versions: ICM Version 8.3 (previousname: "IBM Classification Module for WebSphere® Content Discovery Version 8.3)and OmniFind Version 8.4. Please note that the focus here is on the steps that

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 22 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 23: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

realize the integration of the three systems and does not present details for the tasksthat are accomplished within the tools.

Setting up the integration

Step 1: Create an OmniFind collection on which you want to employauto-classification

Create an OmniFind collection with "rule-based categorization." The configurationoption "rule-based categorization" is needed to allow categories obtained later fromICM to be stored in the OmniFind index and to allow the Search Application tobrowse the category tree.

Step 2: Deploy BNSCategoryAnnotator in OmniFind

As mentioned above, the integration of the ICM server with OmniFind requires anextension module to be loaded into OmniFind. The extensibility of OmniFind isbased on the Unstructured Information Management Architecture (UIMA) (see theResources section for more information). In this architecture extension modules, (UIMA plug-ins, in other words) are also called annotators. The annotator used hereis contained in the UIMA PEAR package BNSCategoryAnnotator.pear (a simplifiedversion of it is attached in the Download section). Figure 3 gives an architecturaloverview of this integration:

Figure 3. BNSCategoryAnnotator provides the bridge between OmniFind andICM

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 23 of 35

Page 24: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

The package contains a configuration file BNS.xml (BNSSample.xml in thedownloadable version), which contains a number of configuration parameters thatneed to be set before deploying the plug-in. The most important parameters arelisted in Table 1.

Table 1. BNSCategoryAnnotator configuration parameters.Parameter Meaning Example

ServerURL The URL of the ICM server http://127.0.0.1:8081/Listener/mod_gsoap.dll

KBName The name of the KB in ICM EnterpriseTaxonomy

DefaultBodyFieldName The KB field in ICM that isexpected to contain thedocument body

text

MinRelevanceScore A float between 0 and 1;categories with a relevancyscore below this threshold areignored

0.5

MaxCategories The maximum number ofcategories that may be assignedto a document

3

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 24 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 25: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

We recommend that you use the Eclipse-based Configuration Description Editor thatcomes with the UIMA SDK to adapt these parameters as required by yourapplication (a description of how to install the UIMA SDK Eclipse tooling and how touse the editor is contained in the UIMA SDK User's Guide and Reference; see, also,the Resources section). At a minimum, the ServerURL parameter needs to beadapted to your ICM server installation so that the annotator can connect to the ICMserver. Also, in the simplified version, the parameter CategoryDirectory needsto be set to the following path which contains the CategoryTree.xml file on theOmniFind controller node:<ES_NODE_ROOT>/master_config/<CollectionId>.parserdriver/ (replace<ES_NODE_ROOT> by the value of the respective environment variable whenlogged on as the OmniFind administrator; to find the <CollectionId>, go to thecollection's General tab).

For parameter DefaultBodyFieldName, you can choose some name, like "text,""body," or "contents." This name must be used again in the ClassificationWorkbench, where you have to choose a field that contains the document text ofyour training data. Finally, the value for parameter KBName should be left to thedefault (empty string). This ensures that the name of the root node of the taxonomyis taken as the KBName, which is true for KBs developed with Workbench.

Figure 4. Editing parameters in BNS.xml using UIMA SDK's ComponentDescriptor Editor

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 25 of 35

Page 26: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Then the PEAR package must be uploaded onto the OmniFind controller node. Inthe OmniFind administration console, use the System:Parse page in Edit mode toadd the PEAR package as a new text analysis engine. Please refer to the tutorial"Semantic Search with UIMA and OmniFind" (developerWorks, December 2006) fordetails about deploying and using custom analysis engines with OmniFind.

Step 3: Associate the custom analysis engine with your OmniFind collection

To have the collection use the auto-classifier, it needs to be associated with the newtext analysis engine. This setting is available in the Text Processing Options pagefor your collection (see the collection's Parse page in Edit mode). More details aboutconfiguring text processing can be found in the tutorial "Semantic Search with UIMA

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 26 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 27: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

and OmniFind."

Step 4: Create and publish a taxonomy with SchemaLogic Enterprise Suite

In the described setup, you need to publish a taxonomy both to ICM and OmniFind.Publication to ICM is done using the SchemaLogic Adapter for ICM, and publicationto OmniFind is done using the SchemaLogic Adapter for OmniFind. Theconfiguration and running of those adapters is done through the Workshop UI andthe Integration Service. Use CSV format in the ICM adapter to publish a taxonomy(subset) for ICM and publish the same taxonomy subset directly to OmniFind.

Configuration of the adapters includes specifying:

1. The directory to write the CSV file for ICM server, respectively theconnection information to the OmniFind controller node

2. The taxonomy or taxonomy subset in the SchemaLogic modeling serverto be published to ICM and OmniFind

3. Any terms that should be excluded or included based on term attributes orterm relationship types

Figure 5 shows how the SchemaLogic Adapter for OmniFind can be configured:

Figure 5. Editing configuration settings in the SchemaLogic Adapter forOmniFind

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 27 of 35

Page 28: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

The adapters can be run by any of the following methods:

1. A manual process, where an administrator executes the publication fromthe Workshop UI

2. A scheduled process, where publication is configured to occur with aspecified frequency

3. A Web services call to the Adapter made by another application or system

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 28 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 29: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

After successful publication, the taxonomy is available as a KB (structure only) forICM and as a category tree that can be browsed in OmniFind.

Step 5: In Classification Workbench, import the taxonomy as a KB structureand train it for auto-classification

Using the Import Wizard, in what to import, select the option Knowledge base, andin what type of knowledge base, select the option KB configuration. Provide thepath of the configuration file describing the KB structure in the following screen, andclick Finish (for details, please refer to section "Importing and Exporting a KBStructure" in the Classification Workbench User's Guide).

To get a high-quality classifier, it is important to carefully select enough trainingsamples for each category that should be recognized in the taxonomy. A trainingsample needs to be pure text, extracted from a sample document of the category inquestion.

Because the OmniFind/ICM integration for the auto-categorization runtime requiresthat all document text is provided within a single field (of NLP usage type Body),each training sample should also be formed in this way: all document text iscontained within a fixed single field of type Body. To simplify the extraction ofdocument text, the OmniFind/ICM integration can be run in "training mode" (notincluded in the sample annotator). This mode simplifies the task of collecting andpreprocessing training data for KB training with the Workbench considerablybecause you can use OmniFind crawlers to fetch training documents and theOmniFind parser for document preprocessing and content extraction in the sameway documents would be preprocessed for categorization.

For the training itself, import the training samples into Workbench and make surethat the categories associated with the samples are correct. Please refer to theWorkbench documentation for the details on how to train a KB.

Step 6: Export the taxonomy to the ICM server

Alternative setup not involving SchemaLogicSuiteIf you do not use SchemaLogic for taxonomy maintenance, you canstill integrate ICM auto-classification with OmniFind. In that case,you will be maintaining the taxonomy within Workbench (ignore step4 above, and in step 5, define a KB from scratch inside Workbench).Hence, you will need to export the KB structure to OmniFindwhenever you modify the KB in Workbench. The following specialexport step of the KB structure is required for that case to produce acategory tree file for downstream use in OmniFind.Important: This step needs to be done before you export the KB tothe ICM server (step 6).

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 29 of 35

Page 30: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

In the Workbench main window:

1. Select Export (the Export Wizard), then click Next.

2. Check Knowledge base, then click Next.

3. Check KB XML file, and click Next.

4. Enter a path for export, then click Finish.

5. Answer Yes to the question of whether you want tooverwrite the User Field properties of each node.

Then, transfer the exported file to the OmniFind controller node, anduse the following OmniFind administration commands to import itinto OmniFind (assuming your collection ID is col_tax1 and theexported KB structure is /tmp/taxonomy.xml):

>esadmin taxonomy add -cid col_tax1 -fname /tmp/taxonomy.xml>esadmin configmanager syncComponent -sid col_tax1.parserdriver

When satisfied with the classification quality of the trained KB, you need to exportthe KB to the ICM server.

You deploy the KB with the ICM server using the Export Wizard: Select Knowledgebase in what to export, and use the KB format "IBM Classification Module."

This export step needs to be repeated each time you change anything in the KBstructure, like when you have add a category, or change a name. When youmaintain the taxonomy within SchemaLogic Suite, you will not perform such changeslocally within Workbench, but rather on the original taxonomy that you then re-importto Workbench before re-training.

Now, OmniFind is ready to process documents. Start crawling and parsing and buildan index. The sample OmniFind Search Application lets you browse through thetaxonomy to view the documents associated with any given category. You can alsouse the Search and Indexing API (SIAPI) to enhance sophisticated queries withrestrictions to categories. Note that category constraints need to specify the string"rulebased" as a taxonomy ID in that case.

Completing the development cycle

Whenever the taxonomy changes, steps 4 (publish to ICM and OmniFind), 5 (importinto Workbench and re-train), and 6 (export to ICM server) must be repeated.

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 30 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 31: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Note that taxonomy changes may invalidate any categorization of documentsprocessed by OmniFind previously. Hence, whenever you update the taxonomy,categories stored in the OmniFind index for a document may be wrong until youre-process (in other words, re-crawl, re-parse, and re-index) that document.

Conclusion

This article has motivated the use of

1. Centrally maintained and consolidated taxonomies and

2. Automatic text classification

for enterprise search applications. It has shown how to set up and use the three-foldintegration of OmniFind combined with both SchemaLogic Enterprise Suite toaddress the first item and IBM Classification Module for OmniFind Discovery Editionto address the second. This integration exploits the plug-in architecture UIMA that isbuilt into OmniFind, and a version of the required plug-in is provided as samplecode.

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 31 of 35

Page 32: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Downloads

Description Name Size Download method

Sample annotator to connect OmniFind toICM

BNSCategoryAnnotatorSample.pear3.7MB HTTP

Information about download methods

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 32 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 33: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

Resources

Learn

• SchemaLogic® home page: Find more information on SchemaLogic.

• OmniFind Enterprise Edition product home page: Find more information onOmniFind Enterprise Edition.

• IBM Classification Module for OmniFind Discovery Edition home page: Find moreinformation on Classification Module for OmniFind Discovery Edition.

• Online documentation for OmniFind products: Find information about installing,administering, and developing content integration and enterprise search anddiscovery solutions. .

• "Semantic Search with UIMA and OmniFind" (developerWorks, December 2006):This tutorial is a good starting point for learning how to use custom text analysisand semantic search in IBM OmniFind Enterprise Edition.

• ANSI/NISO standard for thesauri:

• ANSI/NISO Z39.19 - 2005 Guidelines for the Construction, Format andManagement of Monolingual Controlled Vocabularies: Find guidelines andconventions for the contents, display, construction, testing, maintenance,and management of monolingual controlled vocabularies.

• Resource Description Framework (RDF) Standards of the World Wide WebConsortium (W3C): An integration of a variety of applications from library catalogsand world-wide directories to syndication and aggregation of news, software, andcontent to personal collections of music, photos, and events using XML as aninterchange syntax.

• OWL Web Ontology Language: A W3C Recommendation for describing Webcontent.

• developerWorks resource page for IBM OmniFind: Find articles and tutorials andconnect to other resources to expand your OmniFind skills.

• Unstructured Information Management Architecture SDK: Learn more about theUnstructured Information Management Architecture (UIMA). This Java SDKsupports the implementation, composition, and deployment of applicationsworking with unstructured information.

• developerWorks Information Management zone: Learn more about DB2. Findtechnical documentation, how-to articles, education, downloads, productinformation, and more.

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 33 of 35

Page 34: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

• Stay current with developerWorks technical events and webcasts.

• Technology bookstore: Browse for books on these and other technical topics.

Get products and technologies

• Download UIMA SDK: The free UIMA SDK comes as a self-extracting installer forWindows and Linux or a zip file for all other platforms.

• The full BNSCategoryAnnotator.pear includes a "training mode" and is not limitedto only one collection. It is available from the OmniFind EMEA Center ofExcellence as part of a service engagement, which you can inquire about bye-mail.

• Build your next development project with IBM trial software, available fordownload directly from developerWorks.

Discuss

• Participate in the discussion forum for this content.

• Check out developerWorks blogs and get involved in the developerWorkscommunity.

About the authors

Jochen DörreDr. Jochen Dörre is a Software Engineer at IBM Böblingen Laboratory with abackground in text search and text mining technology. He joined IBM in 1997 and hasworked on several software development projects in those fields specializing on textcategorization, text analytics integration, search over XML documents, as well as coresearch engine design and performance issues. Prior to joining IBM, Jochen hasworked in natural language processing research for several years. He received hisPhD from the University of Stuttgart. Jochen is a member of the World-Wide WebConsortium (W3C) XQuery Working Group, where he co-develops the extension of theXML query language XQuery with full-text search operations.

Josemina MagdalenJosemina Magdalen is a Software Development Team Leader at Israel SoftwareGroup (ILSL) . She has a background in Natural Language Processing (textclassification and search, as well as text mining technologies). Josemina joined IBM in

developerWorks® ibm.com/developerWorks

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogicPage 34 of 35 © Copyright IBM Corporation 1994, 2006. All rights reserved.

Page 35: Leverage taxonomies for enterprise search using …...publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a

2005 and has worked in the Content Discovery Engineering Group doing softwaredevelopment projects in text categorization and search, as well as text analytics. Priorto joining IBM, Josemina has worked in Natural Languages Processing research anddevelopment (Machine Translation, Text Classification and Search, Data Mining), forover ten years. Josemina is working on her PhD at the Hebrew University inJerusalem.

Wendi PohsWendi Pohs has designed and developed taxonomy and search applications for largeorganizations for the past 20 years. She has served on development teams for LotusDevelopment Corporation's Notes/Domino and Discovery Server products, and mostrecently managed Search and Taxonomy Integration for IBM's Corporate Intranet, w3.Author of a book on knowledge management practices, she specializes in advancedtaxonomy applications, built with an experienced practitioner's point of view. Currently,as CTO of Infoclear Consulting, Wendi has provided taxonomy consulting services to alarge government contractor, a major news provider, a leading financial institution, andan innovative public health Web site.

Bob St. ClairBob St. Clair is a Senior Product Manager for SchemaLogic responsible for productsintegrating the SchemaLogic Enterprise Suite with other enterprise systems. Sincejoining SchemaLogic in 2005, he has designed and built several taxonomy andmetadata integration solutions with Search, Portal, and Enterprise ContentManagement products. Prior to joining SchemaLogic, he worked for Corbis, one of thelargest stock photo companies in the world, where he designed thesaurusconstruction, content cataloging, and Media Asset Management systems. Bob holds aMasters of Library and Information Science degree for the University of Washington inSeattle.

ibm.com/developerWorks developerWorks®

Leverage taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and SchemaLogic© Copyright IBM Corporation 1994, 2006. All rights reserved. Page 35 of 35