1 building an ontology of the nhin: status report 3 brand niemann co-chair, semantic...

Building An Ontology of the NHIN: Status Report 3

Brand NiemannCo-Chair, Semantic Interoperability Community of Practice (SICoP)

Best Practices Committee (BPC), CIO Council, andEnterprise Architecture Team, Office of Environmental Information

U.S. Environmental Protection AgencyApril 5, 2005

Overview

• 1. The National Health Information Network (NHIN) Request for Information (RFI):– 1.1 Scope & Quality– 1.2 Statistics– 1.3 Analysis & Reporting Strategy– 1.4 Business Cases– 1.5 Leadership Statements– 1.6 Related Activities– 1.7 Building Ontologies

• 2. Results and Next Steps• Appendices

1. NHIN RFI:1.1 Scope & Quality

• The NHIN RFI stimulated substantial and unprecedented interest.– Cumulatively, the 512 responses yielded

nearly 5,000 pages of information.

• The National Coordinator established a federal government wide RFI review task force (RTF) to review, summarize and analyze the RFI responses.– The RTF consists of more than 120 Federal

officials from 17 agencies.

1. NHIN RFI:1.1 Scope & Quality

• The responses to these initial questions yielded the richest and most descriptive collection of thoughts on interoperability and health information exchange that has likely ever been assembled in the United States.

• The responses to the general questions are a treasure trove of the best thinking on the topic.

1. NHIN RFI:1.2 Statistics

Type of Respondent Count Percent

Individual Consumers 174 34%

Individual - Health Professionals 109 21%

Vendors - Software, hardware, system integrators 94 18%

Associations - Medical, Patient Interests, Vendors 54 11%

Multistakeholder Respondents 16 3%

Provider Organizations (Hospitals, clinics, labs, homecare, hospice, pharmaceutical firms, etc.) 16 3%

Research Org (think tanks, non-hospital Universities, etc.) 15 3%

RHIOs 10 2%

Payers (HMO, PPO) 9 2%

Standards Development Organizations 7 1%

Federal, State, Local Government agencies 4 1%

Foundations 4 1%

Total 512 100%

1. NHIN RFI:1.3 Analysis & Reporting Strategy

• The NHIN RFI consisted of:– Twenty-four (24) questions, in– Six (6) basic groups

• The NHIN Team divided the RFI’s into two basic groups:– Individuals (283)– Organizations (229)

• The NHIN Team organized the Organization responses for review in:– Thirty (30) sets with 2-3 reviewers for each set– Templates (matrices) with 13 entities by about 4 categories of

the 24 questions mapped to each of the three Work Groups (see next slide).

• For example: WG1 – Standards (Questions 4b, 14-18), Technical Development/Architecture (Questions 2-4a, 23), Technical Services/Operations (Questions 9-11), and General Comments by Federal Government, Industry – Software/Hardware Vendors, etc.

• NHIN Team divided the participants into three Work Groups:– Technical and Architecture– Organization and Business Framework– Finance, Privacy, Regulatory, and Legal

• Each Work Group created Major Themes:– WG1: 3, WG2: 2, and WG3: 3

• Each Work Group reported out on Sub-teams:– WG1: 5, WG2: 5, and WG3: 4

• NHIN Team mapped the Work Group results to new structures for two reports:– Report 1 - Sections: 7, Sub-sections: 17, and Sub-Sub-sections:

18– Report 2 - Sections: 4, Sub-sections: 16, and Sub-Sub-sections:

• There is and will be criticism:– “It is important to note, in the front when talking about the

process, that approximately 270 RFIs were not reviewed by the interagency process. The process that ONCCHIT used to select and review these responses should be made clear.” (name withheld)

• There will be responses to criticism:– Statistical Summary Analysis of Responses from Individuals:

• 85% of the responses had strong concerns about the potential loss of privacy along with 53% of health officials who had the same concern.

• 17% of health officials shared their experiences with implementations of EHR systems.

• Only about 4% expressed enthusiasm for the creation of a system that would facilitate interoperability.

1. NHIN RFI:1.4 Business Cases

• Veterans Can Personalize Medical Records on VA Web Site, GCN, November 9, 2004:– My HealtheVet (also copy parts of VistA)– Could allow the VA to share patient data with

other providers.– Patients can request changes to their medical

records and allow their loved ones or their physicians to access portions of their records.

• iHealthBeat, November 13, 2004.

• Canadian Health Infoway*:– An EHR solution is a combination of people, organizational entities,

business processes, systems, technology and standards that interact and exchange clinical data. A network of interoperable EHR solutions—one that links clinics, hospitals, pharmacies, and other points of care—will help enhance quality of care and patient safety, improve Canadian's access to health services, and make the health care system more efficient.

– Interoperability for electronic health records is the capability of computer and software systems to seamlessly communicate with each other. It is central to Infoway's mission, making clinical data available across the continuum of care and across health delivery organizations and regions, promoting reusable and replicable solutions that can be aligned with jurisdictional priorities and deployed across the country more cost-efficiently. Without a common framework and sets of standards, EHR systems across Canada would be a patchwork of incompatible systems and technologies.

*Accelerating the development of Electronic Health Information Systems for Canadians http://www.infoway-inforoute.ca/ehr/index.php?lang=en

Canadian Health Infoway Standards Collaborationhttp://www.infoway-inforoute.ca/ehr/standards_overview.php?lang=en#

• One recent study estimated a net savings from national implementation of fully-standardized interoperability between providers and five other types of organizations could yield $77.8 billion annually, or approximately 5 percent of the projected $1.7 trillion spent on U.S. health care in 2003– Source: J. Walker et al., “The Value of Health Care

Information Exchange and Interoperability,” Health Affairs, January 19, 2005.

1. NHIN RFI:1.5 Leadership Statements

• HHS Administrator Leavitt’s Keynote Address at AFCEA International’s Homeland Security Conference, February 22, 2005 (See http://www.fcw.com/article88110):– The next frontier of human productivity is the Interoperability Era.– Collaboration is the premium leadership skill that’s need in this

new era.– Interoperability begins by setting standards and should be

organically grown through the "messy, complex, difficult process called collaboration.”

– Several elements (8) will improve the chances for success (a “common pain”, a “convener of stature”, a committed leader, openness, transparency, and voluntary participation, a critical mass of stakeholders, representative of substance, a clearly defined purpose and goal, and a formally written and signed charter).

1. NHIN RFI:1.5 Leadership Statements

• Dr. Brailer’s Keynote Address at HIMSS Conference, February 17, 2005:– Interoperability Themes from RFIs:

• Standards (WG1 & WG2)*• Governance (WG2)• Privacy (WG3)• Regionalization (Initially none, then WG2)• Financing (WG3)• Architecture (WG1)• Regulation (WG3)

*Mappings to WG’s added by author of this presentation.

NHIN RFI:1.6 Related Activities

• Federal Health Architecture (FHA) Interoperability Work Group, March 17 and 24, 2005:– Goal: Technology Standards Harmonization

• Strive for consensus on some of the potential technical specifications (see next slide)

• Draft Health Information Interoperability Standards Profile• Present standards to OMB as Draft Standards for Trial Use

(DSTU)• Follow-up with more detailed guidance on implementation

– Concern: Narrow focus of Work Group is on the less crucial aspect of interoperability (technical standards)

Approach for Technology Classification

TransportTransport

MessageMessage

DescriptionDescription

DiscoveryDiscovery

BusinessProcess

HTTPHTTP

SOAPSOAP

WSDLWSDL

UDDIUDDI

BPELBPEL

HTTPHTTP

SOAP w/ attach.,ebMS

CPP/ACPP/A

Registry(RIM)

BPSSBPSS

HTTP, SMTP,FTP

SOAPSOAP

XML Digital Signature

WS-Security

XML Digital Signature

WS-Security

OtherOther ASCII, Binary (e.g., image)ASCII, Binary (e.g., image)

XMLXML XSLT, XSL, etc.XSLT, XSL, etc.

HL7HL7 V 3.0V 3.0

V 2.xV 2.x

ebXMLWeb Services Other Security

Source: FHA Health Interoperability Work Group, March 24, 2005.

1. NHIN RFI:1.6 Related Activities

• FHA Architectural Peer Review Group (APRG) Initial Meeting, February 11, 2005:– Scope – Health Domains as identified by the FHA

Health Domain WG and incorporated into the FHA BRM (see FEA’06 Revision Summary, page 4).

– Semantics – Recommendations were made to consider an ontology that is being developed for this purpose by the CIO Council (actually by GSA, TopQuadrant, and SICoP).

• See Slide 18 for Example.

• Healthcare Informatics Online, January 2004 Cover Story on Emerging Technologies:– Concept introduced in 2001 Scientific American article

and described using the scenario of a man who goes online, employing intelligent agents on the Semantic Web to set up a series of physician appointments and physical therapy sessions for his ailing mother. (It could be 10 years before such agent-enabled scenarios play out, but simpler semantic functions are already emerging.)

• My Note: Semantic Web Applications for National Security (SWANS), April 7-8, 2005, Crystal City, Virginia.

• Healthcare Informatics Online, January 2004 Cover Story on Emerging Technologies:– “It’s not a Web replacement, it’s an evolution based

largely on eXtensible Markup Language (XML) with added technologies that allow computers to interpret and process data “ontologies”, or relationships between disparate pieces of information.”

– “The Semantic Web would represent a worldwide Web of connected data, radically different from today’s Web of discrete documents, which is why it could be the affordable answer to the electronic health record.”

• My Note: The Semantic Web could also deal with the privacy and security concerns expressed in the RFI Individual Responses.

NHIN RFI:1.7 Building Ontologies

• The Mind Map Book: How to Use Radiant Thinking to Maximize Your Brain’s Untapped Potential (Tony Buzan):– Before the web came hypertext. And before hypertext came

mind maps.– A mind map consists of a central word or concept, around the

central word you draw the 5 to 10 main ideas that relate to that word. You then take each of those child words and again draw the 5 to 10 main ideas.

– Mind maps allow associations and links to be recorded and reinforced.

– The non-linear nature of mind maps makes it easy to link and cross-reference different elements of the map.

• See next slide for examples from the “Explorer’s Guide to the Semantic Web,” Thomas Passin, Manning Publications, 2004, pages 106 and 141.

Mind Maps for Searching and Ontologies

Searching

hughchanginggrowinginconsistent

keywordsontologiesclassificationmetadatasemantic Focusingsocial Analysismultiple Passesclustering

Ontologies

ENVIRONMENT

STRATEGIES

informalformaldistinctionsmultipletreeshierarchiestaxonomies vocabularies

combiningspecifyingcommittment

CLASSIFICATION

ONTOLOGIES

propertiesrelationshipsconstraintsidentifiers

RDFSOWLDAMLDescription Logics

LANGUAGES

adhoccategoriesinternet

predefined

Note: These are not complete.

generalorganizational & businessmanagement & operationalstandards & policiesfinancial, regulatory, & legalother

DR. BRAILER

WORK GROUPStechnical & architectureorganization & businessfinancial, regulatory, & legal

STRATEGIC PLAN GOALS regional initiativesclinical practicepopulation healthhealth interoperabilityFederal Health Architecture

ORGANIZATIONAL STRUCTURE

Inform Clinical PracticeInterconnect CliniciansPersonalize CareImprove Population Health

Possible/probable interrelationships

organizationaltechnicalsemantic

FRAMEWORKS

otherOTHER

standardsgovernanceprivacyregionalizationfinancingarchitectureregulation

NCVHSCCHITEtc.

STANDARDSORGANIZATIONS

• An ontology is the organization of things into types and categories with a well-defined structure that are “networks of concepts”.

• Specific ontologies must be constructed with known vocabularies and rules of construction.

• A good ontology requires:– The ability to conceptualize and articulate the underlying ideas.– Skill at modeling abstractions.– Knowledge of the syntax of the modeling language.

• OWL is poised to become the major ontology language for the Web.– Use of well-developed and accepted ontologies whenever

possible.• The Suggested Upper Merged Ontology (SUMO) is a best practice

example.– A Community of Practice with all of these skills that can

collaborate to develop the ontology.• The Ontolog Forum is a best practice example (see next slide).

• A key aspect of successful large scale interoperability is shared meaning.– Shared meaning requires not only a common syntax

(XML), but a common vocabulary.– That common vocabulary should be defined in terms

of the broadest and most general foundation concepts and be in a formal and computable language not subject to human interpretation in English alone.

– Formal ontologies, defined in logic, and a hierarchy of ontologies that build from a common semantic foundations are needed (se next slide).

Current Ontology-Driven Information System for FHA/NHIN

Most General Thing

Process Location

Geographic Area of Interest

Airspace Target Area of Interest

UpperOntology

Mid-LevelOntology

DomainOntology

Most General Thing

Process Location

Geographic Area of Interest

Airspace Target Area of Interest

UpperOntology

Mid-LevelOntology

DomainOntology

Source: Netcentric Semantic Linking (Mapping): An Approach for Enterprise Semantic Interoperability, Mary Pulvermacher, et. Al. MITRE, October 2004.

HL7 RIM

FEA-RMO

EONSNOMED CTLOINC

Examples

• Strategy for the NHIN Ontology:– Compile repository/library of NHIN public and RFI

documents in their native file formats.– Repurpose the documents:

• Proprietary to text formats.• Proprietary to XML documents.• Chunk large documents into sub-documents.

– Compile the NHIN “Mind Maps” for defining searches and building the ontology.

– Work with ontology community of practices to draw in their expertise.

• Proposed new Ontology and Taxonomy Coordinating Group (ONTACG) of SICoP.

2. Results and Next Steps

• 2.1 The Challenge

• 2.2 A Suggested Solution

• 2.3 The Content

• 2.4 The Pilot

• 2.5 Sample Results

• 2.6 Next Steps

2.1 The Challenge

• Extract and organize the semantic concepts from about 5000 pages of semi-structured content in support of a comprehensive analysis to recommend the plan for the National Health Information Network (NHIN).– For example: Dr. Brailer, ONCHIT Technical

Assistance Call December 6, 2004, “NHIN refers to a specific bundle of technologies, business frameworks, financing arrangements, legal contracting or other mechanisms, policy requirements, organizational issues and related things that allow for network interoperability. So NHIN is the middleware in the grand schema of these pieces.”

2.2 A Suggested Solution

• Besides manual human extraction individually and in the Work Group environment, there are machine-aided extraction, analysis, and visualization tools that could and should be brought to bear on this problem that would lead to the building on an ontology

• This approach was taken with the Federal Enterprise Architecture Reference Models to produce an ontology that has been released.– http://web-services.gov/fea-rmo.html

2.3 The ContentContent Category

Analysts(3)

Comments

Background

(pre RFI)

Done Done Done Contains structured relationships

Organizations

About 50% Done Done Complex concepts and relationships

Individuals Done Done Done Only about 5 simple categories!

Workgroups Done Done Done Needs simplification

(1) Indexing, categorization, and relationship linking.(2) Indexing, keyword/concept extraction, and taxonomy.(3) Same as (2).

2.4 The Pilot• A Recommended Start to the NHIN

Ontology:– The European Interoperability Framework:

• Organisational• Technical, &• Semantic

– Leavitt see interoperability: ..interoperability should be organically grown through the "messy, complex, difficult process called collaboration.”

• http://www.fcw.com/article88110

2.4 The Pilot• Tools:

– Selection Criteria:• Selected for participation in the SWANS Conference, April 7-

8, 2005, because of support for Semantic Technologies (RDF/OWL).

• Willing to provide hardware, software, and advice for proof of concept.

• Two or more vendors initially – more after SWANS Conference

– Selection:• NextPage FolioViews and LivePublish (recently acquired by

FAST Search & Transfer)• FAST Data Search and ProPublish

– http://www.fastsearch.com• Content Analyst

– http://www.contentanalyst.com

2.4 The Pilot

• Ontology Expertise:– Ontolog Forum:

• Submitted Response to the RFI– Available on the Internet

• Providing Ontology Engineering Advice• Suggests Brainstorming Session

– Proposed New SICoP Ontology and Taxonomy Coordinating Work Group (ONTACG)

2.5 Sample Results

http://web-services.gov, See Best Practices

2.5 Sample Results

Folio Views Infobase of RFI’s

2.5 Sample Results

Content Analyst: Compute Taxonomy

2.5 Sample Results

Content Analyst: Run Queries

2.5 Sample Results

Content Analyst: Set Training Documents

2.5 Sample Results

FAST ProPublish: Production Manager

2.5 Sample Results

FAST ProPublish: Build Progress

2.5 Sample Results

FAST Data Search: Search View

2.5 Sample Results

FAST Data Search: Taxonomy Results Saved in Excel Spreadsheet

2.6 Next Steps

• NHIN Suggest a Series of Queries:– Results can be provided in Excel spreadsheets for

further analysis and reuse

• Add content from those agencies interviewed by the FHA Interoperability Work Group recently:– VA, DoD, EPA, CDC, FDA, NIH-NCI/DHS/HIS

• See future demonstrations with the initial public domain databases for semantic searching and ontology building (see next slide):– SWANS Conference, April 7-8, 2005– SICoP Meeting at KM Conference, April 22, 2005

2.6 Next Steps

Content Source Content Type Pilot Example

Web Site Topics Children’s Health, Mercury, Etc.

Web Site Registries System of Registries

Exchange Network Nodes Pacific Water Quality

E-Gov E-Rulemaking Samples

Data Mart TBD TBD

Indicators Reports on the Indicators

EPA, Heinz, Etc.

GIS Maps Region 4 GeoBook

GIS Metadata Clearinghouse

Initial Public Domain Databases for Semantic Searching and Ontology Building

Appendices

• A. Ontology Engineering

• B. FAST Data Search and ProPublish

• C. Content Analyst

Appendix A: Ontology Engineering

• A.1 What Is An Ontology?• A.2 Basic Requirements For an Ontology• A.3 Ontology Examples• A.4 Formal Taxonomies for the U.S. Government• A.5 Medical Informatics Ontologies: Examples and

Design Decisions• A.6 GLIF in Protégé• A.7 Why Develop an Ontology?• A.8 Ontology-Development Process• A.9 What Is “Ontology Engineering”?• A.10 Ontology-Driven Information Systems

A.1 What Is An Ontology?

• An ontology is an explicit description of a domain:– concepts

– properties and attributes of concepts

– constraints on properties and attributes

– Individuals (often, but not always)

• An ontology defines – a common vocabulary

– a shared understanding

A.2 Basic Requirements For an Ontology

• 1. Finite controlled (extensible) vocabulary.

• 2. Unambiguous interpretation of classes and term relationships.

• 3. Strict hierarchical subclass relationships between classes.

• 4. Few others…

Source: Deborah McGuiness, Ontologies Come of Age, in the Semantic Web: Why, What, and How, MIT Press, 2002, page 6.

A.3 Ontology Examples

• Taxonomies on the Web

– Yahoo! categories

• Catalogs for on-line shopping

– Amazon.com product catalog

• Domain-specific standard terminology

– SNOMED Clinical Terms – terminology for clinical medicine

– UNSPSC - terminology for products and services

A.4 Formal Taxonomies for the U.S. Government

OWL Listing:<?xml version="1.0"?> <rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns="http://www.owl-ontologies.com/unnamed.owl#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:base="http://www.owl-ontologies.com/unnamed.owl"> <owl:Ontology rdf:about=""/> <owl:Class rdf:ID="Transportation"/> <owl:Class rdf:ID="AirVehicle"> <rdfs:subClassOf rdf:resource="#Transportation"/> </owl:Class> <owl:Class rdf:about="#GroundVehicle"> <rdfs:subClassOf rdf:resource="#Transportation"/> </owl:Class> <owl:Class rdf:about="#Automobile"> <rdfs:subClassOf> <owl:Class rdf:ID="GroundVehicle"/> </rdfs:subClassOf> Etc.

Source: Formal Taxonomies for the U.S. Government, Michael Daconta, Metadata Program Manager, US Department of Homeland Security, XML.Com, http://www.xml.com/pub/a/2005/01/26/formtax.html

Transportation Class Hierarchy

A.5 Medical Informatics Ontologies: Examples and Design Decisions

• Foundational Model of Anatomy (FMA):– Developed at University of Washington as part of the Digital Anatomist

project.– Contains: ~70,000 distinct concepts, ~ 110,000 terms, and 140 relations

• Gene Ontology (GO):– A controlled vocabulary for describing genes and gene products with

three organizing components: Molecular function, Biological process, and Cellular component.

• Health Level 7 (HL7) Data Types and Top-Level RIM Classes:– HL7 data types as Protégé classes

• Guideline Interchange Format (GLIF) (See next slide):

– A format for sharing clinical guidelines independent of platforms and systems:

• Design to support multiple vocabularies and medical knowledge bases.

• Designed to work with different patient information model.

A.6 GLIF in Protégé

A.7 Why Develop an Ontology?

• To share common understanding of the structure of information – among people– among software agents

• To enable reuse of domain knowledge– to avoid “re-inventing the wheel”– to introduce standards to allow interoperability

A.8 Ontology-Development Process

• In this tutorial:determine

scopeconsider

reuseenumerate

termsdefine

classesdefine

propertiesdefine

constraintscreate

instances

In reality - an iterative process:

determinescope

considerreuse

enumerateterms

defineclasses

considerreuse

enumerateterms

defineclasses

defineproperties

createinstances

defineclasses

defineproperties

defineconstraints

createinstances

defineclasses

considerreuse

defineproperties

defineconstraints

createinstances

A.9 What Is “Ontology Engineering”?

• Ontology Engineering: Defining terms in the domain and relations among them– Defining concepts in the domain (classes)

– Arranging the concepts in a hierarchy (subclass-superclass hierarchy)

– Defining which attributes and properties (slots) classes can have and constraints on their values

– Defining individuals and filling in slot values

A.10 Ontology-Driven Information Systems

• Methodology Side – the adoption of a highly interdisciplinary approach:– Analyze the structure at a high level of generality.– Formulate a clear and rigorous vocabulary.

• Architectural Side – the central role in the main components of an information system:– Information resources.– User interfaces.– Application programs.

See for example: Nicola Guarino, Formal Ontology and Information Systems,Proceedings of FOIS ’98, Trento, Italy, 6-8 June 1998.

Appendix B: FAST Data Search

• B.1 Gartner Magic Quadrant for Enterprise Search, 2004

• B.2 FAST Data Search:– Categorization and Taxonomy Support– Integration

• B.3 FAST ProPublish System Overview:– Gather Content– Process Content– Deliver Content

B.1 Gartner Magic Quadrant for Enterprise Search, 2004

Source: Gartner Research ID Number: M-22-7894, Whit Andrews, 17 May 2004.

B.1 Gartner Analysis: Leaders• Fast Search & Transfer (FAST) now is counted in the Leaders quadrant,

moving from the Visionaries quadrant. The vendor has experienced explosive growth, providing better-than-average means and an expanding list of approaches of determining relevancy. Its architecture is superior among search vendors, and sales are strong. (Sales of enterprise search technology were $42 million in 2003, up from $36 million in 2002.) Its acquisition of the remainder of AltaVista's business has had no real impact on operations.

• Critical questions include whether FAST will:– 1) remain a specialist in search technologies;– 2) pursue "search-derivative applications" — FAST's term for the general

application category founded on search platforms, including customer relationship management (CRM) knowledge base support tools and scientific research managers; or

– 3) focus on original equipment manufacturer arrangements or on a broader suite of applications, such as those included in a smart enterprise suite. Search vendors typically follow an arc that leads to their acquiring a company, to failure or to a position as an enduring leader. FAST has the opportunity to pursue the last path.

• Note added by Brand Niemann: FAST acquired NextPage in December 2004 which provides electronic publishing software to 6 of the 9 leading electronic publishers in the world. I have used NextPage in the pilots to date.

B.2 FAST Data Search: Categorization and Taxonomy Support

B.2 FAST Data Search: Integration

B.3 FAST ProPublish System OverviewGather Content Process Content Deliver Content

B.3 FAST ProPublish System Overview

• Searches in the online FAST ProPublish system are powered by FAST proven search technology. Search results are displayed on a results list and additional navigation interfaces such as key words, dynamic drill-down lists, metadata structures, and hierarchy are also provided. When documents are retrieved, they are pulled from the content repository. Search hits are highlighted in HTML and XML documents.

• FAST ProPublish is designed to be a distributed application. Nearly every component may be run on a separate machine (or multiple machines) for extreme scalability and reliability. However, this same flexibility also allows all of the components to be run on a single server.

• FAST ProPublish provides the following services:– Search and query.– Data and text mining and analysis.– Exploration and static reporting.

• Gather Content:– The Production Manager is the tool you use to create

a collection. Also, through the Production Manager graphical user interface, you can establish a library. A library consists of a collection or group of related collections and enables you to structure content. That is, you can define a library hierarchically with folders, sub-folders, and collection nodes the way you want the content to appear on your site.

– Production Manager has the functionality and capability to build libraries from existing collections, or from collections that you define and build within the Production Manager interface from various sources of content.

• Process Content:– A collection is, as the name implies, a collection of

content/documents and is fully indexed, structured, and searchable. Documents within a collection reside in their native formats. Collections house three "chunks" of information:

• The table of contents (TOC) • An index of the content • A copy of the content

– Because collections contain this information, they are self-contained and portable.

• Process Content: – Each node in the content

tree is a library, folder, sub-folder, or collection.

– Folder nodes can contain other content nodes (such as sub-folders and collections).

– You can organize these nodes (folder and collection) within this pane according to your content and business needs to create a hierarchy of content for the library.

Icon Name Description

Library The library node contains all folder and content collection nodes for a given library.

Folder and Sub-folder Folder and sub-folder nodes enable you to create structure within the library and help you organize content.

Collection Collection nodes represent collections that Production Manager builds and updates.

Process Content: Content Tab Icons and Descriptions

• Deliver Content:– The user interface is composed of individual

components built using Velocity templates and the Struts framework. Some of the components are:

• Search components – search forms (simple, advanced, and custom), search results page (configurable), parametric search.

• Navigation components – hierarchical table of contents, browse-by-category, dynamic drill down for search refinement, breadcrumb trails.

• Document display components – document retrieval, search hit highlighting, next / previous document, next / previous hit document.

B.3 FAST ProPublish System OverviewDeliver Content: Default User Interface

B.3 FAST ProPublish System OverviewDeliver Content: Advanced Search Page

Appendix C:Content Analyst

• C.1 Definitions• C.2 Conceptual Mapping• C.3 Document Proximity Conceptual Similarity• C.4 Term Proximity Conceptual Similarity• C.5 No Auxiliary Structures Required• C.6 Retrieval Using Conceptual Comparison• C.7 Terminology Variant Clustering• C.8 Conceptual Generalization

Appendix C:Content Analyst (continued)

• C.9 Deep Conceptual Generalization• C.10 Cross-lingual Operations• C.11 Cross-lingual Capabilities• C.12 Automated Information Organization• C.13 Category Creation by Example• C.14 Automatic Categorization• C.15 Categorizing Items of Interest• C.16 Automated Taxonomy Generation

Appendix C:Content Analyst (continued)

• C.17 Instant Context Display

• C.18 Alias Identification

• C.19 Automated Thematic Decomposition

• C.20 Conceptual Interlingua

• C.21 Product Status

• C.22 Performance

• C.23 For More Information

C.1 Definitions

• Content Analyst:– …is a Machine Learning Technique…– …that allows Conceptual Comparison of Text

Objects…– …based on the Technique of Latent Semantic

Indexing.• Latent Semantic Indexing is a patented machine

learning technique that enables technology to identify, represent, and compare concepts that exist within a collection of documents or data.

DocumentsDocuments

BiologicalWeapons

Transportation

AgricultureAgriculture

C.2 Conceptual Mapping

...missle...

….fuel….

...rocket…

propellant

C.3 Document Proximity Conceptual Similarity

Content AnalystRepresentation Space

Car Automobile

Content AnalystRepresentation Space

C.4 Term Proximity Conceptual Similarity

Taxonomies

GrammarsThesauri

Ontologies

C.5 No Auxiliary Structures Required

QueryQueryDocumentsDocuments

In RelevanceIn RelevanceOrderOrder

Proximity Proximity Conceptual Similarity Conceptual Similarity

Natural RankingNatural Ranking

C.6 Retrieval Using Conceptual Comparison

Osama bin Laden

Usama bin Laden

Osama Binladen

Osama BinLadin

Usama Binladen

Osama bin Ladin

Usama bin Ladin

Usama Binladin

C.7 Terminology Variant Clustering

User’sTerminology

………….…devicesthat spreadshrapnel……………..

Author’sTerminology

CA Space

C.8 Conceptual Generalization

XxxxxxxxxxxxxxXxxxxxxxxxxxxxMethods of armedstruggle not accepted internationallyXxxxxxxxxxxxxxxXxxxxxxxxxxxxxx

War Crimes

C.9 Deep Conceptual Generalization

C.10 Cross-lingual Operations

Farsi Farsi English English

Arabic Arabic English English

English English Doc Doc

Retrieved DocumentsRetrieved Documentsin Correct Relevancein Correct RelevanceOrderOrder

English QueryEnglish Query

Results

Documents in Documents in Multiple Multiple

LanguagesLanguages

C.11 Cross-lingual Capabilities

• Arabic• Chinese• English• Farsi• French• Korean• Russian• Spanish

• Pashtu

• Urdu

• Italian

• German

• Portuguese

• Dutch

CurrentCurrent FutureFutureNear-termNear-term

• Japanese

C.12 Automated Information Organization

• Sorting into Predetermined Categories

• Determining the Natural Topical Breakdown of Information

C.13 Category Creation by Example

XxxxxxxxxXxxxxxxxx..anthrax..Xxxxxxxxx..smallpox.

Documents like this Documents like this Correspond to the Correspond to the

Category Category BioterrorismBioterrorism… …

CA Representation Space

C.14 Automatic Categorization

NewlyAcquiredDocument

Document willbe Assignedto this Category

Exemplar Document

CA Space

Sept. ReportSept. Report

Newly Acquired DocumentNewly Acquired Document

PrecursorsPrecursors

HamasHamas

Hamas Exemplar Document

C.15 Categorizing Items of Interest

NewContent

Taxonomy

C.16 Automated Taxonomy Generation

Last February Qatada and seven other men, said to be members of the GSPC's British cell, were arrested in London after the discovery of plans to bomb or use GB against an unspecified target in Strasbourg. Charges against Qatada were not pursued. During the investigation, codenamed Operation Odin, Special Branch officers raided Qatada's home in Acton, west London.

sarinsarin

organophosphorousorganophosphorous

poisonouspoisonous

vaporsvapors

cholinesterasecholinesterase

resorptiveresorptive

bezhenarbezhenar

C.17 Instant Context Display

ressamressam

ressam’s ressam’s

ahmed ahmed

bennibenni

charkaouicharkaoui

zubeirzubeir

abdelrazikabdelrazik

zoubeidazoubeida

Five men, three of whom identified themselves as Algerian, were arrested Thursday by federal officials wanting to question them about their possible links to Ahmed Ressam, an Algerian arrested in Washington state on explosive smuggling charges.

C.18 Alias Identification

C.19 Automated Thematic Decomposition

The hardware, software, and bandwidth currently installed are adequate to support this level of downloading activity. Three people currently are engaged in developing a comprehensive list of URLs to be monitored. This is a labor-intensive task, as existing Internet indexes of online newspapers are very incomplete. Final decisions have not yet been made as to the eventual level of caching that will be done, or the total number of users to be supported. One of the most important aspects of the existing implementation is a web crawler that we have developed and refined over the past five years that is optimized for this application. This crawler can deal with the many idiosyncrasies of this type of download activity: primitive communications in some countries, bizarre naming conventions, inconsistent and partial postings, and frequent changes in web page structure. The current implementation of this crawler reflects five years of lessons learned in carrying out newspaper downloads from the Internet. One of the functions to be carried out with the downloaded data is entity and relationship extraction. In support of this effort, SAIC personnel have conducted a comparison of current entity and relationship software packages. The test involved processing of actual downloaded material. Of the half dozen packages tested, the product from Attensity was, by far, the most complete and accurate. This package is being procured for use in the download processing. It should be noted that even the best of the entity and relationship packages still miss many entities and relationships of interest and still generate an undesirably high number of false relations. We have a current task to examine the ways in which Content Analyst and Attensity can be used together to provide significantly improved overall entity and relationship extraction capabilities. Although not addressed in the RFI, one topic that we have paid considerable attention to is processing of images of newspapers using optical character recognition (OCR). At present, approximately 13% of all foreign newspapers posted to the web consist of imagesof pages, as opposed to character-encoded representations. This includes some important newspapers, for example, most of the Urdu material on the web is only available as images. In order to automatically filter these articles, and to make them available for retrieval, an OCR process must be carried out. At various times over the past five years we have implemented such capabilities for Arabic, Chinese, Farsi, and Russian materials. OCR of newspaper articles is a challenging, but not impossible task. The biggest problem is caused by the low resolution of images posted to the web

Topic #1

Topic #2

Topic #3

ArbitraryArbitraryDocumentsDocuments

BiologicalWeapons

Transportation

AgricultureAgriculture

C.20 Conceptual Interlingua

C.21 Product Status

• 6 Years Development

• 3 Years Operational Experience

• 24X7 Operations

• Multi-million Document Databases

• Conforms to Modern Standards:– J2EE– UNICODE– XML

C.22 Performance

• Can Fully Index > 1M Documents in 14 Hours on a Single PC

• Can Categorize > 1 Million Documents per Day on a Single PC

• Can Distribute Index Creation and Retrieval Operations across Multiple PCs

C.23 For More Information

• Roger Bradford, 703-391-8700 x110, rbradford@contentanalyst.com

1 building an ontology of the nhin: status report 3 brand niemann co-chair, semantic...

nhin rfi

information rfi

rfi responses

work groups

general questions

wg1 standards questions

initial questions

organization responses

Documents

1 social networking and the medici effect: the sicop...

consensus clinical management guidelines for niemann-pick...

summary report on nhin prototype architectures

nhin workgroup recommendation

1 sicop special conference 2: building knowledgebases for...

ensuring conformance & interoperability nhin testing leslie...

brand niemann tutorial12242009

1 building drm 3.0 and web 3.0 for managing context across...

sicop 2011: transforming government through innovation with...

nhin authorization framework specification

sicop low voltage control components

1 sicop special briefing federal semantic interoperability...

nhin patient discovery - health it

nhin privacy & security

linking clinical information to public health the nhin...

strictly confidential nhin slipstream project executive...

hitop goals align with nhin use cases

nhin direct rest implementation

towards web semantics spreadsheets and the us government lee...

cach nhin moi ve ptktbuoi 1-2010