1 building an ontology of the nhin: status report 3 brand niemann co-chair, semantic...
Post on 26-Dec-2015
213 Views
Preview:
TRANSCRIPT
1
Building An Ontology of the NHIN: Status Report 3
Brand NiemannCo-Chair, Semantic Interoperability Community of Practice (SICoP)
Best Practices Committee (BPC), CIO Council, andEnterprise Architecture Team, Office of Environmental Information
U.S. Environmental Protection AgencyApril 5, 2005
2
Overview
• 1. The National Health Information Network (NHIN) Request for Information (RFI):– 1.1 Scope & Quality– 1.2 Statistics– 1.3 Analysis & Reporting Strategy– 1.4 Business Cases– 1.5 Leadership Statements– 1.6 Related Activities– 1.7 Building Ontologies
• 2. Results and Next Steps• Appendices
3
1. NHIN RFI:1.1 Scope & Quality
• The NHIN RFI stimulated substantial and unprecedented interest.– Cumulatively, the 512 responses yielded
nearly 5,000 pages of information.
• The National Coordinator established a federal government wide RFI review task force (RTF) to review, summarize and analyze the RFI responses.– The RTF consists of more than 120 Federal
officials from 17 agencies.
4
1. NHIN RFI:1.1 Scope & Quality
• The responses to these initial questions yielded the richest and most descriptive collection of thoughts on interoperability and health information exchange that has likely ever been assembled in the United States.
• The responses to the general questions are a treasure trove of the best thinking on the topic.
5
1. NHIN RFI:1.2 Statistics
Type of Respondent Count Percent
Individual Consumers 174 34%
Individual - Health Professionals 109 21%
Vendors - Software, hardware, system integrators 94 18%
Associations - Medical, Patient Interests, Vendors 54 11%
Multistakeholder Respondents 16 3%
Provider Organizations (Hospitals, clinics, labs, homecare, hospice, pharmaceutical firms, etc.) 16 3%
Research Org (think tanks, non-hospital Universities, etc.) 15 3%
RHIOs 10 2%
Payers (HMO, PPO) 9 2%
Standards Development Organizations 7 1%
Federal, State, Local Government agencies 4 1%
Foundations 4 1%
Total 512 100%
6
1. NHIN RFI:1.3 Analysis & Reporting Strategy
• The NHIN RFI consisted of:– Twenty-four (24) questions, in– Six (6) basic groups
• The NHIN Team divided the RFI’s into two basic groups:– Individuals (283)– Organizations (229)
• The NHIN Team organized the Organization responses for review in:– Thirty (30) sets with 2-3 reviewers for each set– Templates (matrices) with 13 entities by about 4 categories of
the 24 questions mapped to each of the three Work Groups (see next slide).
• For example: WG1 – Standards (Questions 4b, 14-18), Technical Development/Architecture (Questions 2-4a, 23), Technical Services/Operations (Questions 9-11), and General Comments by Federal Government, Industry – Software/Hardware Vendors, etc.
7
1. NHIN RFI:1.3 Analysis & Reporting Strategy
• NHIN Team divided the participants into three Work Groups:– Technical and Architecture– Organization and Business Framework– Finance, Privacy, Regulatory, and Legal
• Each Work Group created Major Themes:– WG1: 3, WG2: 2, and WG3: 3
• Each Work Group reported out on Sub-teams:– WG1: 5, WG2: 5, and WG3: 4
• NHIN Team mapped the Work Group results to new structures for two reports:– Report 1 - Sections: 7, Sub-sections: 17, and Sub-Sub-sections:
18– Report 2 - Sections: 4, Sub-sections: 16, and Sub-Sub-sections:
86
8
1. NHIN RFI:1.3 Analysis & Reporting Strategy
• There is and will be criticism:– “It is important to note, in the front when talking about the
process, that approximately 270 RFIs were not reviewed by the interagency process. The process that ONCCHIT used to select and review these responses should be made clear.” (name withheld)
• There will be responses to criticism:– Statistical Summary Analysis of Responses from Individuals:
• 85% of the responses had strong concerns about the potential loss of privacy along with 53% of health officials who had the same concern.
• 17% of health officials shared their experiences with implementations of EHR systems.
• Only about 4% expressed enthusiasm for the creation of a system that would facilitate interoperability.
9
1. NHIN RFI:1.4 Business Cases
• Veterans Can Personalize Medical Records on VA Web Site, GCN, November 9, 2004:– My HealtheVet (also copy parts of VistA)– Could allow the VA to share patient data with
other providers.– Patients can request changes to their medical
records and allow their loved ones or their physicians to access portions of their records.
• iHealthBeat, November 13, 2004.
10
1. NHIN RFI:1.4 Business Cases
• Canadian Health Infoway*:– An EHR solution is a combination of people, organizational entities,
business processes, systems, technology and standards that interact and exchange clinical data. A network of interoperable EHR solutions—one that links clinics, hospitals, pharmacies, and other points of care—will help enhance quality of care and patient safety, improve Canadian's access to health services, and make the health care system more efficient.
– Interoperability for electronic health records is the capability of computer and software systems to seamlessly communicate with each other. It is central to Infoway's mission, making clinical data available across the continuum of care and across health delivery organizations and regions, promoting reusable and replicable solutions that can be aligned with jurisdictional priorities and deployed across the country more cost-efficiently. Without a common framework and sets of standards, EHR systems across Canada would be a patchwork of incompatible systems and technologies.
*Accelerating the development of Electronic Health Information Systems for Canadians http://www.infoway-inforoute.ca/ehr/index.php?lang=en
11
1. NHIN RFI:1.4 Business Cases
Canadian Health Infoway Standards Collaborationhttp://www.infoway-inforoute.ca/ehr/standards_overview.php?lang=en#
12
1. NHIN RFI:1.4 Business Cases
• One recent study estimated a net savings from national implementation of fully-standardized interoperability between providers and five other types of organizations could yield $77.8 billion annually, or approximately 5 percent of the projected $1.7 trillion spent on U.S. health care in 2003– Source: J. Walker et al., “The Value of Health Care
Information Exchange and Interoperability,” Health Affairs, January 19, 2005.
13
1. NHIN RFI:1.5 Leadership Statements
• HHS Administrator Leavitt’s Keynote Address at AFCEA International’s Homeland Security Conference, February 22, 2005 (See http://www.fcw.com/article88110):– The next frontier of human productivity is the Interoperability Era.– Collaboration is the premium leadership skill that’s need in this
new era.– Interoperability begins by setting standards and should be
organically grown through the "messy, complex, difficult process called collaboration.”
– Several elements (8) will improve the chances for success (a “common pain”, a “convener of stature”, a committed leader, openness, transparency, and voluntary participation, a critical mass of stakeholders, representative of substance, a clearly defined purpose and goal, and a formally written and signed charter).
14
1. NHIN RFI:1.5 Leadership Statements
• Dr. Brailer’s Keynote Address at HIMSS Conference, February 17, 2005:– Interoperability Themes from RFIs:
• Standards (WG1 & WG2)*• Governance (WG2)• Privacy (WG3)• Regionalization (Initially none, then WG2)• Financing (WG3)• Architecture (WG1)• Regulation (WG3)
*Mappings to WG’s added by author of this presentation.
15
NHIN RFI:1.6 Related Activities
• Federal Health Architecture (FHA) Interoperability Work Group, March 17 and 24, 2005:– Goal: Technology Standards Harmonization
• Strive for consensus on some of the potential technical specifications (see next slide)
• Draft Health Information Interoperability Standards Profile• Present standards to OMB as Draft Standards for Trial Use
(DSTU)• Follow-up with more detailed guidance on implementation
– Concern: Narrow focus of Work Group is on the less crucial aspect of interoperability (technical standards)
16
Approach for Technology Classification
TransportTransport
MessageMessage
DescriptionDescription
DiscoveryDiscovery
BusinessProcess
BusinessProcess
HTTPHTTP
SOAPSOAP
WSDLWSDL
UDDIUDDI
BPELBPEL
HTTPHTTP
SOAP w/ attach.,ebMS
SOAP w/ attach.,ebMS
CPP/ACPP/A
Registry(RIM)
Registry(RIM)
BPSSBPSS
HTTP, SMTP,FTP
HTTP, SMTP,FTP
SOAPSOAP
XML Digital Signature
XKMS
SAML
WS-Security
XACML
PKI
SSL
XML Digital Signature
XKMS
SAML
WS-Security
XACML
PKI
SSL
OtherOther ASCII, Binary (e.g., image)ASCII, Binary (e.g., image)
XMLXML XSLT, XSL, etc.XSLT, XSL, etc.
HL7HL7 V 3.0V 3.0
V 2.xV 2.x
Da
ta
ebXMLWeb Services Other Security
Me
ss
ag
e O
rie
nte
d
Inte
rch
an
ge
Source: FHA Health Interoperability Work Group, March 24, 2005.
17
1. NHIN RFI:1.6 Related Activities
• FHA Architectural Peer Review Group (APRG) Initial Meeting, February 11, 2005:– Scope – Health Domains as identified by the FHA
Health Domain WG and incorporated into the FHA BRM (see FEA’06 Revision Summary, page 4).
– Semantics – Recommendations were made to consider an ontology that is being developed for this purpose by the CIO Council (actually by GSA, TopQuadrant, and SICoP).
• See Slide 18 for Example.
18
NHIN RFI:1.6 Related Activities
• Healthcare Informatics Online, January 2004 Cover Story on Emerging Technologies:– Concept introduced in 2001 Scientific American article
and described using the scenario of a man who goes online, employing intelligent agents on the Semantic Web to set up a series of physician appointments and physical therapy sessions for his ailing mother. (It could be 10 years before such agent-enabled scenarios play out, but simpler semantic functions are already emerging.)
• My Note: Semantic Web Applications for National Security (SWANS), April 7-8, 2005, Crystal City, Virginia.
19
NHIN RFI:1.6 Related Activities
• Healthcare Informatics Online, January 2004 Cover Story on Emerging Technologies:– “It’s not a Web replacement, it’s an evolution based
largely on eXtensible Markup Language (XML) with added technologies that allow computers to interpret and process data “ontologies”, or relationships between disparate pieces of information.”
– “The Semantic Web would represent a worldwide Web of connected data, radically different from today’s Web of discrete documents, which is why it could be the affordable answer to the electronic health record.”
• My Note: The Semantic Web could also deal with the privacy and security concerns expressed in the RFI Individual Responses.
20
NHIN RFI:1.6 Related Activities
21
NHIN RFI:1.7 Building Ontologies
• The Mind Map Book: How to Use Radiant Thinking to Maximize Your Brain’s Untapped Potential (Tony Buzan):– Before the web came hypertext. And before hypertext came
mind maps.– A mind map consists of a central word or concept, around the
central word you draw the 5 to 10 main ideas that relate to that word. You then take each of those child words and again draw the 5 to 10 main ideas.
– Mind maps allow associations and links to be recorded and reinforced.
– The non-linear nature of mind maps makes it easy to link and cross-reference different elements of the map.
• See next slide for examples from the “Explorer’s Guide to the Semantic Web,” Thomas Passin, Manning Publications, 2004, pages 106 and 141.
22
Mind Maps for Searching and Ontologies
Searching
hughchanginggrowinginconsistent
keywordsontologiesclassificationmetadatasemantic Focusingsocial Analysismultiple Passesclustering
Ontologies
ENVIRONMENT
STRATEGIES
informalformaldistinctionsmultipletreeshierarchiestaxonomies vocabularies
combiningspecifyingcommittment
CLASSIFICATION
ONTOLOGIES
propertiesrelationshipsconstraintsidentifiers
NAMES
RDFSOWLDAMLDescription Logics
LANGUAGES
adhoccategoriesinternet
KINDS
predefined
Note: These are not complete.
23
NHIN RFI:1.7 Building Ontologies
NHIN
generalorganizational & businessmanagement & operationalstandards & policiesfinancial, regulatory, & legalother
RFI
DR. BRAILER
WORK GROUPStechnical & architectureorganization & businessfinancial, regulatory, & legal
STRATEGIC PLAN GOALS regional initiativesclinical practicepopulation healthhealth interoperabilityFederal Health Architecture
ORGANIZATIONAL STRUCTURE
Inform Clinical PracticeInterconnect CliniciansPersonalize CareImprove Population Health
Possible/probable interrelationships
organizationaltechnicalsemantic
FRAMEWORKS
otherOTHER
standardsgovernanceprivacyregionalizationfinancingarchitectureregulation
NCVHSCCHITEtc.
STANDARDSORGANIZATIONS
24
NHIN RFI:1.7 Building Ontologies
• An ontology is the organization of things into types and categories with a well-defined structure that are “networks of concepts”.
• Specific ontologies must be constructed with known vocabularies and rules of construction.
• A good ontology requires:– The ability to conceptualize and articulate the underlying ideas.– Skill at modeling abstractions.– Knowledge of the syntax of the modeling language.
• OWL is poised to become the major ontology language for the Web.– Use of well-developed and accepted ontologies whenever
possible.• The Suggested Upper Merged Ontology (SUMO) is a best practice
example.– A Community of Practice with all of these skills that can
collaborate to develop the ontology.• The Ontolog Forum is a best practice example (see next slide).
25
NHIN RFI:1.7 Building Ontologies
• A key aspect of successful large scale interoperability is shared meaning.– Shared meaning requires not only a common syntax
(XML), but a common vocabulary.– That common vocabulary should be defined in terms
of the broadest and most general foundation concepts and be in a formal and computable language not subject to human interpretation in English alone.
– Formal ontologies, defined in logic, and a hierarchy of ontologies that build from a common semantic foundations are needed (se next slide).
26
Current Ontology-Driven Information System for FHA/NHIN
Most General Thing
Process Location
Geographic Area of Interest
Airspace Target Area of Interest
UpperOntology
Mid-LevelOntology
DomainOntology
Most General Thing
Process Location
Geographic Area of Interest
Airspace Target Area of Interest
UpperOntology
Mid-LevelOntology
DomainOntology
Source: Netcentric Semantic Linking (Mapping): An Approach for Enterprise Semantic Interoperability, Mary Pulvermacher, et. Al. MITRE, October 2004.
SUMO
HL7 RIM
FEA-RMO
EONSNOMED CTLOINC
Examples
27
NHIN RFI:1.7 Building Ontologies
• Strategy for the NHIN Ontology:– Compile repository/library of NHIN public and RFI
documents in their native file formats.– Repurpose the documents:
• Proprietary to text formats.• Proprietary to XML documents.• Chunk large documents into sub-documents.
– Compile the NHIN “Mind Maps” for defining searches and building the ontology.
– Work with ontology community of practices to draw in their expertise.
• Proposed new Ontology and Taxonomy Coordinating Group (ONTACG) of SICoP.
28
2. Results and Next Steps
• 2.1 The Challenge
• 2.2 A Suggested Solution
• 2.3 The Content
• 2.4 The Pilot
• 2.5 Sample Results
• 2.6 Next Steps
29
2.1 The Challenge
• Extract and organize the semantic concepts from about 5000 pages of semi-structured content in support of a comprehensive analysis to recommend the plan for the National Health Information Network (NHIN).– For example: Dr. Brailer, ONCHIT Technical
Assistance Call December 6, 2004, “NHIN refers to a specific bundle of technologies, business frameworks, financing arrangements, legal contracting or other mechanisms, policy requirements, organizational issues and related things that allow for network interoperability. So NHIN is the middleware in the grand schema of these pieces.”
30
2.2 A Suggested Solution
• Besides manual human extraction individually and in the Work Group environment, there are machine-aided extraction, analysis, and visualization tools that could and should be brought to bear on this problem that would lead to the building on an ontology
• This approach was taken with the Federal Enterprise Architecture Reference Models to produce an ontology that has been released.– http://web-services.gov/fea-rmo.html
31
2.3 The ContentContent Category
NextPage
(1)
FAST(2) Content
Analysts(3)
Comments
Background
(pre RFI)
Done Done Done Contains structured relationships
Organizations
About 50% Done Done Complex concepts and relationships
Individuals Done Done Done Only about 5 simple categories!
Workgroups Done Done Done Needs simplification
(1) Indexing, categorization, and relationship linking.(2) Indexing, keyword/concept extraction, and taxonomy.(3) Same as (2).
32
2.4 The Pilot• A Recommended Start to the NHIN
Ontology:– The European Interoperability Framework:
• Organisational• Technical, &• Semantic
– Leavitt see interoperability: ..interoperability should be organically grown through the "messy, complex, difficult process called collaboration.”
• http://www.fcw.com/article88110
33
2.4 The Pilot• Tools:
– Selection Criteria:• Selected for participation in the SWANS Conference, April 7-
8, 2005, because of support for Semantic Technologies (RDF/OWL).
• Willing to provide hardware, software, and advice for proof of concept.
• Two or more vendors initially – more after SWANS Conference
– Selection:• NextPage FolioViews and LivePublish (recently acquired by
FAST Search & Transfer)• FAST Data Search and ProPublish
– http://www.fastsearch.com• Content Analyst
– http://www.contentanalyst.com
34
2.4 The Pilot
• Ontology Expertise:– Ontolog Forum:
• Submitted Response to the RFI– Available on the Internet
• Providing Ontology Engineering Advice• Suggests Brainstorming Session
– Proposed New SICoP Ontology and Taxonomy Coordinating Work Group (ONTACG)
35
2.5 Sample Results
http://web-services.gov, See Best Practices
36
2.5 Sample Results
http://web-services.gov, See Best Practices
37
2.5 Sample Results
http://web-services.gov, See Best Practices
38
2.5 Sample Results
http://web-services.gov, See Best Practices
39
2.5 Sample Results
Folio Views Infobase of RFI’s
40
2.5 Sample Results
Content Analyst: Compute Taxonomy
41
2.5 Sample Results
Content Analyst: Run Queries
42
2.5 Sample Results
Content Analyst: Set Training Documents
43
2.5 Sample Results
FAST ProPublish: Production Manager
44
2.5 Sample Results
FAST ProPublish: Build Progress
45
2.5 Sample Results
FAST Data Search: Search View
46
2.5 Sample Results
FAST Data Search: Taxonomy Results Saved in Excel Spreadsheet
47
2.6 Next Steps
• NHIN Suggest a Series of Queries:– Results can be provided in Excel spreadsheets for
further analysis and reuse
• Add content from those agencies interviewed by the FHA Interoperability Work Group recently:– VA, DoD, EPA, CDC, FDA, NIH-NCI/DHS/HIS
• See future demonstrations with the initial public domain databases for semantic searching and ontology building (see next slide):– SWANS Conference, April 7-8, 2005– SICoP Meeting at KM Conference, April 22, 2005
48
2.6 Next Steps
Content Source Content Type Pilot Example
Web Site Topics Children’s Health, Mercury, Etc.
Web Site Registries System of Registries
Exchange Network Nodes Pacific Water Quality
E-Gov E-Rulemaking Samples
Data Mart TBD TBD
Indicators Reports on the Indicators
EPA, Heinz, Etc.
GIS Maps Region 4 GeoBook
GIS Metadata Clearinghouse
Initial Public Domain Databases for Semantic Searching and Ontology Building
49
Appendices
• A. Ontology Engineering
• B. FAST Data Search and ProPublish
• C. Content Analyst
50
Appendix A: Ontology Engineering
• A.1 What Is An Ontology?• A.2 Basic Requirements For an Ontology• A.3 Ontology Examples• A.4 Formal Taxonomies for the U.S. Government• A.5 Medical Informatics Ontologies: Examples and
Design Decisions• A.6 GLIF in Protégé• A.7 Why Develop an Ontology?• A.8 Ontology-Development Process• A.9 What Is “Ontology Engineering”?• A.10 Ontology-Driven Information Systems
51
A.1 What Is An Ontology?
• An ontology is an explicit description of a domain:– concepts
– properties and attributes of concepts
– constraints on properties and attributes
– Individuals (often, but not always)
• An ontology defines – a common vocabulary
– a shared understanding
52
A.2 Basic Requirements For an Ontology
• 1. Finite controlled (extensible) vocabulary.
• 2. Unambiguous interpretation of classes and term relationships.
• 3. Strict hierarchical subclass relationships between classes.
• 4. Few others…
Source: Deborah McGuiness, Ontologies Come of Age, in the Semantic Web: Why, What, and How, MIT Press, 2002, page 6.
53
A.3 Ontology Examples
• Taxonomies on the Web
– Yahoo! categories
• Catalogs for on-line shopping
– Amazon.com product catalog
• Domain-specific standard terminology
– SNOMED Clinical Terms – terminology for clinical medicine
– UNSPSC - terminology for products and services
54
A.4 Formal Taxonomies for the U.S. Government
OWL Listing:<?xml version="1.0"?> <rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:daml="http://www.daml.org/2001/03/daml+oil#" xmlns="http://www.owl-ontologies.com/unnamed.owl#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:base="http://www.owl-ontologies.com/unnamed.owl"> <owl:Ontology rdf:about=""/> <owl:Class rdf:ID="Transportation"/> <owl:Class rdf:ID="AirVehicle"> <rdfs:subClassOf rdf:resource="#Transportation"/> </owl:Class> <owl:Class rdf:about="#GroundVehicle"> <rdfs:subClassOf rdf:resource="#Transportation"/> </owl:Class> <owl:Class rdf:about="#Automobile"> <rdfs:subClassOf> <owl:Class rdf:ID="GroundVehicle"/> </rdfs:subClassOf> Etc.
Source: Formal Taxonomies for the U.S. Government, Michael Daconta, Metadata Program Manager, US Department of Homeland Security, XML.Com, http://www.xml.com/pub/a/2005/01/26/formtax.html
Transportation Class Hierarchy
55
A.5 Medical Informatics Ontologies: Examples and Design Decisions
• Foundational Model of Anatomy (FMA):– Developed at University of Washington as part of the Digital Anatomist
project.– Contains: ~70,000 distinct concepts, ~ 110,000 terms, and 140 relations
• Gene Ontology (GO):– A controlled vocabulary for describing genes and gene products with
three organizing components: Molecular function, Biological process, and Cellular component.
• Health Level 7 (HL7) Data Types and Top-Level RIM Classes:– HL7 data types as Protégé classes
• Guideline Interchange Format (GLIF) (See next slide):
– A format for sharing clinical guidelines independent of platforms and systems:
• Design to support multiple vocabularies and medical knowledge bases.
• Designed to work with different patient information model.
56
A.6 GLIF in Protégé
57
A.7 Why Develop an Ontology?
• To share common understanding of the structure of information – among people– among software agents
• To enable reuse of domain knowledge– to avoid “re-inventing the wheel”– to introduce standards to allow interoperability
58
A.8 Ontology-Development Process
• In this tutorial:determine
scopeconsider
reuseenumerate
termsdefine
classesdefine
propertiesdefine
constraintscreate
instances
In reality - an iterative process:
determinescope
considerreuse
enumerateterms
defineclasses
considerreuse
enumerateterms
defineclasses
defineproperties
createinstances
defineclasses
defineproperties
defineconstraints
createinstances
defineclasses
considerreuse
defineproperties
defineconstraints
createinstances
59
A.9 What Is “Ontology Engineering”?
• Ontology Engineering: Defining terms in the domain and relations among them– Defining concepts in the domain (classes)
– Arranging the concepts in a hierarchy (subclass-superclass hierarchy)
– Defining which attributes and properties (slots) classes can have and constraints on their values
– Defining individuals and filling in slot values
60
A.10 Ontology-Driven Information Systems
• Methodology Side – the adoption of a highly interdisciplinary approach:– Analyze the structure at a high level of generality.– Formulate a clear and rigorous vocabulary.
• Architectural Side – the central role in the main components of an information system:– Information resources.– User interfaces.– Application programs.
See for example: Nicola Guarino, Formal Ontology and Information Systems,Proceedings of FOIS ’98, Trento, Italy, 6-8 June 1998.
61
Appendix B: FAST Data Search
• B.1 Gartner Magic Quadrant for Enterprise Search, 2004
• B.2 FAST Data Search:– Categorization and Taxonomy Support– Integration
• B.3 FAST ProPublish System Overview:– Gather Content– Process Content– Deliver Content
62
B.1 Gartner Magic Quadrant for Enterprise Search, 2004
Source: Gartner Research ID Number: M-22-7894, Whit Andrews, 17 May 2004.
63
B.1 Gartner Analysis: Leaders• Fast Search & Transfer (FAST) now is counted in the Leaders quadrant,
moving from the Visionaries quadrant. The vendor has experienced explosive growth, providing better-than-average means and an expanding list of approaches of determining relevancy. Its architecture is superior among search vendors, and sales are strong. (Sales of enterprise search technology were $42 million in 2003, up from $36 million in 2002.) Its acquisition of the remainder of AltaVista's business has had no real impact on operations.
• Critical questions include whether FAST will:– 1) remain a specialist in search technologies;– 2) pursue "search-derivative applications" — FAST's term for the general
application category founded on search platforms, including customer relationship management (CRM) knowledge base support tools and scientific research managers; or
– 3) focus on original equipment manufacturer arrangements or on a broader suite of applications, such as those included in a smart enterprise suite. Search vendors typically follow an arc that leads to their acquiring a company, to failure or to a position as an enduring leader. FAST has the opportunity to pursue the last path.
• Note added by Brand Niemann: FAST acquired NextPage in December 2004 which provides electronic publishing software to 6 of the 9 leading electronic publishers in the world. I have used NextPage in the pilots to date.
64
B.2 FAST Data Search: Categorization and Taxonomy Support
65
B.2 FAST Data Search: Integration
66
B.3 FAST ProPublish System OverviewGather Content Process Content Deliver Content
67
B.3 FAST ProPublish System Overview
• Searches in the online FAST ProPublish system are powered by FAST proven search technology. Search results are displayed on a results list and additional navigation interfaces such as key words, dynamic drill-down lists, metadata structures, and hierarchy are also provided. When documents are retrieved, they are pulled from the content repository. Search hits are highlighted in HTML and XML documents.
• FAST ProPublish is designed to be a distributed application. Nearly every component may be run on a separate machine (or multiple machines) for extreme scalability and reliability. However, this same flexibility also allows all of the components to be run on a single server.
• FAST ProPublish provides the following services:– Search and query.– Data and text mining and analysis.– Exploration and static reporting.
68
B.3 FAST ProPublish System Overview
• Gather Content:– The Production Manager is the tool you use to create
a collection. Also, through the Production Manager graphical user interface, you can establish a library. A library consists of a collection or group of related collections and enables you to structure content. That is, you can define a library hierarchically with folders, sub-folders, and collection nodes the way you want the content to appear on your site.
– Production Manager has the functionality and capability to build libraries from existing collections, or from collections that you define and build within the Production Manager interface from various sources of content.
69
B.3 FAST ProPublish System Overview
• Process Content:– A collection is, as the name implies, a collection of
content/documents and is fully indexed, structured, and searchable. Documents within a collection reside in their native formats. Collections house three "chunks" of information:
• The table of contents (TOC) • An index of the content • A copy of the content
– Because collections contain this information, they are self-contained and portable.
70
B.3 FAST ProPublish System Overview
• Process Content: – Each node in the content
tree is a library, folder, sub-folder, or collection.
– Folder nodes can contain other content nodes (such as sub-folders and collections).
– You can organize these nodes (folder and collection) within this pane according to your content and business needs to create a hierarchy of content for the library.
71
B.3 FAST ProPublish System Overview
Icon Name Description
Library The library node contains all folder and content collection nodes for a given library.
Folder and Sub-folder Folder and sub-folder nodes enable you to create structure within the library and help you organize content.
Collection Collection nodes represent collections that Production Manager builds and updates.
Process Content: Content Tab Icons and Descriptions
72
B.3 FAST ProPublish System Overview
• Deliver Content:– The user interface is composed of individual
components built using Velocity templates and the Struts framework. Some of the components are:
• Search components – search forms (simple, advanced, and custom), search results page (configurable), parametric search.
• Navigation components – hierarchical table of contents, browse-by-category, dynamic drill down for search refinement, breadcrumb trails.
• Document display components – document retrieval, search hit highlighting, next / previous document, next / previous hit document.
73
B.3 FAST ProPublish System OverviewDeliver Content: Default User Interface
74
B.3 FAST ProPublish System OverviewDeliver Content: Advanced Search Page
75
Appendix C:Content Analyst
• C.1 Definitions• C.2 Conceptual Mapping• C.3 Document Proximity Conceptual Similarity• C.4 Term Proximity Conceptual Similarity• C.5 No Auxiliary Structures Required• C.6 Retrieval Using Conceptual Comparison• C.7 Terminology Variant Clustering• C.8 Conceptual Generalization
76
Appendix C:Content Analyst (continued)
• C.9 Deep Conceptual Generalization• C.10 Cross-lingual Operations• C.11 Cross-lingual Capabilities• C.12 Automated Information Organization• C.13 Category Creation by Example• C.14 Automatic Categorization• C.15 Categorizing Items of Interest• C.16 Automated Taxonomy Generation
77
Appendix C:Content Analyst (continued)
• C.17 Instant Context Display
• C.18 Alias Identification
• C.19 Automated Thematic Decomposition
• C.20 Conceptual Interlingua
• C.21 Product Status
• C.22 Performance
• C.23 For More Information
78
C.1 Definitions
• Content Analyst:– …is a Machine Learning Technique…– …that allows Conceptual Comparison of Text
Objects…– …based on the Technique of Latent Semantic
Indexing.• Latent Semantic Indexing is a patented machine
learning technique that enables technology to identify, represent, and compare concepts that exist within a collection of documents or data.
79
DocumentsDocuments
BiologicalWeapons
Transportation
AgricultureAgriculture
C.2 Conceptual Mapping
80
...missle...
….fuel….
...rocket…
propellant
C.3 Document Proximity Conceptual Similarity
Content AnalystRepresentation Space
81
Car Automobile
Content AnalystRepresentation Space
C.4 Term Proximity Conceptual Similarity
82
Taxonomies
GrammarsThesauri
Ontologies
C.5 No Auxiliary Structures Required
83
XX
QueryQueryDocumentsDocuments
In RelevanceIn RelevanceOrderOrder
Proximity Proximity Conceptual Similarity Conceptual Similarity
Natural RankingNatural Ranking
C.6 Retrieval Using Conceptual Comparison
84
X
Osama bin Laden
Osama bin Laden
Usama bin Laden
Osama Binladen
Osama BinLadin
Usama Binladen
Osama bin Ladin
Usama bin Ladin
Usama Binladin
C.7 Terminology Variant Clustering
85
User’sTerminology
Bomb
………….…devicesthat spreadshrapnel……………..
Author’sTerminology
CA Space
C.8 Conceptual Generalization
86
XxxxxxxxxxxxxxXxxxxxxxxxxxxxMethods of armedstruggle not accepted internationallyXxxxxxxxxxxxxxxXxxxxxxxxxxxxxx
War Crimes
C.9 Deep Conceptual Generalization
87
C.10 Cross-lingual Operations
Farsi Farsi English English
Arabic Arabic English English
English English Doc Doc
Retrieved DocumentsRetrieved Documentsin Correct Relevancein Correct RelevanceOrderOrder
English QueryEnglish Query
Results
Results
Documents in Documents in Multiple Multiple
LanguagesLanguages
88
C.11 Cross-lingual Capabilities
• Arabic• Chinese• English• Farsi• French• Korean• Russian• Spanish
• Pashtu
• Urdu
• Italian
• German
• Portuguese
• Dutch
CurrentCurrent FutureFutureNear-termNear-term
• Japanese
89
C.12 Automated Information Organization
• Sorting into Predetermined Categories
• Determining the Natural Topical Breakdown of Information
90
C.13 Category Creation by Example
XxxxxxxxxXxxxxxxxx..anthrax..Xxxxxxxxx..smallpox.
Documents like this Documents like this Correspond to the Correspond to the
Category Category BioterrorismBioterrorism… …
CA Representation Space
91
C.14 Automatic Categorization
NewlyAcquiredDocument
Document willbe Assignedto this Category
Exemplar Document
CA Space
92
Sept. ReportSept. Report
Newly Acquired DocumentNewly Acquired Document
PrecursorsPrecursors
HamasHamas
Hamas Exemplar Document
C.15 Categorizing Items of Interest
93
NewContent
Taxonomy
C.16 Automated Taxonomy Generation
94
Last February Qatada and seven other men, said to be members of the GSPC's British cell, were arrested in London after the discovery of plans to bomb or use GB against an unspecified target in Strasbourg. Charges against Qatada were not pursued. During the investigation, codenamed Operation Odin, Special Branch officers raided Qatada's home in Acton, west London.
gbgb
sarinsarin
organophosphorousorganophosphorous
poisonouspoisonous
vaporsvapors
cholinesterasecholinesterase
resorptiveresorptive
bezhenarbezhenar
C.17 Instant Context Display
95
ressamressam
ressam’s ressam’s
ahmed ahmed
bennibenni
charkaouicharkaoui
zubeirzubeir
abdelrazikabdelrazik
zoubeidazoubeida
Five men, three of whom identified themselves as Algerian, were arrested Thursday by federal officials wanting to question them about their possible links to Ahmed Ressam, an Algerian arrested in Washington state on explosive smuggling charges.
C.18 Alias Identification
96
C.19 Automated Thematic Decomposition
The hardware, software, and bandwidth currently installed are adequate to support this level of downloading activity. Three people currently are engaged in developing a comprehensive list of URLs to be monitored. This is a labor-intensive task, as existing Internet indexes of online newspapers are very incomplete. Final decisions have not yet been made as to the eventual level of caching that will be done, or the total number of users to be supported. One of the most important aspects of the existing implementation is a web crawler that we have developed and refined over the past five years that is optimized for this application. This crawler can deal with the many idiosyncrasies of this type of download activity: primitive communications in some countries, bizarre naming conventions, inconsistent and partial postings, and frequent changes in web page structure. The current implementation of this crawler reflects five years of lessons learned in carrying out newspaper downloads from the Internet. One of the functions to be carried out with the downloaded data is entity and relationship extraction. In support of this effort, SAIC personnel have conducted a comparison of current entity and relationship software packages. The test involved processing of actual downloaded material. Of the half dozen packages tested, the product from Attensity was, by far, the most complete and accurate. This package is being procured for use in the download processing. It should be noted that even the best of the entity and relationship packages still miss many entities and relationships of interest and still generate an undesirably high number of false relations. We have a current task to examine the ways in which Content Analyst and Attensity can be used together to provide significantly improved overall entity and relationship extraction capabilities. Although not addressed in the RFI, one topic that we have paid considerable attention to is processing of images of newspapers using optical character recognition (OCR). At present, approximately 13% of all foreign newspapers posted to the web consist of imagesof pages, as opposed to character-encoded representations. This includes some important newspapers, for example, most of the Urdu material on the web is only available as images. In order to automatically filter these articles, and to make them available for retrieval, an OCR process must be carried out. At various times over the past five years we have implemented such capabilities for Arabic, Chinese, Farsi, and Russian materials. OCR of newspaper articles is a challenging, but not impossible task. The biggest problem is caused by the low resolution of images posted to the web
Topic #1
Topic #2
Topic #3
97
ArbitraryArbitraryDocumentsDocuments
BiologicalWeapons
Transportation
AgricultureAgriculture
C.20 Conceptual Interlingua
98
C.21 Product Status
• 6 Years Development
• 3 Years Operational Experience
• 24X7 Operations
• Multi-million Document Databases
• Conforms to Modern Standards:– J2EE– UNICODE– XML
99
C.22 Performance
• Can Fully Index > 1M Documents in 14 Hours on a Single PC
• Can Categorize > 1 Million Documents per Day on a Single PC
• Can Distribute Index Creation and Retrieval Operations across Multiple PCs
100
C.23 For More Information
• Roger Bradford, 703-391-8700 x110, rbradford@contentanalyst.com
top related