finding stuff: -lsi and database searching- a business use case
DESCRIPTION
Finding Stuff: -LSI and Database Searching- A Business Use Case. Joe Tragert EBSCO Publishing Bentley June 26, 2006. Overview. EBSCO Publishing overview Latent Semantic Indexing pros and cons Integrated diverse content types – the Executive Daily Brief use case - PowerPoint PPT PresentationTRANSCRIPT
1
Finding Stuff: -LSI and Database Searching-
A Business Use Case
Joe TragertEBSCO PublishingBentleyJune 26, 2006
2
Overview
EBSCO Publishing overview Latent Semantic Indexing pros and cons Integrated diverse content types – the Executive Daily
Brief use case Discovering obfuscated records – the US PTO example
3
EBSCO Industries • Ranked #162 in Forbes “America’s Largest Private Companies” in 2005
4
EBSCO Publishing Research & reference solutions
Corporate Medical Academic Public Library K-12
73 terabytes of content, configured into over 100 different proprietary full-text databases
Redistribute 100+ 3rd-party reference products Founded in 1987, 550 employees world wide, HQ in
Ipswich, MA
5
Latent Semantic Indexing
Searching is focused on the words, not indices or metadata. The engine can be “trained” to optimize results by domain
(engineering, medical, general business, etc.) Engine creates a vector space based upon the data it sees.
All articles are placed within that vector space. Updates are quickly assigned values within the vector space,
enabling real-time integration of RSS feeds. Multiple data sources are integrated rapidly, requiring a few
hours to a few days.
6
Conceptual Search: concepts are matched, not key words Easier to create searches by using chunks of text as search “terms” No need to understand thesauri or Boolean operators
Integrated Content: databases, blogs, RSS, etc. Multiple databases can be searched at once (similar to federated search, but different…) Since the words are searched, no need to normalize indices or record structures of source data sets
Real time content The engine can rapidly assign new content to the existing vector space, enabling integration of current
content with archival material Language agnostic
Since all content is converted to value in the vector space, multiple languages can be searched and returned in a single result list
LSI Advantages
7
Precision: Matching concepts does not lead to the “one perfect article”
Multiple content types in one result set requires robust filtering and refining functionality, to minimize confusion
Default date order sorting can “overwhelm” a result list Multiple languages is seductive, but requires quality translator
feature to get best utility from the results Can be difficult for the “Google generation” to grasp the concept of
“concepts”
LSI Disadvantages
8
Structured data: users tend not to care about meta data Currency is king: users tend to focus on “real time”
content (news sites, blogs) but periodicals can provide real value
Skills: not everyone is a librarian… actually, most aren’t Tools: slow to learn, slower to change Perspective: impatient with complexity
Why Use LSI?
9
LSI Use Case: Customizable monitoring and alert service Supports non-librarian corporate uses: brand management, corporate intelligence,
general counsel, IP management, etc. Two types of Search
Content Analyst LLC’s patented Concept Search™ EBSCO’s keyword search
Multiple content types Premium business content (EBSCO structured content) Newspapers RSS feeds (blogs, news sites) Licensed databases (USPTO, INSPEC, etc.) Intranet repositories
10
1. Users can set up folders, and monitor for content related conceptually (same meaning, but different words) to key words or article “examples” already in the folders
2. Users can search for immediate results that are related to words, articles, emails or external documents, using Concept Search or Key Word Search
3. Users can link to “advanced” key word search options, thesauri, and visual searching
Multiple Content Types and Search Methods
11
• Users can add, delete or edit “alerts” (folders) as needed
• Users put words, phrases, paragraphs, full articles, emails, MS Word docs, etc. into the folders.
• EDB adds matches to the folders
• Results for a folder appear when the folder is selected
• Users can easily make a result into a “concept” (example) and put it into a folder
Folders Are Determined by End Users
12
• The full text is viewed in a pop up window
• The user will link to the source (the article on EBSCOhost, news site, the RSS feed provider, licensed database or intranet file)
• Users can email, save, print the document, or add it to their folder as a new example to be monitored
Structured Content in Familiar Layout
13
• Selected RSS articles are viewed in a pop up window
• The user links to the source
Linking to RSS Providers Simplifies Access
14
Results Are Refined, Interactively
• Users can sort results by Date, Title, Publication and Relevance
• Users can narrow results by Publication or Content Type
• Users can delete previously read content, content of a specific relevance, or content published before a specific date
15
• Users can set up email lists (groups and individuals) to automatically forward documents
• Users can set higher relevancy threshold for shared documents, vs. their own inbox (only send the “best” articles to colleagues
Alerts Controlled by End User
16
LSI Use Case:
Find deliberately obscured patents Compare prior art to current research Monitor pending patents Search patents in native languages
USPTO European Patent Organization Japan Patent Office
Expose patent search to more staff Bench scientists Competitive intelligence Risk managers
17
Sneak Peak: EBSCO Patent Monitor
In development – Fall 2006 release
Use Concept Searching to identify “conceptually related patents”
Enable cross-database searching Patents (various sources) Published STM literature Proprietary research & intranets
18
Searching on “motorcycle” finds patents that do not include the term “motorcycle”
19
Patent #6,085,857 does not contain the word “motorcycle”, but it sure looks like one…
aka: “motorcycle”
20
Running a concept search on the patent abstract creates an ‘instant context list”
These terms are found in the USPTO database and relate to “saddle-type riding vehicles.” Users can search the USPTO database to find those patents, or they can research the individuals to see who else is an expert…
21
The terms and names on the Instant Context list can indicate the true nature of the patent…
Shinobu Tsutsumikoshi is a developer at Suzuki...
22
Search using press release on the new Maxim Knee System and get hundreds of related patents….
23
US Patent #6,090,144 is about prosthetic knees even though the Maxim press release never used the term “prosthesis”
24
Finding Stuff: The Dead Mouse Test
LSI, key words, proximity, etc… The real question is not which mouse trap
works better… …just did we kill the mouse?
25
Joe TragertDirector, Market Development
EBSCO PublishingO: +800-653-2726 ext. 661
Thank You