intelligent access to text: integrating information extraction technology into text browsers robert...
TRANSCRIPT
Intelligent Access to Text: Integrating Information Extraction
Technology into Text Browsers
Robert Gaizauskas1, Patrick Herring1, Michael Oakes1
Micheline Beaulieu2, Peter Willett2, Helene Fowkes2, and Anna Jonsson2
1Department of Computer Science, 2Department of Information Studies
University of Sheffield
March, 2001 HLT01, San Diego
Outline of Talk
Is Information Extraction Technology Useful?
Barriers to Deployment
Information Seeking in Large Enterprises
The TRESTLE System System Overview
NEAT: Named Entity Access to Text
SCAT: Scenario Access to Text
Preliminary User Evaluation Evaluation Methodology
Access Strategies
User Perceptions
Conclusions and Discussion
March, 2001 HLT01, San Diego
Is Information Extraction Technology Useful?
Information Extraction (IE) technology has led to impressive new
abilities to extract structured information from texts
Named entity recognition
Template Element/Relation filling
Scenario Template filling
IE complements traditional Information Retrieval (IR) capabilities
However, unlike IR, IE has not found its way into widely used end-
user systems, such as
Web search engines
Document indexing systems
Why not?
March, 2001 HLT01, San Diego
Barriers to Deployment
Porting Cost Moving to new domains requires considerable time + expertise
• to create/modify domain-specific resources + rule bases
• to annotate texts for supervised machine learning approaches
Sensitivity to inaccuracies in extracted data MUC-7 results – F-measure scores 50-92% depending on task
Thus, IE only appropriate for applications where some error is tolerable/readily detectable by end users
Note: formal IR evaluation results comparable, but application contexts make error less significant
Complexity of integration into end-user systems IE systems’ outputs must be incorporated into larger application
systems, if end users are to benefit from them
March, 2001 HLT01, San Diego
IE and Information Seeking in Large Enterprises
To investigate the utility of IE in a real setting have developed an advanced text access facility to support information workers at GlaxoSmithKline
TRESTLE – Text Retrieval Extraction and Summarisation Technology for Large Enterprises
Aim: increase effectiveness of employees in “industry watch” function – current awareness/tracking of People
Companies
Products – particularly progress of new drugs through clinical trial/regulatory approval process
Approach: provide enhanced access to Scrip the largest circulation pharmaceutical industry newsletter
March, 2001 HLT01, San Diego
IE and Information Seeking in Large Enterprises
User requirements study at GSK (questionnaire, observation, interviews) revealed 2 key types of information seeking:
1. Current awareness general updating (what's happened in the industry today/this week)
entity or event-based tracking (e.g. what's happened concerning a specific drug or what regulatory decisions have been made)
2. Retrospective search historical tracking of entities or events of interest (e.g. where has a
specific person been reported before, what is the clinical trial history of a particular drug)
search for a specific event or a remembered context in which a specific entity played a role
Note: both activities require identification of entities/events in the news = what IE systems do
March, 2001 HLT01, San Diego
TRESTLE System Overview
The system consists of two components Off-line component
LaSIE IE system• Input: Scrip texts delivered daily via the Internet
• Output: IE results
• Named entities: MUC-7 categories + drugs + diseases
• Scenario templates: Person Tracking; Clinical Trials; Regulatory
Announcements Summary Writer
• Input: Scenario templates
• Output: Single sentence NL summaries of the templates Entity/Scenario Indexer
• Input: NE annotated texts; Scenario templates
• Output: Indices keyed by NE + date with pointers to source texts
March, 2001 HLT01, San Diego
TRESTLE System Overview (cont)
On-line component
Browser scripts
• Input: User requests for information
• Output: Results to requests returned from annotated Scrip DB
Entity/Scenario Index Search + Dynamic Page Generator
• Input: User information requests forwarded from Web server +
entity/scenario indices + NE annotated texts/summaries
• Output: Relevant HTML pages with link info dynamically generated
link information
March, 2001 HLT01, San Diego
TRESTLE System Architecture
User
Scrip
Index Search + Dynamic
Page Creator
LaSIE System
Summary Writer
Indexer Entity/ Scenario Indices
ScenarioTemplates
NE Tagged Texts
Scenario Summaries
Off-Line System
Web Server
InternetWeb
Browser
Info Seeking
March, 2001 HLT01, San Diego
TRESTLE Interface Overview
TRESTLE browser-based interface allows 4 routes to access texts: by headline by named entity (NEAT: Named Entity Access to Text) by scenario summary (SCAT: Scenario Access to Text) by free text search
For first 3 routes date range of accessed articles may be set to current day previous day last week last four weeks full archive
March, 2001 HLT01, San Diego
TRESTLE Interface: Underlying Design
Head Frame
AccessFrame
Index Frame
Text Frame
• Head Frame• User state• Date range selection
• Access Frame• Choose access mode• NE/Scenario/free text search
• Index Frame• Headline list, or• NE + headline list, or• Summary list
• Text Frame• Full text of source text • embedded NE hyperlinks
March, 2001 HLT01, San Diego
NEAT: Named Entity Access to Text
RUN
March, 2001 HLT01, San Diego
SCAT: Scenario Access To Text
RUN
March, 2001 HLT01, San Diego
Preliminary User Evaluation: Methodology
Prelude to full end-user study: preliminary study with 8 Information Studies postgrad students
Aim: to gain insight into ease of use and learnability of the system preferred strategies for accessing text problems in interpreting the interface
Instruments: usability questionnaire, verbal protocols, observational notes
Procedure: brief verbal introduction to evaluation and system undirected exploration of system, asking questions/providing comments simulated tasks of real end-user
You've heard that one of your colleagues, Mr Garcia, has recently accepted an
appointment at another pharmaceutical company. You want to find out which company
he will be moving to and what post he has taken up.
March, 2001 HLT01, San Diego
Preliminary User Evaluation: Access Strategies
NEAT: access to named entities was made available in three ways:
1. by clicking directly on a list of NE categories in the access frame
2. through the NE index look up query box in the access frame
3. through highlighted entries in a full article displayed in the text frame
Observation: users preferred 2 over 1 or 3, regardless of task perhaps because users knew what they were looking for
perhaps more familiar than browsing NE’s
perhaps because of prominence of NE lookup box in interface
SCAT: Observation: for tasks where SCAT was appropriate users opted for NE index lookup perhaps because of novelty of scenario tracking
perhaps because SCAT functionality not clear from interface
March, 2001 HLT01, San Diego
Preliminary User Evaluation: User Perceptions
Colour coding + hyper-linking of NE’s Highly noticeable; some objections to colour choice Disagreement about utility – distracting when reading full texts, but
highly useful in leading to related previous Scrip Integration of current awareness + retrospective searching via NE’s
highly appreciated NE index look-up
Found very useful by all but one participant Some confusion over scope – differences wrt free-text search/only 5
searchable NE categories Exact string matching limiting (limitation now removed)
Scenario Tracking Function misunderstood from labelling in access frame Confusion between SCAT summaries and headlines Flag icons for summaries in headline lists not well understood
March, 2001 HLT01, San Diego
Conclusions (I)
To date IE largely a “technology push” activity
For IE technology to become usable and influenced by end user
requirements (“user pull”), end user prototypes must be built which:
exploit the significant achievement of the technology to date
acknowledge its limitations
TRESTLE attempts to do this by exploiting NE and scenario
template IE technology to offer users
novel ways to access textual information
via a familiar text browsing interface
March, 2001 HLT01, San Diego
Conclusions (II)
Preliminary user evaluation has revealed:
search options initially selected from the access frame were not always
optimal for set tasks
on the whole colour-coded textual/iconic cue in headline index + full text
enabled users to exploit the different functions seamlessly
interface supported interaction at procedural level, but some
misunderstanding at the conceptual level – esp. scenario access
• other studies report similar issues in introducing more complex
interactive search functions
• further investigation + modifications (e.g. to labelling) underway
Full evaluation in real end user environment now being organised
To answer question: can professional information workers use IE-
based searching and awareness approaches effectively?
March, 2001 HLT01, San Diego
The End