intelligent access to text: integrating information extraction technology into text browsers robert...

Intelligent Access to Text: Integrating Information Extraction

Technology into Text Browsers

Robert Gaizauskas1, Patrick Herring1, Michael Oakes1

Micheline Beaulieu2, Peter Willett2, Helene Fowkes2, and Anna Jonsson2

1Department of Computer Science, 2Department of Information Studies

University of Sheffield

March, 2001 HLT01, San Diego

Outline of Talk

Is Information Extraction Technology Useful?

Barriers to Deployment

Information Seeking in Large Enterprises

The TRESTLE System System Overview

NEAT: Named Entity Access to Text

SCAT: Scenario Access to Text

Preliminary User Evaluation Evaluation Methodology

Access Strategies

User Perceptions

Conclusions and Discussion


Is Information Extraction Technology Useful?

Information Extraction (IE) technology has led to impressive new

abilities to extract structured information from texts

Named entity recognition

Template Element/Relation filling

Scenario Template filling

IE complements traditional Information Retrieval (IR) capabilities

However, unlike IR, IE has not found its way into widely used end-

user systems, such as

Web search engines

Document indexing systems

Why not?


Barriers to Deployment

Porting Cost Moving to new domains requires considerable time + expertise

• to create/modify domain-specific resources + rule bases

• to annotate texts for supervised machine learning approaches

Sensitivity to inaccuracies in extracted data MUC-7 results – F-measure scores 50-92% depending on task

Thus, IE only appropriate for applications where some error is tolerable/readily detectable by end users

Note: formal IR evaluation results comparable, but application contexts make error less significant

Complexity of integration into end-user systems IE systems’ outputs must be incorporated into larger application

systems, if end users are to benefit from them


IE and Information Seeking in Large Enterprises

To investigate the utility of IE in a real setting have developed an advanced text access facility to support information workers at GlaxoSmithKline

TRESTLE – Text Retrieval Extraction and Summarisation Technology for Large Enterprises

Aim: increase effectiveness of employees in “industry watch” function – current awareness/tracking of People

Companies

Products – particularly progress of new drugs through clinical trial/regulatory approval process

Approach: provide enhanced access to Scrip the largest circulation pharmaceutical industry newsletter


IE and Information Seeking in Large Enterprises

User requirements study at GSK (questionnaire, observation, interviews) revealed 2 key types of information seeking:

1. Current awareness general updating (what's happened in the industry today/this week)

entity or event-based tracking (e.g. what's happened concerning a specific drug or what regulatory decisions have been made)

2. Retrospective search historical tracking of entities or events of interest (e.g. where has a

specific person been reported before, what is the clinical trial history of a particular drug)

search for a specific event or a remembered context in which a specific entity played a role

Note: both activities require identification of entities/events in the news = what IE systems do


TRESTLE System Overview

The system consists of two components Off-line component

LaSIE IE system• Input: Scrip texts delivered daily via the Internet

• Output: IE results

• Named entities: MUC-7 categories + drugs + diseases

• Scenario templates: Person Tracking; Clinical Trials; Regulatory

Announcements Summary Writer

• Input: Scenario templates

• Output: Single sentence NL summaries of the templates Entity/Scenario Indexer

• Input: NE annotated texts; Scenario templates

• Output: Indices keyed by NE + date with pointers to source texts


TRESTLE System Overview (cont)

On-line component

Browser scripts

• Input: User requests for information

• Output: Results to requests returned from annotated Scrip DB

Entity/Scenario Index Search + Dynamic Page Generator

• Input: User information requests forwarded from Web server +

entity/scenario indices + NE annotated texts/summaries

• Output: Relevant HTML pages with link info dynamically generated

link information


TRESTLE System Architecture

User

Scrip

Index Search + Dynamic

Page Creator

LaSIE System

Summary Writer

Indexer Entity/ Scenario Indices

ScenarioTemplates

NE Tagged Texts

Scenario Summaries

Off-Line System

Web Server

InternetWeb

Browser

Info Seeking


TRESTLE Interface Overview

TRESTLE browser-based interface allows 4 routes to access texts: by headline by named entity (NEAT: Named Entity Access to Text) by scenario summary (SCAT: Scenario Access to Text) by free text search

For first 3 routes date range of accessed articles may be set to current day previous day last week last four weeks full archive


TRESTLE Interface: Underlying Design

Head Frame

AccessFrame

Index Frame

Text Frame

• Head Frame• User state• Date range selection

• Access Frame• Choose access mode• NE/Scenario/free text search

• Index Frame• Headline list, or• NE + headline list, or• Summary list

• Text Frame• Full text of source text • embedded NE hyperlinks


NEAT: Named Entity Access to Text

RUN

http://dali.dcs.shef.ac.uk/trestle/secured/user.testing/index.html


SCAT: Scenario Access To Text

RUN

http://dali.dcs.shef.ac.uk/trestle/secured/user.testing/index.html


Preliminary User Evaluation: Methodology

Prelude to full end-user study: preliminary study with 8 Information Studies postgrad students

Aim: to gain insight into ease of use and learnability of the system preferred strategies for accessing text problems in interpreting the interface

Instruments: usability questionnaire, verbal protocols, observational notes

Procedure: brief verbal introduction to evaluation and system undirected exploration of system, asking questions/providing comments simulated tasks of real end-user

You've heard that one of your colleagues, Mr Garcia, has recently accepted an

appointment at another pharmaceutical company. You want to find out which company

he will be moving to and what post he has taken up.


Preliminary User Evaluation: Access Strategies

NEAT: access to named entities was made available in three ways:

1. by clicking directly on a list of NE categories in the access frame

2. through the NE index look up query box in the access frame

3. through highlighted entries in a full article displayed in the text frame

Observation: users preferred 2 over 1 or 3, regardless of task perhaps because users knew what they were looking for

perhaps more familiar than browsing NE’s

perhaps because of prominence of NE lookup box in interface

SCAT: Observation: for tasks where SCAT was appropriate users opted for NE index lookup perhaps because of novelty of scenario tracking

perhaps because SCAT functionality not clear from interface


Preliminary User Evaluation: User Perceptions

Colour coding + hyper-linking of NE’s Highly noticeable; some objections to colour choice Disagreement about utility – distracting when reading full texts, but

highly useful in leading to related previous Scrip Integration of current awareness + retrospective searching via NE’s

highly appreciated NE index look-up

Found very useful by all but one participant Some confusion over scope – differences wrt free-text search/only 5

searchable NE categories Exact string matching limiting (limitation now removed)

Scenario Tracking Function misunderstood from labelling in access frame Confusion between SCAT summaries and headlines Flag icons for summaries in headline lists not well understood


Conclusions (I)

To date IE largely a “technology push” activity

For IE technology to become usable and influenced by end user

requirements (“user pull”), end user prototypes must be built which:

exploit the significant achievement of the technology to date

acknowledge its limitations

TRESTLE attempts to do this by exploiting NE and scenario

template IE technology to offer users

novel ways to access textual information

via a familiar text browsing interface


Conclusions (II)

Preliminary user evaluation has revealed:

search options initially selected from the access frame were not always

optimal for set tasks

on the whole colour-coded textual/iconic cue in headline index + full text

enabled users to exploit the different functions seamlessly

interface supported interaction at procedural level, but some

misunderstanding at the conceptual level – esp. scenario access

• other studies report similar issues in introducing more complex

interactive search functions

• further investigation + modifications (e.g. to labelling) underway

Full evaluation in real end user environment now being organised

To answer question: can professional information workers use IE-

based searching and awareness approaches effectively?


The End

intelligent access to text: integrating information extraction technology into text browsers robert...

Documents