cni, 3rd april 2006 slide 1 uk national centre for text mining: activities and plans dr. robert...
TRANSCRIPT
![Page 1: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/1.jpg)
CNI, 3rd April 2006 Slide 1
UK National Centre for Text UK National Centre for Text Mining:Mining:
Activities and PlansActivities and Plans
Dr. Robert SandersonDept. of Computer ScienceUniversity of Liverpool
http://www.nactem.ac.uk
![Page 2: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/2.jpg)
CNI, 3rd April 2006 Slide 2
OverviewOverview
Text Mining?
NaCTeM
Consortium Components
Service Infrastructure
Future Work
![Page 3: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/3.jpg)
CNI, 3rd April 2006 Slide 3
Centre for ...Centre for ...
National Centre for ... what was that?
TicksMining!TEXT
![Page 4: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/4.jpg)
CNI, 3rd April 2006 Slide 4
... Text Mining?... Text Mining?
Text Mining: No canonical definition
Commonly used definition based on Data Mining:
“The non-trivial extraction of implicit, previously unknown, and potentially useful information from data.”
“The non-trivial extraction of previously unknown, interesting facts from an invariably large collection of texts.”
![Page 5: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/5.jpg)
CNI, 3rd April 2006 Slide 5
... Text Mining?... Text Mining?
Typical Data Mining Functions:
Classification
Association Rule Mining
Clustering
Useful when applied to texts, but doesn't fulfill the
definition as they don't discover “facts”.
Information Retrieval also doesn't discover facts.
![Page 6: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/6.jpg)
CNI, 3rd April 2006 Slide 6
... Text Mining?... Text Mining?
Need to understand the meaning of the text:
Part of Speech tagging
Clauses
Named Entity Recognition
Find correlations of entities
Infer information from logical chains
Result: New Knowledge
![Page 7: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/7.jpg)
CNI, 3rd April 2006 Slide 7
Other BenefitsOther Benefits
Plus a lot more:
Improved document classification
Automatic semantic annotation of documents
Improved access -- search by semantics and concepts
Improved clustering of documents by concept
Summarization
Visualization techniques
![Page 8: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/8.jpg)
CNI, 3rd April 2006 Slide 8
Event ExtractionEvent Extraction
Extract events from the text along with information
about the
participants
Can be modeled as relationships between named
entities
Extracting events allows discovery of hidden temporal
correlations
eg: Google refuses to announce plans. Google's
stock falls.
Improves understanding of the semantics, improving
the
functions based around those semantics
![Page 9: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/9.jpg)
CNI, 3rd April 2006 Slide 9
NaCTeMNaCTeM
Hosted at University of Manchester
Participants: Universities of Manchester, Liverpool,
Salford
Plus: San Diego Supercomputer Centre, University of
Tokyo,
University of Geneva, University of California
Berkeley
Six full time posts for 3 years (2005-2007)
Plus active board of directors and experts
Current Director: Professor Jun'ichi Tsujii from
U.Tokyo
Funding: JISC, BBSRC, EPSRC
![Page 10: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/10.jpg)
CNI, 3rd April 2006 Slide 10
NaCTeM AimsNaCTeM Aims
Provide text mining oriented services
Facilitate access to text mining resources
User support, advice, training and consultancy
Participate in international research
Formulate best practice guidelines
Increase awareness of text mining in all domains
Develop links with industrial partners involved in text
mining
![Page 11: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/11.jpg)
CNI, 3rd April 2006 Slide 11
ComponentsComponents
Liverpool: Cheshire3 (Information framework)
Manchester: CAFETIERE (Entity recognition, event
extraction)
Salford: TerMine (Automatic term recognition)
SDSC: Storage Resource Broker (Data grid)
UC Berkeley: Cheshire, TM/IR expertise
U.Tokyo: GENIA, ENJU (Text analysis tools)
U.Geneva: User studies and evaluation
![Page 12: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/12.jpg)
CNI, 3rd April 2006 Slide 12
Cheshire3Cheshire3
Information Processing Framework
Liverpool and UC Berkeley
Standards based: XML, SRU, Unicode, etc.
Scalable: Single machine to Grid (PVM, MPI, SRB)
Extensible: Python + C, Object Oriented with stable
API
Work ongoing to integrate Data Mining tools and other
information processing applications
![Page 13: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/13.jpg)
CNI, 3rd April 2006 Slide 13
Cheshire3 ExamplesCheshire3 Examples
Integrated tools from other participants in preparation
for
NaCTeM service infrastructure.
Medline: 4350 records/second using 60 concurrent
processes
on SDSC's Teragrid cluster
440 seconds to index 1 field from 16 million MARC
records
Distributed network of Archival Descriptions in the UK
NARA ERA prototype system with SDSC
![Page 14: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/14.jpg)
CNI, 3rd April 2006 Slide 14
CAFETIERECAFETIERE
Entity Recognition and Annotation
University of Manchester
Discovers named entities in part of speech tagged text
Discovers temporal events referring to those entities
Integration of ontologies and term processing
Rules based
![Page 15: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/15.jpg)
CNI, 3rd April 2006 Slide 15
CAFETIERE ExampleCAFETIERE Example
![Page 16: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/16.jpg)
CNI, 3rd April 2006 Slide 16
TerMineTerMine
Automatic Term Recognition
University of Salford/Manchester
Discovers important terms
Assigns 'C-value' score to rank terms
Interaction with terminology databases for term
management
![Page 17: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/17.jpg)
CNI, 3rd April 2006 Slide 17
TerMine ExampleTerMine Example
![Page 18: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/18.jpg)
CNI, 3rd April 2006 Slide 18
U. Tokyo ToolsU. Tokyo Tools
Natural Language Parsing
University of Tokyo
Tagger, Chunker, ENJU, GENIA
Necessary for any text mining application
Fast and accurate
http://www-tsujii.is.s.u-tokyo.ac.jp/hiiragi/
http://www-tsujii.is.s.u-tokyo.ac.jp/CytoSailing/
![Page 19: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/19.jpg)
CNI, 3rd April 2006 Slide 19
Tokyo Tools ExampleTokyo Tools Example
![Page 20: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/20.jpg)
CNI, 3rd April 2006 Slide 20
Tokyo Tools Example2Tokyo Tools Example2
![Page 21: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/21.jpg)
CNI, 3rd April 2006 Slide 21
Service InfrastructureService Infrastructure
NaCTeM will allow UK researchers to perform text
mining on
their own data in combination with other accessible
resources (eg other data sets, ontologies etc)
Requirements:
Lots of processing power
Lots of storage capacity
Easily extensible/configurable service framework
Access to cutting edge TM, DM and IR tools
![Page 22: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/22.jpg)
CNI, 3rd April 2006 Slide 22
Service InfrastructureService Infrastructure
Processing provided by UK National Grid Service
Data Storage via SDSC's Storage Resource Broker
Important to store multiple versions of each
document
Cheshire3 provides the Grid enabled information
infrastructure
Plus information retrieval and data mining tools
Manchester and Tokyo provide the text mining tools
Stable tools integrated into Cheshire3 already
![Page 23: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/23.jpg)
CNI, 3rd April 2006 Slide 23
Service InfrastructureService Infrastructure
Initial NaCTeM services will be focused on the bio
domain:
Bio-informatics is a growing field
Interest from both academic and corporate sectors
Large datasets/services available (MeSH,
Medline, ...)
Web portal interaction
Then expand into other areas, such as Social Sciences
and
Historical text analysis.
![Page 24: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/24.jpg)
CNI, 3rd April 2006 Slide 24
Future WorkFuture Work
Services for other domains
GUI Workflow configuration
Integration of user developed services and
applications
Maximizing workflow potential with 'smart'
components
Standardizing annotation schemas
Conference/Workshop
Other?
![Page 25: CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool](https://reader036.vdocument.in/reader036/viewer/2022081519/56649f465503460f94c67d18/html5/thumbnails/25.jpg)
CNI, 3rd April 2006 Slide 25
Thank You Thank You
Questions?
...
Reception!