apache uima - hands on code
DESCRIPTION
lesson about UIMA real use cases, integration with search engines and a little hands on code sessionTRANSCRIPT
Apache UIMA - hands on code
Gestione delle Informazioni su Web - 2010/2011Tommaso Teofili
tommaso [at] apache [dot] org
Use Cases - Agenda
UC1 : Real Estatate market analysis
UC2 : Tenders automatic information extraction
UIMA & search engines
Tutorial
Assignment
UC1 : Source
An online announcement site for sellers and buyers
Wide purpose (cars, RE, hi-fi, etc...)
Local scope (Rome and nearby)
UC1 - Goals
Track real estate market in order to:
Take smart decisions
Predict how things will go in the (near) future
Estate listings text is unstructered
Aggregate queries for statistical analysis need structured information
UC1 - Source
UC1 - Blocks
UC1 - CrawlerA specialized crawler extract data from the source
Estate listings data are stored grouped by zones in files on some directory on a managed machine
Define navigation of the site using one XML for each city zone
The crawler downloads page fragments two times a week
The estate listings extracted free text is saved on XML grouped by zone
UC1 - Crawler
Issues :
Enabled cookies
Some HTTP headers needed
Needed to put fixed sleeping intervals between requests
UC1 - Domain
Announcement
Zone
MagazineNumber
HouseStructure (with properties)
UC1 - Information Extraction Engine
Goal : extract price, zone and telephone number
The first version used huge regular expressions
Hard to maintain and unefficient
Poor extraction
UC1 - IE Engine
New requirements: extract the structure of the house
Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc...
Track more fine grained zones
Sample text
“ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”
UC1 - ContentAnnotator
From the XML produced by the crawler only estate listings must be extracted
A simple parser to get each node containing an estate listing (that in turn will be unstructured)
Create a ContentAnnotation over the document
ContentAnnotation
UC1 - Entities
UC1 - ZoneAnnotation
UC1 - Consuming extracted information
the previous version of the IE engine produced XML files that needed to be reparsed to store structured data inside the DB
with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB
UIMA - CAS Consumer
Analysis Engine responsible for consuming information contained inside the CAS
Can write extracted information to:
DBMS
Lucene index
Filesystem
...
UC1 - Analysis Graphs
UC1 - Analysis Graphs
UC2 - Monitor of EU announcements
Monitor various sources which provide announcement and tenders
Automate the long monitoring process of such sources and automatically extract useful common information from announcements’ texts
UC2 - Blocks
Different input texts
Different input texts
Different input texts
UC2 - Domain annotations
Language
Abstract
Activity
Beneficiary
Budget
Expiration date
Funding type
Geographic region
Sector
Subject
Title
Tags
UC2 - Domain entities
First and most important is an entity that represents the entire tender or announcement
Annotations inside the domain will finally fill such entity properties
Each annotator first looks:
if some metadata was extracted during navigation
for the most common pattern for defining information inside such announcements
i.e.: “Budget: 200000$” or “Financial information: ......”
Such patterns are common in different languages
UC2 - Simple first
UC2 - AbstractAnnotator
The abstract is usually in the first part of the document
We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences
We use dictionary of “good” words and linguistic patterns
We look in the first sentences of the document looking for objectives of the announcement
UC2 - ExpirationDateAnnotator
A DateAnnotator is executed before
Iterate over DateAnnotations
Get sentences wrapping such DateAnnotations
Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation
UC2 - BandoEntity
UIMA & Search Engines
Decorate documents with automatically extracted metadata to improve search experience
relevance
results
clustering
Information Retrieval and Named Entities
UIMA & Search Engines“Push” scenario:
documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer
“Pull” scenario:
documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index
“On demand” scenario:
metadata are extracted only on demand each time a document is retrieved/showed...
UIMA - tutorial
create a Type System
create an Analysis Engine descriptor
create a simple Annotator
Assignment
Named Entities Recognition
sport: person, player, coach, team, competition
videogames: person, videogame character, videogame, software house, hardware requirement
Preciosion & Recall test