apache uima - hands on code

Apache UIMA - hands on code

Gestione delle Informazioni su Web - 2010/2011Tommaso Teofili

tommaso [at] apache [dot] org

Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information extraction

UIMA & search engines

Tutorial

Assignment

UC1 : Source

An online announcement site for sellers and buyers

Wide purpose (cars, RE, hi-fi, etc...)

Local scope (Rome and nearby)

UC1 - Goals

Track real estate market in order to:

Take smart decisions

Predict how things will go in the (near) future

Estate listings text is unstructered

Aggregate queries for statistical analysis need structured information

UC1 - Source

UC1 - Blocks

UC1 - CrawlerA specialized crawler extract data from the source

Estate listings data are stored grouped by zones in files on some directory on a managed machine

Define navigation of the site using one XML for each city zone

The crawler downloads page fragments two times a week

The estate listings extracted free text is saved on XML grouped by zone

UC1 - Crawler

Issues :

Enabled cookies

Some HTTP headers needed

Needed to put fixed sleeping intervals between requests

UC1 - Domain

Announcement

Zone

MagazineNumber

HouseStructure (with properties)

UC1 - Information Extraction Engine

Goal : extract price, zone and telephone number

The first version used huge regular expressions

Hard to maintain and unefficient

Poor extraction

UC1 - IE Engine

New requirements: extract the structure of the house

Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc...

Track more fine grained zones

Sample text

“ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”

UC1 - ContentAnnotator

From the XML produced by the crawler only estate listings must be extracted

A simple parser to get each node containing an estate listing (that in turn will be unstructured)

Create a ContentAnnotation over the document

ContentAnnotation

UC1 - Entities

UC1 - ZoneAnnotation

UC1 - Consuming extracted information

the previous version of the IE engine produced XML files that needed to be reparsed to store structured data inside the DB

with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB

UIMA - CAS Consumer

Analysis Engine responsible for consuming information contained inside the CAS

Can write extracted information to:

DBMS

Lucene index

Filesystem

...

UC1 - Analysis Graphs

UC2 - Monitor of EU announcements

Monitor various sources which provide announcement and tenders

Automate the long monitoring process of such sources and automatically extract useful common information from announcements’ texts

UC2 - Blocks

Different input texts

UC2 - Domain annotations

Language

Abstract

Activity

Beneficiary

Budget

Expiration date

Funding type

Geographic region

Sector

Subject

Title

Tags

UC2 - Domain entities

First and most important is an entity that represents the entire tender or announcement

Annotations inside the domain will finally fill such entity properties

Each annotator first looks:

if some metadata was extracted during navigation

for the most common pattern for defining information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are common in different languages

UC2 - Simple first

UC2 - AbstractAnnotator

The abstract is usually in the first part of the document

We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences

We use dictionary of “good” words and linguistic patterns

We look in the first sentences of the document looking for objectives of the announcement

UC2 - ExpirationDateAnnotator

A DateAnnotator is executed before

Iterate over DateAnnotations

Get sentences wrapping such DateAnnotations

Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation

UC2 - BandoEntity

UIMA & Search Engines

Decorate documents with automatically extracted metadata to improve search experience

relevance

results

clustering

Information Retrieval and Named Entities

UIMA & Search Engines“Push” scenario:

documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer

“Pull” scenario:

documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index

“On demand” scenario:

metadata are extracted only on demand each time a document is retrieved/showed...

UIMA - tutorial

create a Type System

create an Analysis Engine descriptor

create a simple Annotator

Assignment

Named Entities Recognition

sport: person, player, coach, team, competition

videogames: person, videogame character, videogame, software house, hardware requirement

Preciosion & Recall test

apache uima - hands on code

Technology

uc1 entities

uc1 blocks

uc1 crawlerissues

uc1 contentannotator

uc1 source

uc1 zoneannotation

uc1 analysis graphs

uc2 blocks