mining reference tables for automatic text segmentation eugene agichtein columbia university...

Mining Reference Tables for Automatic Text Segmentation

Eugene AgichteinColumbia University

Venkatesh GantiMicrosoft Research

Scenarios

Importing unformatted strings into a target structured database– Data warehousing– Data integration

Requires each string to be segmented into the target relation schema

Input strings are prone to errors (e.g., data warehousing, data exchange)

Current Approaches

Rule-based– Hard to develop, maintain, and deploy

comprehensive sets of rules for every domain

Supervised– E.g., [BSD01]– Hard to obtain comprehensive datasets needed to

train robust models

Our Approach

Exploit large reference tables– Learn domain-specific dictionaries– Learn structure within attribute values

Challenges– Order of attribute concatenation in future test

input is unknown– Robustness to errors in test input after training on

clean and standardized reference tables

Problem Statement

Target schema: R[A1,…,An] For a given string s (a sequence of tokens)

– segment s into s1,…,sn substrings at token boundaries – map s1,…,sn to Ai1,…,Ain

– maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s

Product combination function handles arbitrary concatenation order of attribute values

P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi

ARMs are learned from a reference relation r[A1,…,An]

Segmentation Architecture

t1, t2, t3,….,tm SEGMENTATION t1 | t2, t3 | …. | tn

INPUT STRING SEGMENTED TUPLE

A1 A2 … An

PRE-PROCESSING/TRAINING

REFERENCE TABLE

feature hierarchy,tokenization

Design goals– Accurately distinguish an attribute value from

other attributes– Generalize to unobserved/new attribute values– Robust to input errors – Able to learn over large reference tables

ARM: Instantiation of HMMs

Purpose: Estimate probabilities of token sequences belonging to attributes

ARM: instantiation of HMMs (sequential models)

Acceptance probability: product of emission and transition probabilities

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

0.3 0.4 1.0 1.0

Instantiating HMMs

Instantiation has to define– Topology: states & transitions– Emission & transition probabilities

Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01]

– Expensive– Number of states in the ARM is small to keep the search

space tractable

Intuition behind ARM Design

Street address examples – [nw 57th St], [Redmond Woodinville Rd]

Album names– [The best of eagles], [The fury of aquabats], [Colors Soundtrack]

Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit Begin and end tokens are very important to distinguish values

of an attribute (nw, st, the,…) Can learn patterns on tokens (e.g., 57th generalizes to *th) Need robustness to input errors

– [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]

Large Number of States

Associate a state per token: Each state only emits a single base token– More accurate transition probabilities

Model sizes for many large reference tables are still within a few megabytes– Not a problem with current main memory sizes!

Prune the number of states (say, remove low frequency tokens) to limit the ARM size

BMT Topology: Relax Positional Specificity

BEGIN MIDDLE TRAILING

START END

A single state per distinct symbol within a category -- emission probability of a symbol within a category is same

Feature Hierarchy: Relax Token Specificity [BSD01]

ave apt st 5th 42nd 40th

*words[a-z]{1-}

numbers[0-9]{1-}

delimitersmixed[a-z0-9]{1-}

[a-z]{1-10}

[a-z]{1-9}

[a-z]{1-1}

[0-9]{1-10}

[0-9]{1-9}

[0-9]{1-1}

... ...

[a-z0-9]{1-10}

[a-z0-9]{1-2}

123 55 5 #

Featureclasses

Basetokens

Example ARM for Address

BEGIN TRAILING

MIDDLE

Street

START END

... Address …

40th Rd E 50th Street W 42nd St ….

......

Robustness Operations: Relax Sequential Specificity

Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors

Common types of errors [HS98]– Token deletions– Token insertions– Missing values

Intuition: Simulate the effects of such erroneous values over each ARM

Robustness Operations

BEGIN MIDDLE TRAILING

Simulating the effect of token insertions: token and corresponding transition probabilities are copied

from BEGIN to MIDDLE state

Transition Probabilities

Transitions from BM and BT and MM and MT allowed

Learned from examples in reference table Transition probabilities are also weighted by

their ability to distinguish an attribute– A transition “*” “*” which is common across

many attributes gets low weight

Summary of ARM Instantiation

BMT topology Token hierarchy to generalize observed

patterns Robustness operations on HMMs to address

input errors One state per token in reference table to

exploit large dictionaries

Attribute Order Determination

If attribute order is known– Can use dynamic programming algorithm to segment [Rabiner89]

If attribute order is unknown– Can ask the user to provide attribute order– Can discover attribute order

Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string

Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples

– Several datasets on the web satisfy this assumption– Allows us to efficiently

Determine the attribute order over a batch of tuples Segment input strings (using dynamic programming)

Segmentation Algorithm (runtime)

INPUT STRING OUTPUT TUPLE

LEARN ATTRIBUTE VALUE ORDER

SEGMENT(Dynamic programming

algorithm)

Experimental Evaluation

Reference relations from several domains– Addresses: 1,000,000 tuples

[Name, #1, #2, Street Address, City, State, Zip]

– Media: 280,000 tuples [ArtistName, AlbumName, TrackName]

– Bibliography: 100,000 tuples [Title, Author, Journal, Volume, Month, Year]

Compare CRAM (our system) with DataMold [BSD01]

Test Datasets

Naturally erroneous datasets: unformatted input strings seen in operational databases

– Media– Customer addresses

Controlled error injection:– Clean reference table tuples [Inject errors] Concatenate

to generate input strings Evaluate whether a segmentation algorithm recovered

the original tuple– Accuracy Measure: % of attribute values correctly recognized

Overall Accuracy

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

Missing Insertions Deletions Spelling Reordering AllErrors Clean

CRAM Datamold

Addresses DBLP

Topology & Robustness Operations

Addresses

1 Pos BMT BMT-robust

Training on Hypothetical Error Models

Addresses:AllErrors Addresses:Clean DBLP:AllErrors DBLP:Clean

Datamold Datamold:Hypothetical CRAM

Exploiting Dictionaries

Addresses

Accuracy vs Reference Table size

Conclusions

Reference tables leveraged for segmentation Combining ARMs based on independence

allows segmenting input strings with unknown attribute order

ARM models learned over clean reference relations can accurately segment erroneous input strings– BMT topology– Robustness operations– Exploiting large dictionaries

Model Sizes & Pruning

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0.E+00

5.E+05

1.E+06

2.E+06

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

0.1 0.2 0.3 0.5 1

AddressesDBLPMedia

Accuracy #States & Transitions Model Size in MB

Order Determination Accuracy

Topology

Err Clean Err Clean Err Clean Err Clean Err Clean

500 1000 2000 5000 inf

1-Pos BMT 9-Pos

Specificities of HMM Models

Model “specificity” restricts accepted token sequences

Positional specificity– Number ending in ‘th|st’ can

only be the 2nd token in an address value

Token specificity– Last state only accepts “st, rd,

wy, blvd” Sequential specificity

– “st, rd, wy, blvd” have to follow a number in ‘st|th’

Number ending

in ‘st’ or ‘th’

Short word(<= 5 chars)

st|rd|wy|blvd

0.3 0.4 1.0 1.0

Robustness Operations

BEGIN MIDDLE TRALING

START END

BEGIN MIDDLE TRALING

ENDSTART

Token insertion Token deletion Missing values

mining reference tables for automatic text segmentation eugene agichtein columbia university...

attribute nw

large reference tablesarm

reference relation ra1

errors best of eagles

mining reference tables

errors able

future test input

topology search

Documents

pertanggungjawaban individu atas ganti rugi - arena hukum

venkatesh pandey

1 scalable information extraction eugene agichtein

drafttubereport_abhay venkatesh

machine learning applications to modeling web...

gokul venkatesh

mse presentation 1 lakshmikanth ganti ganti/mse_pro.htm

surfacing information in large text collections eugene...

demon: mining and monitoring evolving data venkatesh ganti...

penyelesaian ganti rugi akibat sengketa penguasaan …

modeling information seeking behavior in social media eugene...

check list ganti oil

venkatesh (autosaved)

perihal ganti kerugian dalam rangka perolehan …

eugene agichtein and silviu cucerzan microsoft research

jurnal morfo aver ganti

kata ganti (pronomina)

venkatesh sharvil by shree venkatesh buildcon pvt. ltd. at...

information systems research ... - viswanath venkatesh ›...

ganti buku katsuhiko ogata