mining reference tables for automatic text segmentation eugene agichtein columbia university...
Post on 03-Jan-2016
219 Views
Preview:
TRANSCRIPT
Mining Reference Tables for Automatic Text Segmentation
Eugene AgichteinColumbia University
Venkatesh GantiMicrosoft Research
Scenarios
Importing unformatted strings into a target structured database– Data warehousing– Data integration
Requires each string to be segmented into the target relation schema
Input strings are prone to errors (e.g., data warehousing, data exchange)
Current Approaches
Rule-based– Hard to develop, maintain, and deploy
comprehensive sets of rules for every domain
Supervised– E.g., [BSD01]– Hard to obtain comprehensive datasets needed to
train robust models
Our Approach
Exploit large reference tables– Learn domain-specific dictionaries– Learn structure within attribute values
Challenges– Order of attribute concatenation in future test
input is unknown– Robustness to errors in test input after training on
clean and standardized reference tables
Problem Statement
Target schema: R[A1,…,An] For a given string s (a sequence of tokens)
– segment s into s1,…,sn substrings at token boundaries – map s1,…,sn to Ai1,…,Ain
– maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s
Product combination function handles arbitrary concatenation order of attribute values
P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi
ARMs are learned from a reference relation r[A1,…,An]
Segmentation Architecture
t1, t2, t3,….,tm SEGMENTATION t1 | t2, t3 | …. | tn
INPUT STRING SEGMENTED TUPLE
A1 A2 … An
PRE-PROCESSING/TRAINING
ARM1
ARM2
…
ARMn
REFERENCE TABLE
feature hierarchy,tokenization
ARMs
Design goals– Accurately distinguish an attribute value from
other attributes– Generalize to unobserved/new attribute values– Robust to input errors – Able to learn over large reference tables
ARM: Instantiation of HMMs
Purpose: Estimate probabilities of token sequences belonging to attributes
ARM: instantiation of HMMs (sequential models)
Acceptance probability: product of emission and transition probabilities
Number ending
in ‘st’ or ‘th’
Short word(<= 5 chars)
st|rd|wy|blvd
START
0.3 0.4 1.0 1.0
END
Instantiating HMMs
Instantiation has to define– Topology: states & transitions– Emission & transition probabilities
Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01]
– Expensive– Number of states in the ARM is small to keep the search
space tractable
Intuition behind ARM Design
Street address examples – [nw 57th St], [Redmond Woodinville Rd]
Album names– [The best of eagles], [The fury of aquabats], [Colors Soundtrack]
Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit Begin and end tokens are very important to distinguish values
of an attribute (nw, st, the,…) Can learn patterns on tokens (e.g., 57th generalizes to *th) Need robustness to input errors
– [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]
Large Number of States
Associate a state per token: Each state only emits a single base token– More accurate transition probabilities
Model sizes for many large reference tables are still within a few megabytes– Not a problem with current main memory sizes!
Prune the number of states (say, remove low frequency tokens) to limit the ARM size
BMT Topology: Relax Positional Specificity
BEGIN MIDDLE TRAILING
START END
A single state per distinct symbol within a category -- emission probability of a symbol within a category is same
Feature Hierarchy: Relax Token Specificity [BSD01]
ave apt st 5th 42nd 40th
*words[a-z]{1-}
numbers[0-9]{1-}
delimitersmixed[a-z0-9]{1-}
[a-z]{1-10}
[a-z]{1-9}
[a-z]{1-1}
[0-9]{1-10}
[0-9]{1-9}
[0-9]{1-1}
... ...
[a-z0-9]{1-10}
[a-z0-9]{1-10}
[a-z0-9]{1-2}
...
123 55 5 #
Featureclasses
Basetokens
Example ARM for Address
W
E
c1-3
w+
BEGIN TRAILING
Rd
c1-3
w+
St
50th
c1-3
w+
MIDDLE
42nd
40th
Street
START END
... Address …
40th Rd E 50th Street W 42nd St ….
......
Robustness Operations: Relax Sequential Specificity
Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors
Common types of errors [HS98]– Token deletions– Token insertions– Missing values
Intuition: Simulate the effects of such erroneous values over each ARM
Robustness Operations
BEGIN MIDDLE TRAILING
END
Simulating the effect of token insertions: token and corresponding transition probabilities are copied
from BEGIN to MIDDLE state
Transition Probabilities
Transitions from BM and BT and MM and MT allowed
Learned from examples in reference table Transition probabilities are also weighted by
their ability to distinguish an attribute– A transition “*” “*” which is common across
many attributes gets low weight
Summary of ARM Instantiation
BMT topology Token hierarchy to generalize observed
patterns Robustness operations on HMMs to address
input errors One state per token in reference table to
exploit large dictionaries
Attribute Order Determination
If attribute order is known– Can use dynamic programming algorithm to segment [Rabiner89]
If attribute order is unknown– Can ask the user to provide attribute order– Can discover attribute order
Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string
Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples
– Several datasets on the web satisfy this assumption– Allows us to efficiently
Determine the attribute order over a batch of tuples Segment input strings (using dynamic programming)
Segmentation Algorithm (runtime)
BATCH
INPUT STRING OUTPUT TUPLE
LEARN ATTRIBUTE VALUE ORDER
SEGMENT(Dynamic programming
algorithm)
ARMs
Experimental Evaluation
Reference relations from several domains– Addresses: 1,000,000 tuples
[Name, #1, #2, Street Address, City, State, Zip]
– Media: 280,000 tuples [ArtistName, AlbumName, TrackName]
– Bibliography: 100,000 tuples [Title, Author, Journal, Volume, Month, Year]
Compare CRAM (our system) with DataMold [BSD01]
Test Datasets
Naturally erroneous datasets: unformatted input strings seen in operational databases
– Media– Customer addresses
Controlled error injection:– Clean reference table tuples [Inject errors] Concatenate
to generate input strings Evaluate whether a segmentation algorithm recovered
the original tuple– Accuracy Measure: % of attribute values correctly recognized
Overall Accuracy
86
88
90
92
94
96
98
100
Missing Insertions Deletions Spelling Reordering AllErrors Clean
CRAM Datamold
65
70
75
80
85
90
95
Missing Insertions Deletions Spelling Reordering AllErrors Clean
CRAM Datamold
Addresses DBLP
Topology & Robustness Operations
Addresses
80
85
90
95
100
1 Pos BMT BMT-robust
Training on Hypothetical Error Models
70
75
80
85
90
95
100
Addresses:AllErrors Addresses:Clean DBLP:AllErrors DBLP:Clean
Datamold Datamold:Hypothetical CRAM
Exploiting Dictionaries
40
60
80
1001.
E+
02
1.E
+03
2.E
+03
5.E
+03
1.E
+04
2.E
+04
4.E
+04
1.E
+05
2.E
+05
DBLP
Addresses
Media
Accuracy vs Reference Table size
Conclusions
Reference tables leveraged for segmentation Combining ARMs based on independence
allows segmenting input strings with unknown attribute order
ARM models learned over clean reference relations can accurately segment erroneous input strings– BMT topology– Robustness operations– Exploiting large dictionaries
Model Sizes & Pruning
80
85
90
95
100
0.1 0.2 0.3 0.5 1
AddressesDBLPMedia
0.E+00
5.E+05
1.E+06
2.E+06
2.E+06
0.1 0.2 0.3 0.5 1
AddressesDBLPMedia
0
1
2
3
4
5
6
7
8
0.1 0.2 0.3 0.5 1
AddressesDBLPMedia
Accuracy #States & Transitions Model Size in MB
Order Determination Accuracy
Topology
50
55
60
65
70
75
80
85
90
Err Clean Err Clean Err Clean Err Clean Err Clean
500 1000 2000 5000 inf
1-Pos BMT 9-Pos
Media
Specificities of HMM Models
Model “specificity” restricts accepted token sequences
Positional specificity– Number ending in ‘th|st’ can
only be the 2nd token in an address value
Token specificity– Last state only accepts “st, rd,
wy, blvd” Sequential specificity
– “st, rd, wy, blvd” have to follow a number in ‘st|th’
Number ending
in ‘st’ or ‘th’
Short word(<= 5 chars)
st|rd|wy|blvd
START
0.3 0.4 1.0 1.0
END
Robustness Operations
BEGIN MIDDLE TRALING
END
BEGIN MIDDLE TRALING
START END
BEGIN MIDDLE TRALING
ENDSTART
B’
Token insertion Token deletion Missing values
top related