managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...
Post on 23-Dec-2015
218 Views
Preview:
TRANSCRIPT
Managing Spreadsheets
Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo
University of MichiganNew England Database Summit
February 1, 2013
2
Spreadsheets: The Good Parts
A “Swiss Army Knife” for data: storing, sharing, transforming
Sophisticated users who are not DBAs
Contain lots of data, found nowhere else
Everyone uses them; almost wholly ignored by DB community
Thanks, Jeremy!
3
Spreadsheets: The Awful Parts Users toss in data,
worry about schemas later (well, never)
Spreadsheets designed for humans, not query processors
No explicit schemas: Poor data integrity
(Zeeberg et al, 2004) Integration very hard
• Tumor suppresor gene Deleted In Esophogeal Cancer 1
• aka, DEC1• aka, (according to Excel) 01-DEC
4
Spreadsheets: The Awful Parts Users toss in data,
worry about schemas later (well, never)
Spreadsheets designed for humans, not query processors
No explicit schemas: Poor data integrity
(Zeeberg et al, 2004) Integration very hard
5
A Data Tragedy Spreadsheets build, then entomb, our
best, most expensive, data >400,000 just from ClueWeb09 From gov’ts, WTO, many other sources How many inside firewall?
Application vision: Ad-hoc integration & analysis for any dataset
Challenge: recover relations from any spreadsheet, w/little human effort
6
Closeup
Desired tuple:
One hierarchy error yields many bad tuples
Too many datasets to process manually
7
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
8
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
9
Extracting Tuples
1. Extract frame, attribute hierarchy trees2. Map values to attributes; create tuples3. Apply manual repairs, repeat How many repairs for 100% accuracy? Yields tuples, not relations We won’t discuss: relation assembly
10
1. Frame Detection
Key assumption: inputs are data frames Locate metadata in top/left regions Locate data in center block
11
Closeup
12
1. Frame Detection Key assumption: inputs are data frames
Locate metadata in top/left regions Locate data in center block
~72% of spreadsheets fit; others not relational Each non-empty row labeled one of TITLE,
HEADER, DATA, FOOTNOTE Reconstruct regions from labels Infer labels with linear-chain Conditional Random Field
(Lafferty et al, 2001) Layout features: has bold cell? Merged cell? Text features: contains ‘table’, ‘total’? Indented text?
Numeric cells? Year cells?
13
2. Hierarchy Extraction
14
Closeup
15
16
2. Hierarchy Extraction
1. One task for TOP, one for LEFT
2. Create boolean random var for each candidate parent relationship
3. Build conditional random field to obtain best variable assignment
17
2. Hierarchy Extraction
18
2. Hierarchy Extraction CRFs use potential functions to incorporate features Node potentials represent single parent/child match
Share style? Near each other? WS-separated? Edge potentials tie pairs of parent/child decisions
Share style pairs? Share text? Indented similiarly? Spreadsheet potentials ensure a legal tree
One-parent potential: -∞ weight for multiple parents Directional potential: -∞ weight when parent edges go in opposite
directions Run Loopy Belief Propagation for node + edge; post-
inference test and repair for spreadsheet Real sheets yielded 1K-8K variables; inference <0.13 sec Approach adapted from (Pimplikar, Sarwagi, 2012)
19
3. Manual Repair User reviews, repairs extraction Goal: reduce user burden
Extractor makes repeated mistakes, either within spreadsheet or within corpus
Headache for user to repeat fixes Our sol’n: after each repair, add repair
potentials to CRF Links user-repaired nodes to a set of nodes
throughout CRF Incorporates info on node similarity Edges are generated heuristically
After each repair, re-run inference
20
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
21
Experiments General survey of spreadsheet use Evaluate:
Standalone extraction accuracy Manual repair effectiveness
Test sets: SAUS: 1,322 files from 2010 Statistical
Abstract of the United States WEB: 410,554 files from 51,252
domains, crawled from ClueWeb09
22
Spreadsheets in the Wild
Very common for Web-published gov’t data
Domain # files % total
bts.gov 12,435 3.03%
census.gov 7,862 1.91%
stat.co.jp 6,633 1.62%
bankofengland.co.uk 5,520 1.34%
ers.usda.gov 4,328 1.05%
agr.gc.ca 4,186 1.02%
wto.org 3,863 0.94%
doh.wa.gov 3,579 0.87%
nsf.gov 2,770 0.67%
nces.ed.gov 2,177 0.53%
23
Spreadsheets in the Wild
24
Standalone Extraction 100 random H-Sheets from SAUS, WEB Three metrics
Pairs: parent/child pairs labeled correctly (F1)
Tuples: relational tuples labeled correctly (F1)
Sheets: % of sheets labeled 100% correctly
Two methods Baseline uses just formatting, position Hierarchy uses our approach
25
Standalone Extraction
26
Manual Repair: Effectiveness Gather 10 topic areas from SAUS,
WEB Expert provides ground-truth
hierarchies Extract; repeatedly repair and
recompute
27
Manual Repair: Ordering Good ordering: errors steadily decrease Bad: extended periods of slow decrease
28
End-To-End Extraction What is overall utility of our extractor? Final metric: Correct tuples per manual repair
# Tuples
# Errors
# Repairs
Tuples/Repair
SAUS R50
530.76 5.46 2.06 257.65
SAUS Arts
454.8 25.4 13.1 34.72
SAUS Fin.
266.1 29.9 13.5 19.71
WEB R50
520.28 11.38 3.84 135.49
WEB BTS
65.6 2.7 1 65.6
WEB USDA
350.3 6.8 1.7 206.06
29
Agenda Spreadsheets: An Overview Extracting Data
Hierarchy Extraction Manual Repairs
Experimental Results Demo Related and Future Work
30
Demo Details Ran SAUS corpus through extractor Simple ad hoc integration analysis tool
on top of extracted data Early version of relation reconstruction Early version of data ranking, join finding
31
Related Work Spreadsheet as interface
(Witkowski et al, 2003), (Liu et al, 2009)
Spreadsheet extraction User-provided rules
(Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules
(Abraham and Erwig, 2007), (Cunha et al, 2009)
Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi,
2012), (Yakout et al, 2012)
Semi-automatic data programming Wrangler (Guo, et al, 2011)
32
Conclusions and Future Work Spreadsheet extraction opens new
datasets Manual repair ensures accuracy, low
user burden Ongoing and Future Work
Relation assembly Data relevance ranking Join finding
top related