managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...

Managing Spreadsheets

Michael CafarellaZhe Shirley Chen, Jun Chen, Junfeng Zhang, Dan Prevo

University of MichiganNew England Database Summit

February 1, 2013

Spreadsheets: The Good Parts

A “Swiss Army Knife” for data: storing, sharing, transforming

Sophisticated users who are not DBAs

Contain lots of data, found nowhere else

Everyone uses them; almost wholly ignored by DB community

Thanks, Jeremy!

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

• Tumor suppresor gene Deleted In Esophogeal Cancer 1

• aka, DEC1• aka, (according to Excel) 01-DEC

Spreadsheets: The Awful Parts Users toss in data,

worry about schemas later (well, never)

Spreadsheets designed for humans, not query processors

No explicit schemas: Poor data integrity

(Zeeberg et al, 2004) Integration very hard

A Data Tragedy Spreadsheets build, then entomb, our

best, most expensive, data >400,000 just from ClueWeb09 From gov’ts, WTO, many other sources How many inside firewall?

Application vision: Ad-hoc integration & analysis for any dataset

Challenge: recover relations from any spreadsheet, w/little human effort

Closeup

Desired tuple:

One hierarchy error yields many bad tuples

Too many datasets to process manually

Agenda Spreadsheets: An Overview Extracting Data

Hierarchy Extraction Manual Repairs

Experimental Results Demo Related and Future Work

Extracting Tuples

1. Extract frame, attribute hierarchy trees2. Map values to attributes; create tuples3. Apply manual repairs, repeat How many repairs for 100% accuracy? Yields tuples, not relations We won’t discuss: relation assembly

1. Frame Detection

Key assumption: inputs are data frames Locate metadata in top/left regions Locate data in center block

Closeup

1. Frame Detection Key assumption: inputs are data frames

Locate metadata in top/left regions Locate data in center block

~72% of spreadsheets fit; others not relational Each non-empty row labeled one of TITLE,

HEADER, DATA, FOOTNOTE Reconstruct regions from labels Infer labels with linear-chain Conditional Random Field

(Lafferty et al, 2001) Layout features: has bold cell? Merged cell? Text features: contains ‘table’, ‘total’? Indented text?

Numeric cells? Year cells?

2. Hierarchy Extraction

Closeup

1. One task for TOP, one for LEFT

2. Create boolean random var for each candidate parent relationship

3. Build conditional random field to obtain best variable assignment

2. Hierarchy Extraction CRFs use potential functions to incorporate features Node potentials represent single parent/child match

Share style? Near each other? WS-separated? Edge potentials tie pairs of parent/child decisions

Share style pairs? Share text? Indented similiarly? Spreadsheet potentials ensure a legal tree

One-parent potential: -∞ weight for multiple parents Directional potential: -∞ weight when parent edges go in opposite

directions Run Loopy Belief Propagation for node + edge; post-

inference test and repair for spreadsheet Real sheets yielded 1K-8K variables; inference <0.13 sec Approach adapted from (Pimplikar, Sarwagi, 2012)

3. Manual Repair User reviews, repairs extraction Goal: reduce user burden

Extractor makes repeated mistakes, either within spreadsheet or within corpus

Headache for user to repeat fixes Our sol’n: after each repair, add repair

potentials to CRF Links user-repaired nodes to a set of nodes

throughout CRF Incorporates info on node similarity Edges are generated heuristically

After each repair, re-run inference

Experiments General survey of spreadsheet use Evaluate:

Standalone extraction accuracy Manual repair effectiveness

Test sets: SAUS: 1,322 files from 2010 Statistical

Abstract of the United States WEB: 410,554 files from 51,252

domains, crawled from ClueWeb09

Spreadsheets in the Wild

Very common for Web-published gov’t data

Domain # files % total

bts.gov 12,435 3.03%

census.gov 7,862 1.91%

stat.co.jp 6,633 1.62%

bankofengland.co.uk 5,520 1.34%

ers.usda.gov 4,328 1.05%

agr.gc.ca 4,186 1.02%

wto.org 3,863 0.94%

doh.wa.gov 3,579 0.87%

nsf.gov 2,770 0.67%

nces.ed.gov 2,177 0.53%

Spreadsheets in the Wild

Standalone Extraction 100 random H-Sheets from SAUS, WEB Three metrics

Pairs: parent/child pairs labeled correctly (F1)

Tuples: relational tuples labeled correctly (F1)

Sheets: % of sheets labeled 100% correctly

Two methods Baseline uses just formatting, position Hierarchy uses our approach

Standalone Extraction

Manual Repair: Effectiveness Gather 10 topic areas from SAUS,

WEB Expert provides ground-truth

hierarchies Extract; repeatedly repair and

recompute

Manual Repair: Ordering Good ordering: errors steadily decrease Bad: extended periods of slow decrease

End-To-End Extraction What is overall utility of our extractor? Final metric: Correct tuples per manual repair

# Tuples

# Errors

# Repairs

Tuples/Repair

SAUS R50

530.76 5.46 2.06 257.65

SAUS Arts

454.8 25.4 13.1 34.72

SAUS Fin.

266.1 29.9 13.5 19.71

WEB R50

520.28 11.38 3.84 135.49

WEB BTS

65.6 2.7 1 65.6

WEB USDA

350.3 6.8 1.7 206.06

Demo Details Ran SAUS corpus through extractor Simple ad hoc integration analysis tool

on top of extracted data Early version of relation reconstruction Early version of data ranking, join finding

Related Work Spreadsheet as interface

(Witkowski et al, 2003), (Liu et al, 2009)

Spreadsheet extraction User-provided rules

(Ahmad et al, 2003), (Hung et al, 2011) No explicit user rules

(Abraham and Erwig, 2007), (Cunha et al, 2009)

Ad hoc integration for found data(Cafarella et al, 2009), (Pimplikar and Sarawagi,

2012), (Yakout et al, 2012)

Semi-automatic data programming Wrangler (Guo, et al, 2011)

Conclusions and Future Work Spreadsheet extraction opens new

datasets Manual repair ensures accuracy, low

user burden Ongoing and Future Work

Relation assembly Data relevance ranking Join finding

managing spreadsheets michael cafarella zhe shirley chen, jun chen, junfeng zhang, dan prevo...

hierarchy extraction

closeup slide

hard slide

data tragedy spreadsheets

data frames

lots of data

future work slide

center block slide

Documents

knowitall oren etzioni, stephen soderland, daniel weld...

automatic optimization of mapreduce programs michael ...

mark lajnas - delfi knjižare...dante, božanstvena...

data integration for the relational...

zhiming chen and chongping chen

webtables: exploring the power of tables on the web michael...

confederation of postgraduate medical education councils...

chen, mei hsiang (chinese traditional: (a.k.a. chen, mei ......

intro to web search michael j. cafarella december 5, 2007

chen - wordpress.com · chen author: sala subject: chen...

program evaluation. planning programs for adult learners...

hare: hardware accelerator for regular...

dear parishioners, april 3, 2016 - our lady of · pdf...

dear parishioners, march 27, 2016 - our lady of...

e. j. waggoner jevanĐelje u poslanici enja – sinonimi....

searching for extra dimensions in high energy cosmic rays...

reviewer acknowledgement open access bmc …...jia-xu chen...

all about nutch michael j. cafarella cse 454 april 14, 2005

examining uncertainty and misspecification of chen-miao chen

yuting chen ana gonzalez ximei chen elena orozco