mining semi-structured data: understanding web-tables – building a taxonomy for 2xn tables

39
Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables Emir Mu˜ noz, MSc. [email protected] Galway, Ireland – 19 July 2012 Introduction WTT-Detection WTT-Interpretation On-going work Future work 1/39

Upload: net2-project

Post on 10-May-2015

635 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Mining Semi-structured Data:Understanding Web-tables – Building aTaxonomy for 2xn Tables

Emir Munoz, MSc.

[email protected]

Galway, Ireland – 19 July 2012

Introduction WTT-Detection WTT-Interpretation On-going work Future work 1/39

Page 2: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Outline

1 Introduction

2 WTT-Detection

3 WTT-Interpretation

4 On-going work

5 Future work

Introduction WTT-Detection WTT-Interpretation On-going work Future work 2/39

Page 3: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Introduction I

Tables

They are used as a compact and efficient way to presentrelational information.

They are inherently concise as well as information rich.

The automatic understanding of tables has many applicationsincluding:

Knowledge managementInformation retrievalWeb and text miningSummarization, andContent delivery to mobile devices.

Interesting for domains like: medicine, health-care, finance,e-science (e.g., biotechnology), and public policy.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 3/39

Page 4: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Introduction IIWeb-tables (WTT) examples

Introduction WTT-Detection WTT-Interpretation On-going work Future work 4/39

Page 5: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Introduction III

Table understanding in Web documents include [WH02]:

Table detection,

Functional and structural analysis, and

Table interpretation.

Cafarella in [CHW+08] estimated that there are around 14.1billion HTML tables, out of which 154 million contain highquality relational data.

This represents a large source of knowledge, yet we do nothave systems that can understand and exploit this knowledgeproperly.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 5/39

Page 6: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Outline

1 Introduction

2 WTT-Detection

3 WTT-Interpretation

4 On-going work

5 Future work

Introduction WTT-Detection WTT-Interpretation On-going work Future work 6/39

Page 7: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection I

In practice, tables are not only used to present relationalinformation,

... they are also used to create multiple-column layouts tofacilitate easy viewing.

The presence of the HTML tag <table> does not ensure arelational table, or more general, a table with content.

[WH02] A ML approach for Table Detection

Wang and Hu discriminated genuine and non-genuine tables on thegrounds of their content. They checked as to whether they containlogical relations among the cells, or they are just used as amechanism for grouping content. In so doing, they used a treeclassifier.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 7/39

Page 8: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection II

In genuine or relational tables there are logical relationsamong the cells.

Non-genuine or non-relational tables are used as a mechanismfor grouping contents.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 8/39

Page 9: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [WH02] IA Machine Learning Based Approach for Table Detection on The Web

Then, they define weights derived from the traditional tf ∗ idf

measures used in IR, and define similarity based on the vectorspace model.

Their initial database contains a total of 2,851 pagesharvested from Google directory and News using predefinedkeywords known to have a higher chance to recall genuinetables, from around 200 web sites.

They selected 1,393 pages out of these database (chosenrandomly). (11,477 <table> nodes.)

For training they used 9-fold cross validation.

They experimented with decision trees and SVMs forseparating genuine and non-genuine tables.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 9/39

Page 10: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [WH02] IIA Machine Learning Based Approach for Table Detection on The Web

1,740 are genuine (15.16%) and 9,737 are non-genuine(84.84%) tables.

The results reported areR = 94.25%, P = 97.50%, F = 95.88%.

(The pages was obtained by querying Google using keywordslike “table”, “stock”, “weather”.)

Introduction WTT-Detection WTT-Interpretation On-going work Future work 10/39

Page 11: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10b] IWeb-Scale Knowledge Extraction from Semi-Structured Tables

Tables called Attribute/Value

They propose a classification algorithmfor recognizing layout tables andattribute/value tables. In their work,they adopted the Gradient BoostedDecision Tree classification model, withclasses ATTRIBUTE/VALUE,LAYOUT, and OTHER (e.g., calendars,forms, enumerations).

Introduction WTT-Detection WTT-Interpretation On-going work Future work 11/39

Page 12: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10b] IIWeb-Scale Knowledge Extraction from Semi-Structured Tables

Introduction WTT-Detection WTT-Interpretation On-going work Future work 12/39

Page 13: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10b] IIIWeb-Scale Knowledge Extraction from Semi-Structured Tables

Tables list attributes but rarely contain the subject in thetable proper.

Their focus is on detection of the subject of the table. Theycall this open research problem: Protagonist Detection.

Relational tables considered in their work encode facts, orsemantic triples of the form < p, s, o >.

There are three different places where the protagonist couldbe found:

a) within the table (occasionally found in the table with a genericattribute such as name or model);

b) within the document or the HTML <title> tag; andc) anchor texts offer well defined boundaries for identifying

protagonist candidates, the document body proposes fewerclues.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 13/39

Page 14: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10b] IVWeb-Scale Knowledge Extraction from Semi-Structured Tables

Introduction WTT-Detection WTT-Interpretation On-going work Future work 14/39

Page 15: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10a, CP11] IWeb-scale Table Census and Classification

They extend their previous work, proposing a muchfiner-grained table-type classification and report an overallaccuracy of 75.2%.

From a total of 1.2 billion documents, they extracted 8.2billion tables (2.6 billion unique tables).

In detail, 75% of the pages contain at least one table with anaverage of 9.1 tables per document.

In preliminary experiments, when trying to identify theprotagonist of A-V tables, they use an N-gram basedapproach using a commercial search engine’s web link graph.

They find the correct protagonist in 90% of the cases in itstop-20 ranked candidates, and in 79% of the cases in its top-3.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 15/39

Page 16: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Table detection: [CP10a, CP11] IITable classes

[CP10a, CP11] propose the following table type taxonomy.(This proposal and others are only based on a syntacticstructure of tables.)

Introduction WTT-Detection WTT-Interpretation On-going work Future work 16/39

Page 17: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Outline

1 Introduction

2 WTT-Detection

3 WTT-Interpretation

4 On-going work

5 Future work

Introduction WTT-Detection WTT-Interpretation On-going work Future work 17/39

Page 18: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IRecovering Table Semantics

There are some works focused on mapping spreadsheets intoRDF, but such systems require human intervention.

[MFSJ10] proposed an approach that uses linked data tointerpret tables and associate their components with nodes ina reference linked data collection.

To provide general purpose knowledge as well as specific factsabout significant people, places, organizations, events andmany other entities of interest.

[SFMJ10] used RDF for exporting and encoding theinformation embodied in tables.

Describing techniques to automatically infer a (partial)semantic model for information in tables using both tableheadings, if available, and the values stored in table cells.The techniques have been prototyped for a subset of linkeddata that covers the core of Wikipedia.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 18/39

Page 19: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IIRecovering Table Semantics

City Mayor State Population

Boston T. Menino MA 610,000New York M. Bloomberg NY 8,400,000

Philadelphia M. Nutter PA 1,500,000Baltimore S. Dixon MD 640,000Washington A. Fenty DC 595,000

@prefix dbp: <http://dbpedia.org/resource/> .

@prefix dbpo: <http://dbpedia.org/ontology/> .

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

@prefix cyc: <http://www.cyc.com/2004/06/04/cyc#> .

dbp:Boston dbpo:leaderName

dbp:Thomas_Menino;

cyc:partOf dbp:Massachusetts;

dbpo:populationTotal "610000"^^xsd:integer .

dbp:New_York_City ...

...

Introduction WTT-Detection WTT-Interpretation On-going work Future work 19/39

Page 20: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IIIRecovering Table Semantics

When predicting entity classes in a column, [SFMJ10] usedDBpedia (85.71%), Yago (71.42%), Word-Net (71.42%) andFreebase (90.47%).

Entity types and their correct prediction: Places (61.64%),Persons (90.76%) and Organizations (66.667%).

To describe relations between columns in a table, they take allpairs of entities in the same row (already linked to Wikipedia)and query DBpedia for the set of relations.

http://dbpedia.org/ontology/largestCity

http://dbpedia.org/ontology/PopulatedPlace/largestCity

http://dbpedia.org/ontology/capital

http://dbpedia.org/ontology/PopulatedPlace/capital

http://dbpedia.org/property/capital

http://dbpedia.org/property/largestcity

Introduction WTT-Detection WTT-Interpretation On-going work Future work 20/39

Page 21: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IVRecovering Table Semantics

The relation that appears the maximum number of times isthe selected.

The evaluation test set is very small, just 5 tables taken fromGoogle Squared.

Another example about basketball players:

Name Team Position

Michael Jordan Chicago Shooting guardAllen Iverson Philadelphia Point guardYao Ming Houston Center

Tim Duncan San Antonio Power forward

It is important to discover relations between the tablecolumns, but not only 2-ary relations.

[MFJ11] also analyzed the Government Linked data.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 21/39

Page 22: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IWTT as a very large repository of facts

[YTT01] focused their work on a probabilistic method tointegrate tables according the category of objects representedin each table. (Performing an attribute clusterization.)

[YT01, TI06] proposed methods to ontology extraction fromweb-tables using the relations represented by structures intothe table. (The table structures have to be given by humans.)

An IR approach presented in [YTL11], extracts structureddata from WTT, aggregates and cleans such data and storesthem in a database. They create a very large repository ofentity-attribute-value triples.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 22/39

Page 23: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IIWTT as a very large repository of facts

(How this works?) A good example is the query “SaintPatrick’s day”, any search engine could directly show “17March” within their top-ranked results.

http://en.wikipedia.org/wiki/Public_holidays_in_the_Republic_of_Ireland

Introduction WTT-Detection WTT-Interpretation On-going work Future work 23/39

Page 24: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

WTT-Interpretation IIIWTT as a very large repository of facts

(Hypothesis) Recovering semantics guide to a better searchand quality filter. Some enunciated problems:

Take for instance, a table about trees and a piece of text like“...North America species such as Green Ash...”. From theWTT we could infer that “Green Ash” is a species of treea.k.a. “Fraxinus pennsylvanica”.

Use schema statistics to automatically compute attributesynonyms (more complete than thesaurus).

e.g., e-mail—email, phone—telephone, e-mail address—emailaddress, date—last-modified

It is still necessary to recover large fractions of binaryrelationships and techniques for recovering numericalrelationships (e.g. population, GDP) [VHM+11].

Introduction WTT-Detection WTT-Interpretation On-going work Future work 24/39

Page 25: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Outline

1 Introduction

2 WTT-Detection

3 WTT-Interpretation

4 On-going work

5 Future work

Introduction WTT-Detection WTT-Interpretation On-going work Future work 25/39

Page 26: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going work IIntroduction

Our initial aim was to understand relational tables.

We parse HTML pages to extract HTML tables usingNekoHTML library.

We have a corpus comprising 8.2 billion tables.

A table is parsed as a matrix using Tartar [PCS+07]. Dealingwith the cell spans.

We manually annotated 14,695 randomly chosen tables:

10,923 content-poor (74.33%) and 3,785 content-rich(25.76%) tables.

We made a content-poor and content-rich table predictorusing the same features as [CP11] using a max-entropy model,and 10-fold cross-validation.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 26/39

Page 27: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going work IIIntroduction

From a set of 115 features, we selected 19 via a greedyproperty selection algorithm. The 19 achieved an accuracy of89.46%.

Most important features:

Presence of the <select> tag in a columnDistinct strings in the 1st columnDistinct tags in a columnDistinct tags in a rowNon-empty cells in columns or rowsPresence of linksPresence of colon “:”Presence of break line <br>Presence of input fields (HTML)Presence of numbers in a rowsPresence of the <th> tag

Introduction WTT-Detection WTT-Interpretation On-going work Future work 27/39

Page 28: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going work IIIIntroduction

Our aim is proposing a based-in-content taxonomy for WTTinstead of previously based in syntax structure.

We are now developing a 2xn table predictor with classesfocused in content.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 28/39

Page 29: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going work IWhy this work?

Why a new taxonomy?

[WH02] classify WTT as genuine or non-genuine tables.[CHW+08] classify WTT as relational or no-relational tables.Crestan and Pantel’s taxonomy is a more general purposetaxonomy for tables, focused on the syntax of the tables, notin their semantic.Intuitively, all the classes of the taxonomy of [CP11] are notuseful, and only A-V class is used.Moreover, What it means to say that a table is A-V?A A-V table could have spatio-temporal attributes or universalfacts or, even describe a person or a company or a product.All the previous approaches needs a little bit of focus given forthe “message” of the tables.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 29/39

Page 30: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going work IIWhy this work?

Why 2xn tables are important?

2xn class is larger than A-V class.

They are about 20% of all tables in the Web.

Previous A-V tables were identified by the presence of “:”(colon).

We hope that extending the research to 2xn tables, discoverthose A-V tables that are not indicated by the “colon rule”.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 30/39

Page 31: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going workProposed Taxonomy

We introduce new classes, that could be important, e.g., to beused with ontologies (e.g., FOAF) in a search engine.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 31/39

Page 32: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going workWTT examples according to our taxonomy

Introduction WTT-Detection WTT-Interpretation On-going work Future work 32/39

Page 33: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

On-going workDistribution

We are manually tagging 174,748 unique WTT.

The distribution per class until now is:

Class %

Social networks 34.2%Spatio-temporal information 28.9%

Products 28.4%Resources 4.3%

Universal facts 3.2%Other 1.0%Events 0.03%

Introduction WTT-Detection WTT-Interpretation On-going work Future work 33/39

Page 34: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Outline

1 Introduction

2 WTT-Detection

3 WTT-Interpretation

4 On-going work

5 Future work

Introduction WTT-Detection WTT-Interpretation On-going work Future work 34/39

Page 35: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Future work – Open problems I

Tables in web-pages can be used to model the data ofweb-sites, in particular, its main entities and the key relationsthereof. This entails:a) Web tables bear syntactic and semantic information that it is

useful for determining what they are talking about. Thus,patterns across web-tables can be exploited to automaticallyunderstand their “message”.

b) Once the ”message” of the tables of a specific web-site isdetermined, it is possible to infer the main entities that thisweb-site talks about.

c) Once the relevant entities of a web-site are detected, it isplausible to recognize prominent relationships between theseentities. Thus, we will be able to link data between the chiefentities of a web-site.

d) Once predominant relations and entities of a web-site aredetermined, it is possible to link data between differentweb-sites.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 35/39

Page 36: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

Future work – Open problems II

Extracting RDF from wikipedia tables (not only infobox).

Relation extraction – all kind of relations.Taxonomies definition.

Other levels, like: rankings, definitions.

Complex table understanding.

Table integration.

Protagonist detection for web-tables.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 36/39

Page 37: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

References IIf you want to go further

Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang.

Webtables: exploring the power of tables on the web.PVLDB, 1(1):538–549, 2008.

Eric Crestan and Patrick Pantel.

A fine-grained taxonomy of tables on the web.In Jimmy Huang, Nick Koudas, Gareth J. F. Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An,editors, CIKM, pages 1405–1408. ACM, 2010.

Eric Crestan and Patrick Pantel.

Web-scale knowledge extraction from semi-structured tables.In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 1081–1082, NewYork, NY, USA, 2010. ACM.

Eric Crestan and Patrick Pantel.

Web-scale table census and classification.In Irwin King, Wolfgang Nejdl, and Hang Li, editors, WSDM, pages 545–554. ACM, 2011.

Varish Mulwad, Tim Finin, and Anupam Joshi.

Automatically Generating Government Linked Data from Tables.In Working notes of AAAI Fall Symposium on Open Government Knowledge: AI Opportunities andChallenges. November 2011.

Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi.

Using linked data to interpret tables.In Proceedings of the the First International Workshop on Consuming Linked Data, November 2010.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 37/39

Page 38: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

References IIIf you want to go further

Aleksander Pivk, Philipp Cimiano, York Sure, Matjaz Gams, Vladislav Rajkovic, and Rudi Studer.

Transforming arbitrary tables into logical form with TARTAR.Data Knowl. Eng., 60(3):567–595, 2007.

Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi.

Exploiting a Web of Semantic Data for Interpreting Tables.In in Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, Raleigh NC, USA, April26–27th 2010.

Masahiro Tanaka and Toru Ishida.

Ontology Extraction from Tables on the Web.In in Proceedings of the International Symposium on Applications on Internet, pages 284–290, WashingtonDC, USA, 2006.

Petros Venetis, Alon Y. Halevy, Jayant Madhavan, Marius Pasca, Warren Shen, Fei Wu, Gengxin Miao, and

Chung Wu.Recovering semantics of tables on the web.PVLDB, 4(9):528–538, 2011.

Yalin Wang and Jianying Hu.

A Machine Learning Based Approach for Table Detection on the Web.In In Proceedings of the 11th Int’l Conf. on World Wide Web (WWW’02), pages 242–250. ACM Press,2002.

Minoru Yoshida and Kentaro Torisawa.

Extracting Ontologies from World Wide Web via HTML Tables.In In Proceedings of the Pacific Association for Computational Linguistics (PACLING 2001, pages 332–341,2001.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 38/39

Page 39: Mining Semi-structured Data: Understanding Web-tables – Building a Taxonomy for 2xn Tables

References IIIIf you want to go further

Xiaoxin Yin, Wenzhao Tan, and Chao Liu.

FACTO: a fact lookup engine based on web tables.In Proceedings of the 20th international conference on World wide web, WWW ’11, pages 507–516, NewYork, NY, USA, 2011. ACM.

Minoru Yoshida, Kentaro Torisawa, and Jun’ichi Tsujii.

A method to integrate tables of the World Wide Web.In In Proceedings of the International Workshop on Web Document Analysis (WDA 2001, pages 31–34,2001.

Introduction WTT-Detection WTT-Interpretation On-going work Future work 39/39