government linked data tables automatically generating government linked data from tables varish...
TRANSCRIPT
Automatically Generating Government Linked Data from Tables
Varish Mulwad (@varish)University of Maryland, Baltimore County
November 5, 2011
Dr. Tim Finin Dr. Anupam Joshi
2
What ?
3
State StateFIPS
County CountyFIPS
Group Label Value
Alabama 1 Macon 87 Farms with Black or AfricanAmerican operators
Value of sales of grains, oilseeds, dry beans, and drypeas (farms)
5
Arizona …. Navajo …. …. …. ….
Arkansas 5 Union 139 Farms with women principalOperators
Total value of agriculturalproducts sold (farms)
56
California 6 Humboldt 23 … …. 19
http://dbpedia.org/class/
AdministrativeRegion
http://dbpedia.org/resource/Arizona Map literals as values of properties
dbpedia-owl:state
Introduction Related Work Baseline Results Joint Inference Conclusion
4
State StateFIPS
County CountyFIPS
Group Label Value
Alabama 1 Macon 87 Farms with Black or AfricanAmerican operators
Value of sales of grains, oilseeds, dry beans, and drypeas (farms)
5
Arizona …. Navajo …. …. …. ….
Arkansas 5 Union 139 Farms with women principalOperators
Total value of agriculturalproducts sold (farms)
56
California 6 Humboldt 23 … …. 19
@prefix dbpedia: <http://dbpedia.org/resource/>.@prefix dbpedia-owl: <http://dbpedia.org/ontology/>.@prefix dbpprop: <http://dbpedia.org/property/>.@prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>.”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.[ a dgtwc:DataEntry;dbpedia-owl:state dbpedia:Alabama;dbpedia:FIPS county code 000;dbpedia:Federal Information Processing Standard state code 001;dbpedia-owl:ethnicGroup “Farm with women principal operators”@en;dbpedia-owl:number 6444].
All this in a completely automated way !!
Contribution
Introduction Related Work Baseline Results Joint Inference Conclusion
5
Why ?
6
Tables are everywhere !! … yet …
The web – 154 million high quality relational tables [1]
Introduction Related Work Baseline Results Joint Inference Conclusion
7
Evidence–based medicine
Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010
The idea behind Evidence-based Medicine is to judge the efficacy oftreatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables.
However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment …
# of Clinical trials published in 2008
# of meta analysis published in 2008
8
> 400,000 raw and geospatial datasets~ < 1 % in RDF
Introduction Related Work Baseline Results Joint Inference Conclusion
9
Current Systems
– Require users to have knowledge of the Semantic Web
– Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud
– RDF data in some cases is as useless as raw data– Majority of the work focused on relational data
where schema is available– Web tables systems use ‘semantically poor
knowledge bases’
Introduction Related Work Baseline Results Joint Inference Conclusion
10
Dataset 1425
<rdf:Description rdf:about=“#entry1”><value>6444</value><label>Number of Farms</label><group>Farms with women principal operators</group><county fips>000</county fips><state fips>01</state fips><state>Alabama</state><rdf:type rdf:resource=“http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry”/></rdf:Description>
Introduction Related Work Baseline Results Joint Inference Conclusion
11
How ?
12
• Preliminary work / Baseline system
• Analysis and Evaluation of baseline
• “Domain Independent” Framework grounded in graphical models and probabilistic reasoning
Building a table interpretation framework
Introduction Related Work Baseline Results Joint Inference Conclusion
13
The System’s Brain (Knowledgebase)
Yago
Wikitology1 – A hybrid knowledgebase where structured data meets unstructured data
1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation
Syed, Z., and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer.
14
The Baseline System
15
T2LD Framework
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
Introduction Related Work Baseline Results Joint Inference Conclusion
16
Predicting Class Labels for column
State
Alabama
Arizona
Arkansas
California
Class
Instance
Introduction Related Work Baseline Results Joint Inference Conclusion
1. Alabama2.Alabama_(band)3.Alabama_(people)
{dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes …}
{dbpedia-owl:Place, yago:StatesOfTheUnitedStates, dbpedia-owl:Film, …. ….. ….. }
{……………………………………………………………. }
dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes,dbpedia-owl:Film ...
17
Linking table cells to entities
Macon + County + Alabama + 1 + 87 + Farms with Black or
AfricanAmerican operators + ...
+ dbpedia-owl:AdministrativeRegio
n
1. Macon County, Alabama2. Macon County, Illinois
Classifier 1 – SVM Rank(Ranks the set of entities)
Classifier 2 – SVM (Computes Confidence)
Link to the top ranked entity
Don’t link
Introduction Related Work Baseline Results Joint Inference Conclusion
18
Identify Relations
State
Alabama
Arizona
Arkansas
California
County
Macon
Navajo
Union
Humboldt
Rel ‘A’
Rel ‘A’
Rel ‘A’, ‘C’
Rel ‘A’, ‘B’, ‘C’
Rel ‘A’, ‘B’
Introduction Related Work Baseline Results Joint Inference Conclusion
19
Generating a linked RDF representation
@prefix dbpedia: <http://dbpedia.org/resource/>.@prefix dbpedia-owl: <http://dbpedia.org/ontology/>.@prefix dbpprop: <http://dbpedia.org/property/>.@prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>.”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion.
[ a dgtwc:DataEntry;dbpedia-owl:state dbpedia:Alabama;dbpedia:FIPS county code 000;dbpedia:Federal Information Processing Standard state code 001;dbpedia-owl:ethnicGroup “Farm with women principal operators”@en;dbpedia-owl:number 6444].
Introduction Related Work Baseline Results Joint Inference Conclusion
20
Evaluation of the baseline system
21
Dataset summaryNumber of Tables 15
Total Number of rows 199
Total Number of columns 56 (52)
Total Number of entities 639 (611)
* The number in the brackets indicates # excluding columns that contained numbers
Introduction Related Work Baseline Results Joint Inference Conclusion
22
Evaluation # 1 (MAP)• Compared the system’s ranked list of labels
against a human–ranked list of labels
• Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries]
• Commonly used in the Information Retrieval domain to compare two ranked sets
Introduction Related Work Baseline Results Joint Inference Conclusion
23
Evaluation # 1 (MAP)
0 10 20 30 40 50 600
0.2
0.4
0.6
0.8
1
1.2
Average PrecisionAverage Precision
Column #
Ave
rage
Pre
cisi
on
MAP = 0.411
System Ranked:1. Person2. Politician3. President
Evaluator Ranked:1. President2. Politician3. OfficeHolder
Introduction Related Work Baseline Results Joint Inference Conclusion
24
Accuracy for Entity Linking
Person Place Organization Other0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
83.05% 80.43%61.90%
29.22%
16.95% 19.57%38.10%
70.78%
IncorrectCorrect
Categories
% o
f cor
rect
and
inco
rrec
t ins
tanc
es li
nked
Overall Accuracy: 66.12 %
Introduction Related Work Baseline Results Joint Inference Conclusion
25
Lessons Learnt
• Sequential System – Error percolated from one phase to the next
• Current system favors general classes over specific ones (MAP score = 0.411)
• Largely, a system driven by “heuristics”• Although we consider evidence, we don’t do
assignment jointly
Predict Class for Columns
Linking the table cells
Identify and Discover relations
T2LD Framework
Introduction Related Work Baseline Results Joint Inference Conclusion
26
KB
a,b,c,…
m,n,o,… x,y,z,…
Probabilistic Graphical Model / Joint Inference Model
KB
Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt.
DomainQuery
Linked Data
A “Domain Independent” Framework
27
Joint Inference over evidence in a table
Probabilistic Graphical Models
28
Parameterized graphical model
C1 C2C3
𝝍𝟓
R11 R12 R13 R21 R22 R23 R31 R32 R33
𝝍𝟑 𝝍𝟑 𝝍𝟑
𝝍𝟒 𝝍𝟒 𝝍𝟒
Function that captures the affinity between the column headers and row values
Row value
Variable Node: Column header
Captures interaction between column headers
Captures interaction between row values
Factor Node
Introduction Related Work Baseline Results Joint Inference Conclusion
29
Challenges
30
Challenges - Literals
Population
690,000
345,000
510,020
120,000
Age
75
65
50
25
Introduction Related Work Baseline Results Joint Inference Conclusion
Population / Profit ?
Age / Percentage ?
Use evidence from the rest of the table to decide
31
Challenges - Metadata
Introduction Related Work Baseline Results Joint Inference Conclusion
32
More Challenges !
• Sampling and Interpretation– Data set 1425 has > 400,000 rows !
• Human in the Loop
Introduction Related Work Baseline Results Joint Inference Conclusion
Conclusion• Presented a framework for inferring the semantics of
tables and generating Linked data
• Evaluation of the baseline system show feasibility in tackling the problem
• Work in progress for building framework grounded in graphical models and probabilistic reasoning
• Working on tackling challenges posed by tables from domains such as the medical and open government data
Introduction Related Work Baseline Results Joint Inference Conclusion
34
References1. Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y. 2008.
Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549
2. M. Hurst. Towards a theory of tables. IJDAR,8(2-3):123-131, 2006.
3. D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164-175, 2006.
4. Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010.
5. Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), 2011.
6. Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010
35
Thank You ! Questions ?
@varishhttp://ebiq.org/h/Varish/Mulwad