machine learning for knowledge extraction from wikipedia & other semantically weak sources -...
DESCRIPTION
Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques for mining semantically weak data sources for explicit facts.The session will utilize WEX and preprocessed normalization of Wikipedia designed to make this corpus easily accessible to developers interested in machine learning, natural language processing, or knowledge extraction. The process through which WEX is prepared, as a guide to creating mineable structures from semi-structured data, will be discussed followed by approaches to machine extraction on structures of mixed data quality.The session is targeted at intermediate developers with an interest in machine learning or knowledge extraction (though no experience is assumed with either).The demonstrations leverage the power of Postgres 8.3’s XPath capability to simplify the programming model and present examples in Python, but the data and principles are compatible with any modern data infrastructure.TRANSCRIPT
![Page 1: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/1.jpg)
We all love Wikipedia!
![Page 2: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/2.jpg)
Wikipedia has lots of data...
![Page 3: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/3.jpg)
Lots of semi-structured data!
![Page 4: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/4.jpg)
At Freebase, we use Wikipedia as a source for extracting facts and relationships
![Page 5: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/5.jpg)
Some Interesting Data in Wikipedia
Infoboxes Categories
Text
![Page 6: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/6.jpg)
Problems With Wikipedia Data
•The data is dirty
•Wiki markup is hard to parse..
•.. and often dirty
•.. and not well defined
•Properties and relations are in Wiki markup
•Page redirects have to be resolved
![Page 7: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/7.jpg)
{{Infobox Company | company_name = Sony Corporation <br>ソニー株式会社 | company_logo = [[Image:Sony logo.svg|220px|]] | slogan = like.no.other | company_type = [[Public company|Public]]<br>({{Tyo|6758}})<br>({{nyse|SNE}}) | foundation = [[May 7]] [[1946]] (adopted current name in 1958)<ref name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/CorporateInfo/|title=Sony Global - Corporate Information|accessdate=2007-07-24}}</ref> | founder = [[Masaru Ibuka]]<br>[[Akio Morita]] | location = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref name="sonycorpinfo"/> | area_served = [[Worldwide]] | key_people = [[Sir Howard Stringer]]<br><small>([[Chairman]]) & ([[CEO]])</small><ref name="sonycorpinfo"/><br />[[Ryoji Chubachi]]<br><small>([[President]]) & ([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/> | industry = [[Consumer electronics]]<br>[[Entertainment]] | products = [[Audio]]<br>[[Video]]<br>[[Televisions]]<br>[[Information Technology|Communications and Information Technology]]<br>[[Semiconductors]]<br>[[Electronic components]]<br>[[Motion Picture]]<br>[[Music]]<br>[[Online|Online Business]]<br>[[Sony Playstation]] | services = [[Financial services]] | market cap = [[United States Dollar|US$]] 40.56 Billion (''2008'') | revenue = {{profit}} [[United States Dollar|US$]] 88.714 Billion (''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the fiscal year ended March 31, 2008|format=PDF}}</ref> | operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'') | net_income = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'') | assets = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'') | equity = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'') | num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/> | parent = | subsid = [[Sony Corporation shareholders and subsidiaries|List of the subsidiaries]] | homepage = [http://www.sony.net Sony.net] | footnotes =}}
![Page 8: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/8.jpg)
Problems With Wikipedia Data
•Wikipedia is huge!•2,150,00 articles•7,100,000 category references
•found in 280,000 categories•54,029 non-trivial templates (>= 5 uses)
•50,671,533 Template name-value properties
•Wikipedia is growing!•Grows 2% a week
•25,170 new articles•39,571 new redirects•8,000 deletes•5,000 name changes•1,000 article splits•1,000 id changes
•Before this talk is over, there will be 150 NEW articles!
![Page 9: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/9.jpg)
The Freebase Wikipedia Extraction (WEX)
Current Wikipedia
DumpWiki2XML Parser
Magnus Manske
Current Wikipedia
DumpCurrent
Wikipedia Dump
Wiki MarkupArticles
Current Wikipedia
DumpCurrent
Wikipedia Dump
XML MarkupArticles
TextArticles
Templates
Redirects Categories
Sections
FreebaseMappings
BigPostgres
Database!
![Page 10: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/10.jpg)
WEX Article XML
<template name="Infobox_President">
<param name="name">Abraham Lincoln</param>
<param name="nationality">American</param>
<param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param>
<param name="order">16th<space/><link>
<target>President of the United States</target></link></param>
<param name="term_start"><link><target>March 4</target></link>,
<space/><link><target>1861</target></link></param>
<param name="term_end"><link><target>April 15</target></link>,
<space/><link><target>1865</target></link></param>
<param name="predecessor"><link><target>James Buchanan</target></link></param>
<param name="successor"><link><target>Andrew Johnson</target></link></param>
....
</template>
![Page 11: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/11.jpg)
WEX Schema
articles
category_members
sections
template_calls
template_values
freebase_wpid
freebase_names
freebase_types
redirects
![Page 12: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/12.jpg)
SELECT xpath('/param/text()', template_values.xml) FROM template_valuesINNER JOIN template_calls ON call_id = template_calls.idINNER JOIN articles ON articles.wpid = article_wpidWHERE template_article_name = 'Template:Infobox Bridge' AND template_values.name = 'mainspan' AND articles.name = 'Fremont Bridge (Portland)'
Result: "{"1,255 ft (382.5 m)","longest in Oregon"}"
![Page 13: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/13.jpg)
![Page 14: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/14.jpg)
![Page 15: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/15.jpg)
![Page 16: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/16.jpg)
![Page 17: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/17.jpg)
![Page 18: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/18.jpg)
words become “features”
![Page 19: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/19.jpg)
category_list=['Category:Ninjas', 'Category:Pirates', 'Category:Assassins', 'Category:Apple Inc. employees', 'Category:Microsoft employees', 'Category:Google employees', 'Category:Free software programmers', 'Category:Computer programmers']print "Training classifiers..."
# Get the members of every categoryname_classes={}for category in category_list: members = get_category_members(cur, category) print str(len(members)) + " examples for " + category for name in members: name_classes.setdefault(name,set()).add(category)
![Page 20: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/20.jpg)
def get_category_members(cur, category): records = [] queue = [category] recordsSeen = set() while len(queue) > 0 and len(records) < 500: currentCategory = queue.pop(0) cur.execute("select articles.wpid, articles.name " + "from wikipedia.category_members, wikipedia.articles " + "where category_members.category_name like %s and " + "articles.wpid=category_members.article_wpid", (currentCategory,))
result = cur.fetchall() for wpid, name in result: if wpid not in recordsSeen: recordsSeen.add(wpid) if name.startswith("Category:"): queue.append(name) else: records.append(name) return records
![Page 21: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/21.jpg)
for name in name_classes: name, text = get_article_text(cur, name) words=getwords(text[0:1024]) for cat,cl in classifiers: if cat in name_classes[name]: cl.train(words,1) else: cl.train(words,0)
![Page 22: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/22.jpg)
def get_article_text(cur, name): cur.execute("select name, text from " + "wikipedia.articles where name=%s", (name,)) return cur.fetchone()
![Page 23: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/23.jpg)
# Test set:test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow", "Storm Shadow (G.I. Joe)", "Leonardo (TMNT)", "Bill Gates", "Steve Jobs", "Richard Stallman", "Larry Page", "Guido van Rossum", "Larry Wall", "Jerry Yang"]
# Run tests:for testName in test_set: name, text = get_article_text(cur, testName) print name words=getwords(text[0:1024]) for cat,cl in classifiers: py,pn=cl.prob(words,1),cl.prob(words,0) print '%s\t%s\t%f' % (cat,cl.classify(words),py/pn if pn>0 else 100)
![Page 24: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/24.jpg)
Category:Ninjas 0 0.000000Category:Pirates 1 155082805066493952.000000Category:Assassins 0 0.000001Category:Apple Inc. employees 0 0.000000Category:Microsoft employees 0 0.000000Category:Google employees 0 0.000000Category:Free software programmers 0 0.000000Category:Computer programmers 0 0.000000
![Page 25: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/25.jpg)
Category:Ninjas 1 166627867883323968.000000Category:Pirates 1 19.751727Category:Assassins 1 413388475722811.625000Category:Apple Inc. employees 0 0.000000Category:Microsoft employees 0 0.000000Category:Google employees 0 0.000000Category:Free software programmers 0 0.000000Category:Computer programmers 0 0.000000
![Page 26: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/26.jpg)
ninja japan
characters period
sengoku service
they use
means term based
different japanese
era heroes
appearance kanji
![Page 27: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/27.jpg)
pirate
coast
ship
pirates
crew
off
sea
captured
island
century
ships
piracy
north
merchant
captain
according
![Page 28: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/28.jpg)
can we construct a sentence thatfits into both categories?
(without using the word “pirate” or “ninja”)
![Page 29: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/29.jpg)
“Toby Segaran lived during the sengoku period in Japan. He spent many years at
sea capturing Japanese ships.”
![Page 30: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/30.jpg)
http://code.google.com/p/wexbayes/
![Page 31: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/31.jpg)
Category:Apple Inc. employees 0.000000Category:Microsoft employees 0.000000Category:Google employees 0.000000Category:Free software programmers 0.000000Category:Computer programmers 0.000000
Category:Ninjas 0.008158Category:Pirates 21125425312885.750000Category:Assassins 237.408562
![Page 32: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/32.jpg)
Category:Ninjas 2924533519139.380859Category:Pirates 0.003120Category:Assassins 67800277337186.242188
![Page 33: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/33.jpg)
Category:Ninjas 300.861827Category:Pirates 8392781192.375138Category:Assassins 3817.111709
![Page 34: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/34.jpg)
Category:Ninjas 0.000000Category:Pirates 0.000000Category:Assassins 0.000000Category:Apple Inc. employees 63.870863Category:Microsoft employees 186751.882154Category:Google employees 0.012458Category:Free software programmers 0.000197Category:Computer programmers 1222.542202
![Page 35: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/35.jpg)
Category:Apple Inc. employees 66414979293154.234375Category:Microsoft employees 694373.180082Category:Google employees 2381.361809Category:Free software programmers 0.014712Category:Computer programmers 871493.654163
![Page 36: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/36.jpg)
Category:Apple Inc. employees 11.530269Category:Microsoft employees 1.616829Category:Google employees 2703.744581Category:Free software programmers 12521439594.583622Category:Computer programmers 141466542964.903381
![Page 37: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/37.jpg)
Category:Apple Inc. employees 23162.014385Category:Microsoft employees 99.417180Category:Google employees 291981001482.833679Category:Free software programmers 0.026512Category:Computer programmers 258.589845
![Page 38: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/38.jpg)
Category:Apple Inc. employees 2.018667Category:Microsoft employees 0.310716Category:Google employees 84.447472Category:Free software programmers 21693.027739Category:Computer programmers 7538656551.050776
![Page 39: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/39.jpg)
Category:Apple Inc. employees 518.061964Category:Microsoft employees 16855.582495Category:Google employees 940060750.923012Category:Free software programmers 259957538.360797Category:Computer programmers 462842056873.530640
![Page 40: Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008](https://reader033.vdocument.in/reader033/viewer/2022060109/5555a9dcd8b42afe5d8b46d9/html5/thumbnails/40.jpg)
Category:Apple Inc. employees 0.063467Category:Microsoft employees 0.004360Category:Google employees 79.474061Category:Free software programmers 0.000001Category:Computer programmers 0.000108