the road to the semantic web
DESCRIPTION
A seminar lecture presenting a Carnegie Mellon research project "Read the Web" (http://rtw.ml.cmu.edu/rtw/). Presented at the Databases & the Internet seminar at Hebrew University of JerusalemTRANSCRIPT
The Road to the Semantic WebMichael Genkin
SDBI 2010@HUJI
Michael Genkin ([email protected])
"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation."
Tim Berners-Lee, James Hendler and Ora Lassila; Scientific American, May 2001
Michael Genkin ([email protected])
Over 25 billion RDF triples (October 2010)
More than 24 billion web pages (June 2010)
Probably more than one triple per page, lot more
Michael Genkin ([email protected])
How will we populate the Semantic Web?
Humans will enter structured data
Data-store owners will share their data
Computers will read unstructured data
Michael Genkin ([email protected])
Read the Webhttp://rtw.ml.cmu.edu/rtw/(or google it)
Michael Genkin ([email protected])
Roadmap Motivation Some definitions
Natural language processing Machine learning
Macro reading the web Coupled training NELL
Demo Summary
Michael Genkin ([email protected])
Some Definitions Natural Language Processing Machine Learning
Michael Genkin ([email protected])
Natural Language Processing Part of Speech Tagging (e.g. noun, verb) Noun phrase: a phrase that normally
consists of a (modified) head noun. “pre-modified” (e.g. this, that, the red…) “post-modified” (e.g. …with long hair, …
where I live) Proper noun: a noun which represents an
unique entity (e.g. Jerusalem, Michael) Common noun: a noun which represents a
class of entities (e.g. car, university)
Michael Genkin ([email protected])
Learning: What is it? Assume there is some knowledge base KB. Let some algorithm to perform a set of task
T. Let a performance metric Perf. We will say that a computer program learns
if:
Michael Genkin ([email protected])
Training Methods
Michael Genkin ([email protected])
We have a set of examples (KB) and a domain (D) Examples might be positive, or negative e.g. for every input for some .
The learning algorithm A would try to find such . is called a classifier or regression
Supervised
Michael Genkin ([email protected])
Distinguished from supervised learning by that there are no labeled examples (KB=D).
The unsupervised learning algorithm A will try to find a classifier that when given some as input, will return some arbitrary label. i.e. the algorithm A analyses the structure of
D
Supervised
Unsupervised
Michael Genkin ([email protected])
A middle way between supervised and unsupervised.
Use a minimal amount of labeled examples and a large amount of unlabeled.
Learn the structure of D in unsupervised manner, but use the labeled examples to constraint the results. Repeat. Known as bootstrapping.
Supervised Semi-Supervised
Unsupervised
Michael Genkin ([email protected])
Bootstrapping Iterative semi-supervised learningJerusalemTel AvivHaifa
mayor of arg1life in arg1
Ness-ZionaLondondenial
anxietyselfishnessAmsterdam
arg1 is home oftraits such as arg1
Under constrained! Sematic drift
Michael Genkin ([email protected])
Macro Reading the WebPopulating the Semantic Web by Macro-Reading Internet Text.T.M. Mitchell, J. Betteridge, A. Carlson, E.R. Hruschka Jr., and R.C. Wang. Invited Paper, In Proceedings of the International Semantic Web Conference (ISWC), 2009
Michael Genkin ([email protected])
Problem Specification (1): Input Initial ontology that contains:
Dozens of categories and relations (e.g. Company, CompanyHeadquarteredInCity)
Relations between categories and relations (e.g. mutual exclusion, type constraints)
A few seed examples of each predicate in ontology
The web Occasional access to human trainer
Michael Genkin ([email protected])
Problem Specification (2): The Task Run forever (24x7) Each day:
Run over ~500 million web pages. Extract new facts and relations from the
web to populate ontology. Perform better than the day before
Populate the semantic web.
Michael Genkin ([email protected])
A Solution? An automatic, learning, macro-reader.
Michael Genkin ([email protected])
Micro vs. Macro Reading (1) Micro-reading: the traditional NLP task of
annotating a single web page to extract the full body of information contained in the document. NLP is hard!
Macro-reading: the task of “reading” a large corpus of web pages (e.g. the web) and returning large collection of facts expressed in the corpus. But not necessarily all the facts.
Michael Genkin ([email protected])
Micro vs. Macro Reading (2) Macro-reading is easier than micro-reading.
Why? Macro-reading doesn’t require extracting
every bit of information available. In text corpora as large as the web, many
important fact are stated redundantly, thousands of times, using different wordings. Benefit by ignoring complex sentences. Benefit by statistically combining evidence
from many fragments to determine a belief in a hypothesis.
Michael Genkin ([email protected])
Why an Input Ontology? The problem with understanding free text is
that it can mean virtually anything. By formulating the problem of macro-
reading as populating an ontology we allow the system to focus only on relevant documents.
The ontology can define meta properties of its categories and relations.
Allows to populate parts of the semantic web for which an ontology is available.
Michael Genkin ([email protected])
Machine Learning Methods Semi-supervised (use an ontology to
learn). Learn textual patterns for extraction. Employ methods such as Coupled
Training to improve accuracy. Expand the ontology to improve
performance.
Michael Genkin ([email protected])
Coupled Training
Michael Genkin ([email protected])
Bootstrapping – Revised Iterative semi-supervised learningJerusalemTel AvivHaifa
mayor of arg1life in arg1
Ness-ZionaLondondenial
anxietyselfishnessAmsterdam
arg1 is home oftraits such as arg1
Michael Genkin ([email protected])
Coupled Training
Couple the training of multiple functions to make unlabeled data more informative
Makes the learning task easier by adding constraints
Michael Genkin ([email protected])
Coupling (1):Output Constraints We wish to train a function
e.g. Assume we have two different functions
that assign the label city, but receive different input.
Coupling constraint: must agree over unlabeled data.
Michael Genkin ([email protected])
Coupling (1):Output Constraints
arg1: Nir Barkat is the mayor of Jerusalem
X1=arg1
Y=city?
X2=arg1
Y=country?≠
X2=arg1
Y=city?¿
Michael Genkin ([email protected])
Coupling (2):Compositional Constraints Assume we have Assume we have a constraint on valid
pairs given . Coupling constraint: must satisfy the
constraint on . e.g. “type checks” first argument for
Michael Genkin ([email protected])
Coupling (2):Compositional Constraints
Nir Barkat is the mayor of Jerusalem
MayorOf(X1,X2)
city?
location?
politician?
city?
location?
politician?
Michael Genkin ([email protected])
Coupling (3):Multi-view Agreement We have a function
If X can be partitioned into two “views” . Assume and can predict Y.
We wish to learn Coupling constraint: must agree.
Michael Genkin ([email protected])
Coupling (3):Multi-view Agreement Let Y a set of possible web page
categories Let X be a set of web pages
Assume represents the words in a page Assume represents the words in
hyperlinks pointing to the page
Michael Genkin ([email protected])
NELL – Never-Ending Language LearningCoupled Semi-Supervised Learning for Information Extraction.A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), 2010.Never Ending Language LearningTom Mitchell's invited talk in the Univ. of Washington CSE Distinguished Lecture Series, October 21, 2010.
Michael Genkin ([email protected])
Motivation Humans learn many things, for years,
and become better learners over time Why not machines?
Michael Genkin ([email protected])
Coupled Constraints (1) Mutual Exclusion:
Two mutually exclusive predicates can’t be both satisfied by the same input .
Relation argument type checking: Insure the noun phrases to satisfy each
relation correspond to the categories defined for this relation.
e.g. CompanyIsInEconomicSector relation has arguments of Company and EconomicSector categories.
Michael Genkin ([email protected])
Coupled Constraints (2) Unstructured and Semi-structured text
features: Noun phrases appear on the web in free
text context or semi-structured context. Structured and Semi-structured classifiers
will make independent mistakes But each is sufficient for classification
Both the classifiers must agree.
Michael Genkin ([email protected])
Coupled Pattern Learner (CPL): Overview Learns to extract
category and patterninstances.
Learns high-precisiontextual patterns. e.g. arg1 scored a
goal for arg2
Coupled Pattern Learner (CPL): Extracting Runs forever, on each iteration bootstraps a
patterns promoted on the last iteration to extract instances. Select the 1000 that co-occur with most patterns. Similar procedure for patterns, but using recently
promoted instances. Uses PoS heuristics to accomplish extraction
e.g. per category proper/common noun specification, pattern is a sequence of verbs followed by adjectives, prepositions, or determiners (and optionally preceded by nouns).
Michael Genkin ([email protected])
Michael Genkin ([email protected])
Coupled Pattern Learner (CPL): Filtering and Ranking Candidates are filtered to enforce mutual
exclusion and type constraints A candidate is rejected unless it co-occurs
with a promoted pattern at least three times more than it co-occurs with mutually exclusive predicates.
Candidates are ranked as following: Instances: by the number of promoted
patterns the co-occur with. Patterns: by precision estimation
Michael Genkin ([email protected])
Coupled Pattern Learner (CPL): Promoting Candidates For each predicate – promotes at most
100 instances and 5 patterns. Highest rated. Instances and patterns promoted only if
they co-occur with two promoted pattern or instances.
Relations instances are promoted only if their arguments are candidates for the specified categories.
Michael Genkin ([email protected])
Coupled SEAL (1) SEAL is an established wrapper
induction algorithm. Creates page specific extractors Independent of language Category wrappers defined by prefix and
postfix, relation wrappers defined by infix. Wrappers for each predicate learned
independently.
Michael Genkin ([email protected])
Coupled SEAL (2) Coupled SEAL adds mutual
exclusion and type checkingconstrains to SEAL. Bootstraps recently promoted
wrappers. Filters candidates that are
mutually exclusive or not ofthe right type for relation.
Uses a single page per domainfor ranking.
Promotes the top 100 instances extracted by at least two wrappers.
Michael Genkin ([email protected])
Meta-Bootstrap Learner Couples the training of
multiple extractiontechniques. Intuition: different
extractors will makeindependent errors.
Replaces the PROMOTEstep of subordinateextractor algorithms. Promotes any instance recommended by all
the extractors, as long as mutual exclusion and type checks hold.
Michael Genkin ([email protected])
Michael Genkin ([email protected])
Michael Genkin ([email protected])
Learning New Constraints Data mine the KB to infer new beliefs. Generates probabilistic, first order, horn
clauses. Connects previously uncoupled
predicates.
Manually filter rules.
Michael Genkin ([email protected])
Michael Genkin ([email protected])
Demo Time http://rtw.ml.cmu.edu/rtw/kbbrowser/
Michael Genkin ([email protected])
SummaryPopulating the semantic web by using NELL for macro reading
Michael Genkin ([email protected])
Populating the Semantic Web Many ways to accomplish. Use initial ontology to focus, constrain
the learning task. Couple the learning of many, many
extractors. Macro Reading: instead of annotating a
single page each time, read many pages simultaneously.
A never ending task.
Michael Genkin ([email protected])
Macro-Reading Helps to improve accuracy. Still doesn’t help to annotate a single
page, but… Many things that are true for a single
page are also true for many pages Helps to populate databases with
frequently mentioned knowledge
Michael Genkin ([email protected])
Future Directions Coupling with external sources
DBpedia, Freenode Ontology extension
New relations through reading, Subcategories
Use a macro-reader to train a micro-reader
Self-reflection, Self-correction Distinguishing tokens from entities Active learning – crowd sourcing