Semantic Application Components
Textual documents, web pages,
etc.
Filter
spider Information extraction
Segment Classify
Associate Normalize
Deduplicate
Data-base
Data-base
Data mining
Discover patterns•Select models•Fit parameters•Inference•Report results
Actionable knowledge
Actionable knowledge
PredictionOutlier detectionDecision support
Information Extraction in Context
From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.
UnstructuredSemi-structured
Structured
Knowledge Store
Information/Knowledge Extraction
Convert information (and knowledge) into computer understandable knowledge
Unstructured ExtractionSemi-Structured ExtractionStructured Extraction
Information/Knowledge Extraction
Unstructured Extraction Extract Entities and Relationships
Populate an ontology-Also referred to as populating the slots of a frame (old
A.I. speak)-Populating can be a challenge-Semantic technology separates the description of
structure from the data
Knowledge Extraction Technologies
Natural Language ProcessingMulti-Lingual HandlingMulti-Modal (text, audio, video)
Foundational TechnologiesRule-Based SystemsMachine Learning
On Extraction AccuracyBe wary of junk in your knowledge store – it is very hard to clean out!It is better to extract less with higher accuracy!Especially since many sources tend to yield redundant
information.
http://www.opencalais.com
Example
New York Times says:
Saddam found in bunker
<Entity> <Relationship> <Entity>
Associated Press says:
Saddam Hussein captured at secret underground room
<Entity> <Relationship> <Entity>
Lots of redundancy, and likely significantly improved accuracy from one (based on strengths of extractors and ability to map to relationships your system is ready to handle)
Structured Knowledge Extraction
Given a database Typically rows of well defined relationships
Extract useful entitiesOften, a row (or joined row) is a useful entityGiven a schema, the knowledge is pretty explicit
If structure is good, leave it there (conversion to RDF or similar probably doesn’t make sense)
D2RQ (http://www4.wiwiss.fu-berlin.de/bizer/d2rq/) converts database data into RDF form. It can convert entire databases or provide a SPARQL endpoint that just converts on-the-fly.
Structured Knowledge – Semantic Wiki
Normal Wiki Semantics are a Collection of Knowledge!
Pages reference Pages
Pages might have attributes
Semantic Wiki Enables the Knowledge
Pages have specific types -Class (rdf:type)
References between pages have relationship property
-Object property
Pages have semantic attributes (metadata)-Datatype propertyA Full Featured Semantic Wiki is Semantic Mediawiki – an extension to the wikipedia infrastructure.
Extensions for input forms are required to make system usable by normal humans.One example is: http://foodfinds.referata.com/wiki/HomeList of examples in use at http://semantic-mediawiki.org/wiki/Sites_using_Semantic_MediaWiki
Semi-Structured Knowledge Extraction
Mix of structured and unstructuredLike a form with check boxes for some fields and open text areas for others
-Common to human intelligence reports like police or border patrol reports
Leverage structured tools and provide the known or simple knowledge to the unstructured extractor
-Many extractors won’t use this additional contextual information (still a research area)GRDDL (http://www.w3.org/2001/sw/grddl-wg/) is a W3C tool for creating/extracting RDF
from normal XML documents. Semi-structured data can typically be converted into xmlThen use unstructured tools on the unstructured fields
More Data vs More Entities
Be wary of more data – knowledge extraction must result in entities, not just data!
Data grows exponentially (forever)
Entities grow near asymptotic (and therefore manageable)
-For example: face book people, cars, cell phones…Typical Growth of Data
Through TimeExpected Growth of Entities
Through Time
Disambiguation (Deduplication)
Ambiguous: adj, capable of being understood in two or more possible senses or ways
Disambiguate: v, to establish a single semantic interpretation for
Also known as entity resolution and deduplication
-Given sets of extracted knowledge, determine when pieces refer to the same object or entity
Ex: Hussein is the same entity as Saddam Hussein in the previous news example.
Jaccard Measures for Disambiguation
Jaccard similarity measureSimilarity between sample sets: size of intersection divided by size of the union
J(A,B) = |A intersect B| / |A union B|
Jaccard distance = the dissimilarity: 1-J(A,B)
Is John(#1) the same as John(#2)?We have a set of assertions about John(#1) and John(#2) derived from source sets A and B.If we presume #1 and #2 are the same, how many features do we get versus how many features do we get if we assume they are different.
If sizes are similar enough, then references are to the same John
Jaccard Measures for Disambiguation
Binary data example (slight formulation difference-see wikipedia)
Compare features of two entities (where each feature is true/false)
Both True – M11 = 1
First True, Second false – M10 = 3
First False, Second True – M01= 0
Both False – M00 = 0
Jaccard similarity = M11 / (M10+M01+M11) = 1/4
Jaccard distance = (M01+M10) / (M10+M01+M11) = 3/4
Sphere Shape
Sweet Sour Crunchy
Apple Yes Yes Yes Yes
Banana No Yes No No
Brief, handy Jaccard tutorial at:http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
Random Forest for Name Disambiguation
Automatically create a large number of decision trees, each created from a different bootstrap sample set of the training data
Select number of treesSelect number of features to consider at each node split
Decision Tree ConclusionRun data through all trees for matches yes/no – majority count
winsTested on names from Medline Bibliography
16 million biomedical research referencesRaw Features
Primary Author Name – first, last, middleCo-author last names, Affiliation of first authorTitle, Journal Name, Publication Year, LanguageMesh/Keyword terms
Decision Tree Features (21 total)Comparisons between each articles of name, etc.Uniqueness of names, titles, etc.
Disambiguation Results
Random Forests (RF) work as well as any – At impressive accuracyRF trains orders of magnitude faster than SVM
Bayesian issue - High correlation between features violated independence assumption of random variables
Majority class in training data is bestBayesianSimple RegressionDecision TreeSupport Vector Machines (Radial Basis)Random Forests
Number of Article Pairings in Test SetNumber of Features in Test Set
Knowledge Store
Need to be able to discover and manipulate knowledge across stores and leave most of the knowledge where it is
Problems: database schemas and knowledge store ontologies are different, inconsistent, and conflictingEntities are not aligned across the stores
Some Approach to Federating the Information Is RequiredFederated data must conform to some schema or ontology applicable to the problem being solved
Swoogle can help with ontologies: http://swoogle.umbc.edu/Swoogle provides search of over a million RDF documents on the web
Unified Relational Database
Unifying Relational Database SchemaDefine domain schemaConvert all data into domain schema
Possible to have schema federate across multiple storesPerformance likely to drive all data onto central store
Disambiguate across data sourcesAlso likely to drive all data onto central store
Distributed RDF
Distributed RDF QueryDefine domain ontologyEnable all data as RDF
SPARQL end-points (D2RQ, RDF files from GRDDL or other, triplify from web pages, RDF Stores)
Enable within the domain schema
Disambiguate across data sources Can leave data where it is – use owl:sameAs or convert via SWRLQueries will have to include reasoning effects of the disambiguation
Twine at http://www.twine.com – people create shared threads around a topic (of tags). Not a federated store, but uses RDF for storage. To start search on twine for twine.
Semantic Application Components
Textual documents, web pages,
etc.
Filter
spider Information extraction
Segment Classify
Associate Normalize
Deduplicate
Data-base
Data-base
Data mining
Discover patterns•Select models•Fit parameters•Inference•Report results
Actionable knowledge
Actionable knowledge
PredictionOutlier detectionDecision support
Information Extraction in Context
From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.
UnstructuredSemi-structured
Structured
Knowledge Store
Data Mining / Discovery Patterns
Vector Space ModelMulti-dimensional representation of documents as vectorsEvery dimension of the vector is a term of interest
All vectors represent the same order of terms (at least to be manipulated together)Value at that vector location represents the weight for that term
Given vectors, you can perform vector operations like ‘angle-between’ as similarity measure. Weighting approach is some of the magic:Term Frequency–Inverse Document Frequency (TF-IDF) Weighting
Used to evaluate how important a word is to a document in a collection Includes how likely a term is to occur in a document and a weight based on how unique the term is across all the documents of interest See wikipedia for a good explanation of the details.
Entropy WeightingUsed to estimate relative importance of all words within a documentTake the negative log of the number of occurrences (normalized by number of unique words in document): -ln(cnt/sum_of_all_words)
Data Mining / Discovery Patterns
F-Measure as a General Measure of Mining Performance
Measure of a test's accuracy, considering both the precision p and the recall r of the test
Precision: typically the number of relevant documents returned scaled by the total number returned
Recall: number of relevant documents retrieved by a search scaled by the total number of existing relevant documents (that should have been returned)
Fβ measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision
F1 = 2*Precision*Recall / (Precision + Recall)
See wikipedia for a good explanation of the details.
Data Mining / Discovery Patterns
Seeing some popularity to use
Kullback-Leibler (K-L) divergence as a metricAlso referred to as Information gain
See wikipedia to become quite confused on the details
Wolfram-Alpha at http://www.wolframalpha.com supports data mining by model fitting, inference, and reporting given human language questions. For example, you can ask, “What is the capital of Texas” and receive both short and long answers.
Data Mining / Discovery Patterns
Consistency issues
Mining and Discovery include inference
What about when a new fact or inference contradicts a prior?
IBM Entity Analytics removes old assertions that are contradicted by new information
Alternative is to have some sort of confidence metricBut still have need for aging (at least) some relationships
Most important is to:1) Have a way to detect your are contradicting yourself2) Have some way to address it
Actionable Knowledge/Decision Making
Decision making is quite domain dependentDeciding someone is a terrorist versus where to construct a highway
Typically involves inference and reasoningRule languages (Pellet, SWRL, SPIN, Jena Reasoners, etc)Bayesian networksDempster-Schafer belief handling
Provenance is requiredRDF reification or custom approach (albeit a bit messy)Tends to make databases messy, too
Must be able to remove assertions (i.e. step back) when an incorrect path is detectedMulti-hypothesis reasoning approaches can be a help
Maintain multiple paths of possibility, simultaneouslyObviously an information management headache
The Payoff – Information Superiority
Knowledge Density(#Related Facts /domain)
I nfo
rmati
on Y
iel d
Pot e
nt i
al
(# In
f ere
nce
s /
t im
e )
Information Inferiority
Information Superiority
Phase Transition “Low Yield to High
Yield”
819
9
35
96534
65
42
5317
4
819
9
35
96534
65
42
5317
4
EX
PE
RT
LOW Knowledge Density ExampleLOW Knowledge Density Example
81697
961748
273965
915324
54968
32465
4712589
5394176
814
81697
961748
273965
915324
54968
32465
4712589
5394176
814
EA
SY
HIGH Knowledge Density ExampleHIGH Knowledge Density Example
Interesting problem domains are typically large, so information superiority also requires a large set of related facts.
Interesting problem domains are typically large, so information superiority also requires a large set of related facts.
Summary
Leveraging Semantic Technologies for Arriving at Actionable Knowledge Requires-Converting information into computer usable knowledge-Resolving ambiguity across many sources of knowledge-Accumulating sufficient knowledge density to result in high yield intelligence
RMS In Class Teams Exercise
Relationship Management System (RMS) - You need an information system to allow you to maintain information about university partnerships with your company. This is initially for your own use, but eventually will be a system to share with others that includes summary reporting about recent activity (including contact reports and contracts).
You will have to deal with at least the types of entities:OrganizationPerson (in a position at an organization)ContractContact Reports (by people at your organization with a person)
Management has decided that semantic wiki technology should be tried out on this system. (You are allowed to convince them otherwise, but they may override your objections.)
Federation In Class Teams Exercise
Intelligent Skills Inventory System – When a Request For Proposal (RFP) arrives, we need to construct an inventory of skills among our personnel that is specific to those needed for the RFP. The skills of your personnel are scattered across a number of systems, which you are required to deal with. You will not be providing any new system for gathering skills, rather, you are to:1. build a system to provide federated query and reporting of skills 2. provide a system to extract appropriate skills from the RFP and provide a list of potential personnel with matching (or near-matching) skills.
Emerging Alternative Federation Approach
Object Accumulation OverlayDevelop an abstraction layer of persisted objects
Adapters convert sources into objects – generally driven by high level ontologyObjects have methods – set/retrieve attributes or perform operationsData (i.e. attributes) is left at the original sourceDisambiguate within the object space
Adapters and custom operations perform disambiguation