transforming data into knowledge › ~a78khan › docs › information...transforming data into...
TRANSCRIPT
Transforming DataTransforming Datainto Knowledgeinto Knowledge
Presented by:
Atif Khan
using Knowledge Engineering
&Machine Learning
Scope
Building Blocks ontology inference machine learning
DBpedia
InfoTrellisALLSight
Tools
01-24-2013 3
Motivation
Data Streams » Information?
.....1101111000011
.....1101100011111
.....000010010100
.....1101111111011
.....110110001100
.....011010010100
01-24-2013 4
Motivation
Information » KnowledgeCharles Smith @profcharlesJust finished registering for KDD 2012 in Beijing http://www.kdd.org/kdd2012/
Charles Smith shared a linkI am off to knowledge discovery and data mining conference in Beijing. Looking forward to Michael Jordan's keynote address
01-24-2013 5
Motivation
Information » Knowledge
KDD2012
Beijing
URLwww..kdd2012/
CharlesSmith
CharlesSmith
Conference
Knowledge Discoveryand Data Mining
Beijing
MichaelJordan
01-24-2013 6
Motivation
Information » Knowledge
KDD2012
Beijing
URLwww..kdd2012/
CharlesSmith
CharlesSmith
Conference
Knowledge Discoveryand Data Mining
Beijing
registered for
website
heldAt
MichaelJordan
01-24-2013 7
Motivation
Information » Knowledge
KDD2012
Beijing
URLwww..kdd2012/
CharlesSmith
CharlesSmith
Conference
Knowledge Discoveryand Data Mining
Beijing
registered for
website
heldAt
isa
heldAt
isKeyNoteSpeaker
attending
MichaelJordan
01-24-2013 8
Motivation
Information » Knowledge
KDD2012
Beijing
URLwww..kdd2012/
CharlesSmith
CharlesSmith
Conference
Knowledge Discoveryand Data Mining
Beijing
MichaelJordan
same as
registered for
same as
same as
website
heldAt
isa
heldAt
isKeyNoteSpeaker
attending
01-24-2013 9
Motivation
Information » Knowledge■ Charles Smith is an academic
■ KDD is a conference about data mining and knowledge discovery
■ Michael Jordan is an influential academic in data mining community
■ Charles Smith and Michael Jordan will both be in Beijing during KDD 2012
01-24-2013 10
Building Blocks
ontology inference
machinelearning
01-24-2013 11
Hybrid MDSS - Holmes**
Holmes■ a hybrid medical decision support system (MDSS)
Based on ■ ontological knowledge representation
■ logic-based inference
■ machine learning to deal with noise
**Atif Khan, John Doucette, Robin Cohen. "Validation of an Ontological Medical Decision Support System for Patient Treatment Using a Repository of Patient Data: Insights into the Value of Machine Learning". ACM Transactions on Intelligent Systems and Technology (TIST) 2012.
01-24-2013 12
Hybrid MDSS
Line of Inquiry■ which patients can be prescribed
what sleep medications?
Considerations■ patient-centric & evidence-based
■ automated
■ easy to explain and validate
■ noise tolerant
01-24-2013 13
Hybrid MDSS
Datasets
CDC–Behavioral Risk Factor Surveillance System (BRFSS) – 2010
01-24-2013 14
Hybrid MDSS
Datasets
multi-dimensionala. demographic informationb. medical informationc. behavioural Information
01-24-2013 15
Hybrid MDSS
Datasets
Characteristicsa. 450K+ individualsb. 400+ attributes/recordc. high “missingness”d. numeric coding
01-24-2013 16
Hybrid MDSS
Datasets
sleeping pill prescription protocol(HTML)
01-24-2013 17
Hybrid MDSS
Datasets
drug-to-drug interactions(HTML)
01-24-2013 18
Ontology Model
Drug
Patient
Disease
Condition
01-24-2013 19
Ontology Model
Drug
Patient
Disease
Condition
Pain pills
Sleepingpills
Age GenderMale
Female
MedicalRecord
BRFSSField
01-24-2013 20
Ontology Model
Drug
Patient
Disease
Condition
Pain pills
Sleepingpills
Age GenderMale
Female
MedicalRecord
BRFSSField
isa
01-24-2013 21
Ontology Model
Drug
Patient
Disease
Condition
Pain pills
Sleepingpills
Age GenderMale
Female
MedicalRecord
BRFSSField
hasContraIndication
isa
01-24-2013 22
Ontology Model
Drug
Patient
Disease
Condition
Pain pills
Sleepingpills
Age GenderMale
Female
MedicalRecord
BRFSSField
hasContraIndication
isa
hasRecord
isTakingprescribedTo hasC
ondition
hasAgehasGender
hasField
hasDisease
01-24-2013 23
An Ontology Model
Defines■ taxonomy
● a hierarchy of concepts
■ relationships
Scope - “domain of discourse”■ e.g. medical decision support system
01-24-2013 24
Mapping Raw Data
Patient1
MedicalRecord1
GenderField
2hasValue
AgeField
66hasValue
SnoreField
1hasValue
01-24-2013 25
Mapping Raw Data
Patient1
MedicalRecord1
GenderField
2hasValue
Female
hasGender
AgeField
66hasValue
SnoreField
1hasValue
01-24-2013 26
Mapping Raw Data
Patient1
MedicalRecord1
GenderField
2hasValue
Female
hasGender
AgeField
66hasValue hasCondition
Elderly
SnoreField
1hasValue
01-24-2013 27
Mapping BRFSS Data
Patient1
MedicalRecord1
GenderField
2hasValue
Female
hasGender
AgeField
66hasValue hasCondition
Elderly
SnoreField
1hasValue hasDisease
SleepApnea
01-24-2013 28
Mapping Expert Knowledge
Ramelteon
Eszopiclone
Insomnia
prescribedFor
01-24-2013 29
Mapping Expert Knowledge
Ramelteon
Eszopiclone
LungDisease
LiverDisease
Depression
SleepApnea
Asthma
Insomnia
hasContraIndication
prescribedFor
01-24-2013 30
Mapping Expert Knowledge
Ramelteon
Eszopiclone
Pregnancy
BreastFeeding
Elderly
AlcoholAbuse
LungDisease
LiverDisease
Depression
SleepApnea
Asthma
Insomnia
hasContraIndication
prescribedFor
01-24-2013 31
Mapping Expert Knowledge
Ramelteon
Eszopiclone
Pregnancy
BreastFeeding
Elderly
AlcoholAbuse
Propoxyphene
LungDisease
LiverDisease
Depression
SleepApnea
Asthma
Insomnia
hasContraIndication
prescribedFor
01-24-2013 32
Mapping Expert Knowledge
:Propoxyphene a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:Wygesic a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.
:Trycet a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:PropoxypheneCompound65 a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:Propacet100 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.
:Aspirin a :Drug; :isPrescribedFor :Pain.
:Tylenol1 a :Drug; :isPrescribedFor :Pain.
:Tylenol2 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :SleepingMedication.
N3 representation
01-24-2013 33
Mapping Expert Knowledge
:Propoxyphene a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:Wygesic a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.
:Trycet a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:PropoxypheneCompound65 a :Drug; :isPrescribedFor :Pain; :hasContraIndication
:Eszopiclone.
:Propacet100 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :Eszopiclone.
:Aspirin a :Drug; :isPrescribedFor :Pain.
:Tylenol1 a :Drug; :isPrescribedFor :Pain.
:Tylenol2 a :Drug; :isPrescribedFor :Pain; :hasContraIndication :SleepingMedication.
N3 representation
01-24-2013 34
Knowledge Representation-Recap
Why?■ to create, maintain
and share informationin a precise manner without ambiguity of meaning
How?■ ontologies
“Now! That should clear up a few things around here!”
http://photos1.blogger.com/blogger2/1715/1669/1600/larson-oct-1987.gif
01-24-2013 35
Knowledge-Inference
a discovery processto find implied knowledge
using explicitly defined information
01-24-2013 36
Knowledge-Inference
logic-based
a discovery processto find implied knowledge
using explicitly defined information
ontology
01-24-2013 37
Knowledge-Inference: example
What do we know about Mary?
■ Mary is grandmother
■ Mary is a grand parent
■ Mary is woman
Mary EmilyJames
hasChild
hasChild
01-24-2013 38
Knowledge-Inference: example
What do we know about Mary?
■ Mary is grandmother
■ Mary is a grand parent
■ Mary is woman
Mary EmilyJames
hasChild
hasChild
if a person has a child, and that child also has a child, then the person is a grandparent
01-24-2013 39
Inference Rules
Drug-to-Drug Interactions■ If a patient is taking an existing drug (D1) and
D1 has contraindication to another drug D2 then drug D2 should not be prescribed to the patient
{ ?P a :Patient. ?D1 a :Drug.?D2 a :Drug.
?P :isTaking ?D1.?D1 :hasContraIndication ?D2.
} => {?P :cannotBeGiven ?D2}.
01-24-2013 40
Inference Rules
Drug-to-Disease Interactions■ If a patient has a condition that has a contra
indication to a drugthen the patient should not be given the drug
{ ?P a :Patient.?D a :Drug.?DIS a :Disease.
?P :hasDisease ?DIS.?D :hasContraIndication ?DIS.
} => {?P :cannotBeGiven ?D}.
01-24-2013 41
Putting it all Together
triplestore
reasonerquery
inferencerules
01-24-2013 42
Putting it all Together
triplestore
reasonerquery
inferencerules
Result
Proof
01-24-2013 43
Putting it all Together
Result
Proof
logic-based, can be verified by traversing the knowledge graph
query answer
triplestore
reasonerquery
inferencerules
01-24-2013 44
But What about Noise?
Mary Emily?
hasChild
hasChild
Noise■ cripples knowledge-based solutions
■ limits inference capability
01-24-2013 45
But What about Noise?
Noise■ cripples knowledge-based solutions
■ limits inference capability
Mary EmilyJames
hasChild
?
01-24-2013 46
Data is Almost Always Noisy
use “machine learning”
to deal with noise
01-24-2013 47
Machine Learning?
feature 2age
feature 1income
query: given a person's age and income predict if they are happy or sad
01-24-2013 48
Machine Learning?
feature 2age
feature 1income
query: given a person's age and income predict if they are happy or sad
happy people
sad people
01-24-2013 49
Machine Learning?
query: is he a happy or sad person ?
happy person
sad person
01-24-2013 50
Machine Learning 101
Machine Learning■ classification: predict class of an instance
■ regression: prediction of a numeric value
■ clustering: group similar items together
01-24-2013 51
Machine Learning 101
Machine Learning■ classification: predict class of an instance
■ regression: prediction of a numeric value
■ clustering: group similar items together
01-24-2013 52
Machine Learning 101
Machine Learning■ classification: predict class of an instance
01-24-2013 53
Classification
0. Train■ build a classifier based on known exemplars
1. Predict
2. Update
3. Evaluate
01-24-2013 54
Dealing with Noise
query: is John depressed ?
but we don't have access to that information
01-24-2013 55
Dealing with Noise
query: is John depressed ?
depressed
Training exemplars
unknown
healthy
01-24-2013 56
Dealing with Noise
answer:
sameAs( , ) = 0.14
sameAs( , ) = 0.85
sameAs( , ) = 0.01
query: is John depressed ?
01-24-2013 57
Dealing with Noise
answer:
sameAs( , ) = 0.14
sameAs( , ) = 0.85
sameAs( , ) = 0.01
query: is John depressed ?
01-24-2013 58
Recap
a hybrid decision supportsystem with ontology-based knowledge representation and logic- based reasoning augmented with machine learning classification fornoise tolerance
01-24-2013 59
Recap
a hybrid decision support system with ontology-based knowledge representation and logic-based reasoning augmented with machine learning classification for noise tolerance
01-24-2013 60
Taming Data in the Wild
01-24-2013 61
Taming Data in the Wild
“Data-to-knowledge” ■ an expensive journey
however, connecting the “data dots” is the key
“Linked Data”*■ can make this a reality
* http://linkeddata.org/
01-24-2013 62
Linked Data
Brief Summary■ Tim Berners-Lee's vision
■ URIs* to identify 'things'
■ HTTP-based dereferencing of URIs
■ structured representation (RDF**)
■ hyperlink 'things' together
Tim Berners-Lee
http://www.w3.org/Press/Stock/Berners-Lee/2001-eur-head-quarter.jpg
* URI = Universal Resource Identifier ** RDF = Resource Description Framework
01-24-2013 63
DBpedia
Extract Information from Wikipedia■ unstructured information
● articles (free text + noise)
■ structured components● infobox templates● categorisation information● images● geo-coordinates, ● external web links
http://wiki.dbpedia.org
01-24-2013 64
DBpedia
Knowledge Engineering using Ontology*■ 359 classes
● in a subsumption hierarchy
■ 1,775 different properties
*http://wiki.dbpedia.org/Ontology
01-24-2013 65
DBpedia
Knowledge Engineering using Ontology
http://wiki.dbpedia.org/Datasets
“English version of the DBpedia knowledge base currently
describes 3.77 million things, out of which 2.35 million are classified in a consistent Ontology, including
764,000 persons, 573,000 places (including
387,000 populated places), 333,000 creative works (including 112,000 music albums, 72,000
films and 18,000 video games), 192,000 organizations (including 45,000 companies
and 42,000 educational institutions), 202,000 species and 5,500 diseases.”
01-24-2013 66
DBpedia – Interlinked Datasets
2007
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2007-05-01.png
01-24-2013 67
2009
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2009-03-05_colored.png
01-24-2013 68
2011
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png
01-24-2013 69
DBpedia in Action
http://www.visualdataweb.org/relfinder/relfinder.php
Demo/Video
01-24-2013 70
know your customer
ALLSight Platform
01-24-2013 71
AllSight – 360 View of a Customer
01-24-2013 72
AllSight – 360 View of a Customer
01-24-2013 73
AllSight – 360 View of a Customer
01-24-2013 74
AllSight – 360 View of a Customer
01-24-2013 75
product
Example
location
transactions
customer
01-24-2013 76
Example
female
teenager
adult
senior
male
tech buff
shopaholic
Bieber nation
customer
01-24-2013 77
01-24-2013 78
01-24-2013 79
01-24-2013 80
Under the Hood
Data Enrichment■ taxonomies/ontologies
■ dictionaries/catalogues
Matching (decision making)■ knowledge-based rules
■ machine learning
01-24-2013 81
Challenges
Entity Resolution
Pre-processing
Feature Selection
Training (& retraining)
Noise
01-24-2013 82
Tools of the Trade
01-24-2013 83
Some Basic Tools
Knowledge Engineering■ Protégé
■ Web Ontology Language – OWL
■ Resource Description Framework (RDF)● XML● Notation 3 (N3)
01-24-2013 84
Some Basic Tools
Semantic Reasoners■ CWM
■ Euler Sharp
■ Jenna
■ FuXi
01-24-2013 85
Some Basic Tools
Machine Learning Toolkits■ Weka
■ Stanford NLP
■ Apache Mahout*
■ LIBLINEAR*
■ LIBSVM*
■ numpy, scipy (python libs)
01-24-2013 86
In Summary
The collaborative work of this group has the potential to unlock data to create knowledge
There are many■ uses cases that can benefit from this work
■ tools available to facilitate this process
01-24-2013 87
Thank You!!
01-24-2013 88
Confidence in Your Data
* Adler on Data Governancehttps://www.ibm.com/developerworks/mydeveloperworks/blogs/adler/entry/the_ibm_big_data_governance_summit3?lang=en
“Velocity, Volume, and Variety without Veracity creates Vulnerability”*
01-24-2013 89
Confidence in Your Data
“Velocity, Volume, and Variety without Veracity creates Vulnerability”*
* Adler on Data Governancehttps://www.ibm.com/developerworks/mydeveloperworks/blogs/adler/entry/the_ibm_big_data_governance_summit3?lang=en
evaluate confidence values