semantic technologies for data access
TRANSCRIPT
SemanticTechnologiesforDataAccess
DiegoCalvaneseFreeUniversityofBozen-Bolzano,Italy
MartinRezkRakuten Inc.
Rakuten TechnologyConference(RTC)22October2016,Tokyo 1
Outlineofthepresentation
1. IntroductiontoSemanticTechnologiesfordatamanagement(Diego)
2. UsecasesatRakuten (Martin)3. Ontology-baseddataaccesswithOntop (Diego)
2
Semantics• Isabranchoflinguisticsconcernedwiththestudyofmeaning ofexpressionsinarepresentationscheme(i.e.,alanguage)
• Wenaturallyassociatemeaningtoexpressionsinnaturallanguage,e.g.,towordsandsentences.
• Semanticsneededalsoforartificialandformallanguages.
• Semanticsiscrucialwhenmachinesneedtointeractandexchangeinformationwithhumansandwitheachother.
4
MarsClimate Orbiter had some“misunderstanding”withground stationonimperial vs.metric units [1999].
5
327MUS$burned intheMarsatmosphere!
Implicitvs.explicitsemanticsofdata
id surname name #courses258 Lenzerini Maurizio 2262 Carlucci Gigina 1484 Nardi Daniele 3271 Catarci Tiziana 0435 Marchetti Alberto 2… … … …
6
FacultyStaffWhoisthedean?WhoworksinthedepartmentofNardi?
Thedeanistheoneteachingnocourses.Thedept isencodedinthefirstdigitoftheid.
Thishindersunderstanding,henceusability,maintenance,reusability,extensibility,…
Semanticsisrepresentedimplicitly!
Instead,wewanttorepresentsemanticsexplicitly!
Representingsemanticsexplicitly
• WeareusinghereanEntity-Relationshipdiagram,whichadmitsagraphicalnotation.
• Wecouldalsohaverepresentedthesemanticsusingaformalizationinlogic.
7
Staff Courseteaches
Department
belongsTo
surnamenameid
is_dean
BringingSemanticstoWebdata
“TheSemanticWebis[…]anextensionofthecurrentwebinwhichinformationisgivenwell-definedmeaning,betterenablingcomputersandpeopletoworkincooperation.”[T.Berners-Lee,J.Hendler,O.Lassila,2001]
8
SemanticWebTechnologies
• AretechnologiesenablingtheSemanticWebvisiontobecomereality.
• Basedonflexibledatarepresentationformats.
• Semanticsisrepresentedexplicitlybymeansofformalorlogic-basedlanguages.
9
Semantic Weblayers [2001]
TechnologiesforrepresentingWebdata
Weneedmechanismsforrepresentingdatainaflexibleway:• XML (ExtensibleMarkupLanguage): markuplanguageforencodingdocumentsinahumanandmachinereadableformat(bytheW3C).
• JSON: (JavaScriptObjectNotation): lightweightdata-interchangeformat(byEcma International).
• RDF (ResourceDescriptionFramework): flexibledatamodelbasedontheideaofmaking statements about (web)resources intheformofsubject–predicate–objecttriples(bytheW3C).
10
Aflexibledataformatisnotsufficient
• XML,JSON,RDFdefinemechanismstoassertfactsaboutdataitems:tiziana rdf:type Staff .databases rdf:type Course .tiziana teaches databases.
• Buttheydonotprovidemeanstovalidatethedata:csEngineering rdf:type Department .dumbo rdf:type Animal .dumbo teaches csEngineering .
11
Addingsemanticstodata
TheW3Chasdefinedseverallanguagesforrepresentingthesemanticsexplicitlyandforinterlinkingdata:
• RDFS (RDFSchema): lightweightschemalanguagefordescribingconceptsandtheirrelationships.
• OWL (WebOntologyLanguage): veryexpressiveontologylanguageformodelingknowledgeaboutadomainofinterest.
• RIF (RuleInterchangeFormat): rule-basedlanguageforthespecificationofknowledge.
• StandardsforLOD (LinkedOpenData)12
LinkedOpenData(LOD)
• Microdata
• RDFa
• JSON-LD
• Microformats
13
SeveralstandardshavebeendefinedtoannotateandinterlinkwebpageswithHTML-embeddeddata:
SeeinvitedtalkbyChrisBizer atISWC2016(Kobe)http://www.slideshare.net/bizer/is-the-semantic-web-what-we-expected-adoption-patterns-and-contentdriven-challenges-iswc-2016-keynoteWebDataCommons(11/2015):HTML-embeddeddataisprovidedby• 19%oftheprimary-level-domains(2.72Moutof14.41M)• 30%oftheHTMLpages(540Moutof1.71B)
Example ofRDFSspecification
belongsTo rdfs:domain Staff .belongsTo rdfs:range Department .teaches rdfs:domain Staff .teaches rdfs:range Course .GradCourse rdfs:subClassOf Course .…
Essentially,inRDFSwe canexpressthesame informationas inanERdiagram(except formultiplicities andcomplete/disjoint hierarchies).
Wecanencodethesemanticsofthedomain,anduseittocheckwhetherstatementsaremeaningful.
14
Staff Courseteaches
Department
BelongsTo
surnamename
id
is_dean
GradCourse
Someusecases forsemantictechnologies
1. Toprovideuserswithnewdiscoveryaxes(Rakuten Ichiba,PriceMinister).
2. Todomorefine-grainedpersonalizationinmarketingcampaigns(Rakuten Ichiba).
3. Foraccessingandintegratingheterogeneousdatasources(e.g.,atStatoilandSiemens).
15
GivingSemanticstoData
MartinRezkRakuten [email protected]
(JointworkwithBrunoCharron,Hirate Yu,andDavidPurcell)Publishedintheproc.ofISWC’16
Rakuten TechnologyConferenceOctober22sd,2016
17
• Ichiba offersaround200Mitemsclassifiedinalargelegacy taxonomy(~40,000classes)
• Eachoftheseitemshasapagedescriptioncreatedbythemerchants.
R a k u t e n G r o u p
18
21
material
shape
size
color
maker
origin W H AT P R O P E R T I E S A R E I M P O R TA N T I N E A C H C AT E G O R Y ?
period
22
Poireround
Demi-lune
rectangularoval
WHAT VA LU E S E X I S T ?
material
shape
size
color
maker
origin
period
23
BU T WH I C H H A S T H E MO S T GMS ?
Poireround
Demi-lune
rectangularoval
material
shape
size
color
maker
origin
period
24
BU T WH I C H I S T R END I NG ?
material
shape
size
color
maker
origin
period
Poireround
Demi-lune
rectangular
oval
CatalogTeam( C omp . - U s e r I n t e r a c t i o n )
Whataretherelevantdiscoveryaxesthatcanhelptheuserstoexplorecomplexclasses?
What bus iness need t r i g ge red th i s p ro jec t?( Fo c u s i n g o n I c h i b a fo r n ow )
25
CatalogTeam( C omp . - U s e r I n t e r a c t i o n )
Whataretherelevantdiscoveryaxesthatcanhelptheuserstoexplorecomplexclasses?
What bus iness need t r i g ge red th i s p ro jec t?
26
So lu t ion : Ex t rac t ing semant i c i n fo rmat ion
G i v i n g S e m a n t i c s t o
D a t a
• Wewanttoextractrelevant datapropertiesforcomplexandprofitablesubclasses.
• Foreachdataproperty,wewanttoextracttherelevantsubsetofitsrange.
• Wedonotwanttomodelthewholedomain.
• Wewanttolinkthepropertiesandvaluesbacktotheitems.
• Wewantthesolutiontobeasmuchaspossiblelanguageindependent.
27
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
SeedGeneration
C L A S S S E L E C T I O N P R O P E R T Y E X T R A C T I O N
U s i n g s t a n d a r d d a t a m i n i n g t e c h n i q u e s a n d n o v e l m a t h e m a t i c a l m o d e l s t o c l e a n t h e l i s t .
30
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
小倉百人一首の雅なおせんべい!高級感のあるパッケージで、ちょっとずつ
食べようと思っても、ついつい手が出てしまうので、隠しつつちょこちょこ他寝るように気をつけています。おいしすぎも、困りものですよ・・・
おせんべい★お試しセット もち吉さんのおせんべいが食べたくなって、買い
に行かなくても済むんで宅配たすかります~味は変わらず美味しいですよ♪
気心の知れた方に差し上げても、たいへんに喜ばれますね▼・∀・▼ありがとう
SeedGeneration
Boot Strapping
C L A S S S E L E C T I O N P R O P E R T Y E X T R A C T I O N
We u s e m a c h i n e l e a r n i n g t o e x t e n d t h e o r i g i n a l s e e d . 31
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
小倉百人一首の雅なおせんべい!高級感のあるパッケージで、ちょっとずつ
食べようと思っても、ついつい手が出てしまうので、隠しつつちょこちょこ他寝るように気をつけています。おいしすぎも、困りものですよ・・・
おせんべい★お試しセット もち吉さんのおせんべいが食べたくなって、買い
に行かなくても済むんで宅配たすかります~味は変わらず美味しいですよ♪
気心の知れた方に差し上げても、たいへんに喜ばれますね▼・∀・▼ありがとう
SeedGeneration
Bootstrapping
C L A S S S E L E C T I O N P R O P E R T Y E X T R A C T I O N L I N K I N G
We g e n e r a t e t r i p l e s o f t h e f o r m :( I t e m 1 8 5 , O r i g i n , J a p a n ) 32
C L A S S S U B T R E E S E L E C T I O N
We m e a s u r e :- H o m o g e n e i t y ( f o r t h e u s e r s )- N e e d o f N a v i g a t i o n a l A s s i s t a n c e ( i n t e r e s t i n g t o t h e
u s e r ? )
W H E R E W E S TA R T : T H E TA X O N OM Y
S e l e c t i n g w h i c h c l a s s e s a m o n g t h e 4 0 . 0 0 0 w e w i l l w o r k w i t h .
34
H OM O G E N E I T Y
T H E TA X O N OM Y T R E E W E S E E
S u b t r e e E x t r a c t i o n
T H E TA X O N OM Y T R E E T H E U S E R S S E E
35
T H E TA X O N OM Y T R E E W E S E E
T H E TA X O N OM Y T H E U S E R S S E E
S u b t r e e E x t r a c t i o n
H OM O G E N E I T Y
36
T H E TA X O N OM Y T R E E W E S E E
A l c o h o lW i n e
R e dW h i t eR o s e
W i n e
H OM O G E N E I T Y
37
T H E TA X O N OM Y T R E E W E S E E
A l c o h o lW i n e
R e dW h i t eR o s e U s u a l W i n e s( W h i t e a n d R e d )
F a n c yW i n e s( R o s e )
W i n e
Lookingatusers’shoppingbehavior
T H E TA X O N OM Y T R E E T H E U S E R S S E E
H OM O G E N E I T Y
38
T H E TA X O N OM Y T R E E W E S E E
A l c o h o lW i n e
R e dW h i t eR o s e U s u a l W i n e s( W h i t e a n d R e d )
F a n c yW i n e s( R o s e )
W i n e
Weselectthe``right’’subtreestoextractpropertiesfrom
H o m o g e n o u s
T H E TA X O N OM Y T H E U S E R S S E E
H OM O G E N E I T Y
39
S U B T R E E S E L E C T I O N
N e e d o f N a v i g a t i o n a l A s s i s t a n c e
- Givenasubtree,wecomputeitsGMSdiversity- 𝑒 #$% & ∗ ) * ( % & )
- ( exponentialoftheShannonentropy)
- Wherepi istheproportionofthetotalGMSofthesubtreewhichisduetotheitemi.
- Intuitively,itrepresentstheeffectivenumberofitemsinasubtreemakingupitsGMS.
~ 2 . 7
40
S U B T R E E S E L E C T I O N- Givenasubtree,wecomputeitsGMSdiversity
- 𝑒 #$% & ∗ ) * ( % & )
- ( exponentialoftheShannonentropy)
- Wherepi istheproportionofthetotalGMSofthesubtreewhichisduetotheitemi.
- Intuitively,itrepresentstheeffectivenumberofitemsinasubtreemakingupitsGMS.
- Asubtreeissaidtohaveahighneedfornavigationalassistance(NNA)ifitseffectivenumberofitemsismorethan215.
N e e d o f N a v i g a t i o n a l A s s i s t a n c e
41
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
SeedGeneration
C L A S S S E L E C T I O N P R O P E R T Y E X T R A C T I O N
42
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
SeedGeneration
P r o p e r t y Va l u e E x t r a c t i o n
Givenasubtreet,weextracttheinitialsetofpropertiesandvalues(PV)fromHTMLtablesandsemi-structuredtextinputbymerchantsint.
PropertyCandidates
PossibleValues 44
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
SeedGeneration
P r o p e r t y Va l u e E x t r a c t i o n
Thissetcontainssomeissues:
- Redundantpropertynames:e.g.Maker/Producer
- Noisypropertyvalues.
- UselessPVfordiscoveryaxes:e.g.expirationdate
45
<html><tr><tc></tc></tr>
</html>
<html><tr><tc></tc></tr>
</html>
小倉百人一首の雅なおせんべい!高級感のあるパッケージで、ちょっとずつ
食べようと思っても、ついつい手が出てしまうので、隠しつつちょこちょこ他寝るように気をつけています。おいしすぎも、困りものですよ・・・
おせんべい★お試しセット もち吉さんのおせんべいが食べたくなって、買い
に行かなくても済むんで宅配たすかります~味は変わらず美味しいですよ♪
気心の知れた方に差し上げても、たいへんに喜ばれますね▼・∀・▼ありがとう
SeedGeneration
Boot Strapping
C L A S S S E L E C T I O N P R O P E R T Y E X T R A C T I O N
46
• Word2vecisashallowwordembeddingmodel.
• Themodellearnstomapeachdiscretewordid(0throughthenumberofwordsinthevocabulary)intoalow-dimensionalcontinuousvector-space.
• Wordswithsimilardistributionalproperties(i.e.,thatco-occurregularly)tendtosharesomeaspectofsemanticmeaning.
B o o t s t r a p p i n g : w o r d 2 v e c
Words
Vectors
whiteblueturquoise
electronMozarttango
Similar
NotSimilar48
• Wetrain2differentmodelswithdifferentparameters(cbow,skipgram,window,etc.)andchunkingmethods.
• Weiteratingoverthepropertiesextendingtheknownrange…asfollows:
Shape
B o o t s t r a p p i n g : w o r d 2 v e c
49
• Wetrain2differentmodelswithdifferentparameters(cbow,skipgram,window,etc.)andchunkingmethods.
• Weiteratingoverthepropertiesextendingtheknownrange…asfollows:
Shape
OvalRectangle
B o o t s t r a p p i n g : w o r d 2 v e c
50
• Wetrain2differentmodelswithdifferentparameters(cbow,skipgram,window,etc.)andchunkingmethods.
• Weiteratingoverthepropertiesextendingtheknownrange…asfollows:
OvalRectangle
Moon
Round
Shape UsingthetwoWord2Vecmodels
B o o t s t r a p p i n g : w o r d 2 v e c
51
• Wetrain2differentmodelswithdifferentparameters(cbow,skipgram,window,etc.)andchunkingmethods.
• Weiteratingoverthepropertiesextendingtheknownrange…asfollows:
OvalRectangle
Round
Shape
B o o t s t r a p p i n g : w o r d 2 v e c
52
• Wetrain2differentmodelswithdifferentparameters(cbow,skipgram,window,etc.)andchunkingmethods.
• Weiteratingoverthepropertiesextendingtheknownrange…asfollows:
Material
B o o t s t r a p p i n g : w o r d 2 v e c
53
B o o t s t r a p p i n g
Taketheseedvaluesandgetthemostsemanticallysimilarwordsineachmodel…
TaketheintersectionX…
Ifthewordisnotabettervalueforadifferentproperty
ForeachwordinX
Addittothepropertyvalues.
ForeachpropertyPi…
54
Evaluat ion(at the t ime of publ icat ion)
• Weevaluatedourresultsin4categories:rice,beef,wine,andnecktie.
• Riceandbeefhadbeenpreviouslyextendedbythecatalogteam.
• TheresultsfromwineandnecktiewereevaluatedbyRakuten members.
56
Humancomparison- beef
Totalproperties Intersection Difference TotalValuesValues forIntersectedproperties
Values nonintersectedproperties
Manually Extracted 5 3 2 49 18 31
AutomaticallyExtracted 9 3 6 118 43 75
Cut
Type
Locality
Size
Intended Use
Ingredient
Allergens
Shippingfee
Processingarea
Countryofproduction
ProductName
Resu l t s Beef :
57
Subtree Count Overall Max Median Mean Min
Rice 6 0.92 1.00 0.97 0.81 0.20
Beef 9 0.86 1.00 0.88 0.83 0.50
Wine 9 0.91 1.00 0.81 0.77 0.20
Necktie 8 0.88 1.00 0.85 0.70 0.00
Theabovetableshoesthenumberofproperties,theoverallaccuracyofthepropertiesandthedistributionoftheaccuraciesbyproperty
Accuracy :
58
60
ConclusionsT h e B e a u t y T h e B e a s t T h e K n i g h t
Rakuten canimproveallitsservicesbyextractingsemanticinformation.
Thereistoomuchdata,spreadacrossdatasources,andthesemanticsishiddeninusersshoppinglogs,databases,andtext.
Withourapproach,wecandiscoverthesemanticshiddeninthedata,andbringittolighttobeused.
Howmuchtimeisspentsearchingfordata?
62
Engineersinindustryspendasignificantamountoftheirtimesearchingfordatathattheyrequirefortheircoretasks.Forexample,intheoil&gas industry,30–70%ofengineers’timeisspentlookingfordataandassessingitsquality.[Crompton,2008]
StatoilExploration
63
Facts:• 1,000TBofrelationaldata• usingdiverseschemata• spreadover2,000tables,overmultipleindividualdatabases
DataAccessforExploration:• 900expertsinStatoilExploration.
Expertsingeologyandgeophysicsdevelopstratigraphicmodelsofunexploredareasonthebasisofdataacquiredfrompreviousoperationsatnearbylocations.
Howmuchtime/moneyisspentsearchingfordata?
64
AuserqueryatStatoilShowallnorwegian wellboreswithsomeaditional attributes(wellboreid,completiondate,oldestpenetratedage,result).Limittoallwellboreswithacoreandshowattributeslike(wellboreid,corenumber,topcoredepth,basecoredepth,intersectingstratigraphy).LimittoallwellboreswithcoreinBrentgruppen andshowkeyatributes inatable.AfterconnectingtoEPDS(slegge)wecouldforinstancelimitfuther tocoresinBrentwithmeasuredpermeabilityandwhereitislargerthanagivenvalue,forinstance1mD. WecouldalsofindoutwhethertherearecoresinBrentwhicharenotstoredinEPDS(basedonNPDinfo)andwheretherecouldbepermeabilityvalues.Someofthemissingdatawepossiblyown,othernot.
SELECT [...]FROMdb_name.table1 table1,db_name.table2 table2a,db_name.table2 table2b,db_name.table3 table3a,db_name.table3 table3b,db_name.table3 table3c,db_name.table3 table3d,db_name.table4 table4a,db_name.table4 table4b,db_name.table4 table4c,db_name.table4 table4d,db_name.table4 table4e,db_name.table4 table4f,db_name.table5 table5a,db_name.table5 table5b,db_name.table6 table6a,db_name.table6 table6b,db_name.table7 table7a,db_name.table7 table7b,db_name.table8 table8,db_name.table9 table9,db_name.table10 table10a,db_name.table10 table10b,db_name.table10 table10c,db_name.table11 table11,db_name.table12 table12,db_name.table13 table13,db_name.table14 table14,db_name.table15 table15,db_name.table16 table16WHERE [...]
table2a.attr1=‘keyword’ ANDtable3a.attr2=table10c.attr1 ANDtable3a.attr6=table6a.attr3 ANDtable3a.attr9=‘keyword’ ANDtable4a.attr10 IN (‘keyword’) ANDtable4a.attr1 IN (‘keyword’) ANDtable5a.kinds=table4a.attr13 ANDtable5b.kinds=table4c.attr74 ANDtable5b.name=‘keyword’ AND(table6a.attr19=table10c.attr17 OR(table6a.attr2 IS NULL ANDtable10c.attr4 IS NULL)) ANDtable6a.attr14=table5b.attr14 ANDtable6a.attr2=‘keyword’ AND(table6b.attr14=table10c.attr8 OR(table6b.attr4 IS NULL ANDtable10c.attr7 IS NULL)) ANDtable6b.attr19=table5a.attr55 ANDtable6b.attr2=‘keyword’ ANDtable7a.attr19=table2b.attr19 ANDtable7a.attr17=table15.attr19 ANDtable4b.attr11=‘keyword’ ANDtable8.attr19=table7a.attr80 ANDtable8.attr19=table13.attr20 ANDtable8.attr4=‘keyword’ ANDtable9.attr10=table16.attr11 ANDtable3b.attr19=table10c.attr18 ANDtable3b.attr22=table12.attr63 ANDtable3b.attr66=‘keyword’ ANDtable10a.attr54=table7a.attr8 ANDtable10a.attr70=table10c.attr10 ANDtable10a.attr16=table4d.attr11 ANDtable4c.attr99=‘keyword’ ANDtable4c.attr1=‘keyword’ AND
table11.attr10=table5a.attr10 ANDtable11.attr40=‘keyword’ ANDtable11.attr50=‘keyword’ ANDtable2b.attr1=table1.attr8 ANDtable2b.attr9 IN (‘keyword’) ANDtable2b.attr2 LIKE ‘keyword’% ANDtable12.attr9 IN (‘keyword’) ANDtable7b.attr1=table2a.attr10 ANDtable3c.attr13=table10c.attr1 ANDtable3c.attr10=table6b.attr20 ANDtable3c.attr13=‘keyword’ ANDtable10b.attr16=table10a.attr7 ANDtable10b.attr11=table7b.attr8 ANDtable10b.attr13=table4b.attr89 ANDtable13.attr1=table2b.attr10 ANDtable13.attr20=’‘keyword’’ ANDtable13.attr15=‘keyword’ ANDtable3d.attr49=table12.attr18 ANDtable3d.attr18=table10c.attr11 ANDtable3d.attr14=‘keyword’ ANDtable4d.attr17 IN (‘keyword’) ANDtable4d.attr19 IN (‘keyword’) ANDtable16.attr28=table11.attr56 ANDtable16.attr16=table10b.attr78 ANDtable16.attr5=table14.attr56 ANDtable4e.attr34 IN (‘keyword’) ANDtable4e.attr48 IN (‘keyword’) ANDtable4f.attr89=table5b.attr7 ANDtable4f.attr45 IN (‘keyword’) ANDtable4f.attr1=‘keyword’ ANDtable10c.attr2=table4e.attr19 AND(table10c.attr78=table12.attr56 OR(table10c.attr55 IS NULL ANDtable12.attr17 IS NULL))
AtStatoil,ittakesupto4daystoformulateaqueryinSQL.
Statoillosesupto50M€peryearbecauseofthis.
Needforabstraction
Weneedtofacilitateaccesstodata:• byabstractingawayfromhowthedataisstored,and• bymakinguseofahighlevelviewonthedata,throughanontology.
65
• Isaplatformtoquerydatabases throughontologies,relyingonsemantictechnologies.
• CompliantwiththestandardsoftheW3C.• SupportsallmajorrelationalDBsvai JDBC(Oracle,DB2,MSSQLServer,Postgres,MySQL,H2,…).
• Open-source andreleasedunderApachelicense.• DevelopmentofOntop:
• developmentstarted6yearsago• alreadywellestablished:
• 200membersinthemailinglist• +7000downloadsinlast18months
• maindevelopmentcarriedoutinthecontextoftheEUprojectOptique 67
http://ontop.inf.unibz.it
Queryansweringbyrewriting
68
Ontological query
Rewritten query
SQLqueryRelational answer
Ontological answer Rewriting
Unfolding
Result Translation
Evaluation
Thankyouforlistening
69
FRAZZ:©JeffMallett/Dist.byUnited Feature Syndicate,Inc.
Any questions?DiegoCalvanese – [email protected] – [email protected]