schema.org usage for hotels
TRANSCRIPT
© Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at
Elias Kärle – June 15th, 2015 – OC Meeting
schema.org usage for hotels
An analysis based on the Web Data Commons data set
www.sti-innsbruck.at
1. Motivation
6
• Schema.org annotation
• Hotels and tourism
do they use annotations?
www.sti-innsbruck.at
1. Motivation
7
1) How many hotels use schema.org?
2) How is schema.org used?1) Which classes?
2) Which attributes?
3) Is schema.org used correctly?
3) Who is using schema.org in tourism?
www.sti-innsbruck.at
2. Daten
8
What is schema.org?
• Initiative founded 2011• Vocabulary for structuring data in web sites• Embedded into html
– Microdata– RDFa– JSON-LD
Source: http://www.schema.org
www.sti-innsbruck.at
2. Daten
9
Analysis of all web sites:
• Founded in 2007 • Non-Profit Organisation• Crawls web 4 times per year• Datadumps are available open for public• November 2013: 2,3 billion webseiten, 148TB• Dezember 2014: 2,1 billion webseiten, 160TB
Source: http://commoncrawl.org/the-data/get-started/
www.sti-innsbruck.at
2. Daten
10
Only survey structured data:
WebDataCommons:• 2012 Freie Universität Berlin & KIT• Currently Uni Mannheim• Operated by Chris Bizer• Extracts structured data from the Common Crawl
– WebTables: 147 Million relational tab. (11Billion HTML Tab.)– Hyperlink Graph: 3,5 Billion Webseiten, 128 Billion Links– Semantically annotated data:
• November 2013: 44TB, 2.2Bn URLs• Dezember 2014: 160TB, 2Bn URLs
Source: http://webdatacommons.org/structureddata/
www.sti-innsbruck.at
2. Daten
11
• November 2013 corpus
• Subset: schema.org/Hotel– 35GB– 127 Mio. Triples
• OWLIM-SE Repository – thanks Ontotext
• SPARQL Queries
• Linux Debian 3.2, STI – thanks David
www.sti-innsbruck.at
3. Analyse
12
1) How many hotels are annotated with schema.org?
4.841.353• Hotels annotated several times
– own website– booking websites
740.298• Lost all hotels with same names
– Adler, Post, ...
Bind to address!
www.sti-innsbruck.at
3. Analyse
13
Hotel4.841.353
Address3.035.000
Country
1.904.000Name
1.125.000Region
1.902.000
ZIP
2.011.000
Street
2.284.000
www.sti-innsbruck.at
3. Analyse
14
Hotels per Country
Austria: 148
Tirol: 287
Innsbruck: 63
1. US 10215132. CA 523603. CN 206484. GB 115805. DE 31636. MX 19217. PR 12508. AR 10169. PH 765
10. IN 699
11. TR 68112. AE 39113. KR 37714. RO 37315. QA 34316. PA 29917. SA 29218. AU 29019. BR 25820. CH 238
21. TH 23422. SR 21723. HK 15624. EC 15025. AT 14826. CO 14327. PE 12928. BE 12729. ID 10930. BH 93
Obviously errors in annotating
www.sti-innsbruck.at
3. Analyse
15
Hotels grouped by ZIP in Tirol
18%
10%
8%
4%
4%
3%2%2%2%2%
45%
6020 6370 6100 6450 6580 6456 6215 6213 6365 6010 other
Innsbruck
Kitzbühel
Seefeld
Sölden
St. Anton
ObergurglAchenkirch
PertisauKirchberg
www.sti-innsbruck.at
3. Analyse
16
What categories of hotels are annotated?
http://schema.org/Rating
www.sti-innsbruck.at
3. Analyse
17
Hotel4.841.353
Address3.035.000
Country
1.904.000Name
1.125.000Region
1.902.000
ZIP
2.011.000
Street
2.284.000
www.sti-innsbruck.at
3. Analyse
18
Hotel4.841.353
Address3.035.000
Country
1.904.000Name
1.125.000Region
1.902.000
Rating
2.377.000
RatingValue
2.375.000
www.sti-innsbruck.at
3. Analyse
19
What categories of hotels are annotated?
866.932
651.606
426.925
176.800
135.958
35.079
66.208
15.476
941
www.sti-innsbruck.at
3. Analyse
20
2) How is schema.org used?
15%
14%
13%
9%8%
7%
6%
5%
5%
4%
13%
schema.org usage
http://schema.org/Hotel/name http://schema.org/Hotel/reviewhttp://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/Hotel/imagehttp://schema.org/Hotel/address http://schema.org/Hotel/aggregateRatinghttp://schema.org/Hotel/rating http://schema.org/Hotel/descriptionhttp://schema.org/Hotel/url http://schema.org/Hotel/geoOther
Property # %http://schema.org/Hotel/name 5666474 117.0432http://schema.org/Hotel/review 5226132 107.9478http://www.w3.org/1999/02/22-rdf-syntax-ns#type 4841353 100http://schema.org/Hotel/image 3439579 71.04582http://schema.org/Hotel/address 3035301 62.6953http://schema.org/Hotel/aggregateRating 2723587 56.25673http://schema.org/Hotel/rating 2377406 49.10623http://schema.org/Hotel/description 1934486 39.95755http://schema.org/Hotel/url 1749830 36.14341http://schema.org/Hotel/geo 1323333 27.33395http://schema.org/Hotel/telephone 1124948 23.23623http://schema.org/Hotel/faxNumber 703274 14.52639http://schema.org/Hotel/photo 642159 13.26404http://schema.org/Hotel/openingHours 558353 11.533http://schema.org/Hotel/logo 549525 11.35065http://schema.org/Hotel/branchof 369942 7.641294http://schema.org/Hotel/additionalType 308168 6.365328http://schema.org/Hotel/photos 224887 4.645127http://schema.org/Hotel/maps 86935 1.795676http://schema.org/Hotel/breadcrumb 82122 1.696261http://schema.org/Hotel/priceRange 52071 1.075546http://schema.org/Hotel/price 37634 0.777345http://schema.org/Hotel/email 31854 0.657957http://schema.org/Hotel/event 24838 0.513038
www.sti-innsbruck.at
3. Analyse
21
3) Who uses schema.org in tourism?
Hypothesis:
„Schema.org is mainly used by booking- and rating websites, barely by hotels themselves.“
www.sti-innsbruck.at
3. Analyse
22
Approach:• Hotels on booking- & rating sitesSearch for annotation on own web site
• Countercheck with annotated hotel websitesMultiple appearance in data set?
Currently: exemplaric (top-booking sites)
Next step: full data set
www.sti-innsbruck.at
3. Analyse
23
Summary:
• Main user of schema.org/Hotel:booking- and rating sites
Errors: incompleteWrong clacesWrong attributesWrong datatypesComprehensive errory analysis: Uni Mannheim
(R. Meusel & H. Paulheim) [1]
[1] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/MeuselPaulheim-HeuristicsForFixingCommonErrorsInDeployedSchemaOrgMicrodata-ESWC2015.pdf