1 web search environments web crawling metadata using rdf and dublin core dave beckett slides:

24
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett http://purl.org/net/dajobe/ Slides: http://ilrt.org/people/cmdjb/talks/ tnc2002/

Upload: emma-cannon

Post on 27-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

1

Web Search EnvironmentsWeb Crawling Metadata using RDF and Dublin Core

Dave Becketthttp://purl.org/net/dajobe/

Slides:http://ilrt.org/people/cmdjb/talks/tnc2002/

Page 2: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

2

Introduction

• Overview of SGs and Web Crawling• Why WSE, what’s new? Novel

results• Future work (or stuff we didn’t do)

and conclusions

Page 3: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

3

Overview

• Digital Library community• In UK, subject-specific gateways (SGs)• Want to improve: scope (more),

timeliness (fresh), cost (less)• Stay professional – the Quality word• Compete with web search engines – the

Google Test

Page 4: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

4

Human Cataloguing of the Web

• Pros: High quality, domain knowledge selection, subject-specialised, cataloguing done to well-known and developed standards

• Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

Page 5: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

5

Software running web crawls

• Pros: vastly comprehensive (Con: too much), can be very up-to-date

• Cons: cannot distinguish “this page sucks” from “this page rocks”, indiscriminate, subject to spamming, very general (but…)

Page 6: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

6

Combining Web Crawling and High Quality DescriptionA solution• Seed the web crawl from high quality

records• Crawl to other (presumably) good

quality pages• Track the provenance of the crawled

pages• Provenance can be used for querying

and result ranking

Page 7: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

7

Web Search Environments (WSE) Project• Research by ILRT and later

Resource Discovery Network (RDN)• RDN funds UK SGs (ILRT also had

DutchESS)

Page 8: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

8

WSE Technologies

• Simple Dublin Core (DC) records extracted from SGs

• OAI protocol used to collect these records in one place (not required)

• Combine Web Crawler• RDF framework to connect the

resource descriptions together

Page 9: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

9

Simple DC Records

Really simple:• Title• Description• Identifier (URI of resource)• Source (URI of record)

Page 10: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

10

Information model 1

• DC records describe all the resources

• Web crawler reads these and returns crawled web pages

• These generate a new web crawled resource

Page 11: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

11

Information model 2

• Link back to original record(s), plus web page properties

• RDF model lets these be connected via page, record URIs

• Giving one large RDF graph of the total information

Page 12: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

12

WSE graph

Page 13: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

13

Novel Outcomes?

It is obvious that:• Metadata gathering is not new

(Harvest)• Web crawling is not new (Lycos)• Cataloguing is not new (1000s of

years)So what is new?

Page 14: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

14

WSE – Areas Not Focused

I digress…• Gathering data together – not crucial,

Combine is a distributed harvester• Full text indexing – not optimised• Web crawling algorithm – the routes

through the web were not selected in a sophisticated way

Page 15: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

15

WSE – General Benefits

• Connecting separate systems (one less place needed to go)

• RDF graph allows more data mixing (not fragile)

• Leverages existing systems (Combine, Zebra), standards (RDF, DC)

Page 16: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

16

WSE – Novel Searching

• “game theory napster” – zero hits• Cross-subject searching in one

system – “gmo”• Can navigate resulting provenance

Page 17: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

17

WSE – Gains

• Web crawling gains from high quality human description

• SGs gain from increase in relevant pages

• Fresher content than human-catalogued resource

• More focused than a general search engine

Page 18: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

18

WSE as a new tool

• For subject experts• Which includes cataloguers• Gives fast, relevant search

(no formal precision, recall analysis)

Page 19: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

19

WSE – new areas

• Cross-subject searching possible in subjects not yet catalogued, or that fall between SGs

• Searching emerging topics is possible ahead of additions to catalogue standards

• Helps indicate where new SGs, thesauri are needed

Page 20: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

20

WSE - deploying

• ILRT WSE• RDN WSE• RDN – investigating for the main

search system

Page 21: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

21

WSE for SGs

Individual SGs – enhancing subject-specific searches:

• Deep / full web crawling of high quality sites

• Granularity of cataloguing and costIt is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

Page 22: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

22

Future

• Improve and target the crawling• Use the SG information with result

ranking• Add other relevant data to the

graph such as RSS news• A Semantic Web application

Page 23: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

23

Questions?

• Thank You• Slides:

http://ilrt.org/people/cmdjb/talks/tnc/2002/

• Project:http://wse.search.ac.uk/

Page 24: 1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett  Slides:

24

References

• Combine Web Crawler: http://www.lub.lu.se/combine/

• Dublin Core: http://dublincore.org/ • ILRT: http://ilrt.org/ • RDF: http://www.w3.org/RDF/ • Semantic Web: http://www.w3.org/2001/

sw/