web search environments web crawling metadata using rdf and dublin core

Web Search EnvironmentsWeb Crawling Metadata using RDF and Dublin Core

Dave Becketthttp://purl.org/net/dajobe/

Slides:http://ilrt.org/people/cmdjb/talks/tnc2002/

Introduction• Overview of SGs and Web Crawling• Why WSE, what’s new? Novel

results• Future work (or stuff we didn’t do)

and conclusions

Overview• Digital Library community• In UK, subject-specific gateways (SGs)• Want to improve: scope (more),

timeliness (fresh), cost (less)• Stay professional – the Quality word• Compete with web search engines – the

Google Test

Human Cataloguing of the Web• Pros: High quality, domain

knowledge selection, subject-specialised, cataloguing done to well-known and developed standards

• Cons: Expensive, slow, descriptions need to be reviewed regularly to keep them relevant

Software running web crawls• Pros: vastly comprehensive (Con:

too much), can be very up-to-date• Cons: cannot distinguish “this page

sucks” from “this page rocks”, indiscriminate, subject to spamming, very general (but…)

Combining Web Crawling and High Quality DescriptionA solution• Seed the web crawl from high quality

records• Crawl to other (presumably) good quality

pages• Track the provenance of the crawled

pages• Provenance can be used for querying and

result ranking

Web Search Environments (WSE) Project• Research by ILRT and later

Resource Discovery Network (RDN)• RDN funds UK SGs (ILRT also had

DutchESS)

WSE Technologies• Simple Dublin Core (DC) records

extracted from SGs• OAI protocol used to collect these

records in one place (not required)• Combine Web Crawler• RDF framework to connect the

resource descriptions together

Simple DC RecordsReally simple:• Title• Description• Identifier (URI of resource)• Source (URI of record)

Information model 1• DC records describe all the

resources• Web crawler reads these and

returns crawled web pages• These generate a new web crawled

resource

Information model 2• Link back to original record(s), plus

web page properties• RDF model lets these be connected

via page, record URIs• Giving one large RDF graph of the

total information

WSE graph

Novel Outcomes?It is obvious that:• Metadata gathering is not new

(Harvest)• Web crawling is not new (Lycos)• Cataloguing is not new (1000s of

years)So what is new?

WSE – Areas Not FocusedI digress…• Gathering data together – not crucial,

Combine is a distributed harvester• Full text indexing – not optimised• Web crawling algorithm – the routes

through the web were not selected in a sophisticated way

WSE – General Benefits• Connecting separate systems (one

less place needed to go)• RDF graph allows more data

mixing (not fragile)• Leverages existing systems

(Combine, Zebra), standards (RDF, DC)

WSE – Novel Searching• “game theory napster” – zero hits• Cross-subject searching in one

system – “gmo”• Can navigate resulting provenance

WSE – Gains• Web crawling gains from high quality

human description• SGs gain from increase in relevant

pages• Fresher content than human-catalogued

resource• More focused than a general search

engine

WSE as a new tool• For subject experts• Which includes cataloguers• Gives fast, relevant search

(no formal precision, recall analysis)

WSE – new areas• Cross-subject searching possible in

subjects not yet catalogued, or that fall between SGs

• Searching emerging topics is possible ahead of additions to catalogue standards

• Helps indicate where new SGs, thesauri are needed

WSE - deploying• ILRT WSE• RDN WSE• RDN – investigating for the main

search system

WSE for SGsIndividual SGs – enhancing subject-

specific searches:• Deep / full web crawling of high quality

sites• Granularity of cataloguing and cost

It is better for humans to describe entire sites (or large parts) and let the software do the detailed work of individual pages

Future• Improve and target the crawling• Use the SG information with result

ranking• Add other relevant data to the

graph such as RSS news• A Semantic Web application

Questions?• Thank You• Slides:

http://ilrt.org/people/cmdjb/talks/tnc/2002/• Project:

http://wse.search.ac.uk/

References• Combine Web Crawler: http://www.lub.

lu.se/combine/ • Dublin Core: http://dublincore.org/ • ILRT: http://ilrt.org/ • RDF: http://www.w3.org/RDF/ • Semantic Web: http://www.w3.org/2001/

web search environments web crawling metadata using rdf and dublin core

Documents

weighted semantic pagerank using rdf metadata on hadoop...

the new odf 1.2 metadata framework - apache openoffice ·...

metadata as linked open data: mapping disparate xml ... as...

using resource description framework (rdf) to carry metadata...

the rdf* and sparql* approach - olaf...

presentation metadata - europa · 2021. 4. 22. ·...

december 20, 2002cul metadata wg meeting1 focused crawling...

a metadata-based recommender system for statistical linked...

rdf for librarians jenn riley metadata librarian digital...

yahoo semantic web in production - dave beckett ·...

chapter 1 relational technologies, metadata and rdf

ontomet: ontology metadata framework - semantic … ·...

cs3352 metadata the semantic web directories and thesauri...

cataloging, metadata, and roles of metadata librarian ·...

february 2001 1 harvesting rdf metadata building digital...

quiz! term / conceptat the startby the end discoverability...

semantic search using rdf metadata (semtech 2005)

swib 2013 tutorial...

david dodds david_dodds_2001@yahoo.com. david dodds topics:...

focused crawling with scalable ordinal regression...