the adept digital library architecture

42
Alexandria Digital Earth ProtoType The ADEPT Digital Library Architecture J AMES F REW & G REG J ANÉE Alexandria Digital Library Project University of California, Santa Barbara

Upload: soleil

Post on 24-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

The ADEPT Digital Library Architecture. J AMES F REW & G REG J ANÉE Alexandria Digital Library Project University of California, Santa Barbara. ADEPT Introduction. Alexandria Digital Earth ProtoType (ADEPT) is: Distributed digital library for geo-referenced information - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

2Frew • Cornell DL seminar • 2002-04-23

ADEPT Introduction Alexandria Digital Earth ProtoType (ADEPT) is:

Distributed digital library for geo-referenced information Services supporting DL federation and interoperation Large geospatial collections

Goal: an Internet “library’” layer Organization Persistence Accessibility Scalability

– Lots of collections– Big collections small collections– Heterogeneous contents

Page 3: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

3Frew • Cornell DL seminar • 2002-04-23

Outline Core library architecture Metadata interoperability Other features

Query translation Collection discovery

Page 4: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

ADEPT Core Library Architecture

Page 5: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

5Frew • Cornell DL seminar • 2002-04-23

Architectural Elements Item

structured descriptions (“reports”) contents (optional)

Library set of collections client (public) services

Collection set of items library (internal) services

= a distributed catalog system

Page 6: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

6Frew • Cornell DL seminar • 2002-04-23

The Big Picture

library(middleware server)

client

item item item item

library(middleware server)

proxy collection

collection

collection

ADEPT

Page 7: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

7Frew • Cornell DL seminar • 2002-04-23

Role of the Middleware

collection collection

logical view

item item item

collection discoveryservice

client

functional view

• local access point• standard services• access control• thin client support• distributed search• brokering of queries & results• proxying of collections & items• creation & organiza- tion of “local” collections

middlewaremiddleware

RDBMS

Z39.50

“local”

thesaurus/vocabulary

Page 8: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

8Frew • Cornell DL seminar • 2002-04-23

Middleware InterfacesClients

Configuration {collection-id} Collection(collection-id) report Query(query) query-id Results(query-id) {holding-id} Metadata(collection-id, holding-id, view) report

Libraries Collection report Query(query, accumulator) query-thread Metadata(holding-id, view) report

Collections

Page 9: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

9Frew • Cornell DL seminar • 2002-04-23

Metadata Reports Collection

Metadata that applies to entire collection Bucket

Item’s bucket metadata Scan

Brief (“1-line”) subset of bucket report Full

All the item’s metadata, in whatever format Browse

URL(s) reduced-resolution graphics Access

URL(s) content (if available)

Page 10: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

10Frew • Cornell DL seminar • 2002-04-23

A Complete System: The Boxology

HTTP transport

SDLIP proxy, other clientsweb browser

generic DB driver query translator

MIDDLEWARE

CLIENT

SERVER

web intermediary/XMLHTML converter

configurationfile

core functionalityaccess control (service- and collection-level)

query fan-out & results mergingquery result rankingresult set caching

access controlmechanisms

rankingmethods

client-side services (Java classes)

server-side interface (Java interfaces)

RDBMS

JDBC

configuration files,Python scripts

RMI transport

proxy driver

HTTP

HTTP

XML

XML

group driver

thesauri

OR

paradigmlibrary

Z39.50 driver

Page 11: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

Metadata Interoperability

Page 12: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

12Frew • Cornell DL seminar • 2002-04-23

ADEPT’s Interoperability Problem Distributed, heterogeneous collections

locally, autonomously created and managed

Minimize impact on collection providers allow use of native metadata

Uniform client services common high-level interface across collections discover and exploit collection-specific interfaces

Assumptions items have metadata items have sufficient, “good” metadata i.e., this is a metadata interoperability problem

Page 13: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

13Frew • Cornell DL seminar • 2002-04-23

Bucket Concept Abstract metadata category

Strongly typed Well-defined search semantics

– query terms– query operators

Explicitly mapped from source metadata– (FGDC, 1.3, “Time period of content”, “2001-09-08”)

Bucket-level search uniform across all collections

– e.g.: search all collections for items whose Originator bucket contains the phrase “geological survey”

Page 14: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

14Frew • Cornell DL seminar • 2002-04-23

Bucket Properties name

Coverage date semantic definition

The time period to which the item is relevant. data type (strictly observed)

calendar date or range of calendar dates syntactic representation (strictly observed)

ISO 8601

Page 15: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

15Frew • Cornell DL seminar • 2002-04-23

What is a bucket? (2/2)

Source metadata is mapped to buckets buckets hold not just simple values

– “2001-09-08” but rather, explicit representations of mappings

– (FGDC, 1.3, “Time period of content”, “2001-09-08”) may have multiple values per bucket

Bucket definition includes search semantics query terms

– ISO 8601 date range query operators

– contains, overlaps, is-contained-in Some semantics are fuzzy to accommodate multiple

implementations

Page 16: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

16Frew • Cornell DL seminar • 2002-04-23

Collection-level aggregation Collection-level metadata describes

buckets supported by the collection item-level metadata mappings statistical overviews

– item counts– spatiotemporal coverage histograms

Example (de-XML-ized) in collection foo, the Originator bucket is supported and the

following item fields are mapped to it:– (FGDC, 1.1/8.1, “Citation/Originator”) [973 items]– (USGS DOQ, PRODUCER, “Producer”) [973 items]– (DC, Creator, “Creator”) [1249 items]– unknown [6 items]

Page 17: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

17Frew • Cornell DL seminar • 2002-04-23

Searching collections Bucket-level

uniform across all collections example

– search all collections for items whose Originator bucket contains the phrase “geological survey”

Field-level collection-specific but discovery and invocation mechanisms are uniform functionally equivalent to searching the entire bucket plus

additional constraint example

– search collection foo for items whose FGDC 1.1/8.1 field within the Originator bucket contains the phrase…

Page 18: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

18Frew • Cornell DL seminar • 2002-04-23

Bucket types (1/7)

6 bucket types: spatial, temporal, hierarchical, textual, qualified textual, numeric

Type captures the portion of the bucket definition that has functional implications data type & syntactic representation query terms query operators

Complete bucket definition name semantic definition bucket type

Page 19: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

19Frew • Cornell DL seminar • 2002-04-23

Bucket types (2/7)

Spatial data type: any of several types of geometric regions defined

in WGS84 latitude/longitude coordinates syntax: defined by ADEPT query terms: WGS84 box or polygon operators: contains, overlaps, is-contained-in example query:

– <spatial-constraint bucket=“geographic-location” operator=“overlaps”> <box north=“37.5” south=“30.0” east=“-110” west=“-140”/></spatial-constraint>

Page 20: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

20Frew • Cornell DL seminar • 2002-04-23

Bucket types (3/7)

Temporal data type: calendar date or range of calendar dates syntax: ISO 8601 query term: range of calendar dates operators: contains, overlaps, is-contained-in example query:

– <temporal-constraint bucket=“coverage-date” operator=“contains” from=“1970-01-01” to=“1979-12-31”/>

Page 21: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

21Frew • Cornell DL seminar • 2002-04-23

Bucket types (4/7)

Hierarchical data type: term drawn from a controlled vocabulary

(thesaurus, etc.) one-to-one relationship between hierarchical buckets and

vocabularies query term: vocabulary term operator: is-a example query:

– <hierarchical-constraint bucket=“feature-type” operator=“is-a” vocabulary=“ADL Feature Type Thesaurus” term=“populated place”/>

Page 22: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

22Frew • Cornell DL seminar • 2002-04-23

Bucket types (5/7)

Textual data type: text query term: text operators: contains-all-words, contains-any-words, contains-

phrase example query:

– <textual-constraint bucket=“subject-related-text” operator=“contains-all-words” text=“orthophotograph”/>

Page 23: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

23Frew • Cornell DL seminar • 2002-04-23

Bucket types (6/7)

Qualified textual data type: text with optional associated namespace query term: same query operator: matches example query:

– <qualified-textual-constraint bucket=“identifier” operator=“matches” text=“90-70002-34-5” namespace=“ISBN”/>

Page 24: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

24Frew • Cornell DL seminar • 2002-04-23

Bucket types (7/7)

Numeric data type: real number query term: real number query operators: standard relational operators example query:

– <numeric-constraint bucket=“minimum-feature-size” operator=“less-than” value=“1.0” unit=“meters”/>

Page 25: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

25Frew • Cornell DL seminar • 2002-04-23

Bucket types vs. buckets Bucket types are defined architecturally Buckets in use are defined by collections and items

need standard buckets, defined conventionally, to support cross-collection uniformity

ADL core buckets simple; universal; easily & broadly populated; useful

Bucket descriptions in the following slides: bucket type semantic definition effective treatment of multiple values in searching comparison to Dublin Core

Page 26: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

26Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (1/6)

Subject-related text Title Assigned term

Originator Geographic location Coverage date Object type Feature type Format Identifier

Page 27: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

27Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (2/6)

Subject-related text type: textual description: text indicative of item’s subject

– not necessarily from controlled vocabularies superset of Title and Assigned term multiple values: concatenated compare: DC.Subject

Title type: textual description: item’s title subset of Subject-related text multiple values: concatenated compare: DC.Title

Page 28: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

28Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (3/6)

Assigned term type: textual description: subject-related terms from controlled

vocabularies subset of Subject-related text multiple values: concatenated compare: qualified DC.Subject

Originator type: textual description: names of entities related to item’s origin multiple values: concatenated compare: DC.Creator + DC.Publisher

Page 29: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

29Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (4/6)

Geographic location type: spatial description: subset of Earth’s surface related to item multiple values: union compare: DC.Coverage.Spatial

Coverage date type: temporal description: calendar dates related to item multiple values: union compare: DC.Coverage.Temporal

Page 30: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

30Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (5/6)

Object type type: hierarchical vocabulary: ADL Object Type Thesaurus

– (image, map, thesis, sound recording, etc.) multiple values: unioned compare: DC.Type

Feature type type: hierarchical vocabulary: ADL Feature Type Thesaurus

– (river, mountain, park, city, etc.) multiple values: unioned compare: none

Page 31: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

31Frew • Cornell DL seminar • 2002-04-23

ADL core buckets (6/6)

Format type: hierarchical vocabulary: ADL Object Format Thesaurus

– (loosely based on MIME) multiple values: union compare: DC.Format

Identifier type: qualified textual description: unique identifiers for item multiple values: treated separately compare: DC.Identifier

Page 32: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

32Frew • Cornell DL seminar • 2002-04-23

Bucket Summary Strongly typed, abstract metadata category, with

defined search semantics, to which source metadata is mapped.

Supports discovery/search across distributed, heterogeneous collections that use metadata structures of their choosing.

Supports “drill-down” searching of item-level metadata elements.

Page 33: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

33Frew • Cornell DL seminar • 2002-04-23

Challenges Metadata is like life: refuses to follow the rules

unknown semantics inconsistent typing/syntax unknown or unidentifiable sources poor/inconsistent quality proliferation of overlapping vocabularies ...

Reality check: Dublin Core won adapt buckets to qualified Dublin Core incorporate fallback mechanism or polymorphism

– e.g, treat fields as thesauri/controlled vocabularies or as text

Page 34: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

Query Translation

Page 35: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

35Frew • Cornell DL seminar • 2002-04-23

ADEPT query language Domain

collection of items item = (unique ID, field, …) field = (name, value) bucket = (name, union or concatenation of fields)

Queries atomic constraint: (attribute name, operator, target)

– return items that have at least 1 value for the attribute,for which at least one value matches the target

arbitrary Boolean combinations– AND, OR, AND NOT

Page 36: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

36Frew • Cornell DL seminar • 2002-04-23

The problem Algorithmically translate ADEPT queries to SQL

accommodate all possible SQL implementations configurable by mere mortals generate “reasonable” SQL make up for DB deficiencies

– stupid things like order of tables & conditions matter– incorporate optimizer hints and directives

Page 37: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

37Frew • Cornell DL seminar • 2002-04-23

Approach Python-based translator

~1500 lines Extensible “paradigms” describe atomic translation

techniques 15 paradigms Each paradigm ~100 lines (50 Python code, 20 assertions,

30 documentation) Intrinsic & explicit combination rules

unify; then JOIN; then self-JOIN; etc. Configuration file describes:

buckets, fields, paradigms, paradigm configuration Boolean override rules misc: external identifier table, optimizer clauses

Page 38: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

Collection Discovery

Page 39: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

39Frew • Cornell DL seminar • 2002-04-23

The problem Distributed queries: necessary evil

necessary to achieve scalability– performance– autonomy

introduce scalability, performance, and reliability problems

Amelioration strategies increase server performance/reliability

– replication, DIENST connectivity regions turn into offline problem

– Web search engines, OAI harvesting model identify relevant collections to query (ADEPT)

– analogous to Web search engine

Page 40: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

40Frew • Cornell DL seminar • 2002-04-23

Approach Build on collection-level metadata

spatial & temporal density histograms item counts per collection categorization schemes

Upload periodically to central server Use Euler histograms to support range queries

Page 41: The ADEPT Digital Library Architecture

Alexandria Digital Earth ProtoType

41Frew • Cornell DL seminar • 2002-04-23

Challenges Relevance not necessarily Boolean

worldwide, petabyte, 1cm resolution databasevs.world map drawn on napkin

weight by resolution or minimum feature size– but sometimes you want the napkin

The JOIN problem statistics computed independently

Text overviews STARTS?

Page 42: The ADEPT Digital Library Architecture

That’s All, Folks