a.frank 1 digital libraries (dl): awareness and discovery ariel frank dept. of computer science...
Post on 22-Dec-2015
220 views
TRANSCRIPT
A.Frank1
Digital Libraries (DL): Awareness and Discovery
Ariel Frank
Dept. of Computer Science
Bar-Ilan University
Joint research withNir Yom Tov, Alon Kadury
& Elina Masevich
A.Frank2
Presentation motivation
Ad hoc and unsound use of Search Engines (SEs) does not help for retrieval of quality information on the Web.
Digital Libraries (DLs), on the other hand, provide high quality information retrieval of authoritative results, especially when doing exploratory search.
However, the awareness and discovery of DLs on the Web are still lacking.
So what can be done about it?
A.Frank3
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank6
Often heard sayings • “What – is there something to search with
besides search engines?”• “Sure I know all about search engines –
I always use google.”• “Sure I know all about directories –
I always use yahoo!”• “Sorry, never heard about digital libraries.”• “Listen, I’m used to classical libraries.”• “I can find only E-books in a digital library,
no?”
A.Frank8
Sample list of Digital Libraries
• LOC - Library of Congress American Memory (http://memory.loc.gov/ammem/)
• NSDL - National Science DL (http://nsdl.org)• IPL - Internet Public Library (http://www.ipl.org)• CDL - California DL (http://www.cdlib.org)• ADL – Alexandria DL (http://www.alexandria.ucsb.edu)• BL - British Library (http://www.bl.uk/)• NZDL – New Zealand DL (http://www.nzdl.org/)• Einstein Archives Online (http://www.alberteinstein.info/)
A.Frank9
Web
IndexIndex DDirectoryirectory
Search Engines
Which kind to use? The right one Which kind to use? The right one
SSearch earch EEnginengine
GeneralGeneral SpecialtySpecialty GeneralGeneral SpecialtySpecialty
Meta-SMeta-Search earch EEnginengine
A.Frank10
When not to use SEs?• You know it all.• You prefer asking friends (or paid experts ).• You know the Web site for it (and didn’t forget the
exact URL or have auto-completion or bookmark or can access through another known site).
• You already found a specific/relevant digital library or database (maybe in Invisible Web).
• Tired of paid inclusions, SE spamming, and sponsored commercial results.
• Tired of chasing down useless URLs.
A.Frank11
When to use an Index?• Need to search for a narrow piece of
information.
• Have a specific objective/site in mind.
• Want to find/rank many related Web sites.
• Want to factor quantity in (index has crawler based results).
• Need to check/fix spelling (based on Web statistics).
A.Frank12
When to use a Directory?• Clear about the exact topic of your query. • Need general information on a rather broad
topic/category.• Want to amass knowledge on a fairly wide subject.• Would like to browse (and then search) a certain area.• Want to factor quality in (directory has human-
powered results), not quantity. • Need information that is usually carefully evaluated
and even annotated.
A.Frank13
When to use a Meta-SE?• When single Basic-SE fails to provide good results. • One-stop shopping - prefer to search multiple
SEs/sites at once to get blended ranked results (so as to save effort/time).
• When the query is simple (complex fields/options don't usually work).
• Searching for multi-faceted topics. • Want to get clustered results to focus search on the
relevant keywords. • Looking for current events/news.
A.Frank14
When to use a Specialty-SE?• When general-SE fails to provide good results.
• When your target is very topic/technology specific.
• Want to find more than just Web pages/sites.
• Need more results from the Invisible Web.
• Want your search terms to more likely have the meanings you intended them to have.
A.Frank17
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank19
So What is a Digital Library?
• There are scores of definitions.
• Most are very general and verbose . A managed collection of information,
with associated services, where the information is stored in digital formats and accessible over a network.Arms, William, Y., Digital Libraries, MIT press,
Cambridge, 2000.
A.Frank20
Definition - A Digital Library is:
1. Collection of digital objects
2. Collection of knowledge structures
3. Collection of library services
4. Library Categories: Domain, Focus & Topic
5. Quality Control
6. Preservation/Persistence
A.Frank21
1 .Collection of Digital Objects
• Documents (e.g., texts, HTML pages)
• Books
• Journals
• Multimedia (images, audio, video, etc…)
• Charts/Maps Data objects available
directly or indirectly
A.Frank22
2 .Collection of Knowledge Structures
• Metadata: Standards, Markup• Indices, Catalogs, Guides• Taxonomies, Ontologies, Thesauri• Dictionaries, Glossaries,
Concordances• Gazetteers• Abstracts/Summaries
A.Frank23
3 .Collection of Library Services
• Management (computerization, communication)• Collections development• Search (query formulation) and Browse interfaces• Multi-access/use for varied users• Online Help, Reference, Consultation• Logging, statistics and Performance Measurement
Evaluation (PME)• SDI: Selective Dissemination of Information (Push
mode)
A.Frank24
4 .Library Categories: Domain, Focus & Topic
• Domain: belongs to an area (DNS TLDs).– edu, com, org, gov, us, il, ac.il, co.il, …
• Focus: created to serve a certain community of users/patrons.– Academic, Public, National, School, …
• Topic: the subject of the collection; can be relatively finely-grained.– Law, Medicine, Music, Web, …
A.Frank25
5 .Quality Control
• Selection criteria.
• All material is assessed and authorized (“certified”).
• Adhere to licensing and copyrights.
• Use of Digital Rights Management (DRM).
• Integrity enforced (proven quality).
• Use of filtering.
• Support for profiling/stereotyping.
A.Frank26
6 .Preservation/Persistence
• Access and usage is long term
• Serves as an archive
• Scanning and digitization
• Quality reproduction of material
• Material persistency
– paper vs. digital media
– digital formats (software tools)
A.Frank28
Basic SE (BSE)
Meta SE (MSE)
Popularity SE (PSE)
Stand-alone DL (SDL)
Harvested DL (HDL)
Federated DL (FDL)
Digital Library
(DL)Search Engine
(SE)Directory
(Catalog, Guide, Subject Gateway)
Web Repositories Hierarchy
A.Frank29
Types of DLs
• Stand-alone Digital Library (SDL) – also self-contained, several collections
• Federated Digital Library (FDL)– also confederated, networked
• Harvested Digital Library (HDL)– also distributed
A.Frank30
Stand-alone Digital Library (SDL)
• The regular (classical) DL.• Implemented locally in a fully
computerized fashion, with networked access.
• Self-contained material:– edited/generated– scanned/digitized– purchased
• Single or Several digital collections.
A.Frank31
Federated Digital Library (FDL)
• Contains several autonomous libraries.• Based on common focus and topic.• Usually heterogeneous repositories.• Connected via a network.• Forms a flat unified library.• Transparent user interface.
The major problem is interoperability
A.Frank32
Harvested Digital Library (HDL)
• Virtual library providing metadata-based access to relevant items distributed over the network.
• Objects harvested into metadata (protocol was Harvest/SOIF, nowadays OAI-PMH can be used).
• Harvests digital objects, not full DLs.
• But has regular DL characteristics.
A.Frank33
SDL vs. HDL
Single Digital Library Harvested Digital Library
Items Origin Purchased/Digitized Gathered
Items Location Local/Networked Scattered
Material Items+Catalog Catalog
Repository Size Large Small
Update Medium Fast/Dynamic
CompositionMethod
Interoperability Inherent
A.Frank34
Parallel Evolution of SEs and DLsSearch Engines Search Engines
GenerationsGenerationsDigital Libraries Digital Libraries
GenerationsGenerations
1st Generation – Basic SE (BSE)includes Robots, Indices, Directories,basic/advanced user interfaces.
1st Generation – Stand-alone (SDL)local, classical, focused material, digitized or scanned.
2nd Generation – Meta SE (MSE) uses several basic-SEs simultaneously (federated search), ranks gathered pages by relevancy.
2nd Generation – Federated (FDL)Comprised of autonomous SDLs representing related, possibly heterogeneous, network repositories
3rd Generation – Popularity SE (PSE)uses link analysis and use frequency measures to filter and rank the Web pages.
3rd Generation – Harvested (HDL)contains only summaries and metadata structures; domain focused, of fine granularity.
A.Frank35
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank36
Why are SEs overused?• I always use Google/Yahoo!
• It’s just a quick search!
• The truth? – not sure what I’m looking for.
• I’m too used to using SEs.
• SEs are more general, no?
• SEs always give me enough answers.
• SEs don’t care what my topic/domain is!
A.Frank37
SE vs. DL - Server Side
Search Engine (Harvested) DL
Effort Complex Undertaking Medium
Emphasis Quantitative Qualitative
Content Global/Shallow Focused/Annotated
Repository Huge Small
MaintenanceContinuously Updated
(Robots)Dynamically Updated
A.Frank38
SE vs. DL - Client Side
Search Engine (Harvested) DL
Interest Sudden Lasting
Query Ad-hoc Sounder
Use Short Term Medium Term
Information Returned A lot Modest
Quality Noisy Clean
Sift/Filter Manual not much needed
Distribution Pull Mode Both Pull/Push Mode
A.Frank40
Qualitative IR from Digital Library!?
Fact: Quantity orientation in SE. Fact: Quality orientation in DL.
? Assumption: Accessible DLs in sought after domain.
? Assumption: Usable information retrieval interfaces for DLs.
Result: High quality information retrieval from digital libraries!
A.Frank41
Why are DLs underused (social)?• Too used to classical libraries (fond memories).
• No public awareness (an unknown entity).
• No public relations (unlike for Portals/SEs).
• No money in it (marketing, banners, services).
• If It’s a library, you have to pay to use it, no?
• Are DLs up-to-date at all (as much as SEs)?
• No DLs in my language (localization).
A.Frank42
Why are DLs underused (general)?• Portals don’t offer DLs (services).
• Aren’t DLs part of the Invisible/Deep Web?
• DLs are just for experts!
• Many interests – will need to know many DLs.
• How to find them at all (need to startjump)?
• How to find relevant ones (sounds like search).
• How to find the right one (too many around).
• Lack of domain coverage (no DL in my area).
A.Frank43
Why are DLs underused (technical)?• SEs crawl/index DLs, no?
• Aren’t directories enough?
• Aren’t SSEs (Specialized SEs) enough?
• Too focused/limited (too fine granularity).
• Need know-how to use DLs (unlike for SEs).
• Non-usable interfaces (not user-friendly).
• Mostly textual, not multimedia (like SEs are).
A.Frank44
DL Awareness & Discovery Problems
• Lack of use and familiarity with DLs.
• Hard to locate and identify DLs scattered around the Web.
• Not enough metadata kept for and on the DLs.
• DLs topic and focus and user interfaces are not always clear and usable.
A.Frank46
Sample (Digital) Library Directories
• Berkeley LibWeb (Library Servers via Web) – http://sunsite.berkeley.edu/Libweb/
• Academic Info: Digital Libraries – http://www.academicinfo.net/digital.html
• Google Directory: Digital Libraries – http://directory.google.com/Top/Reference/Libraries/Digital/
• Librarians’ Index to the Internet – http://lii.org/
A.Frank47
Use General SEs and DL Directories?
• Why can’t just use large general SEs?– noisy results, metadata not sufficient,
too many (re)tries to get relevant results.
• Why can’t just use existing DL Directories?– messy categorization, non-friendly UI,
not all libraries are DLs, not really DL Directories.
A.Frank48
Some possible directions/solutions
• Get SEs to better index, reference, and advertise DLs.
• Provide specialized SEs for locating DLs.
• Construct and enhance DL directories.
• DL coverage of more topics/domains.
• Employ SE like interfaces in DLs:– user-friendly interface (Google-like)
– easy-to-use site (usability like in SE)
A.Frank51
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank52
SELFDL Goals
Search Engine Locator For Digital Libraries
• Discover/identify/classify/generate DL resources/sites in the (in)visible Web.
• Supply search tools for users to find relevant DLs for their needs.
• Provide better, usable (thin) interfaces for locating DLs.
• Raise awareness, knowledge, discovery and use of DLs.
A.Frank56
SELFDL techniques
• Harness SE technologies to locate DLs on the Web using:– Extractors: Extract DLs from DLs directories.– Crawlers: focused crawl in search of DLs.– Scripts: Interface with Google/Yahoo APIs.
• Use site analysis (search for DL terms).
• Support Extended DC (Dublin Core) metadata for each DL.
• Provide SELFDL database indexing.
A.Frank57
DLs Identification test
• Manual collection of a list of 65 terms that could be indicative that a Web site is a DL.
• Check if there is statistically significant connection between each of the terms and the fact that a Web site is a DL.
• Initial statistical test included 100 manually identified DLs and a 100 random Web sites.
• The statistical measure used (in SPSS) was Cross tabulation, tested with Chi-square, phi coefficient and Cramer’s V.
A.Frank58
Results of DLs Identification test
• Terms that have been found to be statistically significant:
1. documents, book(s), journal(s), electronic/internet/web resource(s)
2. catalog(s)/catalogue(s)3. ask a librarian, patron(s)4. digital library, library, digital collection(s)5. copyright(s)6. preservation/preserve, digitization/digitize
A.Frank60
SELFDL Directory classifications
Topic Focus Domain
Digital Library
DDC Breeding IANA
Countries - .IL
Commercial - .COM
Educational - .EDU
Children
Academic
Professional
Life Science: DDC 570
Earth Science: DDC 550
Biology: DDC 574
A.Frank63
Advantages of SELFDL Directory
• Contains just DLs.
• Better classification/perspective based on domain/focus/topic.
• Provides user-friendly interface;like Google Directory.
• Additional metadata (based on DC).
A.Frank65
SELFDL Index
• Results from Web focused crawling.• Can be searched for specific DL criteria:
– keywords– DL type (SDL, FDL, HDL)– DL media/content (audio, E-books,
E-serials, theses, movies, etc…)– Protocol support (OAI-PMH)
A.Frank66
SELFDL Index example queries
topic:biology domain:com
algebra domain:com source:crawler
focus:children type:SDL
protocol:OAI
topic:math media:ebooks
A.Frank68
Advantages of SELFDL Index
• Built according to insights/techniques of various studies in the field.
• Supports directory and crawler results.
• Provides specialized SE for DLs.
• Easy to use query interface.
• Supports advanced keywords search.
A.Frank70
SELFDL Meta Engine
• Can be searched for DL keywords like in an ordinary search engine.
• Intersects SE (i.e., Google/Yahoo API) results with SELFDL database to extract the current DLs to be returned as query response.
• Performs like a regular SE – convenient for public use.
A.Frank71
YAHOO!YAHOO!
SELFDL intersects with Google & Yahoo! results
SELFDLSELFDL GoogleGoogle
Relevant DLs
A.Frank74
Advantages of SELFDL Meta
• Provides all the advantages of the SELFDL model (UI, metadata).
• Supports query interface for terms, like existing SEs.
• Supports intersection between SEs results and relevant DLs.
• Supports different orders of results.
A.Frank75
SELFDL prototype testing methods
• Efficiency measures were computed for Directory and Meta.
• Satisfaction surveys were given to users before and after SELFDL use.
• A check was carried out to find the best GUI for SELFDL (regular or Google-like).
A.Frank76
Efficiency testing methods
• Series of queries were evaluated for results relevancy.
• The F-measure was used as the efficiency measure.
Where:
P – Precision of results
R – Relative recall of results
F – Weighted harmonic average of P & R = 2PR/(P+R)
• The two components tested were SELFDL Meta and
SELFDL Directory.
A.Frank77
SELFDL Directory vs. DL Directoriesהשוואת מדדי יעילות
0
0.2
0.4
0.6
0.8
1
1.2
Fהחזרדיוק
SELFDL Directory
Academic Info
Google Directory
Yahoo Directory
R P
A.Frank78
SELFDL Meta vs. Google & Yahooהשוואת מדדי יעילות
0
0.1
0.2
0.3
0.4
0.5
0.6
Fהחזרדיוק
SELFDL Meta
Yahoo
R P
A.Frank79
Users’ satisfaction surveys
1. Usability of Web utilities.2. Ease of locating DLs.3. Ease of identifying if site is DL.4. DL results relevance.5. DL metadata readability.
A.Frank81
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank82
RIDDLE Goals
Resource Inquiry and Discovery in a DL Environment
• Enable creation of HDLs by harvesting (filtering) relevant SDLs using OAI-PMH.
• Enable construction of HDLs based on composition of lower-level HDLs, so as to increase the coverage of DLs’ topics.
• Enable information exchange with SELFDL.
• Raise awareness, knowledge, discovery and use of DLs.
A.Frank83
Example of topics’ composition
University
Life Sciences
Exact Sciences
Social Sciences
ChemistryComputer Science
HardwareSoftware
A.Frank84
OAI-PMH Protocol
• OAI-PMH - Open Archive Initiative (OAI) Protocol for Metadata Harvesting
• Tackles lack of uniformity and interoperability between data repositories, that make information sharing between repositories difficult.
• Addresses these problems by defining the way queries are sent to repositories and the way answers are received.
• Mandates at least one format of metadata for repositories use – Dublin Core (DC).
A.Frank85
RIDDLE Model/Architecture
Enhanced OAI-PMH
Layer 4 – Aggregated Service Providers
HDL
Layer 1 – Internet
SDL SDL SDLLayer 2 –
Data Providers
Layer 5 – Presentation
Layer 3 – Service Providers
Web interfaces
Aggregated HDLs
Web
HDL
OAI-PMH
A.Frank86
Use of OAI-PMH for FDLs/HDLs
• OAI-PMH was planned to support harvesting, as manifested in its name, and also in its design (i.e., selective harvesting using “Sets”).
• However, the number of FDLs that use the protocol is relatively large, while there very few HDLs that employ it.
• Since HDLs, unlike FDLs, filter the information, and not just federate it, we investigate ways by which HDLs can filter information using the OAI-PMH protocol.
A.Frank87
Levels of information filtering
• There are 3 levels where information filtering can be done, though each level has its various problems, mostly caused by lack of uniformity between SDLs:1. Item-level metadata –
relates to problems with the use of DC entries (that are well known).
2. Group-level metadata – the use of OAI-PMH Sets for selective harvesting is not well defined, so it can not be easily used for relating to groups of items.
3. Library-level metadata – description of the metadata of this level is not well defined.
Creation of HDLs using OAI-PMH is not fully supported.
A.Frank88
Suggested extensions to OAI-PMH
• Since lack of uniformity in SDLs using OAI-PMH prevents effective creation of HDLs.
• Provide for better harvesting/filtering capabilities from SDLs, by (re-)use of standards, as follows:1. Item-level metadata –
use of extended DC for metadata description, instead of just DC.
2. Group-level metadata – use of a DDC topic as a defined Set identifier.
3. Library-level metadata – use of extended DC for the library description field in the OAI-PMH Identify verb.
A.Frank89
The RIDDLE Prototype
• Provides for regular creation of FDLs.• Enables creation of HDLs by harvesting/filtering the
relevant SDLs.• Supports HDL aggregation based on DDC hierarchy.• The user search results return not only items matching
the query but also HDLs and SDLs related to the indicated topic.
• The user can search the HDLs hierarchy (by textual or directory search) for a specific HDL and further down the aggregated HDLs tree.
A.Frank92
HDL aggregation
• The HDL aggregation capability is based on:– use of the DDC topics hierarchy.– assigning each HDL a suitable DDC topic
identifier. – providing it with an OAI-PMH interface, similar
to the what data providers have, thus enabling and supporting a HDLs hierarchy.
– supporting both offline and online construction and corresponding search.
A.Frank94
RIDDLE Experimentation
• Several tests where carried out, as follows:1. The quality of information retrieval when using a
specific HDL vs. use of several FDLs.
2. Ease of discovering and using the aggregated HDLs.
3. User preferences in searching several FDLs vs. use of aggregated HDLs.
• Initial testing indicates that use of HDLs and aggregated HDLs are more efficient when compared to the use of separate FDLs.
A.Frank95
Efficiency measures for RIDDLE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
PrecisionRecallF-measure
HDLs
FDLs
A.Frank96
Contents
• SEs vs. DLs?!
• DL Definition/Types
• How to tilt the balance of SE/DL use?
• SELFDL Model/Architecture
• RIDDLE Model/Architecture
• Future directions
A.Frank97
Future directions
• Better locating, identification and ranking of DLs and their categories/types.
• Conduct wider, more significant, tests using SELFDL and RIDDLE.
• Publish a beta Web version of SELFDL and RIDDLE for public use/feedback.
• Better integration between SELFDL and RIDDLE.
• Investigate awareness and discovery of DLs on the Web.
A.Frank98
References• Sharon, T. & Frank, A., “Digital Libraries on the Internet”,
IFLA'00 66th IFLA Council and General Conference, 13-18, Jerusalem, Israel, August 2000, http://www.ifla.org/IV/ifla66/papers/029-142e.htm
• Hanani, U. & Frank, A., “The Parallel Evolution of Search Engines and Digital Libraries: their Convergence to the Mega-Portal”, ICDL'00 Kyoto Intl. Conf. on Digital Libraries: Research and Practice, 269-276, Kyoto, Japan, November 2000, http://csdl.computer.org/comp/proceedings/kyotodl/2000/1022/00/10220211abs.htm
• Yom Tov, N. & Frank, A., “Harnessing Search Engine Technologies to Increase Awareness and Discovery of Digital Libraries”, 4th IEEE Intl. Conf. on IT: Research and Education (ITRE), Tel-Aviv, October 2006.
• Kadury, A. & Frank, A., “Harvesting and Aggregation of Digital Libraries in the OAI Framework”, WEBIST 2007, 3rd Intl. Conf. on Web Information Systems and Technologies, 441-446, Barcelona, Spain, March 2007.
A.Frank99
Bibliography
• Arms W. Y., Digital Libraries, MIT Press, Cambridge, 2000.• Hill, L., Buchel, O., Janée, G. & Lei, Z. M., “Integration of
Knowledge Organization Systems into Digital Library Architectures”, Position Paper for 13th ASIS&T SIG/CR Workshop, “Reconceptualizing Classification Research”, 62-68, Philadelphia, PA, 2002.
• Pace A. K., The Ultimate Digital Library, American Library Association, Chicago, 2003.
• Lossau N., “Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet”, D-Lib Magazine, Vol. 10, No. 6, June 2004.
• Summann F. & Lossau N., “Search Engine Technology and Digital Libraries: Moving from Theory to Practice”, D-Lib Magazine Online, Vol. 10, No. 9, September 2004.
• Lippincott J. K., “Net Generation Students and Libraries”, EDUCAUSE Review, Vol. 40, No. 2, March/April 2005.