emerging information technologies: the role of xml, dois, openurl, and federated search
DESCRIPTION
Emerging Information Technologies: The Role of XML, DOIs, OpenURL, and Federated Search. William H. Mischo [email protected] Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign 2002 International Conference on Digital Archive Technologies (ICDAT2002) - PowerPoint PPT PresentationTRANSCRIPT
Emerging Information Technologies: The Role of XML, DOIs, OpenURL,
and Federated SearchWilliam H. [email protected]
Grainger Engineering Library Information CenterUniversity of Illinois at Urbana-Champaign
2002 International Conference on Digital Archive Technologies (ICDAT2002)
December 19, 2002
Outline• Digital Libraries and the Distributed Information
Environment.• Document Representation and Full-Text• Digital Library Tools• Illinois Projects.• XML Technologies.• Metadata Technologies.• DOIs, Linking, Local Resolver• Portals, Simultaneous Search, Linking• Grainger Search Aid• Issues & Trends.
The Digital Library• ‘Digital’, ‘Virtual’, ‘Electronic’ Library as
network-based library without regard to place and time.
• Tendency to apply term to collections and resources.
• Digital Collections vs. Digital Library.
• Emphasis on the integration of collections and services (e.g. NSDL grant).
• Application of standards and protocols is important.
Scholarly Communication Overview• E-Resources are Web-based and publisher-centric.• Growth of Heterogeneous Distributed Repositories.• Value-added services and ‘branding’ of journals.• Prestige of Journals and Publishers• Reciprocal linking relationships between publishers.• Cooperation on linking standards (DOI, CrossRef).• Alternative publishing models - Academia, Preprint
Servers, disintermediation.
Distributed Information Environment• We live in a world of multiple, heterogeneous
information repositories, resources, portals, and IR systems. – OPACs – local, regional, national shared bibliographic
databases.– Local and remote A & I Services.– Discrete publisher and vendor repositories (full-text).– Web search engines, vertical portals, custom portals
(NSDL, ARL Portal).– Local metadata, digital objects, GIS, finding aids.– Preprint servers and institutional repositories (D-Space).– Instructional (course) management systems (WebCT,
Blackboard).– Harvestable (OAI) sites and services.
Distributed Repository - Issues • Integration of discrete, heterogeneous information
resources.• Role of federated and broadcast searching of distributed
resources.• Integration of collections with reference, instructional
and navigation services -TOC, remote reference assistance.
• Integration of Library, institutional, vendor, publisher, and government portals and information services.
• Linking technologies.• Metadata harvesting, archiving.
Distributed Environment Action Plan• Pressing need for document representation,
retrieval, transmission, and linking middleware tools and standards.
• Metadata standards, DOIs, OpenURL.
• Factor: changing landscape of Scholarly Communication and disintermediation of publishers and libraries.
• Federated search and simultaneous search with reference linking as mechanism to integrate DL landscape.
Portal Functions:--Authorization--Linking mechanisms between resources and among resources.--Simultaneous search.--Navigation
OPACA& I Services
(Local and Remote)
Full-TextResources
Web Client
Portal Presentation LevelLocal Link Server,
Local Value-Added
Local Databasesand OAI
Resources via DBMS
Linking:--Between full-text using DOI, CrossRef, Appropriate Copy.
--Between A&I and full-text.
--Between OPAC and full-text.
Web Resources &Knowledge
Environments
E-ResourceRegistry
Aggregator(Ebsco, OCLC)
PublisherPortal
(Elsevier)
CrossRefMetadata
DOIServer
Document Representation
• Continuum of Web-Enabled technologies -- all presently being utilized.
• Evolving technologies and standards.
• Role and history of markup.
• XML: its role and importance.
• The Smart Document.
Digital Library Tools• We have at our disposal the tools to create integrated
digital libraries from the distributed digital resources environment in which we operate:– Standard retrieval environment (Web) and interface/client
(Web Browser);– Standard transport mechanisms to connect heterogeneous
content (HTTP, OAI, SOAP);– Standard metalanguages and tools for describing and
transforming content and metadata (XML, DTDs & Schemas, XSLT, DC/DCQ, RDF, METS);
– Standardized search/retrieval mechanisms (HTTP Post/Get, SQL, Z39.50, Object Oriented Databases);
– Standard linking tools and infrastructure (DOI, OpenURL, CrossRef).
• Candidate set of ‘best practices’ for IR.
Work by Illinois DLI Group• We are attempting to address many of these issues within
the Digital Library Initiatives group.• Headquartered at Grainger Engineering Library
Information Center at UIUC.• Grant Work:
– Digital Library Initiative I (NSF, others), 1994-1998.– Corporation for National Research Initiatives (CNRI) D-Lib
Test Suite, 1998-2001.– Collaborating Partners Program, 1998--.– Andrew Mellon Foundation OAI Harvesting grant, 2001-2002.– NSF NSDL (National Science, Engineering, Technology, and
Mathematics Digital Library) Program, 2002-2004.– Institute of Museum and Library Services (IMLS) Registry and
Integration grant, 2002-2005.
Illinois Testbed Project• Funded under DLI-I by NSF, DARPA, and
NASA, 1994--1998. Awards made to 6 universities.
• Large-scale Testbed, Distributed Repository models, evaluation, Web software.
• Funded under CNRI D-Lib Test Suite Program, 1998—2001.
• Collaborating Partners Program. AIP, APS, ASCE, IEE, NRL, ASM, ACM, NTT Learning Systems, Elsevier.
• All XML Journal -- AIP, APS, ACM.
Illinois Full-Text Testbed• American Institute of Physics--APL, JAP, RSI
– 19,000+ articles, 1995--.• American Physical Society--PRL
– 15,000+ articles, 1995--, weekly updates.• ASCE Journals (25 titles)
– 11,000+ articles, 1995--.• IEE Proceedings and Electronics Letters
– 9,500+ articles, 1993--.
• IEEE Computer Society.• ASM (American Society for Materials) Handbook.• ACM (Association for Computing Machinery)
Transactions.• Elsevier Science.
Accomplishments• Process & retrieve from multiple publishers &
heterogeneous DTDs.
• SGML to XML Conversion.
• Development of a metadata specification that uses RDF, Dublin Core (DCQ and XML) XML Schemas, local Namespace.
• Cross-repository searching (Testbed & D-LIB Test Suite). Full-Text and Metadata.
• XSLT, CSS, for transformation & rendering, including Mathematics.
Accomplishments (2)• Introduction of numerous technologies now deployed
within publisher repositories:– Forward and Backward links in bibliographies -- within
Testbed/Repository, from/to A & I Services.
– Use of XSLT for transforming XML to HTML.
– Rich extended abstracts.
• Conversion of ISO 12083 math markup to MathML. CSS/DHTML mathematics rendering. Use of plug-ins.
• Enhanced Web retrieval mechanisms: Author Word Wheels, Co-Occurrence Matrices.
• Local Link Server for DOIs, Context-Sensitive linking.
XML (eXtensible Markup Language)
• Like SGML, a Data Description Metalanguage.• XML a subset/version of SGML.• Document representation and interchange Standard.• Allows fine-granularity markup of content and structure.
Author can create their own elements (extensible).• Tags define the structure of document not the presentation
format.• Validated vs. “well-formed” - separation of authoring
process from representation & presentation.• Either validated in DTD/Schema or well-formed.• Integrated with relational DBs.
XML Features• The milestones in document description and
transmission: ASCII, TCP/IP, HTTP and HTML, XML. Web Programmability.
• DTD not required with XML. Needed if internal entities.
• Use of Document Object Model (DOM).
• Technology approach from Web developer’s standpoint: XML data, CSS presentation layer, XSLT to transform the structure (‘view’) of the data/document.
XML in Information Technologies• Used in Open Archives Initiative (OAI),
NSDL.• Compatible with MS SQL Server, Tamino
(Software AG), Oracle, DLXS/XPAT (University of Michigan/OpenText), others.
• Integral to Web Services (WSDL) and SOAP – Google Web Service.
• Used in Library of Congress MODS and METS metadata technologies.
• Baked into XyVision and publishing packages.
XML, XSLT, and CSS• Use XML full-text articles as ordered hierarchy
of content objects.• Generate item-level metadata in XML, using
RDF and Dublin Core syntax and semantics.• XSLT and CSS used to present metadata and
articles in either XML or HTML format depending on Browser.
• Mathematics rendering using MathML tools (conversion from ISO 12083 to MathML).
• Real-time transformation between XML and HTML using XSLT.
Schemas vs. DTDs
• Both are systems of representing a data model that defines the data’s elements and attributes, and the relationship among elements.
• Schema addresses limitations of DTDs and the increasingly data-oriented role of XML.
• W3C XML Schema Working Group: two documents: XML structures and datatypes.
Schema Justification• Description of document type’s structure should
be in an XML document instead of written in special syntax (DTD).
• Schema are in XML: easier to edit and process using standard XML DOM manipulation tools.
• DTD notation doesn’t allow schema designers the power to impose strong data typing -- for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices.
Metadata and Linking Standards
• Digital Object Identifier (DOI) and Persistent Object Identifiers.
• OpenURL and Value-Added Service Components (SFX).
• Open Archives Initiative (OAI), Dublin Core and Qualifiers, RDF.
• Local Resolver Servers.
Open Archives Initiative (OAI)• Released version 1.0 of metadata harvesting
protocols. Frozen through second quarter 2001.• Mechanism for data providers to expose their
metadata through an HTTP protocol and a mechanism for harvesting records containing metadata from repositories.
• Roots in e-print archives.• Lightweight, low-barrier. Easy to implement Web
server to handle OAI protocol requests; need to develop procedures to access and extract your metadata.
Ongoing Investigations• Relationship between interoperability models for
search and discovery: federated searching (OAI harvested) and broadcast, simultaneous searching of distributed repositories. Not mutually exclusive.
• OAI Provider and Harvesting software. Encoding Archival Description (EAD). OAI Engineering/CS/Physics site.
• Role of HTTP harvesting, Spider technology.• Reference Linking integration built on OpenURL and
DOI.• Reference Assistant software with simultaneous search,
point-of-contact assistance, and remote reference capability.
Portals and Gateways• Role is to bring together and integrate
disparate e-resources.• Provide a systematic ‘view’ of the
information landscape, particularly full-text.• Two primary foci: robust search/navigation
and the ability to link everywhere from anywhere in the environment of OPACs, A & I Services, full-text.
• Central to this implementation is federated and simultaneous search and reference linking technologies.
Digital Object Identifier (DOI)• DOI is both a unique identifier of a piece of
digital content AND a system to access that content digitally. Persistent object identifier.
• ‘The ISBN for the 21st Century’ -- Norman Paskin.
• DOI system has two main parts: (the identifier and a directory system) and a third logical component, a database.
• Developed by AAP (Association of American Publishers), now managed by International DOI Foundation.
DOI Construction• First real open standard for content identification.
• DOI is a number that identifies a digital object:– 10.1063/S000369519903216
• 10 Registration Agency Prefix
• 1063 Publisher Prefix
• S000369519903216 Suffix (Publisher-assigned ID)
• Suffix can be SICI or PII.
• The DOI and URL pointing to the digital object, is registered with the International DOI Foundation, e.g:– 10.1063/333 | http://www.pubsite.org/apr99/artl1.pdf
Using a DOI• DOIs are resolved using the Handle System
technology from CNRI (Corporation for National research Initiatives).
• Retrieval of object is two step process: link is sent to central directory where current Web address is stored, location is sent back to browser with special message to redirect to address, e.g:– dx.doi.org/10.1063/333 redirects to
www.pubsite.org/apr99/artl1.pdf
Reference Linking• CrossRef Publisher system: major Sci-Tech
professional societies and commercial publishers.
• System design calls for one URL for each DOI; underlying technology can handle multiple URLs however.
• Issue: Directing users to locally held or licensed version of Digital Object (locally loaded or from Aggregator). Appropriate Copy problem.
Cookie on clientClient
(Web Browser)
DOI Proxy
Illinois LocalLink Server
OpenURL
AwareLocal
AIP, IEE
CrossRefMetadataDatabase
dx.doi.org/10.1063/1234HandleServer
AIP
IEE
Elsevier
DOI
Metadata
LocalValueAdded
Nosfx=y
UIUC MetadataRegistry
OpenURL
Simultaneous Search Implementations• DialIndex from Dialog.• Ex Libris MetaLib service.• Endeavor EnCompass.• Innovative Interfaces MetaFind.• Ovid Multiple Search and reference De-Duping.• ISI Web of Knowledge.• Gale Corporation InfoTrac Total Access.• WebFeat.• California Digital Library SearchLight system.• Los Alamos FlashPoint system.• Fretwell-Downing partnering with ARL Portal and
Monash University.
Grainger Search Aid• Assist users in the selection of appropriate
databases .• Normalize user search arguments and display
search results from candidate databases.• Cross-database asynchronous concurrent
searching.• Article level and e-journal Web site access to
publisher full-text repositories.• Utilize OpenURL, CrossRef metadata database
and DOI for reference linking at the article level.• Proxying of vendor systems and capability of
‘taking over’ the search in vendor native mode.
Grainger Search Aid
Reference Assistant Project• Utilize Search Aid simultaneous search and
link capabilities.
• Opportunity to explore interface and navigation issues.
• Mimics the behavior of reference librarian.
• Allows the application of ‘best match’ and ‘quorum searching’ algorithms.
Reference Assistant Top Menu
Simultaneous Search Implementations
• Shared Blackboard approach employing Independent Searchbots dedicated to searching information resources and passing results to Web clients.
• Event-Driven, Asynchronous HTTP Queries from within a Single Script returning results to Web browser.
Event-Driven, Asynchronous Queries
• Single, event-driven web server process, asynchronously querying multiple resources.
• Uses WinHTTP from ASP and VBScript• Simpler, not as flexible. Search algorithms and
processing coded in scripts.• This is the approach we currently use for our
service.• Implementation of multi-step login and session
variable passthru being investigated.
OpenURL-Based Services
• Standard for expressing and transmitting metadata.
• Promise of standardized, normalized search results.
• Provides value-added links to the Ovid search results.
• Using CrossRef metadata database to look up DOIs.
CiteParse.dll• An ActiveX DLL which can parse various Ovid
citations and turn them into OpenURLs:
• Tansu N. Chang YL. Takeuchi T. Bour DP. Corzine SW. Tan MRT. Mawst LJ. Temperature analysis … quantum-well lasers. [Article] IEEE Journal of Quantum Electronics. 38(6):640-651, 2002 Jun.
• http://…/resolver.asp?genre=article&aulast=Tansu&auinit1=N&atitle=Temperature+analysis+…+quantum-well+lasers&title=IEEE+Journal+of+Quantum+Electronics&volume=38&issue=6&spage=640&epage=651&pages=640-651&date=2002-06
Conclusions• User reactions very positive.• The one-stop-shopping approach has been successful.• Users consider ability to link to full-text from citations
in A & I Services and from references on publisher portals very helpful.
• Technically, best approach appears to be a hybrid of asynchronous client interface with Web Services querying databases. Moves database middleware to Web Services and eliminates extensive custom script code for search and database query.
Publishing Trends• Publishers will continue to add value to
online journal articles.
• Digital version will become version of record.
• Virtual journals (both publisher-based and cross-publisher) will become common.
• Next-generation knowledge environments will evolve. Multimedia, data exposed, live equations with in-place calculations.
Publishing Trends (Continued)• Personalized services will be available --
agent technology, alerting services.
• Different economic and subscription models will be introduced.
• Deconstruction of Journal (Bob Kelly, APS); article at a time publishing.
• Journal branding or perhaps publisher branding.
• Academia issues: publishing, tenure.
Continuing Issues• Role of Authors, Academic Institutions,
Libraries, Publishers, Abstracting & Indexing Services.
• Disintermediation may affect both Libraries and Publishers.
• Information as Function not Place.
• Provide a ‘Digital Library’ out of digital collections.
• Role of XML technology.
• Service mechanisms: processing & archiving, search and discovery, presentation, linking.