the ‘xml’ project: integrated access to scientific resources miriam blake – presenter

34
LITA Forum 2003 The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library

Upload: cher

Post on 11-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library. Rationale for Project. 60+ Million citations – multiple access points Duplicate records / citations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

The ‘XML’ Project:

Integrated Access to Scientific Resources

Miriam Blake – Presenter

Beth Goldsmith

Mariella Di Giacomo

Los Alamos National Laboratory

Research Library

Page 2: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Rationale for Project

60+ Million citations – multiple access points Duplicate records / citations No links between bibliographies and records we

store Need ‘smart objects’ with pointers (to full-text, etc.) Wanted an updated interface with new features

Page 3: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Existing Databases at LANL

Citation (A&I) databases ISI

SciSearch 1945-present : ~30 M + 4 k weekly Social SciSearch 1973-present: ~15M + 1k Arts & Humanities 1975-present: ~5M + .5k ISI Proceedings 1990-present : ~3M + .5k

• All ISI dbs have associated citation records

INSPEC : ~8 M + BIOSIS : ~15M Engineering Index (Compendex) Other (DOE, LAUP/tech repts., GeoRef, OPAC, etc.)

Page 4: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Project Team

6 developers (librarians and programmers) Miriam Blake, Doug Chafe, Mariella Di Giacomo,

Frances Knudson, Beth Goldsmith, Mark Martinez, Ming Yu, Jeff Scott (hardware)

Research Library staff Librarians / metadata experts

Interface team – 2 staff doing jsp, html, graphics for this project part time

Page 5: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Project Workflow

VendorA

VendorB

VendorC

Multiple vendor record formats

VerityXML

Single record format

Application

VerityIndexes

MySQLIndexes

Indexing

Display Search & browse

Co

nve

rsio

n

Page 6: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Hardware Fault-tolerant architecture to provide reliability,

flexibility, and speed Sun Solaris 2.8 platform Security environment

Data stored and accessed inside a firewall Data accessed and application runs outside

the firewall Required a data sharing file system (for Solaris)

LSC file system called QFS Multiple readers, one writer per filesystem

Page 7: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Load

Bal

ance

r

User Authentication Db

(mysql)Linux

Application

Verity Broker

Application

Verity Broker

Verity Servers

MySQL slave server

Application

Verity Broker

Application

Verity Broker

SAN (Storage Area Network)SAN (Storage Area Network)

Verity Colls

Verity Colls

Verity Colls

XML recs

XML recs

XML recs

AuthorBrowse db

AuthorBrowse db

Application

Verity Broker

Verity Servers

Application

Verity Broker

Verity Servers

MySQL slave server

Verity broker/servers

Fir

ewa

ll

Dev

elo

pm

ent

En

viro

nm

ent

Page 8: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Software components

Verity search engine MySQL to handle author browse, user functions Interface

XSLT to transform XML for query result displays

Java servlets JSP Apache / tomcat to handle Java/JSP

presentation

Page 9: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Verity Search Engine Commercial product – used by many large companies

Used in our older apps – users familiar with search capabilities

Strength in full-text searching Required Solaris (now runs on Linux) Verity K2 – parallel multi-tiered architecture

Brokered approach Searches are distributed to multiple servers to

concurrently search multiple Verity collections LANL collections broken by year

Recs colls – bibliographic metadata Cites colls - citations within articles

Page 10: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

ISI Conversion

ISI vendor record (Bib record + citations)

XML Record with bib data

“recs”

XML Record with citation

data items“cites”

Verity recs coll(for searching)

Verity cites coll(for searching)

Page 11: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Record Structure Record keys <fullKey> – structure:

Combination of ISSN, author name, volume, issue, start page, and title letters

/recs/sici00/0018-8190/46/2/173_SCIANCE-LSTROST

Not all elements are always present ISI records split into 2 XML records with the

same fullKey – one for bibliographic and one for citations (bibliography) Bibliographic and cited indexed into separate

collections for searching

Page 12: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Conversion to XML - recs Verity XML

Specific fields needed to handle vendor indexing requirements

One XML record containing matching articles from multiple vendors

Consistent XML tags across databases as much as possible

Verity XML record exampleExample Verity XML Bib record

Page 13: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Kludges for Verity Sort fields

<sorttitle>SPIRITUALITY IN MEDICINE A COMPARISON OF MEDICAL STUDENTS ATTITUDES AND CLINICAL PERFORMANCE </sorttitle>

<sortauthor>MUSICK DW CHEEVER TR QUINLIVAN S NORA LM </sortauthor>

<sortsource>ACADEMIC PSYCHIATRY 2003000027000002000000000067</sortsource>

<sortdate>20030000000183296100001</sortdate>

Display Fields <resauthor>(Art)Musick, DW; Cheever, TR; Quinlivan, S; Nora, LM

</resauthor>

<ressource>(Art)Source: ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2, p.67-73 </ressource>

Page 14: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Kludges for Verity Zones

<znumber> <issn db="Soc">1042-9670</issn> <controlNum db="Soc">000183296100001</controlNum> </znumber>

Data enhanced tags <zjournal> <journalAbbrJ2 db="Soc">ACAD PSYCHIATR</journalAbbrJ2> <journalAbbrJ9 db="Soc">ACAD PSYCHIATRY</journalAbbrJ9> <journalAbbr db="Soc">Acad. Psych.</journalAbbr> <journalAbbrJ1 db="Soc">ACAD PSYCHI</journalAbbrJ1> <journal db="Soc">ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2,

p.67-73</journal></zjournal>

Page 15: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Unified record display

Preference order for fields to display when multiple databases are present in the same record Some fields should be dedupped (e.g. title) Some fields should display all data from all

databases (e.g. subject, keywords) Becomes critical when multiple vendor records

are displayed together

Page 16: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Unified record display

Page 17: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

ISI Conversion - Cites

Cites – citation data (bibliographies) in each bibliographic record Searchable separately from the articles which cite

them 500+ Million individual citations (~170M are unique) Can be search by cited author, source, year, volume

or a combination thereof One cites XML record can have multiple citations

<refItem> - one for each citation After conversion to XML, fullKeys created for each

<refItem> where possible

Page 18: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

ISI Conversion - cites

Title: xxx

-----------------------

Citation 1

Citation 2

Citation 3

Citation x…

3 authorsLing, TW (1st author)

Goh, CHLee, ML

Source (title, year, vol, issue, start page)

Information and software technology1996, v.38, # 9, p.601

26 M records with bibliographies 500 M individual citations <refItem>

FullKey for this item:/recs/sici09/0950-5849/39/9/601_LING-ECFDFPDD

Page 19: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Cites Fuzzy Matching

Every <refItem> is processed to try to link it to the recs article it matches using fullKey Use “fuzzy matching” rules developed

internally Internal db of ISSNs matches brief source

data ( PHYS REV B or P REV B) ISSN + cited author name, cited volume, cited

page creates fullKey that can match to the key of a bib record, creating a link

~60% of bib records match a cite

Page 20: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

XML Cited reference example <refItem type="ref">

<fullKey>/recs/sici10/1040-2446/67/1/42_VU-6YCCPBAUSPASIUS</fullKey>

<starKey>/recs/sici10/1040-2446/67/*/42_VU*</starKey>

<citAu src="cit">VU, NV</citAu>

<citAu src="bib">VU, NV</citAu>

<citAu src="bib">BARROWS, HS</citAu>

<citAu src="bib">TRAVIS, T</citAu>

<citSo src="cit">ACAD MED</citSo>

<citSo src="bib">ACADEMIC MEDICINE</citSo>

<citSo src="bib">ACAD MED</citSo>

<citYear src="cit">1992</citYear>

<citVol src="cit">67</citVol>

<citIssue src="bib">1</citIssue>

<citPage src="cit">42</citPage>

<citEndPg src="bib">50</citEndPg>

<citIssn src="bib">1040-2446</citIssn>

</refItem>

Page 21: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Matching citations and bib records

Sample bibliography

No record match found

Match on key /recs/sici01/0163-5808/29/3/76_LEE-CASXSL

Match on key /recs/sici09/0950-5849/38/9/601_LING-ECFDFPDD

Page 22: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Cited browse

“citeinfo” database with over ½ billion individual citations (one of largest MySQL db’s around!)

Individual <refItem>s include fullKeys (which come from cite XML) for linking

FullKeys are de-dupped Each cited author name is pulled from

<refItem>s, normalized and added browselist tables Browse tables contain ~195 Million names After dedupping, only ~12 M unique names

Page 23: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Cited browse

12 M unique names-Browse cited papers-Browse general search

Number of times each item is cited

Links to record via fullKey

Total cite count

Page 24: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Times cited

<fullKey> is used to create real-time times-cited counts

Counts displayed in bibliographic record as well as cited browse

Times-cited count is also pulled out and indexed into verity to allow sorting of results by “times cited”

Page 25: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Times cited

Page 26: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Cited linkages

Full Record

Title: A Published: 2000 -----------------------

Citation 1

Citation 2

Citation 3

Citation x…

Full Record Citation 1

Times cited: 96

Full RecordCitation 1

Number of times cited: Total 96 2003: 12 2002: 23 2001: 24 2000: 50 1999: 70 1998: 17

Records citingCitation 1 Published in 2000-------------------------Title ATitle BTitle C …

Page 27: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Cited browse

Connections to citeinfo MySQL use connection pooling 100 connections refreshed after every 10

queries (can be increased on the fly) Table structure optimizations reduced browse

time to avg. under 1 second Highly cited works (cited more than 10,000

times) are slow

Page 28: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Adding MySQL to the mix

Fast performance and an Open Source relational db

On Sun platform, can address up to 32GB of memory for query caching

Used to provide browse capability for article authors / cited authors

Also provides a live, disk based backup to XML bibliographic data

Separate MySQL databases used for User authentication and preferences and for current alerts services

Page 29: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Application - Requirements

250,000+ searches per month3300 users have weekly alerts set up115 run saved searches “on demand”Access requests from all over the world

National Inst. Of Materials Physics – Bucharest-RomaniaUniv. Program in Ecology – Duke UniversityDept. of Biochemistry and Molecular Biology – U of

Western AustraliaNational Center for Atmospheric Research – Boulder, CO

Page 30: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Application - Requirements

Interface enhancements Keep “successful” options from legacy interfaces Add features based on user feedback

Search screen options - features based on appropriate dbs Alerts and saved searches User preferences Marking and output SFX

Page 31: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Options in the Interface

Page 32: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Performance

Many variables – attempts to improve each component XML layout on the filesystems Memory use Network infrastructure Application issues

• MySQL engine, Verity engine, JVM, Java compiler, XSL, and JSP

• Application Code itself

Page 33: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Lessons Learned

As deadlines approach, design suffers Standards evolve slower than software As projects become bigger, teams need to

formalize work patterns Project Management tools are critical – ant,

CVS, Bugzilla

Page 34: The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter

LITA Forum 2003

Next steps INSPEC will be added to ISI October 2003

Some interface rework to handle• INSPEC “only” users – no cited features• New / expanded list of indexes• Searches over INSPEC db only (not ISI)

BIOSIS by the end of 2003 Merging User databases across product suite Expanding into a “component architecture”

Increase use of standards and open source(MARCXML, OAI, etc.)