the ‘xml’ project: integrated access to scientific resources miriam blake – presenter
DESCRIPTION
The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library. Rationale for Project. 60+ Million citations – multiple access points Duplicate records / citations - PowerPoint PPT PresentationTRANSCRIPT
LITA Forum 2003
The ‘XML’ Project:
Integrated Access to Scientific Resources
Miriam Blake – Presenter
Beth Goldsmith
Mariella Di Giacomo
Los Alamos National Laboratory
Research Library
LITA Forum 2003
Rationale for Project
60+ Million citations – multiple access points Duplicate records / citations No links between bibliographies and records we
store Need ‘smart objects’ with pointers (to full-text, etc.) Wanted an updated interface with new features
LITA Forum 2003
Existing Databases at LANL
Citation (A&I) databases ISI
SciSearch 1945-present : ~30 M + 4 k weekly Social SciSearch 1973-present: ~15M + 1k Arts & Humanities 1975-present: ~5M + .5k ISI Proceedings 1990-present : ~3M + .5k
• All ISI dbs have associated citation records
INSPEC : ~8 M + BIOSIS : ~15M Engineering Index (Compendex) Other (DOE, LAUP/tech repts., GeoRef, OPAC, etc.)
LITA Forum 2003
Project Team
6 developers (librarians and programmers) Miriam Blake, Doug Chafe, Mariella Di Giacomo,
Frances Knudson, Beth Goldsmith, Mark Martinez, Ming Yu, Jeff Scott (hardware)
Research Library staff Librarians / metadata experts
Interface team – 2 staff doing jsp, html, graphics for this project part time
LITA Forum 2003
Project Workflow
VendorA
VendorB
VendorC
Multiple vendor record formats
VerityXML
Single record format
Application
VerityIndexes
MySQLIndexes
Indexing
Display Search & browse
Co
nve
rsio
n
LITA Forum 2003
Hardware Fault-tolerant architecture to provide reliability,
flexibility, and speed Sun Solaris 2.8 platform Security environment
Data stored and accessed inside a firewall Data accessed and application runs outside
the firewall Required a data sharing file system (for Solaris)
LSC file system called QFS Multiple readers, one writer per filesystem
LITA Forum 2003
Load
Bal
ance
r
User Authentication Db
(mysql)Linux
Application
Verity Broker
Application
Verity Broker
Verity Servers
MySQL slave server
Application
Verity Broker
Application
Verity Broker
SAN (Storage Area Network)SAN (Storage Area Network)
Verity Colls
Verity Colls
Verity Colls
XML recs
XML recs
XML recs
AuthorBrowse db
AuthorBrowse db
Application
Verity Broker
Verity Servers
Application
Verity Broker
Verity Servers
MySQL slave server
Verity broker/servers
Fir
ewa
ll
Dev
elo
pm
ent
En
viro
nm
ent
LITA Forum 2003
Software components
Verity search engine MySQL to handle author browse, user functions Interface
XSLT to transform XML for query result displays
Java servlets JSP Apache / tomcat to handle Java/JSP
presentation
LITA Forum 2003
Verity Search Engine Commercial product – used by many large companies
Used in our older apps – users familiar with search capabilities
Strength in full-text searching Required Solaris (now runs on Linux) Verity K2 – parallel multi-tiered architecture
Brokered approach Searches are distributed to multiple servers to
concurrently search multiple Verity collections LANL collections broken by year
Recs colls – bibliographic metadata Cites colls - citations within articles
LITA Forum 2003
ISI Conversion
ISI vendor record (Bib record + citations)
XML Record with bib data
“recs”
XML Record with citation
data items“cites”
Verity recs coll(for searching)
Verity cites coll(for searching)
LITA Forum 2003
Record Structure Record keys <fullKey> – structure:
Combination of ISSN, author name, volume, issue, start page, and title letters
/recs/sici00/0018-8190/46/2/173_SCIANCE-LSTROST
Not all elements are always present ISI records split into 2 XML records with the
same fullKey – one for bibliographic and one for citations (bibliography) Bibliographic and cited indexed into separate
collections for searching
LITA Forum 2003
Conversion to XML - recs Verity XML
Specific fields needed to handle vendor indexing requirements
One XML record containing matching articles from multiple vendors
Consistent XML tags across databases as much as possible
Verity XML record exampleExample Verity XML Bib record
LITA Forum 2003
Kludges for Verity Sort fields
<sorttitle>SPIRITUALITY IN MEDICINE A COMPARISON OF MEDICAL STUDENTS ATTITUDES AND CLINICAL PERFORMANCE </sorttitle>
<sortauthor>MUSICK DW CHEEVER TR QUINLIVAN S NORA LM </sortauthor>
<sortsource>ACADEMIC PSYCHIATRY 2003000027000002000000000067</sortsource>
<sortdate>20030000000183296100001</sortdate>
Display Fields <resauthor>(Art)Musick, DW; Cheever, TR; Quinlivan, S; Nora, LM
</resauthor>
<ressource>(Art)Source: ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2, p.67-73 </ressource>
LITA Forum 2003
Kludges for Verity Zones
<znumber> <issn db="Soc">1042-9670</issn> <controlNum db="Soc">000183296100001</controlNum> </znumber>
Data enhanced tags <zjournal> <journalAbbrJ2 db="Soc">ACAD PSYCHIATR</journalAbbrJ2> <journalAbbrJ9 db="Soc">ACAD PSYCHIATRY</journalAbbrJ9> <journalAbbr db="Soc">Acad. Psych.</journalAbbr> <journalAbbrJ1 db="Soc">ACAD PSYCHI</journalAbbrJ1> <journal db="Soc">ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2,
p.67-73</journal></zjournal>
LITA Forum 2003
Unified record display
Preference order for fields to display when multiple databases are present in the same record Some fields should be dedupped (e.g. title) Some fields should display all data from all
databases (e.g. subject, keywords) Becomes critical when multiple vendor records
are displayed together
LITA Forum 2003
Unified record display
LITA Forum 2003
ISI Conversion - Cites
Cites – citation data (bibliographies) in each bibliographic record Searchable separately from the articles which cite
them 500+ Million individual citations (~170M are unique) Can be search by cited author, source, year, volume
or a combination thereof One cites XML record can have multiple citations
<refItem> - one for each citation After conversion to XML, fullKeys created for each
<refItem> where possible
LITA Forum 2003
ISI Conversion - cites
Title: xxx
-----------------------
Citation 1
Citation 2
Citation 3
Citation x…
3 authorsLing, TW (1st author)
Goh, CHLee, ML
Source (title, year, vol, issue, start page)
Information and software technology1996, v.38, # 9, p.601
26 M records with bibliographies 500 M individual citations <refItem>
FullKey for this item:/recs/sici09/0950-5849/39/9/601_LING-ECFDFPDD
LITA Forum 2003
Cites Fuzzy Matching
Every <refItem> is processed to try to link it to the recs article it matches using fullKey Use “fuzzy matching” rules developed
internally Internal db of ISSNs matches brief source
data ( PHYS REV B or P REV B) ISSN + cited author name, cited volume, cited
page creates fullKey that can match to the key of a bib record, creating a link
~60% of bib records match a cite
LITA Forum 2003
XML Cited reference example <refItem type="ref">
<fullKey>/recs/sici10/1040-2446/67/1/42_VU-6YCCPBAUSPASIUS</fullKey>
<starKey>/recs/sici10/1040-2446/67/*/42_VU*</starKey>
<citAu src="cit">VU, NV</citAu>
<citAu src="bib">VU, NV</citAu>
<citAu src="bib">BARROWS, HS</citAu>
<citAu src="bib">TRAVIS, T</citAu>
<citSo src="cit">ACAD MED</citSo>
<citSo src="bib">ACADEMIC MEDICINE</citSo>
<citSo src="bib">ACAD MED</citSo>
<citYear src="cit">1992</citYear>
<citVol src="cit">67</citVol>
<citIssue src="bib">1</citIssue>
<citPage src="cit">42</citPage>
<citEndPg src="bib">50</citEndPg>
<citIssn src="bib">1040-2446</citIssn>
</refItem>
LITA Forum 2003
Matching citations and bib records
Sample bibliography
No record match found
Match on key /recs/sici01/0163-5808/29/3/76_LEE-CASXSL
Match on key /recs/sici09/0950-5849/38/9/601_LING-ECFDFPDD
LITA Forum 2003
Cited browse
“citeinfo” database with over ½ billion individual citations (one of largest MySQL db’s around!)
Individual <refItem>s include fullKeys (which come from cite XML) for linking
FullKeys are de-dupped Each cited author name is pulled from
<refItem>s, normalized and added browselist tables Browse tables contain ~195 Million names After dedupping, only ~12 M unique names
LITA Forum 2003
Cited browse
12 M unique names-Browse cited papers-Browse general search
Number of times each item is cited
Links to record via fullKey
Total cite count
LITA Forum 2003
Times cited
<fullKey> is used to create real-time times-cited counts
Counts displayed in bibliographic record as well as cited browse
Times-cited count is also pulled out and indexed into verity to allow sorting of results by “times cited”
LITA Forum 2003
Times cited
LITA Forum 2003
Cited linkages
Full Record
Title: A Published: 2000 -----------------------
Citation 1
Citation 2
Citation 3
Citation x…
Full Record Citation 1
Times cited: 96
Full RecordCitation 1
Number of times cited: Total 96 2003: 12 2002: 23 2001: 24 2000: 50 1999: 70 1998: 17
Records citingCitation 1 Published in 2000-------------------------Title ATitle BTitle C …
LITA Forum 2003
Cited browse
Connections to citeinfo MySQL use connection pooling 100 connections refreshed after every 10
queries (can be increased on the fly) Table structure optimizations reduced browse
time to avg. under 1 second Highly cited works (cited more than 10,000
times) are slow
LITA Forum 2003
Adding MySQL to the mix
Fast performance and an Open Source relational db
On Sun platform, can address up to 32GB of memory for query caching
Used to provide browse capability for article authors / cited authors
Also provides a live, disk based backup to XML bibliographic data
Separate MySQL databases used for User authentication and preferences and for current alerts services
LITA Forum 2003
Application - Requirements
250,000+ searches per month3300 users have weekly alerts set up115 run saved searches “on demand”Access requests from all over the world
National Inst. Of Materials Physics – Bucharest-RomaniaUniv. Program in Ecology – Duke UniversityDept. of Biochemistry and Molecular Biology – U of
Western AustraliaNational Center for Atmospheric Research – Boulder, CO
LITA Forum 2003
Application - Requirements
Interface enhancements Keep “successful” options from legacy interfaces Add features based on user feedback
Search screen options - features based on appropriate dbs Alerts and saved searches User preferences Marking and output SFX
LITA Forum 2003
Options in the Interface
LITA Forum 2003
Performance
Many variables – attempts to improve each component XML layout on the filesystems Memory use Network infrastructure Application issues
• MySQL engine, Verity engine, JVM, Java compiler, XSL, and JSP
• Application Code itself
LITA Forum 2003
Lessons Learned
As deadlines approach, design suffers Standards evolve slower than software As projects become bigger, teams need to
formalize work patterns Project Management tools are critical – ant,
CVS, Bugzilla
LITA Forum 2003
Next steps INSPEC will be added to ISI October 2003
Some interface rework to handle• INSPEC “only” users – no cited features• New / expanded list of indexes• Searches over INSPEC db only (not ISI)
BIOSIS by the end of 2003 Merging User databases across product suite Expanding into a “component architecture”
Increase use of standards and open source(MARCXML, OAI, etc.)