and why do i care? what is xml?. in the age of google, why have fielded data? more efficient for...
TRANSCRIPT
AND WHY DO I CARE?
What is XML?
In the age of Google, why have fielded data?
More efficient for both data entry and for systems to search, retrieve and ingest
Parsed, discretely fielded data can be recombined mechanically for a variety of outputs and uses, including XML
A popular YouTube to illustrate the power of XML:
“The Machine is Using Us” http://youtube.com/watch?v=NLlGopyXT_g
By Michael Wesch, an Assistant Professor of Cultural Anthropolgy at Kansas State University, this clip illustrates how he can supply the same data content to many Web 2.0 sites. The same principles can be applied to the model of supplying data to various software interfaces and tools in an automated fashion—stop and watch it now—it will get you in the XML mood!
So…..? This changes the landscape of digital
tools for users and support staff
It is no longer a matter of “one-size fits all” tools, but a new scenario of multiple tools to fit the users and the use. Supporting multiple tools is less of a burden because the data can be generated once and be automatically transformed by XML stylesheets for each tool or interface or digital collection
What is XML?
Extensible Markup Language (XML) is a universal language for sharing data between applications. XML is most appropriate for situations where the volume of data is generally small, as the data is transmitted as text, and controlling the structure of the data is important.
TRANSLATION: It shuffles data between applications, and users can grab it and send it to a new application too
What XML does
Tags informationFacilitates transfer of that information
between applications and also out to the Web (Web 2.0)
Allows information to be provided by schemas, which organize information and can represent standards (like MARC or VRA Core 4 or Dublin Core)
How does XML work?
It “tags” data—identifies what that data is (what meaning it holds).
MARC tags by using numeric designators:for instance a “245” field is always a title, a “700” or “7xx” field is a personal name (creator)
MARC example
XML tags
XML tags with natural language—easy to see what the information (the data value) is within the “chicken lips”
><
XML example (in VRA Core 4)
<!-- AGENT --> <set><display>Jasper Francis Cropsey (American painter,
1823-1900)</display> <index><agent><name type="personal" vocab="ULAN" refid="500012491">Cropsey,
Jasper Francis</name> <dates type="life"><earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates><culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent></index></set>
Schema: Where the data standard and XML meet
Once a data standard like VRA Core 4.0 is devised, with all the elements and qualifiers laid out, the standard can then be expressed in one XML document called the schema—a road map to then apply to a specific XSLT style sheet that tells a database (or another type of application) how to export data into (Core 4) XML. A schema is a set of rules to which the xml document must conform to be “valid”
VRA Core 4.0 XML schema (a small sample)
<!-- Agent --> <xsd:complexType name="agentType"><xsd:annotation><xsd:documentation>VRA Agent element.
Subelements are used for different types of data (names, roles, dates, etc.). At least one subelement must be provided.</xsd:documentation>
</xsd:annotation><xsd:sequence minOccurs="1" maxOccurs="unbounded"><xsd:element name="attribution" type="basicString"
minOccurs="0" /> <xsd:element name="culture" type="basicString" minOccurs="0" /> <xsd:element name="dates" type="agentDateType"
minOccurs="0" /> <xsd:element name="name" type="agentNameType"
minOccurs="0" /> <xsd:element name="role" type="basicString" minOccurs="0" /> </xsd:sequence><xsd:attributeGroup ref="vraAttributes" />
XML example (compare this output to the previous slide--schema outline for the agent
data element)
<!-- AGENT --> <set><display>Jasper Francis Cropsey (American painter,
1823-1900)</display> <index><agent><name type="personal" vocab="ULAN" refid="500012491">Cropsey,
Jasper Francis</name> <dates type="life"><earliestDate>1823</earliestDate> <latestDate>1900</latestDate> </dates><culture>American</culture> <role vocab="AAT" refid="300025136">painter</role> </agent></index></set>
What is XSLT?
You can export XML data from FileMaker or Access (and many other programs) to use in an assortment of applications simply by applying the appropriate Extensible Stylesheet Language Transformation (XSLT) stylesheet. XSLT is also XML-based. You can use a stylesheet to take an XML document and turn it into plain text, PDF documents, web pages, or to import fielded data into other applications.
XLST Sample—how the XML is actually exported from a database (in this case FMP)
<!-- Agent --> <set><display><xsl:value-of select="fm:AgentDisplay" /> </display><index><xsl:for-each
select="fm:AgentSortName/fm:DATA"><xsl:variable name="i"><xsl:value-of select="position()" /> </xsl:variable><agent>
File Extensions for the 3 parts of XML
So when you see these file extensions, you will know what you are looking at:
The XML document is .xmlThe XML schema is .xsdThe XSLT stylesheet is .xsl
Ummm, yeah, OK
Will you do coding/tagging for schemas? (No, you will use schemas provided/published for standards—MARC (MODS), VRA 4.0, CDWA lite, etc.)
Will you do coding/tagging for XSLT? (Maybe, if you take a class and are interested. More likely you will get tech support or support from user groups)
Will you be able to look at an XML document and basically understand it and edit it? (Yes, this is similar to learning HTML and HTML editors)
So how does this fit into my cataloging?
VRA Core 4 and CCO were both formed with an eye to output and expression in XML
They can be used in “flat” systems, but there is a clear benefit to using relational databases, and XML is also good at capturing/transmitting relational structure
Relational Databases
Relate information stored in multiple tablesIdeally, there is no redundancy of data entry
—each value that might be reused in data entry is only entered once and stored in one table that is related for use everywhere else in the database (made available anywhere needed in the data entry workflow)
Numeric keys are normally used in this process
Excel sample (“flat file” output)
Notice that each row represents an image file and conflates the work and image records (repeats the information about the work for each image).
Each repeating value (like Artist) must have a column reserved for possible use.
A pithy answer to “why relational?” (for cataloging)
Message from Jan Eklund to VRA-L, Feb 20, 2008, subject: Re: CONTENTdm and metadata (search list archive for full message) Complexity: “complexity cannot be captured efficiently
in a flat data model because basically you have to leave space in every record to accommodate the most complex object you will ever encounter. This adds up to a lot of wasted space, and wasted space means more money…”
Consistency: “all the descriptive data about the work is entered once, and every image that shows this work inherits the same information”
Image and Work records (example from VCat)
A note field is possible for every Core 4 element
Repeating values are supported for each element
Numeric key
“indexed” value (in this case the sort name)
“display” value done to CCO recommended formatting. Note that the Agent Nationality is supplied automatically here by theLink (numeric key) to the Agent Authority
Authority record
Numeric key
All the information about the agent is supplied from this file on the basis of the numeric key
<agentSet><display>ACT Architecture (French architectural firm, ca. 1982-present); Gaetana Aulenti (Italian interior designer, born 1927); Victor Alexandre Frédéric Laloux (French architect, 1850-1937)</display><notes>ACT Architecture (Renaud Bardon, Pierre Colboc and Jean-Paul Philippon)</notes><agent><name vocab="ULAN" refid="500023967" type="personal">Laloux, Victor Alexandre Frédéric</name><dates type="life"><earliestDate>1850</earliestDate><latestDate>1937</latestDate></dates><culture>French</culture></agent><agent><name vocab="LCNAF" refid="nr 95039966" type="corporate">ACT Architecture</name><dates type="activity"><earliestDate>1982</earliestDate><latestDate>2082</latestDate></dates><culture>French</culture></agent><agent><name vocab="ULAN" refid="500031019" type="personal">Aulenti, Gaetana</name><dates type="life"><earliestDate>1927</earliestDate><latestDate>9999</latestDate></dates><culture>Italian</culture></agent></agentSet>
The same information expressed in Core 4 XML—this is automatically output from the database
The Element Set of Core 4
Format and Global Attributes
Reciprocity in Relationships
Easy to show relationships between works in a relational database and via XML. In this case the XSLT stylesheet (in conjunction with programming within the database) can be written to supply the reciprocity (the other related work) based on the numeric key.
Stylesheets can do a lot!
They literally do “transformations”—they can change the XML into other formats, they can recombine parsed information—and they can even take that more efficient and consistent relational data and “flatten” it, and output it in csv (Excel) for import into delivery systems or other uses that are not yet XML-compatible!
Other Data Standards (field structures) and XML
MARC; MODSCDWADublin CoreVRA Core 4.0EADMETS
MARC—Machine Readable Cataloging
Emerged from a Library of Congress-led initiative that began in the 1970s for bibliographic (reprographic) materials
Uses numeric tags to designate the fields (“245” means title, “700” fields are makers/creators etc)
This enabled computer protocols to share data worldwide
“The future of the MARC formats is a matter of some debate in the worldwide library science community. On the one hand, the formats are quite complex and are based on outdated technology. On the other, there is no alternative bibliographic format with an equivalent degree of granularity. The huge user base, billions of records in tens of thousands of individual libraries, also creates inertia” (Wikipedia entry)
MODS—Metadata Object Description Schema
A schema that allows the traditional numerically tagged MARC to be turned into XML
Can carry data from existing MARC plus allows creation of new XML-based records—a way to integrate and move forward?
http://www.loc.gov/standards/mods/
CDWA—Core Description of Works of Art
Developed by the Getty specifically to describe art, architecture and cultural artifacts
A very granular standard—the fields are very narrowly defined and there are many specific fields (as opposed to a few fields that use “qualifiers”) Example: Creation - Commissioner - Commissioner Role
See the CDWA lite xml schema:http://www.getty.edu/research/
conducting_research/standards/cdwa/cdwalite.html
Dublin (Ohio) Core
Developed by OCLC (headquartered in Dublin OH) (serving 53,500 libraries in 96 countries)
Created to describe “born digital” items in particular
Simple “bins” of data that can be further “qualified” (difference in Simple DC and Qualified DC)
A qualifier is an element refinement—example Date. Creation
The Simple Dublin Core Metadata Element Set (DCMES) consists of 15:
Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights
VRA Core 4.0
Published in April 2007:http://www.vraweb.org/datastandards/VRA_Core4_Welcome.html
A data standard guiding data structure Formed with an eye to expressing content in XML—
with both index and display values Formed like library records with a “bib” (work) record
and an item (image) record Formed as is Dublin Core with a 1:1 relationship—one
record describes one object
EAD (Encoded Archival Description)
Started 1993 at Berkeley—now maintained by Library of Congress with SAA (Society of American Archivists) Began using SGML, now uses XML
So, tagged and machine-readable, but not necessarily 1:1 records—simple way to make groups/boxes of material retrievable
Sample EAD Finding Aid
http://webtext.library.yale.edu/art/art.VRC1.htm
152 boxes; 64 linear feet of mounted photographs of American painting now in storage
Simply used the outline of the original filing/drawers and tagged them—this translates now to boxes of material with barcodes
METS (Metadata Encoding and Transmission Standard)
http://www.loc.gov/standards/mets/Think of it as an XML “wrapper”—it can
describe a group of objects, a collection of different objects, can “wrap” around a set of XML items that are different formats and therefore may be a way to integrate and present these
METS Profiles
UCSD Simple Object Profileabstract:
The UCSD Libraries uses the UCSD Simple Object profile for composing METS instances for digital objects consisting of a single digital content file and associated descriptive, administrative, and structural metadata. The single digital content file may be of any format type, e.g., audio, image, text, or video, and it may be represented in the METS instance with content equivalent file versions. For example, a digital image may be represented in the METS instance by a TIFF file, a JPEG file, and a GIF file, with each containing the same content image.
What do [book] librarians have that VR professionals don’t?
Tools and networked utilities for COPY CATALOGING: MARC (Machine Readable Cataloging) for field
structure (data standard) AACR2 (Anglo-American Cataloging Rules) for data
formatting (data content) XML and Z39.50 (and other protocols) for transmitting
data OCLC as a shared records repository (sustainable
business model)
How do we get to shared VR image cataloging?
Have to develop the same general mechanisms as the library worldVRA Core 4.0 = MARCCCO = AACR2XML will be one transmission
vehicle/protocolOAI (Open Archives Initiative) may
become a harvesting and retrieval mechanism for record sharing
OAI (Open Archives Initiative)—XML Based
http://www.openarchives.org/Started by 2 computer scientists at Cornell to
quickly share information via mechanical “harvesting”—databases are opened to allow harvesting and results are then put in a central repository for searching. It is a “low-barrier” interoperability framework using Dublin Core (in XML) as its minimum standard, but one can also use other standards (expressed in XML) on top of that.
Google is using OAI to harvest data from the National Library of Australia. (See also U Michigan’s OAIster project).
See—XML matters!
Susan Jane WilliamsIndependent Cataloging and [email protected]