cataloging the internet for the sake of the user

Nancy FlorioIls506

Summer 2009Chang Suk Kim, Ph.D

Historically, libraries and librarians have always sought to collect, organize, preserve, and disseminate the collective knowledge of the world.

The Diamond Sutra holds the distinction as being the earliest dated book ever printed.

It wasn’t until 1436 when Johannes Gutenberg invented a printing press with wooden moveable type that printed material became accessible, although not affordable, by the masses.

Primarily due to economic factors, book ownership has been a fairly recent luxury, while people previously sought out books from their local academic or public libraries.

Initially, libraries kept track of their collections by creating lists in books. This system allowed only one point of access when looking for information.

In 1901 the Library of Congress began to sell printed cards of bibliographic information to libraries.

This step of placing the bibliographic cards in a card catalog allowed users to find information through multiple access points.

The Dewey Decimal and Library of Congress Classification systems, the Anglo-American Cataloging Rules (AACR2), and MARC format enabled users to access information locally that adhered to consistent standards.

With the development of MARC records, card catalogs in individual libraries gave way to electronic cataloging systems, further streamlining the information seeking process.

In 1967, the Ohio College Library Center (OCLC), a consortium of 54 colleges in Ohio formed a network to share their collections catalogued on MARC records.

The OCLC has since changed its name to the Online Computer Library Center and membership is now open to all libraries

through WorldCat.

Today, anyone using a computer with Internet access is able to locate items based on any one of the 138 million bibliographic records held by WorldCat.

“The ability for the OCLC to operate as a collective requires consistent standards for precise communication”. O’Daniel (1992)

The Library of Congress developed these consistent standards by creating Authority Headings for subject, name, title, and name/title combinations and permitting users to harvest this data at http://authorities.loc.gov/.

This controlled vocabulary ensures users consistency and accuracy in bibliographic records and increases the likelihood of accessing the information they seek.

Over the past quarter century there has been a staggering increase in the availability of information in digital format on the World Wide Web.

The faculty and students at the School of Information Management and Systems at the University of California at Berkeley researched how much new information is created each year.

They estimated that in the year 2000, there were 20 to 50 terabytes of information on the Surface Web. In three short years that number had more than tripled; from roughly 50 terabytes to 167 terabytes.

(Lyman et al, 2003)

Bright Planet estimates that the Deep Web holds 400 to 450 times the information of the Surface Web. Their estimate places information on the Deep Web at between 66,800 and 91,850 terabytes.

As a point of reference, it would require 10 terabytes to contain all the information in the entire print collections of the U.S. Library of Congress.

In 2003, this equaled the equivalent of between 6,600 to 10,000 times the entire LOC collections on the Web, most of it on the Deep Web.

One may wonder why there is a need to catalog the information on the Web when people are able to access information through generalized search engines such as Google.

In fact, “Google's mission is to organize the world's information and make it universally accessible and useful”.

Looking at the statistics, it appears Google is doing just that. Currently, Google’s “millions of servers process about 1 petabyte of user-generated data every hour. It conducts hundreds of millions of searches every day”. Vogelstein (2009)

While Google is able to process the equivalent of 6,600,000 to 10,000,000 times the amount of information in the collections of the Library of Congress every hour, volume and speed cannot be construed as true indicators of accuracy of information or relevancy of items retrieved.

Traditional search engines like Google operate based on a system of creating indices by crawling Web pages.

Pages need to be static for this form of indexing to work. Content in the Deep Web cannot be indexed this way because the majority of it doesn’t exist in a static format. Bergman (2001)

The Deep Web contains nearly 550 billion individual documents compared to the one billion of the surface Web.

More than 200,000 deep Web sites presently exist. On average, Deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites.

The Deep Web site is not well known to the Internet-searching public.

The Deep Web is the largest growing category of new information on the Internet.

Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.

Total quality content of the Deep Web is 1,000 to 2,000 times greater than that of the surface Web.

More than half of the Deep Web content resides in topic-specific databases.

A full ninety-five per cent of the Deep Web is publicly accessible information -- not subject to fees or subscriptions.

Today’s academic libraries recognize the importance of providing access to high quality, peer-reviewed journals found on the Deep Web for their students and faculty.

Experimental data collected by ARL libraries over the last decade indicate that the portion of the library materials budget that is spent on electronic resources is indeed growing rapidly, from an estimated 3.6% in 1992-93 to 10.56% in 1998-99. Kyrillidou (2000)

Colleges and universities expend over 10% of their budget to purchase subscription databases and journals.

This money is wasted because only 2% of college students began their information searches on their library websites and 90% are dissatisfied with the information they found when a general search engine directed them to electronic resources in their library.

“Fast is better than slow. Google believes in instant gratification. You want answers and you want them right now. Who are we to argue?” - Google.com

Who indeed. In fact, the OCLC survey documented the perception of students is that “search engines deliver better quality and quantity of information than librarian-assisted searching-and at greater speed”

Carol C. Kuhithau at Rutgers has conducted extensive research on information seeking behavior from the user’s perspective.

“The bibliographic paradigm is based on certainty and order, whereas users’ problems are characterized by uncertainty and confusion”.

This uncertainty frequently causes feelings of anxiety in the student

and the search for broad, generalized information compounds this state.

Kuhithau’s research demonstrated that “a clear formulation reflecting a personal view of the information encountered is the turning point of the search… confusion decreases, and interest intensifies”.

It is exactly because of this uncertainty and confusion that the generalized search engines they prefer are detrimental to the information seeking process. Kuhithau (1990)

John Lubans’ research at Duke University on freshmen internet use found only 7% rank their ability to use the web as “best”, while 23% see their use as “better”, and 29% rank their abilities as “good”.

Clearly these students could benefit from the organization and cataloging of Internet resources.

The search engines they love, which enable users to keyword-pattern-match against billions of web pages, are very good at finding distinctive phrases.

Unfortunately problems arise when students are in the beginning stages of discovery and are unsure exactly what they are looking for.

Subject-organized URL lists on websites are cumbersome and labor-intensive to develop and update. Porter and Bayard (1999)

Questionable authority on many sites

Complaints from librarians about Web resources:

Invariably center on the difficulties in organizing and archiving them…

Inconsistent quality… Disappearing URLs resulting in the dreaded “404” message”…

While subject-organized lists are not the same as cataloged Internet resources, the Michigan Electronic Library (MEL), Internet Public Library (IPL) and INFOMINE are a few excellent examples that illustrate the organizational abilities of individual librarians to organize small portions of the Internet. Oder (1998)

The main problems these individual indexes or catalogs face, though, is their size and frequent redundancy of items.

A federated catalog like WorldCat would alleviate the redundancy problem.

In 1991, OCLC’s Internet Cataloging Project began to address the need to develop a consortia approach to the problem with 30 catalogers spearheading the movement to catalog Internet resources.

Findings at the end of the project demonstrate that overall, MARC/AACR2 cataloging supported cataloging Internet resources, a method to link the record to the resource was beneficial for the user and instructional materials should be developed. Juls (1992)

A manual was published by the OCLC in response to these findings, library system vendors embraced the 856 MARC field for electronic location and access, and the Web OPAC was introduced.

By 1998 over 18,000 Internet resources had been cataloged by over 5,000 OCLC libraries.

In spite of this initial success, there are many inherent difficulties when cataloging Internet resources.

“lack of universally accepted controlled vocabulary; the lack of stability due to frequency of change of data; and the lack of quality standards”. O’Daniel (1992)

Cataloging electronic serials; difficulties locating prior issues for descriptive information, publishers frequently update digital information including titles, HTML and ASCII versions may have subtle differences, variations in paper and digital versions, many lack a table of contents. Hawkins (1998)

Websites move or even disappear. To address this problem, the OCLC’s Office of Research developed persistent URLs or PURLs. These aliases are assigned so that if a URL is changed for any reason the PURL need only be changed one time through the PURL server.

The development of the Dublin Core Metadata Element Description standardized metadata found on websites and streamlined the complexity of the MARC format.

DC uses 15 predetermined but flexible elements.

Metatags are created and embedded with the documents.

MARCit software was developed to specifically pull the metadata from the title and URL fields and place that metadata in the 245 and 856 MARC fields.

Although cataloging is a time-consuming and often cost-prohibitive activity, it will only be through these efforts to mesh Internet resources with local systems that Internet cataloging efforts will be successful.

“Subject gateways” may appeal more to academic libraries whose mission is to support the academic curricula and research needs of their students and faculty. Oden (1998)

INFOMINE, developed in 1994 at the University of California, Riverside, has embraced a combination of 100,000 librarian created links with 75,000 web crawler links.

They use modified LC subject headings and “focused, automatic Internet crawling as well as automatic text extraction and metadata creation functions to assist our experts in content creation and users in searching”( http://infomine.ucr.edu/).

Searching across multiple databases at one time frequently causes slow search speeds.

Because databases have not been indexed locally, each search query is created on the fly. This is the critical flaw with general search engines inability to access information that resides on the Deep Web.

In this instance, library search engines have access to the subscription databases, it is just that the search is too cumbersome due to lack of indexing.

At the 1999 digital libraries conference in Santa Fe, several inherent problems with the Z39.50 cross-search were identified; namely the tools are too slow, results are limited, and they frequently time out.

Rochkind (2007)

Used for harvesting metadata - commonly referred to as “local indexing”.

This local indexing is the type used by Google Scholar and is what makes partnering with them so appealing for academic libraries.

A major roadblock for many academic libraries to index locally is a lack of cooperation and permissions from their content providers.

Content providers are beginning to provide Google and Google Scholar with access to their metadata hoping for placement recognition in search results.

EBSCOhost and Gale have followed suit and allowed Google and other web crawlers to index their metadata.

The problem with partnering with Google Scholar is that libraries still don’t know what Google has or has not indexed. “If libraries licensed full text or metadata by cooperating with the content provider, they could know exactly what they have in their index and be assured of its completeness”. Rochkind (2007)

In the course of a few short decades most libraries have or will become digital libraries on one scale or another.

At this point Google and Google Search with their local indexing protocol, have set the stage for academic and public libraries alike to utilize the technology that allows their users to access information quickly, efficiently, and with verified authority.

Today’s student wants to access information in a seamless environment in a timely fashion.

It will be through the cooperative efforts of libraries, librarians, catalogers and content providers utilizing the transfer of licensed content from providers to indexers by the OAI-PMH harvester process, that today’s patrons will be able to access information in the time and format that they require.

Baruth, B. E. Is your catalog big enough to handle the web? American Libraries, 31(7), 56-60. Bergman, M. K. (2001). The deep web: Surfing hidden value. Retrieved July 12, 2009, from http://brightplanet.com/ "College Students’ Perceptions of Libraries and Information Resources," OCLC, 2006. http://www.oclc.org/reports/pdfs/studentperceptions_conclusion.pdf Dowling, T. P. (1997). The world wide web meets the OPAC. OhioLINK central catalog web interface. ALCTS

Newsletter, 8(2), A-D. Glaser, R. Internet sites in the library catalog: Where are we now? Alabama Librarian, 56(2), 10-12. Hawkins, L. (1997). Serials published on the world wide web: Cataloging problems and decisions. The Serials

Librarian., 33(1-2), 123. Jul, E. (1996). Why catalog internet resources? Computers in Libraries, 16(1), 8. Kuhlthau, C. C. (1991). Inside the search process: Information seeking from the user's perspective. Journal of the

American Society for Information Science (1986-1998), 42(5), 361. Kyrillidou, M. (2000). Research Library Spending on Electronic Scholarly Information is on the Rise. The Association

of Research Libraries. Retrieved July 11, 2009, from http://tinyurl.com/l5qm3c Lubans, J. (1998, April). How first-year university students use and regard Internet resources. How First-Year University Students Use and Regard Internet Resources. Retrieved July 10, 2009, from

http://www.lubans.org/docs/1styear/firstyear.html

Nichols releases MARCit for cataloging internet resources.(1998). Information Today, 15(3), 51. OCLC Internet Cataloging Colloquium, & OCLC. (1996). Proceedings of the OCLC internet cataloging

colloquium. OCLC., Weitz, J., Greene, R. O., & OCLC. (1998). Cataloging electronic resources OCLC-MARC coding

guidelines. O'Daniel, H. B. (1999). Cataloguing the internet. Retrieved July 12, 2009, from http://associates.ucr.edu/heather399.htm. Oder, N. (1998). Cataloging the net: Can we do it? Library Journal, 123(16), 47-51. Porter, G. M., & Bayard, L. (1999). Including web sites in the online catalog: Implications for cataloging,

collection development, and access. The Journal of Academic Librarianship, 25(5), 390-394. Rochkind, J. (2007). (Meta)search like google. Library Journal, 132(3), 28-30. Shafer, K. E. (1997). Scorpion helps catalog the web. research project at OCLC. Bulletin of the American

Society for Information Science, 24(1), 28-29. Taylor, A. & Clemson, P. (1998). Access to networked documents: Catalogs? Search engines? Both?

Retrieved July 11, 2009, fromhttp://worldcat.org/arcviewer/1/OCC/2003/07/21/0000003889/viewer/file9.html

Vine, R. (2004). Going beyond google for faster and smarter web searching. Teacher Librarian, 32(1), 19. Vogelstein, F. (2009, August 2009). Keyword:Monopoly. Wired. 58-65.