leslie johnston: library big data repository services, open repositories 2012

Big Data Challenges in Repository Development

Leslie Johnston, Library of CongressOpen Repositories 2012

Why is the Library of Congress at conferences on Repositories? LC

doesn’t run an IR or have faculty or collect Research Data (yet).

But LC does have digital collections and runs numerous repository

services.

What are the Biggest Insights that we have

Learned in Fifteen Years of Building Digital Collections?

We can never guess every way that our collections will be

used.

Researchers do not use digital collections the same way that they use analog collections.

Stewardship organizations have, until recently, spoken of “collections” or “content” or “records” or even “files,” but not data.

Data is not just generated by satellites, identified during experiments, or collected during surveys.

Datasets are not just scientific and business tables and spreadsheets.

I should not need to convince this audience that we have Big Data in our Libraries, Archives and Museums.

More and more researchers want to use Library, Archive, and Museum (LAM) collections as a whole, mining and organizing the information in novel ways.

Researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.

Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge.

Consider the Digging Into Data ChallengeRepositories available for research include not only scientific information—astronomy, botany, geology, physics, biology, social science surveys—but:

–Art (Canadian Art Database, New York Public Library Gallery)–Film and Audio (Prelinger Archive, Western Soundscape Archive, JISC Media Hub)–Newspapers (National Digital Newspaper Program)–Maps (Sanborn Fire Insurance Maps)–Music (English Broadside Ballad Archive)–Archaeology (Archaeology Data Service)–Architecture (ArtSTOR)–Journal Articles (Project Muse, JSTOR)–Government Records (National Archives UK, National Library of Wales, NTIS).

What new, computationally-based research methods might be applied to such collections? And how can we make such collections available for computational use?

http://www.diggingintodata.org/

http://www.diggingintodata.org/

These Collections are “Big Data”? What Constitutes Big Data?The definition of Big Data is very fluid, as it is a moving target — what cannot be easily manipulated with common tools — and specific to the organization: what can be managed and stewarded by any one institution in its infrastructure. One researcher or organization’s concept of a large data set is small to another.

Not too long ago, an organization would be surprised to need 10 TB of storage for a large digital collection. Now a collection can increase by more than 10 TB in a single week.

We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment.

Now our collections are, more often than not, self-serve.

How do Big Data collection needs translate into Repository Services?

Case Study: Web Archives• Web Archives, such as the one at the

Library of Congress, may be comprised of billions of files.

• When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages.

• The Library is testing tools for full-text indexing of the entire archive and collection subsets

http://www.loc.gov/webarchiving/

http://www.loc.gov/webarchiving/

Case Study: Web ArchivesThe challenge for developing

repository and researcher services are twofold:

– How are web archives best validated, characterized, and structurally documented as web sites upon ingest?

– How are such large and varied collections described, indexed and made both discoverable and analyzable?

Case Study: Historic Newspapers• The Chronicling America collection

has 5 million page images from historic newspapers with OCR from organizations in 25 states.

• The site gets approximately 4 million views per day.

• Some researchers want to search for stories in historic newspapers.

• Some researchers want to mine newspaper OCR for trends across time periods and geographic areas.

• Requests have come in to analyze all 5 million page images.

http://chroniclingamerica.loc.gov/

http://chroniclingamerica.loc.gov/

Case Study: Historic Newspapers• This form of content is well

understood in terms of repository ingest, but grows at a very quick rate and requires constant processing.

• To accommodate unmediated research use, all NDNP content – metadata, OCR, and page images – is available via an API. Will we develop collection- or content- specific APIs for repositories?

Case Study: Twitter• The Twitter archive has 10s of billions

of tweets in it.• Research requests have included users

looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.

status

privacycommercial

personal

events

social media

visualization

social science

Case Study: Twitter• What infrastructure is required to

process, ingest, and index collections at that scale?

• What services should libraries consider supplying to researchers for the analysis and use of such large collections?

Case Study: ViewshareThe Challenges:• We have different kinds of metadata, and its is

messy• That data comes from multiple different

systems• Much of that data is not in the format

researchers might wish it was

Digital cultural heritage collections include temporal, locative, and categorical data that could be tapped to better dynamically interact with to understand those collections.

Viewshare: Functionality to review metadata, create visualizations, and expose the raw data for other others to create their own views.

Data use and re-usehttp://viewshare.org/

http://viewshare.org/

Case Study: Viewshare• ViewShare can be used to analyze and

visualize digital collections, and enable users to share not only the end results, but also the raw data for other others to create their own views.

• While ViewShare has been launched as a standalone visualization service for digital collections and released as open source software, it is also a candidate for a visualization service on top of a repository datastore .

There are so many other possible use cases…

One 100 GB delivery of electronic journal articles received via mandatory copyright deposit had over 1 million files in it. The article files and accompanying metadata are highly variable, requiring normalization and description before ingest.

Born-digital broadcast television is coming our way through Copyright registration and deposit.

Petabytes of files have been created through digital conversion at the Culpeper audio and video facility.

What are the questions that all of our institutions are grappling with as we build large digital collections and discover new ways in which they can be managed and used through repository services?

Can each of our organizations support ingest, processing, and (hoped-for) real-time querying of billions of full-text items?

Can we support the frequent downloading by researchers of multiple collections that may be over 200 TB each?

Should/Can we provide tools for collection analysis and visualization?

The Library of Congress is proceeding on multiple fronts

The development of a variety of repository services that will be used to ingest and inventory Big Data collections.

The ingest and inventory of such collections, other than scale, is basically understood.

How much ingest processing should be done with data collections, or collections that can be treated as data?

Do we process collections to create a variety of derivatives that might be used in various forms of analysis before ingesting them?

Do we have sufficient infrastructure to create full-test indexes for billions of files to support full discovery?

Do we load collections into analytical tools, such as BigInsights or Greenplum? These products are still early days for the scale of billions of files.

LC will be benchmarking ingest and indexing processes in three hardware environments: a standard server, a Hadoop cluster, and in the cloud.

And what are the service models?

If we decide that we will simply provide access to data, do we limit it to the native format or provide pre- processed or on-the-fly format transformation services for downloads?

Can we handle the download traffic?

Can our staff develop the expertise to provide guidance to researchers in using analytical tools?

Or do we leave researchers to fend for themselves?

The Library is increasingly looking towards self- service – researchers need not ask to download or tell us that they have. We may never know.

BUT, we do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, we may have to provide high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them.

Both have policy implications and implications for public service staffing.

Questions?

Leslie [email protected]

leslie johnston: library big data repository services, open repositories 2012

Education