the cornell web library (or laboratory) william y. arms a very large digital library for research on...

The Cornell Web Library (or Laboratory)

William Y. Arms

A Very Large Digital Library for Research on the History of the Web

2

Research Team

Faculty

William Arms, Geri Gay, Dan Huttenlocher, Jon Kleinberg, Michael Macy, David Strang

Cornell Theory Center

Manuel Calimlim, Dave Lifka, Ruth Mitchell, and the Petabyte Data Store team

Ph.D. Students

Selcuk Aya, Pavel Dmitriev, Blazej Kot, with more than 20 M.Eng. and undergraduate students

Internet Archive

Brewster Kahle, Tracey Jacquith, John Berry

3

The Internet Archive

4

Archive of www.cni.org

5

www.cni.org in 1998

6

The Internet Archive Web Collection

The Data

Complete crawls of the Web, every two months since 1996, with

some gaps:

• Range of formats and depth of crawl have increased with time.

• No data from sites that are protected by robots.txt or where owners have requested not to be archived

• Some missing or lost data

• Metadata contains format, links, anchor text

• Organized to facilitate historical access to a known URL (Wayback Machine)

7

Outline

Web-scale research

Building the Web Lab

Observations about very large digital libraries

8

Web-scale Research

Interviews with researchers

Interviews with 15 faculty and graduate students from social sciences and computer science.

• Focused Web Crawling

• The Structure and Evolution of the Web

• Diffusion of Ideas and Practices

• Social and Information Networks

9

NSF Cyberinfrastructure Tools

Sociology: Michael Macy (Principal Investigator), David Strang

Computing and Information Science: Bill Arms, Dan Huttenlocher, Jon Kleinberg

Very Large Semi-Structured Datasets for Social Science Research

"Computer scientists have learned through experience that it is usually best to build software tools in close collaboration with users. Hence, our proposal is two-fold – to build an intelligent front-end that will make the Internet Archive data broadly accessible to social scientists, and to develop, test, and refine these tools through a specific research application – the diffusion of innovation."

Began January 2006

10

Social Science Research

The Web as a social phenomenon

Political campaigns

Online retailing

Polarization of opinions

The Web as evidence of current social events

The spread of urban legends ("The Internet is doubling every six months")

Development of legal concepts across time

11

An Outsider's View of Diffusion Research (Current)

Ryan and Gross (1943)

Studied factors that influenced adoption of hybrid corn

Hybrid corn, introduced by Iowa State in 1928

Adopted by most Iowa farmers by 1940.

Found communication between previous/ potential adopter important

Found an S-shaped rate of adoption

Methodology

Hypothesis, e.g., a model of diffusion

Retrospective survey interviews (small sample size)

Coding of data, by hand with high quality control

Analysis of coded data by hand or by computer

12

An Outsider's View of Diffusion Research (Future)

A vast collection of information that is used for many studies and experiments (with many gaps, known and unknown)

Automatic extraction of items (e.g., Web pages) that appear relevant to a hypothesis (using search methods that will have errors of inclusion and omission)

Automatic coding of these items for factors believed relevant to the hypothesis (using Artificial Intelligence methods that have significant error rates)

Analysis of the encoded data usually by computer

13

Social and Information Networks

Studying social networks intertwined with online information networks

A major area of research at Cornell

– In sociology, communication, economics, information science and computer science

People mediated by information artifacts and vice versa

– Not just social connections or just links between documents

14

Sources of Data

Model systems

– Cornell e-print arXiv (scientific topics, coauthorship/collaboration, trends over time)

– Usenet, with Marc Smith at Microsoft (conversational structure, topic dynamics)

Medium-scale systems

– On-line communities (LiveJournal, MySpace)

Web-scale

– Internet Archive data, showing evolution of Web 1996-2006.

15

Small Scale: Evolution of the arXiv Networks

Changes in the arXiv citation network over time– Number of edges grows superlinearly in number of nodes,

e n1.69

– Average distance between nodes decreases over time

Challenges theoretical models in which diameter is a slowly growing function of number of nodes

Have similarly observed densification laws for many other networks

16

Medium-Scale: Online Social Networking Systems

• A fundamental question in the diffusion of innovation:

– What is the probability an individual will adopt a new behavior, as a function of the number of his/her friends who are adopters?

– New behavior could be: adopting a new technology, joining a social movement, believing a rumor, …

• Large online social networks can have 100,000+ explicitly defined user “communities”.– New behavior: choosing to join a community.

17

Joining a Community

• Unprecedented scale for such a curve

– Close to one billion (user, community) instances

• Most standard models predict S-shaped probability curves

• Further: machine learning to predict joining

18

Web-Scale: Network Evolution

• Researchers are acquiring a large vocabulary of patterns for static networks

– small-world, scale-free, preferential attachment, PageRank, hubs and authorities, bow-ties, bipartite cores, network motifs, …

• Because of the computational challenges, most of the standards results have been measured very few times, e.g., on early Alta vista crawls.

• Little is known about the characteristic ways in which networks grow over time

– What are the analogous collections of patterns?

• Most studies have used link structures. There have been few studies of the evolution of terminology over time

19

Building the Web Library/Laboratory

The Cornell petabyte data store allows us to mount many crawls of

the Web online for broad range of Web research.

• Copy snapshots of the Web from the Internet Archive

• Index snapshots and store online at Cornell

• Extract feature sets for researchers

• Provide APIs for researchers (program interface, download of datasets, Web Services API)

• Provide Web GUI for social science researchers

20

The Internet Archive Web Collection

Sizes

• Current crawls are about 40-60 TByte (compressed)

• Total archive is about 600 TByte (compressed)

• Compression ratio:

up to 25:1best estimate of overall average is 10:1

• Rate of increase is about 1 TByte/day (compressed)

Total storage requirement at Cornell will differ because:

• Elimination of data that is duplicated between crawls

• Expansion of metadata for research

• Database indexes

21

Data Processing Overview

22

Scale of Data Processing

Balance of Resources

Ideal Realistic

Networking 500 Mbit/sec 100 Mbit/sec

Data online allfew crawls/year

Metadata online all all?

Disk 750 TB 240 TB

Tape archive all few crawls/year

Computers research sharedseparate with storage

23

Equipment

Scidata1 -- Initial Configuration

16-Processor Unisys ES7000 Servers – 16 GByte RAM– 8 GByte/sec aggregate I/O bandwidth

100 TByte RAID Online Storage

ADIC Scalar 10K robotic tape library for archive

Separate Web server

Near-term Expansion• Disk capacity will expand to 240 TByte by end of 2007

Network• Internet2 with dedicated 100 mbs link to Internet

Archive

24

Data Processing

Transfer 300-500 GByte per day

Internet 2 -- 100 mbs maximum throughput

Archive raw data to tape

Process raw data

Uncompress and unpack Web pages (ARC) and metadata (DAT) files

Create IDs for pages and content hashes

Database load

Database load batches of metadata about page and links (MS SQL Server 2000)

Store compressed page files

25

Metadata

URL’s, pages and links:

• URL’s contained in metadata may link to pages never crawled

• URL’s not canonicalized: different URL’s may refer to same page

• Links are from a page to a URL

Web graph:

• Nodes are union of pages crawled and URL’s seen

• Each node and edge have time interval(s) over which they exist

Content:

• Anchor text in more recent crawls

• File and mime types

26

Current Status

Data Capture:

• Connection of Internet Archive to Internet 2 (October 2005)

• Parallel loading of crawls of DAT and ARC files (January 2006)

Storage:

• Relational database and preload system: under test

Two complete crawls available April/May 2006

• Page store: preliminary design work

27

User Services

28

Services for Researchers

Under development (available in 2006)

API for users to extract data and download it to their own computers, or to process it on the Scidata 1 computer

Retro Browser (browse the Web as it was on a given date)

Subset extraction (select dataset by query of a relational database with indexes by date, URL, domain name, file type, anchor text, etc.)

Extract Web graph from subset, organized as sparse matrix

Full text index of subset (using Nutch/Lucene)

Future

NLP and machine learning tools to analyze text

Full text index of entire collection

29

Use of the Library (Draft)

Custodianship of data

Make no changes to the content of the Web pages and the identifying metadata such as URL and date.

Copyright

Assume an implied license to use Web data for archiving and academic research. Respects robots.txt exclusions and other requests from copyright owners.

Privacy

Research that might identify individuals is subject to standards that apply to research involving human subjects.

Authentication of users

All users of this library, whether at Cornell or elsewhere, are authenticated. Use restricted to academic, non-commercial research.

30

Very Large Scale Digital Libraries

Only the computer reads every word

• Researchers interact with the library through computer programs that act as their agents.

• Users rarely view individual items except after preliminary screening by programs.

• The library is a highly technical computer system that is used by researchers who are not computing specialists.

• The library is a super-computing application.

• Use of the library depends on automated tools. These tools require state-of-the-art indexes for text and semi-structured data, natural language processing, and machine learning.

31

Design Guidelines for Builders of Digital Libraries

Every online collection or service needs an application program interface (API) for computers to interact with the library.

A primary methodology is:– select a subset of the collection

– download to the researcher's computer

– use programs on the researcher's computer to analyze the data

Almost all metadata will be computer generated, but human cooperative editing can correct errors (see Crane, D-Lib Magazine, March 2006)

32

Further Information

Web site: http://www.infosci.cornell.edu/SIN/

General overview: Arms, et al., "A Research Library based on the Historical Collections of the Internet Archive". D-Lib Magazine, February 2006. http://www.dlib.org/dlib/february06/arms/02arms.html.

Technical information: Arms, et al.,"Building a Research Library for the History of the Web". Joint Conference on Digital Libraries, 2006.

33

Thanks

This work would not be possible without the forethought and longstanding commitment of the Internet Archive to capture and preserve the content of the Web for future generations.

This work has been funded in part by the National Science Foundation, grants CNS-0403340, DUE-0127308, and SES-0537606, with equipment support from Unisys and by an E-Science grant and a gift from Microsoft Corporation.

The Cornell Web Library (or Laboratory)

William Y. Arms

A Very Large Digital Library for Research on the History of the Web

the cornell web library (or laboratory) william y. arms a very large digital library for research on...

Documents

social science researchthe

social scientists

practices social

social sciences

information science

web labobservations

web pages

focused web