archiving web content ce course #285 sheraton ny hotel & towers sunday, june 8, 2003

53
Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Upload: randall-may

Post on 27-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Archiving Web Content

CE Course #285

Sheraton NY Hotel & Towers

Sunday, June 8, 2003

Page 2: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Introductions

Meet today’s panelists

Page 3: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Today’s Panelists

Barry Abisch - The Journal News Olivia Kobelt – Christian Science Monitor Mark Stencel – Washingtonpost.com Janine Yagielski – CNN.com

Page 4: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Agenda

Introductions & session overview Technology Workflow and processes Brainstorming Break Brainstorming recap Role of the librarian Building the business case Closing comments & session evaluation

Page 5: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Technology

Panelist: Janine Yagielski

Page 6: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Technology Overview

What can be archived? Preparing content to be archived Storing and serving archived content Searching archived content

Page 7: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

What can be archived?

Overview of file formats (handout) Dynamic and static content Archiving presentation as well as content Archiving secondary information about online

content (traffic information) Challenges of changing technologies

Technology

Page 8: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Overview of File Formats(handout)

Text formats Image/graphic file formats Video formats Other definitions

Technology

Page 9: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Static and Dynamic Content

Static Content: Content that once posted does not change.

Example: Simple story or information page

Dynamic Content: Constantly changing content

Example 1: Weather data, Stock Prices

Example 2: Election Results, Sports Scores (fixed end point)

Technology

Page 10: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Static and Dynamic Content

Hybrid: Changes occasionally but does not have a predictable updating schedule or end point

Example: Top story with multiple and significant updates

Example: Home page or section page

Technology

Page 11: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Archiving Presentation and Content

CNN.com has built an internal system to archive some presentation

Home Page, US, World, Politics, International Edition

One week of pages Every 30 minutes Perl Script Size of archive: 55.4 MB

Technology

Page 12: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Archiving Secondary Information about Online Content (traffic)

CNN.com has extensive Webstats reporting system that parses and archives the information from Web server logs.

Simple statistics: Page Views, hits (back to 1996)

Advanced statistics: Unique users, time spent, IP address, OS, browser

Real Time Monitor: tracks click through rates of links Home and US pages One week of info on links Tracks average and peak for links

Technology

Page 13: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Challenges of Changing Technology

Interdependencies of the Web make it difficult to maintain old content when optimizing for new content.

Examples: .shtml pages, Vivo video, some Shockwave, other antiquated multimedia technology based on plug-ins

Technology

Page 14: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Preparing Content to be Archived

Directory Structure/Database Key to consistency and automation in subject specific archives. cnn.com/2003/WORLD/meast/06/02/sprj.nitop.political.council/

Slugs conventions Provide additional method of automation archiving

Examples: sprj; sprj.nitop; .ap

Technology

Page 15: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Preparing Content to be Archived

Content Management System Imposes and uses directory structure to prepare content for publication,

syndication and in some cases archiving and searching

Metadata in stories on publish <meta name="DESCRIPTION" content="A U.S. soldier was killed and five were wounded early Thursday in the Iraqi city of Fallujah, the U.S. Central Command announced -- the latest casualties in the city, which has become a center of resistance."> <meta name="AUTHOR" content=""> <meta name="SECTION" content="WORLD"> <meta name="SUBSECTION" content="meast"> <meta name="DATE" content="2003-06-05 05:22:20">

Technology

Page 16: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Preparing Content to be Archived

XML (Extensible Markup Language) CNN.com produces a XML file with every story for site search. We also

produce XML feeds of story headlines and other data sent to

syndication partners.

Metadata and XML for Multimedia CNN.com is looking into way to insert metadata and produce XML feeds

of non-traditional stories. Currently only an internal and manual process of archiving the location and subject of interactive (pop-up) content.

Technology

Page 17: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Storing and Serving Archived Content

Simple storage of content Content servers Burn to CD Web servers (internal and external even if not served) Tape backup

Serving to internal users Image query Directory browsing on the inside Web servers Content purged from outside available (AP, partner stories) Limited space on internal Web server (36 GB)

Technology

Page 18: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Serving to All External Users

All unique URLs published on CNN.com from the launch of the site are still available, unless there was an editorial decision to remove or redirect a URL.

CNN video is hosted by AOL. Because of changes in hosting and capacity of video servers. Not all previous video streams are

available.

Technology

Page 19: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Serving to All External Users

Web servers/NFS ServerHardware: Sun and Intel (running Linux)

Cost: $10,000-$15,000 (Sun), $5,000 (Intel)

Capacity: Storage capacity expanded by adding additional hard drives. Serving capacity varies by content. HTML -- 25K hits/minute; images, style sheets -- 60-70K hits/minute

Video Servers

Hardware: Reconfigured and video dedicated Web server

Cost : $1,500-$3,000

Capacity: Depends on length and size of video and disk space

Technology

Page 20: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Serving to Select Users

Registration E-mail newsletters New e-mail alerts Backend Oracle database JSP’s dynamically served

Subscription Video Real Networks handles CNN.com’s subscriber authentication

Technology

Page 21: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Searching Archived Content

Searching for internal users

Limited functionality for internal materials. Graphics image search. New publishing tools will incorporate a search of externally content.

Searching for external users

Site Search: Run by AOL. CMS produces and publishes (restricted by IP) XML files for every story. At set intervals AOL picks up the XML files uses those files to produce CNN.com’s internal search results.

Web Search: Powered by Google. Sponsored links from Overture. Both sets of

results are returned to CNN.com in XML feeds published on a CNN.com template.

Video/multimedia search: Exploring

Technology

Page 22: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Workflow

Panelist: Olivia Kobelt

Page 23: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Workflow Overview

Types of web content – what do we archive? Archiving old content Internal vs. external archive Making corrections/fixes Search ability Current workflow Systems we use Future Vision

Page 24: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Brainstorming!

Page 25: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Break!

Be back in 15 minutes!

Page 26: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Brainstorming Recap!Legal compliance vs. business user or needCopyright – can you archive someone else’s content, partner content?Talking to IT about what the requirements areHow do you approach gathering user requirements?Who are users?What are retention criteria? (date, size of files, originals/drafts/versioning, exclude search, business value)Hierarchy starting at bottom with knowledge, corporate, business use/reuse, compliance, vital recordsHow to capture and keep the hybrid web pages?What software applications are available?Microfilm archiving?What tools are available to automate the archival process?Where do we begin? Seeking advice in relation to storage, retrieval, technology, etc.What type of information/literature is available on the topic of archiving web data?Selling the idea to managementArchiving “how it looked”How did we do it? Examples of how a project was done.Measure what people are trying to find in older filesManaging the customer service side of it

Page 27: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Role of the Librarian

Panelist: Barry Absich

Page 28: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Librarian Role Overview You are the expert. What do you need? What do readers need? A news Web site has as much in common with a

library as it does with a newspaper. Become familiar with  your newspaper’s Web site. If it is politically correct, insist that you be consulted

on all matters relating to both archiving and searching.

If you can't insist, at least offer your services.  Odds are, your online editor will welcome the offer.

Page 29: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Building A Business Case

Panelist: Mark Stencel

Page 30: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case Overview

What’s worth saving Making money Indirect revenue Costs and challenges Getting credit

Business Case

Page 31: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Does It Pay To Save?

Key points: Your news organization can profit from its

archive of original online content Making money isn’t always profitable (your

business case should account for the cost of doing business, not just revenue)

Business Case

Page 32: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Original Content

Breaking news stories Standing text (FAQs, online guides and

primers) Video/Audio Photo Galleries E-mail Newsletters Interactive Discussions/Chats Databases (listings, scores)

Business Case

Page 33: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Making Money

Sponsorships (e.g., local visitor guides) Resale (paid archives; research services,

such as LexisNexis, Factiva; online reprint rights)

Note: Few good models for selling non-text content (video, audio, galleries)

Business Case

Page 34: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 35: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 36: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 37: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 38: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Indirect Revenue

Promotion (can archived content attract more online users or even print or online subscribers?)

Registration (will users provide valuable e-mail addresses or other personal information in exchange for access to content)

Business Case

Page 39: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 40: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 41: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 42: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 43: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 44: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 45: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 46: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Business Case

Page 47: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Costs and Challenges

Do systems, process, equipment or personnel cost more than you can make?

Rights Management (which content do you have legal rights to use, re-use, or re-sell online)

Content Management (publishing systems and file/directory management for keeping track of where your content is)

Business Case

Page 48: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Costs and Challenges (cont’d.)

Fulfillment and Customer Service (supporting services you provide to the public or to partners)

Revenue Shares (accounting for your partner’s shares)

Coordinating With Parents or Siblings (do your plans fit in or conflict with the overall business goals/strategies of your chain?)

Business Case

Page 49: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Costs and Challenges (cont’d.)

Hosting (server space, streaming) Un-hosting (time and effort to delete or de-

link content; automatically deleting content vs. selectively maintaining content)

Business Case

Page 50: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Get Credit!

Make sure your department gets credit for any revenue it generates, not just the bill for the cost of providing money-making content and services.

Business Case

Page 51: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Questions & Answers

Page 52: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Closing remarks

Please complete an evaluation form.

Page 53: Archiving Web Content CE Course #285 Sheraton NY Hotel & Towers Sunday, June 8, 2003

Suggested Resources “The Archival Black Hole” by Scott Kirsner, 9/19/98,

Editor & Publisher "Archiving the Internet" by Brewster Kahle 11-4-96 From the Scientific American Nothing But Net, Preserving the Internet, 1 Terabyte at a

Time by Bill Barnes, Slate.msn.com "It Was Here a Minute Ago!": Archiving the Net

By Susan E. Fledman, Searcher: The Magazine for Database Professionals

SCC systems archiving billions of bytes at newspapers Newspapers & Technology March 2000

http://www.archive.org