sandhaus, van valkenburg, cotler; nyt technical team: the future of the past

The Future of The Past

The New York Times and the Challenge of Archives

Evan Sandhaus, Sophia Van Valkenburg

Jane Cotler

The New York Times@nytarchives

A Problem of Archives“How do you faithfully represent information created with one technology using another?”

A Problem We Know Well• Migrating The Index to The Times Information Bank• Migrating The Microfilm Archive to TimesMachine• Migrating Legacy Web Content to Modern Online

Presentation (or the challenge of multiple legacy formats)

The Problem By The Numbers

60,000Issues Published Since

September 18, 1851

Almost

3,500,000+Unique Pages Printed Since

September 18, 1851

15,000,000+Articles Published

September 18, 1851

Digital Archives

Full Text NYT5

Full Text NYT4

Abstracts NYT4

Abstracts NYT5

The New York Times Information Bank

The Index

Evan Sandhaus

The New York Times Company Archives

TimesMachine

The Deep Archive

135000

180000

Scanned Articles Digital Articles Blogs

≈75% ≈25%

The Deep Archive

The Numbers

46,592Issues Published Since

September 18, 1851

The Numbers

2,335,446Unique Pages Printed Since

September 18, 1851

The Numbers

11,298,320Articles Published

September 18, 1851

The Scanned Archive

HeadlineCROWD ROARS THUNDEROUS WELCOME;

Breaks Through Lines of Soldiers and Police and Surging to Plane Lifts Weary Flier from His Cockpit AVIATORS SAVE HIM FROM FRENZIED MOB OF

100,000 Paris Boulevards Ring With Celebration After Day and Night Watch -- American Flag Is

Called For and Wildly Acclaimed

The Scanned Archive

Lede ParagraphPARIS, May 21. -- Lindbergh did it. Twenty minutes

after 10 o'clock tonight suddenly and softly there slipped out of the darkness a gray-white airplane as 25,000 pairs of eyes strained toward it. At 10:24 the Spirit of St. Louis landed and lines of soldiers, ranks

of policemen and stout steel fences went down before a mad rush as irresistible as the tides of the

ocean.

The Scanned Archive

“Dirty” ASCII…Lifte Fro'm His Cockpit. As he was lifted to the

ground Lindbergh w as l,-:k:, :::. - hair unkempt, he looked completely worn out. lle h-:: strength

enough, however, to smile, and waved his hand to t? ' crowd. Soldiers with fixed bayonets were unable to keep bach the crowd. United States Ambassador

Herrick was among the first to welcome and congratulate the hero.s…

The Scanned Archive

Indexing MetadataHeadings

People, Places, Organizations, Subject

AbstractsConcise summary of the facts in the article

TimesMachineVersion 2.0

Archive Transcription

The Problem

• As a subscriber exclusive TimesMachine does not appear in Google Search results.

• Lack of full text before 1980 makes it difficult to rank, or even appear, in Google results.

• For example: In 1945 The Times published 161,961 articles and only a tiny fraction appear in Google results.

The Solution

• Transcribe articles from archival scans and publish these assets as searchable pages on nytimes.com.

• Transcribe and publish 1964 as pilot.• If that works transcribe and publish all remaining

articles between 1960-1980.

Progress & Results

• All articles between 1960-1980 transcribed.• All articles between 1970-1979 available on

nytimes.com with more to come.• Google now indexing 672,500 new assets published

between 1970-1979!• Plans to publish 1960-1969, and to monitor

performance of new pages.

Online Archive Modernization

Archival Content on NYTimes.com

The Initial Solution

new format for CMS (JSON)

print data(XML)

The Case Of The Missing Articles

web data(HTML)

new format for CMS (JSON)

print data(XML)

The Case of the Missing Articles

1. What is the complete list of article URLs from 1996-2006?

2. How do we identify which of the missing web articles correspond to existing print articles so that we can combine them and avoid duplicate content?

3. Which articles are web-only and not in our print archive at all, and how do we scrape that page for content & metadata?

4. Can we build a system that will process all the data for each year easily & efficiently?

The Definitive List of Articles

4 different sources:

1. Print archive2. Site analytics (from the past 6 months)3. Movie, theater, and restaurant reviews4. Sitemaps

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

The Archive Migration Pipeline3%

48.3% Print Archive (56K)Print Archive and Web (42K)Web-only (15K)Bad urls (3K)

2004 Articles (116K total)

All The Little Things…

• 1996• Article Matching• Better URLs• Quality Assurance• Next Steps

Article Matching: Fusion

archive XML

extracted URLs

missing URLs

missing HTML

XML to HTML

matches

unmatched HTML

JSON from XML and

JSON from unmatched

skipped files

JSON with no

duplicate

Fusion Explained

web data(HTML)

print data(XML)

Search Engine Optimization27iht-scoutus.t.html

Search Engine Optimizationcurb-violates-free-speech-supreme-court-rules-72-justices-void-internet.html

The Case Of The Missing Sections

Next Steps

Full Text

No Full Text

Next StepsPhotos

Next Steps

Digital preservation

To Conclude…

Thank You!

Evan Sandhaus, Sophia Van Valkenburg, Jane Cotler

The New York Times

sandhaus, van valkenburg, cotler; nyt technical team: the future of the past

News & Politics

analisis de redes - van valkenburg

terra sigillata stamps from valkenburg (z · terra...

164 years of entity-based seo by evan sandhaus

analisis de redes (m. e. van valkenburg), editorial limusa

mawwv2015 andre valkenburg buckaroo

network analysis, m. e. van valkenburg

analog filter design m.e. van valkenburg 1982-600m

29.network analysis and synthesis by van-valkenburg

atp winter 2008 workshop jim valkenburg delta college

van valkenburg

design of analog filters (rolf schaumann & mac e. van...

internet wiretapping and carnivore sarah boucher edward...

ocw seminar willem van valkenburg

van valkenburg - analisis de redes

analisis de redes, van valkenburg

network analysis and synthesis by m e van valkenburg

19 - fluxus postpartum - okdagen.nl · 1 fluxus postpartum...

network analysis chapter 1 - mac e. van valkenburg

re: dale van valkenburg

i.l.c.h. van valkenburg: 465 -...