sandhaus, van valkenburg, cotler; nyt technical team: the future of the past

60
The Future of The Past The New York Times and the Challenge of Archives Evan Sandhaus, Sophia Van Valkenburg Jane Cotler The New York Times @nytarchives

Upload: reynolds-journalism-institute-rji

Post on 19-Jan-2017

50 views

Category:

News & Politics


1 download

TRANSCRIPT

Page 1: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Future of The Past

The New York Times and the Challenge of Archives

Evan Sandhaus, Sophia Van Valkenburg

Jane Cotler

The New York Times@nytarchives

Page 2: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past
Page 3: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

(us)

Page 4: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

A Problem of Archives“How do you faithfully represent information created with one technology using another?”

Page 5: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

A Problem We Know Well• Migrating The Index to The Times Information Bank• Migrating The Microfilm Archive to TimesMachine• Migrating Legacy Web Content to Modern Online

Presentation (or the challenge of multiple legacy formats)

Page 6: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Problem By The Numbers

60,000Issues Published Since

September 18, 1851

Almost

Page 7: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Problem By The Numbers

3,500,000+Unique Pages Printed Since

September 18, 1851

Page 8: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Problem By The Numbers

15,000,000+Articles Published

September 18, 1851

Page 9: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Digital Archives

1851-

1859

1860-

1865

1866-

1949

1970-

1980

1981-

1995

1996-

2016

Full Text NYT5

Full Text NYT4

Abstracts NYT4

Abstracts NYT5

1950-

1959

1960-

1969

Page 10: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Information Bank

Page 11: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Index

Evan Sandhaus

Page 12: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Company Archives

Page 13: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Company Archives

Page 14: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Company Archives

Page 15: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Company Archives

Page 16: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The New York Times Company Archives

Page 17: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

TimesMachine

Page 18: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Deep Archive

0

45000

90000

135000

180000

1851

1858

1865

1872

1879

1886

1893

1900

1907

1914

1921

1928

1935

1942

1949

1956

1963

1970

1977

1984

1991

1998

2005

2012

Scanned Articles Digital Articles Blogs

≈75% ≈25%

Page 19: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Deep Archive

Page 20: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Numbers

46,592Issues Published Since

September 18, 1851

Page 21: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Numbers

2,335,446Unique Pages Printed Since

September 18, 1851

Page 22: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Numbers

11,298,320Articles Published

September 18, 1851

Page 23: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Scanned Archive

Page 24: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Scanned Archive

HeadlineCROWD ROARS THUNDEROUS WELCOME;

Breaks Through Lines of Soldiers and Police and Surging to Plane Lifts Weary Flier from His Cockpit AVIATORS SAVE HIM FROM FRENZIED MOB OF

100,000 Paris Boulevards Ring With Celebration After Day and Night Watch -- American Flag Is

Called For and Wildly Acclaimed

Page 25: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Scanned Archive

Lede ParagraphPARIS, May 21. -- Lindbergh did it. Twenty minutes

after 10 o'clock tonight suddenly and softly there slipped out of the darkness a gray-white airplane as 25,000 pairs of eyes strained toward it. At 10:24 the Spirit of St. Louis landed and lines of soldiers, ranks

of policemen and stout steel fences went down before a mad rush as irresistible as the tides of the

ocean.

Page 26: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Scanned Archive

“Dirty” ASCII…Lifte Fro'm His Cockpit. As he was lifted to the

ground Lindbergh w as l,-:k:, :::. - hair unkempt, he looked completely worn out. lle h-:: strength

enough, however, to smile, and waved his hand to t? ' crowd. Soldiers with fixed bayonets were unable to keep bach the crowd. United States Ambassador

Herrick was among the first to welcome and congratulate the hero.s…

Page 27: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Scanned Archive

Indexing MetadataHeadings

People, Places, Organizations, Subject

AbstractsConcise summary of the facts in the article

Page 28: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Demo

TimesMachineVersion 2.0

Page 29: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Archive Transcription

Page 30: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Problem

• As a subscriber exclusive TimesMachine does not appear in Google Search results.

• Lack of full text before 1980 makes it difficult to rank, or even appear, in Google results.

• For example: In 1945 The Times published 161,961 articles and only a tiny fraction appear in Google results.

Page 31: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Solution

• Transcribe articles from archival scans and publish these assets as searchable pages on nytimes.com.

• Transcribe and publish 1964 as pilot.• If that works transcribe and publish all remaining

articles between 1960-1980.

Page 32: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Progress & Results

• All articles between 1960-1980 transcribed.• All articles between 1970-1979 available on

nytimes.com with more to come.• Google now indexing 672,500 new assets published

between 1970-1979!• Plans to publish 1960-1969, and to monitor

performance of new pages.

Page 33: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Online Archive Modernization

Page 34: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Online Archive Modernization

Page 35: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Archival Content on NYTimes.com

Page 36: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Archival Content on NYTimes.com

Page 37: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Initial Solution

new format for CMS (JSON)

print data(XML)

Page 38: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Case Of The Missing Articles

Page 39: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Case Of The Missing Articles

web data(HTML)

new format for CMS (JSON)

print data(XML)

Page 40: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Case of the Missing Articles

1. What is the complete list of article URLs from 1996-2006?

2. How do we identify which of the missing web articles correspond to existing print articles so that we can combine them and avoid duplicate content?

3. Which articles are web-only and not in our print archive at all, and how do we scrape that page for content & metadata?

4. Can we build a system that will process all the data for each year easily & efficiently?

Page 41: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Definitive List of Articles

4 different sources:

1. Print archive2. Site analytics (from the past 6 months)3. Movie, theater, and restaurant reviews4. Sitemaps

Page 42: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 43: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 44: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 45: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 46: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 47: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 48: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Archive Migration Pipeline3%

12.9%

36.2%

48.3% Print Archive (56K)Print Archive and Web (42K)Web-only (15K)Bad urls (3K)

2004 Articles (116K total)

Page 49: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

All The Little Things…

• 1996• Article Matching• Better URLs• Quality Assurance• Next Steps

Page 50: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Article Matching: Fusion

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Page 51: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Fusion Explained

web data(HTML)

print data(XML)

Page 52: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Search Engine Optimization27iht-scoutus.t.html

Page 53: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Search Engine Optimizationcurb-violates-free-speech-supreme-court-rules-72-justices-void-internet.html

Page 54: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Case Of The Missing Sections

Page 55: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Case Of The Missing Sections

Page 56: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Next Steps

1851-

1859

1860-

1865

1866-

1949

1970-

1980

1981-

1995

1996-

2016

1950-

1959

1960-

1969

Full Text

No Full Text

Page 57: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Next StepsPhotos

Page 58: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Next Steps

Digital preservation

Page 59: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

To Conclude…

Page 60: Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

Thank You!

Evan Sandhaus, Sophia Van Valkenburg, Jane Cotler

The New York Times