international internet preservation consortium research slides from ian milligan

29
Ian Milligan, PhD Assistant Professor of History [email protected] 'An Infinite Archive?Historical Explorations in the Internet Archive’s Wide Web Scrape

Upload: ian-milligan

Post on 11-Jul-2015

110 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: International Internet Preservation Consortium Research Slides from Ian Milligan

Ian Milligan, PhD Assistant Professor of History [email protected]

'An Infinite Archive?’ Historical Explorations in

the Internet Archive’s Wide Web Scrape

Page 2: International Internet Preservation Consortium Research Slides from Ian Milligan

[http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg]

Page 3: International Internet Preservation Consortium Research Slides from Ian Milligan

Why? !

Historians need to think about Computational Methods in an era of

web archives.

Page 4: International Internet Preservation Consortium Research Slides from Ian Milligan

“.... [n]ow expectations have inverted. Everything may be recorded and preserved*, at

least potentially.” !

- James Gleick, The Information !

* an overstatement, of course, but a useful one

Page 5: International Internet Preservation Consortium Research Slides from Ian Milligan

We have too much information to make sense

of with normal methods.

Page 6: International Internet Preservation Consortium Research Slides from Ian Milligan

The 80TB Wide Web Scrape

[March - December 2011]

Page 7: International Internet Preservation Consortium Research Slides from Ian Milligan

ca,yorku,justlabour)/  20110714073726  http://www.justlabour.yorku.ca/  text/html  302  3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ  http://www.justlabour.yorku.ca/index.php?page=toc&volume=16  -­‐  462  880654831  WIDE-­‐20110714062831-­‐crawl416/WIDE-­‐20110714070859-­‐02373.warc.gz  

Page 8: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 9: International Internet Preservation Consortium Research Slides from Ian Milligan

Methods (or the fun of playing with WARC files themselves)

Page 10: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 11: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 12: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 13: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 14: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 15: International Internet Preservation Consortium Research Slides from Ian Milligan
Page 16: International Internet Preservation Consortium Research Slides from Ian Milligan

Named Entity Recognition as another approach?

Page 17: International Internet Preservation Consortium Research Slides from Ian Milligan

Countries Mentioned in .ca TLD (excluding Canada)

Page 18: International Internet Preservation Consortium Research Slides from Ian Milligan

Provinces Mentioned in .ca TLD

Page 19: International Internet Preservation Consortium Research Slides from Ian Milligan

Countries Mentioned in .mil TLD

Page 20: International Internet Preservation Consortium Research Slides from Ian Milligan

Countries Mentioned in .gov TLD

Page 21: International Internet Preservation Consortium Research Slides from Ian Milligan

Countries Mentioned in .edu TLD

Page 22: International Internet Preservation Consortium Research Slides from Ian Milligan

Countries Mentioned in .uk TLD (excluding UK)

Page 23: International Internet Preservation Consortium Research Slides from Ian Milligan

.ca montage

Page 24: International Internet Preservation Consortium Research Slides from Ian Milligan

.ca montage (zoomed in)

Page 25: International Internet Preservation Consortium Research Slides from Ian Milligan

.mil montage

Page 26: International Internet Preservation Consortium Research Slides from Ian Milligan

.mil montage (zoomed in)

Page 27: International Internet Preservation Consortium Research Slides from Ian Milligan

.cn montage

Page 28: International Internet Preservation Consortium Research Slides from Ian Milligan

.cn montage (zoomed in)

Page 29: International Internet Preservation Consortium Research Slides from Ian Milligan

Ian Milligan, PhD Assistant Professor of History [email protected]

Thank you! !

[email protected]