archive-it blogait.blog.archive.org/files/2020/05/introduction-to-the-warc_2020052… · 20/05/2020...

32

Upload: others

Post on 20-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • https://bit.ly/warc-intro

  • crawler

    W/ARCWayback

    https://bit.ly/warc-intro

  • crawler

    W/ARCWayback

    collect

    render

    https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • .warc

    https://bit.ly/warc-intro

  • .warc

    https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • 1996:

    2005:

    2009:

    2017:

    https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • .warc

    warcinfo request response revisit

    resource conversion continuation metadata

    https://bit.ly/warc-intro

  • WARC-Type: warcinfoWARC-Record-ID: WARC-Filename: ARCHIVEIT-8232-TEST_CRAWL-JOB1111215-SEED2166618-20200320173416774-00000-xqtcu3m8.warc.gzWARC-Date: 2020-03-20T17:34:16ZContent-Type: application/warc-fieldsContent-Length: 116software: warcprox 2.4.26hostname: wbgrp-svc408.us.archive.orgip: 207.241.232.59format: WARC File Format 1.0

    https://bit.ly/warc-intro

  • WARC-Type: requestWARC-Record-ID: WARC-Target-URI: https://www.netpreserve.org/WARC-Date: 2020-03-20T17:34:14ZWARC-Concurrent-To: WARC-Block-Digest: sha1:YQQEFRPXTLNBTEYX5VJBMK6M27KRP4UYContent-Type: application/http;msgtype=requestContent-Length: 420

    https://bit.ly/warc-intro

  • GET /blog/HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.83 Safari/537.36Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Accept-Encoding: gzip, deflateAccept-Language: en-us,en;q=0.5X-Forwarded-For: 6.214.43.172Host: www.netpreserve.orgVia: 1.1 warcprox

    https://bit.ly/warc-intro

  • WARC-Type: responseWARC-Record-ID: WARC-Target-URI: https://widgets.wp.com/likes/index.htmlWARC-Date: 2020-03-20T18:57:07ZWARC-Payload-Digest: sha1:CORSFN5DKI2BBKIK7XKYQMVYTTT3MKICContent-Type: application/http; msgtype=responseContent-Length: 386

    https://bit.ly/warc-intro

  • HTTP/1.1 200 OKServer: nginxDate: Thu, 19 Mar 2020 18:57:07 GMTContent-Type: text/htmlContent-Length: 126Accept-Ranges: bytes

    https://bit.ly/warc-intro

  • WARC-Type: revisitWARC-Record-ID: WARC-Target-URI: https://s0.wp.com/i/favicon.icoWARC-Date: 2020-03-19T18:57:29ZWARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digestWARC-Payload-Digest: sha1:AWHWSLHL7LFI7DOK4QXUVH323OGGHPS3WARC-Refers-To: WARC-Refers-To-Target-URI: https://s1.wp.com/i/favicon.icoWARC-Refers-To-Date: 2020-03-19T18:56:21ZContent-Type: application/http; msgtype=responseContent-Length: 362

    https://bit.ly/warc-intro

  • HTTP/1.1 200 OKServer: nginxDate: Thu, 19 Mar 2020 18:57:29 GMTContent-Type: image/x-iconContent-Length: 5430Connection: closeLast-Modified: Fri, 13 Nov 2015 04:17:50 GMTVary: Accept-EncodingETag: "5645646e-1536"Expires: Fri, 28 Aug 2020 04:10:03 GMTCache-Control: max-age=31536000Accept-Ranges: bytes

    https://bit.ly/warc-intro

  • WARC-Type: resourceWARC-Record-ID: WARC-Target-URI: screenshot:http://www.netpreserve.org/blog/WARC-Date: 2019-12-19T17:53:11ZWARC-Block-Digest: sha1:GCDC2JZRN52NG2SNE6V52HC5BWUAFNNCWARC-Payload-Digest: sha1:GCDC2JZRN52NG2SNE6V52HC5BWUAFNNCContent-Type: image/jpegContent-Length: 182270

    ?&??j?ƽ???m4????i?&??k?O??H?[?&??j?ƽ???m4????i?&??k?O??H?[?&??j?ƽ???m4????i?&??k?O??H?[?&??j?ƽ???m4????i?&??k?O??H?[?&??j?ƽ???m4????i?&??k?O??H?[?&??j?ƽ???m4????i?&??k?O??H……….

    https://bit.ly/warc-intro

  • WARC-Type: conversionWARC-Record-ID: WARC-Target-URI: http://www.archive.org/images/logoc.jpgWARC-Date: 2026-09-19T19:00:40ZWARC-Block-Digest: sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEKWARC-Refers-To: Content-Type: image/neoimgContent-Length: 934

    ….

    https://bit.ly/warc-intro

  • WARC-Type: continuationWARC-Record-ID: WARC-Target-URI: http://www.archive.org/images/logoc.jpgWARC-Date: 2016-09-19T17:20:24ZWARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2WARC-Segment-Origin-ID: WARC-Segment-Number: 2WARC-Segment-Total-Length: 1902WARC-Identified-Payload-Type: image/jpegContent-Length: 302

    ….

    https://bit.ly/warc-intro

  • WARC-Type: metadataWARC-Record-ID: WARC-Target-URI: https://netpreserveblog.files.wordpress.com/iipc_logo_fullcolor.pngWARC-Date: 2020-03-19T18:56:17ZContent-Type: application/warc-fieldsContent-Length: 476

    force-fetch:via: https://netpreserveblog.wordpress.com/2020/02/13/cdg-collection-novel-coronavirus/hopsFromSeed: IfetchTimeMs: 61charsetForLinkExtraction: ISO-8859-1

    https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • https://bit.ly/warc-intro

  • WARC specification repository - IIPC

    WARC file format specification 28500:2017 - ISO

    WARC File Format (ISO 28500) Pre-print drafts - BnF

    WARC format description - Library of Congress

    Details for WARC 1.0 - PRONOM

    Storage and preservation policy - Archive-It

    Find and download your WARC files - Archive-It

    https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/https://www.iso.org/standard/68004.htmlhttp://bibnum.bnf.fr/WARC/https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtmlhttps://www.nationalarchives.gov.uk/pronom/fmt/289https://support.archive-it.org/hc/en-us/articles/208117536-Archive-It-Storage-and-Preservation-Policyhttps://support.archive-it.org/hc/en-us/articles/360015225051-Find-and-download-your-WARC-files-with-WASAPIhttps://bit.ly/warc-intro

  • https://bit.ly/warc-intro