web archiving poses challengesbesser.tsoa.nyu.edu/howard/talks/18composers-buckland.pdf ·...

12
1/18/18 1 Making Web Archiving Work for Streaming Media: Archiving the Websites of Contemporary Young Composers Howard Besser, NYU h3p://besser.tsoa.nyu.edu/howard/Talks/ Besser-Berkeley Seminar 1/19/2018 1 Making Web Archiving Work for Streaming Media Background issues and problems The Project Our Technical CollaboraLon Our CollaboraLon with Content Creators & restricLons Architectures & Workflows How things may look EvaluaLon Impact beyond this Project Besser-Berkeley Seminar 1/19/2018 2 BACKGROUND ISSUES AND PROBLEMS Besser-Berkeley Seminar 1/19/2018 3 Web Archiving poses challenges Any given web page may be updated frequently Web links constantly break (404 errors) Few tools/services exist for “Curated” web archiving (Archive-It, CDL’s WAS), and they require significant training/experience to learn, but we do have int’l-accepted format (WARC) Besser-Berkeley Seminar 1/19/2018 4 Many parameters need to be set for Web Archiving Frequency of crawls Depth of crawls (# of hops) StarLng points of crawls (seeds) Besser-Berkeley Seminar 1/19/2018 5 Other issues for developing good crawls Quality control/assurance Workflows Fidelity to original web pages How end user will navigate and view it Besser-Berkeley Seminar 1/19/2018 6

Upload: others

Post on 09-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

1

MakingWebArchivingWorkforStreamingMedia:

ArchivingtheWebsitesofContemporaryYoungComposers

HowardBesser,NYUh3p://besser.tsoa.nyu.edu/howard/Talks/

Besser-BerkeleySeminar1/19/2018 1

MakingWebArchivingWorkforStreamingMedia

•  Backgroundissuesandproblems•  TheProject

– OurTechnicalCollaboraLon– OurCollaboraLonwithContentCreators&restricLons

– Architectures&Workflows– Howthingsmaylook– EvaluaLon

•  ImpactbeyondthisProject

Besser-BerkeleySeminar1/19/2018 2

BACKGROUNDISSUESANDPROBLEMS

Besser-BerkeleySeminar1/19/2018 3

WebArchivingposeschallenges

•  Anygivenwebpagemaybeupdatedfrequently

•  Weblinksconstantlybreak(404errors)•  Fewtools/servicesexistfor“Curated”webarchiving(Archive-It,CDL’sWAS),andtheyrequiresignificanttraining/experiencetolearn,butwedohaveint’l-acceptedformat(WARC)

Besser-BerkeleySeminar1/19/2018 4

ManyparametersneedtobesetforWebArchiving

•  Frequencyofcrawls•  Depthofcrawls(#ofhops)•  StarLngpointsofcrawls(seeds)

Besser-BerkeleySeminar1/19/2018 5

Otherissuesfordevelopinggoodcrawls

•  Qualitycontrol/assurance•  Workflows•  Fidelitytooriginalwebpages•  Howenduserwillnavigateandviewit

Besser-BerkeleySeminar1/19/2018 6

Page 2: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

2

Archive-It

•  TheleadingapplicaLon/serviceforcuratedwebarchivinginNorthAmerica

•  RunbytheInternetArchive,andismuchmoretargetedandcuratedthantheirWayBackMachine

•  IsbasedonCrawlersohwaredevelopedbyIA(Heritrix)in2003-2004

•  IsverypooratcapturingstreamingaudioorvideoaswellasinserLngitproperlyintoacomposedwebpage-

Besser-BerkeleySeminar1/19/2018 7

Archive-ItIssuesw/StreamingMedia

Besser-BerkeleySeminar1/19/2018 8

Archive-ItIssuesw/StreamingMedia

Besser-BerkeleySeminar1/19/2018 9

Archive-ItIssuesw/StreamingMedia

Besser-BerkeleySeminar1/19/2018 10

Archive-Itscreenshotsgeneratedaspartofourproject-

•  ByLorenaRamirez-Løpez

Besser-BerkeleySeminar1/19/2018 11

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-BerkeleySeminar1/19/2018 12

Page 3: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

3

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-BerkeleySeminar1/19/2018 13

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-BerkeleySeminar1/19/2018 14

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTedHearne’swebsitetaken2015/10/05

Besser-BerkeleySeminar1/19/2018 15

Somesourcesofstreamingissues

•  Problemswithcapturingresourcesresidingon3rdpartyservices(YouTube,Vimeo,Soundcloud)

•  ProblemswithhowfaithfullytheA/VmaterialsarecapturedandplacedbyArchive-It

•  ProblemswithwebsitesgeneratedthroughsitebuildingplamormssuchasSquarespace

Besser-BerkeleySeminar1/19/2018 16

OtherIssueswe’retryingtosolve

•  DiscoveringURLsgeneratedbyJavascript

Besser-BerkeleySeminar1/19/2018 17

THEPROJECT

Besser-BerkeleySeminar1/19/2018 18

Page 4: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

4

ArchivingComposerWebsitesh3p://www.nyu.edu/about/news-publicaLons/news/2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-

quality-musical-content-on-the-web.html

•  Collect,preserve,&makeavailableWebsitesofComposers

•  $480,000grantfromMellonin2015toNYULibrary/MIAP/InternetArchive

•  Dealingwiththeissuethatcontemporarycomposerwebsitesgoupanddown(andalsoincorporaterelaLonship-buildingbtwncomposerandfans)

•  AddressingtheproblemsofcollecLngstreamingmedia•  AlsoselecLvelycollecLnghigh-qualityversionsthatareusedtogeneratethestreams,andallowingfutureresearcherstosee/hearthehigherqualityversions

Besser-BerkeleySeminar1/19/2018 19

ArchivingComposerWebsites

Besser-BerkeleySeminar1/19/2018 20

•  DevelopgoodandongoingrelaLonshipsbtwnLibrariesandComposers

•  DevelopTrust–  fordevelopingcollecLons,andconLnuingtoaddtothem–  forPolicyreasons

•  Examinewhattypeoferrorstakeplace–  howfaithfullyaudiovisualmaterialsarebeingcaptured–  howresourcesthatresideonthird-partyweb-services(YouTube,Vimeo,Soundcloud)are(not)displayedwithinArchive-It’sinterface

–  IssueswwebsitesgeneratedthroughsitebuildingplamormssuchasSquarespace

•  Findwaystofixthoseerrors

MetricsAccomplished(asofJan2016)

•  172Composersitescrawled,scoped,assessedforquality,&analyzedforproblems(feedingintoIAdevelopmentwork)

•  800QA/QCreportsgenerated•  IniLalwebarchivingagreementfrom165Composers(25fromNPR’s100)

•  IdenLfiedwebsiteinfrastructuresencounteredandcreatedaclassificaLonmatrix-

Besser-BerkeleySeminar1/19/2018 21

WebsiteInfrastructureencountered

Besser-BerkeleySeminar1/19/2018 22

ProjectTeam•  JeffersonBailey(InternetArchive)•  HowardBesser(MIAP)•  LoriDonovan(InternetArchive)•  AprilHathcock(Lib/ScholComm)•  NicoleGreenhouse(Lib/ACM)•  CarolKassel(Lib/DLTS)•  Sco3Statland(MIAP)•  DonaldMennerich(Lib/ACM/DLTS)•  DavidMillman(Lib/DLTS)•  CourtneyMumma(InternetArchive)•  RobinPreiss(Lib/AFC)•  LorenaRamirez(MIAP)---specialthanks!•  MichaelStoller(Lib/C&RS)•  KentUnderwood(Lib/AFC)•  ChelaSco3Weber(Lib/AFC)--departed

Besser-BerkeleySeminar1/19/2018 23

OURTECHNICALCOLLABORATION:CRAWLING

Besser-BerkeleySeminar1/19/2018 24

Page 5: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

5

NYU/IACollaboraLon

Besser-BerkeleySeminar1/19/2018 25

NYU/IACollaboraLon

Besser-BerkeleySeminar1/19/2018 26

TradiLonalCrawlers

Besser-BerkeleySeminar1/19/2018 27

•  Archive-ItandotherwebarchivesuseHeritrix•  Followlinks,capturemostwebcontent•  Lesssuccessfulwithstreamingvideoanddynamiccontentexecutedinthebrowser

•  Umbrahelps

BROZZLER!

“browser” | “crawler” = BROZZLER

Logo: Noah Levitt Besser-BerkeleySeminar1/19/2018 28

Besser-BerkeleySeminar1/19/2018 29

BrozzlerSystemArchitecturev1

Besser-BerkeleySeminar1/19/2018 30

Page 6: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

6

BrozzlerModel

•  job:collecLonofseeds•  seed:principalunitofcrawlconfiguraLon

–  onebrowserworksononeseedataLme(politeness)–  seedhasitsownconfiguraLon,alsoinheritsfromparentjob

•  page:atomicunitofcrawlingfrombrozzlerperspecLve

•  url:onlybrowsers,warcproxhavetodealwitheveryurl

Besser-BerkeleySeminar1/19/2018 31

Warcprox:WARC-wriRnghSpproxy

•  man-in-the-middleforh3ps•  asynchronous:WarcWriterThread

– writeswarcrecords– savesdeduplicaLoninfo– updatesstaLsLcs

Besser-BerkeleySeminar1/19/2018 32

Otherpieces

•  pythonwayback•  Rethinkdb(distributeddocumentstore)

Besser-BerkeleySeminar1/19/2018 33

StreamcapturereliesonYoutube-dlh3ps://rg3.github.io/youtube-dl/supportedsites.html

Besser-BerkeleySeminar1/19/2018 34

OURCOLLABORATIONWITHCONTENTCREATORS,IPISSUES

Besser-BerkeleySeminar1/19/2018 35

YoungComposersCorpus

•  BeganwithNPR’s2011listof“100ComposersUnder40”

•  91of100haveownself-containedsites•  Asof5/2016hadwri3enagreementswith165Composers(25ofthemfromNPR’slist)

•  Willrecruit10ofthemforenhancedarchiving(uncompressed;be3erthanwhatisonwebsite)–  Thiswillrequireanaddedappendixtocontract/agreement(whichmayinvolvedarkarchivingand/orrestrictedaccess)

Besser-BerkeleySeminar1/19/2018 36

Page 7: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

7

BuildingrelaLonshipswithComposers

•  EngagethemwiththeideaofpreservingtheirWebsite

•  Aretheywillingtogiveusricherversionsofcontentontheirsite?

•  Aretheywillingtomakeall(orjustpart)ofthecontentfreelyaccessible?Dotheywanttoembargosomecontentinadarkarchive?

•  DonorAgreement/Contract-

Besser-BerkeleySeminar1/19/2018 37

DonorAgreement/Contract

•  Havebeenworkingonthiswithlawyersforoverayear

•  Havehadfairlystablelanguageinitandsomecontractsalreadysignedandreturned

•  Doesdefaulttoallowinguscompleterightsforreformaungandforallowingresearcherstosee/hearallhighqualityversionsatminimumon-site– AndthusfarallComposerscontactedhaveagreedtothoseprinciples(butnotnecessarilytothecontractuallanguage)

Besser-BerkeleySeminar1/19/2018 38

ContractIntrotentaLvelanguage

•  NYUandComposerwishtoestablishlong-termpreservaLonofthematerialslistedatthehighestpossiblequality.TheParLeswishtoenterintothisAgreementtoestablishguidelinesandstandardswithregardtoongoingandfuturelibraryprocessesrelatedtosuchpreservaLon.

Besser-BerkeleySeminar1/19/2018 39

ElementsintheContract

•  Whatisbeingacquired•  TermsofTransfer•  TermsofuserAccess•  Rights&ResponsibiliLes(bothNYU&Composer)

•  Appendixdescribingeachitem(format,content,amount,otherper4nentdescriptors)

•  AppendixwithAccessRestricLons-Besser-BerkeleySeminar1/19/2018 40

4possibleLevelsofStreamingAccess

•  Availableforcopy-protectedstreamingfromtheNYULibraries’websitewithunrestrictedaccessbythegeneralpublic.

•  Availableforcopy-protectedstreamingfromtheNYULibraries’website–  withaccesslimitedtoregisteredNYUfacultyandstudentsand–  toexternalresearcherswitheligibilitytouseNYULibraries’archivalresourcesaccordingtoNYULibraries’generalaccesspolicies,withpasswordauthenLcaLon,onoroffcampus.

•  Availableforcopy-protectedstreamingonNYULibrariespremises,atdesignatedworkstaLons,withaccessmediatedbyNYULibrariespersonnel.

•  NotavailableforstreamingunLladesignatedfuturedate.

Besser-BerkeleySeminar1/19/2018 41

TentaLvepiecesoftheContract•  TheuncompressedmasterfilesofMaterialslicensedforinclusionwillbemadeavailabletotheLibrariestoenabletheresearchanddevelopmentofhigherqualitytoolsandprocessesforarchivingontheWebandsuccessortechnology.Theresultanthigh-qualitycopiesofComposer’swebsite—incorporaLngthebestqualitymediafiles—willbepreservedashistoricaldocumentsinthearchive,whichwillbeaccessibleworldwideontheWeborsuccessortechnologyasastorehouseofculturalmemoryandavehicleforresearchandscholarship.ComposerretainsexisLngrightstohisorherMaterials,subjecttothelicensegrantedinthisAgreement.

Besser-BerkeleySeminar1/19/2018 42

Page 8: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

8

TentaLvepiecesoftheContract•  non-exclusiveworldwide,perpetual,irrevocable,royalty-freerighttoproduce,use,copy,anddistributeDerivaLveWorks

•  strictlylimitedtoreforma3eddigitalfilesortoexcerptsandabridgements(suchasthumbnails)createdforthetechnicalpurposesofbuilding,preserving,andprovidingaccesstotheWebarchiveovertheWorldWideWeboritssuccessor

•  maybeusedonlyforthenon-profiteducaLonalandresearchpurposesprovidedunderthisAgreement

•  Agreementdoesnotaffectortransferanycopyrightsorotherintellectualpropertyrights

Besser-BerkeleySeminar1/19/2018 43

ARCHITECTURE&WORKFLOWS

Besser-BerkeleySeminar1/19/2018 44

Architecture&Workflows

•  TheFindingAidsaregeneratedfromArchiveSpace(whichcontainsrichmetadata)

•  ThereisanoverallComposersFindingAid,aswellasaseparateFindingAidforeachcomposer(lisLnginventoryandwebarchives,andlinktoassets)

•  WebarchiveisstoredinArchive-It;richercontentinNYURepository

•  ConnecLonsbuiltoffofArchiveSpaceback-endAPIDemoSite

Besser-BerkeleySeminar1/19/2018 45

Sohware&ServiceComponents

•  IA’sArchive-It•  NYUdigitallibraryinternalcomponents

– Aeonforworkflowmanagement– ArchiveSpace– EAD

Besser-BerkeleySeminar1/19/2018 46

RecentDevelopmentwork

•  Supplyingaseparateaudioplayer?•  HiringaDigitalArchivist•  SLllworkingonpreciseformsofnavigaLonbtwnArchiveSpace,Archive-It,andrichercontentwithinNYU’sdigitalrepository

•  ExampleofworkdoneonIA’sAPI-

Besser-BerkeleySeminar1/19/2018 47

InterimworkonAPItoIA•  WhatIAneedsfromNYUAPI

–  APIURL–  CredenLals(username,password)->AuthenLcaLonToken()–  RepositoryID–  ResourceID

•  WhatIAwillreturnasJSONarray–  UnitTitle–  Creator–  DataExpression–  ExtentStatement–  TechCharacterisLcs–  [SomethingBasedonAccessRestricLon,i.e.canitbestreamed]???

•  WeSpeakEtruscan,1993May21,23.5MB,1AIFFfileStereouncompressed16bit/44.1K

•  TheDreamofInnocenceIII,1998March26,150MB,1AIFFfileStereouncompressed16bit/44.1K

Besser-BerkeleySeminar1/19/2018 48

Page 9: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

9

HOWTHINGSMAYLOOK

Besser-BerkeleySeminar1/19/2018 49

QuerypathssLllunderdevelopment

Besser-BerkeleySeminar1/19/2018 50

OneopLonforUserQueries

•  UserbrowsesthroughArchive-It•  UserseesthatA/Vcontentexists(andinsomecases,itwillincluderichercontent,butsomeofthatmightbeaccess-restricted)

•  Archive-IthandsoffusertoNYU(eitherdirectlytoA/Vcontent,ortoFindingAid)

Besser-BerkeleySeminar1/19/2018 51

OneopLonforQueries

Besser-BerkeleySeminar1/19/2018 52

OneopLonforhighqualitycontent

•  OnarchivedwebsitepagelisLngcomposer’scontent,userseesamessagethathigherqualitycontentisavailable,with:– AccessrestricLons,ifapplicable– Linktorelevantfindingaid–  (lookinglikefollowingimage)-

Besser-BerkeleySeminar1/19/2018 53 Besser-BerkeleySeminar1/19/2018 54

Page 10: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

10

DemofromAPIsideh3p://composers.dlib.nyu.edu/

Besser-BerkeleySeminar1/19/2018 55

FromtheLibraryFindingAidsideh3p://dlib.nyu.edu/findingaids/html/fales/mss_479/

Besser-BerkeleySeminar1/19/2018 56

FromtheLibraryFindingAidside(cont)

Besser-BerkeleySeminar1/19/2018 57

FromtheLibraryFindingAidside(ContainerList)

Besser-BerkeleySeminar1/19/2018 58

FromtheLibraryFindingAidsideh3p://dlib.nyu.edu/findingaids/html/fales/mss_460/dscaspace_7951feea619b6c41436c556e0674d1c8.html

Besser-BerkeleySeminar1/19/2018 59

FromtheArchive-Itsideh3ps://archive-it.org/collecLons/7872

Besser-BerkeleySeminar1/19/2018 60

Page 11: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

11

FromtheArchive-Itsideh3ps://archive-it.org/collecLons/7872?

q=h3p%3A%2F%2Fwww.bitrosie.com&show=SeedVideos&fc=seedId%3A1157594

Besser-BerkeleySeminar1/19/2018 61

FromanydirecLon,usermightneedtoauthenLcate

Besser-BerkeleySeminar1/19/2018 62

SOMEOTHERINTERNALTRACKING

Besser-BerkeleySeminar1/19/2018 63 Besser-BerkeleySeminar1/19/2018 64

CrawlRecords

Besser-BerkeleySeminar1/19/2018 65

EVALUATION

Besser-BerkeleySeminar1/19/2018 66

Page 12: Web Archiving poses challengesbesser.tsoa.nyu.edu/howard/Talks/18composers-buckland.pdf · Archive-It • The leading applicaon/service for curated web archiving in North America

1/18/18

12

EvaluaLonforImprovement

•  ComposersandtheirsaLsfacLonwiththewaysinwhichaudienceswillbeabletoviewarchivesoftheirwebsites

•  Researchers,andwhetherthecontentandfuncLonalityofthesewebarchivesworksforthem

•  Tweakingwhatwedoinordertobe3erserveCreatorsandResearchers

Besser-BerkeleySeminar1/19/2018 67

ScheduleandMethodologyforEvaluaLon

•  Jan2018—Scheduleone-on-oneinterviewswithsetsofcomposersandResearchers

•  Feb-Mar2017—Onehourindividualsessionswith10Composersandalsowith10Researchers,havingthemlookattheuserinterfaceandconductqueries–  Composers:AretheysaLsfiedwithhowaudienceswillbeabletoviewthe

archivalcopiesoftheirwebsites?Isitbe3erorworsethantheirownlivesites?AretheysaLsfiedwiththeaudioandvideoplacementandquality(aswellasopLons)?AretheycontentwiththeDonorAgreement?Whatchanges/improvementsmightbemadetoanyofthese?

–  Researchers:Cantheyfindwhattheyneedinthewebarchive?Isitdifficult(clunky)touse?Whatpartsdon’tworkwelloraren’tintuiLve?WewanttoidenLfywhatchangesinthecontent,funcLonality,ornavigaLonfeatureswouldimprovetheiruserexperience

•  Apr-May2017—ConstrucLonofEvaluaLonSummarycontainingthelistofimprovements/changesthatshouldbemadetotheArchivingproject

•  June-Aug2017—Implementthechanges

Besser-BerkeleySeminar1/19/2018 68

IMPACTBEYONDTHISPROJECT

Besser-BerkeleySeminar1/19/2018 69

ImpactBeyondthisProject•  Archive-Itwillbeabletobe3erhandlestreamingmedia,anddisplayitinpropercontext

•  WewillhavearchitecturesandworkflowsforArchive-Ittointeractwithricherlocalresources(aswellasexamplesofhowinteracLonandnavigaLoncanproceedbtwnArchive-It,ArchiveSpace,FindingAids,andaninternaldigitalrepository)

•  ModelsforinteracLonbtwncreatorsandcollecLngorganizaLonswillhavebeendeveloped(incldonoragreements)

•  Wewillhavepreserved100+++websitesofyoungcomposers

Besser-BerkeleySeminar1/19/2018 70

MakingWebArchivingWorkforStreamingMedia

•  h3p://besser.tsoa.nyu.edu/howard/Talks/•  h3p://www.nyu.edu/about/news-publicaLons/news/

2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-quality-musical-content-on-the-web.html

•  h3p://archive.org/~nlevi3/reveal.js/•  h3p://composers.dlib.nyu.edu/•  h3ps://rg3.github.io/youtube-dl/supportedsites.html

Besser-BerkeleySeminar1/19/2018 71