web archiving poses challengesbesser.tsoa.nyu.edu/howard/talks/18composers-buckland.pdf ·...
TRANSCRIPT
1/18/18
1
MakingWebArchivingWorkforStreamingMedia:
ArchivingtheWebsitesofContemporaryYoungComposers
HowardBesser,NYUh3p://besser.tsoa.nyu.edu/howard/Talks/
Besser-BerkeleySeminar1/19/2018 1
MakingWebArchivingWorkforStreamingMedia
• Backgroundissuesandproblems• TheProject
– OurTechnicalCollaboraLon– OurCollaboraLonwithContentCreators&restricLons
– Architectures&Workflows– Howthingsmaylook– EvaluaLon
• ImpactbeyondthisProject
Besser-BerkeleySeminar1/19/2018 2
BACKGROUNDISSUESANDPROBLEMS
Besser-BerkeleySeminar1/19/2018 3
WebArchivingposeschallenges
• Anygivenwebpagemaybeupdatedfrequently
• Weblinksconstantlybreak(404errors)• Fewtools/servicesexistfor“Curated”webarchiving(Archive-It,CDL’sWAS),andtheyrequiresignificanttraining/experiencetolearn,butwedohaveint’l-acceptedformat(WARC)
Besser-BerkeleySeminar1/19/2018 4
ManyparametersneedtobesetforWebArchiving
• Frequencyofcrawls• Depthofcrawls(#ofhops)• StarLngpointsofcrawls(seeds)
Besser-BerkeleySeminar1/19/2018 5
Otherissuesfordevelopinggoodcrawls
• Qualitycontrol/assurance• Workflows• Fidelitytooriginalwebpages• Howenduserwillnavigateandviewit
Besser-BerkeleySeminar1/19/2018 6
1/18/18
2
Archive-It
• TheleadingapplicaLon/serviceforcuratedwebarchivinginNorthAmerica
• RunbytheInternetArchive,andismuchmoretargetedandcuratedthantheirWayBackMachine
• IsbasedonCrawlersohwaredevelopedbyIA(Heritrix)in2003-2004
• IsverypooratcapturingstreamingaudioorvideoaswellasinserLngitproperlyintoacomposedwebpage-
Besser-BerkeleySeminar1/19/2018 7
Archive-ItIssuesw/StreamingMedia
Besser-BerkeleySeminar1/19/2018 8
Archive-ItIssuesw/StreamingMedia
Besser-BerkeleySeminar1/19/2018 9
Archive-ItIssuesw/StreamingMedia
Besser-BerkeleySeminar1/19/2018 10
Archive-Itscreenshotsgeneratedaspartofourproject-
• ByLorenaRamirez-Løpez
Besser-BerkeleySeminar1/19/2018 11
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-BerkeleySeminar1/19/2018 12
1/18/18
3
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-BerkeleySeminar1/19/2018 13
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-BerkeleySeminar1/19/2018 14
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTedHearne’swebsitetaken2015/10/05
Besser-BerkeleySeminar1/19/2018 15
Somesourcesofstreamingissues
• Problemswithcapturingresourcesresidingon3rdpartyservices(YouTube,Vimeo,Soundcloud)
• ProblemswithhowfaithfullytheA/VmaterialsarecapturedandplacedbyArchive-It
• ProblemswithwebsitesgeneratedthroughsitebuildingplamormssuchasSquarespace
Besser-BerkeleySeminar1/19/2018 16
OtherIssueswe’retryingtosolve
• DiscoveringURLsgeneratedbyJavascript
Besser-BerkeleySeminar1/19/2018 17
THEPROJECT
Besser-BerkeleySeminar1/19/2018 18
1/18/18
4
ArchivingComposerWebsitesh3p://www.nyu.edu/about/news-publicaLons/news/2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-
quality-musical-content-on-the-web.html
• Collect,preserve,&makeavailableWebsitesofComposers
• $480,000grantfromMellonin2015toNYULibrary/MIAP/InternetArchive
• Dealingwiththeissuethatcontemporarycomposerwebsitesgoupanddown(andalsoincorporaterelaLonship-buildingbtwncomposerandfans)
• AddressingtheproblemsofcollecLngstreamingmedia• AlsoselecLvelycollecLnghigh-qualityversionsthatareusedtogeneratethestreams,andallowingfutureresearcherstosee/hearthehigherqualityversions
Besser-BerkeleySeminar1/19/2018 19
ArchivingComposerWebsites
Besser-BerkeleySeminar1/19/2018 20
• DevelopgoodandongoingrelaLonshipsbtwnLibrariesandComposers
• DevelopTrust– fordevelopingcollecLons,andconLnuingtoaddtothem– forPolicyreasons
• Examinewhattypeoferrorstakeplace– howfaithfullyaudiovisualmaterialsarebeingcaptured– howresourcesthatresideonthird-partyweb-services(YouTube,Vimeo,Soundcloud)are(not)displayedwithinArchive-It’sinterface
– IssueswwebsitesgeneratedthroughsitebuildingplamormssuchasSquarespace
• Findwaystofixthoseerrors
MetricsAccomplished(asofJan2016)
• 172Composersitescrawled,scoped,assessedforquality,&analyzedforproblems(feedingintoIAdevelopmentwork)
• 800QA/QCreportsgenerated• IniLalwebarchivingagreementfrom165Composers(25fromNPR’s100)
• IdenLfiedwebsiteinfrastructuresencounteredandcreatedaclassificaLonmatrix-
Besser-BerkeleySeminar1/19/2018 21
WebsiteInfrastructureencountered
Besser-BerkeleySeminar1/19/2018 22
ProjectTeam• JeffersonBailey(InternetArchive)• HowardBesser(MIAP)• LoriDonovan(InternetArchive)• AprilHathcock(Lib/ScholComm)• NicoleGreenhouse(Lib/ACM)• CarolKassel(Lib/DLTS)• Sco3Statland(MIAP)• DonaldMennerich(Lib/ACM/DLTS)• DavidMillman(Lib/DLTS)• CourtneyMumma(InternetArchive)• RobinPreiss(Lib/AFC)• LorenaRamirez(MIAP)---specialthanks!• MichaelStoller(Lib/C&RS)• KentUnderwood(Lib/AFC)• ChelaSco3Weber(Lib/AFC)--departed
Besser-BerkeleySeminar1/19/2018 23
OURTECHNICALCOLLABORATION:CRAWLING
Besser-BerkeleySeminar1/19/2018 24
1/18/18
5
NYU/IACollaboraLon
Besser-BerkeleySeminar1/19/2018 25
NYU/IACollaboraLon
Besser-BerkeleySeminar1/19/2018 26
TradiLonalCrawlers
Besser-BerkeleySeminar1/19/2018 27
• Archive-ItandotherwebarchivesuseHeritrix• Followlinks,capturemostwebcontent• Lesssuccessfulwithstreamingvideoanddynamiccontentexecutedinthebrowser
• Umbrahelps
BROZZLER!
“browser” | “crawler” = BROZZLER
Logo: Noah Levitt Besser-BerkeleySeminar1/19/2018 28
Besser-BerkeleySeminar1/19/2018 29
BrozzlerSystemArchitecturev1
Besser-BerkeleySeminar1/19/2018 30
1/18/18
6
BrozzlerModel
• job:collecLonofseeds• seed:principalunitofcrawlconfiguraLon
– onebrowserworksononeseedataLme(politeness)– seedhasitsownconfiguraLon,alsoinheritsfromparentjob
• page:atomicunitofcrawlingfrombrozzlerperspecLve
• url:onlybrowsers,warcproxhavetodealwitheveryurl
Besser-BerkeleySeminar1/19/2018 31
Warcprox:WARC-wriRnghSpproxy
• man-in-the-middleforh3ps• asynchronous:WarcWriterThread
– writeswarcrecords– savesdeduplicaLoninfo– updatesstaLsLcs
Besser-BerkeleySeminar1/19/2018 32
Otherpieces
• pythonwayback• Rethinkdb(distributeddocumentstore)
Besser-BerkeleySeminar1/19/2018 33
StreamcapturereliesonYoutube-dlh3ps://rg3.github.io/youtube-dl/supportedsites.html
Besser-BerkeleySeminar1/19/2018 34
OURCOLLABORATIONWITHCONTENTCREATORS,IPISSUES
Besser-BerkeleySeminar1/19/2018 35
YoungComposersCorpus
• BeganwithNPR’s2011listof“100ComposersUnder40”
• 91of100haveownself-containedsites• Asof5/2016hadwri3enagreementswith165Composers(25ofthemfromNPR’slist)
• Willrecruit10ofthemforenhancedarchiving(uncompressed;be3erthanwhatisonwebsite)– Thiswillrequireanaddedappendixtocontract/agreement(whichmayinvolvedarkarchivingand/orrestrictedaccess)
Besser-BerkeleySeminar1/19/2018 36
1/18/18
7
BuildingrelaLonshipswithComposers
• EngagethemwiththeideaofpreservingtheirWebsite
• Aretheywillingtogiveusricherversionsofcontentontheirsite?
• Aretheywillingtomakeall(orjustpart)ofthecontentfreelyaccessible?Dotheywanttoembargosomecontentinadarkarchive?
• DonorAgreement/Contract-
Besser-BerkeleySeminar1/19/2018 37
DonorAgreement/Contract
• Havebeenworkingonthiswithlawyersforoverayear
• Havehadfairlystablelanguageinitandsomecontractsalreadysignedandreturned
• Doesdefaulttoallowinguscompleterightsforreformaungandforallowingresearcherstosee/hearallhighqualityversionsatminimumon-site– AndthusfarallComposerscontactedhaveagreedtothoseprinciples(butnotnecessarilytothecontractuallanguage)
Besser-BerkeleySeminar1/19/2018 38
ContractIntrotentaLvelanguage
• NYUandComposerwishtoestablishlong-termpreservaLonofthematerialslistedatthehighestpossiblequality.TheParLeswishtoenterintothisAgreementtoestablishguidelinesandstandardswithregardtoongoingandfuturelibraryprocessesrelatedtosuchpreservaLon.
Besser-BerkeleySeminar1/19/2018 39
ElementsintheContract
• Whatisbeingacquired• TermsofTransfer• TermsofuserAccess• Rights&ResponsibiliLes(bothNYU&Composer)
• Appendixdescribingeachitem(format,content,amount,otherper4nentdescriptors)
• AppendixwithAccessRestricLons-Besser-BerkeleySeminar1/19/2018 40
4possibleLevelsofStreamingAccess
• Availableforcopy-protectedstreamingfromtheNYULibraries’websitewithunrestrictedaccessbythegeneralpublic.
• Availableforcopy-protectedstreamingfromtheNYULibraries’website– withaccesslimitedtoregisteredNYUfacultyandstudentsand– toexternalresearcherswitheligibilitytouseNYULibraries’archivalresourcesaccordingtoNYULibraries’generalaccesspolicies,withpasswordauthenLcaLon,onoroffcampus.
• Availableforcopy-protectedstreamingonNYULibrariespremises,atdesignatedworkstaLons,withaccessmediatedbyNYULibrariespersonnel.
• NotavailableforstreamingunLladesignatedfuturedate.
Besser-BerkeleySeminar1/19/2018 41
TentaLvepiecesoftheContract• TheuncompressedmasterfilesofMaterialslicensedforinclusionwillbemadeavailabletotheLibrariestoenabletheresearchanddevelopmentofhigherqualitytoolsandprocessesforarchivingontheWebandsuccessortechnology.Theresultanthigh-qualitycopiesofComposer’swebsite—incorporaLngthebestqualitymediafiles—willbepreservedashistoricaldocumentsinthearchive,whichwillbeaccessibleworldwideontheWeborsuccessortechnologyasastorehouseofculturalmemoryandavehicleforresearchandscholarship.ComposerretainsexisLngrightstohisorherMaterials,subjecttothelicensegrantedinthisAgreement.
Besser-BerkeleySeminar1/19/2018 42
1/18/18
8
TentaLvepiecesoftheContract• non-exclusiveworldwide,perpetual,irrevocable,royalty-freerighttoproduce,use,copy,anddistributeDerivaLveWorks
• strictlylimitedtoreforma3eddigitalfilesortoexcerptsandabridgements(suchasthumbnails)createdforthetechnicalpurposesofbuilding,preserving,andprovidingaccesstotheWebarchiveovertheWorldWideWeboritssuccessor
• maybeusedonlyforthenon-profiteducaLonalandresearchpurposesprovidedunderthisAgreement
• Agreementdoesnotaffectortransferanycopyrightsorotherintellectualpropertyrights
Besser-BerkeleySeminar1/19/2018 43
ARCHITECTURE&WORKFLOWS
Besser-BerkeleySeminar1/19/2018 44
Architecture&Workflows
• TheFindingAidsaregeneratedfromArchiveSpace(whichcontainsrichmetadata)
• ThereisanoverallComposersFindingAid,aswellasaseparateFindingAidforeachcomposer(lisLnginventoryandwebarchives,andlinktoassets)
• WebarchiveisstoredinArchive-It;richercontentinNYURepository
• ConnecLonsbuiltoffofArchiveSpaceback-endAPIDemoSite
Besser-BerkeleySeminar1/19/2018 45
Sohware&ServiceComponents
• IA’sArchive-It• NYUdigitallibraryinternalcomponents
– Aeonforworkflowmanagement– ArchiveSpace– EAD
Besser-BerkeleySeminar1/19/2018 46
RecentDevelopmentwork
• Supplyingaseparateaudioplayer?• HiringaDigitalArchivist• SLllworkingonpreciseformsofnavigaLonbtwnArchiveSpace,Archive-It,andrichercontentwithinNYU’sdigitalrepository
• ExampleofworkdoneonIA’sAPI-
Besser-BerkeleySeminar1/19/2018 47
InterimworkonAPItoIA• WhatIAneedsfromNYUAPI
– APIURL– CredenLals(username,password)->AuthenLcaLonToken()– RepositoryID– ResourceID
• WhatIAwillreturnasJSONarray– UnitTitle– Creator– DataExpression– ExtentStatement– TechCharacterisLcs– [SomethingBasedonAccessRestricLon,i.e.canitbestreamed]???
• WeSpeakEtruscan,1993May21,23.5MB,1AIFFfileStereouncompressed16bit/44.1K
• TheDreamofInnocenceIII,1998March26,150MB,1AIFFfileStereouncompressed16bit/44.1K
Besser-BerkeleySeminar1/19/2018 48
1/18/18
9
HOWTHINGSMAYLOOK
Besser-BerkeleySeminar1/19/2018 49
QuerypathssLllunderdevelopment
Besser-BerkeleySeminar1/19/2018 50
OneopLonforUserQueries
• UserbrowsesthroughArchive-It• UserseesthatA/Vcontentexists(andinsomecases,itwillincluderichercontent,butsomeofthatmightbeaccess-restricted)
• Archive-IthandsoffusertoNYU(eitherdirectlytoA/Vcontent,ortoFindingAid)
Besser-BerkeleySeminar1/19/2018 51
OneopLonforQueries
Besser-BerkeleySeminar1/19/2018 52
OneopLonforhighqualitycontent
• OnarchivedwebsitepagelisLngcomposer’scontent,userseesamessagethathigherqualitycontentisavailable,with:– AccessrestricLons,ifapplicable– Linktorelevantfindingaid– (lookinglikefollowingimage)-
Besser-BerkeleySeminar1/19/2018 53 Besser-BerkeleySeminar1/19/2018 54
1/18/18
10
DemofromAPIsideh3p://composers.dlib.nyu.edu/
Besser-BerkeleySeminar1/19/2018 55
FromtheLibraryFindingAidsideh3p://dlib.nyu.edu/findingaids/html/fales/mss_479/
Besser-BerkeleySeminar1/19/2018 56
FromtheLibraryFindingAidside(cont)
Besser-BerkeleySeminar1/19/2018 57
FromtheLibraryFindingAidside(ContainerList)
Besser-BerkeleySeminar1/19/2018 58
FromtheLibraryFindingAidsideh3p://dlib.nyu.edu/findingaids/html/fales/mss_460/dscaspace_7951feea619b6c41436c556e0674d1c8.html
Besser-BerkeleySeminar1/19/2018 59
FromtheArchive-Itsideh3ps://archive-it.org/collecLons/7872
Besser-BerkeleySeminar1/19/2018 60
1/18/18
11
FromtheArchive-Itsideh3ps://archive-it.org/collecLons/7872?
q=h3p%3A%2F%2Fwww.bitrosie.com&show=SeedVideos&fc=seedId%3A1157594
Besser-BerkeleySeminar1/19/2018 61
FromanydirecLon,usermightneedtoauthenLcate
Besser-BerkeleySeminar1/19/2018 62
SOMEOTHERINTERNALTRACKING
Besser-BerkeleySeminar1/19/2018 63 Besser-BerkeleySeminar1/19/2018 64
CrawlRecords
Besser-BerkeleySeminar1/19/2018 65
EVALUATION
Besser-BerkeleySeminar1/19/2018 66
1/18/18
12
EvaluaLonforImprovement
• ComposersandtheirsaLsfacLonwiththewaysinwhichaudienceswillbeabletoviewarchivesoftheirwebsites
• Researchers,andwhetherthecontentandfuncLonalityofthesewebarchivesworksforthem
• Tweakingwhatwedoinordertobe3erserveCreatorsandResearchers
Besser-BerkeleySeminar1/19/2018 67
ScheduleandMethodologyforEvaluaLon
• Jan2018—Scheduleone-on-oneinterviewswithsetsofcomposersandResearchers
• Feb-Mar2017—Onehourindividualsessionswith10Composersandalsowith10Researchers,havingthemlookattheuserinterfaceandconductqueries– Composers:AretheysaLsfiedwithhowaudienceswillbeabletoviewthe
archivalcopiesoftheirwebsites?Isitbe3erorworsethantheirownlivesites?AretheysaLsfiedwiththeaudioandvideoplacementandquality(aswellasopLons)?AretheycontentwiththeDonorAgreement?Whatchanges/improvementsmightbemadetoanyofthese?
– Researchers:Cantheyfindwhattheyneedinthewebarchive?Isitdifficult(clunky)touse?Whatpartsdon’tworkwelloraren’tintuiLve?WewanttoidenLfywhatchangesinthecontent,funcLonality,ornavigaLonfeatureswouldimprovetheiruserexperience
• Apr-May2017—ConstrucLonofEvaluaLonSummarycontainingthelistofimprovements/changesthatshouldbemadetotheArchivingproject
• June-Aug2017—Implementthechanges
Besser-BerkeleySeminar1/19/2018 68
IMPACTBEYONDTHISPROJECT
Besser-BerkeleySeminar1/19/2018 69
ImpactBeyondthisProject• Archive-Itwillbeabletobe3erhandlestreamingmedia,anddisplayitinpropercontext
• WewillhavearchitecturesandworkflowsforArchive-Ittointeractwithricherlocalresources(aswellasexamplesofhowinteracLonandnavigaLoncanproceedbtwnArchive-It,ArchiveSpace,FindingAids,andaninternaldigitalrepository)
• ModelsforinteracLonbtwncreatorsandcollecLngorganizaLonswillhavebeendeveloped(incldonoragreements)
• Wewillhavepreserved100+++websitesofyoungcomposers
Besser-BerkeleySeminar1/19/2018 70
MakingWebArchivingWorkforStreamingMedia
• h3p://besser.tsoa.nyu.edu/howard/Talks/• h3p://www.nyu.edu/about/news-publicaLons/news/
2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-quality-musical-content-on-the-web.html
• h3p://archive.org/~nlevi3/reveal.js/• h3p://composers.dlib.nyu.edu/• h3ps://rg3.github.io/youtube-dl/supportedsites.html
Besser-BerkeleySeminar1/19/2018 71