archive-itbesser.tsoa.nyu.edu/howard/talks/18composers-iipc.pdf · archive-it besser-iipc...
Post on 09-Jun-2020
1 Views
Preview:
TRANSCRIPT
11/13/18
1
Archivingwebsitescontainingstreamingmedia:theMusic
ComposerProject
HowardBesser,NYUh4p://besser.tsoa.nyu.edu/howard/Talks/
Besser-IIPC 13/11/2018 1
Archivingwebsitescontainingstreamingmedia:
theMusicComposerProject
• TheProblemwithHeritrixandArchive-It• TheProject
– OurTechnicalCollaboraLon– OurCollaboraLonwithContentCreators&restricLons– Architectures&Workflows– Howthingslook– EvaluaLon
• ImpactbeyondthisProject
• CaveatI:Thisisanin-progressreport;theprojectisunfinished• CaveatII:Iamnotinvolvedinsystemarchitecture&hand-offs,somaynot
beabletoanswerdetailedquesLonsintheseareas
Besser-IIPC 13/11/2018 2
PROBLEMSWITHHERITRIXANDARCHIVE-IT
Besser-IIPC 13/11/2018 3
Archive-It
• TheleadingapplicaLon/serviceforcuratedwebarchivinginNorthAmerica
• RunbytheInternetArchive,andismuchmoretargetedandcuratedthantheirWayBackMachine
• IsbasedonCrawlersoZwaredevelopedbyIA(Heritrix)in2003-2004
• IsverypooratcapturingstreamingaudioorvideoaswellasinserLngitproperlyintoacomposedwebpage-
Besser-IIPC 13/11/2018 4
Archive-ItIssuesw/StreamingMedia
Besser-IIPC 13/11/2018 5
Archive-ItIssuesw/StreamingMedia
Besser-IIPC 13/11/2018 6
11/13/18
2
Archive-ItIssuesw/StreamingMedia
Besser-IIPC 13/11/2018 7
Archive-Itscreenshotsgeneratedaspartofourproject-
• ByLorenaRamirez-Løpez
Besser-IIPC 13/11/2018 8
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-IIPC 13/11/2018 9
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-IIPC 13/11/2018 10
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05
Besser-IIPC 13/11/2018 11
Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTedHearne’swebsitetaken2015/10/05
Besser-IIPC 13/11/2018 12
11/13/18
3
Somesourcesofstreamingissues
• Problemswithcapturingresourcesresidingon3rdpartyservices(YouTube,Vimeo,Soundcloud)
• ProblemswithhowfaithfullytheA/VmaterialsarecapturedandplacedbyArchive-It
• ProblemswithwebsitesgeneratedthroughsitebuildingplajormssuchasSquarespace
Besser-IIPC 13/11/2018 13
OtherIssueswe’retryingtosolve
• DiscoveringURLsgeneratedbyJavascript
Besser-IIPC 13/11/2018 14
THEPROJECT
Besser-IIPC 13/11/2018 15
ArchivingComposerWebsitesh4p://www.nyu.edu/about/news-publicaLons/news/2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-
quality-musical-content-on-the-web.html
• Collect,preserve,&makeavailableWebsitesofComposers
• $480,000grantfromMellonin2015toNYULibrary/MIAP/InternetArchive
• Dealingwiththeissuethatcontemporarycomposerwebsitesgoupanddown(andalsoincorporaterelaLonship-buildingbtwncomposerandfans)
• AddressingtheproblemsofcollecLngstreamingmedia• AlsoselecLvelycollecLnghigh-qualityversionsthatareusedtogeneratethestreams,andallowingfutureresearcherstosee/hearthehigherqualityversions
Besser-IIPC 13/11/2018 16
ArchivingComposerWebsites
Besser-IIPC 13/11/2018 17
• DevelopgoodandongoingrelaLonshipsbtwnLibrariesandComposers
• DevelopTrust– fordevelopingcollecLons,andconLnuingtoaddtothem– forPolicyreasons
• Examinewhattypeoferrorstakeplace– howfaithfullyaudiovisualmaterialsarebeingcaptured– howresourcesthatresideonthird-partyweb-services(YouTube,Vimeo,Soundcloud)are(not)displayedwithinArchive-It’sinterface
– IssueswwebsitesgeneratedthroughsitebuildingplajormssuchasSquarespace
• Findwaystofixthoseerrors
Somemethodsused
• BeganwithNPR’slistof100importantcomposersunder40,andaugmetedthelistwithfacultyandlibrariansuggesLons
• IdenLfiedwebsiteinfrastructuresencounteredandcreatedaclassificaLonmatrix-
Besser-IIPC 13/11/2018 18
11/13/18
4
WebsiteInfrastructureencountered
Besser-IIPC 13/11/2018 19
ProjectTeam• JeffersonBailey(InternetArchive)• HowardBesser(MIAP)• LoriDonovan(InternetArchive)• AprilHathcock(Lib/ScholComm)• NicoleGreenhouse(Lib/ACM)• CarolKassel(Lib/DLTS)• Sco4Statland(MIAP)• DonaldMennerich(Lib/ACM/DLTS)• DavidMillman(Lib/DLTS)• CourtneyMumma(InternetArchive)• RobinPreiss(Lib/AFC)• LorenaRamirez(MIAP)---specialthanks!• MichaelStoller(Lib/C&RS)• KentUnderwood(Lib/AFC)• ChelaSco4Weber(Lib/AFC)--departed
Besser-IIPC 13/11/2018 20
OURTECHNICALCOLLABORATION:CRAWLING
Besser-IIPC 13/11/2018 21
NYU/IACollaboraLon
Besser-IIPC 13/11/2018 22
TradiLonalCrawlers
Besser-IIPC 13/11/2018 23
• Archive-ItandotherwebarchivesuseHeritrix• Followlinks,capturemostwebcontent• Lesssuccessfulwithstreamingvideoanddynamiccontentexecutedinthebrowser
• Umbrahelps
BROZZLER!
“browser” | “crawler” = BROZZLER
Logo: Noah Levitt Besser-IIPC 13/11/2018 24
11/13/18
5
Besser-IIPC 13/11/2018 25
BrozzlerSystemArchitecturev1
Besser-IIPC 13/11/2018 26
BrozzlerModel
• job:collecLonofseeds• seed:principalunitofcrawlconfiguraLon
– onebrowserworksononeseedataLme(politeness)– seedhasitsownconfiguraLon,alsoinheritsfromparentjob
• page:atomicunitofcrawlingfrombrozzlerperspecLve
• url:onlybrowsers,warcproxhavetodealwitheveryurl
Besser-IIPC 13/11/2018 27
Warcprox:WARC-wriOnghPpproxy
• man-in-the-middleforh4ps• asynchronous:WarcWriterThread
– writeswarcrecords– savesdeduplicaLoninfo– updatesstaLsLcs
Besser-IIPC 13/11/2018 28
Otherpieces
• pythonwayback• Rethinkdb(distributeddocumentstore)
Besser-IIPC 13/11/2018 29
StreamcapturereliesonYoutube-dlh4ps://rg3.github.io/youtube-dl/supportedsites.html
Besser-IIPC 13/11/2018 30
11/13/18
6
OURCOLLABORATIONWITHCONTENTCREATORS,IPISSUES
Besser-IIPC 13/11/2018 31
YoungComposersCorpus
• BeganwithNPR’s2011listof“100ComposersUnder40”
• 91of100haveownself-containedsites• WithinayearofstarLngwehadwri4enagreementswith165Composers(25ofthemfromNPR’slist)
• Plannedtorecruit10ofthemforenhancedarchiving(uncompressed;be4erthanwhatisonwebsite)– Thiswillrequireanaddedappendixtocontract/agreement(whichmayinvolvedarkarchivingand/orrestrictedaccess)
Besser-IIPC 13/11/2018 32
BuildingrelaLonshipswithComposers
• EngagethemwiththeideaofpreservingtheirWebsite
• Aretheywillingtogiveusricherversionsofcontentontheirsite?
• Aretheywillingtomakeall(orjustpart)ofthecontentfreelyaccessible?Dotheywanttoembargosomecontentinadarkarchive?
• DonorAgreement/Contract-
Besser-IIPC 13/11/2018 33
DonorAgreement/Contract
• Workedonthiswithlawyersforwelloverayear• Havehadfairlystablelanguageinitandmanycontractsalreadysignedandreturned
• Doesdefaulttoallowinguscompleterightsforreformasngandforallowingresearcherstosee/hearallhighqualityversionsatminimumon-site– AndthusfarallComposerscontactedhaveagreedtothoseprinciples(butnotnecessarilytothecontractuallanguage)
Besser-IIPC 13/11/2018 34
ContractIntrotentaLvelanguage
• NYUandComposerwishtoestablishlong-termpreservaLonofthematerialslistedatthehighestpossiblequality.TheParLeswishtoenterintothisAgreementtoestablishguidelinesandstandardswithregardtoongoingandfuturelibraryprocessesrelatedtosuchpreservaLon.
Besser-IIPC 13/11/2018 35
ElementsintheContract
• Whatisbeingacquired• TermsofTransfer• TermsofuserAccess• Rights&ResponsibiliLes(bothNYU&Composer)
• Appendixdescribingeachitem(format,content,amount,otherper4nentdescriptors)
• AppendixwithAccessRestricLons-Besser-IIPC 13/11/2018 36
11/13/18
7
4possibleLevelsofStreamingAccess
• Availableforcopy-protectedstreamingfromtheNYULibraries’websitewithunrestrictedaccessbythegeneralpublic.
• Availableforcopy-protectedstreamingfromtheNYULibraries’website– withaccesslimitedtoregisteredNYUfacultyandstudentsand– toexternalresearcherswitheligibilitytouseNYULibraries’archivalresourcesaccordingtoNYULibraries’generalaccesspolicies,withpasswordauthenLcaLon,onoroffcampus.
• Availableforcopy-protectedstreamingonNYULibrariespremises,atdesignatedworkstaLons,withaccessmediatedbyNYULibrariespersonnel.
• NotavailableforstreamingunLladesignatedfuturedate.
Besser-IIPC 13/11/2018 37
TentaLvepiecesoftheContract• TheuncompressedmasterfilesofMaterialslicensedforinclusionwillbemadeavailabletotheLibrariestoenabletheresearchanddevelopmentofhigherqualitytoolsandprocessesforarchivingontheWebandsuccessortechnology.Theresultanthigh-qualitycopiesofComposer’swebsite—incorporaLngthebestqualitymediafiles—willbepreservedashistoricaldocumentsinthearchive,whichwillbeaccessibleworldwideontheWeborsuccessortechnologyasastorehouseofculturalmemoryandavehicleforresearchandscholarship.ComposerretainsexisLngrightstohisorherMaterials,subjecttothelicensegrantedinthisAgreement.
Besser-IIPC 13/11/2018 38
TentaLvepiecesoftheContract• non-exclusiveworldwide,perpetual,irrevocable,royalty-freerighttoproduce,use,copy,anddistributeDerivaLveWorks
• strictlylimitedtoreforma4eddigitalfilesortoexcerptsandabridgements(suchasthumbnails)createdforthetechnicalpurposesofbuilding,preserving,andprovidingaccesstotheWebarchiveovertheWorldWideWeboritssuccessor
• maybeusedonlyforthenon-profiteducaLonalandresearchpurposesprovidedunderthisAgreement
• Agreementdoesnotaffectortransferanycopyrightsorotherintellectualpropertyrights
Besser-IIPC 13/11/2018 39
ARCHITECTURE&WORKFLOWS
Besser-IIPC 13/11/2018 40
Architecture&Workflows
• TheFindingAidsaregeneratedfromArchiveSpace(whichcontainsrichmetadata)
• ThereisanoverallComposersFindingAid,aswellasaseparateFindingAidforeachcomposer(lisLnginventoryandwebarchives,andlinktoassets)
• WebarchiveisstoredinArchive-It;richercontentinNYURepository
• ConnecLonsbuiltoffofArchiveSpaceback-endAPIDemoSite
Besser-IIPC 13/11/2018 41
SoZware&ServiceComponents
• IA’sArchive-It• NYUdigitallibraryinternalcomponents
– Aeonforworkflowmanagement– ArchiveSpace– EAD
Besser-IIPC 13/11/2018 42
11/13/18
8
UnfinishedDevelopmentwork
• Supplyingaseparateaudioplayer?• SLllworkingonpreciseformsofnavigaLonbtwnArchiveSpace,Archive-It,andrichercontentwithinNYU’sdigitalrepository
• WhatwillbeontheworkstaLonforitemsthatneedtobelookedaton-site?
• Issueswithstreamsthatwerenotcaptured• ExampleofworkdoneonIA’sAPI-
Besser-IIPC 13/11/2018 43
InterimworkonAPItoIA• WhatIAneedsfromNYUAPI
– APIURL– CredenLals(username,password)->AuthenLcaLonToken()– RepositoryID– ResourceID
• WhatIAwillreturnasJSONarray– UnitTitle– Creator– DataExpression– ExtentStatement– TechCharacterisLcs– [SomethingBasedonAccessRestricLon,i.e.canitbestreamed]???
• WeSpeakEtruscan,1993May21,23.5MB,1AIFFfileStereouncompressed16bit/44.1K
• TheDreamofInnocenceIII,1998March26,150MB,1AIFFfileStereouncompressed16bit/44.1K
Besser-IIPC 13/11/2018 44
HOWTHINGSMAYLOOK
Besser-IIPC 13/11/2018 45
QuerypathssLllunderdevelopment
Besser-IIPC 13/11/2018 46
OneopLonforUserQueries
• UserbrowsesthroughArchive-It• UserseesthatA/Vcontentexists(andinsomecases,itwillincluderichercontent,butsomeofthatmightbeaccess-restricted)
• Archive-IthandsoffusertoNYU(eitherdirectlytoA/Vcontent,ortoFindingAid)
Besser-IIPC 13/11/2018 47
OneopLonforQueries
Besser-IIPC 13/11/2018 48
11/13/18
9
OneopLonforhighqualitycontent
• OnarchivedwebsitepagelisLngcomposer’scontent,userseesamessagethathigherqualitycontentisavailable,with:– AccessrestricLons,ifapplicable– Linktorelevantfindingaid– (lookinglikefollowingimage)-
Besser-IIPC 13/11/2018 49 Besser-IIPC 13/11/2018 50
DemofromAPIsideh4p://composers.dlib.nyu.edu/
Besser-IIPC 13/11/2018 51
FromtheLibraryFindingAidsideh4p://dlib.nyu.edu/findingaids/html/fales/mss_479/
Besser-IIPC 13/11/2018 52
FromtheLibraryFindingAidside(cont)
Besser-IIPC 13/11/2018 53
FromtheLibraryFindingAidside(ContainerList)
Besser-IIPC 13/11/2018 54
11/13/18
10
FromtheLibraryFindingAidsideh4p://dlib.nyu.edu/findingaids/html/fales/mss_460/dscaspace_7951feea619b6c41436c556e0674d1c8.html
Besser-IIPC 13/11/2018 55
FromtheArchive-Itsideh4ps://archive-it.org/collecLons/7872
Besser-IIPC 13/11/2018 56
FromtheArchive-Itsideh4ps://archive-it.org/collecLons/7872?
q=h4p%3A%2F%2Fwww.bitrosie.com&show=SeedVideos&fc=seedId%3A1157594
Besser-IIPC 13/11/2018 57
FromanydirecLon,usermightneedtoauthenLcate
Besser-IIPC 13/11/2018 58
SOMEOTHERINTERNALTRACKING
Besser-IIPC 13/11/2018 59 Besser-IIPC 13/11/2018 60
11/13/18
11
CrawlRecords
Besser-IIPC 13/11/2018 61
EVALUATION
Besser-IIPC 13/11/2018 62
EvaluaLonforImprovement
• ComposersandtheirsaLsfacLonwiththewaysinwhichaudienceswillbeabletoviewarchivesoftheirwebsites(improvingusability)
• Researchers,andwhetherthecontentandfuncLonalityofthesewebarchivesworksforthem(contentpresentaLon
• Tweakingwhatwedoinordertobe4erserveCreatorsandResearchers
• Findingoutwhethercapturesreallyworked
Besser-IIPC 13/11/2018 63
FindingssLllbeinganalyzed
• Streamingcapturesappearmoresuccessful,butwesLllexperiencesomestreamingcaptureproblems
• NeedfurtherexploraLontoseetheprecisecauseofthecrawler/captureissues(&recLfythemifpossible)
Besser-IIPC 13/11/2018 64
CrawlerIssues(brokenheaderlinks)
Besser-IIPC 13/11/2018 65
CrawlerIssues(failedvideocapture)
Besser-IIPC 13/11/2018 66
11/13/18
12
CrawlerIssues(videocapturefailure)
Besser-IIPC 13/11/2018 67
CrawlerIssues(Flashvideoissue)
Besser-IIPC 13/11/2018 68
CrawlerIssues(videocapturedwithoutaudio)
Besser-IIPC 13/11/2018 69
CrawlerIssues(brokenvideolinks)
Besser-IIPC 13/11/2018 70
CrawlerIssues(1audionotcaptured)
Besser-IIPC 13/11/2018 71
CrawlerIssues(audionotcaptured)
Besser-IIPC 13/11/2018 72
11/13/18
13
CrawlerIssues(audiofailure&anchorproblem)
Besser-IIPC 13/11/2018 73
CrawlerIssues(parLalcapturefailure)
Besser-IIPC 13/11/2018 74
CrawlerIssues(incompleteloading)
Besser-IIPC 13/11/2018 75
CrawlerIssues(Captureissues)
Besser-IIPC 13/11/2018 76
CrawlerIssues(unknownproblems)
Besser-IIPC 13/11/2018 77
CrawlerIssues
• Campjulie.com:– Anycapturedate:IfveryslowloadLme,hardtotellifwasworkingornot,sosomesubjects
gaveup.[Siteownersaysthisisinherenttosite,somightnotbeacaptureproblem.]– Discrepanciesbetweenwhenonehopoutiscapturedornot.
• Kmariekim.com:– Sep26,2017capture(latestcapture):A4emptstoplaymusicfromarchivedtumblrpagefrom
variousplajorms(youtube,soundcloud,etc.).• Bitrosie.com:
– Allcapturedates:linkstakeroughly5minutes(assumedbrokenatfirst)• Adelefournet.com/video/:
– Sep12,2017capture:VideoerroraZerroughly10seconds.Stopsplaying"BeretsofMaryJeanPlace",andstartsplayinganothervideowithopeningLtle"BarrancoDistrict,Lima,Peru".Therestofthevideosonthepagedonotplay.Linkto"BeretsofMaryJeanPlace"ontheInternetArchivealsoplaysincorrectvideo("BarrancoDistrict,Lima,Peru").
• MichaelRobinsonarchivedwebsite:Errormessage
Besser-IIPC 13/11/2018 78
11/13/18
14
EvaluaLonResults• ThesubjectswerebasicallysaLsfiedwiththecaptures,but
hadverymanysuggesLonsforimprovementswithlabeling,searching,display,andperformance.MostalsowantedaddiLonalfuncLonality.
• ManyofthesubjectswereconfusedbetweencapturedsitesandtheFindingAidsforthem.InaddiLon,thewords“Papersof”incollecLonLtlesbaffledpeoplewhentheywerelookingforrecordings,notpapers.
• Bothusersandsiteownerswereunclearaboutthescopeofcontentthathadactuallybeencollected.Onesiteownerexpresseddisappointmentthatreviewsthattheylinkedtowerenotcaptured.Andonlyonesubjectfiguredouthowtonavigatetoasuggested“liveweb”pagethathadnotbeennotarchived.
Besser-IIPC 13/11/2018 79
FuncLonalityrequestedbyusers• Mostsubjectswantedmoremetadatadisplayed.Examplesincluded:
displayingadescripLonoftheComposersProjectandlikelycontentsontheiniLalstartpage;displayofaudio/videorun-Lmeinsteadoffilesize;descripLon,thumbnails,excerptsformaterialrestrictedtoonsiteuse(sothattheycoulddecidewhetherornottheyreallyneededtomakeasitevisit);morefieldsshowninvariousdisplays(bothinlistsandinlinkstoessence).
• BothsiteownersrespondedposiLvelytotheideaofprovidingasitemapwithacollapsingmenuoflinks.
• Mostsubjectswantedasearchbox.AndmostwantedtobeabletoimmediatelysortamulL-columndisplaylistbyanycolumnoftheirchoosing.
• Onesubjectfounditmisleadingwhenarestrictedobjectlinkedtoanewpage.
• Onesiteownerpreferredthattheirdigitalobjectsbeorganizedbyproject,ratherthaninanundifferenLatedlistofeverydigitalobjectontheirsite.
Besser-IIPC 13/11/2018 80
FuncLonalityrequestedforlocalworkstaLons
• Abilitytotakescreengrabs• AccesstoaddiLonalbrowserwindow
• Previewframewhenscrubbing(fastforwarding)throughvideomaterial
• Useoftheirownlaptoporanotherwindow• DisplayofLmecode• And2subjectsspecificallyrequestedthe
– abilitytoslowvideo/audiofiletotranscribe– abilitytodroppin/a4achnotestospecificpointinvideo/audiofile
Besser-IIPC 13/11/2018 81
IMPACTBEYONDTHISPROJECT
Besser-IIPC 13/11/2018 82
ImpactBeyondthisProject• TherewillbeanalternaLvetoHeritrixforcapturing
streamingmedia,andArchive-Itwillideallybeabletobe4erhandlestreamingmedia,anddisplayitinpropercontext
• WewillhavearchitecturesandworkflowsforArchive-Ittointeractwithricherlocalresources(aswellasexamplesofhowinteracLonandnavigaLoncanproceedbtwnArchive-It,ArchiveSpace,FindingAids,andaninternaldigitalrepository)
• ModelsforinteracLonbtwncreatorsandcollecLngorganizaLonswillhavebeendeveloped(incldonoragreements)
• Wehavepreserved100+++websitesofyoungcomposers
Besser-IIPC 13/11/2018 83
Archivingwebsitescontainingstreamingmedia:theMusicComposer
Project
• h4p://besser.tsoa.nyu.edu/howard/Talks/• h4p://www.nyu.edu/about/news-publicaLons/news/
2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-quality-musical-content-on-the-web.html
• h4p://archive.org/~nlevi4/reveal.js/• h4p://composers.dlib.nyu.edu/• h4ps://rg3.github.io/youtube-dl/supportedsites.html
Besser-IIPC 13/11/2018 84
top related