archive-itbesser.tsoa.nyu.edu/howard/talks/18composers-iipc.pdf · archive-it besser-iipc...

Post on 09-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11/13/18

1

Archivingwebsitescontainingstreamingmedia:theMusic

ComposerProject

HowardBesser,NYUh4p://besser.tsoa.nyu.edu/howard/Talks/

Besser-IIPC 13/11/2018 1

Archivingwebsitescontainingstreamingmedia:

theMusicComposerProject

•  TheProblemwithHeritrixandArchive-It•  TheProject

–  OurTechnicalCollaboraLon–  OurCollaboraLonwithContentCreators&restricLons–  Architectures&Workflows–  Howthingslook–  EvaluaLon

•  ImpactbeyondthisProject

•  CaveatI:Thisisanin-progressreport;theprojectisunfinished•  CaveatII:Iamnotinvolvedinsystemarchitecture&hand-offs,somaynot

beabletoanswerdetailedquesLonsintheseareas

Besser-IIPC 13/11/2018 2

PROBLEMSWITHHERITRIXANDARCHIVE-IT

Besser-IIPC 13/11/2018 3

Archive-It

•  TheleadingapplicaLon/serviceforcuratedwebarchivinginNorthAmerica

•  RunbytheInternetArchive,andismuchmoretargetedandcuratedthantheirWayBackMachine

•  IsbasedonCrawlersoZwaredevelopedbyIA(Heritrix)in2003-2004

•  IsverypooratcapturingstreamingaudioorvideoaswellasinserLngitproperlyintoacomposedwebpage-

Besser-IIPC 13/11/2018 4

Archive-ItIssuesw/StreamingMedia

Besser-IIPC 13/11/2018 5

Archive-ItIssuesw/StreamingMedia

Besser-IIPC 13/11/2018 6

11/13/18

2

Archive-ItIssuesw/StreamingMedia

Besser-IIPC 13/11/2018 7

Archive-Itscreenshotsgeneratedaspartofourproject-

•  ByLorenaRamirez-Løpez

Besser-IIPC 13/11/2018 8

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-IIPC 13/11/2018 9

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-IIPC 13/11/2018 10

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTarikO’Regan’ssitetaken2015/10/05

Besser-IIPC 13/11/2018 11

Archive-ItIssuesw/StreamingMediaFireFoxversion39.0.ScreenshotofTedHearne’swebsitetaken2015/10/05

Besser-IIPC 13/11/2018 12

11/13/18

3

Somesourcesofstreamingissues

•  Problemswithcapturingresourcesresidingon3rdpartyservices(YouTube,Vimeo,Soundcloud)

•  ProblemswithhowfaithfullytheA/VmaterialsarecapturedandplacedbyArchive-It

•  ProblemswithwebsitesgeneratedthroughsitebuildingplajormssuchasSquarespace

Besser-IIPC 13/11/2018 13

OtherIssueswe’retryingtosolve

•  DiscoveringURLsgeneratedbyJavascript

Besser-IIPC 13/11/2018 14

THEPROJECT

Besser-IIPC 13/11/2018 15

ArchivingComposerWebsitesh4p://www.nyu.edu/about/news-publicaLons/news/2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-

quality-musical-content-on-the-web.html

•  Collect,preserve,&makeavailableWebsitesofComposers

•  $480,000grantfromMellonin2015toNYULibrary/MIAP/InternetArchive

•  Dealingwiththeissuethatcontemporarycomposerwebsitesgoupanddown(andalsoincorporaterelaLonship-buildingbtwncomposerandfans)

•  AddressingtheproblemsofcollecLngstreamingmedia•  AlsoselecLvelycollecLnghigh-qualityversionsthatareusedtogeneratethestreams,andallowingfutureresearcherstosee/hearthehigherqualityversions

Besser-IIPC 13/11/2018 16

ArchivingComposerWebsites

Besser-IIPC 13/11/2018 17

•  DevelopgoodandongoingrelaLonshipsbtwnLibrariesandComposers

•  DevelopTrust–  fordevelopingcollecLons,andconLnuingtoaddtothem–  forPolicyreasons

•  Examinewhattypeoferrorstakeplace–  howfaithfullyaudiovisualmaterialsarebeingcaptured–  howresourcesthatresideonthird-partyweb-services(YouTube,Vimeo,Soundcloud)are(not)displayedwithinArchive-It’sinterface

–  IssueswwebsitesgeneratedthroughsitebuildingplajormssuchasSquarespace

•  Findwaystofixthoseerrors

Somemethodsused

•  BeganwithNPR’slistof100importantcomposersunder40,andaugmetedthelistwithfacultyandlibrariansuggesLons

•  IdenLfiedwebsiteinfrastructuresencounteredandcreatedaclassificaLonmatrix-

Besser-IIPC 13/11/2018 18

11/13/18

4

WebsiteInfrastructureencountered

Besser-IIPC 13/11/2018 19

ProjectTeam•  JeffersonBailey(InternetArchive)•  HowardBesser(MIAP)•  LoriDonovan(InternetArchive)•  AprilHathcock(Lib/ScholComm)•  NicoleGreenhouse(Lib/ACM)•  CarolKassel(Lib/DLTS)•  Sco4Statland(MIAP)•  DonaldMennerich(Lib/ACM/DLTS)•  DavidMillman(Lib/DLTS)•  CourtneyMumma(InternetArchive)•  RobinPreiss(Lib/AFC)•  LorenaRamirez(MIAP)---specialthanks!•  MichaelStoller(Lib/C&RS)•  KentUnderwood(Lib/AFC)•  ChelaSco4Weber(Lib/AFC)--departed

Besser-IIPC 13/11/2018 20

OURTECHNICALCOLLABORATION:CRAWLING

Besser-IIPC 13/11/2018 21

NYU/IACollaboraLon

Besser-IIPC 13/11/2018 22

TradiLonalCrawlers

Besser-IIPC 13/11/2018 23

•  Archive-ItandotherwebarchivesuseHeritrix•  Followlinks,capturemostwebcontent•  Lesssuccessfulwithstreamingvideoanddynamiccontentexecutedinthebrowser

•  Umbrahelps

BROZZLER!

“browser” | “crawler” = BROZZLER

Logo: Noah Levitt Besser-IIPC 13/11/2018 24

11/13/18

5

Besser-IIPC 13/11/2018 25

BrozzlerSystemArchitecturev1

Besser-IIPC 13/11/2018 26

BrozzlerModel

•  job:collecLonofseeds•  seed:principalunitofcrawlconfiguraLon

–  onebrowserworksononeseedataLme(politeness)–  seedhasitsownconfiguraLon,alsoinheritsfromparentjob

•  page:atomicunitofcrawlingfrombrozzlerperspecLve

•  url:onlybrowsers,warcproxhavetodealwitheveryurl

Besser-IIPC 13/11/2018 27

Warcprox:WARC-wriOnghPpproxy

•  man-in-the-middleforh4ps•  asynchronous:WarcWriterThread

– writeswarcrecords– savesdeduplicaLoninfo– updatesstaLsLcs

Besser-IIPC 13/11/2018 28

Otherpieces

•  pythonwayback•  Rethinkdb(distributeddocumentstore)

Besser-IIPC 13/11/2018 29

StreamcapturereliesonYoutube-dlh4ps://rg3.github.io/youtube-dl/supportedsites.html

Besser-IIPC 13/11/2018 30

11/13/18

6

OURCOLLABORATIONWITHCONTENTCREATORS,IPISSUES

Besser-IIPC 13/11/2018 31

YoungComposersCorpus

•  BeganwithNPR’s2011listof“100ComposersUnder40”

•  91of100haveownself-containedsites•  WithinayearofstarLngwehadwri4enagreementswith165Composers(25ofthemfromNPR’slist)

•  Plannedtorecruit10ofthemforenhancedarchiving(uncompressed;be4erthanwhatisonwebsite)–  Thiswillrequireanaddedappendixtocontract/agreement(whichmayinvolvedarkarchivingand/orrestrictedaccess)

Besser-IIPC 13/11/2018 32

BuildingrelaLonshipswithComposers

•  EngagethemwiththeideaofpreservingtheirWebsite

•  Aretheywillingtogiveusricherversionsofcontentontheirsite?

•  Aretheywillingtomakeall(orjustpart)ofthecontentfreelyaccessible?Dotheywanttoembargosomecontentinadarkarchive?

•  DonorAgreement/Contract-

Besser-IIPC 13/11/2018 33

DonorAgreement/Contract

•  Workedonthiswithlawyersforwelloverayear•  Havehadfairlystablelanguageinitandmanycontractsalreadysignedandreturned

•  Doesdefaulttoallowinguscompleterightsforreformasngandforallowingresearcherstosee/hearallhighqualityversionsatminimumon-site– AndthusfarallComposerscontactedhaveagreedtothoseprinciples(butnotnecessarilytothecontractuallanguage)

Besser-IIPC 13/11/2018 34

ContractIntrotentaLvelanguage

•  NYUandComposerwishtoestablishlong-termpreservaLonofthematerialslistedatthehighestpossiblequality.TheParLeswishtoenterintothisAgreementtoestablishguidelinesandstandardswithregardtoongoingandfuturelibraryprocessesrelatedtosuchpreservaLon.

Besser-IIPC 13/11/2018 35

ElementsintheContract

•  Whatisbeingacquired•  TermsofTransfer•  TermsofuserAccess•  Rights&ResponsibiliLes(bothNYU&Composer)

•  Appendixdescribingeachitem(format,content,amount,otherper4nentdescriptors)

•  AppendixwithAccessRestricLons-Besser-IIPC 13/11/2018 36

11/13/18

7

4possibleLevelsofStreamingAccess

•  Availableforcopy-protectedstreamingfromtheNYULibraries’websitewithunrestrictedaccessbythegeneralpublic.

•  Availableforcopy-protectedstreamingfromtheNYULibraries’website–  withaccesslimitedtoregisteredNYUfacultyandstudentsand–  toexternalresearcherswitheligibilitytouseNYULibraries’archivalresourcesaccordingtoNYULibraries’generalaccesspolicies,withpasswordauthenLcaLon,onoroffcampus.

•  Availableforcopy-protectedstreamingonNYULibrariespremises,atdesignatedworkstaLons,withaccessmediatedbyNYULibrariespersonnel.

•  NotavailableforstreamingunLladesignatedfuturedate.

Besser-IIPC 13/11/2018 37

TentaLvepiecesoftheContract•  TheuncompressedmasterfilesofMaterialslicensedforinclusionwillbemadeavailabletotheLibrariestoenabletheresearchanddevelopmentofhigherqualitytoolsandprocessesforarchivingontheWebandsuccessortechnology.Theresultanthigh-qualitycopiesofComposer’swebsite—incorporaLngthebestqualitymediafiles—willbepreservedashistoricaldocumentsinthearchive,whichwillbeaccessibleworldwideontheWeborsuccessortechnologyasastorehouseofculturalmemoryandavehicleforresearchandscholarship.ComposerretainsexisLngrightstohisorherMaterials,subjecttothelicensegrantedinthisAgreement.

Besser-IIPC 13/11/2018 38

TentaLvepiecesoftheContract•  non-exclusiveworldwide,perpetual,irrevocable,royalty-freerighttoproduce,use,copy,anddistributeDerivaLveWorks

•  strictlylimitedtoreforma4eddigitalfilesortoexcerptsandabridgements(suchasthumbnails)createdforthetechnicalpurposesofbuilding,preserving,andprovidingaccesstotheWebarchiveovertheWorldWideWeboritssuccessor

•  maybeusedonlyforthenon-profiteducaLonalandresearchpurposesprovidedunderthisAgreement

•  Agreementdoesnotaffectortransferanycopyrightsorotherintellectualpropertyrights

Besser-IIPC 13/11/2018 39

ARCHITECTURE&WORKFLOWS

Besser-IIPC 13/11/2018 40

Architecture&Workflows

•  TheFindingAidsaregeneratedfromArchiveSpace(whichcontainsrichmetadata)

•  ThereisanoverallComposersFindingAid,aswellasaseparateFindingAidforeachcomposer(lisLnginventoryandwebarchives,andlinktoassets)

•  WebarchiveisstoredinArchive-It;richercontentinNYURepository

•  ConnecLonsbuiltoffofArchiveSpaceback-endAPIDemoSite

Besser-IIPC 13/11/2018 41

SoZware&ServiceComponents

•  IA’sArchive-It•  NYUdigitallibraryinternalcomponents

– Aeonforworkflowmanagement– ArchiveSpace– EAD

Besser-IIPC 13/11/2018 42

11/13/18

8

UnfinishedDevelopmentwork

•  Supplyingaseparateaudioplayer?•  SLllworkingonpreciseformsofnavigaLonbtwnArchiveSpace,Archive-It,andrichercontentwithinNYU’sdigitalrepository

•  WhatwillbeontheworkstaLonforitemsthatneedtobelookedaton-site?

•  Issueswithstreamsthatwerenotcaptured•  ExampleofworkdoneonIA’sAPI-

Besser-IIPC 13/11/2018 43

InterimworkonAPItoIA•  WhatIAneedsfromNYUAPI

–  APIURL–  CredenLals(username,password)->AuthenLcaLonToken()–  RepositoryID–  ResourceID

•  WhatIAwillreturnasJSONarray–  UnitTitle–  Creator–  DataExpression–  ExtentStatement–  TechCharacterisLcs–  [SomethingBasedonAccessRestricLon,i.e.canitbestreamed]???

•  WeSpeakEtruscan,1993May21,23.5MB,1AIFFfileStereouncompressed16bit/44.1K

•  TheDreamofInnocenceIII,1998March26,150MB,1AIFFfileStereouncompressed16bit/44.1K

Besser-IIPC 13/11/2018 44

HOWTHINGSMAYLOOK

Besser-IIPC 13/11/2018 45

QuerypathssLllunderdevelopment

Besser-IIPC 13/11/2018 46

OneopLonforUserQueries

•  UserbrowsesthroughArchive-It•  UserseesthatA/Vcontentexists(andinsomecases,itwillincluderichercontent,butsomeofthatmightbeaccess-restricted)

•  Archive-IthandsoffusertoNYU(eitherdirectlytoA/Vcontent,ortoFindingAid)

Besser-IIPC 13/11/2018 47

OneopLonforQueries

Besser-IIPC 13/11/2018 48

11/13/18

9

OneopLonforhighqualitycontent

•  OnarchivedwebsitepagelisLngcomposer’scontent,userseesamessagethathigherqualitycontentisavailable,with:– AccessrestricLons,ifapplicable– Linktorelevantfindingaid–  (lookinglikefollowingimage)-

Besser-IIPC 13/11/2018 49 Besser-IIPC 13/11/2018 50

DemofromAPIsideh4p://composers.dlib.nyu.edu/

Besser-IIPC 13/11/2018 51

FromtheLibraryFindingAidsideh4p://dlib.nyu.edu/findingaids/html/fales/mss_479/

Besser-IIPC 13/11/2018 52

FromtheLibraryFindingAidside(cont)

Besser-IIPC 13/11/2018 53

FromtheLibraryFindingAidside(ContainerList)

Besser-IIPC 13/11/2018 54

11/13/18

10

FromtheLibraryFindingAidsideh4p://dlib.nyu.edu/findingaids/html/fales/mss_460/dscaspace_7951feea619b6c41436c556e0674d1c8.html

Besser-IIPC 13/11/2018 55

FromtheArchive-Itsideh4ps://archive-it.org/collecLons/7872

Besser-IIPC 13/11/2018 56

FromtheArchive-Itsideh4ps://archive-it.org/collecLons/7872?

q=h4p%3A%2F%2Fwww.bitrosie.com&show=SeedVideos&fc=seedId%3A1157594

Besser-IIPC 13/11/2018 57

FromanydirecLon,usermightneedtoauthenLcate

Besser-IIPC 13/11/2018 58

SOMEOTHERINTERNALTRACKING

Besser-IIPC 13/11/2018 59 Besser-IIPC 13/11/2018 60

11/13/18

11

CrawlRecords

Besser-IIPC 13/11/2018 61

EVALUATION

Besser-IIPC 13/11/2018 62

EvaluaLonforImprovement

•  ComposersandtheirsaLsfacLonwiththewaysinwhichaudienceswillbeabletoviewarchivesoftheirwebsites(improvingusability)

•  Researchers,andwhetherthecontentandfuncLonalityofthesewebarchivesworksforthem(contentpresentaLon

•  Tweakingwhatwedoinordertobe4erserveCreatorsandResearchers

•  Findingoutwhethercapturesreallyworked

Besser-IIPC 13/11/2018 63

FindingssLllbeinganalyzed

•  Streamingcapturesappearmoresuccessful,butwesLllexperiencesomestreamingcaptureproblems

•  NeedfurtherexploraLontoseetheprecisecauseofthecrawler/captureissues(&recLfythemifpossible)

Besser-IIPC 13/11/2018 64

CrawlerIssues(brokenheaderlinks)

Besser-IIPC 13/11/2018 65

CrawlerIssues(failedvideocapture)

Besser-IIPC 13/11/2018 66

11/13/18

12

CrawlerIssues(videocapturefailure)

Besser-IIPC 13/11/2018 67

CrawlerIssues(Flashvideoissue)

Besser-IIPC 13/11/2018 68

CrawlerIssues(videocapturedwithoutaudio)

Besser-IIPC 13/11/2018 69

CrawlerIssues(brokenvideolinks)

Besser-IIPC 13/11/2018 70

CrawlerIssues(1audionotcaptured)

Besser-IIPC 13/11/2018 71

CrawlerIssues(audionotcaptured)

Besser-IIPC 13/11/2018 72

11/13/18

13

CrawlerIssues(audiofailure&anchorproblem)

Besser-IIPC 13/11/2018 73

CrawlerIssues(parLalcapturefailure)

Besser-IIPC 13/11/2018 74

CrawlerIssues(incompleteloading)

Besser-IIPC 13/11/2018 75

CrawlerIssues(Captureissues)

Besser-IIPC 13/11/2018 76

CrawlerIssues(unknownproblems)

Besser-IIPC 13/11/2018 77

CrawlerIssues

•  Campjulie.com:–  Anycapturedate:IfveryslowloadLme,hardtotellifwasworkingornot,sosomesubjects

gaveup.[Siteownersaysthisisinherenttosite,somightnotbeacaptureproblem.]–  Discrepanciesbetweenwhenonehopoutiscapturedornot.

•  Kmariekim.com:–  Sep26,2017capture(latestcapture):A4emptstoplaymusicfromarchivedtumblrpagefrom

variousplajorms(youtube,soundcloud,etc.).•  Bitrosie.com:

–  Allcapturedates:linkstakeroughly5minutes(assumedbrokenatfirst)•  Adelefournet.com/video/:

–  Sep12,2017capture:VideoerroraZerroughly10seconds.Stopsplaying"BeretsofMaryJeanPlace",andstartsplayinganothervideowithopeningLtle"BarrancoDistrict,Lima,Peru".Therestofthevideosonthepagedonotplay.Linkto"BeretsofMaryJeanPlace"ontheInternetArchivealsoplaysincorrectvideo("BarrancoDistrict,Lima,Peru").

•  MichaelRobinsonarchivedwebsite:Errormessage

Besser-IIPC 13/11/2018 78

11/13/18

14

EvaluaLonResults•  ThesubjectswerebasicallysaLsfiedwiththecaptures,but

hadverymanysuggesLonsforimprovementswithlabeling,searching,display,andperformance.MostalsowantedaddiLonalfuncLonality.

•  ManyofthesubjectswereconfusedbetweencapturedsitesandtheFindingAidsforthem.InaddiLon,thewords“Papersof”incollecLonLtlesbaffledpeoplewhentheywerelookingforrecordings,notpapers.

•  Bothusersandsiteownerswereunclearaboutthescopeofcontentthathadactuallybeencollected.Onesiteownerexpresseddisappointmentthatreviewsthattheylinkedtowerenotcaptured.Andonlyonesubjectfiguredouthowtonavigatetoasuggested“liveweb”pagethathadnotbeennotarchived.

Besser-IIPC 13/11/2018 79

FuncLonalityrequestedbyusers•  Mostsubjectswantedmoremetadatadisplayed.Examplesincluded:

displayingadescripLonoftheComposersProjectandlikelycontentsontheiniLalstartpage;displayofaudio/videorun-Lmeinsteadoffilesize;descripLon,thumbnails,excerptsformaterialrestrictedtoonsiteuse(sothattheycoulddecidewhetherornottheyreallyneededtomakeasitevisit);morefieldsshowninvariousdisplays(bothinlistsandinlinkstoessence).

•  BothsiteownersrespondedposiLvelytotheideaofprovidingasitemapwithacollapsingmenuoflinks.

•  Mostsubjectswantedasearchbox.AndmostwantedtobeabletoimmediatelysortamulL-columndisplaylistbyanycolumnoftheirchoosing.

•  Onesubjectfounditmisleadingwhenarestrictedobjectlinkedtoanewpage.

•  Onesiteownerpreferredthattheirdigitalobjectsbeorganizedbyproject,ratherthaninanundifferenLatedlistofeverydigitalobjectontheirsite.

Besser-IIPC 13/11/2018 80

FuncLonalityrequestedforlocalworkstaLons

•  Abilitytotakescreengrabs•  AccesstoaddiLonalbrowserwindow

•  Previewframewhenscrubbing(fastforwarding)throughvideomaterial

•  Useoftheirownlaptoporanotherwindow•  DisplayofLmecode•  And2subjectsspecificallyrequestedthe

–  abilitytoslowvideo/audiofiletotranscribe–  abilitytodroppin/a4achnotestospecificpointinvideo/audiofile

Besser-IIPC 13/11/2018 81

IMPACTBEYONDTHISPROJECT

Besser-IIPC 13/11/2018 82

ImpactBeyondthisProject•  TherewillbeanalternaLvetoHeritrixforcapturing

streamingmedia,andArchive-Itwillideallybeabletobe4erhandlestreamingmedia,anddisplayitinpropercontext

•  WewillhavearchitecturesandworkflowsforArchive-Ittointeractwithricherlocalresources(aswellasexamplesofhowinteracLonandnavigaLoncanproceedbtwnArchive-It,ArchiveSpace,FindingAids,andaninternaldigitalrepository)

•  ModelsforinteracLonbtwncreatorsandcollecLngorganizaLonswillhavebeendeveloped(incldonoragreements)

•  Wehavepreserved100+++websitesofyoungcomposers

Besser-IIPC 13/11/2018 83

Archivingwebsitescontainingstreamingmedia:theMusicComposer

Project

•  h4p://besser.tsoa.nyu.edu/howard/Talks/•  h4p://www.nyu.edu/about/news-publicaLons/news/

2015/03/27/nyu-libraries-to-team-with-internet-archive-to-preserve-high-quality-musical-content-on-the-web.html

•  h4p://archive.org/~nlevi4/reveal.js/•  h4p://composers.dlib.nyu.edu/•  h4ps://rg3.github.io/youtube-dl/supportedsites.html

Besser-IIPC 13/11/2018 84

top related