gors appropriate
TRANSCRIPT
Appropri-utinthesenseofnotinapppropriate–the“rightthing”touse,aswellasapprorpri-ate,asinco-opt,oruseforsomethingitperhapswasn’toriginallyintendedfor.
1
Soforexample,onethingIdoisappropriateopenlylicensedmediaresourcesformyownslides.Inthiscase,IwanttosetthesceneforthispresentaAonasoneinwhichIhaven’tbeenafraidtogetmyhandsdirty,butIhavealsoplayedwithandexploredaparAcularmedium–inthiscase,variousdigitaltechnologies–andcreatedmyownthingswhichmayalso,ulAmately,beofdirectusetoothers.Youmightalsosaythey’reatbesthalf-baked,ifnotcompletelyunbaked;-)
2
ThetoolsI’mgoingtotalkaboutaresituatedwithinadatacontext.IspendalotofAmeplayingwithopenlylicenseddatasets,workingacrossthewholedatapipeline.Thisexample,takenfromthethirdyearundergradequivalentOUcourseTM351“DataAnalysisandManagement”providesasimplisAcviewofsomeoftheprocessesinvolvedinworkingwithdata.(Weallknowit’snotquitethatstraighSorward,andoTeninvolvesalotofiteraAonorbacktracking,butaswellas“Theroleoftheacademic[making]everythinglesssimple”,asMaryBeardputitinanObserverinterviewafewweeksago,theacademicalsosimplifiesandidealisesthroughabstracAonandrevisioniststorytelling,parAcularlywhenitcomestodescribingprocesses.SowhatIplantodoisspendafewminutesshowyousomeofthetoolsandemergingapproachesIuseworkingacrossthevariousstepsofthispipeline.
3
So–thefirstthingtonoteisthatI’matechnologyopAmist:Ibelievetechnologycanhelpmakeourlivessimpler,evenifatfirstitmaylookasifwearemakingitmorecomplexbyintroducingyetmoretoolstolearn–andinstalloncomputersthatourITdepartmentwouldratherweleTundertheircontrol.TakingcontrolofyourcompuAngdesAnyisanotherthemeofthistalk…Inthisexample,theboxdiagramIshowedonthefirstlinewas/wri\en/ratherthandrawn.IfIwanttoaddsteps,orhavesub-branchesaddedtothediagram,Idon’tneedtostartfaffingaroundinPowerpointorWordfigurestryingtolinethingsupandgetthemsizedrightandsoon.Iletthemachinedoit.InthisparAcularonlinetool(youcanseetheURLinthescreenshotatthetopoftheslide–I’llpopacopyoftheannotatedslidesonline,andalsoletAlanhaveacopy)–so,inthisparAculartool,blockdiag,thereareotherdiagramtypesavailable.Theunderlyingcodeisalsoopensourceandavailableasapythonpackage,soyoucanwritediagramssuchastheseinaJupyternotebook,forexample.I’llhavemoretosayaboutJupyternotebookslater.
4
Oneotherpointtonote–andabitofblatantself-promoAonhere–mostoftheindividualslideswithinthistalkarebackedupbyoneormorepostsonmypersonalblog,Ouseful.info.I’vebeenwriAngthisblogformanyyearsanditrepresentsareasonablycompletenotebookofalotsoftheideasI’veexploredoverthatAme.Inmanycases,thepostsarecomprehensiveandself-complete:theyrecordallthestepsItooktodosomehAngincaseIneedtoremindmyselflater.
5
So,thepipeline.Thefirststep,acquisiAon,relatestohowwegetholdofdataThismaybefromdownloadeddatafiles–Excelspreadsheetdocuments(whichareactuallyzipfiles–youknowyoucanchangethexlsxsuffixtozipandunzipthem,right?SamewithdocxWorddocumentfilesandpptxPowerpointfiles),databases,onlineAPIs(applicaAonprogrammableinterfaces),butitmaybescrapedfromothersortsofdocument.Webpages,forexample,orPDFdocuments(eventhoughPDFdocumentsarehorrible,it’soTenquiteeasytoextractdatatablesfromthem).I’mnotgoingtotalkaboutthemechanicsofscraping,butjournalismlecturerPaulBradshawhasagoodintrotoavarietyoftoolsandtechniquesinhisLeanpubbook“ScrapingforJournalists”.
6
IwillbeieflymenAonacoupleoftoolsIusethough–morph.ioisasitehostedbyanAustralianopendatagroupthatisactuallyaforkofatoolbyUKLiverpudlianstart-up,Scraperwiki.Morp.iowillrunascraperofyourownwriAng,hostedonGithub,onceadayandpoptheresultsintoaSQLitedatabasethatyoucandownload.TheslideshowsascraperIuseforscrapingLicenseapplicaAonsmadetotheIsleofWightcouncil.
7
AnothertoolIusealotisTabula.TabulaisaJavaapplicaAonwithabrowserbaseduserinterfacethatwillextractdatatablesfromPDFdocuments.Yousimpledragtoselecttheareaofthepageyouwanttoscrape(youcanmirrorthesameareaovermulAplepagesordefinedifferentareasoneach).
8
TheheartoftheapplicaAonisactuallyacommandlineengine,recentlywrappedbytheRtabulizrpackage.ThismeansyoucanautomatetheuseoftabulainordertoscrapetabulardatafromPDFdocumentswithinR,gepngthedatabackasanRdataframe.That’stabulizr–verynice;andthedeveloper(onGithub)isquiteresponsive.
9
AnothertoolIusefromAmetoAmeisApacheTika–thiscanextracttextfromPDFs,Worddocumentsandsoon,aswellasfromimages.TherearequiteafewonlineOCRservicesnow,manyofthemappearingaspartof“AItoolsets”,offeringarangeofcommodityAIAPIservices–IBM,MicrosoTandGoogleallhavethem,forexample.SoaswellasOCRtextextracAon,theydofaceandemoAondetecAoninimages,semanActagging/enAtylabelingwithindocuments,automaAcimagetagging,speechtotext,andsoon.Allwithvaryingdegreesofsuccess.Butallofthemsteadilyimproving.
10
ATerdataacquisiAon,we’reoTenfacedwithcleaningadataset.AtoolIusedforcleaningdataisanotherJavaapplicaAon,againaccessedviaabrowser,calledOpenRefine.OpenRefinewillopenawiderangeofdocumenttypes–spreadsheets,csvortabbeddatafiles,XML,JSON,HTML–eitherlocallyorfromtheweb,andpresentsitinaspreadsheetstyleUI.AwiderangeofopAonsareprovidedforapplyingaparAculartransformaAontoeachcellinaparAcularcolumn–youcanalsoscriptyourowninacustomscripAnglanguage,orPython–aswellastoolsforfaceAngandfilteringthedisplayofrowsbasedonvalueswithinoneormorecolumns.TheclusteringtoolsareusefulforfindingandcorrecAngparAalmatches–soforexample,youcannormaliseMyCoLtd,withMyCoLtd.,withMyCoLimited,andsoon.
11
OpenRefinecanalsoprovidesupportforalimitedrangeofdatareshapingacAons.I’vedescribedafewoftheminthispost,whichtakesamessylocalelecAonresultsdatasetandshowshowtocleanandreshapeit.OpenRefinealsohasatemplatedexport–sowecangeneratesimple‘lineataAme’reportsfromafiltereddataset.
12
OneofthethingsItrytolookforinapplicaAonsiswhethertheyareopensourceandwhethertheyprovideabrowserbasedUI–ifyoucanuseitviaabrowser,youshouldbeabletouseitonyourownlocalmachineorfromaremotelyhostedversionaccessedovertheweb.OpenRefinemeetsboththesecriteria,whichmeansit’snoproblemforsomeonelikeIBMtomakeitavailableviatheirDataScienAstWorkbenchsite.(It’salsonottoohardtorollyouwonversionofsomethinglikethissite.)TheothertoolscurrentlyprovidedbythissiteareRStudio,apowerful–andfriendly–IDEfortheRprogramminglanguage,andJupyternotebooks.
13
Onereasonwhyit’sgepngeasiertoexposetheseapplicaAonsoverthewebinascaleablewayisthroughcontainerisaAon.ContainerisaAonisaformofapplicaAonvirtualisaAonwhereoneormoreapplicaAonscanbewiredtogetheranisolatedfromeachotherwithinamulA-tenantedvirtualmachine.Dockercontainersofferthepromiseofbeingableto“runanywhere”–oratleast,anywherewherethecontainerplaSormcanoperate.Dockeristhemostpopularroutetothisatthemoment.TheapplicaAonshowhereiscalledKitemaAc.ItletsyousearchforpublicapplicaAoncontainers,anddownloadthemandrunthemlocallyonyourowncomputer.TheexampleshowsvariouscontainersI’veputtogetherforOpenRefine(somearedifferentversions,othersareexperiments/demosIreallyshoulddelete)SoratherthaninstallJavaonyourcomputerandthendownloadandinstallOpenRefine,youcanjustone-clickinKitemaAcanditwillgetaprepackagedOpenRefinecontainerforyouthatincludesallthatOpenRefineneedstorun.
14
Oneofthespin-offsfromtheearlydaysofOpenRefinewasthenoAonofa“reconciliaAonservice”,wherebyyoucouldlookupeachiteminanOpenRefinecolumnagainstawebservicethatwouldtrytomatchitto–reconcileitwith–aknownenAty.AparAal/fuzzymatchinglookupagainstacontrolledvocabulary,essenAally.OpenCorporates,theopendatainternaAonalcompanylookupservice,offersareconciliaAonendpoint.It’seasyenoughtopackageupyourownlookuptablesandthisrecipedescribeshowtodoitusingahomebrewedreconciliaAoncontainer.IdidonesforMPs,forexample.
15
Justasanaside,whenpupngtogetherreconciliaAonservices,weideallywantacanonicallistofenAAesorenAtynameswewanttoreconcileagainst.Registerscanbeagoodsourceofthese.Butit’salsoworthnoAngthatregisterscanalsobeusedtogeneratederiveddatasets.Forexample,IwantedalistofUKprisonswithlocaAoninformaAon.IntheabsencefindingasingleopenlylicenseddatasetwiththisinformaAon(awebsitewithoneprisonperpagewastheclosestIfound,whichIcouldhavescrapedbutchosenotto),IinsteaddoalookupviatheFoodStandardsAgency,whichhasinspecAoninformaAonforpublicfoodoutlets.(AnothersourcemighthavebeentheCQC,withasearchforhealthsurgeriesordentaltreatmentcentres,filteredby“HMP”or“prison”).
16
RStudioisanotherapplicaAonthatcanbefreelyredistributedandexposedviaabowser.ThesepostswhohowtorunanRStudioapplicaAoninthecloudusingasimplecontainermanagementdashboardformerlyknownasTutum,nowavailableasDockerCloud.I’vealsodescribedhowtopackageaShinyapplicaAoninacontainersoyoucandeployitanywhere.DoesanyoneuseShiny?Shinyisarapidprototypingtoolforbuildingbrowser-based,HTML5interacAveapplicaAonsanddashboards–RStudioreleasedanewdashboardingframeworkoverthelastcoupleofweeks–thatmakeitrelaAvelyeasytobuildinteracAvedataexloraAontoolsagainstanRenvironment.
17
OnereallynicecomponentoftheDockerecosytemisdocker-compose,formerlyknownasfig,whichallowsyoutoorchestratethelaunchofseveralinterlinkedcontainers,soyoucaneasilyaccessonefromanother.TheexamplehereshowshowtolinkRStudioandaJupyternotebookstoaneo4jdatabase.
18
I’vemenAonedJupyterafewAmes–doesanyoneuseJupyternotebooks?IPythonnotebooks?ThebrowserbasednotebookUIletsyouentertext(asmarkdown)andexecutablecode(inavarietyoflanguages)andthenrunthecodeanddisplaytheresultsofthecodeexecuAonbackinthenotebook.OnethingI’vebeenexploringrecentlyisawayofcallingcommandlineapplicaAonfuncAonspackagedinacontainerfromanotebookcell,andreturningtheoutputofofthecontainerisedcommandlinefuncAonasasharedfile.ThispostdescribeshowIpackagetheContentminetools-asetoftoolsforharvesAngscienAficjournalpapersandextracAngknowledgefromthem–andwhicharealpaintosetupnormally–andthenusethemviaanotebook.
19
Justbytheby,ifyouwanttotrythenotebooksout,there’salivedemoavailable.(Ialsodidaposton“SevenWaystoRunJupyterNotebooks”whichdescribesseveralotheralternaAvewaysofrunningthenotebooks.)ThecodeexamplehereshowsallthecodeneededtoopenanExcelfilecontainingaveragetravelAmestoGPsurgeriesbyLSOA,filterthedatadowntoaparAcularlocalauthorityarea,pullinanopenlylicensedgeojsonshapefileforthatarea,andthenplot(andembed)aninteracAvechoroplethmapviathefoliumpythonpackage(usingGooglemaps,Ithink,thoughitmaybeOpenStreetmap?)
20
OneproblemwithproducinginteracAvemapsisthatsomeAmesyouactuallywantanimage.ItturnsoutthatwebtesAngframeworkslikeSeleniummakeiteasytograbscreenshotsfromtestpagesrenderedinatestbrowser,soIco-optedtheideatoproducearouAnethatletsmegrabapngsnapshotofamap.
21
ThatexamplewasactuallycreatedforasideprojectIdabbledwithwithourhyperlocalnewsoutletontheIsleofWightcalledOnTheWIght.OnTheWighthavebeenreporAngmonthlyjobfiguresforyears,soIthoughI’dhaveagoatautomaAngtheproducAonofthereportsfromnomisdata,aswellasproducingafewcharts.ThereportisjustaliteralreporAng,althoughIdotrytoaddsomecolourandaAnyamountofanalysisforexamplebyusingdirecAonalandmagnitudeterms–“thenumberswentUPSLIGHTLYfromlastmonth,althoughtheyareSIGNIFICANTLYDOWNfromthesameAmelastyear”.Andsoon.
22
Onmyownsite,Istartedtryingtopulloutsomegeographicalinsight,automaAcallyreporAngonareaswithnoAceablyhighunemploymentcomparedtootherareasbygender.ThemapdoeslooklikeapopulaAonmap,buttheunemploymentrateisactuallyhigherinsomeofthemoreheavilypopulatedareas!
23
Justasidenote–theideaofbeingabletobuildsomethingoncetheydeployitmorewidelyfornoextraeffortreallyappealstome.InthecaseofnaAonaldatasetsbrokendowntolocallevel,buildingasoluAonforalocalareayouknowaboutandunderstandhelpsgetyoustartedonautomaAcallydetecAngandpullingoutstoriesorfeatures–butthesamecodecanthenrunforotherareas.
24
Butifyouautomateapainpointawayforonelocalarea,you’vesolvedtheproblemforallofthem.TheapproachI’vebeentakingistothinkintermsofproducingpressreleasesratherthanthanfinishedstories,relyingonthejournalist,orsomeothereditorialrole,toactasthefinalarbiterofthequalityandrelevanceofthepressreleasestylecommunicaAon.TheimplicaAonisalsothatmoreworkneedstobedonecheckingandworkingupthepressreleaseforthefinalstory(if,indeed,thereisanystory).
26
Sopickinguponthisideaofreuse–orlaziness–thenomisdatatotextenginecanbeeasilywrappedtotoprovideaconversaAonalUIforit.Inthisexample,IcanasktheserviceforthelatestJSAfiguresinaparAculararea.Althoughnotshown,youcanputinapostcode,forexample,andgetthefiguresbackforthelocalauthorityareacontainingthatpostcode.AttheAmeIdidthisdemo,IwashalfthinkingoftryingtopersuadeJohnstonPresstogivemesomepinmoneytoplaywith,soIscrapedalistofJohnstonpresspapers,foundthepostcodeoftheiroffice,anduseditasathebasisforalookupofjoblessfiguresbynewspaperAtlearea.
27
Havinggotsomemachinerysetuptoworkwithslack,Icouldalsouseitasaninterfaceforasimple“spreadsheetrowtoparagraphoftext”toyIwastryingtoputtogether.Sohere,forexample,I’mlookinguplatestfiguresforCQCcarehomeinspecAons.(Actually,IthinkthisisbasedonascraperoftheCQCwebsiteratherthanadatafiledownload.)
28
Theoriginalexperimentshadtheslackbotcoderunningonmypersonalcomputer.Morerecently,IstartedlookingathowthingslikeAmazonAWSLamdafuncAons,essenAallyserverlessremoteprocedurecalls,couldbeusedtohostthebot.TheexamplesheremakeuseoftheUKParliamentAPItoprovidethecontent,allowingmetolookupuprecentreports,orcommi\eememberships,forexample.
29
Thedata2textareaisarichone,andonethingIfindreflecAngonmyownexploratorydataacAviAesisthatIoTenlooktocharts(whichareoTencustom,mutlilayeredchartsofmyowndevising–ggplotisgreatforthat)forinspiraAon.WorkingineducaAon,wherewehavealegalrequirementtomakeourteachingmaterialsaccessible,chartsandfiguresoTenrequirewri\endescripAons.SoonethingI’vestartedwonderingrecentlyiswhetherwecanintrospectonchartobjectscreatedusingthingslikeggplotasa“databasis”foratextualisaAonofthechartcomponents(andthendodata2tesxtanalysisforthesimpleanalyAcsinsightreporAng).Anditseemswecan–gpplotchartobjects,forexample,haveaggplot_build()introspector,andwecanalsogetaccessdirectlytochartobjects.
30
WhenIpostedaboutmyggplot2textexperiment,Iidlywonderedwhetherwecoulddothesameformatplotlibchartobjects.Andisseemswecan,asthisdemosharedviaacommentershows.#LazywebTw,youmightsay:-)
31
AsIwaslookingattheParlimanentAPIbackendforasimpleconversaAonalsearchagent,theONSBetawebsitebecamethelivesite.OneofthenicethingsaboutthenewONSsiteisthataJSONfeedalternaAveisavailableformuchoftheHTMLcontentonthesite.WhichmeanswecanrepurposethatwebsitecontentdirectlyasaresponsetoaconversaAonalsearch.
32
Finally,IwanttoreturntotheJupyterecosystem.Iabsoultelylovethenotebookenvironment:itprovidesagreatenvironmentforwriAngliterate,reproducibledataanalysisscripts(servalnewsoutletsarestarAngtopubllishJupyternotebooksshowingtheanalysisbehindtheirnewsstories–Buzzfeedisagreatexampleofthis,aswiththeirrecenttennismacthfixing/bepngscame,forexample),aswellasprovidingagreatenvironmentfordocumenAngexploratorydataanalyses.ButtheJupyterecosystemisalreadymuchricherthanthat.Ihaven’tdescribedthedashboardtoolkitforcreaAnglivedashboards,theslideshowviewthatletsyoucreateinteracAveslideswithlivecodeexecuAon,therangeofprogrammelanguagekernels(notjustPythonandR)orthekernelwrapperthatletsyoudefineanAPIviaanotebook).ButIdojustwanttoquicklymenAonremotekernels.
33
Atthemoment,we’recurrentlyrewriAngadaylongresidenAalschoolacAvitythatusesLegorobots.UnAlthisyear,we’veusedtheoriginalyellowLegoMindstormsRCXbrick.Thisyear,we’reusingtheLegoEV3brick,whichhaswifiandcanbesetuptorunLinuxandapythonshellthatcanaccesstherobot’sbits.TheapproachI’vebeenexploringittorunaremoteIPythonkernelonthebrick,andaJuoyterserveronadesktopmachine,andthenconnectanotebooktotheremotekernelviatheJupyterserver.Runningthenotebookserveronthebrickremovestheloadofrunningtheserverfromthebrick.(Thesameapproachcanbe–andis–usedtorunlargetasksonsupercomputerclusters.)ThenotebooksalsoallowustocreatesimpleinteracAveUis–justlikeRhastheshinyframework,theJupyternotebookscanruninteracAveipywidgetsdirecltywiredtopythonstate.IntheexampleabovemIhaveaslideforcontrollingmotorspeed,forexample(actually,thedutycyclefothesteppermotor)andanotherthatdisplaysthevaluebeingseenbyaparAcularsensor.(Again,there’saAnyelementofsimplisAcdata2textcontextualisaAoninthedisplay.)
34
Sothat’smedone.SomeofthetoolsandtechnologiesthatIthinkareappropriatefor,orcanbeappropriatedfor,datarelatedtasks.SomeAmesapenwilldoaswellasaspoon.
35
Andfinally,alastbitofblatantself-promoAon.InthesamewaythatmathshasrecreaAonalmaths–funpuzzlesintheSundaypapers–IengageinrecreaAonaldataacAviAes.Andaswiththeblog,IkeeparecordofwhatI’vedone.Severalyearsago,IstartedtolearnR,andusedFormulaOneresultsandAmingsheetsdataascontextforthat.Overtheyears,I’vepulledvarioustricksandtechniquestogetherintothisevolvingbook.(Actually,thebookwasalsoanotherexperiment–Leanpubencouragesyoutopublishasyouwrite,andusedmarkdownforthemanuscript.IwaslookingforanopportunitytoexplorewhetherwemightbeabletousesomethinglikeRstudio,andinparAcularRmd,R-markdown)forauthoringOUcoursematerials,sothisgavemeareason–andacontext–forexploringsuchaworkflow).It’ssAllaworkinprogress,bitatover400pagesalreadyitrepresentsareasonablydeepdiveintothedifferentthingsyoucandowithalimitedrangeofdatasetsonaparAculartopic,aswellasexploringavarietyofwaysofusing–andappropriaAng–Rtohelpusfindstoriesindata.
36