gors appropriate

36
Appropri-ut in the sense of not inapppropriate – the “right thing” to use, as well as approrpri-ate, as in co-opt, or use for something it perhaps wasn’t originally intended for. 1

Upload: tony-hirst

Post on 18-Jan-2017

1.757 views

Category:

Education


6 download

TRANSCRIPT

Appropri-utinthesenseofnotinapppropriate–the“rightthing”touse,aswellasapprorpri-ate,asinco-opt,oruseforsomethingitperhapswasn’toriginallyintendedfor.

1

Soforexample,onethingIdoisappropriateopenlylicensedmediaresourcesformyownslides.Inthiscase,IwanttosetthesceneforthispresentaAonasoneinwhichIhaven’tbeenafraidtogetmyhandsdirty,butIhavealsoplayedwithandexploredaparAcularmedium–inthiscase,variousdigitaltechnologies–andcreatedmyownthingswhichmayalso,ulAmately,beofdirectusetoothers.Youmightalsosaythey’reatbesthalf-baked,ifnotcompletelyunbaked;-)

2

ThetoolsI’mgoingtotalkaboutaresituatedwithinadatacontext.IspendalotofAmeplayingwithopenlylicenseddatasets,workingacrossthewholedatapipeline.Thisexample,takenfromthethirdyearundergradequivalentOUcourseTM351“DataAnalysisandManagement”providesasimplisAcviewofsomeoftheprocessesinvolvedinworkingwithdata.(Weallknowit’snotquitethatstraighSorward,andoTeninvolvesalotofiteraAonorbacktracking,butaswellas“Theroleoftheacademic[making]everythinglesssimple”,asMaryBeardputitinanObserverinterviewafewweeksago,theacademicalsosimplifiesandidealisesthroughabstracAonandrevisioniststorytelling,parAcularlywhenitcomestodescribingprocesses.SowhatIplantodoisspendafewminutesshowyousomeofthetoolsandemergingapproachesIuseworkingacrossthevariousstepsofthispipeline.

3

So–thefirstthingtonoteisthatI’matechnologyopAmist:Ibelievetechnologycanhelpmakeourlivessimpler,evenifatfirstitmaylookasifwearemakingitmorecomplexbyintroducingyetmoretoolstolearn–andinstalloncomputersthatourITdepartmentwouldratherweleTundertheircontrol.TakingcontrolofyourcompuAngdesAnyisanotherthemeofthistalk…Inthisexample,theboxdiagramIshowedonthefirstlinewas/wri\en/ratherthandrawn.IfIwanttoaddsteps,orhavesub-branchesaddedtothediagram,Idon’tneedtostartfaffingaroundinPowerpointorWordfigurestryingtolinethingsupandgetthemsizedrightandsoon.Iletthemachinedoit.InthisparAcularonlinetool(youcanseetheURLinthescreenshotatthetopoftheslide–I’llpopacopyoftheannotatedslidesonline,andalsoletAlanhaveacopy)–so,inthisparAculartool,blockdiag,thereareotherdiagramtypesavailable.Theunderlyingcodeisalsoopensourceandavailableasapythonpackage,soyoucanwritediagramssuchastheseinaJupyternotebook,forexample.I’llhavemoretosayaboutJupyternotebookslater.

4

Oneotherpointtonote–andabitofblatantself-promoAonhere–mostoftheindividualslideswithinthistalkarebackedupbyoneormorepostsonmypersonalblog,Ouseful.info.I’vebeenwriAngthisblogformanyyearsanditrepresentsareasonablycompletenotebookofalotsoftheideasI’veexploredoverthatAme.Inmanycases,thepostsarecomprehensiveandself-complete:theyrecordallthestepsItooktodosomehAngincaseIneedtoremindmyselflater.

5

So,thepipeline.Thefirststep,acquisiAon,relatestohowwegetholdofdataThismaybefromdownloadeddatafiles–Excelspreadsheetdocuments(whichareactuallyzipfiles–youknowyoucanchangethexlsxsuffixtozipandunzipthem,right?SamewithdocxWorddocumentfilesandpptxPowerpointfiles),databases,onlineAPIs(applicaAonprogrammableinterfaces),butitmaybescrapedfromothersortsofdocument.Webpages,forexample,orPDFdocuments(eventhoughPDFdocumentsarehorrible,it’soTenquiteeasytoextractdatatablesfromthem).I’mnotgoingtotalkaboutthemechanicsofscraping,butjournalismlecturerPaulBradshawhasagoodintrotoavarietyoftoolsandtechniquesinhisLeanpubbook“ScrapingforJournalists”.

6

IwillbeieflymenAonacoupleoftoolsIusethough–morph.ioisasitehostedbyanAustralianopendatagroupthatisactuallyaforkofatoolbyUKLiverpudlianstart-up,Scraperwiki.Morp.iowillrunascraperofyourownwriAng,hostedonGithub,onceadayandpoptheresultsintoaSQLitedatabasethatyoucandownload.TheslideshowsascraperIuseforscrapingLicenseapplicaAonsmadetotheIsleofWightcouncil.

7

AnothertoolIusealotisTabula.TabulaisaJavaapplicaAonwithabrowserbaseduserinterfacethatwillextractdatatablesfromPDFdocuments.Yousimpledragtoselecttheareaofthepageyouwanttoscrape(youcanmirrorthesameareaovermulAplepagesordefinedifferentareasoneach).

8

TheheartoftheapplicaAonisactuallyacommandlineengine,recentlywrappedbytheRtabulizrpackage.ThismeansyoucanautomatetheuseoftabulainordertoscrapetabulardatafromPDFdocumentswithinR,gepngthedatabackasanRdataframe.That’stabulizr–verynice;andthedeveloper(onGithub)isquiteresponsive.

9

AnothertoolIusefromAmetoAmeisApacheTika–thiscanextracttextfromPDFs,Worddocumentsandsoon,aswellasfromimages.TherearequiteafewonlineOCRservicesnow,manyofthemappearingaspartof“AItoolsets”,offeringarangeofcommodityAIAPIservices–IBM,MicrosoTandGoogleallhavethem,forexample.SoaswellasOCRtextextracAon,theydofaceandemoAondetecAoninimages,semanActagging/enAtylabelingwithindocuments,automaAcimagetagging,speechtotext,andsoon.Allwithvaryingdegreesofsuccess.Butallofthemsteadilyimproving.

10

ATerdataacquisiAon,we’reoTenfacedwithcleaningadataset.AtoolIusedforcleaningdataisanotherJavaapplicaAon,againaccessedviaabrowser,calledOpenRefine.OpenRefinewillopenawiderangeofdocumenttypes–spreadsheets,csvortabbeddatafiles,XML,JSON,HTML–eitherlocallyorfromtheweb,andpresentsitinaspreadsheetstyleUI.AwiderangeofopAonsareprovidedforapplyingaparAculartransformaAontoeachcellinaparAcularcolumn–youcanalsoscriptyourowninacustomscripAnglanguage,orPython–aswellastoolsforfaceAngandfilteringthedisplayofrowsbasedonvalueswithinoneormorecolumns.TheclusteringtoolsareusefulforfindingandcorrecAngparAalmatches–soforexample,youcannormaliseMyCoLtd,withMyCoLtd.,withMyCoLimited,andsoon.

11

OpenRefinecanalsoprovidesupportforalimitedrangeofdatareshapingacAons.I’vedescribedafewoftheminthispost,whichtakesamessylocalelecAonresultsdatasetandshowshowtocleanandreshapeit.OpenRefinealsohasatemplatedexport–sowecangeneratesimple‘lineataAme’reportsfromafiltereddataset.

12

OneofthethingsItrytolookforinapplicaAonsiswhethertheyareopensourceandwhethertheyprovideabrowserbasedUI–ifyoucanuseitviaabrowser,youshouldbeabletouseitonyourownlocalmachineorfromaremotelyhostedversionaccessedovertheweb.OpenRefinemeetsboththesecriteria,whichmeansit’snoproblemforsomeonelikeIBMtomakeitavailableviatheirDataScienAstWorkbenchsite.(It’salsonottoohardtorollyouwonversionofsomethinglikethissite.)TheothertoolscurrentlyprovidedbythissiteareRStudio,apowerful–andfriendly–IDEfortheRprogramminglanguage,andJupyternotebooks.

13

Onereasonwhyit’sgepngeasiertoexposetheseapplicaAonsoverthewebinascaleablewayisthroughcontainerisaAon.ContainerisaAonisaformofapplicaAonvirtualisaAonwhereoneormoreapplicaAonscanbewiredtogetheranisolatedfromeachotherwithinamulA-tenantedvirtualmachine.Dockercontainersofferthepromiseofbeingableto“runanywhere”–oratleast,anywherewherethecontainerplaSormcanoperate.Dockeristhemostpopularroutetothisatthemoment.TheapplicaAonshowhereiscalledKitemaAc.ItletsyousearchforpublicapplicaAoncontainers,anddownloadthemandrunthemlocallyonyourowncomputer.TheexampleshowsvariouscontainersI’veputtogetherforOpenRefine(somearedifferentversions,othersareexperiments/demosIreallyshoulddelete)SoratherthaninstallJavaonyourcomputerandthendownloadandinstallOpenRefine,youcanjustone-clickinKitemaAcanditwillgetaprepackagedOpenRefinecontainerforyouthatincludesallthatOpenRefineneedstorun.

14

Oneofthespin-offsfromtheearlydaysofOpenRefinewasthenoAonofa“reconciliaAonservice”,wherebyyoucouldlookupeachiteminanOpenRefinecolumnagainstawebservicethatwouldtrytomatchitto–reconcileitwith–aknownenAty.AparAal/fuzzymatchinglookupagainstacontrolledvocabulary,essenAally.OpenCorporates,theopendatainternaAonalcompanylookupservice,offersareconciliaAonendpoint.It’seasyenoughtopackageupyourownlookuptablesandthisrecipedescribeshowtodoitusingahomebrewedreconciliaAoncontainer.IdidonesforMPs,forexample.

15

Justasanaside,whenpupngtogetherreconciliaAonservices,weideallywantacanonicallistofenAAesorenAtynameswewanttoreconcileagainst.Registerscanbeagoodsourceofthese.Butit’salsoworthnoAngthatregisterscanalsobeusedtogeneratederiveddatasets.Forexample,IwantedalistofUKprisonswithlocaAoninformaAon.IntheabsencefindingasingleopenlylicenseddatasetwiththisinformaAon(awebsitewithoneprisonperpagewastheclosestIfound,whichIcouldhavescrapedbutchosenotto),IinsteaddoalookupviatheFoodStandardsAgency,whichhasinspecAoninformaAonforpublicfoodoutlets.(AnothersourcemighthavebeentheCQC,withasearchforhealthsurgeriesordentaltreatmentcentres,filteredby“HMP”or“prison”).

16

RStudioisanotherapplicaAonthatcanbefreelyredistributedandexposedviaabowser.ThesepostswhohowtorunanRStudioapplicaAoninthecloudusingasimplecontainermanagementdashboardformerlyknownasTutum,nowavailableasDockerCloud.I’vealsodescribedhowtopackageaShinyapplicaAoninacontainersoyoucandeployitanywhere.DoesanyoneuseShiny?Shinyisarapidprototypingtoolforbuildingbrowser-based,HTML5interacAveapplicaAonsanddashboards–RStudioreleasedanewdashboardingframeworkoverthelastcoupleofweeks–thatmakeitrelaAvelyeasytobuildinteracAvedataexloraAontoolsagainstanRenvironment.

17

OnereallynicecomponentoftheDockerecosytemisdocker-compose,formerlyknownasfig,whichallowsyoutoorchestratethelaunchofseveralinterlinkedcontainers,soyoucaneasilyaccessonefromanother.TheexamplehereshowshowtolinkRStudioandaJupyternotebookstoaneo4jdatabase.

18

I’vemenAonedJupyterafewAmes–doesanyoneuseJupyternotebooks?IPythonnotebooks?ThebrowserbasednotebookUIletsyouentertext(asmarkdown)andexecutablecode(inavarietyoflanguages)andthenrunthecodeanddisplaytheresultsofthecodeexecuAonbackinthenotebook.OnethingI’vebeenexploringrecentlyisawayofcallingcommandlineapplicaAonfuncAonspackagedinacontainerfromanotebookcell,andreturningtheoutputofofthecontainerisedcommandlinefuncAonasasharedfile.ThispostdescribeshowIpackagetheContentminetools-asetoftoolsforharvesAngscienAficjournalpapersandextracAngknowledgefromthem–andwhicharealpaintosetupnormally–andthenusethemviaanotebook.

19

Justbytheby,ifyouwanttotrythenotebooksout,there’salivedemoavailable.(Ialsodidaposton“SevenWaystoRunJupyterNotebooks”whichdescribesseveralotheralternaAvewaysofrunningthenotebooks.)ThecodeexamplehereshowsallthecodeneededtoopenanExcelfilecontainingaveragetravelAmestoGPsurgeriesbyLSOA,filterthedatadowntoaparAcularlocalauthorityarea,pullinanopenlylicensedgeojsonshapefileforthatarea,andthenplot(andembed)aninteracAvechoroplethmapviathefoliumpythonpackage(usingGooglemaps,Ithink,thoughitmaybeOpenStreetmap?)

20

OneproblemwithproducinginteracAvemapsisthatsomeAmesyouactuallywantanimage.ItturnsoutthatwebtesAngframeworkslikeSeleniummakeiteasytograbscreenshotsfromtestpagesrenderedinatestbrowser,soIco-optedtheideatoproducearouAnethatletsmegrabapngsnapshotofamap.

21

ThatexamplewasactuallycreatedforasideprojectIdabbledwithwithourhyperlocalnewsoutletontheIsleofWightcalledOnTheWIght.OnTheWighthavebeenreporAngmonthlyjobfiguresforyears,soIthoughI’dhaveagoatautomaAngtheproducAonofthereportsfromnomisdata,aswellasproducingafewcharts.ThereportisjustaliteralreporAng,althoughIdotrytoaddsomecolourandaAnyamountofanalysisforexamplebyusingdirecAonalandmagnitudeterms–“thenumberswentUPSLIGHTLYfromlastmonth,althoughtheyareSIGNIFICANTLYDOWNfromthesameAmelastyear”.Andsoon.

22

Onmyownsite,Istartedtryingtopulloutsomegeographicalinsight,automaAcallyreporAngonareaswithnoAceablyhighunemploymentcomparedtootherareasbygender.ThemapdoeslooklikeapopulaAonmap,buttheunemploymentrateisactuallyhigherinsomeofthemoreheavilypopulatedareas!

23

Justasidenote–theideaofbeingabletobuildsomethingoncetheydeployitmorewidelyfornoextraeffortreallyappealstome.InthecaseofnaAonaldatasetsbrokendowntolocallevel,buildingasoluAonforalocalareayouknowaboutandunderstandhelpsgetyoustartedonautomaAcallydetecAngandpullingoutstoriesorfeatures–butthesamecodecanthenrunforotherareas.

24

ThepainpointsoTencomeinsplipngthedatadowntolocalareasandthengeneraAngthestories.

25

Butifyouautomateapainpointawayforonelocalarea,you’vesolvedtheproblemforallofthem.TheapproachI’vebeentakingistothinkintermsofproducingpressreleasesratherthanthanfinishedstories,relyingonthejournalist,orsomeothereditorialrole,toactasthefinalarbiterofthequalityandrelevanceofthepressreleasestylecommunicaAon.TheimplicaAonisalsothatmoreworkneedstobedonecheckingandworkingupthepressreleaseforthefinalstory(if,indeed,thereisanystory).

26

Sopickinguponthisideaofreuse–orlaziness–thenomisdatatotextenginecanbeeasilywrappedtotoprovideaconversaAonalUIforit.Inthisexample,IcanasktheserviceforthelatestJSAfiguresinaparAculararea.Althoughnotshown,youcanputinapostcode,forexample,andgetthefiguresbackforthelocalauthorityareacontainingthatpostcode.AttheAmeIdidthisdemo,IwashalfthinkingoftryingtopersuadeJohnstonPresstogivemesomepinmoneytoplaywith,soIscrapedalistofJohnstonpresspapers,foundthepostcodeoftheiroffice,anduseditasathebasisforalookupofjoblessfiguresbynewspaperAtlearea.

27

Havinggotsomemachinerysetuptoworkwithslack,Icouldalsouseitasaninterfaceforasimple“spreadsheetrowtoparagraphoftext”toyIwastryingtoputtogether.Sohere,forexample,I’mlookinguplatestfiguresforCQCcarehomeinspecAons.(Actually,IthinkthisisbasedonascraperoftheCQCwebsiteratherthanadatafiledownload.)

28

Theoriginalexperimentshadtheslackbotcoderunningonmypersonalcomputer.Morerecently,IstartedlookingathowthingslikeAmazonAWSLamdafuncAons,essenAallyserverlessremoteprocedurecalls,couldbeusedtohostthebot.TheexamplesheremakeuseoftheUKParliamentAPItoprovidethecontent,allowingmetolookupuprecentreports,orcommi\eememberships,forexample.

29

Thedata2textareaisarichone,andonethingIfindreflecAngonmyownexploratorydataacAviAesisthatIoTenlooktocharts(whichareoTencustom,mutlilayeredchartsofmyowndevising–ggplotisgreatforthat)forinspiraAon.WorkingineducaAon,wherewehavealegalrequirementtomakeourteachingmaterialsaccessible,chartsandfiguresoTenrequirewri\endescripAons.SoonethingI’vestartedwonderingrecentlyiswhetherwecanintrospectonchartobjectscreatedusingthingslikeggplotasa“databasis”foratextualisaAonofthechartcomponents(andthendodata2tesxtanalysisforthesimpleanalyAcsinsightreporAng).Anditseemswecan–gpplotchartobjects,forexample,haveaggplot_build()introspector,andwecanalsogetaccessdirectlytochartobjects.

30

WhenIpostedaboutmyggplot2textexperiment,Iidlywonderedwhetherwecoulddothesameformatplotlibchartobjects.Andisseemswecan,asthisdemosharedviaacommentershows.#LazywebTw,youmightsay:-)

31

AsIwaslookingattheParlimanentAPIbackendforasimpleconversaAonalsearchagent,theONSBetawebsitebecamethelivesite.OneofthenicethingsaboutthenewONSsiteisthataJSONfeedalternaAveisavailableformuchoftheHTMLcontentonthesite.WhichmeanswecanrepurposethatwebsitecontentdirectlyasaresponsetoaconversaAonalsearch.

32

Finally,IwanttoreturntotheJupyterecosystem.Iabsoultelylovethenotebookenvironment:itprovidesagreatenvironmentforwriAngliterate,reproducibledataanalysisscripts(servalnewsoutletsarestarAngtopubllishJupyternotebooksshowingtheanalysisbehindtheirnewsstories–Buzzfeedisagreatexampleofthis,aswiththeirrecenttennismacthfixing/bepngscame,forexample),aswellasprovidingagreatenvironmentfordocumenAngexploratorydataanalyses.ButtheJupyterecosystemisalreadymuchricherthanthat.Ihaven’tdescribedthedashboardtoolkitforcreaAnglivedashboards,theslideshowviewthatletsyoucreateinteracAveslideswithlivecodeexecuAon,therangeofprogrammelanguagekernels(notjustPythonandR)orthekernelwrapperthatletsyoudefineanAPIviaanotebook).ButIdojustwanttoquicklymenAonremotekernels.

33

Atthemoment,we’recurrentlyrewriAngadaylongresidenAalschoolacAvitythatusesLegorobots.UnAlthisyear,we’veusedtheoriginalyellowLegoMindstormsRCXbrick.Thisyear,we’reusingtheLegoEV3brick,whichhaswifiandcanbesetuptorunLinuxandapythonshellthatcanaccesstherobot’sbits.TheapproachI’vebeenexploringittorunaremoteIPythonkernelonthebrick,andaJuoyterserveronadesktopmachine,andthenconnectanotebooktotheremotekernelviatheJupyterserver.Runningthenotebookserveronthebrickremovestheloadofrunningtheserverfromthebrick.(Thesameapproachcanbe–andis–usedtorunlargetasksonsupercomputerclusters.)ThenotebooksalsoallowustocreatesimpleinteracAveUis–justlikeRhastheshinyframework,theJupyternotebookscanruninteracAveipywidgetsdirecltywiredtopythonstate.IntheexampleabovemIhaveaslideforcontrollingmotorspeed,forexample(actually,thedutycyclefothesteppermotor)andanotherthatdisplaysthevaluebeingseenbyaparAcularsensor.(Again,there’saAnyelementofsimplisAcdata2textcontextualisaAoninthedisplay.)

34

Sothat’smedone.SomeofthetoolsandtechnologiesthatIthinkareappropriatefor,orcanbeappropriatedfor,datarelatedtasks.SomeAmesapenwilldoaswellasaspoon.

35

Andfinally,alastbitofblatantself-promoAon.InthesamewaythatmathshasrecreaAonalmaths–funpuzzlesintheSundaypapers–IengageinrecreaAonaldataacAviAes.Andaswiththeblog,IkeeparecordofwhatI’vedone.Severalyearsago,IstartedtolearnR,andusedFormulaOneresultsandAmingsheetsdataascontextforthat.Overtheyears,I’vepulledvarioustricksandtechniquestogetherintothisevolvingbook.(Actually,thebookwasalsoanotherexperiment–Leanpubencouragesyoutopublishasyouwrite,andusedmarkdownforthemanuscript.IwaslookingforanopportunitytoexplorewhetherwemightbeabletousesomethinglikeRstudio,andinparAcularRmd,R-markdown)forauthoringOUcoursematerials,sothisgavemeareason–andacontext–forexploringsuchaworkflow).It’ssAllaworkinprogress,bitatover400pagesalreadyitrepresentsareasonablydeepdiveintothedifferentthingsyoucandowithalimitedrangeofdatasetsonaparAculartopic,aswellasexploringavarietyofwaysofusing–andappropriaAng–Rtohelpusfindstoriesindata.

36