automated content production the swedes love their …

38
3 AUTOMATED CONTENT PRODUCTION The Swedes love their soccer. So much so that in mid- 2016 the Swedish local media company Östgöta Media decided to launch a new site called “Klackspark” to cover every local soccer game in the eastern province of Östergötland. “It’s like the national sport of Sweden and you play it even if you’re not that good at it,” Nils Olauson, the publisher of Klackspark, told me. “I played in Division 6, one of the lower leagues, maybe until I was thirty-four just because it’s fun and there’s still a lot of prestige in the game. You want to be the best in your neighborhood, you want to be the best in your part of the town, and you get your rivals even in that kind of low league.” The sport has a high cultural significance. And when practically every neighborhood has its own team, spanning six divisions of local men’s soccer and four divisions of local women’s soccer, not to mention the major national and international leagues, that’s a lot of games. To reach that breadth of coverage, Klackspark strategically employs automated software writing algorithms, which take structured data about each local game and automatically write and publish a short, roughly one- hundred-word factual summary of what happened in the game. It’s not too fancy really. Any given story might recount who scored the goals in addition to the history and league standing of the teams that played. But the automation provides a foundational breadth to the coverage—anyone looking for the quick facts about a local match can find that story on the site. Olauson described how Klackspark’s automated stories are orchestrated with fourteen sports reporters who then add a bit of “spice,” layering on details and human interest to the stories on the site. The reporters receive alerts when the software detects something newsworthy or unique in a lower-league game. “When a girl had ten goals in the same game, we had one of our reporters call her up and talk to her. He wrote an article about it, and that article was one of the most read pieces on Klackspark that week.” The automation is serving a dual purpose: writing straight factual stories that are directly published and, as

Upload: others

Post on 25-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

3AUTOMATEDCONTENTPRODUCTION

TheSwedeslovetheirsoccer.Somuchsothatinmid-2016theSwedishlocalmediacompanyÖstgötaMediadecidedtolaunchanewsitecalled“Klackspark”tocovereverylocalsoccergameintheeasternprovinceofÖstergötland.“It’s like thenationalsportofSwedenandyouplay iteven ifyou’renotthatgoodatit,”NilsOlauson,thepublisherofKlackspark,toldme.“IplayedinDivision6,oneofthelowerleagues,maybeuntilIwasthirty-fourjustbecauseit’sfunandthere’sstillalotofprestigeinthegame.Youwanttobethebestinyourneighborhood,youwanttobethebestinyourpartofthetown,andyou get your rivals even in that kind of low league.” The sport has a highcultural significance. And when practically every neighborhood has its ownteam, spanning six divisions of localmen’s soccer and four divisions of localwomen’s soccer, not to mention the major national and international leagues,that’salotofgames.Toreachthatbreadthofcoverage,Klacksparkstrategicallyemploys automated software writing algorithms, which take structured dataabouteachlocalgameandautomaticallywriteandpublishashort,roughlyone-hundred-wordfactualsummaryofwhathappenedinthegame.It’snottoofancyreally.Any given storymight recountwho scored the goals in addition to thehistory and league standing of the teams that played. But the automationprovidesafoundationalbreadth tothecoverage—anyonelookingforthequickfactsaboutalocalmatchcanfindthatstoryonthesite.

OlausondescribedhowKlackspark’sautomatedstoriesareorchestratedwithfourteensportsreporterswhothenaddabitof“spice,” layeringondetailsandhuman interest to the storieson the site.The reporters receive alertswhen thesoftware detects something newsworthy or unique in a lower-league game.“Whenagirlhadtengoalsinthesamegame,wehadoneofourreporterscallherupandtalktoher.Hewroteanarticleaboutit,andthatarticlewasoneofthemost read pieces onKlackspark thatweek.”The automation is serving a dualpurpose: writing straight factual stories that are directly published and, as

detailedinChapter2,alertinghumanreporterstowhatcouldbeajuicystoryifonly they did some additional reporting, got some quotes, and fleshed it out.Straight automation provides breadth of coverage, and automation plusprofessionalreportersaddsdepthtocoverage.Theautomationdoestheroutinework, and the reporters get to focus onmore interesting stories. No jobs losteither,atleastnotyet.

ThistypeofhybridscenariowasthenormforthenewsorganizationsIspoketoabouttheiruseofcontentautomation.Newsroomsseecontentautomationasbeing largely complementary to journalists’ work.1 Yes, there are instances incontentproductionwhere there iscompleteautomation,and ifyousquint,youmightevensaythereisartificialintelligenceoperatinginnarrowtargetedareas.Butthestateoftheartisstillfarfromautonomouslyoperatingintheunboundedenvironmentof theworldandfromdoing thecontextualized interpretationandnuancedcommunicationrequiredofjournalists.

Thischapterfocusesonthecontentcreationphaseofeditorialproduction.Asvariouscontentcreationtasksaredelegatedtoautomationandalgorithms,newopportunities emerge for reinventing editorial processes and practices. I firstexaminethetechnicalcapabilitiesandpotentialofautomatedcontentproductionalgorithms. Then I detail how algorithms enable faster, larger scale, moreaccurate, and more personalized journalism, which creates new businessopportunitiesfornewsorganizations.Atthesametime,thischapteralsopointsout that automated content production has some very real limitations, such asdata contingencies and difficulties matching human flexibility and quality inreportingon a dynamicworld.This leads to the next topic, a considerationofhowpeopleandautomationwillworktogether,bothinthedesignandoperationofthesesystems,aswellasinhowthiscollaborationwillimpacttheevolutionof human tasks and roles.This chapter concludesby exploringwhatmight benextforautomatedcontentproduction.

HowAutomatedTextWritingWorksThebasicpremiseofautomatedtextproductionistotakestructureddata,asonemightfindinadatabaseorspreadsheet,andhaveanalgorithmtranslatethatdatainto written text. This process is referred to as “natural language generation”(NLG). At the simpler end of NLG are rule-based techniques that work like“MadLibs”—that is, there are prewritten templateswith gapswhere numbersaredynamicallyinsertedfromadatasetaccordingtomanuallycraftedrules(seeFigure3.1).Moreadvancedtemplateapproachesareimbuedwithrule-setsthat

incorporatelinguisticknowledgeandfacilitatemoresophisticatedanddynamictext production. Such techniques can conjugate verbs in different tenses ordecline nouns to make them grammatical. Sophisticated templates can beblendedinwithsimplerones.Forinstance,asimpletemplatecouldbeusedforaheadline so that it’smoreattentiongetting,but thenadynamic templatecoulddrivemoreheavilydescriptivepartsofthestory.2Manyrule-basedNLGsystemsare built according to a standard model that includes three distinct stages:documentplanning,microplanning,anddocumentrealization.3

Thedocumentplanningstageconsistsofdeterminingandselectingwhat tocommunicate and then how to structure that information in paragraphs andsentences. Deciding what to communicate is impacted by what the reader isinterested in,what thewriter is trying to accomplish (for example, explain orpersuade), and by constraints such as the available space and data.Documentstructurereflectstheeditorialpriorityandimportanceofinformationaswellashigher-level discourse objectives such as telling a story versus explaining atimeline.Inthedomainofweatheradocumentplanmightprioritizethesalienceor ordering of information related to warnings, winds, visibility, andtemperature, among other factors.4 In election result articles it may take intoaccount the“interestingness”ofparticularcandidatesormunicipalities,aswellaswhetherawinorlossisastatisticalaberration.5Documentplanningworkstoenumerate all the things the data could say and then prioritizes those factsaccordingtonewsworthinesscriteria.6Moresophisticatedsystemsmayidentify“angles”forstoriesinordertohelpstructurethenarrativebasedonrareeventsor domain-specific metrics.7 Depending on what the data indicates, exampleanglesforasportsstorymightbe“back-and-forthhorserace,”“heroicindividualperformance,”“strongteameffort,”or“cameoutofaslump.”8

Figure3.1.  AnexcerptofanautomaticallygeneratedarticleonPollyVote.comreportingontheresultsofaUSelectionpollfrom2016.Underlinestyleindicatesdifferenttypesofdynamictext.Source:AndreasGraefe,“ComputationalCampaignCoverage,ColumbiaJournalismReview,July5,2017(usedwithpermissionoftheauthor).

Thenextstage,microplanning,consistsofmakingwordandsyntaxchoicesatthelevelofsentencesandphrases.Thisphaseoftextgenerationisimportantbecauseitimpactsthevariabilityandcomplexityofthelanguageoutput,aswellas how publication or genre-specific style guides, tones, or reading levels are

produced. Microplanning entails nuanced choices, such as deciding amongdifferent templates for conveying the same information or among referringexpressions, which specify different ways of mentioning the same person orcompany inanarticle.The first timeaplayer ismentioned ina soccerarticle,their whole name might be used, the second time just the last name mightsuffice, and if there is a third time, it might be more interesting to use anattributeofthepersonsuchas“the25year-old.”9Additionalvariabilitycanbeinjected into output texts by integrating synonym selections from wordontologies,whichdefinestructuredrelationshipsbetweendifferentwords(seeanexample in figure 3.1). Careful word choice can steer the text away frommonotonybyblockingtheuseofcertainverbsorphrasesifthey’vealreadybeenincluded in a text. In the case of rule-based systemsmicroplanning decisionsreflecteditorialchoicesdeliberatelycodedintothealgorithm.

Finally,therealizationstageofNLGwalksthroughalinguisticspecificationfor the planned document to generate the actual text. This stage satisfiesgrammaticalconstraints,suchasmakingasubjectandverbagree,declininganadjective, making sure a noun is pluralized correctly, or rendering a questionwithaquestionmark.Therealizationstageisthemostrobustandwell-studiedaspect ofNLG,withmature toolkits such as SimpleNLGbeing used by newsorganizations to generate text.10 Many of the decisions at the level of textrealization relate to grammar and so there is less potential for journalists toimbueeditorialvaluesinthealgorithmsdrivingthisstage.

An alternative to the standard approach to NLG involves data-drivenstatistical techniques that learn patterns of language use from large corpora ofexamples. For instance, machine learning can automatically classify whichpointsofdatatoincludeinanAmericanfootballrecaparticle.Thiscontributesto the document planning stage by considering the context of data points andhowthataffectsinclusiondecisions.11Suchtechniquesrequireahighdegreeofengineering toproduceoutput in closelyconstrained scenarios.Moreover, richdatasetsof content anddataneed tobeavailable. Inonecase, awind forecaststatistical NLG system required a parallel corpus of manually written windforecast texts as well as the aligned source data for each text.12 NewsorganizationswithlargecorporaofarticlescouldpotentiallyutilizethosetextsasinputstotrainstatisticalNLGsystems.

Purely statistical approaches may, however, come up against barriers toadoptionbecausetheycanintroduceunpredictableerrors.Intheshorterterm,afruitful path forward appears to be the marriage of template- and statistically

basedtechniques.In2013Thomson-Reutersdemonstratedaresearchsystemthatcould extract and cluster templates from a corpus of examples, creating atemplate database.13 Then the system could generate a new text by iterativelyselectingthebesttemplategiventheavailabledataforeachsuccessivesentence.Thetextproductionqualitywascompetitivewiththeoriginaltexts,andhadtheadditional benefit of introducingmore variability into the output texts, whichaddressedaweaknessofpurelytemplate-basedapproaches.

BeyondAutomatedWritingAutomatedcontentproductionisnotlimitedtowritingtexts,andcanworkwithdifferentinputsbesidesstructureddata,includingthefullrangeofunstructuredtexts,photos,andvideosproliferatingonline.Tobeuseful,however,thesemediamust often first be converted into more structured data. Algorithms extractsemantics,tags,orannotationsthatarethenusedtostructureandguidecontentproduction.Fortextthisinvolvesnaturallanguageunderstanding(NLU)andforimages or video this involves computer vision to detect or classify visualproperties of interest. For instance,Wibbitz, a system that semi-automates theproductionofvideos,usescomputervisiontoidentifyfacesininputphotosandvideossoitcanappropriatelyframeandcropthevisualoutput.

Another typeofcontentautomationissummarization.Giventhe inundationof socialmedia content, there’s a lot of potential value to using algorithms tocrunchthingsdownintooutputsummariesthatpeoplecanskim.AlgorithmscansummarizeeventsasdiverseastheFacebookinitialpublicoffering,theBritishPetroleum oil spill in 2010, or a World Cup match by curating sets ofrepresentative media, such as tweets or photos.14 Summarization can alsogenerateaheadlinetoshareonsocialmedia,acompactpresentationforanewsbrowsingapp,orasetofimportanttake-awaysfromthestory.15Summarizationapproaches can be either extractive or abstractive.16 Extractive summarizationcorresponds to the task of finding the most representative sentences or textfragments from a document or set of documents, whereas abstractivesummarizationcansynthesizeentirelynewsentencesandwordsthatdidn’texistin the original text. Summarization algorithms embed a range of meaningfuleditorial decisions, such as prioritizing information from inputs and thenselectinginformative,readable,anddiversevisualcontent.Whilesummarizationtechnology is advancing, it’s sometimes hard to tell whether one summary ismuchbetter thananother.Wibbitzuses summarizationalgorithms to reduceaninputtextintoasetofpointstobeillustratedinvideo,butasNevilleMehtafrom

LawStreetMedia explained, “Sometimesyouget something that’s shorterbutit’snotdoingtheoriginalpiecejustice.…Thereisquiteabitofhandeditingthatgoesintoitjusttomaintainthelevelofqualitythatwewant.”

Videogenerationisconsiderablymorechallengingthantextbecauseitentailsautomaticallytailoringthestyle,cropping,motion,andcutsofthevisual,whilealso considering aspects such as the timing between the visual and textualoverlaystothevideo.AsystemlikeWibbitzmustcuratevisualmaterialfromadatabase(suchasGettyImages)andtheneditthatmaterialtogethercoherently.Earlyresearchsystemscapableofeditingandsynthesizingvideostoriesbeganto emerge towards the late 2000s,17 but only in the last few years has thetechnology been commercialized by the likes of Wibbitz and its competitor,Wochit.Wibbitz is further advancing the fieldby integratingmachine learningthat can learn editorial importance from user interactions with the tool. Ashuman editors tune rough cuts, the system learns which content is mostinterestingtoincludeforothervideosonsimilartopics.

Another output medium for automated content production is datavisualization. Visualizations are increasingly used in the news media forstorytelling and involvemapping data to visual representations such as charts,maps,timelines,ornetworkstohelpconveynarrativesandengageanaudience.18Datavisualizationsarealsoquitemultimodalandoftenrequire incorporatingahealthydoseof text toaid interpretation.19Weaving text andgraphics togetherraises new challenges about which medium to use to convey which types ofinformation,howtoreferbackandforthbetweenthevisualizationandthetext,anddecidewhichaspectsaremostimportanttovisualizefortheoverallstory.20Automatically generating captions or other descriptive text goes hand in handwith automating data visualization.21 For instance, systems can automaticallyproduce annotated stock visualizations or annotated maps to augment newsarticles(seeFigure3.2).22Theeditorialdecisionsencodedintothesealgorithmsare not only in selectingwhat to show, but also how to show it, andwhat tomake most visually salient through labeling, highlighting, or additionalannotation. The Associated Press (AP) produces automatically generatedgraphicsontopicssuchastheOlympics,finance,andboxofficenumbers,whichare then distributed on the wire. Other publishers such as Der Spiegel andReutersarebeginningtoexperimentwithautomateddatavisualizationsthatcanaugmentorevenanchorarticles.23

Figure3.2.  AmapvisualizationofobesityratesintheUnitedStates,includingannotationsofvariousareasofinterest.Themapwasautomaticallygeneratedbasedonaninputarticleonobesityanddiabetes.TheNewsViewstoolwasoriginallydescribedinT.Gao,J.Hullman,E.Adar,B.Hecht,andN.Diakopoulos,“NewsViews:AnAutomatedPipelineforCreatingCustomGeovisualizationsforNews,”ProceedingsoftheConferenceonHumanFactorsinComputingSystems(CHI)(NewYork:ACM,2014).

Machine learning is openingup fascinatingnewpossibilities for automatedcontent,includingthewholesalesynthesisofnewimages,videos,ortextsonthebasis of training data. Research prototypes have generated video using othervideosoraudio,aswellassynthesized imagesbasedonphotographiccorpora,voicesusingaudioofpotentiallyimportantsourcessuchaspoliticians,andtexttomimicuser-generatedcomments.24Oneprototypedemonstratedasystemthatcan takeordinaryvideofootageofaperson’s facefromawebcam,understandthefacialexpressionsmade,andthenmapthoseontoanothervideo’sfaceinrealtime. The demo is provocative: videos show the face of Putin or Trumpmimicking the same facial movements as an actor’s.25 Faces can now beswapped from one body to another, creating what are popularly known as“deepfakes.”26Othersystemstakeinanaudioclipandsynthesizephoto-realisticinterviewvideo that synchronizes theaudio to themouth shapeof the speaker

basedontrainingfootage.27Photoscanalsobesynthesizedentirelyfromdata.Amachine-learning model trained on a database of 30,000 high resolutionphotographs of celebrities can output photos of faces for people who don’texist.28Thephotoshavethelookandfeelofcelebrityheadshotsandarestrikingintheirquality(seeFigure3.3). In termsof text,neuralnetworkscangenerateYelp reviews that are rated just as “useful” as reviews from real people—essentiallygoingunnoticed.29

Figure3.3.  Automaticallysynthesizedfacesofpeoplewhodonotexistcreatedusinganeuralnetworkmodeltrainedoncelebrityphotos.Source:T.Karras,T.Aila,S.Laine,J.Lehtinen,“ProgressiveGrowingofGANsforImprovedQuality,Stability,andVariation,”presentedattheInternationalConferenceonLearningRepresentations,Vancouver,Canada,2018,https://arxiv.org/abs/1710.10196,licensedviaCreativeCommonsAttribution-NonCommercial4.0International,https://creativecommons.org/licenses/by-nc/4.0/.

Algorithms thatsynthesizeentirelynewimages,videos,and textschallengetheveracityofvisualmediaandtheauthenticityofwrittenmedia.Thecreationof fake videos showing sources saying something they did not could wreakhavocontheuseofvisualdocumentaryevidenceinjournalism.30Photoshophasbeen undermining trust in visualmedia for years, but these new technologiescreateawholenewpotentialforscaleandpersonalization.Theoutputof thesesystemsisrapidlyadvancinginbelievability,thoughtoatrainedeyetheremaybe subtle signs that the videos aren’t quite right, such as the flicker of animperfectly synthesizedmouth or a glazed look in the eyes. As the synthesis

technology moves ahead quickly, forensics technology that can identifysynthesizedfacialimageryorsynthesizedtextsisalsobeingdeveloped.31Evenifsynthesized content might sometimes fool the human eye, the forensicalgorithm’sstatisticaleyewillknowit’sfaked.Still,anarmsraceistakingshapeinwhichjournalistswillneedtobeequippedwithsophisticatedforensicstoolsinordertoevaluatetheauthenticityofpotentiallysynthesizedmedia.

AutomatedContentinUseThe news industry is still actively exploring different domains and use-caseswhere there is themost strategic potential for content automation. Automatedtext writing has been used for more than two decades to produce weatherforecasts.32 In finance and markets, Bloomberg has been automaticallygenerating written news alerts for its terminal for more than a decade, andReuters currently uses automation to facilitate the writing of market recapreportsonadailybasis.In2011,ForbesrolledoutautomatedcorporateearningspreviewsandreportsusingtechnologyfromNarrativeScience,andin2014,theAP began publishing automated earnings reports as well.LeMonde deployedautomatedwritingtohelpreportFrenchelectionresultsin2015,andin2017inFinlandthepublicbroadcasterYleaswellasthenewspaperHelsinginSanomatand academic project Valtteri reported the results of the Finnish municipalelectionsusingautomation.33Intherunuptothe2016USelection,automationwas used by election forecasting site PollyVote to renderwritten stories aboutthenearlydailyelectionpollsthatwerecomingout.34

Amid-2017surveyoffourteennewsagenciesfoundthatelevenwerealreadyusing automation or were actively developing it, mostly in the domains offinanceandsports.35Variouspublishershavepursueduse-casesinspecificsportssuch as soccer, ice hockey, and baseball.36 TheWashington Post has used thetechnology to cover events ranging from the 2016Olympics to local footballgames,aswellastocoverthe2016USelections.37ProPublicaandtheLATimeshave dabbled in generating automatic content in domains such as education,crime, and earthquake reporting.38 MittMedia in Sweden writes articles aboutlocal real-estate transactions automatically.39 Beyond articles, theWashingtonPost and other publications have explored automated headline writing,40 andMcClatchyhasusedthetechniquetoconveyinformationfromadatabaseaspartof a larger investigative story.41 Some outlets, including the UK PressAssociation in collaboration with Urbs Media, are exploring the coverage oftopics such as road conditions, public safety, crime, and health using open

municipaldata.

Automation’sAdvantagesOver time innovators are expanding the scope of content genres and domainswhere automation offers a benefit. In some cases automation enables a newactivity that people simplywould not be capable of,whereas in other cases itoffloads repetitive tasks from journalists or allows new scenarios of coveragethatwhilepossibleforpeopletoperformwouldbecostprohibitive.Automationaffordsoneormoreofseveraladvantagesincontentcreationbyenablingspeed,scale,accuracy,andpersonalization.42

Beingthefirsttobreakanewsstoryisoftenacompetitivestrategyfornewsorganizations.They see speed as away to drive traffic and enhance authority,despitetherenotnecessarilybeingthesamedemandorurgencyforspeedfromthe audience.43 But in some scenarios, and for some end users of newsinformation,speedreallyisthedefiningfactor.AtReutersaswellasBloomberg,which both sell specialized information terminals to stock traders, automationparses text documents, such as earnings releases, and almost instantaneouslygeneratesandpublishesaheadlinetotheterminalinterfacethatreflectswhetherthe company beat ormissed earnings expectations. It’s simple stuffwritten tospecificstandardizedtexttemplates,suchas“IBMmissedexpectationsby0.12:3.02versus3.14.”Butwhennewfinancialinformationcanmoveastock,beingthefirsttohaveaccesstoinformationcanmeanprofit—andsospeedisessentialfor traders. Finance appears to be the sole domain where full automation—complete autonomyacross the entireproductionpipeline—hasgained traction.Tointerjectahumanintheloopwoulddiminishthevalueoftheinformationtotradersbyslowing itdown.Evenanon-zeromarginoferror isoutweighedbythesheerdemandforspeed.Automationforspeedinthiscaseisaboutprovidingaservicethatsimplywouldnotbepossibletoprovideusinghumanlabor.

Thedomainoffinanceitselfisnotnecessarilythedefiningfactorinmakingspeed critical, however. The importance of speed is more a reflection of theinformation needs of particular end users. Take for instance the AP’simplementationofquarterlyearningsreportsdescribedintheintroductiontothisbook.UnliketheheadlinespushedtotheReutersandBloombergterminals,AP’sreportsarenotpublishedwithinmillisecondsofacorporateearningsrelease,butinsteadmaybepushedtothewireanywherefromfivetotenminutesorevenanhourafteranearnings release.Thereafter, theymaybepublishedby localAP-affiliatednewspapersforthesakeoflocalretailinvestorsorindividualswhoare

interested in a particular company. In determining whether speed will be anadvantage in aparticular automated content scenario, designers should look tohow end users will actually use the information. For instance, the LA Times’QuakeBot,whichreportsonearthquakesautomatically,generatesandpublishesstorystubsveryquicklybecausethiscanservetoalertreaderstopotentialrisksorsafetyissuesforthemselvesorlovedoneswhomaybewithintheearthquakezone.Forbothprescheduledevents(suchasearningsreleases)andunscheduledones (such as earthquakes) automation for speed depends on the needs of theaudiencefortheinformation.44

Thepotentialforscale,ofincreasingcoverageandbreadth,isanotherdrivingfactorintheadoptionofautomatedcontent.Acomputeralgorithmcanchurnouthundreds or thousands of variations of a story quite easily once set up,parameterized,andfedwithastreamofdata.Automatedcontentusecasesareemerging in areas such as weather, sports, and elections, where the cost ofdevelopinganautomatedsystemcanbeamortizedoverahighvolumeofoutputwith a long tail that might not otherwise attract dedicated attention fromjournalists. For instance, instead of covering only 15 percent of congressionalracesasithadinthepast,inthe2016USelectionstheWashingtonPostwasabletocoverelectionsinallstates,including435houseraces,34senateseats,and12gubernatorialraces.That’sadramaticexpansionofcoverage.AndalthoughthePost couldhave reached that scaleof coverageusinghuman reporters inpriorelections, the costs would have been prohibitive. Automation offers thepossibilitytocreatecontentforeversmalleraudiences,relievingthepressuretomake editorial decisions that balance labor availability with newsworthinessjudgments about things like the magnitude of the event.45 What constitutes“newsworthy”changeswhenit’scheapandeasytocoverbasicallyeverything.46

Increased breadth of coverage can allow for more widespread access toinformationwhichcaninturnimpactothersocietalprocesses.Forinstance,onestudyshowedthattheearningsreportsrolledoutbytheAPhaveaffectedstock-tradingbehavior.Theresearchexamined2,433companiesontheUSexchangesthatdidn’thaveanyfinancialearningscoveragebyAPpriortotheintroductionofautomation inOctober2014.47By the endof the sampleperiod, inOctober2015,about two-thirdsof thecompanies trackedhadreceivedsomeautomatedcoverage and exhibited higher liquidity and trading volume in comparison tothose that didn’t receive automated coverage. The relative increase in volumewasabout11percent.

Scale can be considered across different dimensions, including categorical

units such as sports teams (for instance, Klackspark’s soccer coverage),geographicunits(suchastheWashingtonPost’selectionscoverage),ortemporalunits (such as periodic polls in politics).The idea of scaling over timehas anadditional benefit: rather thanwidening coverage over a conceptual or spatialdomain,itallowsforconsistentanduniformcoverageofthesametypeofeventsover time.TheLATimes’QuakeBot,whichautomaticallywritesstoriesaboutearthquakes using data provided by the United States Geological Survey(USGS), demonstrates this idea.48 Earthquakes are always happening inCalifornia,andwhiletherewasaudiencedemandtoknowwhentheyhappened,theLA Times’ coveragewas inconsistent. No reporter reallywants to have towrite a repetitive pro forma story about a local, relatively minor earthquake.Automation allowed the paper to offload the tedium of writing those basicarticleswhileincreasingtheconsistencyofcoverage,sincebasicrulesaboutthemagnitudeofquakes thatwarranteda storywerebaked intocode.BenWelsh,theheadof thedata teamat theLATimes, remarked that thereporterwhohadbeen tasked towrite those stories in the pastwas now able to focus onmoreimportantinvestigations,suchaswhatbuildingsinthecitywereunsafeforthenext big earthquake. Scaling over time can contribute to time savings andpotentiallytolaborreallocationthatcanbuttressmoresubstantivecoverageofatopic.

Accuracyoftextcanbeanotheradvantageofautomatedcontentproduction.AsAndreasGraefewrites in theGuide toAutomatedJournalism, “Algorithmsdonotgettiredordistracted,and—assumingthattheyareprogramedcorrectlyand the underlying data are accurate—they do notmake simplemistakes likemisspellings,calculationerrors,oroverlookingfacts.”Theremainingerrorsareusuallyattributabletoissueswiththeunderlyingdata.49LisaGibbs,thebusinesseditorattheAPwhohelpedwiththeroll-outoftheautomatedearningsreports,concurred:“Theerrorrateislowerthanitwaswithhumanreporters.…Robotsdonotmaketyposandtheydonotmakematherrors.”Andifanerrorisdetectedintheoutput,thesoftwareneedstobefixedonlyonceandthenitcangobacktooperatingaccuratelyacrossitsthousandsofoutputs.Sooncethesystemsaresetupanddebugged,they’reconsistentintheirapplicationofaprocedure.Butthatdoesn’t always translate into more accuracy per se. Yes, automated contentproduction canmitigate the clerical errors related to typos andmathmistakes,but different types of errors and accuracy issues can crop up. The differentaccuracyprofileofautomatedcontentinturncreatesnewdemandsoneditorstorecognizenottheerrorsofsloppyfingers,butofmissingdata,unaccountedfor

context,orotheralgorithmiclimitations.50Speed, scale, and accuracy are all possible reasons for adopting automated

content, but there’s another affordance of the technology that is still nascent:personalization. A variety of avenues have been explored for newspersonalization, such as adaptive navigation, dynamic recommendations, geo-targetededitions,orjustreorderingofcontentonapagetomatchanindividual’sinterests.51Theideaofnewspersonalizationhasbeenaroundalmostaslongasthe World Wide Web,52 and many modern reading interfaces, from apps, tonewsfeedsandhomepages,routinelymakeuseofthesestrategies.

HereIwanttofocusonpersonalizationatthelevelofthecontentitself.Thismight, for instance, entail automatically rewriting the words of an article toappeal best to an individual’s interests or to the characteristics of a particularpublication’s audience. For instance, financial content can be written quitedifferentlyforageneralpurposeaudienceconsumingtheAP’searningsreportsas compared to the expert audience reading a niche financial news site.Personalization in this contextmeans any adaptation of a communication to auser.Morespecifically,contentpersonalizationis“anautomatedchangetoasetof facts that appear inanarticle’scontentbasedonpropertiesof the reader.”53Adaptationsoccurwithrespecttosomemodeloftheuser,whichcouldincludeage, gender, education level, location, political affiliation, topical interests, orotherattributesoftheindividual.Whenausersetshisorherownmodel,thisisoften referred to as “customization,” whereas when the model is implicitlydetermined (such as by observing behavior), this is referred to as“personalization.”54Oftentimesthereissomecombinationofcustomizationandpersonalizationwherebymediaisadaptedsomewhatthroughpersonalizationandthentheusercanmakefurtheradjustmentstocustomizeit.Researchonbaseballstories thatcanbe interactivelycustomizedbyendusers to focusonparticularplayers or parts of the game showed that such adaptive articleswere rated asmore informative, interesting, and pleasant to read than nonadaptive articles.55It’sunclearwhethersuchresultsholdforimplicitlypersonalizedarticles,butiftheydo,thiscouldimprovetheuserexperienceofnewsconsumption.

Localization isamorenarrowtypeofpersonalizationreflectingadaptationsbased specifically on geography or language. For instance, localized weatherreportsareproducedforthousandsoflocationsthroughouttheworld.56 In theirpilot, Urbs Media localized articles for the thirty-three boroughs of Londonusingopendatatowriteanarticleaboutthediabetesincidenceineachborough.In a project together with the U.K. Press Association called “RADAR”

(Reporters andData andRobots),Urbs is expanding this approach to providelocalized reports for a range of topics including health, education, crime,transportation, housing, and the environment.57 A staff of four writers canproduce fifteen story templateseachweek,eachofwhichcan in turngenerate250localizedversions.58Similarly,Hoodline,alocaloutletinSanFrancisco,isexperimenting with automatically localized stories about new or stand-outneighborhood restaurants using Yelp data.59 Meanwhile, the Norwegian NewsAgency is considering localizing its automated soccer articles by creatingdifferent versions and angles on a story depending on which team a localetypicallycheersfor.Soahometeamlossmightbeframedinasofterwaythanwhen that same team crushes a crosstown rival. Another aspect of coveragelocalization relates to adapting content to a desired language of consumption.This has advantages for global organizations that publish content in multiplelanguagesaroundtheworld.

Many of these applications of personalized content generation are in anascent or developing stage. There are, however, already examples ofpersonalized or customized content integrated into larger data-driven articlesfromtheNewYorkTimes,theWashingtonPost,andVox.60Forexample,aPostarticleentitled“America’sGreatHousingDivide:AreYouaWinnerorLoser?”maps and explains how real-estate prices have shifted over the last decade,showing the dip and uneven recovery after the 2008 financial crisis on a zip-code by zip-code basis. The article highlights several nationally interestinglocations such as the San FranciscoBayArea, but it also presents amap andsome text showing the localityof theviewerof the article.Using the locationinformationmodernwebbrowsersinferbasedonauser’sinternetprotocol(IP)address,thearticleisabletoadaptthepresentationofthecontenttothenearestmetropolitanareaandzipcode.WhenIaccessedthepageinWashington,DC,itread, “In 2015, a single-family home there was worth $637,165 on average,about79percentmorethanin2004.It’sadenselypopulated,mostlyblackarea.ThehomevaluesaretypicallyhigherthanmostofthehomesintheWashington-Arlington-Alexandria,DC-VA-area.”ThetextisstraightforwardbutisadaptedtomeassomeonewhoatthetimewaslivingintheLeDroitParkneighborhoodofWashington,DC,makingitmorecontextuallyrelevant.

TheBusinessCase“For us it’s not about saving money—it’s about doing stuff we couldn’totherwisedo,”underscoredBenWelshaboutthepossibilitiesforautomationin

CL
Highlight

the LA Times newsroom. Whether it’s speeding up production, increasingbreadth,enhancingaccuracy,orenablingnewtypesofpersonalization, there isgreatpotentialforapplyingautomatedcontentproductioninjournalism.Butaseditorial possibilities are explored, the business incentives and valuepropositionsalsoneedtobeworkedout.Canautomatedcontenthelpwithsomeofthesustainabilityconcernsthatfacenewsroomswithshrinkingbudgets?

Some organizations are already seeing the business impacts of automatedcontent in terms of competitiveness based on speed or breadth. Law StreetMediaisanichepublisheroflawandpolicytopicsforamillennialaudience.Itdoesn’thaveabig—orreallyany—videoproductionstaff,butithasbeenabletouse theWibbitzsemi-automatedproduction tool toadaptcontent foravisuals-seeking audience and thus to expand its reach.Wibbitz, the toolmaker,makesmoneythroughrevenuesharing.“Insteadofhaving…400textarticleswritteneverydayandonlyhavingavideoontwentyofthem,they[publishers]cannowhavevideoon90percentof them,”ZoharDayan, the founderofWibbitz toldme.Expandingthebreadthofvideocontentcreatesmoreadvertisinginventory,andinsomecasesWibbitzgetsacutoftherevenuethattheadvertisingcreates.Valueaccruesbothtothecreatoroftheautomationtoolandtothepublisher.

Automated content can also create business value by reducing or shiftinglabor costs. Sören Karlsson, the CEO of United Robots, estimated that theirsoccer recap-writing softwarehas savedoneof its clients about$150,000 in ayear.Somefreelancerswerenolongerneededtocallarenastofindouthoweachgameturnedout,forinstance.Thisismoreofashiftoflaborthananythingelsesince theeffort (andcost)needed tocollect thedatahasbeensubtractedawayfromeach local newsroomand centralizedby a newbusiness entity that itselfemploys people to gather the data. There can be some cost savings bymediaorganizationslookingtoshiftlaborontoperhapsmoreefficientandspecializedtechnologyprovidercompanies.

Keybusinessmetricscanalsobefavorablyimpactedbyautomatedcontent.TheWashington Post found that automatically produced articles can drive asubstantial number of clicks. The paper published about 500 automaticallywrittenarticlesduringthe2016USelections,whichintotalgeneratedmorethan500,000pageviews.61InChina,Xinhuazhiyun,ajointventurebetweenAlibabaandXinhuaNewsAgency,createdanautomatedsoccervideohighlightsystemand deployed it during the World Cup in 2018. The 37,581 clips produced,including things likegoal highlights and coach reactions, generatedmore than120millionviews in total.Klackspark’s automatedarticles arealsogenerating

new traffic for Östgöta Media: the site had about 30,000 unique visitors permonth in early 2017, and in an indication of a higher level of engagement,visitors now read more articles per visit on Klackspark. These numbers havemadeitfairlyeasytoattractsponsorshipforthesitefromorganizationsthatwanttospeaktosoccer-lovingaudiences.Klacksparkisharnessingtheattentionandvalue the automation provides to move toward a subscription model, whereaccesstothesiteispackagedwithaccesstooneofÖstgötaMedia’slocalmediaproperties.It’sessentiallyseenasavalue-addtothelocalnewspackage,alongthe lines of: “Subscribe to our local paper andget access to this great soccer-news resource as part of the deal.” Automated articles can also help drivesubscriptionconversiondirectly.MittMedia’sautomaticallyproducedrealestatearticles have driven hundreds of new subscriptions in the first fewmonths ofoperation.62

Automation can, in some cases, also enhance the visibility of content onsearchenginessuchasGoogle.Totheextentthatsearchenginerankingcriteriaare known, these factors can be baked into how content is automaticallyproduced so that it is rankedmore highly by the search engine.This providesmore visibility and traffic to the content and, theoretically at least, moreadvertisingrevenueasaresult.Forinstance,Googlerankseditorialcontentmorehighlywhenapictureisincluded.Realizingthis,theGermanautomatedcontentproviderRetrescoautomaticallyselectsanimageorpicturetogoalongwiththewrittenfootballrecaparticlesitgenerates.

Automatedcontentalsocreatesnewopportunitiesforproducts.Inthefutureperhapswewillseenativecontent(thatis,contentsponsoredbyanorganizationwithastakeinthetopic)mass-localizedfordifferentmarketsusingautomation.But aswith any new technical advance, designers should ask if an innovationcomesintotensionwithdesiredvalues.Automationforscaleappearstoassumea“moreisbetter”standardfornewscontent.InthecaseoffinancialnewsattheAP,thathashelpedliquidityaroundstocks.Butismorecontentalwaysbetter—for society, or for business? For thatmatter, ismore speed or personalizationalwaysbetter?Practitionerswillneedtoweighthebenefitsandtrade-offsastheybalancebusinessandjournalisticcommitments.

BarrierstoAdoptionInnovationinautomatedcontentcontinuestopushtheboundaryofwhatcanbeaccomplishedbytakingadvantageofspeed,scale,accuracy,andpersonalization.Andwhile therewill be a growing number of use caseswhere these benefits

make sense, either financially or competitively, a range of drawbacks willprevent a fully automated newsroom of the future. The limitations of contentautomationincludeaheavyrelianceondata,thedifficultyofmovingbeyondthefrontierofitsinitialdesign,cognitivedisadvantagessuchasalackofabilitytointerpret and explain, and boundedwriting quality. Some of these limitations,such as the dependency on data, are inherent to algorithmic productionprocesses,whereasotherissuesmayeventuallysuccumbtotheforwardmarchoftechnicalprogressasartificialintelligenceadvances.

Data,Data,DataTheavailabilityofdataisperhapsthemostcentrallimitingfactorforautomatedcontentproduction.Whethernumericaldata,textualorvisualmediacorpora,orknowledgebases,automatedcontentisallaboutthedatathat’savailabletodrivetheprocess.Datafication—theprocessofcreatingdatafromobservationsoftheworld—becomesastricturethatholdsbackmorewidespreaduseofautomationsimplybecauseaspectsoftheworldthataren’tdigitizedandrepresentedasdatacannot be algorithmically manipulated into content. The quality, breadth, andrichness of available data all impact whether the automated content turns outcompellingorbland.63Dataalsobecomeacompetitivedifferentiator:exclusivedata mean exclusive content.64 Automated content runs the risk of becominghomogenous and undifferentiated if every news organization simply relies onaccess to the same data feeds or open data sources. In the content landscape,newsorganizationsmayseenewcompetitors(orcollaborators)inorganizationsthatalreadyhavedeepanduniquedatabasesandcancheaplytransformthatdatainto content. Despite a reluctance among some news organizations,65 theacquisitionofqualitynumericaldatastreamsandknowledgebasesisrelativelynew territory where news organizations will need to invest to remaincompetitive.

Investment in data production could involve everything from sensorjournalism—the deployment of cheap sensors to gather data—to creating newtechniques for digitizing documents acquired during investigations.66 Publicrecordsrequestsareamethodmanydatajournalistsusetoacquiredatafortheirinvestigations;howeveritstillremainstobeseenwhetherautomatedcontentcanbereliablybuiltaroundthisformofdataacquisition.Streamsofdataareoftenthe most compelling and valuable for automated content since they facilitatescaleandamortizationofdevelopmentcostsovertime.Butengineeringastreamofdatarequirestheentirechainfromacquisitiontoeditingandqualitycontrolto

besystematized,andperhapsautomatedaswell.Dataqualityisessentialbutisstillmostly relianton iterativehumanattentionandcare.67Newsorganizationsable to innovate processes that can pipeline data acquired via public recordsrequestsorothersourceswillhaveacompetitiveadvantageinprovidinguniqueautomatedcontent.Theuseofmoresophisticatedstructureddataandknowledgebasesisanotherpromisingavenueforadvancingthecapabilitiesofautomation.68The deliberate and editorially oriented design and population of event andknowledge abstractions through structured reporting processes allow for themeticulousdataficationofparticulareventsfromtheworld.Thesedatacanthenbeusedtodrivemoresophisticatednarrativegeneration.

The textual outputs of an automated writing system are influenced by theeditorial decisions about the data they are fed. There’s a certain “algorithmicobjectivity” or even “epistemic purity” that has been attributed to automatedcontent—an adherence to a consistent factual rendition that confers a halo ofauthority.69 But the apparent authority of automated content belies the messy,complex realityofdatafication,which contorts thebeautiful complexityof theworld into a structured data scheme that invariably excludes nuance andcontext.70 The surface realization of content may be autonomous or nearlyautonomous,buteditorialdecisionsdrivenbypeopleororganizationswiththeirown values, concerns, and priorities suffuse the data coursing through thesystem. The ways data are chosen, evaluated, cleaned, standardized, andvalidated all entail editorial decision-making that still largely lies outside therealm of automated writing, and in some cases demands closer ethicalconsideration.71 For instance, standard linguistic resources that an automatedwritingsystemmightrelyoncanembedsocietalbiases(suchasraceorgenderbias)thatthenrefractthroughthesystem.72

Biascanemergeinautomatedcontentinseveraldifferentways.Thebiasofadatasourcerelatestoahostofissuesthatcanariseduetothediversityofintentsandmethodsdifferentactorsmayuseinthedataficationprocess.Thelimitationsof datasets must be clearly understood before journalists employ automation.Corporationsandgovernmentsthatcreatedatasetsarenotcreatingthemforthesakeofobjectivejournalism.Whyweretheycreated,andhowaretheyintendedtobeused?Datacanbeeasilypushedpastitslimitsorintendeduses,leadingtovalidityissues.

Anotherissueiscoveragebias.Ifautomationhasastrongdependencyontheavailabilityofdata,thentheremaybeatendencytoautomaticallycovertopicsordomainsonlyifdataareavailable.Andevenifdataaregenerallyavailable,a

missingroworgapcouldpreventcoverageofaspecificevent.Wealreadyseethistypeofbiasintermsoftheinitialusecasesforautomation:weather,sports,andfinanceallbeingdata-richdomains.Othertopicsmaynotreceiveautomatedcoverageifdataarenotavailable,leadingtocoveragebias.“Justbecausetherearealotofdatainsomedomains,itdoesn’tmeanthatthatisactuallysomethingthatisinterestingorimportanttothesociety.Weneedtowatchoutsothatyoudon’t let the data control what you would cover,” explained Karlsson fromUnitedRobots.Dataqualityalsoneeds tobeconsidered, sincedata journalistshavebeenknowntopreferdatasetsthatareeasilyreadableanderror-free.73Asaresultautomationeditorsmaywanttoconsciouslyconsiderhowdataavailabilityandqualityinfluencecoverageusingautomation.

The data that feed automated content production are also subject to little-discussed security concerns that could affect quality. It’s not unreasonable toimaginebadactorsmanipulatingorhackingdatastreamstoinjectmaliciousdatathatwouldresultinmisleadingcontentbeinggenerated.Asasimpleexample,ifa hackerwere able tomake the data delivered to theAP for the earnings pershareofacompanyappeartobeatestimatesinsteadofmissingthem,thiscouldhave an impact on investors’ trading decisions and allow the hacker to profit.TheAPdoeshavechecksinplacetocatchwildanomaliesinthedata—suchasastock dropping by 90 percent—as well as means to alert an editor, but moresubtleandundetectedmanipulationscouldstillbepossible.Additionalresearchis needed to understand the security vulnerabilities of automated contentsystems.

Data are the primary and in many cases the only information source forautomatedcontent.Asaresult,dataficationputsimportantlimitsontherangeofcontent that can be feasibly automated. Data must be available to produceautomated content, but sometimes information is locked away in nondigitizeddocuments, difficult-to-indexdigitizeddocuments such ashandwritten forms,74orevenmoreproblematically,inhumanheads.Ifabeat,story,ortopicreliesoninformation tied up in human brains (that is, not recorded in some digital ordigitizablemedium), then interviews are needed to draw out that information.The current generation of automated journalism is most applicable when thestoryrestsoninformationeitherdirectlyinthedata,orderivablefromthedata,butnotoutsideof thatdata.Unlesstheinputdatacapturesall thenuancesofasituation,which ishighlyunlikely, therewillnecessarilybecontext loss in theautomated content. New ways to overcome context loss and lack of nuanceinherent to datafication are needed in order to expand the range of utility for

automatedcontent.In cases where necessary information is coming directly from people, and

when it may be “sensitive, complex, uncertain, and susceptible tomisunderstanding, requiring intimacy, trust, assessment of commitment, anddetectionof lies,”75 the interjectionof reporters—people taskedwith acquiringthat information—will be essential. Handling sensitive leaks and meetings indarkalleyswithshadysourceswillbethepurviewofhumanjournalistsforsometimetocome.Interviewingisrelational.Evenjustgettingaccessmaydependonbuilding a rapport with a particular source over time.Moreover, an interviewmayrequirenotjustblindlyrecordingresponsesbutinsteaddemandadversarialpush-back on falsehoods, follow-ups to clarify facts, or reframing unansweredquestions to press for responses. For all of these reasons, automationmay bedeeplychallenged in termsof itscapacity toengage inmeaningful journalisticinterviews.76

Buttherearesomehintsforhowautomatedcontentcouldbeintegratedwithhuman question asking, such as when the AP has reporters do follow-upinterviews to get a quote to augment some automated earnings stories. Thispattern of collaboration could even be systematized by inserting gaps intemplates where information acquired from human collaborators would beexpected. For instance, a templatemight include an “insert quote from soccercoachhere”flagandsendataskrequesttoareportertoacquireaquotebeforepublishingastory.Insuchmixedinitiativeinterfacesanalgorithmmightpromptthereporterforcertaininputsitdeemsnecessary.77 InoneexperimentpotentialsourcesweretargetedonTwitterandaskedaquestionaboutthewaitingtimeinan airport queue (42 percent of users even responded).78 But the gap betweenaskingsimplequestionslikethisandthemoresubstantivequestionsrequiredofhumanreporting isvast.Meaningful researchawaits tobedoneondevelopingautomationthatcanaskmeaningfulquestionsandreceivemeaningfulresponses.Automationmustlearntodatifytheworld.

TheFrontierofDesignandCapabilityAutomatedcontentproductionsystemsaredesignedandengineeredtofittheusecasesforwhichthey’redeployed.Rule-basedautomationisparticularlybrittle,oftenfunctioningreliablyonlywithintheboundariesofwhereitsdesignershadthoughtthroughtheproblemenoughtowritedownrulesandprogramthelogic.In certain domains such as sports, finance, weather, elections, or even localcoverageofscheduledmeetings,thereisafairbitofroutineinjournalism.Such

events are often scheduled in advance and unfold according to a set ofexpectations that are not particularly surprising. Sports have their own well-defined rules and boundaries. It is in routine events like these where anautomatedalgorithm,itselfahighlystructuredroutine,willbemostuseful.But,almost by definition, routine events in which something nonroutine happenscouldbeextremelynewsworthy—likesay,stadiumrafterscollapsingonasoccermatch.Unlikeautomatedsystems,human journalistsareadeptat adaptingandimprovisingincaseslikethese.79

Automationfundamentallylackstheflexibilitytooperatebeyondthefrontierofitsowndesign.Automatedsystemsdon’tknowwhattheydon’tknow.Theylackameta-thinkingabilitytoseeholesorgapsintheirdataorknowledge.Thismakesitdifficulttocopewithnoveltyintheworld—asevereweaknessforthedomainofnewsinformationgiventhatnoveltyisquiteoftennewsworthy.Whenthe NorwegianNewsAgency implemented its soccer writing algorithm, therewasalotofhand-wringingaboutwhattodoinoutliersituations.Attheendofthedaythedecisionwastosimplynotcreateanautomatedstoryforeventsthathad some kind of exceptional, out-of-bounds aspect to them.But that kind ofrecognitionitselfrequireshumanmeta-thinking;humanoversightmustbebuiltintothesystem.

The fragility of automated content production algorithms also leads toreliabilityissues.Yes,onceit’ssetup,analgorithmwillrunoverandoveragaininthesameway.Itisreliableinthesensethatitisconsistent.Butiftheworldchangesevenjustalittlebit,suchasadatasourcechangingitsformat,thiscouldlead to an error. Reliability in the sense of being dependable is an issue foradoptionofautomatedsystemsbecauseitimpactstrust.80Thisintroduceslimitson how far news organizations are willing to push automation: “The moresophisticatedyoutrytogetwithit,themoreyouincreasethechanceoferror,”said Lisa Gibbs from the AP.81 In high-risk, high-reward use cases such asfinance,whicharebuiltaroundaneedforspeed,organizationssuchasReutershavedoubleandtripleredundancyontheirautomation,includinginsomecasesahumanfallback.

The resilience of people allows them to accommodate error conditions andmake corrections on the fly.Automated systems, by contrast, continue blindlypowering forwarduntilanengineerhits thekill switch,debugs themachine tofixtheerror,andthenrestartstheprocess.Asaresult,automatedcontentneedseditorswhocanmonitorandcheckthereliabilityoftheprocessonanongoingbasis. Oftentimes the errors that crop up with automation are data-related,

thoughsometimestheycanalsobeintroducedbyalgorithmsthataren’tquiteuptothetask.82BenWelshattheLATimesdescribedanerrorthattheirearthquakereportingsystemmade.ItpublishedareportbasedonUSGSdata,butthentenminuteslatertheUSGSsentacorrectionsayingitwasaghostreadingcausedbyaftershocks in Japan.TheLATimes updated thepost and thenwrote a secondstory about ghost earthquakes that are sometimes incorrectly reflected in theUSGSdata.Errorsofthisilkkeepcroppingup,though:inJune,2017theQuakeBot relayed another bogus report from faultyUSGS data, this time because adate-timebugattheUSGSsentoutafaultyalert.83Theseinstancespointouttheweaknessofrelyingonasingle,untriangulatedstreamofdata,albeitonethatisoftenreliable.Similardata-drivenerrorshavealsoarisenattheAP,includingaprominenterrormadeinJuly2015,whenoneofitsearningsreportserroneouslyindicated that Netflix’s share price was down, when instead it was up. Thereasonfortheerrorwastracedtothefactthatthecompanyhadundergonea7–1stock split, but the algorithm didn’t understand what a stock split was.84Thankfully, outlets such as the LA Times and AP have human oversight andcorrectionspoliciesthathelptomitigateorremediatesuchalgorithmicerrors.

Errors in data aside, algorithms have gotten reasonably proficient atautomatedcontentcreationwithintheboundsoftheirdesign.Buttheyalsostillhavedifficultywithmoreadvancedcognitive tasks.Beyond simplydescribingwhathappened,algorithmsstrugglewithexplainingwhythingshappened.Theyhavedifficulty interpreting information innewwaysbecause they lackcontextandcommonsense. In thenomenclatureof theclassic sixWsof journalism—who,what,where,when,why, and how—automation is practicablewithwho,what,where, andwhen, particularlywhen given the right data and knowledgebases, but it still struggleswith thewhy andhow, which demand higher-levelinterpretationandcausalreasoningabilities.

Generating explanations for why something is happening in society isdaunting, even for people. While data-driven methods do exist for causalinference, it remains to be seen whether they rise to the level of reliabilitypublisherswouldrequire inorder toautomaticallypublishsuch inferences. It’smore likely that any explanation produced by an automated systemwould befurthervettedoraugmentedbypeople.Explainingacomplexsocialphenomenamaydemanddata,context,orsocialunderstandingthatissimplyinaccessibletoan automated content production system. One of the critiques of masslocalization has been that if statistics aren’t contextualized by reporters withlocal knowledge, articles canmiss the bigger social reality.85 Entire strands of

journalism, for instance thosepertaining tocultural interpretation,mayalsobedifficult for automation to address.86 The lack of commonsense reasoningprevents systems from making inferences that would be easy to make for ahumanreporter,whichfurthercurtailstheirusageoutsideofthenarrowdomainswherethey’vebeenengineeredwiththerequisiteknowledge.

One specific type of knowledge and reasoning that automated content stilllacksisthelegalkind.Asaresult,algorithmshavethepotentialtoviolatemedialawswithout realizing it.Consider foramoment thepossibilityofalgorithmicdefamation,definedas“afalsestatementoffactthatexposesapersontohatred,ridicule or contempt, lowers him in the esteemof his peers, causes him to beshunned,orinjureshiminhisbusinessortrade.”87InorderforastatementtobeconsidereddefamationintheUnitedStates,itmustbefalsebutbeperceivedasfactandmustharmthereputationofanindividualorentity.88TheUnitedStateshas some of the most permissive free speech laws in the world, and for adefamation suit toholdup incourt against apublic figure, the statementmusthavebeenmadewith“knowledgeofitsfalsityorrecklessdisregardforitstruthorfalsity”—the“actualmalice”standard.89Presumablythiscouldbeprovenifaprogrammer or news organization deliberately acted with malice to create anautomatedcontentalgorithmthatwouldspewdefamatorystatements.Butthere’sa lower standard for defamatory speech against private individuals. If a newsorganization publishes an automatically produced libelous statement against aprivate individual, it could be liable if it is shown that the organization wasnegligent, which might include failing to properly clean data or fact checkautomated outputs.90 The potential for algorithmic missteps that lead to legalliability isyetanotherreason tohavehumaneditors in the loop,orat theveryleast toengineersystemsawayfrombeingable tomakepublicstatements thatcouldhurtthereputationofprivateindividuals.

WritingQualityDespite being able to output perfectly readable and understandable texts, thequalityofautomatedwritingstillhasawaytogobeforeitcanreflecthuman-likestandards of nuance and variability, not tomention complex uses of languagesuch as metaphor and humor.91 Variability is a key concern, especially forcontent domains such as sports,where it’s likely for a single user to consumemorethanonepieceofcontentatatime.Repetitioncouldleadtoboredom.“Youreally need to be able to express the game outcome in many different ways.Otherwiseyouwillseethatthisisautomatedtextandyouwillfindthatthisis

repeating itself,” explainedKarlsson fromUnitedRobots. In domains such asfinance theremay be advantages to keeping the language straightforward andmore robot-like—verbal parsimony and repetition can straightforwardlycommunicateessentialfacts tousers—whereasindomainssuchassportsusersmaywantmoreentertaining,emotional,orlivelytext.

Text quality of course varies greatlywith themethodused to automate thewriting.Template-basedmethodshave thehighestqualitybecause theyencodethefluencyofthehumanwriterof thetemplate.Yettemplatesalsoreducetextvariability, increasing repetitionand thepotential tobore readers.Studieshavebegun to examine the perceptionof automaticallywritten texts and found thatreaders can’t always differentiate automatic from human-written texts but thattherecanbesubstantivedifferencesinreadability,credibility,andperceptionofquality.92Anearlystudyfoundthathuman-writtenarticlesweremore“pleasanttoread”thantheirautomatedcounterparts.93Amorerecentstudyofmorethan1,100 people inGermany found that computer-written articleswere scored asless readable, but more credible than their human-written counterparts.94 Thedifferenceinreadabilitywassubstantial:themeanratingforthehuman-writtenarticles was 34 percent higher than for computer-written articles. But theabsolute differences for credibility were quite small, only about 7 percent.Anotherstudyof300Europeannewsreadersfoundthatmessagecredibilityofsports (but not finance) articles was about 6 percent higher for automatedcontent.95 In Finland, a study of the Valtteri system compared automaticallygeneratedmunicipalelectionarticlestojournalist-writtenarticlesandfoundthatthe automated articles were perceived to be of lower credibility, likability,quality,andrelevance.96Feedbackfromthe152studyparticipantssuggestedthattheautomatedarticleswereboring, repetitive,andmonotonous,aswellas thatthere were grammatical mistakes and issues with the writing style. Unlikeprevious studies which evaluated texts generated using templated sentences,Valtteri used phrase-level templates, phrase aggregation, referring expressiongeneration, and automated document planning based on the detection andrankingofnewsworthinessoffactstoincludeinthestory.Inotherwords,it’samore complex NLG engine. But this complexity appears to introduce thepotential for grammatical errors that may undermine the credibility of texts.Moreover, thephrase-level templatesuseddidn’tprovideenoughvariability tomakethetextsinterestingtousers.

We can contrast template-driven automated content of perhaps passable (ifnot entirely compelling) quality to even lower-quality statistical NLG

approaches.Let’s lookat an abstractive summarization system trainedonNewYork Times articles using a machine-learning technique.97 Researchers askedhumanreaderstoratethereadabilityofthesummariesproducedinanevaluationof the system.Here is one example of the output of themodel that producedsummarieswith the highest readability of those tested, run on an input articleaboutaFormulaOnecarrace:

Buttonwasdeniedhis100thraceforMcLaren.TheERSpreventedhimfrommakingittothestart-line.Buttonwas his teammate in the 11 races inBahrain.Hequizzed afterNicoRosberg accusedLewisHamiltonofpullingoffsuchamanoeuvreinChina.

Thoughsomewhatrecognizableastopicaltext,thesummaryhasawkwardverbs,suchas“quizzed,”whichdonotfitthecontext,itdoesn’texplainacronyms,andithasfragmentsofnonsensicaltext.

Thefluencyoftextualoutputproducedbysimpletemplates(seeFigure3.1),aswellasbeingmorestraightforwardtounderstandin termsofconcreterules,helps explain its adoption by news agencies in lieu of statistical methods.98Publishinggarbledtextconsistingofungrammaticalnonsensewouldcastalongshadowonthecredibilityofanewsoutlet. Ifmoreadvancedmachine-learningapproaches are to be integrated into content production, additional technicaladvancesareneeded.

AutomatedContentIsPeople!In 2016 theWhite House predicted that as the result of advances in artificialintelligence,“Manyoccupationsarelikelytochangeassomeoftheirassociatedtasksbecomeautomatable.”99Sowhatdoesautomationmeanforhumantasksinnews production? Even in scenarios where content is produced entirelyautonomously with no human in the loop,100 there is still a lot of humaninfluencerefractedthroughthesystem.Automationisdesigned,maintained,andsupervised by people. If people are inserted at key points in the productionprocess, they can oftentimes compensate for the shortcomings of automationdescribed in the last section.And so determining how best to infuse or blendhumanintelligenceintothehybridprocessisakeyquestion.“Thevalueoftheseautomated systems is the degree to which you can imbue themwith editorialacumen,”explainedJeremyGilbertattheWashingtonPost.HereIconsiderhowthateditorialacumenisexpressedviathedesign,development,andoperationofautomated systems, as well as what this means for how human roles maychange.

DesignandDevelopmentThedesignanddevelopmentprocessforautomatedcontentproductionsystemsisrifewithopportunitiesforbakingineditorialdecision-makingaccordingtotheeditorial values and domain knowledge ofwhoever is involved in that designprocess, including reporters, editors, computational linguists, data scientists,software engineers, product managers, and perhaps even end users. Valuesbecome embedded into systems through everything from how constructs aredefined and operationalized to how knowledge bases and data are structured,collected,oracquired,howcontentisannotated,howtemplatesorfragmentsoftext are written and translated for various output languages, and how generalknowledge about the genre or domain comes to be encoded so that it can beeffectivelyutilizedbyanalgorithm.

Oneofthekeyeditorialdecisionsthatautomatedcontentsystemsmustmakeis how they define what information is included or excluded. Whether thatdepends on some notion of “newness,” “importance,” “relevance,”“significance,”or“unexpectedness,”datascientistswillneedtocarefullydefineandmeasure thatconstruct inorder touse it in thealgorithm.Decisionsaboutthese typesof editorial criteriawillhave implications forhowpeopleperceivethe content. “Some critics of the Quake Bot think it writes too many storiesabout smaller earthquakes,” Ben Welsh explained. This was the result of adecision by the city editor, who simply declared that the bot should covereverythingthat’samagnitude3.0andhigherearthquakeintheLosAngelesarea.These types of human decisions get enshrined in code and repeated over andoverbytheautomation,soit’simportanttothinkthemthroughcarefullyinthedesign process.101 Otherwise different or even conflicting notions ofnewsworthinessmaybecomeunconsciouslybuilt intohowalgorithmscome to“think” about such inclusion and exclusion decisions. It’s often necessary forpeopleinvolvedinthedesignofthesesystemstobeabletomaketheirimplicitjudgmentsexplicit,andwithclearrationale.

People alsoheavily influence automatedcontentproduction in termsof thecreationandprovisionofdata.Forunstructureddataor forsystemsrelyingonmachine learning, human annotation of data is often an important step.Structuredannotationsareakeyenablerofalgorithmicproduction.Forinstance,automated video creation systems need a corpus of photos that is reliablyannotatedsothatwhenapersonisreferredtointhescript,aphotoofthecorrectperson can be shown. In early systems for producing automated video suchannotations were supplied by people using controlled vocabularies of

concepts.102Newersystems,suchas thoseofWochitandWibbitz, relyonvastcorpora of content that have been tagged by professional photo agencies. Incaseswhere yetmore classification algorithms are used to automatically tag aphotoorvideowithconcepts, thosealgorithmshave tobe trainedondata thatwas originally judged by people as containing those concepts. Automatedsystemsarebuiltatoplayersandlayersofhumanjudgmentreflectedinconceptannotations.

Domain-specific and general knowledge reflecting an understanding of thegenre,narrativestyle,choiceofwords,orexpectationsfortonealsoneedtobeencoded for automated systems. For specific content areas, such as politics,sports,orfinance,expertsneedtoabstractandencodetheirdomainknowledgeandeditorialthinkingforusebythemachine.Forinstance,semanticenrichmentofvalenceaboutaparticulardomainmightallowanautomatedsystemtowritethat a trend foraparticulardata series“improved” insteadof just “increased.”Themeaningofwordsmayalsodependonconstantlyevolvingknowledgeofadomain:forexample, inastoryaboutelectionpolls, themeaningoftermsthatarenotwelldefined,suchas“lead,”“trend,”or“momentum,”coulddependoncontext.103Retrescokeepsformereditorsonstaffsothatdifferentgenresoftextcanbematchedintermsofchoicesingrammarandwordchoice.TomMeagherattheMarshallProjectexplainedhowthedatabasetheydesignedtosupporttheautomation in their “Next toDie”projectwas influencedbypartnerswhohadspentdozensofyearscoveringcapitalpunishment in theirownstatesandasaresultknewwhatfactorstoquantifyfortheautomation.Wibbitzhasanin-houseeditorial teamof video editors and ex-journalistswhoworkwith research anddevelopmentgroupstoincorporatetheirknowledgeintothetechnology.“That’soneofthemostimportantpartsofthecompany,”accordingtoitsfounder,ZoharDayan.Akeyenablerofthenextgenerationofautomatedcontentsystemswillbe better user interfaces that will allow experts to impart their journalisticknowledgeanddomainexpertisetothesystem.

CollaborationinOperationOnceanautomatedcontentproductionsystemhasbeendesignedanddeveloped,itmovesintoanoperationalphase.Inthisstageaswell,peopleareinconstantcontactandcollaborationwiththeautomationastheyupdateit,augmentit,editit,validateit,andotherwisemaintainandsuperviseitsoverallfunctioning.Therearesometaskshumanscan’tdo,thereareothertasksthatautomationcan’tdo,andsoacollaboration is themostobviouspath forward.This raises important

questionsaboutthenatureofthathuman-computerinteractionandcollaboration,suchashowtoensurecommongroundandunderstandingbetweenahumanuserandtheautomatedsystem,howtomonitortheautomationtoensureitsreliabilityortooverrideit,howcontributionsbypeopleandcomputerscanbeseamlesslyinterleaved, and when and how trust is developed when systems are fairlyreliablebutnot100percentreliable.104Aspeopleinteractwiththesesystems,thenatureoftheirskills,tasks,roles,andjobswillnecessarilyevolve,mostlikelytoprivilegeabstractthinking,creativity,andproblem-solving.105

Given the preponderance of template-driven approaches to automatedcontent,writingisoneoftheareaswherepeoplewillneedtoevolvetheircraft.Templatewriters need to approach a storywith an understanding ofwhat theavailabledatacouldsay—inessencetoimaginehowthedatacouldgiverisetodifferent angles and stories and delineate the logic that would drive thosevariations.Related is theability to transformtheavailabledataandorganize itmoresuitablyoradvantageouslytothewayitwillbeused.GaryRogersofUrbsLondon, who creates templates to tell stories using open data in the UnitedKingdom, explained it thisway: “When you’rewriting a template for a story,you’renotwriting thestory—you’rewriting thepotential foreveryeventualityofthestory.”Thecombinatoricsofthedataneedtobeunderstoodinrelationtowhatwouldbeinterestingtoconveyinastory,somethingthatiscloselyrelatedtothecomputationalthinkingconceptofparameterizationdescribedinChapter1.Peoplewillalsoneedtobeonhandtoadaptorreparameterizestorytemplatesinviewofdatacontingenciesorshiftsindataavailabilityfordifferentcontexts,suchasanationalsoccermatchwithlotsofrichdataandalocalsoccermatchwitharelativepaucityofdata.Availableinterfacesfortemplatewritingdon’tyetsupportwriters in seeing themultitudeofpossibilities inagivendataset,or inparameterizingastoryintermsoftheavailabledata.

But user interfaces are, in general, evolving to better support the template-writing task. Automated Insights, whose technology drives the AP earningsreports,marketsatoolcalled“Wordsmith,”whileArria,whichisusedbyUrbsMedia, has a tool called “Arria Studio.” These are essentially user-interfaceinnovations wrapped around an automated content production algorithm. Youcouldthinkofitasanalienwordprocessorthatletstheauthorwritefragmentsoftextcontrolledbydata-drivenif-then-elserules(seeFigure3.4).Automatingthe document planning and microplanning stages of NLG is complex anddemands a fair amount of both engineering and domain knowledge. Theseinterfaces offload the most complex of the document planning and

microplanning decisions to a human writer who authors a meta-document ofrulesandtextfragments.Updatingandmaintainingaportfoliooftemplatesthisway,however,requiresafairbitofneweditorialwork.

Figure3.4.  TheArriaStudiouserinterfaceshowingastorydraftedusingdatafromtheWashingtonPostfatalpoliceshootingsdatabase.Notethetoprowofbuttonsinthetoolbar,usedforinsertingvariables,conditionalstatements,synonyms,orothercomputationalfunctionstotransformthetext.Analternativeviewprovidesdirectaccesstothetemplatinglanguage.Attheleftaretabsthatallowtheusertoexplorethedata,tochooseamongvariablesforthepurposesofwriting,andtopreviewthetemplate’soutputwithdifferentrowsoftestdata.Source:ArriaStudio(https://app.studio.arria.com/).

If there’sanyareawheretherewillbesteadygrowthinemployment, it’s inthevariousshadesofongoingmaintenanceworkthatautomatedcontentsystemsrequire. People need to be involved in remediating errors that crop up inautomatedoutputs,forinstance.Thismayverywellinvolvesomeexcitingandcreative detective work in debugging the system. But then there is the moreprosaicmaintenance:makingsurethedatastreamsareupdatedwhenthesoccerleaguesannouncechangestoteams,editingdatabasesorspreadsheetstoreflectnew corporate details, keeping track of new or updated open data sources, ortweakinganyofthemyriadrule-setsforstructure,genre,orstylethathavebeenbaked into the design of the system. Lisa Gibbs at the AP toldme that their

introductionofautomatedstorieshasledtonewtasksforhercolleagues:“Oneoftheresponsibilitiesofaneditoronourbreakingnewsdesknowisessentiallyautomation maintenance,” which involves tracking when companies moveheadquarters, split their stock, and so on. She estimates this takes about twohourspermonthofstaff time,soit’sstillasmallsliceofeffort,butyoucouldimaginethatgrowingasmoreandmoreautomationcomesonline.WhentheLATimes’ Quake Bot missed a noticeable earthquake out in the middle of thePacificOcean,thegeographicboundingboxofthebothadtobeadjustedsothatitsfilterstretchedfartheroutintotheocean.Updateslikethiswillbeneededasautomatedcontentisputthroughitspacesandexposedtosomeofthecorneroredge cases that designers maybe didn’t capture in their initial rules. From abusiness perspective, automated content probably won’t be competitive if it’sstaticfortoolong.Peoplewillneedtobeinvolvedincontinuouslyassessingandreassessinghowitcouldbeimproved.

Supervision and management of automated systems are also increasinglyoccupying humans in the newsroom. As systems are developed, automationeditorsareneededtoassessthequalityofthecontentuntilitisgoodenoughtopublish,ideallyforeseeingandrectifyingpotentialerrors in thedesignprocess.Regardingthedevelopmentofsoccerarticle-writingsoftware,HelenVogtfromtheNorwegianNewsAgency said, “I thinkwe had at least 70 or 80 versionsbeforewewerereadytogolive.Andthey[thesportseditors]hada lotoffuntesting it and tweaking it.” In the production of its Opportunity Gap project,ProPublica’sScottKleinnotedthatediting52,000automaticallywrittenstoriesbecame a new challenge: “Editing one narrative does notmean you’ve editedthem all.”106 Sensible edits in one case might not entirely fit in another.Automatedcheckscancontributetoqualityassurance,buttheprovisionofhigh-quality text still requires that representative samples of outputs be manuallyassessed.107 At the RADAR project about one in ten stories is still manuallychecked to ensure that the raw data and story flow are accurately reflected inoutputs.Muchasmethodsfor“distant reading”havehelpeddigitalhumanitiesscholars understand corpora of texts via computational methods andvisualizations,108building tools to support thenotionof“distantediting”couldalloweditorstoworkonamuchlargerscaleoftexts.

There’s a distinct quality control or editing function that people mustcontributeduringdevelopmentofautomation.Peoplemustalsobeavailable tomake and post corrections in case the automation doesn’t behave as expectedonce it’s put into operation. When the Washington Post ran its automated

electionsarticlesforthe2016USelections,peopleweretaskedwithmonitoringsomeofthestoriesusingtoolslikevirtualprivatenetworks(VPNs)tosimulateloading the stories fromdifferent locations in order to see if the articleswereadaptingasexpected.Toolmakershavenoted thatabarrier to theadoptionofpersonalized content is the effort needed to edit and check consistency of themanypotentialoutputs.109Asnewsroomsexpandtheiruseofautomation,peoplewill need to keep an eye on the big picture—to know when to deploy,decommission,orredevelopasystemovertime.

ShiftingRolesThe collaborations that emerge between journalists and automated contentsystemsmay lead todeskilling (lossof skills in theworkforce),upskilling (anincrease in skills to meet new demands from sophisticated tools), reskilling(retraining),orotherwiseshifttheroleofhumanactorsastheyadjusttoanewalgorithmic presence.110 Maintenance work related to updating datasets andknowledge bases, or annotating media, may not ultimately be that sexy,cognitivelydemanding,orhighskill.Thiscouldhavetheeffectofcreatingmoreentry-level jobsforwhich therewouldbemore laborsupplyand lowerwages.Atthesametime,automationcanoffloadandsubstitutesomeofthetediumof,say, determining data-driven facts for a story. This may create more time forhigh-skilledjournaliststodowhatthey’retrainedfor,namelyreporting,findingandspeakingtonewsources,andwritingincreativeorcompellingways.111Theneedforhigh-skilledreporters is reinforcedby theshortcomingsof thecurrentstate-of-the-art technology,whichcannotgetat informationwhere therearenodataoranswerquestionssuchaswhyandhowaneventfitsintothesocialfabricof society. The design, development, configuration, and supervision ofautomationmayentailtheneedforadifferentkindofhigh-skilledindividualaswell—one who has mastered computational thinking, is highly creative, andunderstandsthelimitsofthetechnology.

The reskilling of the existing journalistic workforce will involve learningtaskssuchashowtowritetemplatestofeedtheautomation.Itwillalsoincludesomelightcodingoratleastfamiliaritywithcoding.“Youhavetobeabletosayhere’saquestionareporterwouldask,oraframingofastory,oranangleandhowdo I encode that in computer code, orhowcould computer code ask andanswer that question on its own, and return an answer that can then berepublished,”Welshtoldme.Writingtemplatesdemandsdomainknowledgeandafluencyinwritingforthatdomain,aswellasanessentialunderstandingofthe

possibilities of what the technology affords. At theWashington Post, JeremyGilbert observed that more experienced editors are often better at writing thetemplatesbecausetheyalreadyhaveexperiencehelpingtoshapethewritingofreportersandsothey“haveabettersenseofstructurallywhatisorshouldbe.”Astructured journalism pilot project asked ten reporters to populate event andknowledge data structures, and found that the key skill that enabled some ofthosereporterstoadaptquicklywasa“generalcomfortwithabstraction.”112Butthisisskillenhancement,notreplacement;reportingskillsarejustasimportantas ever. On the “Next to Die” project at the Marshall Project TomMeagherexplainedthatforstructuredjournalism“thecoretaskiskeepingthedatabaseupto date and adding every new change to a case,” but that “in someways thereporting is almost exactly the same.” In otherwords, reporting skills are stillnecessary,buttheoutputofthosereportingskillsistoastructureddatabasethatthenfeedstheautomatedoutput.

These shifts are already to some extent reflected in new roles, both at thedeskilledandtheupskilledendsofthespectrum.Wibbitzclaimsthattheresultsof its automatic process are a “rough cut” that’s about 80 percent of thewaythere.Peoplemanage the last20percent,which involves fine tuning thestory,swapping images based on relevance, or adjusting the transitions or captiontiming.Thepeoplewhodothisdon’thavetobehighlyskilledvideoeditors—onemediaorganizationIspoketoreferredtothepersontaskedwitheditingthevideosalldayas“juniorstaff,”anintern.Ontheotherhand,thesportsreporterwho worked on the design of the first soccer story-writing software at theNorwegianNewsAgencydecided to takeon a new role as “newsdeveloper,”andhenowlooksat tools,workflows,andopportunitieswhereautomationcanprovidevaluetotheorganization.Reutershasanautomationbureauchiefwhoissimilarly tasked with both thinking through how to manage the existingautomationinuseandwithenvisioningnewpossibilities.AndattheAP,anewpositionwascreatedin2014foranautomationeditorwhoengages inmuchofthesupervision,management,andstrategyaroundautomationusebytheagency.These roles need editorial thinking as much as they need a capacity tounderstand data and the capabilities of state-of-the-art content automationtechnology.Inotherwordstheycouldbeappropriateforareskilledjournalistorareskilleddatascientist.Theimplicationsoflaborshiftswillbeborneoutovertime. Future researchwill need to observe how automated content productionaffects thecompositionofdeskilling,reskilling,andupskillingandhowthat inturnimpactsissuessuchasworkerautonomyandskillseducation.113

WhitherAutomatedContent?Automated content production is still in a nascent state. It’s beginning to gaintractionasdifferentorganizationsrecognizethepotentialforspeedandscale,butpersonalization is still largely on the horizon. There remains untappedcommunicative potential for combining text, video, and visualization incompellingways.As limitationsondataprovision and automation capabilitiesgive way to advances in engineering, what should we expect for automatedcontentmovingforward?Whereshouldorganizationslookforopportunityasthetechnologyadvances?

GenreExpansionOne of the criticisms of automated journalism is that it’s less “automatedjournalism”than“automateddescription.”114Atypologyofjournalisticformsisusefulforseeingwhereelseautomationcouldexpand.Onesuchtypologydrawson rhetorical theory and offers five types of journalistic output: description,narration,exposition,argumentation,andcriticism.115Descriptionisessentialinallof thesemodesand iswell-suited toautomation,given that the stateof theworld can be rendered factually based on collected data. For aspects ofdescription that need to be more phenomenological, human collaborators canstep in to add “flavor” such as quotes or other human-interest observations.Narrationbuildsondescriptionbutoffersmorestructureinpresentinganevent,perhaps even making it come across as more of a story by establishingconnectionsbetweenfacts,events,andcharacters.Thisiswhatcompaniessuchas Narrative Science strive for: to be smart about document planning so thatdescription starts to look more like narrative. The current state of the art incontentautomationgivesusdescriptionandnarrative.

Exposition starts to be more challenging. Exposition seeks to explain,interpret,predict,oradviseandisfamiliarfromformatssuchasop-edcolumnsornewsanalyses.This typeofcomposition isharder forautomation to render.Notable research prototypes of heavily engineered systems have, however,createdexpositoryessaysordocumentaries.116AsexplanatoryAIbecomesmoresophisticatedandknowledgestructuresareengineered to incorporatecausality,we may see production-level automated content integrating aspects ofexplanation.Still,someaspectsofexposition,suchascommentaryflavoredbyindividual opinion, will remain outside the purview of automation, or willextrude human commentary that is parameterized and amplified through themachinery of the system. User-interfaces could allow for an opinionated

configurationofthecontentproductionpipeline.Personalinterpretationsofthedefinition of newsworthiness, for instance, as well as the word choices,templates, and tone used could then infuse what would be a more overtlysubjectiveexposition.

Argumentationgoesbeyondexpositionbyaddingthepurposeofpersuasion,oftryingtoswayareader’sattitudesandultimatelybehaviors.Providedthattheargument is coming fromahumanauthor andcouldbe encoded in a templatethroughhuman-drivendocumentplanningandthatthereisdatathatsupportstheargument,therearenorealbarrierstoproducingautomatedcontentthatreflectsthat argument. Automated argumentation is an area that is ready to colonize,though perhaps less so by traditional journalists than by other strategiccommunicators.

A strong human collaborator enables automation to also participate in thecompositionofcriticism.Criticismrelieson theexerciseofpersonal judgmentandtasteinappraisingsomenewsworthyobjectorevent.Thisispreciselywhatsophisticated sports commentatorsdowhen they critique theperformanceof ateamorathlete.OneexampleinthisdirectionisaprojectfromDerSpiegelthatautomaticallyproducestacticaldatavisualizations,ormapsofhowaparticularsoccer team has performed.117 But the maps are practically unpublishable ontheir own. A squad of sports reporters interpret the maps and add their ownknowledge and critique of the game as they view the automatically producedvisualization through their own critical lens. The automated visualizationbecomesahelpfultoolforgroundingaparticularformofcriticism.

TopicalOpportunitiesThere is certainly room for automation to explore opportunities in exposition,argumentation, and criticism, aswell as to deepen and expanddescriptive andnarrative capabilities. But there are also opportunities to apply existingtechniques to new topics. We’ve already seen automation gaining traction inareas such as sports, finance,weather, and electionswhere data are available.Where else might there be demand for automated content if only data wereavailable?Andwhereisthereenoughofapotentialforscaleintermsofeitherraw scale of interest in publication of an event by different publishers or forscaleovertimebecauseaneventisroutineorrepetitive?

To examine these questions I gathered data fromEventRegistry, a contentaggregator that as of mid-2017 had slurped up more than 180 million newsarticles from more than 30,000 news publishers worldwide. You might have

guessedthatEventRegistryisaboutevents.Thedatabaseclustersnewsarticlesthat are about the same event, not unlike how Google News groups articles.EventRegistry then enriches these events by automatically tagging themwithcategories—high-levelgroupingssuchas“Health”or“Science”butalsomuchmorefine-grainedonessuchas“ClimateChange.”Icollectedallof theeventsfrom the Event Registry database for June 2017—more than 20,000—thatreferredtoeventsintheUnitedStatesandhadatleastsomecontentinEnglish.Fromthese I tabulated thedominantcategoriesofcontentbasedonnumberofeventstoseewhichcategoriesshowpromiseforscaleovertime.Ialsolookedattheaveragenumberofarticlesperevent to seewhichcategoriesmightbenefitfromscalingcoverageoverarangeofdifferentaudiences.

Table3.1liststhetopcategoriesofcontentaccordingtothenumberofeventsobserved. Categories such as “Law Enforcement,” “Crime and Justice,”“Murder,” “Sex Offenses,” and “Fire and Security” all clocked hundreds tothousands of events over the course of a month, with the average number ofarticlespereventbeingintherangeoffifteentotwenty.Thedominanceofthesecategoriesisperhapsunsurprisinggiventheoldjournalismadage,“Ifitbleeds,itleads.” It is also no surprise that categories relating to sports hit a sweet spotacross theboard—agreatvarietyof sports-related topics receivedboth routineand fairly widespread coverage. These include American favorites such as“Baseball,”“ProfessionalBasketball,”and“IceHockey”aswellasnewclassicssuchas“CompetitionShooting”and“AutoRacing.”“MartialArts”wasanotherstandout: therewereabout forty-sixarticlesonaverage foreachof theevents,suggestinganopportunityforbreadth.Onesurpriseinthedatawastheapparentextent of media produced about video game events, which includes e-sports,videogametournaments,andfantasysports.However,uponcloserinspection,itappearsEventRegistrymiscategorizedsomeregular sportingeventsas“VideoGame”eventsmakingitdifficulttosayhowprevalentcoverageofvideogameeventsreallywas.

Besidescrime,sports,andpotentiallygamesthereareseveralothercategoriesofcontentthatcouldtakeadvantageofthepotentialforscale.Service-orientedtopicsrelatedtotransportationsuchas“Aviation”arepromising,with118eventsandabout21articlesperevent,asis“Health”with178eventsand16articlesperevent.Policy-orientedtopicsalsohaveapotentialforscale:“Immigration”had109 events with an average of 19 articles per event, “Environment” had 103eventsalsowith19articlesonaverage,“Economic”issueshad79eventswithanaverage of 24 articles, and “Campaigns& Elections” had 110 eventswith 21

articlesonaverage.Thereisarangeofopenlyavailabledatarelatedtoeconomicissues, as well as for immigration and the environment, suggesting that thesetopicscouldperhapsbecoveredusingautomatedcontenttechniques.Ingeneral,however,catalogingtheavailabledataresourcesforeachofthesetopicswouldbenecessarytoidentifywherethebestopportunitieslie.

Table3.1.    CategoriesofnewsinEventRegistrywithmorethan100distincteventsinJune2017.*

CategoryEventCount

MeanArticleCount

MedianArticleCount

Society/Law/LawEnforcement 1,440 17 8

Reference/Education/Colleges&Universities 767 15 8

Business/Business_Services/Fire&Security 657 14 8

Games/Video_Games/Recreation 564 21 10

Games/Video_Games/Browser_Based 495 30 11

Society/Religion&Spirituality/Christianity 447 15 8

Society/Issues/Crime&Justice 389 18 8

Sports/Football/American 381 15 10

Society/Law/Services 355 26 9

Society/Law/LegalInformation 235 30 9

Business/IndustrialGoods&Services/IndustrialSupply

221 13 7

Sports/Basketball/Professional 194 31 11

Recreation/Guns/CompetitionShooting 190 31 8

Health/PublicHealth&Safety/EmergencyServices 188 12 8

Society/Issues/Health 178 16 8

Shopping/Sports/Baseball 178 25 9

Home/Family/Parenting 176 14 10

Sports/Hockey/IceHockey 174 16 10

Arts/PerformingArts/Theater 160 24 8

Science/Biology/Flora&Fauna 155 18 8

Sports/Baseball/Youth 154 15 8

Society/Issues/Business 152 41 10

Society/Crime/Murder 148 17 8

Games/Video_Games/Downloads 140 25 10

Society/Issues/Transportation 139 15 7

Computers/Software/Accounting 129 20 9

Society/Crime/SexOffenses 128 46 7

Society/People/Generations&AgeGroups 120 14 7.5

Society/Issues/Warfare&Conflict 120 20 10

Business/Transportation&Logistics/Aviation 118 21 8

Computers/Internet/OntheWeb 117 25 11

Science/Technology/Energy 114 13 8

Sports/Motorsports/AutoRacing 114 19 8

Reference/Education/Kthrough12 113 10 8

Society/Politics/Campaigns&Elections 110 21 8

Society/Issues/Immigration 109 19 10

Arts/Television/Programs 108 16 8

Business/Energy/Utilities 108 18 8

Society/Issues/Environment 103 19 9

Recreation/Pets/Dogs 103 21 9

*EachcategoryisshownwiththemeanandmediannumberofEnglisharticlespublishedforeachevent.

Ofcourse, thereare limitations toEventRegistrydata in termsofcoverageand theextentof representationofmedia.Also, thenumbersnotedabovedealonlywiththesupplysideofcontent,andthestrategicdeploymentofautomatedcontent should also consider reader demand. But these data do provide hintsaboutwhatcategoriesor topicsmightbepromising forautomation. It’snotanentirelyunreasonableassumptionthatthecoverageofthesetopicsisinresponsetodemandthat’sbeenexpressedinsomewaybymediaconsumers.Thenumberssuggest that theremay still be opportunities for finding underexplored nicheswheredataavailability,technicalcapability,editorialinterest,andsomeformofneedforscale—eitherovertimeorforbreadth—allintersect.

Automatedcontentproductionoffersahostofnewopportunitiestoincreasethespeed,scale,accuracy,andpersonalizationofnewsinformation,someofwhicharealreadycreatingrevenueandcompetitiveadvantagesfornewsorganizations.At the same time, human roles and tasks are evolving to accommodate thedesign,maintenance,andsupervisionofautomatedcontentsystems.Farfromapanaceafornewsproduction, thereareahostof limitations thatgoalongwithcontent automation. A dependency on data, as well as bounds on flexibility,interpretive ability, and writing quality place real constraints on how widelyautomatednewsproductioncanspread.Still,somedomainsofcontentmayyetbe colonized by automated news content, as journalists find where end-userdemand intersects with technical capabilities and constraints. Another milieuwheretheseintersectionsareincreasinglybeingexploredissocialmedia,whereautomatedcontenttakesonasocialfaceintheformofnewsbots.