newsviews: an automated pipeline for creating custom … · 2018. 3. 30. · multiple possible...

10
A I o m d g n in d a N c t ta f w a k to ta A N o A H M I O s n o d im a P ro p o sp C C h Ne Tong Gao, Ey School Universi {gaoto eadar} ABSTRACT Interactive visu online news a most prevalent designers capa geovisualizatio novel automate nteractive, ann designers. New and data compa NewsViews sy concepts and lo ential annotati abases, and tec fy and create v work, we deve annotated, map key features fo o context, vari ation and anno Author Keywo Narrative inform online news; ge ACM Classific H.5.m. Informa Miscellaneous. NTRODUCTIO Online news c social, and pol national, and gl of novel mech dissemination mages, videos and integrated Permission to make dioom use is granted wor commercial advanta page. Copyrights for c ored. Abstracting with ervers or to redistribu permissions from perm CHI 2014, April 26–M Copyright ©ACM 97 http://dx.doi.org/10.1 ewsVie Cu Jessica Hul ytan Adar of Informati ity of Michig ong, jhullman }@umich.ed ualizations ad rticles. Geogr form of these able of produc ns are scarce ed news visua notated maps w wsViews’ map arisons relevan ystem leverage ocations discus ons), an exten chniques adapt visually “intere elop and evalu p generation an or successful r iable selection, otation quality) ords mation visualiz eovisualization ation Keyword ation interfaces ON overs a gamut itical topics (a lobal scales. It hanisms to pro of additional s, and informa into news art gital or hard copies of ithout fee provided tha age and that copies be omponents of this wo h credit is permitted. ute to lists, requires pr m[email protected]. May 1, 2014, Toronto, 8-1-4503-2473-1/14/ 145/2556288.255722 ews: An ustom G llman, ion gan n, du d rich, data-b aphic maps a e visualizations cing high-qual e. We present alization system without requir ps support tren nt to a given n es text mining ssed in articles nsive repository ted from cartog esting” themat uate key criter nd experiment representations , “interestingne . zation; interact n; text summari ds s and presentat t of informatio among others), ts digital nature ovide context, information. ation graphics ticle presentati f all or part of this wo at copies are not made ear this notice and the ork owned by others t To copy otherwise, o rior specific permissio , Ontario, Canada. /04...$15.00. 28 n Autom Geovis Br Comp & E Universi bhecht based context re currently th s. Unfortunatel lity, customiz t NewsViews, m that generat ring profession nd identificatio news article. Th to identify k s (as well as p y of “found” d graphy to iden tic maps. In th ria in automati tally validate th (e.g., relevan ess” of represe tive maps; ization. ion (e.g., HCI) on on economi , spanning loca e affords a rang navigation, an Media such can be creat ions to augme ork for personal or cla e or distributed for pro e full citation on the fi than ACM must be ho or republish, to post on and/or a fee. Requ mated P ualizati rent Hecht puter Scienc Engineering ity of Minne t@cs.umn.ed to he ly, ed a tes nal on he ey po- da- nti- his ic, he nce en- ): ic, al, ge nd as ed ent the rea ticated summa ing sta more t indirec map su econom ograph only th map ba for ref come m cerning tions a points The an are typ ists. A data th propria design tations Figure geograp come br ass- ofit irst on- on uest Pipeline ions fo ce esota du ading experienc d data visualiza arize additiona atic and interact tangible depict ctly implied in upplementing a mic mobility in hic trends at a hrough text (Fi ased on the arti ferenced locati mobility, or m g income mobi are often annot and add contex nnotated visua pically the wor Among other t hat enhances a ate representat possibilities a s. Each decisio 1: Thematic m phical differenc racket [19]. e for Cr r News Nichola School Columb nad2141 ce. Of particula ations used to p al context for t tive data “view tion of trends n an article. F an article abou n the United S level of detail igure. 1). A ne icle content an ions with part making persona ility in her hom ated [13, 25] t xt for the story alizations that rk of skilled de tasks, the desi and complemen tions and enco and select usef on in this pipel map by the New ces in the likelih reating s as Diakopou of Journalis bia Universi @columbia. ar interest are present relevan these articles. B ws,” visualizati explicitly me For example, ut geographic v States might sh that is difficul ews reader can nd data, compar ticularly high ally-relevant qu metown. These to guide the rea y being told. appear in new esigners and da igner must fin nts the article, odings, mainta ful and approp line has an effe w York Times hood of moving ulos sm ity .edu the sophis- nt data and By provid- ions offer a entioned or a thematic variation of how the ge- lt to obtain n query the ring values or low in- ueries con- e visualiza- ader to key ws contexts ata journal- nd relevant , select ap- ain various riate anno- ect on sub- describing g up an in- Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada 3005

Upload: others

Post on 27-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

AIomdgnindaNcttafwaktota

ANo

AHM

IOsnodima

ProoposepCCh

Ne

Tong Gao, Ey

School Universi

{gaotoeadar}

ABSTRACT Interactive visuonline news amost prevalent designers capageovisualizationovel automatenteractive, ann

designers. Newand data compaNewsViews syconcepts and loential annotatiabases, and tec

fy and create vwork, we deveannotated, mapkey features foo context, variation and anno

Author KeywoNarrative informonline news; ge

ACM ClassificH.5.m. InformaMiscellaneous.

NTRODUCTIOOnline news csocial, and polnational, and glof novel mechdissemination mages, videos

and integrated

Permission to make digoom use is granted wi

or commercial advantapage. Copyrights for cored. Abstracting withervers or to redistribu

permissions from permCHI 2014, April 26–MCopyright ©ACM 97http://dx.doi.org/10.1

ewsVieCu

Jessica Hulytan Adar of Informatiity of Michig

ong, jhullman}@umich.ed

ualizations adrticles. Geogrform of these

able of producns are scarceed news visuanotated maps wwsViews’ maparisons relevan

ystem leverageocations discusons), an extenchniques adaptvisually “intereelop and evalup generation anor successful riable selection,otation quality)

ords mation visualizeovisualization

ation Keywordation interfaces

ON overs a gamutitical topics (alobal scales. It

hanisms to proof additional

s, and informainto news art

gital or hard copies ofithout fee provided thaage and that copies beomponents of this wo

h credit is permitted. ute to lists, requires pr

[email protected]. May 1, 2014, Toronto,

8-1-4503-2473-1/14/145/2556288.255722

ews: Anustom Gllman,

ion gan n,

du

d rich, data-baphic maps a

e visualizationscing high-quale. We presentalization systemwithout requir

ps support trennt to a given nes text mining ssed in articles

nsive repositoryted from cartogesting” thematuate key criternd experimentrepresentations, “interestingne.

zation; interactn; text summari

ds s and presentat

t of informatioamong others),ts digital natureovide context,

information. ation graphics ticle presentati

f all or part of this woat copies are not madeear this notice and theork owned by others tTo copy otherwise, orior specific permissio

, Ontario, Canada. /04...$15.00. 28

n AutomGeovis

BrComp

& EUniversi

bhecht

based context re currently ths. Unfortunatellity, customizt NewsViews, m that generatring professionnd identificationews article. Th

to identify ks (as well as py of “found” dgraphy to identic maps. In thria in automatitally validate th (e.g., relevaness” of represe

tive maps; ization.

ion (e.g., HCI)

on on economi, spanning locae affords a rangnavigation, anMedia such can be creat

ions to augme

ork for personal or clae or distributed for proe full citation on the fithan ACM must be hoor republish, to post on and/or a fee. Requ

mated Pualizatirent Hechtputer SciencEngineeringity of [email protected]

to he ly, ed

a tes nal on he ey

po-da-nti-his ic, he

nce en-

):

ic, al, ge nd as ed

ent

the reaticatedsummaing stamore tindirecmap sueconomographonly thmap bafor refcome mcerningtions apoints

The anare typists. Adata thpropriadesign tations

Figure geograpcome br

ass-ofitfirston-on

uest

Pipelineions fo

ce

esota du

ading experiencd data visualizaarize additionaatic and interacttangible depictctly implied inupplementing amic mobility inhic trends at a hrough text (Fiased on the artiferenced locatimobility, or mg income mobi

are often annotand add contex

nnotated visuapically the worAmong other that enhances aate representatpossibilities a

s. Each decisio

1: Thematic mphical differencracket [19].

e for Crr News

NicholaSchool Columb

nad2141

ce. Of particulaations used to pal context for ttive data “viewtion of trends n an article. Fan article aboun the United Slevel of detail igure. 1). A neicle content anions with part

making personaility in her homated [13, 25] txt for the story

alizations that rk of skilled detasks, the desiand complementions and encoand select usefon in this pipel

map by the Newces in the likelih

reating s as Diakopou

of Journalisbia Universi@columbia.

ar interest are present relevanthese articles. Bws,” visualizati

explicitly meFor example, ut geographic vStates might sh

that is difficulews reader can

nd data, comparticularly high ally-relevant qumetown. Theseto guide the reay being told.

appear in newesigners and daigner must finnts the article,odings, maintaful and appropline has an effe

w York Times hood of moving

ulos sm ity .edu

the sophis-nt data and By provid-ions offer a entioned or a thematic

variation of how the ge-lt to obtain n query the ring values or low in-ueries con-e visualiza-ader to key

ws contexts ata journal-nd relevant , select ap-ain various riate anno-ect on sub-

describing g up an in-

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3005

Page 2: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

sequent steps and frequently requires time-consuming itera-tion. As such, resource limitations and process complexity constrain the creation processes of professionals, making it challenging to provide visualizations for the large collec-tions of articles available on news websites today.

Automation has been employed in news visualizations, yet these few solutions tend to be specific to small subsets of news articles. For example, applications generate annotated stock time series [12] or graphics to accompany articles that follow a strict narrative template [22].

We explore how a complex visualization pipeline, or se-quence of decisions and algorithms for solving them, can be automated to scale the creation of popular types of news geovisualizations. Our work contributes to automated news visualization research by identifying key criteria for a news map generation pipeline. We describe the NewsViews sys-tem, which implements this pipeline to automatically gen-erate visualizations to accompany news articles using geo-graphic information from multiple sources. Criteria include pairing relevant data variables to article text, finding loca-tions of interest, choosing relevant annotations, setting the map extent, and modeling maps’ visual interestingness.

We secondly contribute a demonstration of how these crite-ria can be operationalized, employing text mining and ex-traction techniques as well as visual-spatial pattern analysis to realize the NewsViews pipeline. NewsViews identifies relevant features of the article (topics, locations, etc.) auto-matically, and uses these features to select from hundreds of data variables and ~670,000 spatiotemporal “cases” (i.e., database cells). Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base) are generated and visualized. A novel ranking tech-nique selects the visualization to maximize the relevance of selected data and annotations to the article and leverage statistical features to align with users’ perceptions of map “interestingness.” Our final contribution is an evaluation of users’ reactions to our operationalized criteria. We evaluate our criteria as operationalized both independently and in combination, including a novel perceptual evaluation of Moran’s I as a measure of visual “interestingness.”

RELATED WORK

Narrative Visualization Research around narrative visualizations—visualizations that guide users’ interpretations using forms of visual, tex-tual and procedural rhetoric [13, 25]—informs the design of NewsViews. Many exemplary artifacts motivating scholarly discussion come from news settings, where interactive vis-ualizations presented with text articles help explain news [5]. Text annotations play an important role in guiding us-ers’ interactions and interpretations [25, 13]. Annotations are used in two primary ways: to emphasize data-based observations (“observational” use), such as a maximum value in a series, or, as applied in NewsViews, to add in-formation that is not otherwise presented (“additive”) [12].

Typically, the annotated narrative visualizations that appear in online news contexts are the work of designers with ex-pertise in graphics, statistics, and journalism. The time these experts spend creating graphics also makes it difficult to scale their process to the massive amounts of new and archived news content, motivating automated solutions.

Automatic Visualization Generation and Annotation Various recent projects contribute methods for automatical-ly generating or supplementing visualizations, such as by depicting a chain of news stories connecting two articles [27], or summarizing relations between scientific papers using metro-style maps [26]. Google News Timeline and Google Finance present automatically-generated visual summaries of articles, though typically using relatively simple chronological parameters. Narrative Science (http://narrativescience.com) produces narrative style re-ports and graphics to walk end-users through datasets of interest. Visualization researchers have contributed methods for adding annotations to existing visualizations, including graphical and textual annotations to emphasize trends, max-imum points, or means, among others [16], and text annota-tions on interesting trends in point data (e.g., [15]).

Our earlier system, Contextifier [12], produces narrative visualizations for online business news. Textual features of a news article about a company are used to generate a cus-tomized line graph of that company’s stock. Additive anno-tations drawn from other articles provide information rele-vant to the company and context, yet external to the original article or data. These annotations are selected and placed using a combination of topical relevancy features, features suggesting important (historical) company events, and sali-ency features computed on the data series. Most notably, whereas Contextifier focused on one data type and one spe-cific instantiation (i.e., a stock time series) NewsViews is designed to work with articles from arbitrary domains, se-lecting appropriate data of various types (i.e., georeferenced data, time series) and between possible visualizations. Like Contextifier, NewsViews generates annotation content from a news corpus but can also support observational annotation of extreme values (e.g., min/max).

Wu et al. [21] proposed the semi-automated MuckRaker system for connecting news readers to a database of rele-vant context using a visualization interface. Unlike NewsViews, MuckRaker leverages the crowd to improve automatic presentation features, yet still calls for explicit input from the user. Other work focuses on the generation and searching of very large tables of structured data, but does not concern visualization [4].

Geovisualization and News Maps often appear with news articles to improve readers’ understanding of a story’s geographic context, to make ge-ographic trends personally relatable, or simply to attract readers’ attention [5, 9, 31]. A survey identifies maps as the most prevalent visual format used in a recent sample of narrative news contexts [12]. Presenting maps with articles

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3006

Page 3: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

incb

TeBRcTopp

SWc

UAasizhcapc(eadclautiu

ncreases a newcoded learningbenefits to larg

The cartographegories: refereBoth forms freqReference mapcific locations Thematic mapsone or more gports both typepropriate type g

SYSTEM OVERWe briefly deschitecture below

User ExperienA news reader article as part site. If the articzed NewsView

her reading alocontextualized about the resulperformance, thchoropleth map(see Figure 2).education attaiareas in propoduring the timecases, the articlayer), but may

unfamiliar to thion is a “refer

understand the

Figure 2: Terence map

ws reader’s ge [9]. NewsViee numbers of a

hy literature divence/locator mquently appears are dedicatedsuch as boun

s are used to “eeographic varies of maps andgiven a news st

RVIEW scribe the user w.

ce interacts with Nof her normal

cle mentions onws visualizatioongside the art

with data, suclts of a nationwhe presented mp, showing ed In this view, inment of resirtion to the me frame impliele may not impy still mention he user. In thisrence map” wspatial contex

The NewsView p is shown to th

eography knowews automaticaarticles.

vides maps intmaps and themr in news geovid to the commundaries or po

emphasize the siables” [29]. Nd intelligently tory.

experience an

NewsViews asinteraction wi

ne or more locaon is presentedicle. If the artich as in the cawide evaluatiomap is a custoducation attain

the thematic idents by cou

measured leveled by the articply relevant da

locations thats case, the pres

which the user xt of mentione

interface, depiche lower left.

wledge via duaally scales the

o two broad camatic maps [29

isualizations [5unication of sp

oints of interespatial pattern NewsViews su

chooses the a

nd high level a

s she reads a teith a news weations, a customd to supplemeicle topic is bease of an articn of educationomized themat

nment by counmap depicts th

unty by colorinl of the variabcle text. In othata (i.e., thematt are likely to bsented visualizcan leverage

ed locations. F

cting an input a

al-ese

at-9]. 5]. pe-st. of

up-ap-

ar-

ext eb-m-ent est cle nal tic

nty he ng ble her tic be

za-to

For

exampyield acontex

The mdemanof locadetailearticle tion ofthe entperson

For newmore tCustomFor exeducatiin the describscores.tional idata, oarticlesread thized torelatedto usin

Other fto extrtions inannotaeral “ovalues system

article and custo

ple, an article aba zoomed in mxt of better-kno

map supports annd [28] allow tations mentioned perspective

alone, (2) shef educational atire U.S., and

nal interest, like

ws readers to ftextual contexmized annotatioxample, the usetional trends inarticle. Or, she

bing why som. Annotations information suor details on ps referred to inhese articles ino that article ind articles via thng text-based li

features. Obsereme values arn the news [12

ations of extremoutlier” locatio

(see Fig. 4, bm to create furth

om annotated th

bout a marathomap of Missourown nearby citi

nalysis in sevethe user to com

ned in the articthan can typic

e can infer the attainment by (3) she can z

e her hometow

fully understanxt than is avaions can provider may have l

n general or thee may be curio

me regions or affixed to map

uch as locationparticular testsn annotations cn full, and to vnstead. Thus, thhe graphical inists or search fe

ervational annore also commo2]. NewsViewsme values in caons display unbottom). We dher visualizatio

hematic data. A

on in Nixa, Miri that places Nies.

eral ways: (1) mpare the speccle, providing hcally be gainedbroader spatiaexamining levoom to specif

wn or current re

nd an issue oftelable in a sing

de this additionlittle prior knoe specific tests ous for more instates have m

p locations pron-specific explas used. The tican be used to nview a new mahe reader can nnterface as an eatures.

otations that simon in narratives presents the rases for maps wncommonly hidiscuss extensions below.

An example of a

ssouri may Nixa in the

details-on-cific levels her a more d from the al distribu-vels across fic areas of esidence.

en calls for gle article.

nal context. owledge on

mentioned nformation

much lower ovide addi-anations of itles of the navigate to ap custom-navigate to alternative

mply point e visualiza-reader with where sev-igh or low ions of the

a ref-

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3007

Page 4: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

TNfta

NTn1pftics

TTrpp“ntwFthetr(mloin

Wf[locabd

Fv

The NewsViewNewsViews’ abfor news is drivabular informa

News Corpus The news corpnews articles fr1994 to Decempaper [24]. Sinfirst matched aies, states, citi

cles that mentsampled ~1M o

Table DatabaseThe table databreferenced variployment, diseprevalence of “found” tables noisy and uncowo primary so

Freebase. We ehis database. T

entity as the mariplets that ind

(e.g., “pollutionmeasurement docation and geng all location

We manually afrom the Statis32] and the Bocations chang

can be referencanother, and a between these dia/DBPedia, a

From these datvetted by one a

Figu

ws Corpus bility to produven by severalation, which fir

pus we use conrom the New Y

mber 2006, covnce we focus oarticles againsties, and regiontioned at leastof these for the

e and Table Crbase is a colleiables, such asease statistics,

different indu(e.g., Excel fi

ontrolled. For ources of data:extracted tableThe Freebase sain page (e.g.,

dicate a relationn in”) and va

date). We extraenerated one tas for which we

augmented thisstical Abstract

Bureau of Laboge across dataced using a namGeoName id iIDs by comb

and the GeoNam

tasets we selecauthor of this

ure 3: The New

uce high qualit large databaserst describe her

nsists of a set York Times ra

vering all sectioon location-relat a complete lin names. This t one U.S. loc NewsViews c

rawler ection of temps state and cou, or statistics ustries. Our siles); howeverevaluation pu

the United Sts from Freebastructure incluSan Francisco

nship to the hialue (the numbacted these tripable for each ve had a measur

s collection wof the United

or Statistics. Ba sources (e.g.me in one tablein a third), we bining the Fremes dataset.

cted 155 “highpaper (with ex

sViews system

ty visualizatioes of textual anre.

of ~1.9 millioanging from Juons of the newated context, wst of U.S. couleft ~1.5M ar

cation. We sucorpus.

porally and geunty-level unem

describing thsystem suppor, these are oft

urposes we ustates Census anse by “pivotin

udes a high levo, CA) with RDgher-level enti

ber, units, andplets for each Uariable (contaie).

with recent tabld States (SAUBecause keys f., San Francise, a FIPS code built a mappin

eebase, Wikip

h quality” tablxpertise in geo

architecture.

ns nd

on uly ws-we un-rti-ub-

eo-m-he rts en ed nd g”

vel DF ity

d a US in-

les US) for co in ng

pe-

les og-

raphy afix varties (eployme“cleanewould tables incomemanufascience

The NeThe Npoints table don the articlesgeograWe notion imrelevanhow an

1. QueThe qusubseqploymelow annalismnews aappliedare ext

To identractor and or[18], wWikipepages in the entities

and cartographriables that were.g., percent uent counts). Wed” as these ar

be transformspanned topi

e, population,facturing, diseae and technolo

ewsViews PipNewsViews pi

(see Fig. 3, Tdatabase. We fo

challenges ins that contain aphically anchoote that dependmpact “downsnce and visuand why NewsV

ery Extraction uery extractor quently used tent by county)n approach bas

m, which statesarticle often cod in [12]). Noutracted as seeds

entify locationsr extracts U.S. rganization namwhich identifieedia page. Thifor entities wipage. The qu

s to a geograph

NewsV

1 Wim

2 Wrequ

3 Warvi

4 WasM

5 Wth

6 Wimin

Table 1: N

hy). This also are best express

unemployment We ensured thre used by our

med to unempics including , transportatioases and healthgy.

peline ipeline negotiTable 1) that ocus our discucreating themtoponyms, or

ored entities (edencies on locstream” decisial interestingneViews prioritize

identifies impto infer related) and entities (sed on the “invs that the firstontain the mainun phrases froms for topic iden

s mentioned instate names, cmes. We emp

es all entities ins is beneficial ith geographic uery extractor hic footprint.

Views: Visualiza

What are the topmplied comparison

What table views eference maps) uery?

What mark types (gre compatible witews?

What is the optimssignment given t

M views?What annotations he topic and visua

Which annotated vmized relevance angness?

NewsViews visu

allowed us to idsed as normaliz

rather than rhat variable na

algorithms (e.ployment). The

education, emon, agricultureh, wealth distrib

iates six mainleverage the cssion of the pip

matic maps to aplace names,

e.g., Harvard Uation and variions such as ess. We descres these decisio

portant attributd variables (e(e.g., locationsverted pyramidt several sente

n point of the am the first threentification.

n an article, thecounty names, ploy the Wikifn text with linbecause most coordinates rethus “anchors

ation Pipeline

ic, locations, andns? (data graphics omatch extracted

graphical formatsth each of the M

al retinal variablethe data in each o

add context tolizations? visualization maxnd visual interest

ualization pipeli

dentify and zed quanti-raw unem-ames were .g., unemp. e resulting

mployment, e, climate, bution, and

n decision corpus and peline here accompany as well as

University). iable selec-annotation

ribe below ons.

tes that are e.g., unem-s). We fol-d” of jour-ences of a article ([7], e sentences

e query ex-city names fier system nks to their

Wikipedia ecord these s” detected

d

r d

) M

e f

o

--

ine.

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3008

Page 5: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

Wikifier outputs multiple possible links for ambiguous enti-ties. To boost the accuracy of NewsView’s location tagging accuracy for these cases, we combine Wikifier’s output with that of OpenCalais [23] (a second named entity detec-tor). OpenCalais is used to filter entities tagged as locations by Wikifier but persons by OpenCalais.

2. View Generation (Column and Row Selection) Key to a successful visualization is the choice of an appro-priate data variable. The view generator takes the infor-mation outputted by the query extractor (consisting of ‘top-ical’ noun phrases, identified locations, plus the article time stamp) and applies further analyses to generate “data views” (representing a selection of columns and rows) on a large table database. The NewsViews database consists of columns that reflect a particular piece of data (e.g., unem-ployment, education, etc.). Rows are keyed by location for US states and counties (with counties as a default). If there are multiple samples of a variable from different periods (e.g., unemployment from 2009, 2010, 2011, etc.) a differ-ent column is assigned to each. Retaining historical values is useful as it allows us to generate other visualization types (e.g., a time series for a given location) but also allows us to default to data from the time the article was written—a 2005 article on unemployment plot data from 2013.

A set (of size M) relevant views identified by the view gen-eration step is then be passed to the visualization generator. The specific mechanisms for using the query information to select from all possible variables are detailed below.

3, 4, and 5. Visualization Generation The visualization generator takes the filtered set of M views identified by the view generator and uses visualization best practices to generate a set of N annotated visualizations with D3. A mark generator (3) first assigns the best visuali-zation mark type (e.g. point, line, polygon). In the default NewsViews implementation, these are polygons depicted using a Mercator projection. However, the pipeline is ex-tensible to other mark types such as lines, bars, or small multiples, discussed further below. A retinal variable as-signor (4) maps the selected data to a retinal variable (e.g., sequential color scheme with optimal binning; see Thematic Map Generation for details). The retinal variable assignor also identifies the default geographic extent (zoom level). An annotation engine (5) selects the descriptive annotations for (explicitly mentioned) map points from our news cor-pus. The generator may produce more than M visualizations (i.e., N ≥ M) to allow for cases where more than one visual-ization of a table view is possible (e.g., when the annotation content differs between two maps, different bins are used, or different zoom levels are proposed).

6. Visualization Ranking The visualization ranker ranks the N visualizations pro-duced by the visualization generator. Three criteria are used: the quality of the variable selection as captured by pointwise mutual information (PMI) between article text and variable labels, the annotation relevancy as captured by

cosine similarity, and the visual interestingness as captured by Moran’s I. Our ranking (detailed below) prioritizes var-iable match to article (via PMI), followed by visual interest-ingness. Our evaluation of annotation relevance and visual interestingness (Fig. 7) supports this ordering.

ALGORITHMS FOR FINDING THE RIGHT MAP Below, we detail the specific algorithms used to model each step in the NewsViews pipeline for map generation (Fig. 3).

Identifying the Best Variable for a Thematic Map A critical decision in the map creation pipeline concerns the data to be depicted with an article. The noun phrases from the first three sentences of an article (e.g., “job hunters” or “undergraduate degree”) rarely map directly to the name of variable likely to be most suitable (e.g., “percent with bach-elors”). We devise an algorithm to negotiate this decision using a measure of mutual information.

We rank the 155 variables (i.e., column descriptors) by their relevancy, calculating the co-occurrence score between the noun phrase and variable name weighted by Pointwise Mu-tual Information (PMI) [2]. PMI measures how often relat-ed two terms are by comparing how often both terms appear together in an article together versus independently. The more articles that mention both phrases simultaneously (i.e., the variable name and the extracted noun phrase), the high-er the PMI. In practice, we convert the original variable description into a sub-phrase set and determine the mean PMI between these and the article-derived noun phrases.

Isolating descriptive variable names: Each variable is rep-resented as a descriptive phrase, e.g. “fast-food-restaurants-by-county,” which may be too specific to match any given article. To improve recall we automatically identify the noun phrases from the original variable string, e.g. ‘fast food restaurants’, ‘fast food’, ‘food’, ‘restaurant’ and ‘coun-ty’. For each phrase, we count the matching documents in our corpus (as a fraction of corpus size). We discarded terms with relatively high frequencies (>0.05) such as ‘county’ (0.057), and those with relatively low frequencies (<0.0001) such as ‘educational attainment’ (0.00009). We retained those with moderate score: ‘fast food restaurant’ (0.00052) in the variable ‘fast-food-restaurants-by-county’, and ‘bachelor’s degree’ (0.00054) in the variable ‘educa-tional-attainment-25-years-and-over-bachelor's-degree-or-higher.’ The filtered set of noun phrases is retained (we refer to this set as Variable Phrases, or VarP).

Capture Potential Topic From Articles: For a given article, all the noun phrases from the first three sentences are ex-tracted from the query extractor. We remove phrases that have low tf-idf scores (<0.001), as these are less likely to carry important information about the article topic (weighting is calculated based on global frequency of the phrases in the news corpus collection). We refer to the re-maining noun phrases as “NP terms.”

Calculate PMI: We can then calculate the PMI between each NP term and VarP term and aggregate these as the

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3009

Page 6: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

mean PMI for all pairings. For efficiency, all articles in our dataset are indexed in a Lucene database. For each pair, we issue 3 queries, one for NP, one for VarP, and one for both VarP and NP simultaneously (the number of matching arti-cles are retained as NPcount, VarP count, and NP_VarPcount respectively). From this, we calculate PMI as follows: ( , ) = ( , )( ) ( )

= _ ⁄( ) × ( ⁄ )⁄ The mean PMI is calculated by determining the PMI for each phrase pairing (i.e., one item from NP and one from VarP) and normalizing by the number of such pairings (i.e., NPcount * VarPcount. Variables are ranked by their mean PMI relative to the extracted noun phrases. The intuition here is that the higher a variable’s PMI value is relative to the arti-cle text, the more relevant the variable to that content.

Thresholding to Focus Variable Relevance: Given our ex-pectations that the match between article content and a rel-evant data set will most drive map usefulness, we threshold the PMI-ranked list. This involves finding the threshold based on where PMI values decrease sharply in the ranked list and then filtering to the high PMI articles.

Reference Map Default: NewsViews supports reference as well as thematic maps. A heuristic is used to create refer-ence maps if no PMI score for any variable surpasses a threshold (set to 2.5 after qualitative experimentation). If the threshold is not met, a reference map is created. The map is zoomed to the locations mentioned in the article, annotated with simple place-markers (Fig. 2, lower left).

Generating Thematic Maps Having selected a set of georeferenced data from the table database, a retinal variable assignor determines the optimal mappings of a retinal variables (e.g., color, size, etc.) to visualize the data on a map. In the current implementation, we classify the data into 7 stepwise classes with the Jenks natural breaks classification method [14], an iterative data classification method to determine the best binning of val-ues given the observed distances between sets of values. Following Brewer’s color guidelines for mapping and visu-alization [3], NewsViews selects sequential schemes for the Jenk’s derived classes in which lightness steps dominate, using light colors to depict low data values and dark colors for higher values. Future work might incorporate alternative schemes (e.g., gradients or semantically-driven colors [20]).

Identifying Related Locations and Geographic Extent Primary Location Identification: Defining an appropriate map extent (or default zoom region) allows users to focus on the critical area that is relevant to the article. We first identify the “primary location” that represents the main location of interest for that article by extracting “seed loca-tions” from the title and the first three sentences. The occur-rence of each seed location within the article is counted.

Seed locations are ranked based on occurrence, and the most often mentioned seed is assumed as primary.

Identifying Related Locations: We then detected the rela-tionship between the primary and other locations (li, the ith location in the set of other locations mentioned in the article text). Three distance measures are calculated and combined. A spatial distance captures the Euclidean (coor-dinate) distance between location li and the primary loca-tion. A hierarchy distance captures the distance between the two locations on the geo-hierarchy tree, prioritizing local extent. This allows us to capture the idea of “sibling” rela-tionships (two counties in Virginia), enclosure relationships (Michigan Washtenaw County Ann Arbor), and other longer range relationships (e.g., Ann Arbor, MI and Alex-andria County, VA). Different weights can be assigned to different “hops,” (e.g., state to county versus county to city) though we use an equal weight. Finally, a weighted location co-occurrence distance corresponds to the number of times the two locations are mentioned together in the same cor-pus. We use the PMI(li,lmain) function, as described above, to calculate this weight.

The three measures are combined in a final location similar-ity score for each location and the primary location sim(li,lprimary) by dividing the weighted location co-occurrence difference (which reflects a strong relationship between the two locations), by the sum of the spatial and semantic distances (where larger numbers indicate dissimi-larity). Other implementations of these three forms of loca-tion similarity are possible. We stress the importance of the gross relationships between them over specific operational-izations and combination functions.

Filtering Locations and Setting Extent: We experimented with different thresholds for filtering the location set. Spe-cifically, we kept the locations that had high scores (sim(li,lprimary) ≥ 0.1), and discarded those with low scores (sim(li,lprimary) < 0.1). We set the map extent to the area that exactly contains all the high scoring locations (as well as the primary location).

Selecting and Placing Text Annotations NewsViews supplements each generated thematic map with additive annotations—textual descriptions that provide re-lated information (context) about the map, data, and input article topic where the source of that information is not the data or input article itself [12]. That is, we find text in other articles that are related to locations and data on the map.

For example, NewsViews considers the related locations (calculated above) for which we have articles that focus on the location and variable of interest. Thus, if the context article is about unemployment in Springfield, MA, NewsViews will identify Westfield, MA (a nearby town that often co-occurs in articles about Springfield) and Prov-idence, RI (a similarly sized New England town that also co-occurs in Springfield-related articles). NewsViews will then isolate news articles about these locations (Westfield

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3010

Page 7: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

atha

FloloTthcTr

Dblmrththusa

Appzb

Aoote

VOcticrin

and Providencehose that also m

articles, conten

Finding Locatiocation we quocation and t

These articles ahe bag-of-wor

cle, and the aThese search srelated to the va

Descriptive Sebeen identifiedected from the

mention the lorelevancy. To dhe possible sehe VarP term.

used as annotaspecificity by nand location bu

Annotation Plaplacing map anple custom plazations (includible screen spac

Annotations neother articles. Fof relevant annional annotatio

est points with

Visual InterestOne of the bencommunicate rion approache

captures the exregionalization nterestingness.

e) that focus mention Spring

nt will be extrac

ion-Topic Articuery the corputhe topic wordare sorted baserds article reprearticle with thsteps ensure thariable of inter

ntence Selectid for a locatione article text bcation and rando so, we iden

entence and caThe sentence

ation content. not only selectiut finding the b

acement: Manynnotations withcement algoriting the paramee.

eed not be drivFor thematic mnotations, Newons, such as pocorresponding

tingness (Salienefits of themaregional patternes [29]. We hyxtent to which

would serve a. Measures of

on unemploymgfield). From tcted as annotat

cle Matches: Fs for articles tds (recall, thed on their cosiesentation) wit

he highest simhe related artirest and the con

on: Once the n, a descriptiveby extracting anking the sententify the nounalculate their m

with the higheThis further

ing articles relbest sentence fr

y possible algohout overlap. Wthm to match aeters of the tex

ven by contenmaps with a smwsViews also cointers to the h

g values (Fig. 4

ency) Analysiatic cartographyns better than ypothesized th

a spatial distras a good proxspatial autocor

Figual ato eandatedplyipar

ment (preferrinthis set of relattion material.

For each relatthat mention the “VarP” termne similarity (oth the input ar

milarity selecteicles are in fantext.

best article he sentence is s

all sentences thences by topic-phrase terms

mean PMI givest mean PMIenforces topi

lated to the toprom that article

orithms exist fWe devise a simannotation rea

xt box) to avail

t extracted fromall number (<

creates observhighest and low

4, left).

s y is its ability other visualiz

hat a metric thribution exhibixy of that maprrelation accom

ure 4: Observaannotation poinextreme value (ld line graph gend for an article ing temporal co

risons (below).

ng ed

ed he

m). on rti-ed. act

has se-hat cal in en is

ic-pic e.

for m-ali-la-

om <2) va-w-

to za-hat its

p’s m-

plish tto the nearbyther apcalled little isble’s pdistribupercep

Final MThe finmost liing dalocatioto inclThe minteresWe thetation ranker relevanness, w

EVALUWe expperceivrecall ment oand (4)

The fNewsVview gazon’s

1)

2)

To invinteresods formore “ate annall quaNewsVa givenble intvisual ty of Nvisualla user’

LocatiWe evtion tag

tion-ting left), ner-im-

om-

this. Broadly sextent to whi

y each other arpart. We compuMoran’s I for

s known aboutproperties and ution [10], we

ptions in Evalua

Map Selectionnal map selectiikely to displa

ata with topicaons on the maplude only high

map ranker comstingness of eaen sort the posrelevancy of tselects as the

nce visualizatiwhile staying ab

UATION pect at least foved quality of of article loca

of data variable) prioritization

first two facViews’ map segenerator. We e

Mechanical T

Do extracted sessments of imDo the rankinumn selectionman raters eva

vestigate whethstingness, and rr selecting ann“holistic” evalunotation contenality of maps, wViews algorithmn feature type terdependence interestingness

NewsViews maly interesting ms annotation ra

on Tagging Evaluate the pregging as imple

speaking, spatiich two spatialre more similaute a measure r each of our 1t the relationshthe “interestin

e evaluate howation, below.

n ion process seeay highly relevally appropriatp. We thresholh PMI scoringmputes Moran’ach high PMI mssible visualizathe visualizatie final visualizion that maximbove a thresho

our aspects of thf NewsViews’ ation tagging, es, (3) selectio

n of visually int

ctors play aelection proceevaluate two q

Turk studies wit

locations alignmportant locat

ngs produced n) align with thaluating the dat

ther user percerelevance align

notations and inuation. In our snt, visual interwhich we genems or a varianis not includebetween the a

s as contributoaps (e.g., a ratmight be unintating).

valuation ecision and recemented in the

ial autocorrelal features (e.g

ar than those thof spatial auto

155 variables. hip a given spngness” of a mw Moran’s I a

eks to find the mvant and visualte annotations ld possible visg variables (e.g’s I to predict map, retaining ations by the mon’s annotatiozation that higmizes visual i

old annotation r

he algorithm tomaps: (1) pre(2) ranking a

on of annotatioteresting maps.

a fundamentaless as they comquestions in septh human rater

n with human tions in articlesin data assignhe relative ratita to article ma

eptions’ of man with NewsVinteresting mapstudy, participrestingness, anderate using thent of the systemed. This allowsannotation releors to the perceting of whethetentionally infl

call of NewsVquery extracto

ation refers g. counties) hat are far-ocorrelation

Given that atial varia-

map of that affects user

map that is ly interest-to explain

sualizations g.., top 3). the visual this score.

mean anno-on set. The gh variable interesting-relevance.

o affect the ecision and and assign-on content , .

l role in mprise the parate Am-rs:

raters’ as-s? nment (col-ings of hu-atch?

ap quality, iews meth-s, we use a ants evalu-d the over-e combined m in which s for possi-evancy and eived quali-er a map is fluenced by

Views loca-or.

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3011

Page 8: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

TWthtataalyw

RPthththw

V

TWfvuliPrthPwc

Adoruainbefa

Iab

FtE Task and MateWe randomly she query extraen locations w

authors used aagged location

at least 2 codery extracted lo

were mentioned

Results Precision (the he raters confihe set of 47 ahat raters belie

were not captur

Variable Assig

Task and MaterWe combined for which taggview generatorumns) for eachist of 155 rank

PMI was knowread the articlehe article alon

PMI. For each with the classifcontent (yes/no

A second set odisplayed each only six variabrate the relevausing a Likert-at All) to 3 (Hncluded a mix

ble was randomed list, a secondfourth variable and a fifth and

In each case, Uand above werebase reward of

Figure 5: Accuren variables ra

Error bars displ

erials selected 47 artiactor to tag relewere identifiedassessed the n set (each artirs). Coders readocations, and ad in the article

percentage ofirmed as accurarticles. Recaleved were of inred by tagging)

gnment Evalua

rials the 47 above

ged locations wr to assign and h of the resultiked variables

w. A first groupes in separate Hng with the toarticle, workerfication of the

o question).

of HITs basedwith a list of

bles were showance of each v-style radio bu

Highly Relevanof PMI-based

mly drawn fromd variable fromfrom between

sixth variable

U.S. workers we eligible to cof $0.10. A wor

racy of variableanked by PMIlay standard de

icles from the evant locationsd in each articprecision and icle-location sed the article, idadded up to 1but not extract

f system-taggerate) was 92.7%l (the percentanterest to the a) was 42.6% (

ation

articles with were evaluaterank data vari

ing 66 articlesfor each articlp of MechanicHITs, each of op ten ranked rs indicate whe variable as re

d on the same f variables. Hown, and workervariable to thetton list from

nt). The presend ranks. Specifim the first five m the first ten sn slot 11 and thfrom the lower

with an approvaomplete up to arker received a

e assignments tI, aggregated aeviation.

corpus and uss in each. One cle. Four of th

recall of eaets evaluated b

dentified correc0 locations thted.

ed locations th% ( =14.2) ovage of locatioarticle but whic

=23.9).

19 more articld, and used thiables (table cos. This createdle for which th

cal Turk workewhich displayvariables usin

ether they agreelevant to artic

66 articles alwever, this timrs were asked e article conte0 (Not Releva

nted variable liically, one varislots in the sorslots, a third anhe list midpoinr half of the lis

al rating of 95all 66 HITs fora bonus of $0.0

to article for fiacross 66 articl

ed to he ch by ct-hat

hat ver ns ch

les he ol-d a he ers ed ng ed cle

so me to

ent ant ist ia-rt-nd nt, t.

5% r a 02

for eacmajorit

Results30 andbetwee6600 aFigure of the Resultsracy an

We usratingsble decpossiblvance d

Annota

Task aWe ranmatch three m(FULLconsidtered tthat theable ouis likelWe cocandidcosine notatio

Two ptive cotizing (NoCOcosine is chosused to(NoSAsion, bvery losolutioble, bu

Figure for all

irstles.

ch variable forty of other wor

s d 41 unique woen 1 and 66 HIand 5940 ind5 depicts the variables in ths indicate a cond rank by PM

sed the results s to examine hcreases as the vle variables. Adrops off at ap

ations, Interes

and Materials ndomly selecteevaluation. F

map visualizatiL) applies annerations to mato the top three maps presentut of these toply to be predic

ompute annotadate articles m

similarity andon as described

partial solutionontributions ofvisually intere

OS) map is gesimilarity is n

sen from the uo create annota

AL) also followbut with one eow saliency mon constant, inut presented

e 6: Loess curvvariables, aggr

r which her rarkers for that a

orkers took parITs apiece ( =dividual variab

aggregated ache top ten posionsistent negat

MI.

of the secondhow the perceivvariables rank As shown in Fpproximately ra

stingness, and

ed 10 articles For each of thions for compnotation relevap creation. Pe matched vart relevant data

p three for whicted to be high

ations for eachmatching a locad using the topd above.

ns are also creaf having annotesting maps. Aenerated for eanot considered. unranked list oation content. ws the same prexception desi

maps. We held ncluding the la

randomly-gen

e applied to meregated across 6

ating matched article and varia

rt in the two ta=24) to providble ratings, reccuracy for theitions from thetive trend betw

d task’s Likertved relevance by PMI decre

Figure 6 perceank 30.

d Quality Eval

for location anhese articles, wparison. A “fulancy and maossible variabriables by PMI. We then seleich visual interhest by using

h location by ration and Varp article to cre

ated to examintation content A “no cosine ach article whInstead, a rand

of candidate aA “no saliencyrocess as the Fgned to allowall aspects of

abeling of the erated artifici

ean relevance b66 articles.

that of the able.

asks, doing de a total of espectively. e relevance e first task. ween accu-

t relevancy of a varia-ases for all eived rele-

luation

nd variable we created ll solution” ap saliency les are fil-I to ensure ct the vari-restingness Moran’s I. ranking all rP term by eate the an-

ne the rela-and priori-similarity”

here article dom article articles and y” solution FULL ver-

w us to test f the FULL data varia-al data in

by PMI rank

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3012

Page 9: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

pwFthu

E3mcth

R1pfOfNv3btv(0F3bu

Tramvcme

Fs

place of the trwith the name FULL version.he spatial aut

users appraisals

Each of 10 use30 article-visuamade aware ofcial. The task ahe map, answe

1) Annotatithe anno

2) Visual S(Do the v

3) Overall helping e

Results 10 university gparticipating vifor each of the Overall ratingsfollowed by TuNoSAL visualivant annotation3.6, NoSAL = 3both padj = 0.00erestingness)

visualizations (FULL = 4.2, C

0.001, both ad

FULL maps (309) = 11, p < both features (cusefulness.

The Pearson’s ratings overall annotation, ovemay have prodvalues than migcorrelation. Stimap interestingevaluative work

Figure 7: Mean tandard deviati

rue values in tof the actual

The intentiontocorrelation ths of how visual

ers completed aalization pairs f the possibilityasked the user tering three que

ion Relevance:otations to the aSaliency: How vvisual patternsRating: How

explain the con

graduate studenia an online intAnnotation R by the featureukeyHSD testsizations were ens than the N.6, NoCOS = 200). Similarly,were highest and lowest wCOS = 4.0, NoS

dj = 0.000). OvFULL = 3.4, N

0.001, both pcosine similari

correlation bewas fairly higherall ratings). duced visualizght occur in reill, the correlatgness is encourk concerning M

Ratings by Feaion.

the map but ladata variable

n of this was tohat Moran’s Illy interesting

a rating task thin random or

y that the mapto read the artistions:

How relevant article topic anvisually interes incite your cuuseful is the

ntent in the arti

nts earned $15terface. Figure elevance, Visue combination s indicated thaeach rated as ha

NoCOS visualiz.6, F(2, 309) = ratings of viswith the FUL

with the NoSASAL = 2.3; F(2, verall ratings wNoSAL = 2.8, N

adj = 0.001). Wity and Moran’

etween Moran’h (0.67, vs. -0.2Our randomiz

zations with loeal data with lition between traging, as we kMoran’s I.

ature Types. Cr

abeling the dadisplayed in tho see how muI detects affeca map is.

hat presented thrder. Users wep data was artiicle and examin

is the content d details? sting is the map

uriosity?) map overall

icle?

5 for voluntari5 depicts mea

ual Saliency, anused. ANOVA

at the FULL anaving more relzations (FULL

= 54, p < 0.00ual saliency (i

LL and NoCOAL visualizatio

309) = 149, pwere highest fNoCOS = 2.6, F(

We conclude th’s I) benefit ma

s I and salien26 and 0.08 wized data methoower Moran’sittle spatial autthe measure anknow of no pri

ossbars depict

ata he ch cts

he ere fi-ne

of

p?

in

ily ans nd As nd le-

L = 01, in-OS on <

for (2, hat ap

cy ith od s I to-nd ior

As thevaried these fof an aPearsoPMI an< 0.05resulteinclude

DISCU

LimitaThe Nthat ditions. Tarticle the fina databthus criteratiowhichpublishimprovgenerasuch ations. Warticle-matchethe texgeneracated m“stats-h

PipelinThe Nnews acorrespand timcriteriadatabaeralizeto captextracttypes apipelinquire a

Incorpothe Nethat adporal, compaated fachievscore” focusetimes. techniq

e PMI for labebetween map

factors did not association bet

on correlation vnd ratings (all ). We expect t

ed from the smed to avoid con

USSION

ations and ChaNewsViews’ piirectly impact The variables text (relative

nal visualizationbase that is brritical. These pons via the aut

could simplifhed data fromvements couldating maps for as by citing theWe experimen-map data aliges with a variaxt, and checkiated thematic mmethods for enheavy” articles

ne GeneralizatNewsViews’ aparticle implies ponding to a vmes (rows) ana of variable-se search, and

e to other data tture variable ttion techniqueas well. Otherne, such as moadapting the sp

orating additioewsViews pipeddress differen

temporal-geoarisons. Figure for an articleed using Heidcapturing the

s on comparisA “location s

ques currently

eled variables as (min: 2, maximpact ratings

tween ratings avalues > -0.05 Pearson correlthe lack of cor

mall range of hignfounds from P

allenges ipeline displaythe quality of that have the to other poss

n that the systroad, recent, anproperties can btomation of a fy updating th

m reputable sod address cha

articles that pe levels of a vnted with simpgnment such able name withing that these map. There is nsuring article-s.

tion and Futurpproach assum

a particular sview on a tablend variables (c-to-article simd data assignmtypes. Our demto article text es should applr high level crodeling visual pecific methods

onal data extraeline to be ex

nt implied comographic, and

4, bottom, dise focused on delTime [30] e relative degrsons between score” can like

implemented

and the annotax: 7), we confs. We found nand annotationand < 0.12), nolation values >rrelation for PMgh PMI variabPMI difference

ys various depf the generatedhighest PMI s

sible variables)tem generates. nd accurate inbe strengthenetable crawler

he database wources. Furtheallenges assocprovide explicivariable at diffle heuristics foas by looking

h surrounding svalues aligneroom for mor

-map alignmen

re Work mes that the coset of data come comprised o

columns). Our milarity (e.g., rment methods shmonstration of relevance, andly to other viriteria in the N

interestingness used (see, e.g

action algorithxtended to vis

mparisons, inclulimited grou

splays a line grtemporal co

to produce a ree to which aan indicator aewise be achiein NewsView

ation count firmed that

no evidence n count (all or between

> -0. 51 and MI to have

bles that we es.

pendencies d visualiza-scores with ) constrain Achieving

n its data is ed in future

(as in [4]) with newly er pipeline ciated with it statistics, ferent loca-or ensuring g for exact statistics in d with the re sophisti-nt for these

ontent of a mparisons,

of locations high level

relevance), hould gen-using PMI

d our topic isualization NewsViews ss, may re-g., [12]).

hms allows sualizations uding tem-

up-category raph gener-mparisons, “temporal

article text at different eved using

ws. For ex-

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3013

Page 10: NewsViews: An Automated Pipeline for Creating Custom … · 2018. 3. 30. · Multiple possible “views” (each represent-ing a set of salient columns from a multi-topic table data-base)

ample, calculating the number of locations mentioned in relation to the anticipated familiarity of those locations to users (inferred from population data, for example). This enables prioritizing map creation where locations are less likely to be known, and where a map’s locational af-fordances are more useful.

CONCLUSION We present an automated pipeline for generating relevant geovisualizations given context articles. By mining a con-text article locations and topic, NewsViews can select from a range of data views. We demonstrate that the system is able to generate visualizations that are both “interesting” and relevant. By automating visualization construction, our system can be applied in resource constrained environments that nonetheless have a huge corpus of articles.

ACKNOWLEDGEMENTS This work is partially supported by NSF Grant SES-1131500.

REFERENCES 1. ArcGIS Resource Center. Map the Data. ArcGIS Re-

source Center, 2013. http://bit.ly/18bKiEw 2. Bouma, Gerlof. Normalized (Pointwise) Mutual Infor-

mation in Collocation Extraction. Proc. of the Biennial GSCL Conference, (2009), 31-40.

3. Brewer, C. Guidelines for Use of the Perceptual Dimen-sions of Color for Mapping and Visualization, Proc. SPIE ’94 vol. 2171 (1994), 54-63.

4. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., and Zhang, Y. WebTables: Exploring the Power of Tables on the Web. Proc. VLDB Endow. 1, 1 (2008), 538–549.

5. Cairo, A. The Functional Art: An Introduction to Infor-mation Graphics and Visualization. New Riders, 2012.

6. Chang, K. Introduction to Geographic Information Sys-tems. McGraw-Hill, New York, NY, 2012.

7. Franklin, B. Key Concepts in Journalism Studies. SAGE Publications, London; Thousand Oaks, 2005.

8. Gabrilovich, E., Dumais, S., and Horvitz, E. Newsjunk-ie: Providing Personalized Newsfeeds via Analysis of Information Novelty. Proc. of WWW (2004), 482–490.

9. Griffin, J. and Stevenson, R. The Effectiveness of Loca-tor Maps in Increasing Reader Understanding of the Geography of Foreign News. Journalism & Mass Com-mun. Quarterly 71 (4) (1994), 937-946.

10. Hecht, B., Carton, S.H., Quaderi, M., et al. Explanatory Semantic Relatedness and Explicit Spatialization for Exploratory Search. Proc. SIGIR ‘12, (2012), 415–424.

11. Heer, J., Viegas, F.B., and Wattenberg, M. Voyagers and voyeurs: Supporting Asynchronous Collaborative Visualization. Commun. ACM 52, 1 (2009), 87–97.

12. Hullman, J., Diakopoulos, N., and Adar, E. Contextifier: Automatic Generation of Annotated Stock Visualiza-tions. Proc. CHI ’13 (2013), 2707–2716.

13. Hullman, J. and Diakopoulos, N. Visualization Rheto-ric: Framing Effects in Narrative Visualization. IEEE TVCG 17, 12 (2011), 2231–2240.

14. Jenks, G. and Caspall, F.C. Error on Choroplethic Maps: Definition, Measurement, Reduction. Annals of the Assoc. of Amer. Geographers 61, 2 (1971), 217–244

15. Kandogan, E. Just-in-time Annotation of Clusters, Out-liers, and Trends in Point-Based Data Visualizations. IEEE VAST '12, (2012), 73-82.

16. Kong, N. and Agrawala, M. Perceptual Interpretation of Ink Annotations on Line Charts. Proc. UIST ‘09 (2009), 233–236.

17. Kong, N. and Agrawala, M. Graphical Overlays: Using Layered Elements to Aid Chart Reading. IEEE TVCG 18, 12 (2012), 2631–2638.

18. Ratinov, L., Roth, D., Downey, D., and Anderson, M. Local and Global Algorithms for Disambiguation to Wikipedia. ACL (2011).

19. Leonhardt, D., Bostock, M., Carter, S., et al. In Climb-ing Income Ladder, Location Matters - NYTimes.com. The New York Times, 2013. http://nyti.ms/1dPGWbN.

20. Lin, S., Fortuna, J., Kulkarni, C., Stone, M., and Heer, J. Selecting Semantically-Resonant Colors for Data Visu-alization.

21. Marcus, A., Wu, E., and Madden, S. Data In Context: Aiding News Consumers while Taming Dataspaces. DBCrowd 2013, (2013), 47.

22. Narrative Science. http://narrativescience.com/. 23. OpenCalais, http://www.opencalais.com/ 24. Sandhaus, E. The New York Times Annotated Corpus.

LDC, 2008. http://bit.ly/1a4ObME. 25. Segel, E. and Heer, J. Narrative Visualization: Telling

Stories with Data. IEEE TVCG 16, 6 (2010), 1139–1148.

26. Shahaf, D., Guestrin, C., and Horvitz, E. Metro maps of information. SIGWEB Newsl., 4, (2013), 1–4.

27. Shahaf, D. and Guestrin, C. Connecting Two (or Less) Dots: Discovering Structure in News Articles. KDD 5, 4 (2012), 1–24.

28. Shneiderman, B. The Eyes Have It: A Task by Data Type Taxonomy for Information Visualizations. IEEE Symp. on Visual Languages ‘96. (1996), 336–343.

29. Slocum, T.A. and Slocum, T.A. Thematic Cartography and Geovisualization. Pearson Prentice Hall, Upper Saddle River, NJ, 2009.

30. Strötgen, J. and Gertz, M. HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions. Proc. IWSE ‘10 (2010), 321–324.

31. Tenore, M.J. Explainer Maps Locate, Contextualize and Localize News from Libya, Japan. Poynter, 2011. http://www.poynter.org/how-tos/newsgathering-storytelling/125206/explainer-maps-locate-contextualize-and-localize-news-from-libya-japan/.

32. U.S. Census Bureau. Statistical Abstract of the United States: 2012 (131st ed.) Washington, DC, 2011; http://www.census.gov/compendia/statab/.

Session: Journalism and Social News CHI 2014, One of a CHInd, Toronto, ON, Canada

3014