draft - arxiv · gauging public health through caloric input and output on social media sharon e....

23
RA The Lexicocalorimeter: Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, * Jake Ryland Williams, 1, Andrew J. Reagan, 1, Stephen C. Alajajian, 2, § Morgan R. Frank, 3, Lewis Mitchell, 4, ** Jacob Lahne, 5, †† Christopher M. Danforth, 1, ‡‡ and Peter Sheridan Dodds 1, §§ 1 Department of Mathematics & Statistics, Vermont Complex Systems Center, Computational Story Lab, & the Vermont Advanced Computing Core, The University of Vermont, Burlington, VT 05401. 2 Women, Infants and Children, East Boston, MA 02128. 3 Center for Computational Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139 4 School of Mathematical Sciences , North Terrace Campus, The University of Adelaide, SA 5005, Australia 5 Culinary Arts and Food Science, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104. (Dated: April 11, 2018) We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the “caloric content” of social media and other large-scale texts. We do so by constructing extensive yet improvable tables of food and activity related phrases, and respectively assigning them with sourced estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of “caloric input”, “caloric output”, and the ratio of these measures—“caloric balance”—are all strong correlates with health and well-being demographics for the contiguous United States. Our caloric balance measure outperforms both its constituent quantities; is tunable to specific demographic measures such as diabetes rates; provides a real-time signal reflecting a population’s health; and has the potential to be used alongside traditional survey data in the development of public policy and collective self-awareness. Because our Lexicocalorimeter is a linear superposition of principled phrase scores, we also show we can move beyond correlations to explore what people talk about in collective detail, and assist in the understanding and explanation of how population-scale conditions vary, a capacity unavailable to black-box type methods. PACS numbers: I. INTRODUCTION Online instruments designed to measure social, psycho- logical, and physical well-being at a population level are becoming essential for public policy purposes and public health monitoring [1, 2]. These data-centric gauges both empower the general public with information to allow comparisons of communities at all scales, and natural- ly complement the broad, established set of more read- ily measurable socioeconomic indicators such as wage growth, crime rates, and housing prices. Overall well-being, or quality of life, depends on many factors and is complex to measure [3]. Existing techniques for estimating population well-being range from traditional surveys [1, 4] to estimates of smile-to- frown ratios captured automatically on camera in pub- lic spaces [5], and vary widely in the types of data they amass, collection methods, cost, time scales involved, and degree of intrusion. Many measures are composite in * Electronic address: [email protected] Electronic address: [email protected] Electronic address: [email protected] § Electronic address: [email protected] Electronic address: [email protected] ** Electronic address: [email protected] †† Electronic address: [email protected]. ‡‡ Electronic address: [email protected] §§ Electronic address: [email protected] nature with two examples being the Gallup Well-Being Index, which is based on factors such as life evaluation, emotional health, physical health, healthy behavior, work environment, and basic access to necessary resources [4]; and the Living Conditions measure developed by the United States Census Bureau, which is derived from housing conditions, neighborhood conditions, basic needs met, a “full set” of appliances, and access to help if need- ed [6]. With the explosive growth of online activity and social media around the world, the massive amount of real- time data created directly by populations of interest has become an increasingly attractive and fruitful source for analysis. Despite the limitation that social media users in the United States are not a random sample of the US population [7], there is a wealth of information in these data sets and uneven sampling can often be accommo- dated. Indeed, online activity is now considered by many to be a promising data source for detecting health conditions [8, 9] and gathering public-health informa- tion [10, 11], and within the last decade, researchers have constructed a range of online public-health instruments with varying degrees of success. We cover a few relevant examples in the following sec- tion, and then turn to measuring “caloric content” of text, the focus of our present work. Typeset by REVT E X arXiv:1507.05098v1 [physics.soc-ph] 17 Jul 2015

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

The Lexicocalorimeter:Gauging public health through caloric input and output on social media

Sharon E. Alajajian,1, ∗ Jake Ryland Williams,1, † Andrew J. Reagan,1, ‡ Stephen C. Alajajian,2, § Morgan R.

Frank,3, ¶ Lewis Mitchell,4, ∗∗ Jacob Lahne,5, †† Christopher M. Danforth,1, ‡‡ and Peter Sheridan Dodds1, §§

1Department of Mathematics & Statistics, Vermont Complex Systems Center,Computational Story Lab, & the Vermont Advanced Computing Core,

The University of Vermont, Burlington, VT 05401.2Women, Infants and Children, East Boston, MA 02128.

3 Center for Computational Engineering, Massachusetts Institute of Technology, Cambridge, MA, 021394 School of Mathematical Sciences , North Terrace Campus, The University of Adelaide, SA 5005, Australia

5Culinary Arts and Food Science, Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104.(Dated: April 11, 2018)

We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the“caloric content” of social media and other large-scale texts. We do so by constructing extensiveyet improvable tables of food and activity related phrases, and respectively assigning them withsourced estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of“caloric input”, “caloric output”, and the ratio of these measures—“caloric balance”—are all strongcorrelates with health and well-being demographics for the contiguous United States. Our caloricbalance measure outperforms both its constituent quantities; is tunable to specific demographicmeasures such as diabetes rates; provides a real-time signal reflecting a population’s health; andhas the potential to be used alongside traditional survey data in the development of public policyand collective self-awareness. Because our Lexicocalorimeter is a linear superposition of principledphrase scores, we also show we can move beyond correlations to explore what people talk about incollective detail, and assist in the understanding and explanation of how population-scale conditionsvary, a capacity unavailable to black-box type methods.

PACS numbers:

I. INTRODUCTION

Online instruments designed to measure social, psycho-logical, and physical well-being at a population level arebecoming essential for public policy purposes and publichealth monitoring [1, 2]. These data-centric gauges bothempower the general public with information to allowcomparisons of communities at all scales, and natural-ly complement the broad, established set of more read-ily measurable socioeconomic indicators such as wagegrowth, crime rates, and housing prices.

Overall well-being, or quality of life, depends onmany factors and is complex to measure [3]. Existingtechniques for estimating population well-being rangefrom traditional surveys [1, 4] to estimates of smile-to-frown ratios captured automatically on camera in pub-lic spaces [5], and vary widely in the types of data theyamass, collection methods, cost, time scales involved, anddegree of intrusion. Many measures are composite in

∗Electronic address: [email protected]†Electronic address: [email protected]‡Electronic address: [email protected]§Electronic address: [email protected]¶Electronic address: [email protected]∗∗Electronic address: [email protected]††Electronic address: [email protected].‡‡Electronic address: [email protected]§§Electronic address: [email protected]

nature with two examples being the Gallup Well-BeingIndex, which is based on factors such as life evaluation,emotional health, physical health, healthy behavior, workenvironment, and basic access to necessary resources [4];and the Living Conditions measure developed by theUnited States Census Bureau, which is derived fromhousing conditions, neighborhood conditions, basic needsmet, a “full set” of appliances, and access to help if need-ed [6].

With the explosive growth of online activity and socialmedia around the world, the massive amount of real-time data created directly by populations of interest hasbecome an increasingly attractive and fruitful source foranalysis. Despite the limitation that social media usersin the United States are not a random sample of the USpopulation [7], there is a wealth of information in thesedata sets and uneven sampling can often be accommo-dated.

Indeed, online activity is now considered by manyto be a promising data source for detecting healthconditions [8, 9] and gathering public-health informa-tion [10, 11], and within the last decade, researchers haveconstructed a range of online public-health instrumentswith varying degrees of success.

We cover a few relevant examples in the following sec-tion, and then turn to measuring “caloric content” oftext, the focus of our present work.

Typeset by REVTEX

arX

iv:1

507.

0509

8v1

[ph

ysic

s.so

c-ph

] 1

7 Ju

l 201

5

Page 2: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

2

A. Previous work

In the difficult realm of predicting pandemics [12],Google Flu Trends [13]—initially based very simply onsearch terms—enjoyed early success and acclaim butproved (unsurprisingly) to be imperfect in need of a moresophisticated approach [14].

In work by several of the current authors and col-leagues, Mitchell et al. measured the happiness of tweetsacross the US and found strong correlations with otherindices of well-being at city and state level, such as theGallup Well-being Index; the Peace Index; the America’sHealth Ranking composite index of Behavior, Commu-nity and Environment, Policy and Clinical Care metrics;and gun violence (negative correlation) [15]. Using thesame instrument in 10 languages, the Hedonometer, wehave also shown that the emotional content of tweetstracks major world events [2, 16].

Paul and Dredze found that states with higher obe-sity rates have more tweets about obesity, and stateswith higher smoking rates have more tweets about can-cer [11]. They also found a negative correlation betweenexercise and frequency of tweeting about ailments, sug-gesting “Twitter users are less likely to become sick instates where people exercise.” They further found healthcare coverage rates to be negatively correlated with like-lihood of posting tweets about diseases.

Chunara et al. recently found that activity-relatedinterests on Facebook are negatively correlated withbeing overweight and obese, while interest in televisionis positively correlated with the same [17].

In an analysis of online recipe queries, West et al.found that the number of patients admitted to the emer-gency room of a major urban hospital in Washington,DC for congestive heart failure (CHF) each month wassignificantly correlated with average sodium per recipesearched for on the Web in the same month [18].

Eichstaedt and colleagues [19] have demonstrated thatpsychological language on Twitter outperforms certaincomposite socioeconomic indices in predicting heart dis-ease at the county level. They were able to show in par-ticular that the expression of negative emotions such asanger on Twitter could be taken as a kind of risk factorat the population scale.

Finally, in work directly related to our present study,Abbar et al. [20] have recently performed a similar analy-sis of translating food terms used on Twitter into calories.They found a correlation between Twitter calories andobesity and diabetes rates for the US, and explored howfood-themed interactions over social networks vary withconnectedness, finding suggestions of social contagion.While our approaches and results are largely sympathet-ic, our work incorporates estimates of physical activitywhich we will show provides essential extra informationregarding health; introduces a phrase extraction methodwe call serial partitioning; and leads to an online imple-mentation of a real-time instrument as part of our pro-posed ‘panometer.’

B. Lexicocalometrics

It has thus become clear that we can estimatepopulation-scale levels of health and well-being throughsocial media.

Here, we examine the words and phrases people postpublicly about food and physical activity on Twitteron a statewide level for the contiguous United States(48 states along with the District of Columbia). As weexplain fully below in Sec. II A and Methods and Mate-rials, Sec. IV, we group categorically similar words andphrases into lemmas, and we then assign caloric valuesto these lemmas using the terms and notation “caloricinput” for food, Cin, and “caloric output” for activity,Cout. We define the ratio of caloric output to caloricinput to be a third quantity, “caloric balance”:

Cbal =Cout

Cin. (1)

We use “phrase shifts” [2] to show how specific lemmas—e.g., “apples”, “cake with frosting”, “white water raft-ing”, “knitting”, and “watching tv or movie” contributeto the caloric texture of states across the contiguous US.We then correlate all three values with 37 demograph-ic measures, and we find statistically strong correlationswith quantities such as high blood pressure, inactivity,diabetes levels, and obesity rates. For ease of language,we will generally speak of phrases rather than lemmas.

We have also generated an accompanying online,interactive instrument for exploring health patternsthrough the lens of “Twitter calories”: the Lexic-ocalorimeter. An initial version of the instrumentmay be accessed at this paper’s Online Appen-dices, http://compstorylab.org/share/papers/alajajian2015a/, and will be housed within our largermeasurement platform http://panometer.org at http://panometer.org/instruments/lexicocalorimeter.We note that while our online instrument is based onTwitter, it may in principle be used on any sufficientlylarge text source, social media or otherwise, such asFacebook.

From this point, we structure the core of our paper asfollows. In Sec. II, we establish and discuss our findingsin depth. Specifically, we: (1) Outline our text analysisof a Twitter corpus from 2011–2012 (Sec. II A), reserv-ing full details for Methods and Materials in Sec. IV;(2) Present caloric maps of the contiguous US contrast-ing the 48 states and DC through histograms and phraseshifts (Sec. II B); and (3) Examine how Cin, Cin, and Cbal

correlate with a suite of demographic measures. In theSupporting Information, we provide a sample of confir-matory figures as well as all shareable data sets. We offerconcluding thoughts in Sec. III.

Page 3: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

3

II. ANALYSIS AND RESULTS

A. Estimating calories from phrases

From a sample of around 50 million geotagged tweetsmade during 2011 and 2012 in each of the 48 continen-tal states and the District of Columbia, we counted thetotal number of times each food and physical activityphrase in our database was tweeted about (see Methodsand Materials, Sec. IV, and Supporting Information). Wethen used these counts to determine the average caloricinput Cin from food phrase tweets and the average caloricoutput Cout from physical activity phrase tweets as fol-lows.

First, we equate each food phrase s with the caloriesper 100 grams of that food, using the notation Cin(s).(We also explored serving sizes but the databases avail-able proved far from complete.) We then compute thecaloric input for a given text T as:

Cin(T ) =

∑s∈Sin

Cin(s)f(s|T )∑s f(s|T )

=∑s∈Sin

Cin(s)p(s|T ),

(2)where f(s|T ) is the frequency of phrase s in text T ,p(s|T ) is the normalized version, and Sin is the set ofall food phrases in our database.

Second, for each tweeted physical activity phrase, weuse an estimate of the Metabolic Equivalent of Tasks, orMETs, which we then converted to calories expended perhour, assuming a weight of 80.7 kilograms, the averageweight of a North American adult [21]. Analogous toCin(T ) above, we then have

Cout(T ) =∑s∈Sout

Cout(s)p(s|T ), (3)

where now Sout is the set of all phrases in our activitydatabase.

We emphasize that both our food and exercise phrasedata sets and Twitter databases are necessarily incom-plete in nature. The values of Cin and Cout are thus notmeaningful as absolute numbers but rather have powerfor comparisons. We also acknowledge that our equiva-lences are crude—e.g., each mention of a specific food isnaively turned into the calories associated with 100 gramsof that food—and later on we address our choices in moredepth. Nevertheless, our method is pragmatic yet—as wewill show—effective, and offers clear directions for futureimprovement.

For simplicity and ultimately because the results aresufficiently strong, we did not filter tweets beyond theirgeographic location. Tweets may thus come from indi-viduals, restaurants, sports stores, resorts, news outlets,marketers, fitness apps, tourists, and so on, and fur-ther improvements and refinements may be achieved byappropriately constraining the Twitter corpus.

Finally, we take the ratio of Cout(T ) to Cin(T ) toobtain the text’s caloric balance Cbal(T ). In general, we

observe that a higher value of Cbal(T ) at the populationscale would appear to be intuitively better, up to somelimit indicating negative energy balance. We note thatCbal = 1 is not salient and should not be taken to mean apopulation is ‘balanced calorically’. Using the differenceCout − Cin generates similar results but, from a framingperspective, we have reservations in creating a scale witha 0 point given the approximate nature of our measures.

B. Caloric maps of the contiguous US

We now move to our central analysis and exploration ofhow our lexicocalorimetric measure varies geographically.We start with visual representations and then continueon to more detailed comparisons. In Fig. 1, we showthree choropleth maps of our overall 2011–2012 measuresof Twitter’s caloric input Cin, caloric output Cout andcaloric balance Cbal, for the contiguous US. For all threemaps, quantities increase as colors move from light todark green.

These maps immediately allow for some basic obser-vations which we will delve into and harden up as ouranalysis proceeds. For the food calories map, we see Cin

is generally largest in the Midwest and the south whileColorado and Maine stand out as states with the lowestcalories.

We see a different texture in the activity calories mapwith the highest calories appearing in the three-stateblock of Wyoming, Colorado, and Utah, as well as Ver-mont. Tweet-based caloric output drops to a low in Mis-sissippi and the surrounding states, while Michigan alsoappears to have a low value of Cout.

Visually, the third map shows the highest values of Cbal

are found for Colorado, Wyoming, and Vermont, andsecondarily for Maine, Minnesota, Oregon, and Utah.Low values of Cbal appear in the region of Mississippi,Louisiana, Alabama, and Arkansas, as well as West Vir-ginia.

Visually, we see that Cout is clearly more well alignedwith Cbal than Cin. The reason is that for the present ver-sion of the Lexicocalorimeter, Cout has a larger dynamicrange than Cin, roughly 250 to 285 versus 160 to 210 giv-ing ratios of 210

160 ' 1.31 and 285250 ' 1.14. We could assert

that Cin is fundamentally less informative but:

1. In Sec. II E, we will find that some demograph-ic measures correlate more strongly with Cin andCout;

2. We may adjust the dynamic range of either measureby rescaling, introducing a kind of tunability [2] tothe instrument (a feature we will reserve for futureiterations); and

3. Because our food phrase database is a factor of 10smaller than our activity phrase one, revisions ofour instrument may elevate the power of Cin.

Page 4: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

4

eatin

g

hiking

eating

dancing

running

watching tv or movie● talking on phone

eating

running

eating

biking

watching tv or movie

runn

ing

●running

●running●

watching tv or movie

eating

mountain biking

eating

walking●

laying down

running

eating

watching tv or movie

●skiing

●ice skating

dancing

skiing

● getting my nails done

dancing

●running

●eating

●running

●sitting

●running

●running

watching tv or movie

● using treadmill

eating

●running

●sitting

eating

skiing

skiing

showering

●running

watching tv

or mov

ie

running●

running

B: Calories out −10

0

10

20

30

Dev

iatio

n fr

om n

atio

nal a

ctiv

ity−

calo

ric a

vg.

choc

olat

e ca

ndy

egg

chocolate candy●

tomato

noodles

apples●

crab●

egg

●crab

grits

pasta

noodles●

cook

ies

●butter

cookies●

peanut butter

chocolate candy

lobster

crab

lobster●

chocolate candy

noodles

cake

noodles

peanut butter

●cookies

egg

apples

butter

butter

●apples

●butter

●bacon

●cookies

chocolate candy

green beans

●cheese● corn

grits

●bacon

●corn

donuts

noodles

bacon

noodles

●pasta

cookie

s

noodles●

corn

A: Calories in−15

−10

−5

0

5

10

15

Dev

iatio

n fr

om n

atio

nal f

ood−

calo

ric a

vg.

C: Caloric Balance −0.05

0.00

0.05

0.10

Dev

iatio

n fr

om n

atio

nal a

vgs.

cal

oric

bal

ance

FIG. 1: Choropleth maps indicating (A) caloric input Cin, (B) caloric output Cout, and (C) caloric balance Cbal in thecontiguous United States (including the District of Columbia) based on 50 million geotagged tweets taken from 2011-2012.Darker means higher values as per the color bars. The histograms in Figs. 3, S2, and S3 show the specific rankings accordingto the three variables. The phrases in (A) and (B) are those whose increased usage contribute the most to a population’s Cin

and Cout differing from the average (see Sec. II D).

Page 5: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

5

250 255 260 265 270 275 280 285

160

170

180

190

200

210

Cou

t

Cin

AL

AZ

AR

CA

CO

CT

DE

DC

FL

GA

ID

ILIN

IA

KS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MT

NE

NV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN TX

UT

VT

VA

WA

WV

WI

WY ρ̂p: −0.13p−val: 0.37

m: −1.69

FIG. 2: Plots for the contiguous US showing the lackof correlation between caloric input Cin and caloric outputCout, demonstrating their separate value as they bear differ-ent kinds of information. The Pearson correlation coefficientρ̂p is -0.13 and the best line of fit slope is m = -1.69. Fig. S1adds plots of Cbal as a function of Cin and Cout.

To provide some support for point 1, we compare Cout

and Cin in Fig. 2 (see also Fig. S1). Importantly, wesee that the two measures are indeed not well correlat-ed, indicating they contain different kinds of informa-tion (Pearson correlation coefficient ρ̂p ' 0.13, p-value= 0.37). This demonstrates why we might expect Cin

or Cout to separately correlate more strongly with otherpopulation-level measures, and justifies the incorporationof both Cin and Cout into some kind of composite mea-sure.

Regarding point 2 above, we have evidently made anumber of choices in computing Cin and Cout that meanwe have already introduced an arbitrary tuning of theratio Cbal (e.g., assuming 100 grams of a food and anhour’s worth of activity). Having no principled way ofrescaling (i.e., one that is not a function of the data setbeing studied), we have chosen to leave the measures ascomputed, but we envisage introducing tunability of thedynamic ranges of Cin and Cout—altering the bias of themeasure toward food or activity—will allow the Lexic-ocalorimeter to be refined for a range of purposes suchas estimating correlates of diabetes levels versus cancerrates (see Sec. II E).

For the food and activities maps, we also overlay thespecific phrase whose increased prevalence most con-tributes to moving each population’s Twitter caloriescores away from the overall average for the contiguous

US. We will return to how we determine these phraseslater with phrase shifts (Sec. II D), but for now we see adiverse spread of terms.

We find a number of phrases make for apparently rea-sonable representations:

• “lobster” in Maine and Massachusetts;

• “grits” in Georgia;

• “skiing” in Vermont, New Hampshire, and Utah;

• and “running” in Colorado and a number of otherlocations.

Prototypical unhealthy foods rise to the top in variousstates:

• “donuts” in Texas;

• “cake” in Mississippi;

• “chocolate candy” in Louisiana;

• and “cookies” in Indiana.

By contrast, a few “virtuous” foodstuffs appear such as“green beans” in Oregon and “tomato” in California.

Our activity list also includes some rather low intensityones and we see:

• “eating” rising to the top in Texas, the south, anda number other states;

• “watching tv or movie” in Pennsylvania and else-where;

• “sitting” in Tennessee;

• “talking on the phone” in Delaware;

• “getting my nails done” in New Jersey;

• and simply “lying down” in Michigan.

Now, we do not pretend that these phrases all comefrom individuals diligently recording their present mealsor activities. And some phrases are problematic in theirgenerality of meaning, most especially “running” (theword “run” currently has the most meanings in theOxford English Dictionary). Nevertheless, as we digdeeper into all the phrases found for a particular state,we will continue to find commonsensical lexical patterns.

C. Rankings for the contiguous US

Having taken in the maps of our three measures Cin,Cout, and Cbal, we now explore the rankings quantita-tively, first through the histograms shown in Fig. 3. Weorder the 48 states and DC by Cbal (rightmost plot) andall bars are relative to the overall average of the specificmeasure. Numeric rankings for each measure are given

Page 6: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

6

Balance

Mississippi, 49Louisiana, 48Alabama, 47Arkansas, 46West Virginia, 45Delaware, 44Michigan, 43Kentucky, 42Ohio, 41Georgia, 40Maryland, 39North Carolina, 38South Carolina, 37Texas, 36New Jersey, 35Indiana, 34Tennessee, 33Pennsylvania, 32Connecticut, 31Virginia, 30

Kansas, 29Illinois, 28

Oklahoma, 27North Dakota, 26South Dakota, 25

Missouri, 24New Mexico, 23

Rhode Island, 22Iowa, 21

Massachusetts, 20District of Columbia, 19

Florida, 18Idaho, 17

Nevada, 16Nebraska, 15

Arizona, 14Wisconsin, 13

Washington, 12California, 11New York, 10

Montana, 9New Hampshire, 8

Oregon, 7Minnesota, 6

Maine, 5Utah, 4

Vermont, 3Wyoming, 2Colorado, 1

−0.12 0.13Food

MS, 16LA, 14AL, 11AR, 13WV, 2

DE, 33MI, 15KY, 6

OH, 4GA, 35MD, 43

NC, 20SC, 23

TX, 12NJ, 19

IN, 8TN, 29

PA, 22CT, 25VA, 32

KS, 10IL, 34

OK, 9ND, 3SD, 1

MO, 40NM, 18

RI, 31IA, 7

MA, 38DC, 46FL, 41ID, 27NV, 42

NE, 17AZ, 39WI, 45WA, 30CA, 28NY, 37

MT, 5NH, 47OR, 36MN, 44ME, 49UT, 24

VT, 21WY, 26CO, 48

−15 17 Activity

MS, 49LA, 48AL, 44AR, 46WV, 41DE, 47MI, 42KY, 37OH, 32GA, 43MD, 45NC, 39SC, 40TX, 35NJ, 36

IN, 27TN, 38PA, 34CT, 33VA, 30

KS, 22IL, 31

OK, 20ND, 16

SD, 7MO, 28

NM, 17RI, 23

IA, 9MA, 24

DC, 29FL, 26ID, 19

NV, 25NE, 12AZ, 18WI, 21

WA, 13CA, 10NY, 11MT, 5

NH, 15OR, 6MN, 8

ME, 14UT, 4VT, 3

WY, 1CO, 2

−15 30

Deviations from national averages

FIG. 3: Histograms of caloric intake Cin (food), caloric output Cout (activity), and caloric balance Cbal for the states of thecontiguous US, all ranked by decreasing Cbal. Bars indicate the difference in the three quantities from the overall average withcolors corresponding to those used in Fig. 1. We provide the same set of histograms re-sorted by Cin and Cout in Figs. S2and S3.

next to each bar. In Figs. S2 and S3, we present the samehistograms re-sorted respectively by Cin and Cout.

As was the case for the maps, we again see that Cbal ismore strongly driven by Cout than Cin due to the former’slarger dynamic range. The states with the highest valuesof Cbal achieve their scores through high levels of Cout butmore variable levels of Cin. Wyoming (26), Vermont (21),and Utah (24) are all middling in Cin while Colorado (48)and Maine (49) have the lowest ranks for caloric intake.At the trailing end, we see by contrast that low activityranks are coupled with high ranks for caloric intake.

A few of the more anomalous states are both evidentin the Cin and Cout histograms and as those appearingfurthest away from the best line of fit in the scatter plotof Fig. 2. South Dakota has both high values of Cin andCout (ranks of 1 and 7) that arrange to give it a rankingof 25 for Cbal. Maryland ranking 43rd and 45th in Cin

and Cout, is the only state in the ‘bottom’ 10 of both

measures.

D. Phrase shifts

In any study involving word- or phrase-based measuresof texts, it is plainly essential that we demonstrate thatthe measures involved behave reasonably by exploringthe effects of changes in word frequencies (and yet manypapers fail to do so, e.g., [23]). Such a requirementlimits the scientific value of potentially powerful “blackbox” type instruments, but as our measure is linear innature, the lexicocalometric contributions of individualterms can be laid bare.

In our work on measuring happiness, we have devel-oped and extensively used “word shifts” to show whichwords make a given text appear more positive than

Page 7: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

7

A. Colorado—food: B. Mississippi—food:

C. Colorado—activity: D. Mississippi—activity:

FIG. 4: Phrase shifts showing which food phrases and physical activity phrases have the most influence on Colorado andMississippi’s top and bottom ranking for caloric balance, when compared with the average for the contiguous United States.Note that phrases are lemmas representing phrase categories. Overall, Colorado scores lower on Twitter food calories (257.4versus 271.7) and higher on physical activity calories (203.5 versus 161.3) than Mississippi. We provide interactive phraseshifts for as part of the paper’s online appendix at http://compstorylab.org/share/papers/alajajian2015a/ and at http:

//panometer.org/instruments/lexicocalorimeter. We explain phrase (word) shifts in the main text (see Eqs. 4 and 5), andin full depth in [2] and [16] and online at http://hedonometer.org [22].

Page 8: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

8

another text in aggregate (see [2] and [16]). Such visu-alizations not only provide our necessary test, but alsoallow us to draw insight from the lexical tapestry oftexts. Here, we will explain and use analogously con-structed phrase shifts for both Cin and Cout to exam-ine the states at the extremes of our Cbal rankings,Colorado and Mississippi. Interactive food and activityphrase shifts for the 49 regions of the contiguous US forma central part of our online Lexicocalorimeter: http://panometer.org/instruments/lexicocalorimeter.

We start with two texts: a base “reference text” Tref,and a “comparison text” Tcomp which we wish to compareto Tref. In this paper, we will use the Contiguous US asthe reference text (weighting the phrase distributions ofeach state equally), but in principle any text can be used(e.g., in comparing two states, one would be selected asa reference). Our interest is in determining which wordsor phrases most contribute to or go against the differencein estimated calories. Ci/o(Tcomp)− Ci/o(Tref) where i/ostands for in or out. Following [2] and using Eq. (2), wecan express the difference as

Ci/o(Tcomp)− Ci/o(Tref)

=∑s∈Si/o

Ci/o(s)[p(s|Tcomp)− p(s|Tref)

]=

∑s∈Si/o

[Ci/o(s)− C(ref)

i/o

] [p(s|Tcomp)− p(s|Tref)

].

(4)

We now have a sum contributions due to all phrases. Wenormalize these contributions as percentages and anno-tate their structure as follows:

δCi/o(s) =

100∣∣∣C(comp)i/o − C(ref)

i/o

∣∣∣[Ci/o(s)− C(ref)

i/o

]︸ ︷︷ ︸

+/−

[p(comp)s − p(ref)s

]︸ ︷︷ ︸

↑/↓

,

(5)

where∑s∈Si/o

δCi/o(s) = ±100. We use the symbols

+/− and ↑ / ↓ to respectively encode whether the calo-ries of a phrase exceed the average of the reference text,and whether a phrase is being used more or less in thecomparison text. We call δCi/o(s) the “per food/activityphrase caloric expenditure shift”. Finally, we sort phras-es by the absolute value of δCi/o(s) to create each phraseshift.

In Fig. 4, we present food phrase shifts which help toillustrate why:

• Colorado ranks 48/49 for caloric input Cin

(Fig. 4A),

• Mississippi ranks 16/49 for caloric input Cin

(Fig. 4B),

• Colorado ranks 2/49 for caloric output Cout

(Fig. 4C),

• and Mississippi ranks 49/49 for caloric output Cout

(Fig. 4D).

These shifts display phrases that fall into four cate-gories:

+↑, yellow: Phrases representing above average quan-tities (here calories) being used moreoften. Examples: “cookies” for Mississip-pi in Fig. 4B and “rock climbing” for Col-orado in Fig. 4C.

-↓, pale blue: Phrases representing below average quan-tities being used less often. Examples:“watching tv or movie” for Mississippi inFig. 4B and “laying down” for Coloradoin Fig. 4C.

+↓, pale yellow: Phrases representing above average quan-tities being used less often. Examples:“chocolate candy” for Colorado in Fig. 4Aand “running” for Mississippi in Fig. 4D.

-↑, blue: Phrases representing below average quan-tities being used more often. Examples:“reading” for Colorado in Fig. 4A and“catfish” for Mississippi in Fig. 4B.

Note that depending on the quantity, higher or lower maybe “better” and the four categories flip signs in their sup-port. For example, Cin and Cout increase with +↑ phras-es; after we examine correlations with demographics inSec. II E, we will be able to interpret this as “bad” forCin and “good” for Cout.

At the top of each phrase shift, the bars indicate thetotal contribution of each of the four types of phrases,and the black bar the net change. We see that the fournet changes arise in different ways.

• Fig. 4A: Colorado is lower than average for Cin

largely due to tweeting more about relatively lowcalorie (per 100 grams) foods: “noodles”, “egg”,“pasta”, and “turkey”. We also find less tweetsabout high calorie foods such as “candy”, “cake”,and “cookies.” Going against these phrases, we seeColorado does tweet relatively more about “bacon”and “olive oil”, and less about some relatively low-er calorie foods “chicken”, “ice cream”, “shrimp”,and “corn”. We note that this does not mean thesefoods are low calorie in absolute terms (“ice cream”is a good example), just that 100 grams of them arelow calorie in comparison to the US baseline.

• Fig. 4B: Mississippi almost equally tweets lessabout a variety of low calorie foods, e.g., “pas-ta”, “banana”, and “crab” (pale blue bar) whilealso tweeting more about the complementary rangeof such foods including “shrimp”, “peaches”, and“pineapple” (dark blue bar). The modest net gainis mostly due to a small increase in tweeting abouthigh calorie foods such as “cake”, “cookies”, and“sausage”.

Page 9: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

9

● ●●

●●

●●

●●

● ●

● ●

2025

3035

Phys

ical

Inac

tivity

AL

AZ

AR

CA

CO

CT

DE FLGA

ID

IL

IN

IAKS

KY

LA

ME

MD

MAMI

MN

MS

MO

MT

NE

NV

NH

NJNM

NYNC NDOH

OK

OR

PA RISC SD

TN

TX

UT

VT

VA

WA

WV

WI

WY

ρ̂s : −0.78q : 2.5 × 10−9

●●

●●

● ●●

● ●

2530

3540

Hig

h Bl

ood

Pres

sure

AL

AZ

AR

CA

CO

CT

DEFL

GA

ID

IL

IN

IAKS

KYLA

MEMD

MA

MI

MN

MS

MO

MT

NE

NV NHNJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VAWA

WV

WI WY

ρ̂s : −0.77q : 3.5 × 10−9

●●

●●

67

89

1011

Dia

bete

s

AL

AZ

ARCA

CO

CT

DEDC

FL

GA

ID

IL

IN

IA

KS

KYLA

ME

MD

MA

MI

MN

MS

MO

MT

NE

NV

NH

NJNM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VA

WA

WV

WI

WY

ρ̂s : −0.74q : 1.4 × 10−8

●●

●●

●●

● ●●●

5560

6570

Ove

rwei

ght

AL

AZ

AR

CA

CO

CT

DE

DC

FL

GA

ID

IL

INIA

KS

KY

LA

MEMD

MA

MI

MN

MS

MO

MT

NE

NV NHNJNM

NY

NC NDOH

OK

OR

PA

RI

SC SDTNTX

UT

VT

VA

WA

WV

WI

WY

0.6 0.7Caloric Balance

ρ̂s : −0.74q : 1.6 × 10−8 ●

●●

●●

120

160

200

240

Hea

rt D

isea

se D

eath

AL

AZ

AR

CA

CO

CT

DE

DC

FL

GA

ID

IL

IN

IAKS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MTNE

NV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VT

VA

WA

WV

WIWY

0.6 0.7Caloric Balance

ρ̂s : −0.73q : 2.8 × 10−8

●●

7576

7778

7980

81Li

fe E

xpec

tanc

y

AL

AZ

AR

CA

CO

CT

DE

DC

FL

GA

ID

IL

IN

IA

KS

KYLA

MEMD

MA

MI

MN

MS

MO

MT

NE

NV

NHNJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TN

TX

UTVT

VA

WA

WV

WI

WY

0.6 0.7Caloric Balance

ρ̂s : 0.68q : 5.7 × 10−7

FIG. 5: Six demographic quantities compared with caloric balance Cbal for the contiguous US. The inset values are theSpearman correlation coefficient ρ̂s, and the Benjamini-Hochberg q-value. See Tab. I for a full summary of the 37 demographicquantities studied here.

• Fig. 4C: For physical activity, tweets from Col-orado show a preponderance of relatively highcaloric expenditure phrases (+↑, yellow) includ-ing “running”, “skiing”, “hiking”, “snowboard-ing” and so on. Tweeting less about low effortactivities is the only other contribution of anysubstance—Colorado tweets less about “eating”,“laying down”, and “watching tv or movie”.

• Fig. 4D: Mississippi’s low ranking in activity islargely due to tweeting less about high outputactivities (+↓, pale yellow): less “running”, “danc-ing”, “walking”, and “biking”. The second mostimportant category is an increase in low out-put activity phrases such as “eating”, “attendingchurch”, and “talking on the phone.”

In Figs. S4, S5, S6, and S7, we complement thefour phrase shifts of Fig. 4 by showing the top 23phrases for each of four ways phrases may contribute.Interactive phrase shifts for all of the contiguous USare housed at http://panometer.org/instruments/lexicocalorimeter.

Overall, we find the lexical texture afforded by ourphrase shifts is generally convincing, but we expect future

improvements in our food and activity data sets williron out some oddities (we again use the example of icecream). We also note that phrase shifts are very sen-sitive and that terms that seem to be being evaluatedincorrectly may easily be removed from the phrase set,and that doing so will minimally change the overall scorefor sufficiently large texts.

E. Correlations with other health and well-beingmeasures

We now turn to a suite of statistical comparisonsbetween our three measures—caloric input, caloric out-put, and caloric balance—and a collection of demograph-ic quantities.

We use Spearman’s correlation coefficient ρ̂s to exam-ine relationships between Cin, Cout, and Cbal and 37variables variously relating to food and physical activity,“Big Five” personality traits, and health and well-beingrankings (a total of 111 comparisons). To correct for mul-tiple comparisons, we calculate the q-value for each corre-lation coefficient using the Benjamini-Hochberg step-upprocedure [34] (the q-value is to be interpreted in the

Page 10: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

10

Demographic quantityρ̂s for

Cbal

q-valρ̂s for

Cin

q-valρ̂s for

Cout

q-val

1. % high blood pressure [24] -0.78 2.64 × 10−09 0.29 5.76 × 10−02 -0.78 2.64 × 10−09

2. % no physical activity in past 30 days [24] -0.78 2.64 × 10−09 0.55 1.63 × 10−04 -0.66 1.51 × 10−06

3. % have been physically active in past 30 days [24] 0.78 2.64 × 10−09 -0.55 1.85 × 10−04 0.67 1.29 × 10−06

4. Adult diabetes rate [25] -0.76 4.80 × 10−09 0.29 6.17 × 10−02 -0.77 2.73 × 10−09

5. CNBC quality of life ranking [26] (lower is better) -0.76 6.09 × 10−09 0.27 8.64 × 10−02 -0.77 3.60 × 10−09

6. Heart disease death rate [27] -0.74 1.67 × 10−08 0.32 3.81 × 10−02 -0.73 2.07 × 10−08

7. % adult overweight/obesity [27] -0.72 3.99 × 10−08 0.53 2.34 × 10−04 -0.59 3.07 × 10−05

8. % adult obesity [25] -0.72 4.78 × 10−08 0.53 2.34 × 10−04 -0.59 2.94 × 10−05

9. Gallup Wellbeing score [4] (higher is better) 0.72 5.23 × 10−08 -0.3 5.41 × 10−02 0.73 3.99 × 10−08

10. America’s Health Rankings, overall [24] (lower is better) -0.72 3.40 × 10−07 0.41 7.07 × 10−03 -0.67 2.77 × 10−06

11. Life expectancy at birth [27] (higher is better) 0.69 3.52 × 10−07 -0.38 1.11 × 10−02 0.65 2.64 × 10−06

12. % who eat fruit less than once a day [28] -0.66 1.29 × 10−06 0.59 2.94 × 10−05 -0.51 5.74 × 10−04

13. % child overweight/obesity [27] -0.64 2.96 × 10−06 0.25 1.02 × 10−01 -0.64 3.06 × 10−06

14. % who eat vegetables less than once a day [28] -0.61 1.69 × 10−05 0.49 8.06 × 10−04 -0.46 1.57 × 10−03

15. Median daily intake of fruits [28] 0.6 2.36 × 10−05 -0.6 1.94 × 10−05 0.41 5.45 × 10−03

16. Smoking rate [27] -0.59 2.73 × 10−05 0.49 8.06 × 10−04 -0.48 1.08 × 10−03

17. Median household income [27] 0.5 6.46 × 10−04 -0.51 5.74 × 10−04 0.4 8.51 × 10−03

18. Median daily intake of vegetables [28] 0.49 8.06 × 10−04 -0.55 1.33 × 10−04 0.31 4.41 × 10−02

19. % high cholesterol [24] -0.49 9.01 × 10−04 0.22 1.61 × 10−01 -0.48 9.21 × 10−04

20. Brain health ranking [29] (lower is better) -0.49 9.21 × 10−04 0.6 2.77 × 10−05 -0.29 5.70 × 10−02

21. % with bachelor’s degree or higher [6] 0.46 1.84 × 10−03 -0.54 2.05 × 10−04 0.33 2.85 × 10−02

22. Colorectal cancer rate [25] -0.44 4.24 × 10−03 0.51 6.46 × 10−04 -0.27 8.44 × 10−02

23. US Census Gini index score [30] (lower is better) -0.42 5.05 × 10−03 -0.04 8.09 × 10−01 -0.5 6.08 × 10−04

24. Avg # poor mental health days, past 30 days [24] -0.42 5.05 × 10−03 0.11 4.97 × 10−01 -0.48 1.06 × 10−03

25. Neuroticism Big Five personality trait [31] -0.38 1.24 × 10−02 0.19 2.44 × 10−01 -0.37 1.42 × 10−02

26. Binge drinking rate [24] 0.38 1.36 × 10−02 -0.15 3.41 × 10−01 0.41 5.94 × 10−03

27. Avg # poor physical health days, past 30 days [24] -0.35 2.25 × 10−02 0.18 2.47 × 10−01 -0.38 1.15 × 10−02

28. Farmers markets per 100,000 in pop. [28] 0.34 2.53 × 10−02 0.05 8.06 × 10−01 0.42 5.05 × 10−03

29. Strolling of the Heifers locavore score (lower is better) [32] -0.3 5.69 × 10−02 -0.3 5.69 × 10−02 -0.45 2.94 × 10−03

30. Extraversion Big Five personality trait [31] -0.28 6.76 × 10−02 0.04 8.33 × 10−01 -0.29 5.69 × 10−02

31. % schools offering fruit/veg at celebrations [28] 0.24 1.25 × 10−01 -0.47 1.47 × 10−03 0.05 7.98 × 10−01

32. Openness Big Five personality trait [31] 0.23 1.44 × 10−01 -0.5 6.46 × 10−04 0.04 8.09 × 10−01

33. % cropland harvested for fruits/veg [28] 0.19 2.44 × 10−01 -0.6 2.48 × 10−05 -0.04 8.09 × 10−01

34. Conscientiousness Big Five personality trait [31] -0.12 4.83 × 10−01 0.18 2.54 × 10−01 -0.05 8.01 × 10−01

35. % census tracts, healthy food retailer within 1/2 mile [28] -0.03 8.41 × 10−01 -0.51 5.51 × 10−04 -0.24 1.30 × 10−01

36. George Mason overall freedom ranking [33] (lower is freer) -0.03 8.63 × 10−01 -0.12 4.70 × 10−01 -0.1 5.64 × 10−01

37. Agreeableness Big Five personality trait [31] -0.01 9.66 × 10−01 0.23 1.40 × 10−01 0.08 6.47 × 10−01

TABLE I: Spearman correlation coefficients, ρ̂s, and Benjamini-Hochberg q-values for caloric input Cin, caloric output Cout,and caloric balance Cbal = Cout/Cin and demographic data related to food and physical activity, Big Five personality traits [31],health and well-being rankings by state, and socioeconomic status, correlated, ordered from strongest to weakest Spearmancorrelations with caloric balance. The two breaks in the table indicate significance levels of 0.01 and 0.05 for the Benjamini-Hochberg q of Cbal, corresponding to the first 24 demographic quantities and then the next four, numbers 25 to 28. The bottom9 quantities were not significantly correlated with Cbal according to our tests.

same way as a p-value). We then consider correlations inreference to the standard significance levels of 0.01 and0.05.

We must first acknowledge that many of the variableswe test against our measures are highly correlated with

each other. The food and physical activity-related vari-ables are in the areas of physical activity levels, produceintake and availability rates (including trends in publicschools), chronic disease rates, and rates of unhealthyhabits. Many of these variables are well known to be

Page 11: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

11

influenced by diet and physical activity (e.g., obesityrates [25]), and others may be less directly related (e.g.,percent of cropland in each state harvested for fruits andvegetables [28]).

To give some grounding for the full set of comparisons,we show in Fig. 5 how six demographic quantities varywith caloric balance Cbal. We see strong correlationswith |ρ̂s| ≥ 0.68, and the highest value for Benjamini-Hochberg q-value is 5.7×10−7.

We present a summary of all results in Tab. I wherewe have ordered and numbered demographic quantitiesin terms of ascending Benjamini-Hochberg q-values forCbal.

Surveying the health-based demographics, we foundCbal was significantly correlated with all chronic disease-related rates we tested against (high blood pressure (#1),adult diabetes (#4), heart disease deaths (#6), adultoverweight and obesity (#7), adult obesity (#8), child-hood overweight and obesity (#13), high cholesterol(#19), and colorectal cancer (#22)). All of these butcolorectal cancer rate were also significantly correlatedwith Cout.

Caloric input Cin results were more mixed. Rates ofhigh blood pressure, adult diabetes, childhood overweightand obesity, and high cholesterol were not significantlycorrelated with Cin after correcting for multiple compar-isons.

The variables relating to unhealthy habits (smoking(#16) and binge drinking rates (#26)) both correlatedsignificantly with all three of our measures with the oneexception of binge drinking and caloric input. The direc-tion of correlations for these two habits are opposite eachother (e.g., negative for smoking and Cbal, positive forbinge drinking and Cbal), consistent with recent work onalcohol consumption [35].

The two variables relating to physical activity rates(percent of population that has been physically active inpast 30 days (#2), and percent of population that hashad no physical activity in past 30 days (#3)) correlatedsignificantly with all three of our measures. The twomeasures relating to rates of physical and mental health(average number of poor mental health days in past 30days (#24), and average number of poor physical healthdays in past 30 days (#27)) correlated significantly withboth Cout and Cbal, but did not correlate significantlywith Cin.

The four variables relating to fruit and vegetable con-sumption rates all correlated significantly with all threeof our measures, except for median daily intake of veg-etables (#18) with caloric output Cout. The variablesrelating to presence of produce in the state (percent ofcropland in each state harvested for fruits and vegeta-bles (#33), percent of census tracts with a healthy foodretailer within one-half mile (#35), and percent of schoolsoffering fruits and vegetables at celebrations (#31)) weresignificantly correlated with Cin but were not correlatedwith Cout or Cbal. Variables relating to local food (num-ber of farmers markets per 100,000 people (#28) and

Strolling of the Heifers locavore score(#29)) were notsignificantly correlated with Cin, but were significantlycorrelated with Cout.

Our health and well-being ranking variables includedthe CNBC quality of life ranking (#5), Gallup Wellbe-ing ranking (#9), America’s Health Ranking overall staterank (#10), life expectancy ranking (#11), Brain Healthranking (#20), Gini index score (#23), and GeorgeMason’s overall freedom ranking (#36). Caloric balancecorrelated with all of these variables except for GeorgeMason’s freedom ranking (which did not correlate withany of our three measures). Cout correlated significantlywith all of these measures except for the Brain Healthranking and the freedom ranking. caloric input Cin didnot correlate significantly with the CNBC quality of liferanking, Gini index score, or freedom ranking.

Regarding correlations with the Big Five personalitytraits, Pesta et al. noted that “Neuroticism...emerged asthe only consistent Big Five predictor of epidemiologicoutcomes (e.g., rates of heart disease or high blood pres-sure) and health-related behaviors (e.g., rates of smokingor exercise)” [36]. Additionally, “neuroticism correlateswith many health-related variables, including depressionand anxiety disorders, mortality, coping skill, death fromcardiovascular disease, and whether one smokes tobac-co” [36]. Here, in keeping with these observations, wefound that neuroticism (#25) was indeed the only BigFive personality trait that correlated significantly andnegatively with caloric balance.

We also tested our three measures against two mea-sures of socioeconomic status—median income (#17) andpercent of state with a bachelor’s degree or higher levelof education (#21)—and found these correlations weresignificant for all three of our measures.

III. CONCLUDING REMARKS

Our Lexicocalorimeter has thus, when applied to Twit-ter, proved to find and demonstrate a range of strong,commonsensical patterns and correlations for the con-tiguous US. We invite the reader to explore our onlineinstrument, a screenshot of which is shown in Fig. 6.

Given the complex relationships between health, well-being, happiness, and various measures of socioeconomicstatus, it is rather difficult to say that we are only mea-suring health or only measuring well-being. We are alsomeasuring socioeconomic status to some extent. Howev-er, the correlations between caloric balance and measuresof socioeconomic status are not as strong as the correla-tion of caloric balance with many of the other measures.Given the above, we believe that the caloric content oftweets can be used successfully, along with other well-being and quality of life measures, to help gauge overallwell-being in a population.

There are many potential forward directions. Apromising avenue is to incorporate tunability to the Lex-icocalorimeter by manipulating the dynamic range of Cin

Page 12: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

12

How do I look in these tweets? Gauging well-being through "caloriccontent" of tweetsSharon E. Alajajian, Jake R. Williams, Andrew J. Reagan, Stephen C. Alajajian, Morgan R. Frank, Lewis Mitchell, Jacob Lahne,Christopher M. Danforth, and Peter Sheridan Dodds

Caloric Balance

AL

AZ AR

CA CO

CT

DE

DC

FL

GA

ID IL

INIA

KS

KY

LA

ME

MD

MAMIMN

MS

MO

MT

NE

NV

NH

NJ

NM

NY

NC

ND

OH

OK

OR PA RI

SC

SD

TN

TX

UT

VT

VA

WA

WV

WI

WY

1. bacon+↑2. chocolate candy+↓

3. onion-↑4. donuts+↓

5. chicken-↓6. apples-↑

7. butter+↑8. banana-↑

9. noodles-↓10. cookie dough+↑

11. cake+↓12. coconut oil+↑13. cookies+↑

14. broccoli-↑15. crab-↓16. peanut butter+↑17. beef-↓18. shrimp-↓

19. beet-↑20. cucumber-↑

21. strawberries-↓22. walnuts+↑

23. chicken salad-↑24. mashed potatoes-↑

25. pineapple-↓26. olive oil+↓

27. catfish-↓28. grits-↓

29. lettuce-↑30. girl scout cookie+↑31. grapes-↓

32. swiss chard-↑33. roasted red pepper-↑

34. mushrooms-↑35. spaghetti squash-↑

36. green pepper-↑37. tortilla-↑

38. baked potato-↓39. fried eggs-↑

40. tomato-↑41. cake with frosting+↑

42. oysters-↑43. sunflower seeds+↓

44. tangerines-↑45. peanuts+↓

46. almond joy+↑47. sweet potato-↑

48. pudding-↑49. cheese+↑50. pita chips+↑

51. salmon-↑52. goat cheese+↑

53. yogurt-↑54. cheddar cheese+↑

55. celery-↑56. popcorn+↑

57. fortune cookie+↓58. turkey-↓

59. peaches-↑60. lobster-↓61. king crab-↓62. pastry+↑

63. tuna-↑64. potato chips+↓

65. asparagus-↓66. collards-↓67. pasta-↓

68. hard candy+↓69. scallops-↑

70. popeyes chicken+↓71. avocado-↑

72. carrot-↑73. applesauce-↑

74. pear-↑75. mayonnaise+↓

76. oatmeal-↓77. kale-↑

78. candy bar+↑79. ribs-↑

80. mac and cheese-↓81. watermelon-↓

∑+↑∑-↓

∑+↓∑-↑

visualization by@andyreagan

-5 0 5Per food phrase caloric shift

Food

rank

Rese

t

Why Vermont consumes more calories on average:Average US calories = 267.92Vermont calories = 268.66 (Rank 29 out of 49)

1. skiing+↑2. running+↑

3. snowboarding+↑4. hiking+↑

5. dancing+↓6. sledding+↑7. eating-↓

8. watching tv or movie-↓9. cooking+↓

10. cleaning+↓11. using treadmill+↓

12. walking+↑13. biking+↑14. picking fruit+↑15. rock climbing+↑16. getting my hair done-↓17. getting my nails done-↓

18. doing laundry+↓19. talking on phone-↓

20. writing-↑21. playing basketball+↓

22. shoveling+↑23. playing football+↓

24. boxing+↓25. square dancing+↑26. ballet dancing+↑27. jumping jacks+↑28. cleaning or washing a vehicle+↑29. laying down-↓30. ice skating+↑31. climbing stairs+↑32. mountain biking+↑33. roller skating+↑34. paddleboarding+↑35. jazzercise+↑

36. mowing grass+↓37. attending church-↓

38. playing video or computer games-↑39. boating-↑

40. fishing+↑41. weight lifting+↓

42. reading-↓43. doing my hair+↓

44. doing pushups+↓45. playing dodgeball+↑46. watching tv or movies laying down-↓47. vacuuming+↑48. doing power yoga+↑

49. pole dancing+↓50. wrapping presents-↑

51. walking a pet+↑52. hunting+↑

53. elliptical+↓54. raking+↑

55. walking leisurely-↑56. showering-↓

57. ultimate frisbee+↓58. fly fishing+↑59. bass fishing+↑60. snowmobiling+↑61. doing yoga+↑

62. skateboarding+↓63. rowing+↑64. packing+↑65. mini golfing+↑

66. golfing+↓67. doing situps+↓

68. walking briskly+↓69. kayaking+↓

70. line dancing+↓71. using stair master+↓

72. playing games-↑73. doing yardwork+↓

74. running stairs+↓75. doing my makeup+↑

76. jet skiing+↓77. walking quickly+↓

78. playing frisbee+↑79. crocheting-↑

80. bowling+↓81. attending a family reunion-↑

∑+↑∑-↓

∑+↓∑-↑

visualization by@andyreagan

-10 -5 0 5 10Per activity phrase caloric expenditure shift

Activ

ity ra

nkRe

set

Why Vermont expends more calories on average:Average US caloric expenditure = 176.60Vermont caloric expenditure = 203.22 (Rank 3 out of 49)

-0.10

-0.05

0.00

0.05

0.10

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48

1. Colorado2. Wyoming3. Vermont4. Utah5. Maine6. Minnesota7. Oregon8. New Hampshire

9. Montana10. New York

11. California

12. Washington

13. Wisconsin

14. Arizona15. Nebraska

16. Nevada17. Idaho18. Florida19. District of Columbia

20. Massachusetts

21. Iowa22. Rhode Island

23. New Mexico

24. Missouri

25. South Dakota

26. North Dakota

27. Oklahoma28. Illinois

29. Kansas30. Virginia

31. Connecticut

32. Pennsylvania

33. Tennessee34. Indiana

35. New Jersey36. Texas

37. South Carolina

38. North Carolina39. Maryland40. Georgia

41. Ohio42. Kentucky43. Michigan44. Delaware

45. West Virginia46. Arkansas47. Alabama

48. Louisiana

49. MississippiCa

loric

Bal

ance

State Rank

FIG. 6: Screenshot of the interactive dashboard for our prototype Lexicocalorimeter site (taken 2015/07/03). An archiveddevelopment version can be found as part of our paper’s Online Appendices at http://compstorylab.org/share/papers/

alajajian2015a/maps.html, and a full dynamic implementation will be part of our Panometer project at http://panometer.

org/instruments/lexicocalorimeter.

and Cbal. Though a universal approach is unclear (oneconstraint is that such an approach must be independentof the particular data set being studied), we may prof-it from this versatility when focusing on a single demo-graphic. For example, if we are interested in diabetesrates, we could tune the instrument to obtain the bestcorrelation with known levels, and thereby create a real-time estimator. To do so, we would introduce a tunableparameter α into Eq. (6):

C(α)bal =

αCout

Cin, (6)

and find the value of α that gives the highest correlation

between C(α)bal and diabetes rates for a given set of popu-

lations. Of course, we could use a “black box” method togenerate a more optimal fit, but in basing our instrumenton food and activity words, we have a far more princi-pled approach that grants us the opportunity not just tomimic but to understand and explain patterns that wefind. In particular, our word shifts will be of great usein showing why our hypothetical estimate of diabetes isvarying across populations.

We fully recognize that the Twitter population is notthe same as the general population; Twitter users differfrom the general population in terms of race, age, andurbanity [7]. However, we currently have no reliable wayto know, for example, the true age, race, gender, andeducation level of individual users and as such, are notable to adjust for these factors. While we were able tovet our food and physical activity lists to some extent(as described in Methods and Materials), we could notrealistically go through every tweet to be certain thatthe phrase was being used in the way that we thought.We realize that even if the phrases are being used as weimagine, it does not necessarily mean that the personwho tweeted actually performed the physical activity orate the tweeted-about food (West et al. address a sim-ilar issue in inferring food consumption from accessingrecipes online [18]).

We also currently do not know at what point our met-ric breaks down at smaller time scales (e.g., months orweeks) or for smaller spatial regions (e.g., city or county)level. Our preliminary research shows that the physicalactivity metric on its own may be quite effective at the

Page 13: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

13

city level, but the food measure may not be accurate ona smaller scale. We have also found the physical activi-ty list to be robust to random partitioning [37], whereasthe food list was not. We believe that these preliminaryfindings may be due to several factors: (a) the size of thefood list (just over 1400 phrases) is much smaller thanthe physical activity phrase list (just over 13,400 phras-es); (b) there are generally more tweets about physicalactivities in our list than the foods in our food list; and(c) the amount of data within a city may not be a largeenough sample for any food-based Twitter metric. Wenote that we have not tried using the metric on countiesor Census block or tract groups, and it may be that theseare more conducive to the metric.

We propose to use crowdsourcing as a way to build amore comprehensive food phrase list that includes com-monly eaten foods with brand names as well as food slangthat we did not capture here. Ideally, we would arriveat a food phrase database similar in scale to that of ourexisting physical activity phrase list. However we moveforward, we believe it is clear that the Lexicocalorime-ter we have designed and implemented is already ofsome potency and may be improved substantively in thefuture.

IV. METHODS AND MATERIALS

In order to attempt to estimate the “caloric content”of text-extracted phrases [37] relating to food (caloricinput) and physical activity (caloric output), we neededcomprehensive lists of foods and physical activities andtheir respective caloric content and expenditure informa-tion. Here, we explain in detail how we constructed thesephrase lists and assigned calories to each phrase.

We provide all data in the Supporting Informationand with the paper’s Online Appendices: http://compstorylab.org/share/papers/alajajian2015a/.

A. Calorie estimates for phrases

We used the USDA National Nutrient Database [38] toapproximate the caloric content of foods, and the Com-pendium of Physical Activities from Arizona State Uni-versity and the National Cancer Institute [39] to approx-imate average Metabolic Equivalent of Tasks (METs)for physical activities, which we converted to caloriesexpended per hour of activity [39]. Because the foodslisted in the USDA National Nutrient Database are notdescribed in a way that people talk about food, we creat-ed a list of food phrases used on Twitter by starting witha kernel of basic food terms from the USDA’s MyPlatewebsite’s food group pages [40]. If the food phrase wasnot specific, such as “cereal”, we chose the most popularversion of that food in the United States via an informalGoogle search at the time of the study (in this instance,Cheerios). If a brand name food was not in the USDA

National Nutrient Database, we chose the closest matchwe could find. (Please note that this means that data inappendix may be inaccurate when searching brand nameitems.)

This approach yielded examples of foods in the foodgroups of fruits, vegetables, grains, proteins, dairy, oils,solid fats, and “empty calories” (e.g., junk food), andbuilt up a list of nearly 1400 food phrases used on Twit-ter. For this study, we did not include sodas or juicesin our list. Soups, ice creams, oils and other items thatmay act as liquids may be excluded in future versions ofthese analyses.

For physical activity, we used the physical activitieslisted in the Compendium to build up a list of nearly14,000 physical activity phrases used on Twitter. Theorder of magnitude of difference between the length of thetwo lists exists because of the difference in the number ofterms that went into creating each list and the rates atwhich people tweet about foods vs. physical activities.

B. Phrase extraction

A major obstacle to the development of the food andphysical activity lists is the determination of those phras-es used by individuals that most accurately represent afood or physical activity. Various methods exist whichmay help one ascertain information about the frequencyof usage of higher-order lexical units [37]. However, werequire one that not only determines reasonable estimatesof frequency of usage, but further, does so with nuanceregarding context. For example, one should not count thephrase “apple” as having occurred if it appeared with-in a larger phrase that was recognized as meaningful,such as “you’re the apple of my eye.” Hence, we imposemeaningful partitions of text by utilizing what we callan “inner context model” [41]. In particular, we con-sider two phrases to be members of a common contextwhen they are maximally similar according to a normal-ized edit distance [42] (allowing for just insertions anddeletions). Our basic measure for a phrase’s strengthof context is then the local likelihood (denoted `s for aphrase s), or rate of occurrence of the phrase amongst allothers present in the context. A phrase of length (num-ber of words) N is a member of N contexts (setting q = 1in the context model). Therefore, we assume the phrase’scontext that occurs most frequently as the one by whichlikelihood is taken.

To describe our method, serial partitioning, we use amock-up example. Supposing we wish to partition theclause “you’re the apple of my eye”, assume further thatamong all the phrases contained within the clause, oneknows

`you’re < `you’re the and `you’re the > `you’re the apple

and that

`apple < `apple of < `apple of my < `apple of my eye,

Page 14: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

14

as this information is sufficient to exemplify the algo-rithm. We would not have included “you’re the apple ofmy eye” in our food phrase list, as it does not use applein a way that pertains to eating food.

A complete step-by-step illustration of the serial parti-tioning algorithm for the clause “you’re the apple of myeye” would be:

1. you’re the apple of my eye `φ = 0

2. you’re the apple of my eye `you’re > `φ

3. you’re the apple of my eye `the > `you’re

4. you’re the apple of my eye `the apple < `you’re the

5. you’re the apple of my eye `φ = 0

6. you’re the apple of my eye `apple > `φ

...

9. you’re the apple of my eye`apple of my eye > `apple of my

10. you’re the apple of my eye.

with each step explained in detail:

1. The first component begins with the null phrase(likelihood 0).

2. The first word is concatenated onto our component,producing the phrase “you’re”, whose positive like-lihood keeps the process going.

3. The word “the” is then concatenated to form“you’re the”, whose increased likelihood keeps theprocess going.

4. Once the word “apple” is concatenated onto thegrowing phrase the likelihood drops.

5. Since the likelihood has dropped the previousphrase “you’re the” is partitioned.

6. The process resumes with the null word, and sub-sequent concatenation forming the new workingphrase, “apple”.

7–9. Concatenation of the words “of”, “my”, and “eye”result in increased local likelihood and so perpetu-ate the process.

10. Since the phrase “apple of my eye” is bounded bypunctuation, the phrase is partitioned, whereuponthe algorithm moves on to the next clause.

We manually applied the following criteria for con-structing both food and exercise phrase lists. For aphrase to be included, it had to be a phrase that used thefood or physical activity word(s) in a way that pertainedto eating or physical activity; we excluded phrases thatwere part of hashtags, Twitter user names, song lyrics,or names of organizations or businesses, and phrases thatappeared four or fewer times were not included. Mis-spellings and alternate spellings were included if we hap-pened upon them (for example, “mash potatoes” insteadof “mashed potatoes”), but we did not go out of our wayto search for them. We queried questionable phrases tobe sure that the majority of their uses were referring tothe item of interest. Because we were building up froma small list, some specific versions of foods were includedwhile more general forms were not. For example, becausewe built phrases up from “strawberry,” “strawberry jam”was included while we did not conduct a larger search for“jam”. In another example, in building phrases up from“bacon,” “bacon wrapped dates” turned up so we includ-ed those dates but did not conduct a larger search for allpossible “dates”. (Note: We removed the physical activi-ties category ‘sexual activity’ from the study because thetask of determining meaning and context was too diffi-cult.)

We searched for phrases containing the physical activ-ities in multiple tenses in order to capture as much infor-mation as possible. For example, for the activity typeshoveling snow, we searched for the forms of shovel, shov-eling, and shoveled. Tweets were initially converted to alllowercase text, so we were assured that we were not miss-ing data due to capitalization. To match each food phrasewith its closest caloric data, we found the most closelycorresponding food from the USDA National NutrientDatabase, counting all vegetables and fruits in their rawform unless the phrase indicated otherwise. Similarly, weentered meats as roasted or cooked with dry heat, notfried, unless the phrase indicated otherwise or there wasno homemade option. We used the nutrition content ofhomemade versions of foods (for example, baked goods)rather than store-bought foods unless the phrase indicat-ed otherwise. Our approach, while systematic, was notexhaustive, nor is it the only way of taking on this chal-lenge; there are certainly other methods that we expectto yield similar results.

Finally, we lemmatized the food phrases by their codein the USDA National Nutrient Database. If there werefood phrases that were more general in each set of phrasesthat held the same code, we used the more general phraseas the lemma.

We lemmatized the activity phrases by their METs andactivity category. Activity categories were largely thesame as listed in the Compendium with slight changesdue to items in Compendium being listed in a Miscella-neous category, etc. This yielded instances of physicalactivity phrases that were in the same activity catego-ry but were very different with the same METs beingincluded in the same lemma. From this level of lemmati-

Page 15: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

15

zation, we then used our best judgement to break theselemmas down further until proper phrases were included

in each lemma.

[1] Health-related quality of life: Well-being concepts (2013),health-related quality of life: Well-being concepts. http://www.cdc.gov/hrqol/wellbeing.htm; Accessed March29, 2014.

[2] P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A.Bliss, and C. M. Danforth, PLoS ONE 6, e26752 (2011),draft version available at http://arxiv.org/abs/1101.

5120v4. Accessed November 15, 2014.[3] E. Diener, M. Diener, and C. Diener, Journal of Person-

ality and Social Psychology 69(5), 851 (1995).[4] State of the states, state of the States. http:

//www.gallup.com/poll/125066/State-States.aspx;Accessed March 29, 2014.

[5] Stimmungsgasometer (2013), stimmungsgasometer.http://xn--fhlometer-q9a.de/; Accessed March 29,2014.

[6] J. Siebens, Extended measures of well-being: Living con-ditions in the United Srates: 2011 (2013), URL http:

//www.census.gov/prod/2013pubs/p70-136.pdf.[7] M. Duggan and J. Brenner, The demograph-

ics of social media users—2012 (2013), URLhttp://www.pewinternet.org/files/old-media/

/Files/Reports/2013/PIP_SocialMediaUsers.pdf.[8] A. Signorini, A. M. Segre, and P. M. Polgreen, PLoS

ONE 6(5): e19467 (2011).[9] V. M. Prieto, S. Matos, M. Alvarez, F. Cacheda, and

J. L. Oliveira, PLoS ONE 9(1): e86191. (2014).[10] S. C. Walpole, D. Prieto-Merino, P. Edwards, J. Cleland,

and G. Stevens, PLoS ONE 5(11): e14118 (2010).[11] M. J. Paul and M. Dredze, ICWSM (2011).[12] D. J. Watts, R. Muhamad, D. Medina, and P. S. Dodds,

Proc. Natl. Acad. Sci. 102, 11157 (2005).[13] Google Flu Trends, https://www.google.org/

flutrends/; accessed March 1, 2015.[14] D. Lazer, R. Kennedy, G. King, and A. Vespignani, Sci-

ence Magazine 343, 1203 (2014).[15] L. Mitchell, M. R. Frank, P. S. Dodds, and C. M. Dan-

forth, PLoS ONE 8, e64417 (2013).[16] P. S. Dodds, E. M. Clark, S. Desu, M. R. Frank, A. J.

Reagan, J. R. Williams, L. Mitchell, K. D. Harris, I. M.Kloumann, J. P. Bagrow, et al., Proc. Natl. Acad. Sci.112, 2389 (2015).

[17] R. Chunara, L. Bouton, J. W. Ayers, and J. S. Brown-stein, PLoS ONE 8(4): e61373 (1995).

[18] R. West, R. W. White, and E. Horvitz, Proceedings of the22nd international conference on World Wide Web. Inter-national World Wide Web Conferences Steering Commit-tee pp. 1399–1410 (2013).

[19] J. C. Eichstaedt, H. A. Schwartz, M. L. Kern, G. Park,D. R. Labarthe, R. M. Merchant, S. Jha, M. Agrawal,L. A. Dziurzynski, M. Sap, et al., Psychological Science(2015).

[20] S. Abbar, Y. Mejova, and I. Weber, You tweet what youeat: Studying food consumption through Twitter (2014),available at: http://arxiv.org/abs/1412.4361.

[21] C. Chew and G. Eysenbach, BMC Public Health 12:439(2012).

[22] C. S. L. blog, Hedonometer 2.0: Measuring happinessand using word shifts (2014).

[23] S. A. Golder and M. W. Macy, Science Magazine 333,1878 (2011).

[24] Americas Health Rankings report—State Health Statis-tics, http://AmericasHealthRankings.org, AccessedMarch 15, 2014.

[25] Centers for disease control and prevention, http://www.cdc.gov, Accessed March 15, 2014.

[26] CNBC overall rankings 2012, http://www.cnbc.com/id/100016697, Accessed March 15, 2014.

[27] State Health Facts—The Henry J. Kaiser Family Foun-dation, http://kff.org/statedata, Accessed March 15,2014.

[28] State indicator report on fruits and vegetables. nationalCenter for Chronic Disease Prevention and HealthPromotion, Division of Nutrition, Physical Activity,and Obesity. centers for Disease Control and Preven-tion, US Department of Health and Human Services(2013), http://www.cdc.gov/nutrition/downloads/

State-Indicator-Report-Fruits-Vegetables-2013.

pdf, Accessed March 15, 2014.[29] America’s Brain Health Index, http://www.

beautiful-minds.com/AmericasBrainHealthIndex,Accessed March 15, 2014.

[30] US Census American FactFinder., http:

//factfinder2.census.gov/faces/nav/jsf/pages/

index.xhtml, Accessed March 15, 2014.[31] P. J. Rentfrow, S. D. Gosling, M. Jokela, D. Stillwell,

M. Kosinski, and J. Potter, Journal of Personality andSocial Psychology 105(6), 996 (2013).

[32] Strolling of the Heiders Locavore Index, http:

//www.strollingoftheheifers.com/locavoreindex/,Accessed March 15, 2014.

[33] Freedom in the 50 states, mercatus center, george masonuniversity (2013), http://freedominthe50states.org/,Accessed March 15, 2014.

[34] Y. Benjamini and Y. Hochberg, Journal of the RoyalStatistical Society. Series B (Methodological) 57(1), 289(1995).

[35] M. T. French, I. Popovici, and J. C. Maclean, Am JHealth Promot 24, 2 (2009).

[36] B. J. Pesta, S. Bertsch, M. A. McDaniel, C. B. Mahoney,and P. J. Poznanski, Intelligence 40, 107 (2012).

[37] J. R. Williams, P. R. Lessard, S. Desu, E. Clark, J. P.Bagrow, C. M. Danforth, and P. S. Dodds (2014), draftversion available at arXiv:1406.5181. Accessed August1, 2014.

[38] U.S. Department of Agriculture, Agricultural ResearchService, USDA National Nutrient Database for Stan-dard Reference, release 25 (2013), URL http://www.

ars.usda.gov/ba/bhnrc/ndl.[39] B. E. Ainsworth, W. L. Haskell, S. D. Herrmann,

N. Meckes, D. R. Bassett Jr., C. Tudor-Locke,J. L. Greer, J. Vezina, W.-G. M. C., and L. A.S., The compendium of physical activities trackingguide. healthy lifestyles research center, college of

Page 16: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

16

nursing & health innovation, arizona state univer-sity (2013), URL https://sites.google.com/site/

compendiumofphysicalactivities/.[40] USDA MyPlate food groups, accessed May 15, 2015, URL

http://www.choosemyplate.gov/food-groups/.[41] J. R. Williams, E. M. Clark, J. P. Bagrow, C. M. Dan-

forth, and P. S. Dodds, Identifying missing dictionaryentries with frequency-conserving context models (2015),available online at http://arxiv.org/abs/1503.02120.

[42] V. Levenshtein, Soviet Physics Doklady 10, 707 (1966).

Page 17: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S1

● ●

●●

250 255 260 265 270 275 280 285

0.60

0.65

0.70

0.75

Cba

l

Cin

AL

AZ

AR

CA

CO

CT

DE

DC FL

GA

ID

IL

IN

IA

KS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MT

NENV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TNTX

UT

VT

VA

WA

WV

WI

WY ρ̂p: −0.47p−val: 0.00059

m: −0.0071

● ●

●●

160 170 180 190 200 210

0.60

0.65

0.70

0.75

Cba

l

Cout

AL

AZ

AR

CA

CO

CT

DE

DC FL

GA

ID

IL

IN

IA

KS

KY

LA

ME

MD

MA

MI

MN

MS

MO

MT

NENV

NH

NJ

NM

NY

NC

ND

OH

OK

OR

PA

RI

SC

SD

TNTX

UT

VT

VA

WA

WV

WI

WY

ρ̂p: 0.93p−val: 0

m: 0.0042

FIG. S1: Plots for the contiguous US showing the relationships Cbal versus Cin (left), and Cbal versus Cout (right). ThePearson correlation coefficient ρ̂p is included in each plot, showing a low correlation between Cin and Cout. With its largerrange, caloric output Cout is more tightly coupled with the ratio Cbal.

Page 18: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S2

Balance

ME, 5CO, 1NH, 8

DC, 19WI, 13MN, 6

MD, 39NV, 16FL, 18

MO, 24AZ, 14MA, 20NY, 10OR, 7

GA, 40IL, 28

DE, 44VA, 30

RI, 22WA, 12

TN, 33CA, 11ID, 17WY, 2

CT, 31UT, 4

SC, 37PA, 32

VT, 3NC, 38NJ, 35

NM, 23NE, 15

MS, 49MI, 43LA, 48AR, 46TX, 36AL, 47

KS, 29OK, 27

IN, 34IA, 21

KY, 42MT, 9

OH, 41ND, 26

WV, 45SD, 25

−0.12 0.13Food

Maine, 49Colorado, 48New Hampshire, 47District of Columbia, 46Wisconsin, 45Minnesota, 44Maryland, 43Nevada, 42Florida, 41Missouri, 40Arizona, 39Massachusetts, 38New York, 37Oregon, 36Georgia, 35Illinois, 34Delaware, 33Virginia, 32Rhode Island, 31Washington, 30Tennessee, 29California, 28Idaho, 27Wyoming, 26Connecticut, 25Utah, 24South Carolina, 23

Pennsylvania, 22Vermont, 21

North Carolina, 20New Jersey, 19New Mexico, 18

Nebraska, 17Mississippi, 16

Michigan, 15Louisiana, 14Arkansas, 13

Texas, 12Alabama, 11

Kansas, 10Oklahoma, 9

Indiana, 8Iowa, 7

Kentucky, 6Montana, 5

Ohio, 4North Dakota, 3West Virginia, 2South Dakota, 1

−15 17 Activity

ME, 14CO, 2

NH, 15DC, 29

WI, 21MN, 8

MD, 45NV, 25FL, 26

MO, 28AZ, 18MA, 24NY, 11OR, 6

GA, 43IL, 31DE, 47VA, 30

RI, 23WA, 13

TN, 38CA, 10ID, 19WY, 1

CT, 33UT, 4

SC, 40PA, 34

VT, 3NC, 39NJ, 36

NM, 17NE, 12

MS, 49MI, 42LA, 48AR, 46TX, 35AL, 44

KS, 22OK, 20IN, 27

IA, 9KY, 37

MT, 5OH, 32

ND, 16WV, 41

SD, 7

−15 30

Deviations from national averages

FIG. S2: Histograms as per Fig. 3 with states sorted by food rank. The bar colors correspond those used in for the choroplethmaps in Fig. 1.

Page 19: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S3

Balance

MS, 49LA, 48DE, 44AR, 46MD, 39AL, 47GA, 40MI, 43WV, 45SC, 37NC, 38TN, 33KY, 42NJ, 35TX, 36PA, 32CT, 31OH, 41

IL, 28VA, 30

DC, 19MO, 24

IN, 34FL, 18NV, 16MA, 20RI, 22

KS, 29WI, 13OK, 27ID, 17

AZ, 14NM, 23ND, 26NH, 8ME, 5

WA, 12NE, 15NY, 10CA, 11IA, 21MN, 6

SD, 25OR, 7MT, 9UT, 4VT, 3

CO, 1WY, 2

−0.12 0.13Food

MS, 16LA, 14

DE, 33AR, 13

MD, 43AL, 11

GA, 35MI, 15WV, 2

SC, 23NC, 20

TN, 29KY, 6

NJ, 19TX, 12PA, 22

CT, 25OH, 4

IL, 34VA, 32DC, 46MO, 40

IN, 8FL, 41NV, 42MA, 38RI, 31

KS, 10WI, 45

OK, 9ID, 27AZ, 39

NM, 18ND, 3

NH, 47ME, 49WA, 30

NE, 17NY, 37CA, 28

IA, 7MN, 44

SD, 1OR, 36

MT, 5UT, 24

VT, 21CO, 48WY, 26

−15 17 Activity

Mississippi, 49Louisiana, 48Delaware, 47Arkansas, 46Maryland, 45Alabama, 44Georgia, 43Michigan, 42West Virginia, 41South Carolina, 40North Carolina, 39Tennessee, 38Kentucky, 37New Jersey, 36Texas, 35Pennsylvania, 34Connecticut, 33Ohio, 32Illinois, 31Virginia, 30District of Columbia, 29Missouri, 28

Indiana, 27Florida, 26

Nevada, 25Massachusetts, 24

Rhode Island, 23Kansas, 22

Wisconsin, 21Oklahoma, 20

Idaho, 19Arizona, 18

New Mexico, 17North Dakota, 16

New Hampshire, 15Maine, 14

Washington, 13Nebraska, 12New York, 11California, 10

Iowa, 9Minnesota, 8

South Dakota, 7Oregon, 6

Montana, 5Utah, 4

Vermont, 3Colorado, 2Wyoming, 1

−15 30

Deviations from national averages

FIG. S3: Histograms as per Fig. 3 with states sorted by activity rank. The bar colors correspond those used in for thechoropleth maps in Fig. 1.

Page 20: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S4

A. High calorie foods mentioned more: B. Low calorie foods mentioned less:

C. High calorie foods mentioned less: D. Low calorie foods mentioned more:

FIG. S4: Food phrase shifts for Colorado, broken down into the four ways phrases may contribute to a shift, as described inSec. II D. See Fig. 4A for the combined shift.

Page 21: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S5

A. High calorie foods mentioned more: B. Low calorie foods mentioned less:

C. High calorie foods mentioned less: D. Low calorie foods mentioned more:

FIG. S5: Food phrase shifts for Mississippi, broken down into the four ways phrases may contribute to a shift, as described inSec. II D. See Fig. 4B for the combined shift.

Page 22: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S6

A. High calorie activities mentioned more: B. Low calorie activities mentioned less:

C. High calorie activities mentioned less: D. Low calorie activities mentioned more:

FIG. S6: Activity phrase shifts for Colorado, broken down into the four ways phrases may contribute to a shift, as describedin Sec. II D. See Fig. 4C for the combined shift.

Page 23: DRAFT - arXiv · Gauging public health through caloric input and output on social media Sharon E. Alajajian, 1, Jake Ryland Williams, y Andrew J. Reagan, z Stephen C. Alajajian, 2,

DRAFT

S7

A. High calorie activities mentioned more: B. Low calorie activities mentioned less:

C. High calorie activities mentioned less: D. Low calorie activities mentioned more:

FIG. S7: Activity phrase shifts for Mississippi, broken down into the four ways phrases may contribute to a shift, as describedin Sec. II D. See Fig. 4D for the combined shift.