modeling social data, lecture 1: case studies
TRANSCRIPT
Introduction and OverviewAPAM E4990
Modeling Social Data
Jake Hofman & Chris Wiggins
Columbia University
January 23, 2013
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 1 / 52
Housekeeping
• Who are we:
• Jake: PhD, now at Microsoft Research NYC• Chris: APMA faculty, also NYT Chief Data Scientist• E-Dean Fung: Cornell Undergrad, now APAM student
• Grading:
• 3 homeworks, each 20 percent of final grade• 1 final project, 30 percent of final grade• Class participation, 10 percent of final grade
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 2 / 52
Course overview
Modeling social data requires an understanding of:
1 How to obtain data produced by online human interactions,2 What questions we typically ask about human-generated data,3 How to reframe these questions as mathematical models, and4 How to interpret the results of these models in ways that
address our questions.
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 3 / 52
Topics we will cover
Figure : modelingsocialdata.org
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 4 / 52
Mise-en-place
(Things you will need in your setup)
1 Bash shell2 Git3 Account on Github
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 5 / 52
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(motivating questions)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(fitting large, potentially sparse models)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
Computational social science
An emerging discipline at the intersection of the social sciences,statistics, and computer science
(parallel processing for filtering and aggregating data)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52
Questions
Many long-standing questions in the social sciences are notoriouslydi�cult to answer, e.g.:
• “Who says what to whom in what channel with what e↵ect”?(Laswell, 1948)
• How do ideas and technology spread through cultures?(Rogers, 1962)
• How do new forms of communication a↵ect society?(Singer, 1970)
• . . .
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 7 / 52
Large-scale data
Recently available electronic data provide an unprecedentedopportunity to address these questions at scale
Demographic Behavioral Network
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 8 / 52
The clean real story
“We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard FeynmanNobel Lecture
1, 1965
1http://bit.ly/feynmannobel
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 9 / 52
Examples!
Now some examples of social data we’ve worked with and
will be discussing
But first: questions?
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 10 / 52
Topics
Exploratory Data Analysis
Regression
Classification
Networks
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 11 / 52
Exploratory Data Analysis
(a.k.a. counting and plotting things)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 12 / 52
Demographic diversity on the Webwith Irmak Sirer and Sharad Goel (ICWSM 2012)
Dai
ly P
er−C
apita
Pag
evie
ws
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 13 / 52
Motivation
Previous work is largely survey-based and focuses and group-leveldi↵erences in online access
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52
Motivation
“As of January 1997, we estimate that 5.2 million
African Americans and 40.8 million whites have ever used
the Web, and that 1.4 million African Americans and
20.3 million whites used the Web in the past week.”
-Ho↵man & Novak (1998)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52
Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographicgroups?
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 15 / 52
Data
• Representative sample of 265,000 individuals in the US, paidvia the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information(age, education, income, race, sex, etc.)
2Special thanks to Mainak Mazumdar
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 16 / 52
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
Data
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)
• Aggregate activity at the site, group, and user levels
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52
Aggregate usage patterns
How do users distribute their time across di↵erent categories?
Frac
tion
of to
tal p
agev
iew
s
0.05
0.10
0.15
0.20
0.25●
●
●
● ●
Social
Media
E−mail
Games
Portals
Search
All groups spend the majority of their time in the top five mostpopular categories
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52
Aggregate usage patterns
How do users distribute their time across di↵erent categories?
User Rank by Daily Activity
Frac
tion
of P
agev
iew
s in
Cat
egor
y
0.05
0.10
0.15
0.20
0.25
0.30
●
● ● ● ●●
●
●
●
●
10% 30% 50% 70% 90%
● Social MediaE−mailGamesPortalsSearch
Highly active users devote nearly twice as much of their time tosocial media relative to typical individuals
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52
Group-level activity
How does browsing activity vary at the group level?
Dai
ly P
er−C
apita
Pag
evie
ws
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Large di↵erences exist even at the aggregate level(e.g. women on average generate 40% more pageviews than men)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52
Group-level activity
How does browsing activity vary at the group level?
Dai
ly P
er−C
apita
Pag
evie
ws
0
10
20
30
40
50
60
70
●
●
●●
●
Over $25k
Under $25k
Black&
Hispanic
White
No College
Some College
Over 65
Under 65
Female
Male
Income Race Education Age Sex
Younger and more educated individuals are both more likely toaccess the Web and more active once they do
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52
Group-level activity
All demographic groups spend the majority of their time in thesame categories
Age
Frac
tion
of to
tal p
agev
iew
s
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social MediaE−mailGamesPortalsSearch
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
Group-level activity
Older, more educated, male, wealthier, and Asian Internet usersspend a smaller fraction of their time on social media
Age
Frac
tion
of to
tal p
agev
iew
s
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social MediaE−mailGamesPortalsSearch
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
Group-level activity
Lower social media use by these groups is often accompanied byhigher e-mail volume
Age
Frac
tion
of to
tal p
agev
iew
s
0.0
0.1
0.2
0.3
0.4
0.5
●
●
●
●
●●
● ●
●
●
●
●
●●
● ●
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
● Social MediaE−mailGamesPortalsSearch
Fr
actio
n of
tota
l pag
evie
ws
0.0
0.1
0.2
0.3
0.4Education
● ●
●●
●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
●● ●
●●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●● ●
●
Other
Hispan
icBlack
White
Asian
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iews
per
mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Post-graduates spend three times as much time on health sitesthan adults with only some high school education
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iews
per
mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Asians spend more than 50% more time browsing online news thando other race groups
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iews
per
mon
th
0
2
4
6
8
10
12Education
●
●
●
● ●
●
●
Grammar
Schoo
l
Some H
igh Sch
ool
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Sex
●
●
Female Male
Income
● ● ●●
●
●
$0−25k
$25−50k
$50−75k
$75−100k
$100−150k
$150k+
Race
● ●●
●
●
Other
Hispan
icBlack
White
Asian
● NewsHealthReference
Even when less educated and less wealthy groups gain access tothe Web, they utilize these resources relatively infrequently
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r mon
th
0
2
4
6
8
10
12News
●
● ●
●
●
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Health
●● ●
●●
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
Reference
●● ●
● ●
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
AsianBlackHispanicWhite
Controlling for other variables, e↵ects of race and gender largelydisappear, while education continues to have large e↵ect
pi =X
j
↵jxij +X
j
X
k
�jkxijxik +X
j
�jx2ij + ✏i
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 22 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
A
vera
ge p
agev
iew
s pe
r mon
th
0
2
4
6
8
10
12Health
●● ●
● ●
High Sch
ool G
radua
te
Some C
ollege
Associa
te Deg
ree
Bache
lor's D
egree
Post G
radua
te Deg
ree
FemaleMale
However, women spend considerably more time on health sitescompared to men
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52
Revisiting the digital divide
How does usage of news, health, and reference vary withdemographics?
Monthly pageviews on health sites
20 40 60 80 100
FemaleMale
However, women spend considerably more time on health sitescompared to men, although means can be misleading
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52
Summary
• Highly active users spend disproportionately more of theirtime on social media and less on e-mail relative to the overallpopulation
• Access to research, news, and healthcare is strongly related toeducation, not as closely to ethnicity
• User demographics can be inferred from browsing activity withreasonable accuracy
• “Who Does What on the Web”, Goel, Hofman & Sirer,ICWSM 2012
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 24 / 52
Classification
(a.k.a. modeling discrete things)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 25 / 52
example:
millions of views per hour
interpretable supervised learning
supe
r co
ol s
tuff
Regression
(a.k.a. modeling continuous things)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 26 / 52
Predicting consumer activity with Web searchwith Sharad Goel, Sebastien Lahaie, David Pennock, Duncan Watts
"Right Round"
Week
Ran
k
40
30
20
10
cccccccccccccccccccccccccccccccccccccccccc
Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09
BillboardSearch
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 27 / 52
Search predictionsSummary
• Relative performance and value of search varies acrossdomains
• Search provides a fast, convenient, and flexible signal acrossdomains
• “Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 37 / 52
Networks
(a.k.a. counting complicated things)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 38 / 52
The structural virality of online di↵usionwith Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 39 / 52
Information di↵usionSummary
• Most cascades fail, resulting in fewer than two adoptions, onaverage
• Of the hits that do succeed, we observe a wide range ofdiverse di↵usion structures
• It’s di�cult to say how something spread given only itspopularity
• “The structural virality of online di↵usion”, Anderson, Goel,Hofman & Watts (Under review.)
Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 52 / 52