modeling social data, lecture 1: case studies

49
Introduction and Overview APAM E4990 Modeling Social Data Jake Hofman & Chris Wiggins Columbia University January 23, 2013 Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 1 / 52

Upload: jakehofman

Post on 15-Jul-2015

922 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Modeling Social Data, Lecture 1: Case Studies

Introduction and OverviewAPAM E4990

Modeling Social Data

Jake Hofman & Chris Wiggins

Columbia University

January 23, 2013

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 1 / 52

Page 2: Modeling Social Data, Lecture 1: Case Studies

Housekeeping

• Who are we:

• Jake: PhD, now at Microsoft Research NYC• Chris: APMA faculty, also NYT Chief Data Scientist• E-Dean Fung: Cornell Undergrad, now APAM student

• Grading:

• 3 homeworks, each 20 percent of final grade• 1 final project, 30 percent of final grade• Class participation, 10 percent of final grade

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 2 / 52

Page 3: Modeling Social Data, Lecture 1: Case Studies

Course overview

Modeling social data requires an understanding of:

1 How to obtain data produced by online human interactions,2 What questions we typically ask about human-generated data,3 How to reframe these questions as mathematical models, and4 How to interpret the results of these models in ways that

address our questions.

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 3 / 52

Page 4: Modeling Social Data, Lecture 1: Case Studies

Topics we will cover

Figure : modelingsocialdata.org

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 4 / 52

Page 5: Modeling Social Data, Lecture 1: Case Studies

Mise-en-place

(Things you will need in your setup)

1 Bash shell2 Git3 Account on Github

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 5 / 52

Page 6: Modeling Social Data, Lecture 1: Case Studies

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52

Page 7: Modeling Social Data, Lecture 1: Case Studies

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(motivating questions)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52

Page 8: Modeling Social Data, Lecture 1: Case Studies

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(fitting large, potentially sparse models)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52

Page 9: Modeling Social Data, Lecture 1: Case Studies

Computational social science

An emerging discipline at the intersection of the social sciences,statistics, and computer science

(parallel processing for filtering and aggregating data)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 6 / 52

Page 10: Modeling Social Data, Lecture 1: Case Studies

Questions

Many long-standing questions in the social sciences are notoriouslydi�cult to answer, e.g.:

• “Who says what to whom in what channel with what e↵ect”?(Laswell, 1948)

• How do ideas and technology spread through cultures?(Rogers, 1962)

• How do new forms of communication a↵ect society?(Singer, 1970)

• . . .

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 7 / 52

Page 11: Modeling Social Data, Lecture 1: Case Studies

Large-scale data

Recently available electronic data provide an unprecedentedopportunity to address these questions at scale

Demographic Behavioral Network

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 8 / 52

Page 12: Modeling Social Data, Lecture 1: Case Studies

The clean real story

“We have a habit in writing articles published in

scientific journals to make the work as finished as

possible, to cover all the tracks, to not worry about the

blind alleys or to describe how you had the wrong idea

first, and so on. So there isn’t any place to publish, in

a dignified manner, what you actually did in order to

get to do the work ...”

-Richard FeynmanNobel Lecture

1, 1965

1http://bit.ly/feynmannobel

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 9 / 52

Page 13: Modeling Social Data, Lecture 1: Case Studies

Examples!

Now some examples of social data we’ve worked with and

will be discussing

But first: questions?

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 10 / 52

Page 14: Modeling Social Data, Lecture 1: Case Studies

Topics

Exploratory Data Analysis

Regression

Classification

Networks

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 11 / 52

Page 15: Modeling Social Data, Lecture 1: Case Studies

Exploratory Data Analysis

(a.k.a. counting and plotting things)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 12 / 52

Page 16: Modeling Social Data, Lecture 1: Case Studies

Demographic diversity on the Webwith Irmak Sirer and Sharad Goel (ICWSM 2012)

Dai

ly P

er−C

apita

Pag

evie

ws

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 13 / 52

Page 17: Modeling Social Data, Lecture 1: Case Studies

Motivation

Previous work is largely survey-based and focuses and group-leveldi↵erences in online access

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52

Page 18: Modeling Social Data, Lecture 1: Case Studies

Motivation

“As of January 1997, we estimate that 5.2 million

African Americans and 40.8 million whites have ever used

the Web, and that 1.4 million African Americans and

20.3 million whites used the Web in the past week.”

-Ho↵man & Novak (1998)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 14 / 52

Page 19: Modeling Social Data, Lecture 1: Case Studies

Motivation

Focus on activity instead of access

How diverse is the Web?

To what extent do online experiences vary across demographicgroups?

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 15 / 52

Page 20: Modeling Social Data, Lecture 1: Case Studies

Data

• Representative sample of 265,000 individuals in the US, paidvia the Nielsen MegaPanel2

• Log of anonymized, complete browsing activity from June2009 through May 2010 (URLs viewed, timestamps, etc.)

• Detailed individual and household demographic information(age, education, income, race, sex, etc.)

2Special thanks to Mainak Mazumdar

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 16 / 52

Page 21: Modeling Social Data, Lecture 1: Case Studies

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52

Page 22: Modeling Social Data, Lecture 1: Case Studies

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52

Page 23: Modeling Social Data, Lecture 1: Case Studies

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52

Page 24: Modeling Social Data, Lecture 1: Case Studies

Data

# ls -alh nielsen_megapanel.tar

-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar

• Normalize pageviews to at most three domain levels, sans wwwe.g. www.yahoo.com ! yahoo.com,us.mg2.mail.yahoo.com/neo/launch ! mail.yahoo.com

• Restrict to top 100k (out of 9M+ total) most popular sites(by unique visitors)

• Aggregate activity at the site, group, and user levels

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 17 / 52

Page 25: Modeling Social Data, Lecture 1: Case Studies

Aggregate usage patterns

How do users distribute their time across di↵erent categories?

Frac

tion

of to

tal p

agev

iew

s

0.05

0.10

0.15

0.20

0.25●

● ●

Social

Media

E−mail

Games

Portals

Search

All groups spend the majority of their time in the top five mostpopular categories

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52

Page 26: Modeling Social Data, Lecture 1: Case Studies

Aggregate usage patterns

How do users distribute their time across di↵erent categories?

User Rank by Daily Activity

Frac

tion

of P

agev

iew

s in

Cat

egor

y

0.05

0.10

0.15

0.20

0.25

0.30

● ● ● ●●

10% 30% 50% 70% 90%

● Social MediaE−mailGamesPortalsSearch

Highly active users devote nearly twice as much of their time tosocial media relative to typical individuals

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 18 / 52

Page 27: Modeling Social Data, Lecture 1: Case Studies

Group-level activity

How does browsing activity vary at the group level?

Dai

ly P

er−C

apita

Pag

evie

ws

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Large di↵erences exist even at the aggregate level(e.g. women on average generate 40% more pageviews than men)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52

Page 28: Modeling Social Data, Lecture 1: Case Studies

Group-level activity

How does browsing activity vary at the group level?

Dai

ly P

er−C

apita

Pag

evie

ws

0

10

20

30

40

50

60

70

●●

Over $25k

Under $25k

Black&

Hispanic

White

No College

Some College

Over 65

Under 65

Female

Male

Income Race Education Age Sex

Younger and more educated individuals are both more likely toaccess the Web and more active once they do

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 19 / 52

Page 29: Modeling Social Data, Lecture 1: Case Studies

Group-level activity

All demographic groups spend the majority of their time in thesame categories

Age

Frac

tion

of to

tal p

agev

iew

s

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social MediaE−mailGamesPortalsSearch

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52

Page 30: Modeling Social Data, Lecture 1: Case Studies

Group-level activity

Older, more educated, male, wealthier, and Asian Internet usersspend a smaller fraction of their time on social media

Age

Frac

tion

of to

tal p

agev

iew

s

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social MediaE−mailGamesPortalsSearch

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52

Page 31: Modeling Social Data, Lecture 1: Case Studies

Group-level activity

Lower social media use by these groups is often accompanied byhigher e-mail volume

Age

Frac

tion

of to

tal p

agev

iew

s

0.0

0.1

0.2

0.3

0.4

0.5

●●

● ●

●●

● ●

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80

● Social MediaE−mailGamesPortalsSearch

Fr

actio

n of

tota

l pag

evie

ws

0.0

0.1

0.2

0.3

0.4Education

● ●

●●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

●● ●

●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●● ●

Other

Hispan

icBlack

White

Asian

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 20 / 52

Page 32: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iews

per

mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Post-graduates spend three times as much time on health sitesthan adults with only some high school education

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52

Page 33: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iews

per

mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Asians spend more than 50% more time browsing online news thando other race groups

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52

Page 34: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iews

per

mon

th

0

2

4

6

8

10

12Education

● ●

Grammar

Schoo

l

Some H

igh Sch

ool

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Sex

Female Male

Income

● ● ●●

$0−25k

$25−50k

$50−75k

$75−100k

$100−150k

$150k+

Race

● ●●

Other

Hispan

icBlack

White

Asian

● NewsHealthReference

Even when less educated and less wealthy groups gain access tothe Web, they utilize these resources relatively infrequently

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 21 / 52

Page 35: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r mon

th

0

2

4

6

8

10

12News

● ●

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Health

●● ●

●●

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

Reference

●● ●

● ●

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

AsianBlackHispanicWhite

Controlling for other variables, e↵ects of race and gender largelydisappear, while education continues to have large e↵ect

pi =X

j

↵jxij +X

j

X

k

�jkxijxik +X

j

�jx2ij + ✏i

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 22 / 52

Page 36: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

A

vera

ge p

agev

iew

s pe

r mon

th

0

2

4

6

8

10

12Health

●● ●

● ●

High Sch

ool G

radua

te

Some C

ollege

Associa

te Deg

ree

Bache

lor's D

egree

Post G

radua

te Deg

ree

FemaleMale

However, women spend considerably more time on health sitescompared to men

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52

Page 37: Modeling Social Data, Lecture 1: Case Studies

Revisiting the digital divide

How does usage of news, health, and reference vary withdemographics?

Monthly pageviews on health sites

20 40 60 80 100

FemaleMale

However, women spend considerably more time on health sitescompared to men, although means can be misleading

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 23 / 52

Page 38: Modeling Social Data, Lecture 1: Case Studies

Summary

• Highly active users spend disproportionately more of theirtime on social media and less on e-mail relative to the overallpopulation

• Access to research, news, and healthcare is strongly related toeducation, not as closely to ethnicity

• User demographics can be inferred from browsing activity withreasonable accuracy

• “Who Does What on the Web”, Goel, Hofman & Sirer,ICWSM 2012

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 24 / 52

Page 39: Modeling Social Data, Lecture 1: Case Studies

Classification

(a.k.a. modeling discrete things)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 25 / 52

Page 40: Modeling Social Data, Lecture 1: Case Studies

example:

millions of views per hour

Page 41: Modeling Social Data, Lecture 1: Case Studies
Page 42: Modeling Social Data, Lecture 1: Case Studies
Page 43: Modeling Social Data, Lecture 1: Case Studies

interpretable supervised learning

supe

r co

ol s

tuff

Page 44: Modeling Social Data, Lecture 1: Case Studies

Regression

(a.k.a. modeling continuous things)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 26 / 52

Page 45: Modeling Social Data, Lecture 1: Case Studies

Predicting consumer activity with Web searchwith Sharad Goel, Sebastien Lahaie, David Pennock, Duncan Watts

"Right Round"

Week

Ran

k

40

30

20

10

cccccccccccccccccccccccccccccccccccccccccc

Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09

BillboardSearch

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 27 / 52

Page 46: Modeling Social Data, Lecture 1: Case Studies

Search predictionsSummary

• Relative performance and value of search varies acrossdomains

• Search provides a fast, convenient, and flexible signal acrossdomains

• “Predicting consumer activity with Web search”Goel, Hofman, Lahaie, Pennock & Watts, PNAS 2010

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 37 / 52

Page 47: Modeling Social Data, Lecture 1: Case Studies

Networks

(a.k.a. counting complicated things)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 38 / 52

Page 48: Modeling Social Data, Lecture 1: Case Studies

The structural virality of online di↵usionwith Ashton Anderson, Sharad Goel, Duncan Watts (Management Science 2015)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 39 / 52

Page 49: Modeling Social Data, Lecture 1: Case Studies

Information di↵usionSummary

• Most cascades fail, resulting in fewer than two adoptions, onaverage

• Of the hits that do succeed, we observe a wide range ofdiverse di↵usion structures

• It’s di�cult to say how something spread given only itspopularity

• “The structural virality of online di↵usion”, Anderson, Goel,Hofman & Watts (Under review.)

Hofman & Wiggins (Columbia University) Introduction and Overview January 23, 2013 52 / 52