scalable methods for the analysis of network-based data

31
P. Smyth: Networks MURI Meeting, Jan 10 th 2012 1 Scalable Methods for the Analysis of Network-Based Data Annual Review Meeting Principal Investigator: Professor Padhraic Smyth Department of Computer Science University of California, Irvine Additional project information online at www.datalab.uci.edu/muri

Upload: others

Post on 23-Nov-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

P. Smyth: Networks MURI Meeting, Jan 10th 2012

1

Scalable Methods for the Analysis of Network-Based Data Annual Review Meeting Principal Investigator: Professor Padhraic Smyth Department of Computer Science University of California, Irvine Additional project information online at www.datalab.uci.edu/muri

P. Smyth: Networks MURI Meeting, Jan 10th 2012

2

Today’s Annual Review Meeting

• Goals – Review our research progress – Discussion, questions, interaction – Feedback from visitors

• Format

– Introduction – Research talks

• 20 minute talks + 10 minutes/session for questions/discussion – Poster session

• 1 to 2:30 (in this room) – Questions/discussion encouraged during talks – Several breaks

Butts

P. Smyth: Networks MURI Meeting, Jan 10th 2012

3

MURI Project Timeline

• Initial 3-year period – May 1 2008 to April 30th 2011 – Funding arrived to universities in Oct 2008

• 2-year extension: – May 1 2011 to April 30th 2013

• Meetings (all at UC Irvine) – Kickoff Meeting, November 2008 – Working Meetings, April 2009, August 2009 – Annual Review, December 2009 – Working Meeting, May 2010 – Annual Review, November 2010 – Working Meeting, June 2011 – Annual Review, January 2012 – …….many various smaller meetings involving subsets of the research team

P. Smyth: Networks MURI Meeting, Jan 10th 2012

4

Motivation 2007: interdisciplinary interest in

analysis of large network data sets Many of the available techniques were

descriptive, could not handle - Prediction - Missing data - Covariates, etc

P. Smyth: Networks MURI Meeting, Jan 10th 2012

5

Motivation 2007: interdisciplinary interest in

analysis of large network data sets Many of the available techniques were

descriptive, could not handle - Prediction - Missing data - Covariates, etc

2007: significant statistical body of theory available on network modeling

Many of the available techniques did not

scale up to large data sets, not widely known/understood/used, etc

P. Smyth: Networks MURI Meeting, Jan 10th 2012

6

Motivation 2007: interdisciplinary interest in

analysis of large network data sets Many of the available techniques were

descriptive, could not handle - Prediction - Missing data - Covariates, etc

2007: significant statistical body of theory available on network modeling

Many of the available techniques did not

scale up to large data sets, not widely known/understood/used, etc

Goal of this MURI project

Develop new statistical network models and algorithms to broaden their scope of

application to large, complex, dynamic real-world network data sets

P. Smyth: Networks MURI Meeting, Jan 10th 2012

7

Key Aspects of Our Technical Approach

– Foundational statistical theory for network data

– New methods to handle heterogeneous network data (with time, text, ..)

– Efficient algorithms and data structures for scalable statistical estimation

– Applications to large real-world data sets

– Open-source software for others to build on

P. Smyth: Networks MURI Meeting, Jan 10th 2012

8

Example: Network Dynamics in Classrooms Chris DuBois, Carter Butts, Padhraic Smyth, Dan McFarland (Stanford)

P. Smyth: Networks MURI Meeting, Jan 10th 2012

9

Data: Count matrix of 200,000 email messages among 3000 individuals over 3 months Problem: Understand communication patterns and predict future communication activity Challenges: sparse data, missing data, non-stationarity, unseen covariates

C. DuBois, J. Foulds, P. Smyth, ICWSM, 2011

Example: Email Communication Data

P. Smyth: Networks MURI Meeting, Jan 10th 2012

10

Example: Time Evolution of Emergency Responder Organizational Network for Hurricane Katrina

C. T. Butts, R. Acton, and C. Marcum, Interorganizational collaboration in the hurricane Katrina response, Journal of Social Structure, 2010

P. Smyth: Networks MURI Meeting, Jan 10th 2012

11

MURI Team

Investigator University Department Expertise Number Of PhD

Students

Number of Postdocs

Padhraic Smyth (PI) UC Irvine Computer Science Machine learning 6 1

Carter Butts UC Irvine Sociology Statistical social network analysis

6

Mark Handcock UCLA Statistics Statistical social network analysis

2 1

Dave Hunter Penn State Statistics Computational statistics

2 2

David Eppstein UC Irvine Computer Science Graph algorithms 2

Michael Goodrich UC Irvine Computer Science

Algorithms and data structures

2 1

Dave Mount U Maryland Computer Science

Algorithms and data structures

2

TOTALS 22 5

P. Smyth: Networks MURI Meeting, Jan 10th 2012

12 Collaboration Network

Padhraic Smyth

Dave Hunter

Mark Handcock

Dave Mount

Mike Goodrich

David Eppstein Carter

Butts

(Circa 2007)

P. Smyth: Networks MURI Meeting, Jan 10th 2012

13

Emma Spiro

Lorien Jasny

Zack Almquist

Chris Marcum

Sean Fitzhugh

Ragupathyraj Vallyvan

Ryan Acton

Collaboration Network

Padhraic Smyth

Dave Hunter

Mark Handcock

Dave Mount

Mike Goodrich

David Eppstein Carter

Butts

Chris DuBois

Minkyoung Cho

Eunhui Park

Miruna Petrescu-Prahova

Arthur Asuncion

Jimmy Foulds

Duy Vu Ruth Hummel

Michael Schweinberger

Ranran Wang

Nick Navaroli

Krista Gile

Darren Strash

Lowell Trott Maarten

Loffler

Joe Simons

Pavel Pszona

Ian Fellows

Romain Thibaux

Pavel Krivitsky

P. Smyth: Networks MURI Meeting, Jan 10th 2012

14

Emma Spiro

Lorien Jasny

Zack Almquist

Chris Marcum

Sean Fitzhugh

Ragupathyraj Vallyvan

Ryan Acton

Collaboration Network

Padhraic Smyth

Dave Hunter

Mark Handcock

Dave Mount

Mike Goodrich

David Eppstein Carter

Butts

Chris DuBois

Minkyoung Cho

Eunhui Park

Miruna Petrescu-Prahova

Arthur Asuncion

Jimmy Foulds

Duy Vu Ruth Hummel

Michael Schweinberger

Ranran Wang

Nick Navaroli

Krista Gile

Darren Strash

Lowell Trott Maarten

Loffler

Joe Simons

Pavel Pszona

Ian Fellows

Romain Thibaux

Pavel Krivitsky

P. Smyth: Networks MURI Meeting, Jan 10th 2012

15

Emma Spiro

Lorien Jasny

Zack Almquist

Chris Marcum

Sean Fitzhugh

Ragupathyraj Vallyvan

Ryan Acton

Collaboration Network

Padhraic Smyth

Dave Hunter

Mark Handcock

Dave Mount

Mike Goodrich

David Eppstein Carter

Butts

Chris DuBois

Facebook

Minkyoung Cho

Eunhui Park

Miruna Petrescu-Prahova

Arthur Asuncion

Jimmy Foulds

Duy Vu Ruth Hummel

Michael Schweinberger

Ranran Wang

Nick Navaroli

Krista Gile

U Mass Amherst Computational Social

Science Initiative

Google

Intel

Darren Strash

Lowell Trott Maarten

Loffler

Joe Simons

Pavel Pszona

RAND University of Utrecht

Ian Fellows

Romain Thibaux

Pavel Krivitsky

P. Smyth: Networks MURI Meeting, Jan 10th 2012

16

Domain Theory Data Collection

Network Modeling

Mapping the Project Terrain

P. Smyth: Networks MURI Meeting, Jan 10th 2012

17

Data Structures and Algorithms

Domain Theory Data Collection

Network Modeling Statistical Theory

Inference Algorithms

Mapping the Project Terrain

P. Smyth: Networks MURI Meeting, Jan 10th 2012

18

Simulation Hypothesis Testing

Data Structures and Algorithms

Domain Theory Data Collection

Network Modeling Statistical Theory

Inference Algorithms

Prediction/ Forecasting

Decision Support

Mapping the Project Terrain

P. Smyth: Networks MURI Meeting, Jan 10th 2012

19

Simulation Hypothesis Testing

Data Structures and Algorithms

Domain Theory Data Collection

Network Modeling Statistical Theory

Inference Algorithms

Prediction/ Forecasting

Decision Support

Mapping the Project Terrain

P. Smyth: Networks MURI Meeting, Jan 10th 2012

20

Statistical Network Modeling Approaches

• Exponential Random Graph Models (ERGMs) – “Canonical” representation for statistical models of networks – Can model edge dependencies in very flexible ways – Fitting of the model can be computationally difficult

• Latent Variable Models

– Edges are conditionally independent given the latent variables – Can lead to much simpler estimation algorithms than regular ERGMs – Model interpretation can be difficult

• Event-Based Models

– Edges have time-stamps, models based on survival analysis – Surprisingly can be much easier to fit than models for “static” networks

P. Smyth: Networks MURI Meeting, Jan 10th 2012

21

Impact: Software

• R Language and Environment – Open-source, high-level environment for statistical computing – Default standard among research statisticians - increasingly being adopted by others – Estimated 250k to 1 million users

• Statnet

– R libraries for analysis of network data – New contributions from this MURI project:

• Missing data (Gile and Handcock, 2010) • Relational event models (Butts, 2008-2011) • Latent-class models (DuBois, 2010) • Fast clique-finding (Strash, 2011) • + more……

P. Smyth: Networks MURI Meeting, Jan 10th 2012

22

Impact: Publications

• Approximately 60 peer-reviewed publications – across computer science, statistics, and social science

– High visibility • Science, Butts, 2009 • Journal of the American Statistical Association, Schweinberger, in press • Annals of Applied Statistics, Gile and Handcock, 2010 • Journal of the ACM, da Fonseca and Mount, 2010 • Journal of Machine Learning Research, Asuncion, Smyth, etc, 2010

– Highly selective conferences • ACM SIGKDD 2010 (16% accept rate) • Neural Information Processing (NIPS) Conference 2009, 2011 (25% accepts) • IEEE Infocom 2010 (17.5% accepts) • Best paper and best poster awards

• Cross-pollination – Exposing computer scientists to statistical and social networking ideas – Exposing social scientists and statisticians to computational modeling ideas

P. Smyth: Networks MURI Meeting, Jan 10th 2012

23

Impact: Workshops and Invited Talks

• 2010 Political Networks Conference – Workshop on Network Analysis – Presented and run by Butts and students Spiro, Fitzhugh, Almquist

• Invited Talks: Conferences and Workshops

– R!2010 Conference at NIST (Handcock, 2010) – 2010 Summer School on Social Networks (Butts) – Mining and Learning with Graphs Workshop (Smyth, 2010) – NSF/SFI Workshop on Statistical Methods for the Analysis of Network Data (Handcock, 2009) – International Workshop on Graph-Theoretic Methods in Computer Science (Eppstein, 2009) – Quantitative Methods in Social Science (QMSS) Seminar, Dublin (Almquist, 2010) – + many more…..

• Invited Talks: Universities – Stanford, UCLA, Georgia Tech, U Mass, Brown, etc

P. Smyth: Networks MURI Meeting, Jan 10th 2012

24

Impact: the Next Generation • Where students have gone…

– Academia: University of Massachusetts, Karlsruhe, Utrecht – Research Labs/Industry: RAND, Google, Facebook

• Students speaking at major conferences – Sunbelt International Social Network Meetings

• Jasny, Spiro, Fitzhugh, Almquist, DuBois

– American Sociological Association Meetings • Marcum, Jasny, Spiro, Fitzhugh, Almquist

– 2010 ACM SIGKDD Conference (DuBois) – 2011 International Conference on Machine Learning (Vu) – 2011 Neural Information Processing Conference (Asuncion)

• Only 20 talks selected for presentation out of 1400 submissions

• Best paper awards or nominations (Spiro, Hummel, Almquist)

• National fellowships: DuBois (NDSEG), Asuncion (NSF), Navaroli (NDSEG)

P. Smyth: Networks MURI Meeting, Jan 10th 2012

25

…..and the Old Generation • Carter Butts

– American Sociological Association, Leo A. Goodman award, 2010 – highest award to young methodological researchers in social science

• David Eppstein – ACM Fellow, 2011

• Michael Goodrich – ACM Fellow, IEEE Fellow, 2009

• Mark Handcock

– Fellow of the American Statistical Association, 2009

• Padhraic Smyth – ACM SIGKDD Innovation Award 2009 – AAAI Fellow 2010

P. Smyth: Networks MURI Meeting, Jan 10th 2012

26

What Next?

• Extending algorithmic advances into statistical modeling – Will allow us to scale existing algorithms to much larger data sets

• Develop network models with richer representational power

– Geographic data, temporal events, text data, actor covariates, heterogeneity, etc

• Systematically evaluate and test different approaches – evaluate ability of models to predict over time, to impute missing values, etc

• Apply these approaches to high visibility problems and data sets

– e.g., online social interaction such as email, Facebook, Twitter, blogs • Make software publicly available

P. Smyth: Networks MURI Meeting, Jan 10th 2012

27 SESSION 1: 9:20 External-Memory Network Analysis Algorithms for Naturally Sparse Graphs Michael Goodrich, Professor, Computer Science, UC Irvine 9:40 New Models for Exponential Family Random Networks Ian Fellows, Phd student, Statistics, UCLA 10:00 Set-Differencing Data Structures David Eppstein, Professor, Computer Science, UC Irvine 10:30 BREAK SESSION 2: 10:50 Hierarchical Statistical Models for Event-Based Social Network Data Chris DuBois, Phd student, Statistics, UC Irvine 11:10 Scalable Statistical Estimation Methods for Large Time-Varying Networks Dave Hunter, Professor, Statistics, Penn State 11:30 Large-Scale Social Network Analysis of Facebook Data Emma Spiro, Phd student, Sociology, UC Irvine

P. Smyth: Networks MURI Meeting, Jan 10th 2012

28

12:00 – 1:00 LUNCH PIs + visitors at the University Club Students + postdocs in 6011 1:00 to 2:30 POSTER SESSION with Phd students and postdoctoral fellows 2:30 – 3:40: SESSION 3 2:30 Order-Stable Parametrizations for ERGMs Carter Butts, Professor, Sociology, UC Irvine 2:50 ERGMs for Rank-Order Statistics

Pavel Krivitsky, Postdoctoral Fellow, Statistics, Penn State 3:10 Estimating the Size of Hidden Populations based on Partially-Observed Network Data

Mark Handcock, Professor, Statistics, UCLA 3:40 WRAP-UP, CLOSING COMMENTS (+ BEVERAGE BREAK) 4:00 ADJOURN 5:00 ADJOURN

P. Smyth: Networks MURI Meeting, Jan 10th 2012

29

Logistics

• All talks and posters in this room

• Wireless

• Restrooms

P. Smyth: Networks MURI Meeting, Jan 10th 2012

30

Additional Resources Project Web site: http://www.datalab.uci.edu/muri/

Slides and Posters from AHM: http://www.datalab.uci.edu/muri/june2011/

Publications: http://www.datalab.uci.edu/muri/publications.php

Software: http://csde.washington.edu/statnet/

Data Sets: http://networkdata.ics.uci.edu/resources.php

P. Smyth: Networks MURI Meeting, Jan 10th 2012

31

QUESTIONS?