scalable methods for the analysis of network-based data
TRANSCRIPT
P. Smyth: Networks MURI Meeting, Jan 10th 2012
1
Scalable Methods for the Analysis of Network-Based Data Annual Review Meeting Principal Investigator: Professor Padhraic Smyth Department of Computer Science University of California, Irvine Additional project information online at www.datalab.uci.edu/muri
P. Smyth: Networks MURI Meeting, Jan 10th 2012
2
Today’s Annual Review Meeting
• Goals – Review our research progress – Discussion, questions, interaction – Feedback from visitors
• Format
– Introduction – Research talks
• 20 minute talks + 10 minutes/session for questions/discussion – Poster session
• 1 to 2:30 (in this room) – Questions/discussion encouraged during talks – Several breaks
Butts
P. Smyth: Networks MURI Meeting, Jan 10th 2012
3
MURI Project Timeline
• Initial 3-year period – May 1 2008 to April 30th 2011 – Funding arrived to universities in Oct 2008
• 2-year extension: – May 1 2011 to April 30th 2013
• Meetings (all at UC Irvine) – Kickoff Meeting, November 2008 – Working Meetings, April 2009, August 2009 – Annual Review, December 2009 – Working Meeting, May 2010 – Annual Review, November 2010 – Working Meeting, June 2011 – Annual Review, January 2012 – …….many various smaller meetings involving subsets of the research team
P. Smyth: Networks MURI Meeting, Jan 10th 2012
4
Motivation 2007: interdisciplinary interest in
analysis of large network data sets Many of the available techniques were
descriptive, could not handle - Prediction - Missing data - Covariates, etc
P. Smyth: Networks MURI Meeting, Jan 10th 2012
5
Motivation 2007: interdisciplinary interest in
analysis of large network data sets Many of the available techniques were
descriptive, could not handle - Prediction - Missing data - Covariates, etc
2007: significant statistical body of theory available on network modeling
Many of the available techniques did not
scale up to large data sets, not widely known/understood/used, etc
P. Smyth: Networks MURI Meeting, Jan 10th 2012
6
Motivation 2007: interdisciplinary interest in
analysis of large network data sets Many of the available techniques were
descriptive, could not handle - Prediction - Missing data - Covariates, etc
2007: significant statistical body of theory available on network modeling
Many of the available techniques did not
scale up to large data sets, not widely known/understood/used, etc
Goal of this MURI project
Develop new statistical network models and algorithms to broaden their scope of
application to large, complex, dynamic real-world network data sets
P. Smyth: Networks MURI Meeting, Jan 10th 2012
7
Key Aspects of Our Technical Approach
– Foundational statistical theory for network data
– New methods to handle heterogeneous network data (with time, text, ..)
– Efficient algorithms and data structures for scalable statistical estimation
– Applications to large real-world data sets
– Open-source software for others to build on
P. Smyth: Networks MURI Meeting, Jan 10th 2012
8
Example: Network Dynamics in Classrooms Chris DuBois, Carter Butts, Padhraic Smyth, Dan McFarland (Stanford)
P. Smyth: Networks MURI Meeting, Jan 10th 2012
9
Data: Count matrix of 200,000 email messages among 3000 individuals over 3 months Problem: Understand communication patterns and predict future communication activity Challenges: sparse data, missing data, non-stationarity, unseen covariates
C. DuBois, J. Foulds, P. Smyth, ICWSM, 2011
Example: Email Communication Data
P. Smyth: Networks MURI Meeting, Jan 10th 2012
10
Example: Time Evolution of Emergency Responder Organizational Network for Hurricane Katrina
C. T. Butts, R. Acton, and C. Marcum, Interorganizational collaboration in the hurricane Katrina response, Journal of Social Structure, 2010
P. Smyth: Networks MURI Meeting, Jan 10th 2012
11
MURI Team
Investigator University Department Expertise Number Of PhD
Students
Number of Postdocs
Padhraic Smyth (PI) UC Irvine Computer Science Machine learning 6 1
Carter Butts UC Irvine Sociology Statistical social network analysis
6
Mark Handcock UCLA Statistics Statistical social network analysis
2 1
Dave Hunter Penn State Statistics Computational statistics
2 2
David Eppstein UC Irvine Computer Science Graph algorithms 2
Michael Goodrich UC Irvine Computer Science
Algorithms and data structures
2 1
Dave Mount U Maryland Computer Science
Algorithms and data structures
2
TOTALS 22 5
P. Smyth: Networks MURI Meeting, Jan 10th 2012
12 Collaboration Network
Padhraic Smyth
Dave Hunter
Mark Handcock
Dave Mount
Mike Goodrich
David Eppstein Carter
Butts
(Circa 2007)
P. Smyth: Networks MURI Meeting, Jan 10th 2012
13
Emma Spiro
Lorien Jasny
Zack Almquist
Chris Marcum
Sean Fitzhugh
Ragupathyraj Vallyvan
Ryan Acton
Collaboration Network
Padhraic Smyth
Dave Hunter
Mark Handcock
Dave Mount
Mike Goodrich
David Eppstein Carter
Butts
Chris DuBois
Minkyoung Cho
Eunhui Park
Miruna Petrescu-Prahova
Arthur Asuncion
Jimmy Foulds
Duy Vu Ruth Hummel
Michael Schweinberger
Ranran Wang
Nick Navaroli
Krista Gile
Darren Strash
Lowell Trott Maarten
Loffler
Joe Simons
Pavel Pszona
Ian Fellows
Romain Thibaux
Pavel Krivitsky
P. Smyth: Networks MURI Meeting, Jan 10th 2012
14
Emma Spiro
Lorien Jasny
Zack Almquist
Chris Marcum
Sean Fitzhugh
Ragupathyraj Vallyvan
Ryan Acton
Collaboration Network
Padhraic Smyth
Dave Hunter
Mark Handcock
Dave Mount
Mike Goodrich
David Eppstein Carter
Butts
Chris DuBois
Minkyoung Cho
Eunhui Park
Miruna Petrescu-Prahova
Arthur Asuncion
Jimmy Foulds
Duy Vu Ruth Hummel
Michael Schweinberger
Ranran Wang
Nick Navaroli
Krista Gile
Darren Strash
Lowell Trott Maarten
Loffler
Joe Simons
Pavel Pszona
Ian Fellows
Romain Thibaux
Pavel Krivitsky
P. Smyth: Networks MURI Meeting, Jan 10th 2012
15
Emma Spiro
Lorien Jasny
Zack Almquist
Chris Marcum
Sean Fitzhugh
Ragupathyraj Vallyvan
Ryan Acton
Collaboration Network
Padhraic Smyth
Dave Hunter
Mark Handcock
Dave Mount
Mike Goodrich
David Eppstein Carter
Butts
Chris DuBois
Minkyoung Cho
Eunhui Park
Miruna Petrescu-Prahova
Arthur Asuncion
Jimmy Foulds
Duy Vu Ruth Hummel
Michael Schweinberger
Ranran Wang
Nick Navaroli
Krista Gile
U Mass Amherst Computational Social
Science Initiative
Intel
Darren Strash
Lowell Trott Maarten
Loffler
Joe Simons
Pavel Pszona
RAND University of Utrecht
Ian Fellows
Romain Thibaux
Pavel Krivitsky
P. Smyth: Networks MURI Meeting, Jan 10th 2012
16
Domain Theory Data Collection
Network Modeling
Mapping the Project Terrain
P. Smyth: Networks MURI Meeting, Jan 10th 2012
17
Data Structures and Algorithms
Domain Theory Data Collection
Network Modeling Statistical Theory
Inference Algorithms
Mapping the Project Terrain
P. Smyth: Networks MURI Meeting, Jan 10th 2012
18
Simulation Hypothesis Testing
Data Structures and Algorithms
Domain Theory Data Collection
Network Modeling Statistical Theory
Inference Algorithms
Prediction/ Forecasting
Decision Support
Mapping the Project Terrain
P. Smyth: Networks MURI Meeting, Jan 10th 2012
19
Simulation Hypothesis Testing
Data Structures and Algorithms
Domain Theory Data Collection
Network Modeling Statistical Theory
Inference Algorithms
Prediction/ Forecasting
Decision Support
Mapping the Project Terrain
P. Smyth: Networks MURI Meeting, Jan 10th 2012
20
Statistical Network Modeling Approaches
• Exponential Random Graph Models (ERGMs) – “Canonical” representation for statistical models of networks – Can model edge dependencies in very flexible ways – Fitting of the model can be computationally difficult
• Latent Variable Models
– Edges are conditionally independent given the latent variables – Can lead to much simpler estimation algorithms than regular ERGMs – Model interpretation can be difficult
• Event-Based Models
– Edges have time-stamps, models based on survival analysis – Surprisingly can be much easier to fit than models for “static” networks
P. Smyth: Networks MURI Meeting, Jan 10th 2012
21
Impact: Software
• R Language and Environment – Open-source, high-level environment for statistical computing – Default standard among research statisticians - increasingly being adopted by others – Estimated 250k to 1 million users
• Statnet
– R libraries for analysis of network data – New contributions from this MURI project:
• Missing data (Gile and Handcock, 2010) • Relational event models (Butts, 2008-2011) • Latent-class models (DuBois, 2010) • Fast clique-finding (Strash, 2011) • + more……
P. Smyth: Networks MURI Meeting, Jan 10th 2012
22
Impact: Publications
• Approximately 60 peer-reviewed publications – across computer science, statistics, and social science
– High visibility • Science, Butts, 2009 • Journal of the American Statistical Association, Schweinberger, in press • Annals of Applied Statistics, Gile and Handcock, 2010 • Journal of the ACM, da Fonseca and Mount, 2010 • Journal of Machine Learning Research, Asuncion, Smyth, etc, 2010
– Highly selective conferences • ACM SIGKDD 2010 (16% accept rate) • Neural Information Processing (NIPS) Conference 2009, 2011 (25% accepts) • IEEE Infocom 2010 (17.5% accepts) • Best paper and best poster awards
• Cross-pollination – Exposing computer scientists to statistical and social networking ideas – Exposing social scientists and statisticians to computational modeling ideas
P. Smyth: Networks MURI Meeting, Jan 10th 2012
23
Impact: Workshops and Invited Talks
• 2010 Political Networks Conference – Workshop on Network Analysis – Presented and run by Butts and students Spiro, Fitzhugh, Almquist
• Invited Talks: Conferences and Workshops
– R!2010 Conference at NIST (Handcock, 2010) – 2010 Summer School on Social Networks (Butts) – Mining and Learning with Graphs Workshop (Smyth, 2010) – NSF/SFI Workshop on Statistical Methods for the Analysis of Network Data (Handcock, 2009) – International Workshop on Graph-Theoretic Methods in Computer Science (Eppstein, 2009) – Quantitative Methods in Social Science (QMSS) Seminar, Dublin (Almquist, 2010) – + many more…..
• Invited Talks: Universities – Stanford, UCLA, Georgia Tech, U Mass, Brown, etc
P. Smyth: Networks MURI Meeting, Jan 10th 2012
24
Impact: the Next Generation • Where students have gone…
– Academia: University of Massachusetts, Karlsruhe, Utrecht – Research Labs/Industry: RAND, Google, Facebook
• Students speaking at major conferences – Sunbelt International Social Network Meetings
• Jasny, Spiro, Fitzhugh, Almquist, DuBois
– American Sociological Association Meetings • Marcum, Jasny, Spiro, Fitzhugh, Almquist
– 2010 ACM SIGKDD Conference (DuBois) – 2011 International Conference on Machine Learning (Vu) – 2011 Neural Information Processing Conference (Asuncion)
• Only 20 talks selected for presentation out of 1400 submissions
• Best paper awards or nominations (Spiro, Hummel, Almquist)
• National fellowships: DuBois (NDSEG), Asuncion (NSF), Navaroli (NDSEG)
P. Smyth: Networks MURI Meeting, Jan 10th 2012
25
…..and the Old Generation • Carter Butts
– American Sociological Association, Leo A. Goodman award, 2010 – highest award to young methodological researchers in social science
• David Eppstein – ACM Fellow, 2011
• Michael Goodrich – ACM Fellow, IEEE Fellow, 2009
• Mark Handcock
– Fellow of the American Statistical Association, 2009
• Padhraic Smyth – ACM SIGKDD Innovation Award 2009 – AAAI Fellow 2010
P. Smyth: Networks MURI Meeting, Jan 10th 2012
26
What Next?
• Extending algorithmic advances into statistical modeling – Will allow us to scale existing algorithms to much larger data sets
• Develop network models with richer representational power
– Geographic data, temporal events, text data, actor covariates, heterogeneity, etc
• Systematically evaluate and test different approaches – evaluate ability of models to predict over time, to impute missing values, etc
• Apply these approaches to high visibility problems and data sets
– e.g., online social interaction such as email, Facebook, Twitter, blogs • Make software publicly available
P. Smyth: Networks MURI Meeting, Jan 10th 2012
27 SESSION 1: 9:20 External-Memory Network Analysis Algorithms for Naturally Sparse Graphs Michael Goodrich, Professor, Computer Science, UC Irvine 9:40 New Models for Exponential Family Random Networks Ian Fellows, Phd student, Statistics, UCLA 10:00 Set-Differencing Data Structures David Eppstein, Professor, Computer Science, UC Irvine 10:30 BREAK SESSION 2: 10:50 Hierarchical Statistical Models for Event-Based Social Network Data Chris DuBois, Phd student, Statistics, UC Irvine 11:10 Scalable Statistical Estimation Methods for Large Time-Varying Networks Dave Hunter, Professor, Statistics, Penn State 11:30 Large-Scale Social Network Analysis of Facebook Data Emma Spiro, Phd student, Sociology, UC Irvine
P. Smyth: Networks MURI Meeting, Jan 10th 2012
28
12:00 – 1:00 LUNCH PIs + visitors at the University Club Students + postdocs in 6011 1:00 to 2:30 POSTER SESSION with Phd students and postdoctoral fellows 2:30 – 3:40: SESSION 3 2:30 Order-Stable Parametrizations for ERGMs Carter Butts, Professor, Sociology, UC Irvine 2:50 ERGMs for Rank-Order Statistics
Pavel Krivitsky, Postdoctoral Fellow, Statistics, Penn State 3:10 Estimating the Size of Hidden Populations based on Partially-Observed Network Data
Mark Handcock, Professor, Statistics, UCLA 3:40 WRAP-UP, CLOSING COMMENTS (+ BEVERAGE BREAK) 4:00 ADJOURN 5:00 ADJOURN
P. Smyth: Networks MURI Meeting, Jan 10th 2012
29
Logistics
• All talks and posters in this room
• Wireless
• Restrooms
P. Smyth: Networks MURI Meeting, Jan 10th 2012
30
Additional Resources Project Web site: http://www.datalab.uci.edu/muri/
Slides and Posters from AHM: http://www.datalab.uci.edu/muri/june2011/
Publications: http://www.datalab.uci.edu/muri/publications.php
Software: http://csde.washington.edu/statnet/
Data Sets: http://networkdata.ics.uci.edu/resources.php