Enhancing a Social Science
Model-building Workflow with
Interactive Visualisation
Cagatay Turkay, Aidan Slingsby,
Kaisa Lahtinen, Sarah Butt and Jason DykesgiCentre & Centre for Comparative Social Surveys at City University London
ESANN 2016, 29 April 2016
“We (social scientists) need (data-based)
models that we can understand and
explain so that we can defend them to
our peers in full confidence.”
A quote that motivates this work (from collaborators within our AddResponse project)
Image from: Lahtinen, K. et al. (2015). Informing Non-Response Bias Model Creation in Social
Surveys with Visualisation. Poster VIS 2015
Numerical models to predict phenomena or, act as a
simulation of the phenomena being investigated
Good predictive power is often desired in models, BUT, (in
some fields) explanatory power is also crucial (Shmueli, 2010 for a detailed
[*] Shmueli, Galit. "To explain or to predict?." Statistical science (2010): 289-310.
discussion)
AddResponse Project -- https://blogs.city.ac.uk/addresponse/
… utilise organically generated auxiliary data (from commercial
transactions, public administration and other sources) to understand propensity
to respond and eventually tackle nonresponse bias (i.e.,
respondents differ from nonrespondents ).
AddResponse - Details
• European Social Survey (ESS) UK 2012 - 13
• 4,520 households
• linked to auxiliary data from:
• administrative sources
• commercial consumer profiling
• open-source data
• 401 auxiliary variables
• 32 survey response variables (only for the respondents)
e.g., Proportion
of house
sharing adults
e.g., Sports
facilities
within walking
distance
Existing workflow
• Iteratively add and/or removing variables from a
logistic regression model
• Assess the changes through model fitness metrics
(e.g., AIC, McFadden)
• Put up a sticker !
• Highly manual but involved!
Key roles for interactive visualisation
• Incorporating Theory
• Exploring variables
• Interactively building models
• Considering Geography
• Recording the model-building process, i.e., provenance
VarXplorer ModelBuilder
Prototype-1: VarXplorer
Co-variation plot
Correlations with
indicators
Theory-related
meta-data
Interactive
modelling
Link to the Video: http://goo.gl/XNiOIX
Exploring variables – 1: Investigate Covariation
- Compute pairwise correlation within all
401 variables
- Use this as a distance matrix and
project to 2D (using MDS)
- Visualise on a scatterplot where each
point is a variable
Exploring variables – 2: Correlation with indicators
- Compute correlations within all 32
response variables + response rate
- Use this as meta-data on variables to
check whether they relate to indicators
Incorporating Theory-related data
- Associate variables to social-science
concepts and theory
- Concepts relate to theories
- Variables act as proxies for concepts
- Use these as meta-data on variables
and visualise through histograms
Concepts, e.g.,
deprivation or quality
of life
Theories, e.g., social
isolation or social
disorganisation
Prototype-2: ModelBuilder
Variable selection
Model provenance
Interactive modelling
(through R)
Model quality
metrics
Interactively building models & evaluating them
- R scripts are called with the variable
selections and the variable to predict
(response or ESS variable)
- Quality metrics (AIC, McFadden) &
variables weights visualised
Interactive model building
also in VarXplorer
with variable weights
Considering Geography
- Facet data (geographically) into 12 regions
- Build local models
- Evaluate locally
Model provenance & annotations
- Save and analyse the model-building
trail
- Mark dead-ends and good models
- Attach notes to models
A brief example of the modelling process
1. Select two
concepts ,
economic
circumstances and
quality of life
A brief example of the modelling process
2. Select variables
that are distinct
and relevant
A brief example of the modelling process
3. Select variables
that correlate
with an ESS
indicator
(happiness)
3.1 Observe that
they relate to
“Social Isolation”
A brief example of the modelling process
4. Use these variables as a
starting point, check local
variations and plug into
existing scripts
4.1 Model performs
“better” in South-East UK
and in Greater London
Lessons learned
• Enhanced analysis through informed use of computation
• Interactive visual methods improve reliability and
interpretability
• Improved trust in models
• Tight integration enables quick hypothesis prototyping
• Important to communicate the certainty of the findings
Looking into the future
• Explanatory models not only predictive models
• Incorporating more complex methods (already
incorporated random forests)
• Other ways to make models more accessible?
• Use models & findings as scientific evidence ?
Acknowledgments
• giCentre team @ City
• ADDResponse project funded by the UK Economic
and Social Research Council (grant ES/L013118/1)
Thank you !
@cagatay_turkay
http://staff.city.ac.uk/cagatay.turkay.1/
https://blogs.city.ac.uk/addresponse/
http://www.gicentre.net/
!! We are hiring !!
* Researcher in visualisation of cyber-security data
(H2020 funded RIA)
* PhD studentships
Deadlines in late May and June
check giCentre.net