e-social science

29
E-Social Science What is e-Science? E-Science and e-Social Science E-Social Science and Longitudinal Data Examples of the Computational Problems we Currently Face (BHPS, YCS) Existing Web Based Tools and Possible New Tools Need for a VRE The e-Social Science Program

Upload: arlen

Post on 06-Jan-2016

58 views

Category:

Documents


2 download

DESCRIPTION

E-Social Science. What is e-Science? E-Science and e-Social Science E-Social Science and Longitudinal Data Examples of the Computational Problems we Currently Face (BHPS, YCS) Existing Web Based Tools and Possible New Tools Need for a VRE The e-Social Science Program. What is e-Science?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: E-Social Science

E-Social Science

• What is e-Science?

• E-Science and e-Social Science

• E-Social Science and Longitudinal Data

• Examples of the Computational Problems we Currently Face (BHPS, YCS)

• Existing Web Based Tools and Possible New Tools

• Need for a VRE

• The e-Social Science Program

Page 2: E-Social Science

What is e-Science?

The technology: the three exponentials

• Computer speed doubles every 18 months. Our ability to model and simulate complex systems increases at the same rate;

• Storage density doubles every 12 months. Some groups talking about data sets that are a Petabyte in size and which will be 10 Petabytes/year in 5 years time;

• Network bandwidth doubles every 9 months.

Page 3: E-Social Science

What is e-Science?

The GRID concept puts all three components together and makes them even more important. There will be several different types of GRID, e.g.

1. Computational GRIDs for high-performance computation;

2. Access GRIDs for collaborative visualization involving distant researchers;

3. Data GRIDs for moving large volumes of data;

4. Sensor GRIDs for real-time monitoring (e.g traffic and pedestrian flows, electronic transactions);

Page 4: E-Social Science
Page 5: E-Social Science

Cambridge

Newcastle

Edinburgh (NeSC)

Oxford

Glasgow NeSC

Manchester

Cardiff

Southampton

London

Belfast

DL

RAL Hinxton

UK e-Science Grid

Page 6: E-Social Science

Some Features of Social Science Research

• We want to develop evidence based substantive theory. We want to know “what determines what”, e.g. long term unemployment and social exclusion

• We want to explore the consequences of policy changes on individual behaviour, e.g. encouragement to stay on at school on educational attainment, truancy, and social exclusion

• Data may be small (<10GB) but they are complex

Page 7: E-Social Science

Some Generic Features of Social Research

Observational Data, usually full of holes-missing data-measurement error-dropout

Substantive Theory-what determines what-not comprehensive-often contradictory

Methodology-only partially developed

Page 8: E-Social Science

Cluster Effects (CE)

• Most large scale longitudinal surveys use multi-stage sample designs to obtain 'representative' samples; this procedure often creates cluster effects, e.g. BHPS (households), YCS (schools).

• Elaborate procedures have been developed to take cluster effects into account by means of shared random effects in the model e.g. MLwiN, Stata (Gllamm).

• The estimation of non-identity link CE models, e.g. probit, are computationally demanding. The quick approximations do not work in the presence of endogenous variables, e.g. conditional estimators.

Page 9: E-Social Science

Measurement Errors (ME) • Ignoring ME can seriously mislead the

quantification of the link between explanatory and response variables;

• In observational studies, it is rarely possible to measure all relevant covariates accurately, e.g. age, educational attainment;

• ME in one covariate can bias the association between other covariates and the response variable, even if those other covariates are measured without error;

• Repeated measures and longitudinal data provide the opportunity to deal with ME in explanatory variables, adds to the computational demands of the analysis.

Page 10: E-Social Science

Missing Data, Dropout and Selection

• All of the major data sets available to the British social science community, such as the YCS, BHPS and NCDS, contain missing data and dropout.

• It is mostly non-ignorable, non-ignorable missing data and dropout are a source of bias.

• When there is missing data it is important to try and model, as realistically as possible, the process by which the observed subjects have been retained in the sample, otherwise we will not know whether the selection process has only retained subjects with certain characteristics.

• Some sample designs create selection effects.• These features add to the computational demands of the

analysis

Page 11: E-Social Science

Endogenous effects

• The curse of endogenous effects, everything seems to depend on everything else.

• We need multiprocess models to disentangle this complexity, adds to computation.

• Longitudinal data can provide the opportunity to disentangle endogenous effects from correlated errors.

Page 12: E-Social Science

Take Just One of These Complications (Endogenous effects)

• The YCS is a multi-stage stratified random sample of individuals ages 16-17.

• These individuals were contacted by post three times at annual intervals, at age 16-17, 17-18 and 18-19 (sweeps 1, 2 and 3, respectively).

• I use YCS6 which covers young people eligible to leave school in 1990-91 (YCS6), who are then observed over the 1992-94 period.

Page 13: E-Social Science

Part-time work and truancy are potential determinants of educational attainment

• A comprehensive model will allow us to disentangle the observable, direct, effects of truancy on educational attainment from any effects that arise from correlation in the errors (unobserved effects).

Page 14: E-Social Science

Trivariate Ordered Probit Model(Path Diagram)

ep Y*p

et Y *t

Yp

Yt

Yq Y*q eq

Independent Errors (ep, et, eq)

Page 15: E-Social Science

Correlated Errors

ep Y*p

et Y *t

Yp

Yt

Yq Y*q eq

Page 16: E-Social Science

Comprehensive Model Results

• The direct effect of part time work on attainment changes sign

• The correlation between pt-work & attainment has a different sign to the direct effect of pt-work, on attainment, the direct effect has also become significant.

• The correlation between truancy & attainment, has the same sign as the direct effect of truancy, on attainment and direct effect reduced.

Page 17: E-Social Science

Problems and Model Extensions

• Model takes up to a month to estimate on a P4, 3 linear predictors, 169 parameters, 8,496 trivariate integrals for each function evaluation.

• Results change as our model becomes more comprehensive.

• Need to explore other directions for the endogenous effects.

Page 18: E-Social Science

Going Parallel

• Farm out the calculations for the integrals to different processors, we get linear improvements in speed;

• e.g. if it takes T on one processor it takes T/200 on 200, i.e. 1 month goes to less than 4 hours.

• This improvement is present all the way up to sample size, at which 1 month goes to 6 minutes.

Page 19: E-Social Science

• Allows users to submit R jobs and get output back to their web session;

• Rweb needs more menus, the extensive R statistical library, not used;

(Existing web

based tools)

Page 20: E-Social Science

• Allows 66 major datasets to be explored online,

• Only uses one data set at a time;

• Has very limited facilities for sub-setting and none for fusing;

• Restricted statistical facilities, e.g. descriptive analysis, linear regression;

• No facilities for handling missing data.

(Existing web based tools)

Page 21: E-Social Science

New Tools: Joining Up in the Analysis Cycle

Main ESDS Data Sets

Select Data Set and Appropriate Variables:

TTWA Data, NOMIS

Merge Files: Add Variables

Working Data

Contextual Data

Results

Page 22: E-Social Science

DataManagement

A

DataManagement

B

DataManagement

C

Analysis A Analysis B Analysis C

Middleware

New Tools: Linking Components

Page 23: E-Social Science

New Tools: Simultaneous Analysis

National Pupils Database

Psychologists Analysis

Geographers Analysis AnalysisLocational Analysis B

Economists Analysis

Educationalists Analysis

Page 24: E-Social Science

Check Step

We need to keep our focus on the priority ordering:

Scientific challenge > software > hardware

• Software is more important than hardware.

• Software lasts longer than hardware

• Software development is lagging behind that of hardware, the ’software gap’

Page 25: E-Social Science

Problems with using the GRID

• Currently requires heroic effort to use it;

• GT2 is very complicated and difficult to install;

• Can make other University services vulnerable if not properly managed;

• User requirements not fully articulated;

• Human factors not addressed, needs familiar GUI, pull down menus, etc.

Page 26: E-Social Science

We need a Virtual Research Environment

“to make the use of e-Science technologies, methodologies and resources easier and more transparent for than simply developing bespoke applications on an infrastructure toolkit (such as GT2). ”

JISC interested in funding the VRE

Page 27: E-Social Science

Middleware/Software Library

Access GRID

Security Authorisation Authentication

Text Mining/ Data services

UK GRID Services

D

JJISC PortalJISC Portal

Portal Management

Semantic GRID Services

VLE Portal VRE

Portal

Awareness Raising Resources

Workshops

Functionality/ Content of the VRE

Page 28: E-Social Science

What might a VRE look like?

Page 29: E-Social Science

UK E-Social Science Programme

There is a growing body of work and projects in this area:

• Centre of Excellence in e-Social Science – DTI Core Programme;

• Pilot projects – ESRC;• ReDReSS: Resource Discovery for Researchers in e-Social

Science – JISC/ ESRC;• Agenda Setting Workshops – JISC/ ESRC;• UK National Grid Service + e-Science Grid - JCSR and DTI

Core Programme;• NCeSS: National Centre for e-Social Science – ESRC;• QeSSSS: Quantitative e-Social Science Support Service -

ESRC (+ future NCeSS nodes).