geographically varying correlates of car non-ownership in ... · both are cartograms. cartograms...
TRANSCRIPT
The geographically varying correlates of car non-ownership in 2001 Census output areas of England
INTRODUCTION The data give the results of an experimental geographically weighted regression model (GWR) (Fotheringham, Brunsdon & Charlton 2002) fitted using the statistical package R to Census Output Areas in England. The model uses 2001 Census data and an implementation of the sp.gwr library available at http://cran.r-project.org/ adapted for the distributed (parallel) computing environment available as part of the UK’s National Grid Service (http://www.grid-support.ac.uk/).
Specifically, a simple GWR analysis has been undertaken to predict the proportion of households without a car (or van) in n = 165 665 Output Areas. Car non-ownership generally is regarded as an indicator of material and social disadvantage (reflecting an inability to afford and insure a vehicle which both causes and sustains disadvantage in the job market where access to employment becomes an issue) (Clark & Wang 2005). However, that is not true, everywhere: car ownership is lower in London, for example, presumably because public transport offers a credible alternative (Harris, Sleight & Webber 2005, pp.219-220).
The predictor variables incorporated social, economic, demographic and ethnicity information and were:
X1: Proportion of persons of working age unemployed
X2: Proportion of households in public housing
X3: Proportion of households that are lone parent households
X4: Proportion of persons 16 or above that are single
X5: Proportion of persons that are “white British”
ABOUT GWR At its simplest, GWR can be understood as treating the global regression model
ε+β++β+β+β= kk x...xxy 22110
(1)
as a special case of the model
( ) ( ) iikk iikiii xv,uv,uy ε+β+β= ∑0
(2)
UK Data Archive Study Number 6100 - Geographically Varying Correlates of Car Non-Ownership in Census Output Areas of England, 2001
where the difference between (1) and (2) is that the former is spatially invariant whereas, in (2),
( iik v,u )β is a realisation of the continuous function ( ik vu, )β at point i (Fotheringham, Brunsdon
& Charlton 2002).
In other words, GWR assumes the nature of the relationship between Y and the Xs to vary continuously across space, an assumption which is the opposite of a standard regression methodology that takes the relationship to be everywhere the same. This, the global model, for the data is:
Y = 0.09 + 1.60X1 + 0.46X2 – 0.32X3 + 0.38X4 – 0.07X5 + ε
Each of the predictor variables is significant at a greater than 99% confidence but this is hardly surprising and not especially instructive: it a consequence of the size of n (it being large). What we are interested in and what the GWR data tell us is how the regression coefficients vary across England. The data reveal geographical variation in the correlates of car non-ownership.
SUMMARY OF THE DATA
The data confirm that the correlates of car non-ownership vary across England. For example, whereas the global model predicted a 10% increase in the proportion of lone parent households would be associated with an average decrease in car non-ownership of 3.2%, the GWR model suggests the decrease could typically be from 9.6% to 1.5%, depending upon the location. Because of the double negative, it is easier to interpret the results as showing that as rates of lone parenthood increase so too do rates of car ownership, but that the effect is greater in some places more than others.
The GWR methodology fits weighted regression models at each of 165 665 locations separately. The regression coefficients obtained at each of those locations are contained in the data set. A summary of their variation is given in Table 1. Note that there were 6 168 locations where no model could be fitted.
Minimum 1st
QuartileMedian Mean
3rd Quartile
Maximum NAs
Intercept -50.48 0.05 0.21 0.2392 0.39 37.45 6168X1 -28.97 0.16 0.56 0.5729 1.01 35.41 6168X2 -4.1 0.46 0.53 0.5211 0.59 4.67 6168X3 -136.1 -1 -0.58 -0.5587 -0.13 97.61 6168X4 -3.59 0.13 0.28 0.2692 0.42 12.62 6168X5 -37.02 -0.33 -0.15 -0.1773 0 50.73 6168
Table 1. Comparing the coefficients of a GWR model predicting car non-ownership
for n = 165,665 census output areas in England and Wales.
FORMAT OF THE DATA
The data are saved as comma separated variables with the following headers and meaning:
Easting The National Grid Reference of a centroid within the Output Area
Northing The National Grid Reference of a centroid within the Output Area
Intercept The intercept for the weighted regression model at the centroid location
beta1 The regression coefficient for X1 at the location
beta2 The regression coefficient for X2 at the location
beta3 The regression coefficient for X3 at the location
beta4 The regression coefficient for X4 at the location
beta5 The regression coefficient for X5 at the location
Note: the GWR model was fitted with a fixed bandwidth of 2643.3 metres.
SUGGESTED APPLICATIONS OF THE DATA For teaching, to suggest how standard regression techniques may conceal geographical differences.
Example Figures 1 and 2 show some of the spatial variation in the coefficient for the lone parent variable. Figure 1 is for London, and Figure 2 is for Birmingham and Coventry. Both are cartograms. Cartograms are produced by warping a Euclidean view of geographic space to permit the size of each circle to be proportional to the population density at the location that circle represents (Dorling 1996). Consequently, the positions of the motorways are indicative, included only to aid interpretation of the maps.
The interesting areas are those shaded yellow or red, as these are the places where an increase in lone parenthood is least associated with increased car ownership. In Birmingham and Coventry these places are near to the city centres; in London they are more dispersed but prevalent to the East of the city. If there is an advantage in the job market to be had by owning a car, then the results might suggest rather different experiences (or meanings) of lone parenthood across geographical space.
Figure 1. A cartogram showing the spatial variation in the lone parent coefficient across London.
Figure 2. A cartogram showing the spatial variation in the lone parent coefficient across Birmingham and Coventry.
ACKNOWLEDGMENT The research was funded by the ESRC as part of the National Centre for e-social science’s small grants projects. RES-149-25-1041
REFERENCES Clark, W.A.V. & Wang, W.W., 2005. Job Access and Commute Penalties: Balancing Work and
Residence in Los Angeles. Urban Geography, 26(7), 610-626.
Dorling, D., 1996. Area Cartograms: Their Use and Creation, Norwich: Environmental Publications.
Fotheringham, A.S., Brunsdon, C. & Charlton, M., 2002. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships, Chichester: John Wiley & Sons.
Harris, R., Sleight, P. & Webber, R., 2005. Geodemographics: GIS and Neighbourhood Targeting, Chichester: John Wiley & Sons.
����������� �����������������������
With�Application�to�Deprivation�Indices�
Background��Geographically�Weighted�Regression�(GWR)�(Fotheringham,�Brunsdon,�&�Charlton�2002),�like�many�other�methods�of�spatial�analysis,�is�characterised�by�multiple�repeat�testing�as�the�data�are�divided�into�geographical�regions�and�also�randomly�redistributed�many�times�to�simulate�the�likelihood�that�the�results�obtained�from�the�analysis�are�actually�due�to�chance.�Each�of�these�tests�requires�computer�time�so,�given�a�large�dataset�such�as�the�UK�Census�statistics,�running�the�analysis�on�a�standard�machine�can�take�a�long�time�–�in�the�order�of�days�or�weeks.�This�is�far�from�ideal�when�the�purpose�of�many�spatial�statistics�is�to�be�exploratory:�allowing�the�user�to�interact�with�data�and�find�spatial�patterns�of�association�within�them.�
Consequently,�the�application�of�high�performance�computing�to�spatial�analysis�has�long�been�of�interest�to�social�and�geographical�scientists.�Of�particular�note�is�the�pioneering�work�undertaken�by�Stan�Openshaw�at�the�University�of�Newcastle�and�at�the�Centre�for�Computational�Geography�at�Leeds�University,�of�which�an�exemplar�is�the�Geographical�Analysis�Machine�(GAM)�(Openshaw�et�al.�1987).�More�recently,�Martin�(2005)�has�identified�the�potential�for�geocomputation�to�develop�under�the�rubric�of�high�performance�computer�(grid)�networks�and�e�(electronic)�social�science.�He�identifies�four�essential�research�issues�for�e�social�science:�automated�data�mining;�visualization�of�spatial�data�uncertainty;�incorporation�of�an�explicitly�spatial�dimension�into�simulation�modeling;�and�neighborhood�classification�from�multi�source�distributed�datasets.�
Missing�from�Martin's�list�is�the�explicit�use�of�parallelization�to�speed�up�the�calculations�associated�with�spatial�statistics.�What�GAM,�GWR�and�other�methods�of�spatially�localized�analysis�have�in�common�is�a�general�sequence�of:�
1.�calibrating�the�size�of�the�kernel�or�search�window�to�the�amount�of��spatial�autocorrelation�found�in�the�attributes�of�the�data�being�examined;�
2.�creating�spatially�overlapping�subsets�of�the�data�to�reflect�this;��
3.�allowing�the�kernel�to�pass�from�one�subset�to�the�next,�applying�a�statistical�test�in�each;�
4.�simulating�confidence�intervals�for�the�statistical�result�by�detaching�the�data�attributes�from�the�geographical�coordinates�at�which�they�were�captured,�then�repeatedly�reattaching�the�attributes�to�randomly�selected�locations�and�applying�the�test�again.�
For�many�spatial�statistical�procedures,�each�of�the�stages�of�calibration,�fitting�and�assessing�significance�can�be�parallelized�with�processes�that�will�operate�without�communication�with�the�others�(since,�for�example,�the�outcome�of�a�model��fitted�to�one�spatial�subset�of�the�data�does�not�affect�or�modify�the�outcome�of�a�model�fitted�to�another).�Thus,�each�of�the�processes�can�be�sent�to�separate�computational�nodes,�their�outputs�pooled�and�then�incorporated�into�the�overall�calculation.��
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
By�‘grid�enabling’�GWR�we�hope�to�balance�its�computationally�intensive�requirements�with�the�need�of�users�for�faster�run�times,�and�to�showcase�it�as�an�example�of�methods�of�spatial�analysis�can�be�operated�on�the�UK’s�National�Grid�infrastructure.�
Objectives�The�overarching�objective�of�the�research�was:�
To�develop�a�prototype,�grid�enabled�implementation�of�GWR�that�can�be�used�by�other�researchers�and�which�builds�upon�existing�e�Science�infrastructure�(the�National�Grid�Service).�
This�objective�was�completed�in�full.�
Other�aims�were:�
(1)�to�demonstrate�the�use�of�GWR�and�grid�technologies�with�regards�to�the�important�social�and�policy�issue�of�understanding�and�measuring�the�spatially�distributed�correlates�of�deprivation;�
(2)�to�work�collaboratively�across�institutions�(the�University�of�Bristol,�the�University�of�Leicester�and�the�National�University�of�Ireland,�Maynooth)�and�across�disciplines�(geographical�and�computational�science),�to�foster�research�networks;�
(3)�to�‘connect’�and�liaise�with�currently�funded�research�projects�that�complement�this�proposal;�and�
(4)�to�foster�the�continued�professional�development�of�the�research�team�and�particularly�the�researcher�employed�to�carry�forward�and�develop�e�science�within�the�social�sciences.�
Of�these,�(3)�and�(4)�were�fully�addressed,�(2)�met�in�kind�and�(1)�in�part�and�is�on�going�(see�Results�section�below).�
Very�early�in�the�project’s�life,�collaboration�had�been�explored�with�the�Lancaster�University�Centre�for�e�Science.�An�opportunity�to�formalise�this�arrangement�came�with�the�unexpected�departure�of�the�research�assistant�at�Bristol�to�other�employment,�and�the�subsequent�failure�to�recruit�a�suitably�replacement.�Collaboration�therefore�focused�on�three�institutions:�the�University�of�Bristol,�the�University�of�Leicester�and�the�University�of�Lancaster,�though�the�work�was�also�presented�at�the�National�University�of�Ireland,�Maynooth.�
Methods�At�the�outset�of�the�project�four�versions�of�GWR�were�available�to�us�to�develop.�The�first�was�a�Windows�based�version�with�a�graphical�user�interface.�This�is�the�software�produced�by�the�GWR�development�team�at�the�National�University�of�Ireland,�Maynooth.�The�second�was�the�‘raw’�Fortran�77�underpinning�the�Windows�delivery.�The�third�was�an�existing�implementation�of�GWR�written�in�R�by�one�of�the�research�team�and�originators�of�GWR�(Professor�Chris�Brunsdon).�The�fourth�was�an�open�source�library�for�running�R:�the�spgwr�package�developed�by�Bivand�and�Yu�and�hosted�on�the�Comprehensive�R�Archive�Network.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
The�spgwr�package�for�R�provides�functions�for�calibration�of�the�bandwidth�and�calculation�of�the�regression�parameters�using�the�methods�of�Geographically�Weighted�Regression.�Clearly,�it�would�be�advantageous�to�re�use�this�well�used�and�supported�package�as�much�as�possible�when�developing�a�parallel�version�of�the�GWR�methods.�Doing�so�would�minimize�the�amount�of�additional�skills�required�by�existing�users�of�spgwr�when�adapting�to�using�a�parallel�implementation.�In�addition,�it�would�reduce�the�overall�development�effort�required�to�implement�parallel�GWR.�
In�fact,�this�approach�to�parallelising�GWR�has�already�been�taken�by�the�authors�of�the�spgwr�package�to�make�GWR�available�on�multiprocessor�systems.�This�was�achieved�using�the�snow�package�which�provides�a�set�of�methods�for�evaluating�R�functions�in�parallel�using�PVM,�sockets�or�threads.�However,�snow�does�not�provide�the�means�of�employing�a�large�number�of�distributed�systems�such�as�are�typically�encountered�in�a�grid�environment.��
R�is�an�open�source�package�for�statistical�computing�and�graphics�and�has�a�large�and�growing�user�base,�many�of�whom�provide�libraries�(or�‘add�ins’)�extending�its�functionality.�A�prior�National�Centre�for�e�social�science�(NCeSS)�project�called�SABRE�in�R�had�involved�the�Lancaster�University�Centre�for�e�Science�developing�a�parallel�implementation�of�SABRE�(a�program�for�the�statistical�analysis�of�binary,�ordinal�and�count�recurrent�events)�as�R�Objects.�That�project�had�used�GROWL�“to�provide�user�friendly�access�to�GRID�resources�for�applications�accessible�from�desktop�computer”�(www.ncess.ac.uk/research/quantitative/cqess/growl/).�To�develop�a�version�of�GWR�that�could�run�the�existing�spgwr�library�on�a�desktop�computer�using�R�but�do�the�processing�remotely�on�the�National�Grid�infrastructure�became�the�method�of�choice.�
A�package,�entitled�multiR,�was�developed�for�this�purpose,�using�GROWL�technology.�Unlike�snow,�multiR�does�provide�a�client�R�interface�for�parallel�computing�in�a�high�throughput�distributed�computing�environment.�The�package,�multiR,�is�a�client/server�system�which�provides�a�means�of�submitting�a�group�of�tasks�for�processing�on�multiple�systems�that�are�remote�from�the�client�system.�The�remote�systems�could�be�processors�on�a�local�high�performance�cluster,�a�Condor�pool�or�combinations�of�these�and�possibly�many�other�types�of�system.�The�multiR�client�interface�is�distributed�as�a�package�for�R�and�its�usage�is�similar�in�many�respects�to�that�of�the�R�function�lapply.�The�multiR�concept�is�to�provide�a�means�of�specifying�a�R�function�for�multiple�invocation�with�varying�arguments�where�the�function�is�evaluated�on�multiple�processors.�By�doing�so�it�allows�R�to�become�a�programming�environment�for�course�grained�parallel�processing.�
The�multiR�client/server�system�is�based�on�a�three�tier�architecture.�It�is�implemented�in�this�way�because�such�an�architectural�design�pattern�overcomes�many�of�the�difficulties�associated�with�providing�and�administrating�a�secure�service�where�the�resources�employed�to�implement�the�service�are�manifold,�varied�and�constantly�changing.�Figure�1�outlines�the�principle�of�the�architecture.�Clients�use�R�to�define�the�functions�that�require�evaluation�and�use�multiR�to�submit�a�job�(the�function�invocations)�to�the�multiR�server.�The�multiR�server�then�delegates�these�tasks�to�whatever�resources�it�employs.�The�progress�of�jobs�that�have�been�submitted�by�a�client�may�be�monitored�within�R�and�the�results�“harvested”�by�commands�provided�within�the�multiR�package.�The�evaluation�of�the�function�invocations�which�comprise�the�job�are�evaluated�within�R�sessions�invoked�on�the�host�systems�which�act�as�proxies�for�the�client�R�session.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
�
Figure�1.�The�three�tier�client/server�architecture�employed�by�multiR.
Results�The�main�‘result’�of�the�research�was�the�spgwr.dist�and�multiR�packages�for�R�which�were�then�used�to�fit�a�GWR�model�of�car�non�ownership�using�165,665�data�points.�These�are�now�described.��
The�spgwr.dist�package�for�R�contains�the�functions�required�for�grid�enabled�GWR�(��dist�is�an�abbreviation�of�distributed,�i.e.�it�is�designed�for�distributed�computing).�It�uses�a�further�R�package�called�multiR�which�is�installed�locally�but�sets�up�R�to�run�on�a�distributed�computing�platform�by�identifying�a�remote�multiR�server�by�‘name’�and�by�the�port�number�on�which�the�service�is�hosted.�The�multiR�package�and�server�are�the�middleware�between�the�user’s�desktop�and�the�grid�system�on�which�the�GWR�analysis�will�be�completed.�The�multiR�session�requires�three�security�credentials�to�be�supplied:�a�multiR�proxy�certificate,�a�certificate�validating�the�multiR�server�and�the�user’s�proxy�credentials�for�the�National�Grid�Service�(NGS).�The�last�of�these�is�generated�using�multiR’s�create.proxy�function�from�the�user’s�certificate�key�pair�issued�by�the�UK�e�Science�Certification�Authority.�(The�actual�certificate�obtained�from�https://ca.grid�support.ac.uk�is�exported�from�a�web�browser�in�.p12�format;�that�file�then�needs�converting�into�two�separate�but�paired�files�by�using�the�OpenSSL�toolkit:�see�www.grid�support.ac.uk/content/view/67/184/�for�detail).�Specifying�the�multiR�certificate�will�shortly�become�unnecessary�and�the�associated�argument�will�be�deprecated�in�future�versions�of�multiR.�
Currently,�a�typical�session�in�R�begins�as:�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
> library(spgwr.dist) # loads the spgwr.dist and multiR packages > session <- multiR.session("stats-grid.hpc.lancs.ac.uk", "50000", + "~/multiR.CA.pem", "~/grid.proxy.pem") �
The�analysis�then�continues�in�much�the�same�way�as�for�the�existing�spgwr�packgage.�Where,�in�spgwr,�the�bandwidth�for�GWR�is�calculated�on�the�user’s�desktop�using�a�function�of�the�form�
> bw = gwr.sel(y~x, data, coords)
for�the�grid�enabled�version�we�use�
> bw = gwr.sel.dist(session, y~x, data, coords, max.processors)
Similarly,�where�the�model�is�fitted�in�spgwr�using�
> gwr.model = gwr(y~x, data, coords, bw)
it�is�fitted�in�spgwr.dist�using�
> gwr.model = gwr.dist(session, y~x, data, coords, bw, + max.processors)�
The�only�difference,�from�the�user’s�perspective,�is�that�the�additional�parameter�“session”�contains�the�information�required�to�connect�to�the�multiR�server,�and�the�parameter�“max.processors”�(which�is�optional)�specifies�a�maximum�number�of�processors�the�GWR�fit�should�run�on.�
Imagine�a�comma�delimited�file�called�“census.csv”�containing�six�columns�of�data.�The�first�are�attribute�data,�headed�Y,�X1,�X2,�X3,�and�the�remaining�two�define�a�point�coordinate�associated�with�where�the�data�were�collected.�Those�are�headed�Easting�and�Northing.�To�fit�a�GWR�model�on�the�grid�system�at�Lancaster,�exploring�the�geographically�varying�relationship�of�
the�process�would�be:��1(i, j) 2(i, j) 3(i, j)(i, j)y x x x� � �
> mydata = read.csv(“census.csv”, header=TRUE)
> locations = cbind(mydata$Easting, mydata$Northing)
> bw = gwr.sel.dist(session, Y~X1+X2+X3, data=mydata, + coords=locations, max.processors=20)
> gwr.model = gwr.dist(session, Y~X1+X2+X3, data=mydata, + coords=locations, bandwidth=bw, max.processors=20)�
�
There�is�little�sense�in�using�the�spgwr.dist�package�for�‘small’�datasets�of�about�1000�observations�or�less.�For�those,�the�Windows�based�software�or�the�existing�spgwr�package�in�R�will�be�a�better�choice:�faster,�because�of�the�greater�need�for�communication�and�data�exchange�that�the�use�of�a�distributed�system�introduces.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
However,�GWR�does�not�scale�well.�The�reason�is�that�GWR�fits�a�distance�weighted�regression�
model,�usually�of�the�form� �i i i i0(u ,v ) k(u ,v ) iki k
y � �� � x� to�each�of�m�points�within�a�continuous,�
geographic�space:�(u ,�v )�denotes�the�geographic�coordinates�of�the�i �of�the�m�points.�For�a�model�that�examines�a�regression�relationship�at�each�of�100,000�census�zones,�n�=�100,000�and�it�would�appear�that�there�are�100,000�regression�surfaces�which�need�to�be�calculated.��Whilst�true,�there�are�also�prior�calculations�to�be�completed.�
i i th
First,�because�the�regression�is�distance�weighted,�the�distances�between�the�points�need�to�be�calculated.�In�the�example,�a�n�by�n�matrix�is�required.�More�generally,�because�the�fit�points�need�not�be�the�same�locations�as�those�for�which�the�data�are�collected,�then�given�a�GWR�model�with�n�data�points�and�m�fit�points,�the�distance�matrix,�D�is�of�size�m�by�n.�Nevertheless,�the�number�of�calculations�required�to�obtain�the�distance�matrix�approximates�to�the�order�of�n ,�D:�O(n ).�2 2
Having�calculated�the�distance�matrix,�the�m�(or�n)�regression�models�are�fitted.�However,�this�is�not�sufficient.�First�the�bandwidth�controlling�the�distance�weighting�must�be�found�and�optimised�(using�a�cross�validation�technique�or�based�on�the�Akaike�information�criteria,�AIC).�If�it�takes�g�iterations�for�the�optimisation�procedure�to�converge�on�a�preferred�bandwidth,�then�the�are�actually��g��m�regression�models�to�fit.�
Returning�to�the�example�of�census�zones,�where�m�=�n�=�100,000�(which�is�about�two�thirds�of�the�total�number�of�2001�census�output�areas�in�England�and�Wales),�we�estimate�that�using�a�desktop�implementation�of�GWR�it�would�take�about�half�a�day�to�derive�D�and�about�two�weeks�to�obtain�the�bandwidth.�This�is�‘do�able’�but�conflicts�with�the�notion�of�using�GWR�as�a�tool�for�exploratory�data�analysis�(to�in�some�sense�‘interact’�with�the�data).�As�the�times�for�the�various�stages�suggest,�the�main�bottleneck�is�not�in�finding�the�distance�matrix�but�in�calibrating�the�bandwidth:�each�iteration�is�of�order,�O�(n ).�3
It�is�unsurprising�to�discover�that�prior�to�this�research�(and�to�the�best�of�our�knowledge)�the�largest�dataset�for�which�GWR�has�been�attempted�was�of�size�n�=�12,493�(Fotheringham,�Brunsdon,�&�Charlton�2002).�Here,�we�have�demonstrated�the�potential�for�grid�enabled�GWR�by�using�a�dataset�with�greater�than�ten�times�that�number�of�observations.�
Specifically,�a�simple�analysis�has�been�undertaken�to�predict�the�proportion�of�households�without�a�car�(or�van)�in�n�=�165,665�output�areas�using�data�drawn�from�the�2001�Census.�The�predictor�variables�incorporate�social,�economic,�demographic�and�ethnicity�information�and�are:�
� X :�Proportion�of�persons�of�working�age�unemployed�1
� X :�Proportion�of�households�in�public�housing�2
� X :�Proportion�of�households�that�are�lone�parent�households�3
� X :�Proportion�of�persons�16�or�above�that�are�single�4
� X :�Proportion�of�persons�that�are�“white�British”�5
The�reason�for�modelling�car�non�ownership�is�that�it�generally�is�regarded�as�an�indicator�of�material�and�social�disadvantage�(reflecting�an�inability�to�afford�and�insure�a�vehicle�which�both�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
causes�and�sustains�disadvantage�in�the�job�market�where�access�to�employment�becomes�an�issue)�(Clark�&�Wang�2005).�However,�that�is�not�true,�everywhere:�car�ownership�is�lower�in�London,�for�example,�presumably�because�public�transport�offers�a�credible�alternative�(Harris,�Sleight,�&�Webber�2005,�p.219�220).��
The�regression�coefficients�for�a�standard,�ordinary�least�squares�regression�model�fitted�to�all�of�the�165,665�observations�are�1.61,�0.46,��0.32,�0.38�and��0.07,�respectively.�Each�is�significant�at�a�greater�than�99%�confidence�but�this�is�hardly�surprising�and�not�especially�instructive:�it�a�consequence�of�the�size�of�n�(it�being�large).�
More�interesting�is�how�the�coefficients�vary�spatially,�as�estimated�by�GWR�and�indicated�in�Table�1�
by�the�interquartile�range�for�each� k(u,v)� .�For�example,�whereas�the�general�model�predicts�a�10%�
increase�in�the�proportion�of�lone�parent�households�would�be�associated�with�an�average�decrease�in�car�non�ownership�of�3.2%,�the�GWR�model�suggests�a�decrease�in�the�(interquartile)�range�from�9.6%�to�1.5%.�Because�of�the�double�negative,�it�is�easier�to�interpret�the�results�as�showing�that�as�rates�of�lone�parenthood�increase�so�too�do�rates�of�car�ownership,�but�that�the�effect�is�greater�in�some�places�more�than�others.�
�
Global GWR
� �(u, v): Q1 Median Mean Q3 IQR
intercept 0.09 0.08 0.22 0.24 0.36 0.28
unemployment 1.61 0.21 0.61 0.62 1.01 0.80
public housing 0.46 0.47 0.52 0.52 0.58 0.11
lone parents -0.32 -0.96 -0.58 -0.56 -0.15 0.81
single 0.38 0.18 0.29 0.29 0.42 0.24
white British -0.07 -0.31 -0.16 -0.19 -0.04 0.27
�
Table�1.�Comparing�the�coefficients�of�a�standard�linear�model�and�a�GWR�model�predicting�car�non�ownership�for�n�=�165,665�census�output�areas�in�England�and�Wales.�
�
Figures�2�and�3�show�some�of�the�spatial�variation�in�the�coefficient�for�the�lone�parent�variable.�Figure�2�is�for�London,�and�Figure�3�is�for�Birmingham�and�Coventry.�Both�are�cartograms.�Cartograms�are�produced�by�warping�a�Euclidean�view�of�geographic�space�to�permit�the�size�of�each�circle�to�be�proportional�to�the�population�density�at�the�location�that�circle�represents�(Dorling�1996).�Consequently,�the�positions�of�the�motorways�are�indicative,�included�only�to�aid�interpretation�of�the�maps.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
�
Figure�2.�A�cartogram�showing�the�spatial�variation�in�the�lone�parent�coefficient�across�London.�
�
Figure�3.�A�cartogram�showing�the�spatial�variation�in�the�lone�parent�coefficient�across�Birmingham�and�Coventry.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
The�interesting�areas�are�those�shaded�yellow�or�red,�as�these�are�the�places�where�an�increase�in�lone�parenthood�is�least�associated�with�increased�car�ownership.�In�Birmingham�and�Coventry�these�places�are�near�to�the�city�centres;�in�London�they�are�more�dispersed�but�prevalent�to�the�East�of�the�city.�If�there�is�an�advantage�in�the�job�market�to�be�had�by�owning�a�car,�then�the�results�might�suggest�rather�different�experiences�(or�meanings)�of�lone�parenthood�across�geographical�space.�
The�GWR�model�for�the�n�=�m�=�165,665�fit�points�took�about�three�hours�to�calculate�using�the�North�West�Grid�Service�(at�Lancaster).�Clearly�this�is�not�‘immediate’�but�also�not�unreasonable�from�the�user’s�perspective�(especially�given�that�it�is�not�running�or�consuming�resources�on�their�own�PC).�
In�a�sense,�however,�we�‘cheated’.�We�estimate�that�it�takes�about�1.5�seconds�to�fit�a�single�regression�surface�using�generalised�geographically�weighted�regression.�If�it�takes�50�iterations�to�find�the�GWR�bandwidth�for�100,000�fit�points�and�the�calculation�is�distributed�over�100�processors,�then�the�total�time�to�obtain�the�model�would�be�about�1.5�×�50�×�(100,000�/�100)�seconds�–�about�20�hours.�Whether�it�is�really�necessary�to�calibrate�the�bandwidth�using�all�the�fit�points�is�a�moot�point�and�an�area�for�further�study���the�effects�of�sampling�on�GWR�need�to�be�more�fully�understood.�In�any�case,�a�random�sample�of�about�50,000�was�used�(the�gwr.sel.dist�function�can�generate�a�random�sample�of�the�points�if�desired).�
A�number�of�points�follow:�
� A�generalised�geographically�weighted�regression�was�used�to�fit�the�model�of�car�non�ownership�(primarily�to�check�it�worked).�However,�it�is�the�more�basic�(weighted�least�squares�and�Gaussian)�model�which�is�described�with�the�spgwr.dist�package,�above.�It�will�run�faster.�(In�fact,�it�takes�about�14�hours�to�run�on�the�entire�data�set�–�about�one�third�faster).�
� If�it�satisfactory�to�use�a�sampling�strategy�when�calibrating�the�bandwidth�then�it�may�also�be�sufficient�when�investigating�spatial�variation�in�the�regression�coefficients.��This�would�seem�appropriate�for�the�exploratory�stages�of�an�analysis.�
� Recall�that�the�processing�‘bottleneck’�is�the�regression�fit.�Many�other�spatial�statistics�(for�example�various�types�of�hot�spot�analysis)�are�simpler�than�GWR�where,�basically,�they�compare�the�rate,�incidence�or�density�of�an�event�or�feature�at�one�place�against�the�corresponding�values�for�other�places�across�the�study�region.�The�derivation�of�such�statistics�can�still�be�treated�as�embarrassingly�parallel�(with�different�processors�operating�on�different�subsets�of�the�data)�and�because�they�are�more�descriptive�than�explanatory,�they�will�run�considerably�faster�–�there�is�no�regression�required.�
Activities�The�research�and�project�have�been�presented�at�the�second�and�third�international�conferences�on�e�social�science�(Manchester�2006�and�Ann�Arbor,�Michigan�2007,�respectively),�and�at�the�9th�International�Conference�on�GeoComputation�(Maynooth,�Ireland,�2007).�It�has�also�been�presented�at�the�recent�NCeSS�Showcase�(Manchester�2008)�and�will�be�at�the�forthcoming�Digital�Geography�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
in�a�Web�2.0�World�conference�(London�2008)�as�well�at�the�R�User’s�conference�(Dortmund,�Germany�2008).�A�free�training�workshop�in�using�grid�enabled�GWR�was�undertaken�(Lancaster�2007).��
The�research�was�genuinely�collaborative,�involving�members�of�the�University�of�Leicester’s�SPLINT�(Spatial�Literacy�in�Teaching)�group�and,�especially,�the�Lancaster�University�Centre�for�e�Science.�The�latter�collaboration�was�not�envisioned�in�the�original�proposal�and�was�largely�serendipitous;�it�was�also�extremely�successful�and�may�represent�something�of�a�model�by�which�computer�and�social�scientists�may�collaborative.�
We�also�grateful�for�the�input�of�Professor�Roger�Bivand,�a�member�of�the�R�core�development�team,�with�whom�time�was�spent�in�Bergen,�Norway.�
Outputs�Papers�are�being�prepared�for�the�International�Journal�of�Geographical�Information�Science,�focusing�on�the�more�technical�aspects�of�how�spatial�statistics�may�be�grid�enabled,�and�also�for�the�Transactions�in�GIS�journal,�providing�a�more�applied�case�study.�A�further�paper�is�being�prepared�for�the�Journal�of�Statistical�Software�and�we�hope�to�produce�a�short�feature�for�the�Scientific�Computing�World�magazine.�
Nevertheless,�the�main�outputs�are�the�multiR�and�spgwr.dist�packages�for�R�which�are�being�‘cleaned’�to�make�them�freely�accessible�on�CRAN�(the�the�Comprehensive�R�Archive�Network).�Beta�versions�may�be�requested�from�members�of�the�project�team.�
The�training�manual�will�be�uploaded�to�a�suitable�website�–�initially�be�updating�the�content�at�http://www.esrcsocietytoday.ac.uk/ESRCInfoCentre/Minisite/gwr/index.html�
Impacts�The�development�of�the�multiR�package�and�server�is�not�specific�to�GWR�but�provides�a�more�general�link�between�(desktop)�R�and�grid�resources.�It�is�a�development�of�the�existing�GROWL�software�and�further�enhances�the�use�of�the�North�West�Grid�as�a�hub�for�statistical�operations�of�relevance�to�social�scientists.�
Future�research�priorities�There�are�four�lines�of�priority�which�arise�from�the�project.�
� Methodological:�the�impact�of�sampling�on�GWR�needs�to�be�better�understood,�as�may�the�impact�of�multicollinearity�and�correlation�among�local�regression�coefficients�in�geographically�weighted�regression�(Wheeler�&�Tiefelsdorf�2005).�More�positively,�there�is�a�possibility�to�resolve�one�of�the�simplifying�assumptions�of�basic�GWR:�that�a�single�measure�of�spatial�autocorrelation�(one�bandwidth)�is�sufficient�for�the�entire�study�region.�The�simple�possibility�is�to�regionalise�the�data,�process�it�separately,�and�compare�the�bandwidths.�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC
� Developmental:�the�application�of�multiR�is�not�limited�to�gwr.�A�toolbox�of�statistical�operations�could�be�offered�running�in�a�R�grid�environment,�including�types�of�hot�spot�analysis�and�geostatistical�operations�including�kriging:�in�fact,�almost�any�process�that�can�be�separated�into�subsets�(not�necessarily�spatial)�of�the�data.�
� Data�linkage:�to�census�and�other�data�via�the�National�Grid�Service.�See�the�GEMS�project�at�http://pascal.mvc.mcc.ac.uk:9080/gems�for�example.�
� Collaborative:��to�extend�the�collaborative�model�of�working�between�computer�and�social�scientists,�for�example�by�‘discipline�hopping’�funding.�
References��
Clark,�W.A.V.�&�Wang,�W.W.,�2005.�Job�Access�and�Commute�Penalties:�Balancing�Work�and�Residence�in�Los�Angeles.�Urban�Geography,�26(7),�p.610�626.�
Dorling,�D.,�1996.�Area�Cartograms:�Their�Use�and�Creation,�Norwich:�Environmental�Publications.�
Fotheringham,�A.S.,�Brunsdon,�C.,�&�Charlton,�M.,�2002.�Geographically�Weighted�Regression:�The�Analysis�of�Spatially�Varying�Relationships,�Chichester:�John�Wiley�&�Sons.�
Harris,�R.,�Sleight,�P.,�&�Webber,�R.,�2005.�Geodemographics:�GIS�and�Neighbourhood�Targeting,�Chichester:�John�Wiley�&�Sons.�
Martin,�D.,�2005.�Socioeconomic�GeoComputation�and�E�Social�Science’.�Transactions�in�GIS,�9(1),�p.1�3.�
Openshaw,�S.�et�al.,�1987.�A�Mark�I�Geographical�Analysis�Machine�for�the�Automated�Analysis�of�Point�Datasets.�International�Journal�of�Geographical�Information�Systems,�1(4),�p.335�358.�
Wheeler,�D.�&�Tiefelsdorf,�M.,�2005.�Multicollinearity�and�correlation�among�local�regression�coefficients�in�geographically�weighted�regression.�Journal�of�Geographical�Systems,�7(2),�p.161�187.�
�
To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC