geographically varying correlates of car non-ownership in ... · both are cartograms. cartograms...

The geographically varying correlates of car non-ownership in 2001 Census output areas of England

INTRODUCTION The data give the results of an experimental geographically weighted regression model (GWR) (Fotheringham, Brunsdon & Charlton 2002) fitted using the statistical package R to Census Output Areas in England. The model uses 2001 Census data and an implementation of the sp.gwr library available at http://cran.r-project.org/ adapted for the distributed (parallel) computing environment available as part of the UK’s National Grid Service (http://www.grid-support.ac.uk/).

Specifically, a simple GWR analysis has been undertaken to predict the proportion of households without a car (or van) in n = 165 665 Output Areas. Car non-ownership generally is regarded as an indicator of material and social disadvantage (reflecting an inability to afford and insure a vehicle which both causes and sustains disadvantage in the job market where access to employment becomes an issue) (Clark & Wang 2005). However, that is not true, everywhere: car ownership is lower in London, for example, presumably because public transport offers a credible alternative (Harris, Sleight & Webber 2005, pp.219-220).

The predictor variables incorporated social, economic, demographic and ethnicity information and were:

X1: Proportion of persons of working age unemployed

X2: Proportion of households in public housing

X3: Proportion of households that are lone parent households

X4: Proportion of persons 16 or above that are single

X5: Proportion of persons that are “white British”

ABOUT GWR At its simplest, GWR can be understood as treating the global regression model

ε+β++β+β+β= kk x...xxy 22110

(1)

as a special case of the model

( ) ( ) iikk iikiii xv,uv,uy ε+β+β= ∑0

(2)

UK Data Archive Study Number 6100 - Geographically Varying Correlates of Car Non-Ownership in Census Output Areas of England, 2001

where the difference between (1) and (2) is that the former is spatially invariant whereas, in (2),

( iik v,u )β is a realisation of the continuous function ( ik vu, )β at point i (Fotheringham, Brunsdon

& Charlton 2002).

In other words, GWR assumes the nature of the relationship between Y and the Xs to vary continuously across space, an assumption which is the opposite of a standard regression methodology that takes the relationship to be everywhere the same. This, the global model, for the data is:

Y = 0.09 + 1.60X1 + 0.46X2 – 0.32X3 + 0.38X4 – 0.07X5 + ε

Each of the predictor variables is significant at a greater than 99% confidence but this is hardly surprising and not especially instructive: it a consequence of the size of n (it being large). What we are interested in and what the GWR data tell us is how the regression coefficients vary across England. The data reveal geographical variation in the correlates of car non-ownership.

SUMMARY OF THE DATA

The data confirm that the correlates of car non-ownership vary across England. For example, whereas the global model predicted a 10% increase in the proportion of lone parent households would be associated with an average decrease in car non-ownership of 3.2%, the GWR model suggests the decrease could typically be from 9.6% to 1.5%, depending upon the location. Because of the double negative, it is easier to interpret the results as showing that as rates of lone parenthood increase so too do rates of car ownership, but that the effect is greater in some places more than others.

The GWR methodology fits weighted regression models at each of 165 665 locations separately. The regression coefficients obtained at each of those locations are contained in the data set. A summary of their variation is given in Table 1. Note that there were 6 168 locations where no model could be fitted.

Minimum 1st

QuartileMedian Mean

3rd Quartile

Maximum NAs

Intercept -50.48 0.05 0.21 0.2392 0.39 37.45 6168X1 -28.97 0.16 0.56 0.5729 1.01 35.41 6168X2 -4.1 0.46 0.53 0.5211 0.59 4.67 6168X3 -136.1 -1 -0.58 -0.5587 -0.13 97.61 6168X4 -3.59 0.13 0.28 0.2692 0.42 12.62 6168X5 -37.02 -0.33 -0.15 -0.1773 0 50.73 6168

Table 1. Comparing the coefficients of a GWR model predicting car non-ownership

for n = 165,665 census output areas in England and Wales.

FORMAT OF THE DATA

The data are saved as comma separated variables with the following headers and meaning:

Easting The National Grid Reference of a centroid within the Output Area

Northing The National Grid Reference of a centroid within the Output Area

Intercept The intercept for the weighted regression model at the centroid location

beta1 The regression coefficient for X1 at the location





Note: the GWR model was fitted with a fixed bandwidth of 2643.3 metres.

SUGGESTED APPLICATIONS OF THE DATA For teaching, to suggest how standard regression techniques may conceal geographical differences.

Example Figures 1 and 2 show some of the spatial variation in the coefficient for the lone parent variable. Figure 1 is for London, and Figure 2 is for Birmingham and Coventry. Both are cartograms. Cartograms are produced by warping a Euclidean view of geographic space to permit the size of each circle to be proportional to the population density at the location that circle represents (Dorling 1996). Consequently, the positions of the motorways are indicative, included only to aid interpretation of the maps.

The interesting areas are those shaded yellow or red, as these are the places where an increase in lone parenthood is least associated with increased car ownership. In Birmingham and Coventry these places are near to the city centres; in London they are more dispersed but prevalent to the East of the city. If there is an advantage in the job market to be had by owning a car, then the results might suggest rather different experiences (or meanings) of lone parenthood across geographical space.

Figure 1. A cartogram showing the spatial variation in the lone parent coefficient across London.

Figure 2. A cartogram showing the spatial variation in the lone parent coefficient across Birmingham and Coventry.

ACKNOWLEDGMENT The research was funded by the ESRC as part of the National Centre for e-social science’s small grants projects. RES-149-25-1041

REFERENCES Clark, W.A.V. & Wang, W.W., 2005. Job Access and Commute Penalties: Balancing Work and

Residence in Los Angeles. Urban Geography, 26(7), 610-626.

Dorling, D., 1996. Area Cartograms: Their Use and Creation, Norwich: Environmental Publications.

Fotheringham, A.S., Brunsdon, C. & Charlton, M., 2002. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships, Chichester: John Wiley & Sons.

Harris, R., Sleight, P. & Webber, R., 2005. Geodemographics: GIS and Neighbourhood Targeting, Chichester: John Wiley & Sons.

��

With�Application�to�Deprivation�Indices�

Background��Geographically�Weighted�Regression�(GWR)�(Fotheringham,�Brunsdon,�&�Charlton�2002),�like�many�other�methods�of�spatial�analysis,�is�characterised�by�multiple�repeat�testing�as�the�data�are�divided�into�geographical�regions�and�also�randomly�redistributed�many�times�to�simulate�the�likelihood�that�the�results�obtained�from�the�analysis�are�actually�due�to�chance.�Each�of�these�tests�requires�computer�time�so,�given�a�large�dataset�such�as�the�UK�Census�statistics,�running�the�analysis�on�a�standard�machine�can�take�a�long�time�–�in�the�order�of�days�or�weeks.�This�is�far�from�ideal�when�the�purpose�of�many�spatial�statistics�is�to�be�exploratory:�allowing�the�user�to�interact�with�data�and�find�spatial�patterns�of�association�within�them.�

Consequently,�the�application�of�high�performance�computing�to�spatial�analysis�has�long�been�of�interest�to�social�and�geographical�scientists.�Of�particular�note�is�the�pioneering�work�undertaken�by�Stan�Openshaw�at�the�University�of�Newcastle�and�at�the�Centre�for�Computational�Geography�at�Leeds�University,�of�which�an�exemplar�is�the�Geographical�Analysis�Machine�(GAM)�(Openshaw�et�al.�1987).�More�recently,�Martin�(2005)�has�identified�the�potential�for�geocomputation�to�develop�under�the�rubric�of�high�performance�computer�(grid)�networks�and�e�(electronic)�social�science.�He�identifies�four�essential�research�issues�for�e�social�science:�automated�data�mining;�visualization�of�spatial�data�uncertainty;�incorporation�of�an�explicitly�spatial�dimension�into�simulation�modeling;�and�neighborhood�classification�from�multi�source�distributed�datasets.�

Missing�from�Martin's�list�is�the�explicit�use�of�parallelization�to�speed�up�the�calculations�associated�with�spatial�statistics.�What�GAM,�GWR�and�other�methods�of�spatially�localized�analysis�have�in�common�is�a�general�sequence�of:�

1.�calibrating�the�size�of�the�kernel�or�search�window�to�the�amount�of��spatial�autocorrelation�found�in�the�attributes�of�the�data�being�examined;�

2.�creating�spatially�overlapping�subsets�of�the�data�to�reflect�this;��

3.�allowing�the�kernel�to�pass�from�one�subset�to�the�next,�applying�a�statistical�test�in�each;�

4.�simulating�confidence�intervals�for�the�statistical�result�by�detaching�the�data�attributes�from�the�geographical�coordinates�at�which�they�were�captured,�then�repeatedly�reattaching�the�attributes�to�randomly�selected�locations�and�applying�the�test�again.�

For�many�spatial�statistical�procedures,�each�of�the�stages�of�calibration,�fitting�and�assessing�significance�can�be�parallelized�with�processes�that�will�operate�without�communication�with�the�others�(since,�for�example,�the�outcome�of�a�model��fitted�to�one�spatial�subset�of�the�data�does�not�affect�or�modify�the�outcome�of�a�model�fitted�to�another).�Thus,�each�of�the�processes�can�be�sent�to�separate�computational�nodes,�their�outputs�pooled�and�then�incorporated�into�the�overall�calculation.��

To cite this output: Harris, Richard et al (2007). Grid Enabled Spatial Regression Models (With Application to Deprivation Indices): Full Research Report ESRC End of Award Report, RES-149-25-1041. Swindon: ESRC

By�‘grid�enabling’�GWR�we�hope�to�balance�its�computationally�intensive�requirements�with�the�need�of�users�for�faster�run�times,�and�to�showcase�it�as�an�example�of�methods�of�spatial�analysis�can�be�operated�on�the�UK’s�National�Grid�infrastructure.�

Objectives�The�overarching�objective�of�the�research�was:�

To�develop�a�prototype,�grid�enabled�implementation�of�GWR�that�can�be�used�by�other�researchers�and�which�builds�upon�existing�e�Science�infrastructure�(the�National�Grid�Service).�

This�objective�was�completed�in�full.�

Other�aims�were:�

(1)�to�demonstrate�the�use�of�GWR�and�grid�technologies�with�regards�to�the�important�social�and�policy�issue�of�understanding�and�measuring�the�spatially�distributed�correlates�of�deprivation;�

(2)�to�work�collaboratively�across�institutions�(the�University�of�Bristol,�the�University�of�Leicester�and�the�National�University�of�Ireland,�Maynooth)�and�across�disciplines�(geographical�and�computational�science),�to�foster�research�networks;�

(3)�to�‘connect’�and�liaise�with�currently�funded�research�projects�that�complement�this�proposal;�and�

(4)�to�foster�the�continued�professional�development�of�the�research�team�and�particularly�the�researcher�employed�to�carry�forward�and�develop�e�science�within�the�social�sciences.�

Of�these,�(3)�and�(4)�were�fully�addressed,�(2)�met�in�kind�and�(1)�in�part�and�is�on�going�(see�Results�section�below).�

Very�early�in�the�project’s�life,�collaboration�had�been�explored�with�the�Lancaster�University�Centre�for�e�Science.�An�opportunity�to�formalise�this�arrangement�came�with�the�unexpected�departure�of�the�research�assistant�at�Bristol�to�other�employment,�and�the�subsequent�failure�to�recruit�a�suitably�replacement.�Collaboration�therefore�focused�on�three�institutions:�the�University�of�Bristol,�the�University�of�Leicester�and�the�University�of�Lancaster,�though�the�work�was�also�presented�at�the�National�University�of�Ireland,�Maynooth.�

Methods�At�the�outset�of�the�project�four�versions�of�GWR�were�available�to�us�to�develop.�The�first�was�a�Windows�based�version�with�a�graphical�user�interface.�This�is�the�software�produced�by�the�GWR�development�team�at�the�National�University�of�Ireland,�Maynooth.�The�second�was�the�‘raw’�Fortran�77�underpinning�the�Windows�delivery.�The�third�was�an�existing�implementation�of�GWR�written�in�R�by�one�of�the�research�team�and�originators�of�GWR�(Professor�Chris�Brunsdon).�The�fourth�was�an�open�source�library�for�running�R:�the�spgwr�package�developed�by�Bivand�and�Yu�and�hosted�on�the�Comprehensive�R�Archive�Network.�


The�spgwr�package�for�R�provides�functions�for�calibration�of�the�bandwidth�and�calculation�of�the�regression�parameters�using�the�methods�of�Geographically�Weighted�Regression.�Clearly,�it�would�be�advantageous�to�re�use�this�well�used�and�supported�package�as�much�as�possible�when�developing�a�parallel�version�of�the�GWR�methods.�Doing�so�would�minimize�the�amount�of�additional�skills�required�by�existing�users�of�spgwr�when�adapting�to�using�a�parallel�implementation.�In�addition,�it�would�reduce�the�overall�development�effort�required�to�implement�parallel�GWR.�

In�fact,�this�approach�to�parallelising�GWR�has�already�been�taken�by�the�authors�of�the�spgwr�package�to�make�GWR�available�on�multiprocessor�systems.�This�was�achieved�using�the�snow�package�which�provides�a�set�of�methods�for�evaluating�R�functions�in�parallel�using�PVM,�sockets�or�threads.�However,�snow�does�not�provide�the�means�of�employing�a�large�number�of�distributed�systems�such�as�are�typically�encountered�in�a�grid�environment.��

R�is�an�open�source�package�for�statistical�computing�and�graphics�and�has�a�large�and�growing�user�base,�many�of�whom�provide�libraries�(or�‘add�ins’)�extending�its�functionality.�A�prior�National�Centre�for�e�social�science�(NCeSS)�project�called�SABRE�in�R�had�involved�the�Lancaster�University�Centre�for�e�Science�developing�a�parallel�implementation�of�SABRE�(a�program�for�the�statistical�analysis�of�binary,�ordinal�and�count�recurrent�events)�as�R�Objects.�That�project�had�used�GROWL�“to�provide�user�friendly�access�to�GRID�resources�for�applications�accessible�from�desktop�computer”�(www.ncess.ac.uk/research/quantitative/cqess/growl/).�To�develop�a�version�of�GWR�that�could�run�the�existing�spgwr�library�on�a�desktop�computer�using�R�but�do�the�processing�remotely�on�the�National�Grid�infrastructure�became�the�method�of�choice.�

A�package,�entitled�multiR,�was�developed�for�this�purpose,�using�GROWL�technology.�Unlike�snow,�multiR�does�provide�a�client�R�interface�for�parallel�computing�in�a�high�throughput�distributed�computing�environment.�The�package,�multiR,�is�a�client/server�system�which�provides�a�means�of�submitting�a�group�of�tasks�for�processing�on�multiple�systems�that�are�remote�from�the�client�system.�The�remote�systems�could�be�processors�on�a�local�high�performance�cluster,�a�Condor�pool�or�combinations�of�these�and�possibly�many�other�types�of�system.�The�multiR�client�interface�is�distributed�as�a�package�for�R�and�its�usage�is�similar�in�many�respects�to�that�of�the�R�function�lapply.�The�multiR�concept�is�to�provide�a�means�of�specifying�a�R�function�for�multiple�invocation�with�varying�arguments�where�the�function�is�evaluated�on�multiple�processors.�By�doing�so�it�allows�R�to�become�a�programming�environment�for�course�grained�parallel�processing.�

The�multiR�client/server�system�is�based�on�a�three�tier�architecture.�It�is�implemented�in�this�way�because�such�an�architectural�design�pattern�overcomes�many�of�the�difficulties�associated�with�providing�and�administrating�a�secure�service�where�the�resources�employed�to�implement�the�service�are�manifold,�varied�and�constantly�changing.�Figure�1�outlines�the�principle�of�the�architecture.�Clients�use�R�to�define�the�functions�that�require�evaluation�and�use�multiR�to�submit�a�job�(the�function�invocations)�to�the�multiR�server.�The�multiR�server�then�delegates�these�tasks�to�whatever�resources�it�employs.�The�progress�of�jobs�that�have�been�submitted�by�a�client�may�be�monitored�within�R�and�the�results�“harvested”�by�commands�provided�within�the�multiR�package.�The�evaluation�of�the�function�invocations�which�comprise�the�job�are�evaluated�within�R�sessions�invoked�on�the�host�systems�which�act�as�proxies�for�the�client�R�session.�


�

Figure�1.�The�three�tier�client/server�architecture�employed�by�multiR.

Results�The�main�‘result’�of�the�research�was�the�spgwr.dist�and�multiR�packages�for�R�which�were�then�used�to�fit�a�GWR�model�of�car�non�ownership�using�165,665�data�points.�These�are�now�described.��

The�spgwr.dist�package�for�R�contains�the�functions�required�for�grid�enabled�GWR�(��dist�is�an�abbreviation�of�distributed,�i.e.�it�is�designed�for�distributed�computing).�It�uses�a�further�R�package�called�multiR�which�is�installed�locally�but�sets�up�R�to�run�on�a�distributed�computing�platform�by�identifying�a�remote�multiR�server�by�‘name’�and�by�the�port�number�on�which�the�service�is�hosted.�The�multiR�package�and�server�are�the�middleware�between�the�user’s�desktop�and�the�grid�system�on�which�the�GWR�analysis�will�be�completed.�The�multiR�session�requires�three�security�credentials�to�be�supplied:�a�multiR�proxy�certificate,�a�certificate�validating�the�multiR�server�and�the�user’s�proxy�credentials�for�the�National�Grid�Service�(NGS).�The�last�of�these�is�generated�using�multiR’s�create.proxy�function�from�the�user’s�certificate�key�pair�issued�by�the�UK�e�Science�Certification�Authority.�(The�actual�certificate�obtained�from�https://ca.grid�support.ac.uk�is�exported�from�a�web�browser�in�.p12�format;�that�file�then�needs�converting�into�two�separate�but�paired�files�by�using�the�OpenSSL�toolkit:�see�www.grid�support.ac.uk/content/view/67/184/�for�detail).�Specifying�the�multiR�certificate�will�shortly�become�unnecessary�and�the�associated�argument�will�be�deprecated�in�future�versions�of�multiR.�

Currently,�a�typical�session�in�R�begins�as:�


> library(spgwr.dist) # loads the spgwr.dist and multiR packages > session <- multiR.session("stats-grid.hpc.lancs.ac.uk", "50000", + "~/multiR.CA.pem", "~/grid.proxy.pem") �

The�analysis�then�continues�in�much�the�same�way�as�for�the�existing�spgwr�packgage.�Where,�in�spgwr,�the�bandwidth�for�GWR�is�calculated�on�the�user’s�desktop�using�a�function�of�the�form�

> bw = gwr.sel(y~x, data, coords)

for�the�grid�enabled�version�we�use�

> bw = gwr.sel.dist(session, y~x, data, coords, max.processors)

Similarly,�where�the�model�is�fitted�in�spgwr�using�

> gwr.model = gwr(y~x, data, coords, bw)

it�is�fitted�in�spgwr.dist�using�

> gwr.model = gwr.dist(session, y~x, data, coords, bw, + max.processors)�

The�only�difference,�from�the�user’s�perspective,�is�that�the�additional�parameter�“session”�contains�the�information�required�to�connect�to�the�multiR�server,�and�the�parameter�“max.processors”�(which�is�optional)�specifies�a�maximum�number�of�processors�the�GWR�fit�should�run�on.�

Imagine�a�comma�delimited�file�called�“census.csv”�containing�six�columns�of�data.�The�first�are�attribute�data,�headed�Y,�X1,�X2,�X3,�and�the�remaining�two�define�a�point�coordinate�associated�with�where�the�data�were�collected.�Those�are�headed�Easting�and�Northing.�To�fit�a�GWR�model�on�the�grid�system�at�Lancaster,�exploring�the�geographically�varying�relationship�of�

the�process�would�be:��1(i, j) 2(i, j) 3(i, j)(i, j)y x x x� � �

> mydata = read.csv(“census.csv”, header=TRUE)

> locations = cbind(mydata$Easting, mydata$Northing)

> bw = gwr.sel.dist(session, Y~X1+X2+X3, data=mydata, + coords=locations, max.processors=20)

> gwr.model = gwr.dist(session, Y~X1+X2+X3, data=mydata, + coords=locations, bandwidth=bw, max.processors=20)�

�

There�is�little�sense�in�using�the�spgwr.dist�package�for�‘small’�datasets�of�about�1000�observations�or�less.�For�those,�the�Windows�based�software�or�the�existing�spgwr�package�in�R�will�be�a�better�choice:�faster,�because�of�the�greater�need�for�communication�and�data�exchange�that�the�use�of�a�distributed�system�introduces.�


However,�GWR�does�not�scale�well.�The�reason�is�that�GWR�fits�a�distance�weighted�regression�

model,�usually�of�the�form� �i i i i0(u ,v ) k(u ,v ) iki k

y � �� x� to�each�of�m�points�within�a�continuous,�

geographic�space:�(u ,�v )�denotes�the�geographic�coordinates�of�the�i �of�the�m�points.�For�a�model�that�examines�a�regression�relationship�at�each�of�100,000�census�zones,�n�=�100,000�and�it�would�appear�that�there�are�100,000�regression�surfaces�which�need�to�be�calculated.��Whilst�true,�there�are�also�prior�calculations�to�be�completed.�

i i th

First,�because�the�regression�is�distance�weighted,�the�distances�between�the�points�need�to�be�calculated.�In�the�example,�a�n�by�n�matrix�is�required.�More�generally,�because�the�fit�points�need�not�be�the�same�locations�as�those�for�which�the�data�are�collected,�then�given�a�GWR�model�with�n�data�points�and�m�fit�points,�the�distance�matrix,�D�is�of�size�m�by�n.�Nevertheless,�the�number�of�calculations�required�to�obtain�the�distance�matrix�approximates�to�the�order�of�n ,�D:�O(n ).�2 2

Having�calculated�the�distance�matrix,�the�m�(or�n)�regression�models�are�fitted.�However,�this�is�not�sufficient.�First�the�bandwidth�controlling�the�distance�weighting�must�be�found�and�optimised�(using�a�cross�validation�technique�or�based�on�the�Akaike�information�criteria,�AIC).�If�it�takes�g�iterations�for�the�optimisation�procedure�to�converge�on�a�preferred�bandwidth,�then�the�are�actually��g�×�m�regression�models�to�fit.�

Returning�to�the�example�of�census�zones,�where�m�=�n�=�100,000�(which�is�about�two�thirds�of�the�total�number�of�2001�census�output�areas�in�England�and�Wales),�we�estimate�that�using�a�desktop�implementation�of�GWR�it�would�take�about�half�a�day�to�derive�D�and�about�two�weeks�to�obtain�the�bandwidth.�This�is�‘do�able’�but�conflicts�with�the�notion�of�using�GWR�as�a�tool�for�exploratory�data�analysis�(to�in�some�sense�‘interact’�with�the�data).�As�the�times�for�the�various�stages�suggest,�the�main�bottleneck�is�not�in�finding�the�distance�matrix�but�in�calibrating�the�bandwidth:�each�iteration�is�of�order,�O�(n ).�3

It�is�unsurprising�to�discover�that�prior�to�this�research�(and�to�the�best�of�our�knowledge)�the�largest�dataset�for�which�GWR�has�been�attempted�was�of�size�n�=�12,493�(Fotheringham,�Brunsdon,�&�Charlton�2002).�Here,�we�have�demonstrated�the�potential�for�grid�enabled�GWR�by�using�a�dataset�with�greater�than�ten�times�that�number�of�observations.�

Specifically,�a�simple�analysis�has�been�undertaken�to�predict�the�proportion�of�households�without�a�car�(or�van)�in�n�=�165,665�output�areas�using�data�drawn�from�the�2001�Census.�The�predictor�variables�incorporate�social,�economic,�demographic�and�ethnicity�information�and�are:�

� X :�Proportion�of�persons�of�working�age�unemployed�1

� X :�Proportion�of�households�in�public�housing�2

� X :�Proportion�of�households�that�are�lone�parent�households�3

� X :�Proportion�of�persons�16�or�above�that�are�single�4

� X :�Proportion�of�persons�that�are�“white�British”�5

The�reason�for�modelling�car�non�ownership�is�that�it�generally�is�regarded�as�an�indicator�of�material�and�social�disadvantage�(reflecting�an�inability�to�afford�and�insure�a�vehicle�which�both�


causes�and�sustains�disadvantage�in�the�job�market�where�access�to�employment�becomes�an�issue)�(Clark�&�Wang�2005).�However,�that�is�not�true,�everywhere:�car�ownership�is�lower�in�London,�for�example,�presumably�because�public�transport�offers�a�credible�alternative�(Harris,�Sleight,�&�Webber�2005,�p.219�220).��

The�regression�coefficients�for�a�standard,�ordinary�least�squares�regression�model�fitted�to�all�of�the�165,665�observations�are�1.61,�0.46,��0.32,�0.38�and��0.07,�respectively.�Each�is�significant�at�a�greater�than�99%�confidence�but�this�is�hardly�surprising�and�not�especially�instructive:�it�a�consequence�of�the�size�of�n�(it�being�large).�

More�interesting�is�how�the�coefficients�vary�spatially,�as�estimated�by�GWR�and�indicated�in�Table�1�

by�the�interquartile�range�for�each� k(u,v)� .�For�example,�whereas�the�general�model�predicts�a�10%�

increase�in�the�proportion�of�lone�parent�households�would�be�associated�with�an�average�decrease�in�car�non�ownership�of�3.2%,�the�GWR�model�suggests�a�decrease�in�the�(interquartile)�range�from�9.6%�to�1.5%.�Because�of�the�double�negative,�it�is�easier�to�interpret�the�results�as�showing�that�as�rates�of�lone�parenthood�increase�so�too�do�rates�of�car�ownership,�but�that�the�effect�is�greater�in�some�places�more�than�others.�

�

Global GWR

� �(u, v): Q1 Median Mean Q3 IQR

intercept 0.09 0.08 0.22 0.24 0.36 0.28

unemployment 1.61 0.21 0.61 0.62 1.01 0.80

public housing 0.46 0.47 0.52 0.52 0.58 0.11

lone parents -0.32 -0.96 -0.58 -0.56 -0.15 0.81

single 0.38 0.18 0.29 0.29 0.42 0.24

white British -0.07 -0.31 -0.16 -0.19 -0.04 0.27

�

Table�1.�Comparing�the�coefficients�of�a�standard�linear�model�and�a�GWR�model�predicting�car�non�ownership�for�n�=�165,665�census�output�areas�in�England�and�Wales.�

�

Figures�2�and�3�show�some�of�the�spatial�variation�in�the�coefficient�for�the�lone�parent�variable.�Figure�2�is�for�London,�and�Figure�3�is�for�Birmingham�and�Coventry.�Both�are�cartograms.�Cartograms�are�produced�by�warping�a�Euclidean�view�of�geographic�space�to�permit�the�size�of�each�circle�to�be�proportional�to�the�population�density�at�the�location�that�circle�represents�(Dorling�1996).�Consequently,�the�positions�of�the�motorways�are�indicative,�included�only�to�aid�interpretation�of�the�maps.�


�

Figure�2.�A�cartogram�showing�the�spatial�variation�in�the�lone�parent�coefficient�across�London.�

�

Figure�3.�A�cartogram�showing�the�spatial�variation�in�the�lone�parent�coefficient�across�Birmingham�and�Coventry.�


The�interesting�areas�are�those�shaded�yellow�or�red,�as�these�are�the�places�where�an�increase�in�lone�parenthood�is�least�associated�with�increased�car�ownership.�In�Birmingham�and�Coventry�these�places�are�near�to�the�city�centres;�in�London�they�are�more�dispersed�but�prevalent�to�the�East�of�the�city.�If�there�is�an�advantage�in�the�job�market�to�be�had�by�owning�a�car,�then�the�results�might�suggest�rather�different�experiences�(or�meanings)�of�lone�parenthood�across�geographical�space.�

The�GWR�model�for�the�n�=�m�=�165,665�fit�points�took�about�three�hours�to�calculate�using�the�North�West�Grid�Service�(at�Lancaster).�Clearly�this�is�not�‘immediate’�but�also�not�unreasonable�from�the�user’s�perspective�(especially�given�that�it�is�not�running�or�consuming�resources�on�their�own�PC).�

In�a�sense,�however,�we�‘cheated’.�We�estimate�that�it�takes�about�1.5�seconds�to�fit�a�single�regression�surface�using�generalised�geographically�weighted�regression.�If�it�takes�50�iterations�to�find�the�GWR�bandwidth�for�100,000�fit�points�and�the�calculation�is�distributed�over�100�processors,�then�the�total�time�to�obtain�the�model�would�be�about�1.5�×�50�×�(100,000�/�100)�seconds�–�about�20�hours.�Whether�it�is�really�necessary�to�calibrate�the�bandwidth�using�all�the�fit�points�is�a�moot�point�and�an�area�for�further�study��the�effects�of�sampling�on�GWR�need�to�be�more�fully�understood.�In�any�case,�a�random�sample�of�about�50,000�was�used�(the�gwr.sel.dist�function�can�generate�a�random�sample�of�the�points�if�desired).�

A�number�of�points�follow:�

� A�generalised�geographically�weighted�regression�was�used�to�fit�the�model�of�car�non�ownership�(primarily�to�check�it�worked).�However,�it�is�the�more�basic�(weighted�least�squares�and�Gaussian)�model�which�is�described�with�the�spgwr.dist�package,�above.�It�will�run�faster.�(In�fact,�it�takes�about�14�hours�to�run�on�the�entire�data�set�–�about�one�third�faster).�

� If�it�satisfactory�to�use�a�sampling�strategy�when�calibrating�the�bandwidth�then�it�may�also�be�sufficient�when�investigating�spatial�variation�in�the�regression�coefficients.��This�would�seem�appropriate�for�the�exploratory�stages�of�an�analysis.�

� Recall�that�the�processing�‘bottleneck’�is�the�regression�fit.�Many�other�spatial�statistics�(for�example�various�types�of�hot�spot�analysis)�are�simpler�than�GWR�where,�basically,�they�compare�the�rate,�incidence�or�density�of�an�event�or�feature�at�one�place�against�the�corresponding�values�for�other�places�across�the�study�region.�The�derivation�of�such�statistics�can�still�be�treated�as�embarrassingly�parallel�(with�different�processors�operating�on�different�subsets�of�the�data)�and�because�they�are�more�descriptive�than�explanatory,�they�will�run�considerably�faster�–�there�is�no�regression�required.�

Activities�The�research�and�project�have�been�presented�at�the�second�and�third�international�conferences�on�e�social�science�(Manchester�2006�and�Ann�Arbor,�Michigan�2007,�respectively),�and�at�the�9th�International�Conference�on�GeoComputation�(Maynooth,�Ireland,�2007).�It�has�also�been�presented�at�the�recent�NCeSS�Showcase�(Manchester�2008)�and�will�be�at�the�forthcoming�Digital�Geography�


in�a�Web�2.0�World�conference�(London�2008)�as�well�at�the�R�User’s�conference�(Dortmund,�Germany�2008).�A�free�training�workshop�in�using�grid�enabled�GWR�was�undertaken�(Lancaster�2007).��

The�research�was�genuinely�collaborative,�involving�members�of�the�University�of�Leicester’s�SPLINT�(Spatial�Literacy�in�Teaching)�group�and,�especially,�the�Lancaster�University�Centre�for�e�Science.�The�latter�collaboration�was�not�envisioned�in�the�original�proposal�and�was�largely�serendipitous;�it�was�also�extremely�successful�and�may�represent�something�of�a�model�by�which�computer�and�social�scientists�may�collaborative.�

We�also�grateful�for�the�input�of�Professor�Roger�Bivand,�a�member�of�the�R�core�development�team,�with�whom�time�was�spent�in�Bergen,�Norway.�

Outputs�Papers�are�being�prepared�for�the�International�Journal�of�Geographical�Information�Science,�focusing�on�the�more�technical�aspects�of�how�spatial�statistics�may�be�grid�enabled,�and�also�for�the�Transactions�in�GIS�journal,�providing�a�more�applied�case�study.�A�further�paper�is�being�prepared�for�the�Journal�of�Statistical�Software�and�we�hope�to�produce�a�short�feature�for�the�Scientific�Computing�World�magazine.�

Nevertheless,�the�main�outputs�are�the�multiR�and�spgwr.dist�packages�for�R�which�are�being�‘cleaned’�to�make�them�freely�accessible�on�CRAN�(the�the�Comprehensive�R�Archive�Network).�Beta�versions�may�be�requested�from�members�of�the�project�team.�

The�training�manual�will�be�uploaded�to�a�suitable�website�–�initially�be�updating�the�content�at�http://www.esrcsocietytoday.ac.uk/ESRCInfoCentre/Minisite/gwr/index.html�

Impacts�The�development�of�the�multiR�package�and�server�is�not�specific�to�GWR�but�provides�a�more�general�link�between�(desktop)�R�and�grid�resources.�It�is�a�development�of�the�existing�GROWL�software�and�further�enhances�the�use�of�the�North�West�Grid�as�a�hub�for�statistical�operations�of�relevance�to�social�scientists.�

Future�research�priorities�There�are�four�lines�of�priority�which�arise�from�the�project.�

� Methodological:�the�impact�of�sampling�on�GWR�needs�to�be�better�understood,�as�may�the�impact�of�multicollinearity�and�correlation�among�local�regression�coefficients�in�geographically�weighted�regression�(Wheeler�&�Tiefelsdorf�2005).�More�positively,�there�is�a�possibility�to�resolve�one�of�the�simplifying�assumptions�of�basic�GWR:�that�a�single�measure�of�spatial�autocorrelation�(one�bandwidth)�is�sufficient�for�the�entire�study�region.�The�simple�possibility�is�to�regionalise�the�data,�process�it�separately,�and�compare�the�bandwidths.�


� Developmental:�the�application�of�multiR�is�not�limited�to�gwr.�A�toolbox�of�statistical�operations�could�be�offered�running�in�a�R�grid�environment,�including�types�of�hot�spot�analysis�and�geostatistical�operations�including�kriging:�in�fact,�almost�any�process�that�can�be�separated�into�subsets�(not�necessarily�spatial)�of�the�data.�

� Data�linkage:�to�census�and�other�data�via�the�National�Grid�Service.�See�the�GEMS�project�at�http://pascal.mvc.mcc.ac.uk:9080/gems�for�example.�

� Collaborative:��to�extend�the�collaborative�model�of�working�between�computer�and�social�scientists,�for�example�by�‘discipline�hopping’�funding.�

References��

Clark,�W.A.V.�&�Wang,�W.W.,�2005.�Job�Access�and�Commute�Penalties:�Balancing�Work�and�Residence�in�Los�Angeles.�Urban�Geography,�26(7),�p.610�626.�

Dorling,�D.,�1996.�Area�Cartograms:�Their�Use�and�Creation,�Norwich:�Environmental�Publications.�

Fotheringham,�A.S.,�Brunsdon,�C.,�&�Charlton,�M.,�2002.�Geographically�Weighted�Regression:�The�Analysis�of�Spatially�Varying�Relationships,�Chichester:�John�Wiley�&�Sons.�

Harris,�R.,�Sleight,�P.,�&�Webber,�R.,�2005.�Geodemographics:�GIS�and�Neighbourhood�Targeting,�Chichester:�John�Wiley�&�Sons.�

Martin,�D.,�2005.�Socioeconomic�GeoComputation�and�E�Social�Science’.�Transactions�in�GIS,�9(1),�p.1�3.�

Openshaw,�S.�et�al.,�1987.�A�Mark�I�Geographical�Analysis�Machine�for�the�Automated�Analysis�of�Point�Datasets.�International�Journal�of�Geographical�Information�Systems,�1(4),�p.335�358.�

Wheeler,�D.�&�Tiefelsdorf,�M.,�2005.�Multicollinearity�and�correlation�among�local�regression�coefficients�in�geographically�weighted�regression.�Journal�of�Geographical�Systems,�7(2),�p.161�187.�

�


geographically varying correlates of car non-ownership in ... · both are cartograms. cartograms...

Documents