uncertainty - csiss · many geographic representations depend upon inherently vague deﬁnitions...

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124

6 Uncertainty

Uncertainty in geographic representation arises because, of necessity, almostall representations of the world are incomplete. As a result, data in a GIScan be subject to measurement error, out of date, excessively generalized, orjust plain wrong. This chapter identifies many of the sources of geographicuncertainty and the ways in which they operate in GIS-based representations.Uncertainty arises from the way that GIS users conceive of the world, howthey measure and represent it, and how they analyze their representationsof it. This chapter investigates a number of conceptual issues in the creationand management of uncertainty, before reviewing the ways in which itmay be measured using statistical and other methods. The propagation ofuncertainty through geographical analysis is then considered. Uncertainty isan inevitable characteristic of GIS usage, and one that users must learn tolive with. In these circumstances, it becomes clear that all decisions based onGIS are also subject to uncertainty.

Geographic Information Systems and Science, 2nd edition Paul Longley, Michael Goodchild, David Maguire and David Rhind. 2005 John Wiley & Sons, Ltd. ISBNs: 0-470-87000-1 (HB); 0-470-87001-X (PB)

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124

128 PART II PRINCIPLES

Learning Objectives

By the end of this chapter you will:

■ Understand the concept of uncertainty, andthe ways in which it arises from imperfectrepresentation of geographic phenomena;

■ Be aware of the uncertainties introduced inthe three stages (conception, measurementand representation, and analysis) ofdatabase creation and use;

■ Understand the concepts of vagueness andambiguity, and the uncertainties arisingfrom the definition of key GIS attributes;

■ Understand how and why scale ofgeographic measurement and analysis canboth create and propagate uncertainty.

6.1 Introduction

GIS-based representations of the real world are used toreconcile science with practice, concepts with applica-tions, and analytical methods with social context. Yet,almost always, such reconciliation is imperfect, because,necessarily, representations of the world are incomplete(Section 3.4). In this chapter we will use uncertainty asan umbrella term to describe the problems that arise outof these imperfections. Occasionally, representations mayapproach perfect accuracy and precision (terms that wewill define in Section 6.3.2.2) – as might be the case, forexample, in the detailed site layout layer of a utility man-agement system, in which strenuous efforts are made toreconcile fine-scale multiple measurements of built envi-ronments. Yet perfect, or nearly perfect, representationsof reality are the exception rather than the rule. Moreusually, the inherent complexity and detail of our worldmakes it virtually impossible to capture every single facet,at every possible scale, in a digital representation. (Neitheris this usually desirable: see the discussion of samplingin Section 4.4.) Furthermore, different individuals see theworld in different ways, and in practice no single view islikely to be seen universally as the best or to enjoy uncon-tested status. In this chapter we discuss how the processesand procedures of abstraction create differences betweenthe contents of our (geographic and attribute) databaseand real world phenomena. Such differences are almostinevitable and understanding of them can help us to man-age uncertainty, and to live with it.

It is impossible to make a perfect representation ofthe world, so uncertainty about it is inevitable.

Various terms are used to describe differences betweenthe real world and how it appears in a GIS, depending onthe context. The established scientific notion of measure-ment error focuses on differences between observers orbetween measuring instruments. As we saw in a previ-ous chapter (Section 4.7), the concept of error in mul-tivariate statistics arises in part from omission of somerelevant aspects of a phenomenon – as in the failure tofully specify all of the predictor variables in a multipleregression model, for example. Similar problems arisewhen one or more variables are omitted from the cal-culation of a composite indicator – as, for example, inomitting road accessibility in an index of land value,or omitting employment status from a measure of socialdeprivation (see Section 16.2.1 for a discussion of indi-cators). More generally, the Dutch geostatistician GerardHeuvelink (who we will introduce in Box 6.1) has definedaccuracy as the difference between reality and our rep-resentation of reality. Although such differences mightprincipally be addressed in formal mathematical terms, theuse of the word our acknowledges the varying views thatare generated by a complex, multi-scale, and inherentlyuncertain world.

Yet even this established framework is too simple forunderstanding quality or the defining standards of geo-graphic data. The terms ambiguity and vagueness identifyfurther considerations which need to be taken into accountin assessing the quality of a GIS representation. Qual-ity is an important topic in GIS, and there have beenmany attempts to identify its basic dimensions. The USFederal Geographic Data Committee’s various standardslist five components of quality: attribute accuracy, posi-tional accuracy, logical consistency, completeness, andlineage. Definitions and other details on each of theseand several more can be found on the FGDC’s Webpages (www.fgdc.gov). Error, inaccuracy, ambiguity,and vagueness all contribute to the notion of uncertaintyin the broadest sense, and uncertainty may thus be definedas a measure of the user’s understanding of the differencebetween the contents of a dataset, and the real phenomenathe data are believed to represent. This definition impliesthat phenomena are real, but includes the possibility thatwe are unable to describe them exactly. In GIS, the termuncertainty has come to be used as the catch-all term todescribe situations in which the digital representation issimply incomplete, and as a measure of the general qualityof the representation.

Many geographic representations depend uponinherently vague definitions and concepts

The views outlined in the previous paragraph are them-selves controversial, and a rich ground for endless philo-sophical discussions. Some would argue that uncertaintycan be inherent in phenomena themselves, rather thanjust in their description. Others would argue for distinc-tions between vagueness, uncertainty, fuzziness, impreci-sion, inaccuracy, and many other terms that most peopleuse as if they were essentially synonymous. Information

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124

CHAPTER 6 UNCERTAINTY 129

U1

U2

U3

Conception

Real World

Measurement &Representation

Analysis

Figure 6.1 A conceptual view of uncertainty. The three filters, U1, U2 and U3 distort the way in which the complexity of the realworld is conceived, measured and represented, and analyzed in a cumulative way

scientist Peter Fisher (1999) has provided a useful andwide-ranging discussion of these terms. We take the catch-all view here, and leave these arguments to further study.

In this chapter, we will discuss some of the principalsources of uncertainty and some of the ways in whichuncertainty degrades the quality of a spatial representa-tion. The way in which we conceive of a geographicphenomenon very much prescribes the way in whichwe are likely to set about measuring and representingit. The measurement procedure, in turn, heavily condi-tions the ways in which it may be analyzed within aGIS. This chain sequence of events, in which concep-tion prescribes measurement and representation, which inturn prescribes analysis is a succinct way of summarizingmuch of the content of this chapter, and is summarized inFigure 6.1. In this diagram, U1, U2 and U3 each denotefilters that selectively distort or transform the representa-tion of the real world that is stored and analyzed in GIS: alater chapter (Section 13.*•) introduces a fourth filter that• Q4mediates interpretation of analysis, and the ways in whichfeedback may be accommodated through improvements inrepresentation.

6.2 U1: Uncertainty in theconception of geographic

phenomena

6.2.1 Units of analysis

Our discussion of Tobler’s Law (Section 3.1) and ofspatial autocorrelation (Section 4.6) established that geo-graphic data handling is different from all other classes

of non-spatial applications. A further characteristic thatsets geographic information science apart from most everyother science is that it is only rarely founded upon naturalunits of analysis. What is the natural unit of measure-ment for a soil profile? What is the spatial extent ofa pocket of high unemployment, or a cluster of cancercases? How might we delimit an environmental impactstudy of spillage from an oil tanker (Figure 6.2)? Thequestions become still more difficult in bivariate (twovariable) and multivariate (more than two variable) stud-ies. At what scale is it appropriate to investigate anyrelationship between background radiation and the inci-dence of leukemia? Or to assess any relationship betweenlabor-force qualifications and unemployment rates?

In many cases there are no natural unitsof geographic analysis.

FPOFPO

Figure 6.2 How might the spatial impact of an oil tankerspillage be delineated? We can measure the dispersion of thepollutants, but their impacts extend far beyond thesenarrowly-defined boundaries• • Q5

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


The discrete-object view of geographic phenomenais much more reliant upon the idea of natural units ofanalysis than the field view. Biological organisms arealmost always natural units of analysis, as are groupingssuch as households or families – though even here thereare certainly difficult cases, such as the massive networksof fungal strands that are often claimed to be thelargest living organisms on Earth, or extended familiesof human individuals. Things we manipulate, such aspencils, books, or screwdrivers, are also obvious naturalunits. The examples listed in the previous paragraph fallalmost entirely into one of two categories – they are eitherinstances of fields, where variation can be thought of asinherently continuous in space, or they are instances ofpoorly defined aggregations of discrete objects. In bothof these cases it is up to the investigator to make thedecisions about units of analysis, making the identificationof the objects of analysis inherently subjective.

6.2.2 Vagueness and ambiguity

6.2.2.1 VaguenessThe frequent absence of objective geographic individualunits means that, in practice, the labels that we assignto zones are often vague best guesses. What absoluteor relative incidence of oak trees in a forested zonequalifies it for the label oak woodland (Figure 6.3)?Or, in a developing-country context in which aerialphotography rather than ground enumeration is usedto estimate population size, what rate of incidence ofdwellings identifies a zone of dense population? In eachof these instances, it is expedient to transform point-likeevents (individual trees or individual dwellings) into areaobjects, and pragmatic decisions must be taken in orderto create a working definition of a spatial distribution.These decisions have no absolute validity, and raise twoimportant questions:

■ Is the defining boundary of a zone crisp andwell-defined?

■ Is our assignment of a particular label to a given zonerobust and defensible?

FPOFPO

Figure 6.3 Seeing the wood for the trees: what absolute orrelative incidence rate makes it meaningful to assign the label‘oak woodland’?

Uncertainty can exist both in the positions of theboundaries of a zone and in its attributes.

The questions have statistical implications (can we putnumbers on the confidence associated with boundaries orlabels?), cartographic implications (how can we conveythe meaning of vague boundaries and labels throughappropriate symbols on maps and GIS displays?), andcognitive implications (do people subconsciously attemptto force things into categories and boundaries to satisfy adeep need to simplify the world?).

6.2.2.2 AmbiguityMany objects are assigned different labels by differ-ent national or cultural groups, and such groups per-ceive space differently. Geographic prepositions likeacross, over, and in (used in the Yellow Pages query inFigure 1.17) do not have simple correspondences withterms in other languages. Object names and the topo-logical relations between them may thus be inherentlyambiguous. Perception, behavior, language, and cognitionall play a part in the conception of real-world entitiesand the relationships between them. GIS cannot presenta value-neutral view of the world, yet it can providea formal framework for the reconciliation of differentworldviews. The geographic nature of this ambiguity mayeven be exploited to identify regions with shared charac-teristics and worldviews. To this end, Box 6.1 describeshow different surnames used to describe essentially thesame historic occupations provide an enduring measurein region building.

Many English terms used to convey geographicinformation are inherently ambiguous.

Ambiguity also arises in the conception and construc-tion of indicators (see also Section 16.2.1). Direct indi-cators are deemed to bear a clear correspondence with amapped phenomenon. Detailed household income figures,for example, provide a direct indicator of the likely geog-raphy of expenditure and demand for goods and services;tree diameter at breast height can be used to estimate standvalue; and field nutrient measures can be used to esti-mate agronomic yield. Indirect indicators are used whenthe best available measure is a perceived surrogate linkwith the phenomenon of interest. Thus the incidence ofcentral heating amongst households, or rates of multiplecar ownership, might provide a surrogate for (unavail-able) household income data, while local atmosphericmeasurements of nitrogen dioxide might provide an indi-rect indicator of environmental health. Conception of the(direct or indirect) linkage between any indicator and thephenomenon of interest is subjective, hence ambiguous.Such measures will create (possibly systematic) errors ofmeasurement if the correspondence between the two isimperfect. So, for example, differences in the concep-tion of what hardship and deprivation entails can lead tospecification of different composite indicators, and differ-ent geodemographic systems include different cocktails ofcensus variables (Section 2.3.3). With regard to the natu-ral environment, conception of critical defining properties

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


Applications Box 6.1

Historians need maps of our uncertain past

In the study of history, there are many waysin which ‘spatial is special’ (Section 1.1.1). Forexample, it is widely recognized that althoughwhat our ancestors did (their occupations)and the social groups (classes) to which theybelonged were clearly important in terms ofdemographic behavior, location and place wereof equal if not greater importance. Althoughpopulation changes occur in particular socioe-conomic circumstances, they are also stronglyinfluenced by the unique characteristics, or ‘cul-tural identities’, of particular places. In GreatBritain today, as almost everywhere else in theworld, most people still think of their nationas made up of ‘regions’, and their stability anddefining characteristics are much debated bycultural geographers and historians.

Yet analyzing and measuring human activityby place creates particular problems for histori-ans. Most obviously, the past was very much lessdata rich than the present, and few systematicdata sources survive. Moreover, the geographi-cal administrative units by which the events ofthe past were recorded are both complex andchanging. In an ideal world, perhaps, physicaland cultural boundaries would always coincide,but physical features alone rarely provide appro-priate indicators of the limits of socioeconomicconditions and cultural circumstance.

Unfortunately many mapped historical dataare still presented using high-level aggregations,such as counties or regions. This achieves ameasure of standardization but may depictdemography in only the most arbitrary ofways. If data are forced into geographicaladministrative units that were delineated forother purposes, regional maps may presentnothing more than misleading, or evenmeaningless, spatial means (see Box 1.9: StewartFotheringham).

In England and in many other countries, thedaily activities of most individuals historicallyrevolved around small numbers of contiguouscivil parishes, of which there were more than16 000 in the 19th century. These are thesmallest administrative units for which data aresystematically available. They provide the bestavailable building blocks for meaningful regionbuilding. But how can we group parishes inorder to identify non-overlapping geographicalterritories to which people felt that theybelonged? And what indicators of regionalidentity are likely to have survived for allindividuals in the population?

Historian Kevin Schurer (Box *.*•) has • Q1investigated these questions using a historicalGIS to map digital surname data from the 1881Census of England and Wales. The motivationfor the GIS arises from the observation thatmany surnames contain statements of regionalidentity, and the suggestion that distinct zonesof similar surnames might be described ashomogeneous regions. The digitised records ofthe 1881 Census for England and Wales (*•) • Q2cover some 26 million people: although some41 000 different surnames are recorded, a fifthof the population shared just under 60 surnames,and half of the population were accountedfor by some 600 surnames. Schurer suggeststhat these aggregate statistics conceal muchthat we might learn about regional identityand diversity.

Many surnames of European origin areformed from occupational titles. Occupationsoften have uneven regional distributions andsometimes similar occupations are describedusing different names in different places (today’s‘realtors’ in the US perform much the samefunctions as their ‘estate agent’ counterparts inthe UK, for example). Schurer has investigatedthe 1881 geographical distribution of threeoccupational surnames – Fuller, Tucker, andWalker. These essentially refer to the sameoccupation; namely someone who, from aroundthe 14th century onwards, worked in thepreparation of textiles by scouring or beatingcloth as a means finishing or cleansing it.Using GIS, Schurer confirms that the geographiesof these 14th century surnames remained ofenduring importance in defining the regionalgeography of England in 1881. Figure 6.4illustrates that in 1881 Tuckers remainedconcentrated in the West Country, while Fullersoccurred principally in the east and Walkersresided in the Midlands and north. This mapalso shows that there was not much mixing ofthe surnames in the transition zones betweennames, suggesting that the maps provide auseful basis to region building.

The enduring importance of surnames asevidence of the strength and durability ofregional cultures has been confirmed in anupdate to the work by Daryl Lloyd at UniversityCollege London: Lloyd used the 2003 UKElectoral Register to map the distribution of thesame three surnames (Figure 6.5) and identifiedpersistent regional concentrations.

�

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


�

NoneFullerTuckerWalker

Fuller & TuckerFuller & WalkerTucker & WalkerTucker, Fuller & Walker

Source: 1881 Census of Population

0 100 200 300 40050Kilometres

Figure 6.4 The 1881 geography of the Fullers,Tuckers, and Walker (courtesy Kevin Schurer andDaryl Lloyd)

None

Fuller

Tucker

Walker

Fuller & Tucker

Fuller & Walker

Tucker & Walker

Tucker, Fuller & Walker

Source: 1998 Electoral Registrar

0 100 200 300 40050Kilometres

Figure 6.5 The 2003 geography of the Fullers,Tuckers, and Walker (courtesy Daryl Lloyd)

of soils can lead to inherent ambiguity in their classifica-tion (see Section 6.2.3).

Ambiguity is introduced when imperfect indicatorsof phenomena are used instead of thephenomena themselves.

Fundamentally, GIS has upgraded our abilities togeneralize about spatial distributions. Yet our abilities todo so may be constrained by the different taxonomies thatare conceived and used by data-collecting organizationswithin our overall study area. A study of wetlandclassification in the US found no fewer than six agenciesengaged in mapping the same phenomena over the samegeographic areas, and each with their own definitions ofwetland types (see Section 1.*•). If wetland maps are• Q6to be used in regulating the use of land, as they are inmany areas, then uncertainty in mapping clearly exposesregulatory agencies to potentially damaging and costlylawsuits. How might soils data classified according to theUK national classification be assimilated within a pan-European soils map, which uses a classification honedto the full range and diversity of soils found across the

European continent rather than those just on an offshoreisland? How might different national geodemographicclassifications be combined into a form suitable for a pan-European marketing exercise? These are all variants ofthe question:

■ How may mismatches between the categories ofdifferent classification schema be reconciled?

Differences in definitions are a major impedimentto integration of geographic data over wide areas.

Like the process of pinning down the different nomen-clatures developed in different cultural settings, the pro-cess of reconciling the semantics of different classificationschema is an inherently ambiguous procedure. Ambiguityarises in data concatenation when we are unsure regard-ing the meta-category to which a particular class shouldbe assigned.

6.2.3 Fuzzy approaches

One way of resolving the assignment process is to adopta probabilistic interpretation. If we take a statement like

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


‘the database indicates that this field contains wheat,but there is a 0.17 probability (or 17% chance) that itactually contains barley’, there are at least two possibleinterpretations:

(a) If 100 randomly chosen people were asked to makeindependent assessments of the field on the ground,17 would determine that it contains barley, and 83would decide it contains wheat.

(b) Of 100 similar fields in the database, 17 actuallycontained barley when checked on the ground, and83 contained wheat.

Of the two we probably find the second moreacceptable because the first implies that people cannotcorrectly determine the crop in the field. But theimportant point is that, in conceptual terms, both of theseinterpretations are frequentist, because they are based onthe notion that the probability of a given outcome can bedefined as the proportion of times the outcome occurs insome real or imagined experiment, when the number oftests is very large. But while this is reasonable for classicstatistical experiments, like tossing coins or drawing ballsfrom an urn, the geographic situation is different – thereis only one field with precisely these characteristics, andone observer, and in order to imagine a number of testswe have to invent more than one observer, or more thanone field (the problems of imagining larger populationsfor some geographic samples are discussed further inSection 15.4).

In part because of this problem, many people prefer thesubjectivist conception of probability – that it represents ajudgment about relative likelihood that is not the result ofany frequentist experiment, real or imagined. Subjectiveprobability is similar in many ways to the concept offuzzy sets, and the latter framework will be used hereto emphasize the contrast with frequentist probability.

Suppose we are asked to examine an aerial photographto determine whether a field contains wheat, and wedecide that we are not sure. However, we are able to puta number on our degree of uncertainty, by putting it ona scale from 0 to 1. The more certain we are, the higherthe number. Thus we might say we are 0.90 sure it iswheat, and this would reflect a greater degree of certaintythan 0.80. This degree of belonging to the class wheat istermed the fuzzy membership, and it is common thoughnot necessary to limit memberships to the range 0 to 1.In effect, we have changed our view of membership inclasses, and abandoned the notion that things must eitherbelong to classes or not belong to them – in this newworld, the boundaries of classes are no longer clean andcrisp, and the set of things assigned to a set can be fuzzy.

In fuzzy logic, an object’s degree of belongingto a class can be partial.

One of the major attractions of fuzzy sets is that theyappear to let us deal with sets that are not preciselydefined, and for which it is impossible to establishmembership cleanly. Many such sets or classes are foundin GIS applications, including land use categories, soiltypes, land cover classes, and vegetation types. Classesused for maps are often fuzzy, such that two people askedto classify the same location might disagree, not becauseof measurement error, but because the classes themselvesare not perfectly defined and because opinions vary. Assuch, mapping is often forced to stretch the rules ofscientific repeatability, which require that two observerswill always agree. Box 6.2 shows a typical extract fromthe legend of a soil map, and it is easy to see how twopeople might disagree, even though both are experts withyears of experience in soil classification.

Figure 6.6 shows an example of mapping classes usingthe fuzzy methods developed by A-Xing Zhu of the

Technical Box 6.2

Fuzziness in classification: description of a soil class

Following is the description of the Limerickseries of soils from New England, USA (thetype location is in Chittenden County, Vermont),as defined by the National Cooperative SoilSurvey. Note the frequent use of vague termssuch as ‘very’, ‘moderate’, ‘about’, ‘typically’,and ‘some’. Because the definition is so looseit is possible for many distinct soils to belumped together in this one class – and twoobservers may easily disagree over whether agiven soil belongs to the class, even thoughboth are experts. The definition illustrates theextreme problems of defining soil classes withsufficient rigor to satisfy the criterion of scientificrepeatability.

The Limerick series consists of very deep,poorly drained soils on flood plains. They

formed in loamy alluvium. Permeability ismoderate. Slope ranges from 0 to 3 percent.Mean annual precipitation is about 34 inchesand mean annual temperature is about 45degrees F.

Depth to bedrock is more than 60 inches.Reaction ranges from strongly acid to neutralin the surface layer and moderately acid toneutral in the substratum. Textures aretypically silt loam or very fine sandy loam, butlenses of loamy very fine sand or very finesand are present in some pedons. Theweighted average of fine and coarser sands, inthe particle-size control section, is less than15 percent.

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


(A) (B)

(C) (D)

Figure 6.6 (A) Membership map for bare soils in the Upper Lake McDonald basin, Glacier National Park. High membership valuesare in the ridge areas where active colluvial and glacier activities prevent the establishment of vegetation. (B) Membership map forforest. High membership values are in the middle to lower slope areas where the soils are both stable and better drained.(C) Membership map for alpine meadows. High membership values are on gentle slopes at high elevation where excessive soil waterand low temperature prevent the growth of trees. (D) Spatial distribution of the three cover types from hardening the membershipmaps (courtesy of A-Xing Zhu)

University of Wisconsin-Madison, USA, which take bothremote sensing images and the opinions of experts asinputs. There are three classes, and each map shows thefuzzy membership values in one class, ranging from 0(darkest) to 1 (lightest). This figure also shows the resultof converting to crisp categories, or hardening – to obtainFigure 6.6(D), each pixel is colored according to the classwith the highest membership value.

Fuzzy approaches are attractive because they capturethe uncertainty that many of us feel about the assignmentof places on the ground to specific categories. Butresearchers have struggled with the question of whetherthey are more accurate. In a sense, if we are uncertain

about which class to choose then it is more accurate tosay so, in the form of a fuzzy membership, than to beforced into assigning a class without qualification. Butthat does not address the question of whether the fuzzymembership value is accurate. If Class A is not welldefined, it is hard to see how one person’s assignment of afuzzy membership of 0.83 in Class A can be meaningful toanother person, since there is no reason to believe that thetwo people share the same notions of what Class A means,or of what 0.83 means, as distinct from 0.91, or 0.74. Sowhile fuzzy approaches make sense at an intuitive level,it is more difficult to see how they could be helpful in the

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


process of communication of geographic knowledge fromone person to another.

6.2.4 The scale of geographicindividuals

There is a sense in which vagueness and ambiguity inthe conception of usable (rather than natural ) units ofanalysis undermines the very foundations of GIS. How,in practice, may we create a sufficiently secure baseto support geographic analysis? Geographers have longgrappled with the problems of defining systems of zonesand have marshaled a range of deductive and inductiveapproaches to this end (see Section 4.9 for a discussionof what deduction and induction entail). The long-established regional geography tradition is fundamentallyconcerned with the delineation of zones characterized byinternal homogeneity (with respect to climate, economicdevelopment, or agricultural land use, for example),within a zonal scheme which maximizes between-zoneheterogeneity, such as the map illustrated in Figure 6.7.Regional geography is fundamentally about delineatinguniform zones, and many employ multivariate statisticaltechniques such as cluster analysis to supplement, or post-rationalize, intuition.

Identification of homogenous zones and spheresof influence lies at the heart of traditional regionalgeography as well as contemporary data analysis.

Other geographers have tried to develop functionalzonal schemes, in which zone boundaries delineate the

breakpoints between the spheres of influence of adjacentfacilities or features – as in the definition of travel-to-work areas (Figure 6.8) or the definition of a rivercatchment. Zones may be defined such that there ismaximal interaction within zones, and minimal betweenzones. The scale at which uniformity or functionalintegrity is conceived clearly conditions the ways it ismeasured – in terms of the magnitude of within-zoneheterogeneity that must be accommodated in the case ofuniform zones, and the degree of leakage between theunits of functional zones.

Scale has an effect, through the concept of spatialautocorrelation outlined in Section 4.3, upon the outcomeof geographic analysis. This was demonstrated morethan half a century ago in a classic paper by Yule andKendall, where the correlation between wheat and potatoyields was shown systematically to increase as Englishcounty units were amalgamated through a succession ofcoarser scales (Table 6.1). A succession of subsequentresearch papers has subsequently reaffirmed the existenceof similar scale effects in multivariate analysis. However,rather discouragingly, scale effects in multivariate casesdo not follow any consistent or predictable trends. Thistheme of dependence of results on the geographic unitsof analysis is pursued further in Section 6.4.

Relationships typically grow stronger when basedon larger geographic units.

GIS appears to trivialize the task of creating compositethematic maps. Yet inappropriate conception of the scaleof geographic phenomena can mean that apparent spatialpatterning (or the lack of it) in mapped data may be

Ob R.

Ob R.

YeniseyR

.

Volga R.

Don

R.

D

niep

erR.

Amur R.

Lena R.

Lena

R.

Koly

ma

R.

Irtysh R.

Caspian

Sea

AralSea

ARCTIC OCEAN

Seaof

EastSea

(Sea of

Japan)

Sea of

Azov

LakeBaykal

Sea

K a r aS e a L a p t e v

S e a

S ea

E a s t

S ib

er i

an

Barents B

erin

gS e

aB a l t i c S e a

BlackSea

Okhotsk3 4 5

6

2

1

7

8

wocs

oM

IRAN

LAT.

LITH.

ES

T.

TUR.

GEOR.

AZER.

KA

L.

S W E D E N

N O R W A Y

F I N L A N D Murm

ansk

U.S

.

Vladivostok

JAPAN

C H I N AM O N G O L I A

Irkutsk

Novosibirsk

OmskK

AZ A K H S T A N

Vladikavkaz

UK

RA

IN

EB

ELA

RU

S

PO

LAN

D

St. P

etersburg

GE

RM

AN

Y

DE

NM

AR

K

kstukaY

Norilsk

ARM

.

20 40

60

80 100 120

140

160

180

60

50

140 120 100

40

50

40

Longitude East of Greenwich

Arc

ticC

lircl

e

Kola

Peninsula

Russ ian

Pla in

CaucasusM

ts.

A r c t i c L o w l a n d

Ura

l

Mo

un t

a ins

Wes tS iber ianP la in

Central Asian

Ranges

Ve

rkh

oy a n s k M ts .

Eas t

er n

Hig

hla

nd

s

S

i kh

ot e

-Al i

nM

ts.

Sakhalin

Island sd

nalsI

eliruK

Sva

lba

rd

(No

rway)

ayaksvehcuylK.tM

.tf485,51Kam

chatka

Peninsu

la

Mt. Narodnaya

6,214 ft.Ce n t ra lS i b e r i a nP l a t e a u

Yaku t s k

B as in

PHYSIOGRAPHIC REGIONS OF RUSSIA

0 800400 1200 1600 Kilometers

0 1000 Miles400200 600 800

Figure 6.7 The regional geography of Russia (source: de Blij H.J. and Muller P.O. 2000 Geography: Realms, Regions and Concepts(9th edn.) New York: Wiley, p. 113

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Extent of major functionalregions in Great Britain

Figure 6.8 Dominant functional regions of Great Britain(source: Champion T. et al 1996 The Population of Britain inthe 1990s. Oxford: Oxford University Press, p. 9)

Table 6.1 In 1950 Yule and Kendall used data for wheat andpotato yields from the (then) 48 counties of England todemonstrate that correlation coefficients tend to increase withscale. They aggregated the 48-county data into zones so thatthere were first 24, then 12, then 6, and finally just 3 zones.The range of their results, from near zero (no correlation) toover 0.99 (almost perfect positive correlation) demonstrates therange of results that can be obtained, although subsequentresearch has suggested that this range of values is atypical

No. of geographic areas Correlation

48 0.218924 0.296312 0.57576 0.76493 0.9902

oversimplified, crude, or even illusory. It is also clearlyinappropriate to conceive of boundaries as crisp andwell-defined if significant leakage occurs across them (ashappens, in practice, in the delineation of most functionalregions), or if geographic phenomena are by nature fuzzy,vague, or ambiguous.

6.3 U2: Further uncertainty in themeasurement and representation

of geographic phenomena

6.3.1 Measurement andrepresentation

The conceptual models (fields and objects) that wereintroduced in Chapter 3 impose very different filters uponreality, and their usual corresponding representationalmodels (raster and vector) are characterized by differentuncertainties as a consequence. The vector model enablesa range of powerful analytical operations to be performed(see Chapters 14 through 16), yet it also requires a prioriconceptualization of the nature and extent of geographicindividuals and the ways in which they nest together intohigher-order zones. The raster model defines individualelements as square cells, with boundaries that bear norelationship at all to natural features, but neverthelessprovides a convenient and (usually) efficient structure fordata handling within a GIS. However, in the absenceof effective automated pattern recognition techniques,human interpretation is usually required to discriminatebetween real-world spatial entities as they appear in arasterized image.

Although quite different representations of reality,vector and raster data structures are both attractive in theirlogical consistency, the ease with which they are able tohandle spatial data, and (once the software is written) theease with which they can be implemented in GIS. Butneither abstraction provides easy measurement fixes andthere is no substitute for robust conception of geographicunits of analysis (Section 6.2). This said, however, theconceptual distinction between fields and discrete objectsis often useful in dealing with uncertainty. Figure 6.9shows a coastline, which is often conceptualized as adiscrete line object. But suppose we recognize that itsposition is uncertain. For example, the coastline shownon a 1:2 000 000 map is a gross generalization, in whichmajor liberties are taken, particularly in areas where thecoast is highly indented and irregular. Consequently the1:2 000 000 version leaves substantial uncertainty aboutthe true location of the shoreline. We might approachthis by changing from a line to an area, and mappingthe area where the actual coastline lies, as shown in thefigure. But another approach would be to reconceptualizethe coastline as a field, by mapping a variable whosevalue represents the probability that a point is land.This is shown in the figure as a raster representation.This would have far more information content, andconsequently much more value in many applications.But at the same time it would be difficult to find anappropriate data source for the representation – perhapsa fuzzy classification of an air photo, using one of anincreasing number of techniques designed to produce

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


Figure 6.9 The contrast between discrete object (top) and field(bottom) conceptualizations of an uncertain coastline. In thediscrete object view the line becomes an area delimiting wherethe true coastline might be. In the field view a continuoussurface defines the probability that any point is land

representations of the uncertainty associated with objectsdiscovered in images.

Uncertainty can be measured differently underfield and discrete object views.

Indeed, far from offering quick fixes for eliminating orreducing uncertainty, the measurement process can actu-ally increase it. Given that the vector and raster datamodels impose quite different filters on reality, it is unsur-prising that they can each generate additional uncertaintyin rather different ways. In field-based conceptualiza-tions, such as those that underlie remotely sensed imagesexpressed as rasters, spatial objects are not defined a pri-ori. Instead, the classification of each cell into one orother category builds together into a representation. Inremote sensing, when resolution is insufficient to detectall of the detail in geographic phenomena, the term mixelis often used to describe raster cells that contain morethan one class of land – in other words, elements inwhich the outcome of statistical classification suggeststhe occurrence of multiple land cover categories. Thetotal area of cells classified as mixed should decreaseas the resolution of the satellite sensor increases, assum-ing the number of categories remains constant, yet a

completely mixel-free classification is very unlikely atany level of resolution. Even where the Earth’s sur-face is covered with perfectly homogeneous areas, suchas agricultural fields growing uniform crops, the fail-ure of real-world crop boundaries to line up with pixeledges ensures the presence of at least some mixels. Nei-ther does higher resolution imagery solve all problems:medium resolution data (defined as pixel size of between30 m × 30 m and 1000 m × 1000 m) are typically classi-fied using between 3 and 7 bands, while high resolutiondata (pixel sizes 10 × 10 m or smaller) are typically clas-sified using between 7 and 256 bands, and this can gen-erate much greater heterogeneity of spectral values withattendant problems for classification algorithms.

A pixel whose area is divided among more thanone class is termed a mixel.

The vector data structure, by contrast, defines spatialentities and specifies explicit topological relations (seeSection 3.6) between them. Yet this often entails transfor-mations of the inherent characteristics of spatial objects(Section 14.4). In conceptual terms, for example, whilethe true individual members of a population might eachbe defined as point-like objects, they will often appearin a GIS dataset only as aggregate counts for apparentlyuniform zones. Such aggregation can be driven by theneed to preserve confidentiality of individual records, orsimply by the need to limit data volume. Unlike the fieldconceptualization of spatial phenomena, this implies thatthere are good reasons for partitioning space in a partic-ular way. In practice, partitioning of space is often madeon grounds that are principally pragmatic, yet are rarelycompletely random (see Section 6.4). In much of socio-economic GIS, for example, zones which are designedto preserve the anonymity of survey respondents may belargely ad hoc containers. Larger aggregations are oftenused for the simple reason that they permit comparisons ofmeasures over time (see Box 6.1). They may also reflectthe way that a cartographer or GIS interpolates a bound-ary between sampled points, as in the creation of isoplethmaps (Box 4.3).

6.3.2 Statistical models of uncertainty

Scientists have developed many widely used methods fordescribing errors in observations and measurements, andthese methods may be applicable to GIS if we are willingto think of databases as collections of measurements. Forexample, a digital elevation model consists of a largenumber of measurements of the elevation of the Earth’ssurface. A map of land use is also in a sense a collection ofmeasurements, because observations of the land surfacehave resulted in the assignment of classes to locations.Both of these are examples of observed or measuredattributes, but we can also think of location as a propertythat is measured.

A geographic database is a collection ofmeasurements of phenomena on or near theEarth’s surface.

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Here we consider errors in nominal class assignment,such as of types of land use, and errors in contin-uous (interval or ratio) scales, such as elevation (seeSection 3.4).

6.3.2.1 Nominal caseThe values of nominal data serve only to distinguishan instance of one class from an instance of another,or to identify an object uniquely. If classes have aninherent ranking they are described as ordinal data, butfor purposes of simplicity the ordinal case will be treatedhere as if it were nominal.

Consider a single observation of nominal data – forexample, the observation that a single parcel of land isbeing used for agriculture (this might be designated bygiving the parcel Class A as its value of the ‘Land UseClass’ attribute). For some reason, perhaps related to thequality of the aerial photography being used to buildthe database, the class may have been recorded falselyas Class G, Grassland. A certain proportion of parcelsthat are truly Agriculture might be similarly recordedas Grassland, and we can think of this in terms of aprobability, that parcels that are truly Agriculture arefalsely recorded as Grassland.

Table 6.2 shows how this might work for all of theparcels in a database. Each parcel has a true class, definedby accurate observation in the field, and a recorded classas it appears in the database. The whole table is describedas a confusion matrix, and instances of confusion matricesare commonly encountered in applications dominated byclass data, such as classifications derived from remotesensing or aerial photography. The true class might bedetermined by ground check, which is inherently moreaccurate than classification of aerial photographs, butmuch more expensive and time-consuming.

Ideally all of the observations in the confusion matrixshould be on the principal diagonal, in the cells thatcorrespond to agreement between true class and databaseclass. But in practice certain classes are more easilyconfused than others, so certain cells off the diagonal willhave substantial numbers of entries.

A useful way to think of the confusion matrix isas a set of rows, each defining a vector of values.

Table 6.2 Example of a misclassification or confusion matrix.A grand total of 304 parcels have been checked. The rows ofthe table correspond to the land use class of each parcel asrecorded in the database, and the columns to the class asrecorded in the field. The numbers appearing on the principaldiagonal of the table (from top left to bottom right) reflectcorrect classification

A B C D E Total

A 80 4 0 15 7 106B 2 17 0 9 2 30C 12 5 9 4 8 38D 7 8 0 65 0 80E 3 2 1 6 38 50Total 104 36 10 99 55 304

The vector for row i gives the proportions of cases inwhich what appears to be Class i is actually Class 1,2, 3, etc. Symbolically, this can be represented as avector {p1, p2, . . . , pi, . . . , pn}, where n is the numberof classes, and pi represents the proportion of cases forwhich what appears to be the class according to thedatabase is actually Class i.

There are several ways of describing and summarizingthe confusion matrix. If we focus on one row, then thetable shows how a given class in the database falselyrecords what are actually different classes on the ground.For example, Row A shows that of 106 parcels recordedas Class A in the database, 80 were confirmed as Class Ain the field, but 15 appeared to be truly Class D. Theproportion of instances in the diagonal entries representsthe proportion of correctly classified parcels, and the totalof off-diagonal entries in the row is the proportion ofentries in the database that appear to be of the row’s classbut are actually incorrectly classified. For example, therewere only 9 instances of agreement between the databaseand the field in the case of Class D. If we look at thetable’s columns, the entries record the ways in whichparcels that are truly of that class are actually recorded inthe database. For example, of the 10 instances of Class Cfound in the field, 9 were recorded as such in the databaseand only 1 was misrecorded as Class E.

The columns have been called the producer’s per-spective, because the task of the producer of an accuratedatabase is to minimize entries outside the diagonal cellin a given column, and the rows have been called theconsumer’s perspective, because they record what thecontents of the database actually mean on the ground;in other words, the accuracy of the database’s contents.

Users and producers of data look atmisclassification in distinct ways.

For the table as a whole, the proportion of entriesin diagonal cells is called the percent correctly classified(PCC), and is one possible way of summarizing the table.In this case 209/304 cases are on the diagonal, for aPCC of 68.8%. But this measure is misleading for atleast two reasons. First, chance alone would produce somecorrect classifications, even in the worst circumstances, soit would be more meaningful if the scale were adjustedsuch that 0 represents chance. In this case, the numberof chance hits on the diagonal in a random assignmentis 76.2 (the sum of the row total times the column totaldivided by the grand total for each of the five diagonalcells). So the actual number of diagonal hits, 209, shouldbe compared to this number, not 0. The more useful indexof success is the kappa index, defined as:

κ =

n∑i=1

cii −n∑

i=1

ci.c.i/c..

c.. −n∑

i=1

ci.c.i/c..

where cij denotes the entry in row i column j , thedots indicate summation (e.g., ci. is the summation over

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


all columns for row i, that is, the row i total, andc.. is the grand total), and n is the number of classes.The first term in the numerator is the sum of all thediagonal entries (entries for which the row number andthe column number are the same). To compute PCC wewould simply divide this term by the grand total (thefirst term in the denominator). For kappa, both numeratorand denominator are reduced by the same amount, anestimate of the number of hits (agreements between fieldand database) that would occur by chance. This involvestaking each diagonal cell, multiplying the row total bythe column total, and dividing by the grand total. Theresult is summed for each diagonal cell. In this case kappaevaluates to 58.3%, a much less optimistic assessmentthan PCC.

The second issue with both of these measures concernsthe relative abundance of different classes. In the table,Class C is much less common than Class A. The confusionmatrix is a useful way of summarizing the characteristicsof nominal data, but to build it there must be some sourceof more accurate data. Commonly this is obtained byground observation, and in practice the confusion matrixis created by taking samples of more accurate data, bysending observers into the field to conduct spot checks.Clearly it makes no sense to visit every parcel, and insteada sample is taken. Because some classes are commonerthan others, a random sample that made every parcelequally likely to be chosen would be inefficient, becausetoo many data would be gathered on common classes,and not enough on the relatively rare ones. So, instead,samples are usually chosen such that a roughly equalnumber of parcels are selected in each class. Of coursethese decisions must be based on the class as recorded inthe database, rather than the true class. This is an instanceof sampling that is stratified by class (see Section 4.4).

Sampling for accuracy assessment should paygreater attention to the classes that are rarer onthe ground.

Parcels represent a relatively easy case, if it isreasonable to assume that the land use class of a parcelis uniform over the parcel, and class is recorded as asingle attribute of each parcel object. But as we noted inSections 4.2 and 4.3, more difficult cases arise in samplingnatural areas, for example in the case of vegetation coverclass, where parcel boundaries do not exist. Figure 6.10shows a typical vegetation cover class map, and isobviously highly generalized. If we were to apply theprevious strategy, then we would test each area to see if itsassigned vegetation cover class checks out on the ground.But unlike the parcel case, in this example the boundariesbetween areas are not fixed, but are themselves part ofthe observation process, and we need to ask whether theyare correctly located. Error in this case has two forms:misallocation of an area’s class and mislocation of anarea’s boundaries. In some cases the boundary betweentwo areas may be fixed, because it coincides with aclearly defined line on the ground; but in other cases, theboundary’s location is as much a matter of judgment as theallocation of an area’s class. Burrough and Frank (1996)

Figure 6.10 An example of a vegetation cover map. Twostrategies for accuracy assessment are available: to check byarea (polygon), or to check by point. In the former case astrategy would be devised for field checking each area, todetermine the area’s correct class. In the latter, points would besampled across the state and the correct class determined ateach point

have discussed many of the implications of uncertainboundaries in GIS.

Errors in land cover maps can occur in the locationsof boundaries of areas, as well as in theclassification of areas.

In such cases we need a different strategy, thatcaptures the influence both of mislocated boundaries andof misallocated classes. One way to deal with this is tothink of error not in terms of classes assigned to areas,but in terms of classes assigned to points. In a rasterdataset, the cells of the raster are a reasonable substitutefor individual points. Instead of asking whether areaclasses are confused, and estimating errors by samplingareas, we ask whether the classes assigned to rastercells are confused, and define the confusion matrix interms of misclassified cells. This is often called per-pixel or per-point accuracy assessment, to distinguishit from the previous strategy of per-polygon accuracyassessment. As before, we would want to stratify by class,to make sure that relatively rare classes were sampled inthe assessment.

6.3.2.2 Interval/ratio caseThe second case addresses measurements that are madeon interval or ratio scales. Here, error is best thoughtof not as a change of class, but as a change of value,such that the observed value x ′ is equal to the true value

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


x plus some distortion δx, where δx is hopefully small.δx might be either positive or negative, since errors arepossible in both directions. For example, the measuredand recorded elevation at some point might be equal tothe true elevation, distorted by some small amount. If theaverage distortion is zero, so that positive and negativeerrors balance out, the observed values are said to beunbiased, and the average value will be true.

Error in measurement can produce a change ofclass, or a change of value, depending on the typeof measurement.

Sometimes it is helpful to distinguish between accu-racy, which has to do with the magnitude of δx, andprecision. Unfortunately there are several ways of defin-ing precision in this context, at least two of which areregularly encountered in GIS. Surveyors and others con-cerned with measuring instruments tend to define preci-sion through the performance of an instrument in makingrepeated measurements of the same phenomenon. A mea-suring instrument is precise according to this definitionif it repeatedly gives similar measurements, whether ornot these are actually accurate. So a GPS receiver mightmake successive measurements of the same elevation, andif these are similar the instrument is said to be precise.Precision in this case can be measured by the variabilityamong repeated measurements. But it is possible that allof the measurements are approximately 5 m too high, inwhich case the measurements are said to be biased, eventhough they are precise, and the instrument is said to beinaccurate. Figure 6.11 illustrates this meaning of precise,and its relationship to accuracy.

The other definition of precision is more commonin science generally. It defines precision as the numberof digits used to report a measurement, and again it isnot necessarily related to accuracy. For example, a GPSreceiver might measure elevation as 51.3456 m. But if the

(A) (B)

Figure 6.11 The term precision is often used to refer to therepeatability of measurements. In both diagrams sixmeasurements have been taken of the same position,represented by the center of the circle. In (A) successivemeasurements have similar values (they are precise), but showa bias away from the correct value (they are inaccurate). In(B), precision is lower but accuracy is higher

receiver is in reality only accurate to the nearest 10 cm,three of those digits are spurious, with no real meaning.So, although the precision is one ten thousandth of ameter, the accuracy is only one tenth of a meter. Box 6.3summarizes the rules that are used to ensure that reportedmeasurements do not mislead by appearing to have greateraccuracy than they really do.

To most scientists, precision refers to the numberof significant digits used to report a measurement,but it can also refer to a measurement’srepeatability.

In the interval/ratio case, the magnitude of errors isdescribed by the root mean square error (RMSE), definedas the square root of the average squared error, or:

[∑δx2/n

]1/2

Technical Box 6.3

Good practice in reporting measurements

Here are some simple rules that help to en-sure that people receiving measurements fromothers are not misled by their apparentlyhigh precision.

1. The number of digits used to report ameasurement should reflect themeasurement’s accuracy. For example, if ameasurement is accurate to 1 m then nodecimal places should be reported. Themeasurement 14.4 m suggests accuracy toone tenth of a meter, as does 14.0, but 14suggests accuracy to 1 m.

2. Excess digits should be removed by rounding.Fractions above one half should be roundedup, fractions below one half should be

rounded down. The following examplesreflect rounding to two decimal places:

14.57803 rounds to 14.58

14.57397 rounds to 14.57

14.57999 rounds to 14.58

14.57499 rounds to 14.57

3. These rules are not effective to the left ofthe decimal place – for example, they giveno basis for knowing whether 1400 isaccurate to the nearest unit, or to thenearest hundred units.

4. If a number is known to be exactly aninteger or whole number, then it is shownwith no decimal point.

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


where the summation is over the values of δx for all ofthe n observations. The RMSE is similar in a numberof ways to the standard deviation of observations in asample. Although RMSE involves taking the square rootof the average squared error, it is convenient to thinkof it as approximately equal to the average error in eachobservation, whether the error is positive or negative. TheUS Geological Survey uses RMSE as its primary measureof the accuracy of elevations in digital elevation models,and published values range up to 7 m.

Although the RMSE can be thought of as capturingthe magnitude of the average error, many errors will begreater than the RMSE, and many will be less. It isuseful, therefore, to know how errors are distributed inmagnitude – how many are large, how many are small.Statisticians have developed a series of models of errordistributions, of which the commonest and most importantis the Gaussian distribution, otherwise known as the errorfunction, the ‘bell curve’, or the Normal distribution.Figure 6.12 shows the curve’s shape. If observationsare unbiased, then the mean error is zero (positive andnegative errors cancel each other out), and the RMSEis also the distance from the center of the distribution(zero) to the points of inflection on either side, as shownin the figure. Let us take the example of a 7 m RMSEon elevations in a USGS digital elevation model; if errorfollows the Gaussian distribution, this means that someerrors will be more than 7 m in magnitude, some will beless, and also that the relative abundance of errors of anygiven size is described by the curve shown. 68% of errorswill be between −1.0 and +1.0 RMSEs, or −7 m and+7 m. In practice many distributions of error do followthe Gaussian distribution, and there are good theoreticalreasons why this should be so.

The Gaussian distribution predicts the relativeabundances of different magnitudes of error.

To emphasize the mathematical formality of the Gaus-sian distribution, its equation is shown below. The symbolσ denotes the standard deviation, µ denotes the mean (inFigure 6.12 these values are 1 and 0 respectively), and

–4.0 –2.0 0.0 2.0 4.0

Figure 6.12 The Gaussian or Normal distribution. The heightof the curve at any value of x gives the relative abundance ofobservations with that value of x. The area under the curvebetween any two values of x gives the probability thatobservations will fall in that range. The range between −1standard deviation and +1 standard deviation is in blue. Itencloses 68% of the area under the curve, indicating that 68%of observations will fall between these limits

exp is the exponential function, or ‘2.71828 to the powerof’. Scientists believe that it applies very broadly, andthat many instances of measurement error adhere closelyto the distribution, because it is grounded in rigorous the-ory. It can be shown mathematically that the distributionarises whenever a large number of random factors con-tribute to error, and the effects of these factors combineadditively – that is, a given effect makes the same addi-tive contribution to error whatever the specific values ofthe other factors. For example, error might be introducedin the use of a steel tape measure over a large number ofmeasurements because some observers consistently pullthe tape very taught, or hold it very straight, or fastid-iously keep it horizontal, or keep it cool, and others donot. If the combined effects of these considerations alwayscontributes the same amount of error (e.g., +1 cm, or−2 cm), then this contribution to error is said to be addi-tive.

f (x) = 1

σ√

2πexp

[− (x − µ)2

2σ 2

]

We can apply this idea to determine the inherent uncer-tainty in the locations of contours. The US GeologicalSurvey routinely evaluates the accuracies of its digitalelevation models (DEMs), by comparing the elevationsrecorded in the database with those at the same locationsin more accurate sources, for a sample of points. The dif-ferences are summarized in an RMSE, and in this examplewe will assume that errors have a Gaussian distributionwith zero mean and a 7 m RMSE. Consider a measure-ment of 350 m. According to the error model, the truthmight be as high as 360 m, or as low as 340 m, and therelative frequencies of any particular error value are aspredicted by the Gaussian distribution with a mean ofzero and a standard deviation of 7. If we take error intoaccount, using the Gaussian distribution with an RMSE of7 m, it is no longer clear that a measurement of 350 m liesexactly on the 350 m contour. Instead, the truth might be340 m, or 360 m, or 355 m. Figure 6.13 shows the impli-cations of this in terms of the location of this contourin a real world example. 95% of errors would put thecontour within the colored zone. In areas colored red theobserved value is less than 350 m, but the truth might be350 m; in areas colored green the observed value is morethan 350 m, but the truth might be 350 m. There is a 5%chance that the true location of the contour lies outsidethe colored zone entirely.

6.3.3 Positional error

In the case of measurements of position, it is possiblefor every coordinate to be subject to error. In the two-dimensional case, a measured position (x ′, y ′) wouldbe subject to errors in both x and y; specifically, wemight write x ′ = x + δx, y ′ = y + δy, and similarly inthe three-dimensional case where all three coordinates aremeasured, z′ = z + δz. The bivariate Gaussian distribu-tion describes errors in the two horizontal dimensions,and it can be generalized to the three-dimensional case.Normally, we would expect the RMSEs of x and y to be

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Figure 6.13 Uncertainty in the location of the 350 m contour in the area of State College, Pennsylvania, generated from a USGeological Survey DEM with an assumed RMSE of 7 m. According to the Gaussian distribution with a mean of 350 m and astandard deviation of 7 m, there is a 95% probability that the true location of the 350 m contour lies in the colored area, and a 5%probability that it lies outside (Source: Hunter G. J. and Goodchild M. F. 1995 ‘Dealing with error in spatial databases: a simple casestudy’. Photogrammetric Engineering and Remote Sensing 61: 529–37)

the same, but z is often subject to errors of quite differentmagnitude, for example in the case of determinations ofposition using GPS. The bivariate Gaussian distributionalso allows for correlation between the errors in x and y,but normally there is little reason to expect correlations.

Because it involves two variables, the bivariateGaussian distribution has somewhat different propertiesfrom the simple (univariate) Gaussian distribution. 68% ofcases lie within one standard deviation for the univariatecase (Figure 6.12). But in the bivariate case with equalstandard errors in x and y, only 39% of cases lie within acircle of this radius. Similarly, 95% of cases lie within twostandard deviations for the univariate distribution, but it isnecessary to go to a circle of radius equal to 2.15 times thex or y standard deviations to enclose 90% of the bivariatedistribution, and 2.45 times standard deviations for 95%.

National Map Accuracy Standards often prescribethe positional errors that are allowed in databases. Forexample, the 1947 US National Map Accuracy Standard

specified that 95% of errors should fall below 1/30 inch(0.85 mm) for maps at scales of 1:20 000 and finer(more detailed), and 1/50 inch (0.51 mm) for other maps(coarser, less detailed, levels of granularity than 1:20 000).A convenient rule of thumb is that positions measuredfrom maps are subject to errors of up to 0.5 mm at thescale of the map. Table 6.3 shows the distance on theground corresponding to 0.5 mm for various commonmap scales.

A useful rule of thumb is that features on maps arepositioned to an accuracy of about 0.5 mm.

6.3.4 The spatial structure of errors

The confusion matrix, or more specifically a single row ofthe matrix, along with the Gaussian distribution, provideconvenient ways of describing the error present in a single

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


Table 6.3 A useful rule of thumb is that positions measuredfrom maps are accurate to about 0.5 mm on the map.Multiplying this by the scale of the map gives thecorresponding distance on the ground

Map scale Ground distancecorresponding to 0.5 mm

map distance

1:1250 62.5 cm1:2500 1.25 m1:5000 2.5 m1:10 000 5 m1:24 000 12 m1:50 000 25 m1:100 000 50 m1:250 000 125 m1:1 000 000 500 m1:10 000 000 5 km

observation of a nominal or interval/ratio measurementrespectively. When a GIS is used to respond to a simplequery, such as ‘tell me the class of soil at this point’,or ‘what is the elevation here?’, then these methodsare good ways of describing the uncertainty inherent inthe response. For example, a GIS might respond to thefirst query with the information ‘Class A, with a 30%probability of Class C’, and to the second query with theinformation ‘350 m, with an RMSE of 7 m’. Notice howthis makes it possible to describe nominal data as accurateto a percentage, but it makes no sense to describe a DEM,or any measurement on an interval/ratio scale, as accurateto a percentage. For example, we cannot meaningfully saythat a DEM is ‘90% accurate’.

However, many GIS operations involve more than theproperties of single points, and this makes the analysisof error much more complex. For example, consider thequery ‘how far is it from this point to that point?’. Supposethe two points are both subject to error of position,because their positions have been measured using GPSunits with mean distance errors of 50 m. If the twomeasurements were taken some time apart, with differentcombinations of satellites above the horizon, it is likelythat the errors are independent of each other, such thatone error might be 50 m in the direction of North, andthe other 50 m in the direction of South. Depending onthe locations of the two points, the error in distance mightbe as high as +100 m. On the other hand, if the twomeasurements were made close together in time, withthe same satellites above the horizon, it is likely thatthe two errors would be similar, perhaps 50 m Northand 40 m North, leading to an error of only 10 m in thedetermination of distance. The difference between thesetwo situations can be measured in terms of the degree ofspatial autocorrelation, or the interdependence of errorsat different points in space (Section 4.6).

The spatial autocorrelation of errors can be asimportant as their magnitude in manyGIS operations.

Spatial autocorrelation is also important in errors innominal data. Consider a field that is known to containa single crop, perhaps wheat. When seen from above, itis possible to confuse wheat with other crops, so theremay be error in the crop type assigned to points in thefield. But since the field has only one crop, we know thatsuch errors are likely to be strongly correlated. Spatialautocorrelation is almost always present in errors to somedegree, but very few efforts have been made to measureit systematically, and as a result it is difficult to makegood estimates of the uncertainties associated with manyGIS operations.

An easy way to visualize spatial autocorrelation andinterdependence is through animation. Each frame in theanimation is a single possible map, or realization of theerror process. If a point is subject to uncertainty, eachrealization will show the point in a different possiblelocation, and a sequence of images will show the pointshaking around its mean position. If two points haveperfectly correlated positional errors, then they will appearto shake in unison, as if they were at the ends of a stiffrod. If errors are only partially correlated, then the systembehaves as if the connecting rod were somewhat elastic.

The spatial structure or autocorrelation of errors isimportant in many ways. DEM data are often usedto estimate the slope of terrain, and this is done bycomparing elevations at points a short distance apart.For example, if the elevations at two points 10 m apartare 30 m and 35 m respectively, the slope along theline between them is 5/10, or 0.5. (A somewhat morecomplex method is used in practice, to estimate slope ata point in the x and y directions in a DEM raster, byanalyzing the elevations of nine points – the point itselfand its eight neighbors. The equations in Section 14.4detail the procedure.)

Now consider the effects of errors in these twoelevation measurements on the estimate of slope. Supposethe first point (elevation 30 m) is subject to an RMSEof 2 m, and consider possible true elevations of 28 mand 32 m. Similarly the second point might have trueelevations of 33 m and 37 m. We now have fourpossible combinations of values, and the correspondingestimates of slope range from (33 − 32)/10 = 0.1 to(37 − 28)/10 = 0.9. In other words, a relatively smallamount of error in elevation can produce wildly varyingslope estimates.

The spatial autocorrelation between errors ingeographic databases helps to minimize theirimpacts on many GIS operations.

What saves us in this situation, and makes estimationof slope from DEMs a practical proposition at all,is spatial autocorrelation among the errors. In reality,although DEMs are subject to substantial errors inabsolute elevation, neighboring points nevertheless tendto have similar errors, and errors tend to persist overquite large areas. Most of the sources of error in the DEMproduction process tend to produce this kind of persistenceof error over space, including errors due to misregistrationof aerial photographs. In other words, errors in DEMsexhibit strong positive spatial autocorrelation.

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Another important corollary of positive spatial auto-correlation can also be illustrated using DEMs. Supposean area of low-lying land is inundated by flooding, andour task is to estimate the area of land affected. We areasked to do this using a DEM, which is known to have anRMSE of 2 m (compare Figure 6.13). Suppose the datapoints in the DEM are 30 m apart, and preliminary analy-sis shows that 100 points have elevations below the floodline. We might conclude that the area flooded is the arearepresented by these 100 points, or 900 × 100 sq m, or 9hectares. But because of errors, it is possible that some ofthis area is actually above the flood line (we will ignorethe possibility that other areas outside this may also bebelow the flood line, also because of errors), and it is pos-sible that all of the area is above. Suppose the recordedelevation for each of the 100 points is 2 m below theflood line. This is one RMSE (recall that the RMSE isequal to 2 m) below the flood line, and the Gaussian dis-tribution tells us that the chance that the true elevation isactually above the flood line is approximately 16% (seeFigure 6.12). But what is the chance that all 100 pointsare actually above the flood line?

Here again the answer depends on the degree of spatialautocorrelation among the errors. If there is none, in otherwords if the error at each of the 100 points is independentof the errors at its neighbors, then the answer is (0.16)100,or 1 chance in 1 followed by roughly 70 zeroes. But ifthere is strong positive spatial autocorrelation, so strongthat all 100 points are subject to exactly the same error,then the answer is 0.16. One way to think about this is interms of degrees of freedom. If the errors are independent,they can vary in 100 independent ways, depending onthe error at each point. But if they are strongly spatiallyautocorrelated, the effective number of degrees of freedomis much less, and may be as few as 1 if all errors behave inunison. Spatial autocorrelation has the effect of reducingthe number of degrees of freedom in geographic databelow what may be implied by the volume of information,in this case the number of points in the DEM.

Spatial autocorrelation acts to reduce the effectivenumber of degrees of freedom in geographic data.

6.4 U3: Further uncertaintyin the analysis of geographic

phenomena

6.4.1 Internal and external validationthrough spatial analysis

In Chapter 1 we identified one remit of GIS as theresolution of scientific or decision-making problemsthrough spatial analysis, which we defined in Section 1.7as ‘the process by which we turn raw spatial data

into useful spatial information’. Good science needssecure foundations, yet Sections 6.2 and 6.3 have shownthe conception and measurement of many geographicphenomena to be inherently uncertain. How can theoutcome of spatial analysis be meaningful if it has suchuncertain foundations?

Uncertainties in data lead to uncertainties in theresults of analysis.

Once again, there are no easy answers to this question,although we can begin by examining the consequencesof accommodating possible errors of positioning, orof aggregating clearly defined units of analysis intoartificial geographic individuals (as when people areaggregated by census tracts, or disease incidences areaggregated by county). In so doing, we will illustrate howpotential problems might arise, but will not present anydefinitive solutions – for the simple reason that the truthis inherently uncertain. The conception, measurement, andrepresentation of geographic individuals may distort theoutcome of spatial analysis by masking or accentuatingapparent variation across space, or by restricting thenature and range of questions that can be asked ofthe GIS.

There are three ways of dealing with this risk. First,although we can only rarely tackle the source of distortion(we are rarely empowered to collect new, completelydisaggregate data, for example), we can quantify theway in which it is likely to operate (or propagates)within the GIS, and can gauge the magnitude of its likelyimpacts. Second, although we may have to work with areally aggregated data, GIS allows us to model within-zone spatial distributions in order to ameliorate the worsteffects of artificial zonation. Taken together, GIS allowsus to gauge the effects of scale and aggregation throughsimulation of different possible outcomes. This is internalvalidation of the effects of scale, point placement, andspatial partitioning.

Because of the power of GIS to merge diversedata sources, it also provides a means of externalvalidation of the effects of zonal averaging. In today’sadvanced GIService economy (Section 1.5.3), there maybe other data sources that can be used to gauge theeffects of aggregation upon our analysis. In Chapter 13we will refine the basic model that was presented inFigure 6.1 to consider how GIS provides a medium forvisualizing models of spatial distributions and patterns ofhomogeneity and heterogeneity.

GIS gives us maximum flexibility when workingwith aggregate data, and helps us to validate ourdata with reference to other available sources.

6.4.2 Internal validation: errorpropagation

The examples of Section 6.3.4 are cases of error prop-agation, where the objective is to measure the effectsof known levels of data uncertainty on the outputs of

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


Figure 6.14 Error in the measurement of the area of a square100 m on each side. Each of the four corner points has beensurveyed; the errors are subject to bivariate Gaussiandistributions with standard deviations in x and y of 1 m(dashed circles). The red polygon shows one possible surveyedsquare (one realization of the error model)

GIS operations. We have seen how the spatial structureof errors plays a role, and how the existence of strongpositive spatial autocorrelation reduces the effects ofuncertainty upon estimates of properties such as slope orarea. Yet the cumulative effects of error can also pro-duce impacts that are surprisingly large, and some of theexamples in this section have been chosen to illustrate thesubstantial uncertainties that can be produced by appar-ently innocuous data errors.

Error propagation measures the impacts ofuncertainty in data on the results ofGIS operations.

In general two strategies are available for evaluatingerror propagation. The examples in the previous sectionwere instances in which it was possible to obtain acomplete description of error effects based upon knownmeasures of likely error. These enable a complete analysisof uncertainty in slope estimation, and can be applied inthe DEM flooding example described in Section 6.3.4.Another example that is amenable to analysis is thecalculation of the area of a polygon given knowledge ofthe positional uncertainties of its vertices.

For example, Figure 6.14 shows a square approxi-mately 100 m on each side. Suppose the square has beensurveyed by determining the locations of its four cornerpoints using GPS, and suppose the circumstances of themeasurements are such that there is an RMSE of 1 min both coordinates of all four points, and that errorsare independent. Suppose our task is to determine thearea of the square. A GIS can do this easily, using astandard algorithm (see Figure 14.9). Computers are pre-cise (in the sense of Box 6.3), and capable of workingto many significant digits, so the calculation might bereported by printing out a number to eight digits, such as10014.603 sq m, or even more. But the number of signif-icant digits will have been determined by the precision of

the machine, and not by the accuracy of the determina-tion. Box 6.3 summarized some simple rules for ensuringthat the precision used to report a measurement reflects asfar as possible its accuracy, and clearly those rules willhave been violated if the area is reported to eight digits.But what is the appropriate precision?

In this case we can determine exactly how positionalaccuracy affects the estimate of area. It turns out thatarea has an error distribution which is Gaussian, with astandard deviation (RMSE) in this case of 200 sq m – inother words, each attempt to measure the area will givea different result, the variation between them having astandard deviation of 200 sq m. This means that the fiverightmost digits in the estimate are spurious, including twodigits to the left of the decimal point. So if we were tofollow the rules of Box 6.3, we would print 10 000 ratherthan 10014.603 (note the problem with standard notationhere, which doesn’t let us omit digits to the left of thedecimal point even if they are spurious, and so leavessome uncertainty about whether the tens and units digitsare certain or not – and note also the danger that if thenumber is printed as an integer it may be interpreted asexactly the whole number). We can also turn the questionaround and ask how accurately the points would haveto be measured to justify eight digits, and the answeris approximately 0.01 mm, far beyond the capabilities ofnormal surveying practice.

Analysis can be applied to many other kinds of GISanalysis, and Gerard Heuvelink (Box 6.4) discusses sev-eral further examples in his excellent text on error propa-gation in GIS (Heuvelink 1998). But analysis is a difficultstrategy when spatial autocorrelation of errors is present,and many problems of error propagation in GIS are notamenable to analysis. This has led many researchers toexplore a more general strategy of simulation to evaluatethe impacts of uncertainty on results.

In essence, simulation requires the generation ofa series of realizations, as defined earlier, and it isoften called Monte Carlo simulation in reference to therealizations that occur when dice are tossed or cards aredealt in various games of chance. For example, we couldsimulate error in a single measurement from a DEM bygenerating a series of numbers with a mean equal to themeasured elevation, and a standard deviation equal to theknown RMSE, and a Gaussian distribution. Simulationuses everything that is known about a situation, so if anyadditional information is available we would incorporateit in the simulation. For example, we might know thatelevations must be whole numbers of meters, and wouldsimulate this by rounding the numbers obtained fromthe Gaussian distribution. With a mean of 350 m and anRMSE of 7 m the results of the simulation might be 341,352, 356, 339, 349, 348, 355, 350, . . .

Simulation is an intuitively simple way of gettingthe uncertainty message across.

Because of spatial autocorrelation, it is impossible inmost circumstances to think of databases as decompos-able into component parts, each of which can be inde-pendently disturbed to create alternative realizations, asin the previous example. Instead, we have to think of

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


(A) (B)

(C)

Figure 6.15 Three realizations of a model simulating the effects of error on a digital elevation model. The three datasets differ onlyto a degree consistent with known error. Error has been simulated using a model designed to replicate the known error properties ofthis dataset – the distribution of error magnitude, and the spatial autocorrelation between errors. (Courtesy of Ashton Shortridge)

the entire database as a realization, and create alterna-tive realizations of the database’s contents that preservespatial autocorrelation. Figure 6.15 shows an example,simulating the effects of uncertainty on a digital ele-vation model. Each of the three realizations is a com-plete map, and the simulation process has faithfullyreplicated the strong correlations present in errors acrossthe DEM.

6.4.3 Internal validation: aggregationand analysis

We have seen already that a fundamental differencebetween geography and other scientific disciplines isthat the definition of its objects of study is onlyrarely unambiguous and, in practice, rarely precedes our

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


Biographical Box 6.4

Gerard Heuvelink, geostatistician

Figure 6.16 Gerard Heuvelink,geostatistician

Understanding the limitations of spatial data and spatial models is essentialboth for managing environmental systems effectively and for encouragingsafe use of GIS. Gerard Heuvelink (Figure 6.16) of the WageningenUniversity and Research Centre, the Netherlands, has dedicated much of hisscientific career to this end, through statistical modeling of the uncertaintyin spatial data and analysis of the ways in which uncertainty is propagatedthrough GIS.

Trained as a mathematician, Gerard undertook a Ph.D. in PhysicalGeography working with Professor Peter Burrough of Utrecht University.His 1998 research monograph Error Propagation in EnvironmentalModelling with GIS has subsequently become the key reference in spatialuncertainty analysis. Gerard is firmly of the view that GI scientists shouldpay more attention to statistical validation and exploration of data, and heis actively involved in a series of symposia on ‘Spatial Accuracy Assessment inNatural Resources and Environmental Sciences’ (www.spatial-accuracy.org).

Gerard’s background in mathematics and statistics has left him with the view that spatial uncertaintyanalysis requires a sound statistical basis. In his view, understanding uncertainty in the position of spatialobjects and in their attribute values entails use of probability distribution functions, and measuring spatialautocorrelation (Section 4.6) with uncertainties in other objects in spatial databases. He says: ‘I remaindisappointed with the amount of progress made in understanding the fundamental problems of uncertaintyover the last fifteen years. We have moved forward in the sense that we now have a broader viewof various aspects of spatial data quality. The 1980s and early 1990s were dedicated to technical topicssuch as uncertainty propagation in map overlay operations and the development of statistical models forrepresenting positional uncertainty. More recently the research community has addressed a range of user-centric topics, such as visualization and communication of uncertainty, decision making under uncertaintyand the development of error-aware GIS. But these developments do not hide the fact that we still do nothave the statistical basics right. Until this is achieved, we run the risk of building elaborate representationson weak and uncertain foundations.’

Gerard and co-worker James Brown from the University of Amsterdam are working to contribute to fillingthis gap, by developing a general probabilistic framework for characterising uncertainty in the positions andattribute values of spatial objects.

attempts to measure their characteristics. In socioeco-nomic GIS applications, these objects of study (geo-graphic individuals) are usually aggregations, since thespaces that human individuals occupy are geographicallyunique, and confidentiality restrictions usually dictate thatuniquely attributable information must be anonymizedin some way. Even in natural-environment applications,the nature of sampling in the data collection process(Section 4.4) often makes it expedient to collect datapertaining to aggregations of one kind or another. Thusin socioeconomic and environmental applications alike,the measurement of geographic individuals is unlikely tobe determined with the end point of particular spatial-analysis applications in mind. As a consequence, we can-not be certain in ascribing even dominant characteristicsof areas to true individuals or point locations in thoseareas. This source of uncertainty is known as the eco-logical fallacy, and has long bedevilled the analysis ofspatial distributions (the opposite of ecological fallacy is

atomistic fallacy, in which the individual is considered inisolation from his or her environment). This is illustratedin Figure 6.17.

Inappropriate inference from aggregate dataabout the characteristics of individuals is termedthe ecological fallacy.

We have also seen that the scale at which geographicindividuals are conceived conditions our measures ofassociation between the mosaic of zones representedwithin a GIS. Yet even when scale is fixed, there is amultitude of ways in which basic areal units of analysiscan be aggregated into zones, and the requirement ofspatial contiguity represents only a weak constraint uponthe huge combinatorial range. This gives rise to the relatedaggregation or zonation problem, in which differentcombinations of a given number of geographic individualsinto coarser-scale areal units can yield widely differentresults. In a classic 1984 study, the geographer Stan

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Chinatown

Footwear factory

(A) N

1 km

Chinese Ethnic Origin>10%2–10%<2%

(C) N

1 km

Unemployment

>12%6–12%<6%

(B) N

1 km

Figure 6.17 The problem of ecological fallacy. Before itclosed down, the Anytown footwear factory drew its labor fromblue collar neighborhoods in its south and west sectors. Itsclosure led to high local unemployment, but not amongst theresidents of Chinatown, who remain employed in serviceindustries. Yet comparison of choropleth maps B and Csuggests a spurious relationship between Chinese ethnicity andunemployment

Openshaw applied correlation and regression analysisto the attributes of a succession of zoning schemes.He demonstrated that the constellation of elemental

zones within aggregated areal units could be used tomanipulate the results of spatial analysis to a widerange of quite different prespecified outcomes. Thesenumerical experiments have some sinister counterparts inthe real world, the most notorious example of which isthe political gerrymander of 1812 (see Section 14.3.2).Chance or design might therefore conspire to createapparent spatial distributions which are unrepresentativeof the scale and configuration of real-world geographicphenomena. The outcome of multivariate spatial analysisis also similarly sensitive to the particular zonal schemethat is used. Taken together, the effects of scale andaggregation are generally known as the Modifiable ArealUnit Problem (MAUP).

The ecological fallacy and the MAUP have longbeen recognized as problems in applied spatial analysisand, through the concept of spatial autocorrelation(Section 4.3), they are also understood to be relatedproblems. Increased technical capacity for numericalprocessing and innovations in scientific visualizationhave refined the quantification and mapping of thesemeasurement effects, and have also focused interest on theeffects of within-area spatial distributions upon analysis.

6.4.4 External validation: dataintegration and shared lineage

Goodchild and Longley (1999) use the term concatenationto describe the integration of two or more different datasources, such that the contents of each are accessible inthe product. The polygon overlay operation that will bediscussed in Section 14.4.3, and its field-view counterpart,is one simple form of concatenation. The term conflationis used to describe the range of functions that attemptto overcome differences between datasets, or to mergetheir contents (as with rubber-sheeting: see Section 9.*•). • Q7Conflation thus attempts to replace two or more versionsof the same information with a single version that reflectsthe pooling, or weighted averaging, of the sources.

The individual items of information in a singlegeographic dataset often share lineage, in the sense thatmore than one item is affected by the same error. Thishappens, for example, when a map or photograph isregistered poorly, since all of the data derived from it willhave the same error. One indicator of shared lineage is thepersistence of error – because all points derived from thesame misregistration will be displaced by the same, ora similar, amount. Because neighboring points are morelikely to share lineage than distant points, errors tend toexhibit strong positive spatial autocorrelation.

Conflation combines the information from twodata sources into a single source.

When two datasets that share no common lineage areconcatenated (for example, they have not been subject tothe same misregistration), then the relative positions ofobjects inherit the absolute positional errors of both, evenover the shortest distances. While the shapes of objectsin each dataset may be accurate, the relative locations

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


of pairs of neighboring objects may be wildly inaccuratewhen drawn from different datasets. The anecdotal historyof GIS is full of examples of datasets which were perfectlyadequate for one application, but which failed completelywhen an application required that they be merged withsome new dataset that had no common lineage. Forexample, merging GPS measurements of point positionswith streets derived from the US Bureau of the CensusTIGER files may lead to surprises where points appearon the wrong sides of streets. If the absolute positionalaccuracy of a dataset is 50 m, as it is with parts of theTIGER database, points located less than 50 m from thenearest street will frequently appear to be misregistered.

Datasets with different lineages often revealunsuspected errors when overlaid.

Figure 6.18 shows an example of the consequencesof overlaying data with different lineages. In this case,two datasets of streets produced by different commercialvendors using their own process fail to match in positionby amounts of up to 100 m, and also fail to match in thenames of many streets, and even the existence of streets.

The integrative functionality of GIS makes it anattractive possibility to generate multivariate indicatorsfrom diverse sources. Yet such data are likely to have beencollected at a range of different scales, and for a range ofareal units as diverse as census tracts, river catchments,land ownership parcels, travel-to-work areas, and marketresearch surveys. Established procedures of statisticalinference can only be used to reason from representativesamples to the populations from which they were drawn.Yet these procedures do not regulate the assignmentof inferred values to (usually smaller) zones, or theirapportionment to ad hoc regional categorizations. Thereis an emergent tension within the socio-economic realm,for there is a limit to the uses of inferences drawn fromconventional, scientifically valid data sources which are

Figure 6.18 Overlay of two street databases for part ofGoleta, California, USA. The red and green lines fail to matchby as much as 100 m. Note also that in some cases streets inone dataset fail to appear in the other, or have differentconnections. The background is dark where the fit is best andwhite where it is poorest (it measures the average distancelocally between matched intersections)

frequently out-of-date, zonally coarse, and irrelevant towhat is happening in modern societies. Yet the alternativeof using new rich sources of marketing data may beprofoundly unscientific in its inferential procedures.

6.4.5 Internal and external validation;induction and deduction

Reformulation of the MAUP into a geocomputational(Box 1.*• and Section 16.1) approach to zone design has • Q8been one of the key contributions of geographer StanOpenshaw. Central to this is inductive use of GIS toseek patterns through repeated scaling and aggregationexperiments, alongside much better external validation,deduced using the multitude of new datasets that are ahallmark of the information age.

The Modifiable Areal Unit Problem can beinvestigated through simulation of large numbersof alternative zoning schemes.

Neither of these approaches, used in isolation, is likelyto resolve the uncertainties inherent in spatial analysis.Zone design experiments are merely playing with theMAUP, and most of the new sources of external validationare unlikely to sustain full scientific scrutiny, particularlyif they were assembled through non-rigorous surveydesigns. The conception and measurement of elementalzones, the geographic individuals, may be ad hoc, butit is rarely wholly random either. Can our recognitionand understanding of the empirical effects of the MAUPhelp us to neutralize its effects? Not really. In measuringthe distribution of all possible zonally averaged outcomes(‘simple random zoning’ in analogy to simple randomsampling in Section 4.4), there is no tenable analogy withthe established procedures of statistical inference and itsconcepts of precision and error. And even if there were,as we have seen in Section 4.7 there are limits to theapplication of classic statistical inference to spatial data.

Zoning seems similar to sampling, but its effectsare very different.

The way forward seems to be to complement ournew-found abilities to customize zoning schemes in GISwith external validation of data and clearer application-centered thinking about the likely degree of within-zoneheterogeneity that is concealed in our aggregated data.In this view, MAUP will disappear if GIS analystsunderstand the particular areal units that they wish tostudy. There is also a sense here that resolution of theMAUP requires acknowledgement of the uniqueness ofplaces. There is also a practical recognition that the arealobjects of study are ever-changing, and our perceptions ofwhat constitutes their appropriate definition will change.And finally, within the socio-economic realm, the act ofdefining zones can also be self-validating if the allocationof individuals affects the interventions they receive (bethey a mail-shot about a shopping opportunity or aid underan areal policy intervention). Spatial discrimination affects

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


Applications Box 6.5

Uncertainty and town center definition

Although the locus of retail activity inmany parts of the US long ago shifted tosuburban locations, traditional town centers(‘downtowns’) remain vibrant and are cherishedin most of the rest of the world. Indeed, manynations vigorously defend existing retail centersand through various planning devices seek toregulate ‘out of center’ development.

Therefore, many interests in the planning andretail sectors are naturally interested in learningthe precise extent of existing town centers. Thepressure to devise standard definitions acrossits national jurisdiction led the UK’s centralgovernment planning agency (the Office ofthe Deputy Prime Minister) to initiate a five-year research program to define and monitorchanges in the shape, form, and internalgeography of town centers across the nation.The work has been based at the Centre forAdvanced Spatial Analysis (CASA) at UniversityCollege London.

Town centers present classic examples of geo-graphic phenomena with uncertain boundaries.Moreover the extent of any given town cen-ter is likely to change over time – for example,in response to economic fortunes consequentupon national business cycles. Candidate indi-cator variables of town centeredness mightinclude tall buildings, pedestrian traffic, highlevels of retail employment, and high retailfloorspace figures.

After a consultation period with user groups(in the spirit of public participation in GIS,PPGIS: see Section 13.*•) a set of the most per- • Q3tinent indicators was agreed (the conceptionstage). These indicator variables measured retailand hospitality industry employment, shop andoffice floorspace, and retail, leisure, and serviceemployment. The indicator measures were stan-dardized, weighted, and summed into a sum-mary index measure. This measure, mapped forall town centers, was the principal deliverable

(A)

Figure 6.19 (A) Camden Town Center, London (courtesy Jamie Quinn); (B) a data surface representing the index of towncenter activity (the darker shades of red indicate greater levels of retail activity); and (C) The Camden Town Center report:Camden Town center boundary is blue, whilst the orange lines denote the retail core of Camden and of nearby towncenters – the darker shades of red again indicate greater levels of retail activity

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124


(B)

(C)

Figure 6.19 (continued)

�

UNCORRECTED PROOFS

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124


�of the research. After further consultation, theCASA team chose to represent the ‘degree oftown centeredness’ as a field variable. Thischoice reflected various priorities, including theneed to maintain confidentiality of those datathat were not in the public domain, an attemptto avoid the worst effects of the MAUP, andthe need to communicate the rather complexconcept of the ‘degree of town centeredness’to an audience of ‘spatially unaware profes-sionals’ (see Section 1.4.3.2). The datasets usedin the projects each represent populations(not samples), and so kernel density estima-tion (Section 14.4.5) was used to create the

composite surfaces: the size of the kernel wassubjectively set at 300 meters, because of the res-olution of the data and on the basis of the empir-ical observation that this is the maximum dis-tance that most shoppers are prepared to walk.

An example of the composite surface ‘indexof town center activity’ for the town centerof Camden, London (Figure 6.19A) is shownin Figure 6.19B. For reasons to be explored inour discussion of geovisualization (Chapter 13),most users prefer maps to have crisp and notgraduated or uncertain boundaries. Thus crispboundaries were subsequently interpolated asshown in Figure 6.19C.

spatial behavior, and so the principles of zone design areof much more than academic interest.

Many of issues of uncertainty in conception, measure-ment, representation, and analysis come together in thedefinition of town center boundaries (see Box 6.5).

6.5 Consolidation

Uncertainty is certainly much more than error. Just asthe amount of available digital data and our abilities toprocess it have developed, so our understanding of thequality of digital depictions of reality has broadened.It is one of the supreme ironies of contemporary GISthat as we accrue more and better data and have morecomputational power at our disposal, so we seem tobecome more uncertain about the quality of our digitalrepresentations and the adequacy of our areal units ofanalysis. Richness of representation and computationalpower only make us more aware of the range andvariety of established uncertainties, and challenge us tointegrate new ones. The only way beyond this impasseis to advance hypotheses about the structure of data,in a spirit of humility rather than conviction. But thisimplies greater a priori understanding about the structurein spatial as well as attribute data. There are some generalrules to guide us here and statistical measures such asspatial autocorrelation provide further structural clues(Section 4.3). The developing range of context-sensitivespatial analysis methods provides a bridge between suchgeneral statistics and methods of specifying place orlocal (natural) environment (Box 1.*)•. Geocomputation• Q9helps too, by allowing us to gage the sensitivity ofoutputs to inputs, but, unaided, is unlikely to provide anyunequivocal best solution. The fathoming of uncertaintyrequires a combination of the cumulative developmentof a priori knowledge (we should expect scientificresearch to be cumulative in its findings), externalvalidation of data sources, and inductive generalization

in the fluid, eclectic data-handling environment that iscontemporary GIS.

More pragmatically, here are some rules for howto live with uncertainty: First, since there can be nosuch thing as perfectly accurate GIS analysis, it isessential to acknowledge that uncertainty is inevitable.It is better to take a positive approach, by learning whatone can about uncertainty, than to pretend that it doesnot exist. To behave otherwise is unconscionable, andcan also be very expensive in terms of lawsuits, baddecisions, and the unintended consequences of actions(see Chapter 17).

Second, GIS analysts often have to rely on othersto provide data, through government-sponsored mappingprograms like those of the US Geological Survey or theUK Ordnance Survey, or commercial sources. Data shouldnever be taken as the truth, but instead it is essential toassemble all that is known about the quality of data, andto use this knowledge to assess whether, actually, the dataare fit for use. Metadata (Section 11.2.1) are designedspecifically for this purpose, and will often includeassessments of quality. When these are not present, it isworth spending the extra effort to contact the creators ofthe data, or other people who have tried to use them,for advice on quality. Never trust data that have not beenassessed for quality, or data from sources that do not havegood reputations for quality.

Third, the uncertainties in the outputs of GIS analysisare often much greater than one might expect givenknowledge of input uncertainties, because many GISprocesses are highly non-linear. Other processes dampenuncertainty, rather than enhance it. Given this, it isimportant to gain some impression of the impacts of inputuncertainty on output.

Fourth, rely on multiple sources of data wheneveryou can. It may be possible to obtain maps of anarea at several different scales, or to obtain severaldifferent vendors’ databases. Raster and vector datasetsare often complementary (e.g., combine a remotely sensedimage with a topographic map). Digital elevation modelscan often be augmented with spot elevations, or GPSmeasurements.

UNCORRECTED PROOFS


Finally, be honest and informative in reporting theresults of GIS analysis. It is safe to assume thatGIS designers will have done little to help in thisrespect – results will have been reported to high apparentprecision, with more significant digits than are justifiedby actual accuracy, and lines will have been drawn onmaps with widths that reflect relative importance, ratherthan uncertainty of position. It is up to you as the user toredress this imbalance, by finding ways of communicating

what you know about accuracy, rather than relying on theGIS to do so. It is wise to put plenty of caveats intoreported results, so that they reflect what you believe tobe true, rather than what the GIS appears to be saying. Assomeone once said, when it comes to influencing people‘numbers beat no numbers every time, whether or notthey are right’, and the same is certainly true of maps(see Chapters 12 and 13).

Questions for further study

1. What tools do GIS designers build into their productsto help users deal with uncertainty? Take a look atyour favorite GIS from this perspective. Does it allowyou to associate metadata about data quality withdatasets? Is there any support for propagation ofuncertainty? How does it determine the number ofsignificant digits when it prints numbers? What arethe pros and cons of including such tools?

2. Using aggregate data for Iowa counties, Openshaw(1984) found a strong positive correlation betweenthe proportion of people over 65 and the proportionwho were registered voters for the Republican party.What if anything does this tell us about the tendencyfor older people to register as Republicans?

3. Find out about the five components of data qualityused in GIS standards, from the information availableat www.fdgc.gov. How are the five componentsapplied in the case of a standard mapping agencydata product, such as the US Geological Survey’sDigital Orthophoto Quarter–Quadrangle program(search the Web for the appropriate documents)?

4. You are a senior retail analyst for Safemart, which iscontemplating expansion from its home US state tothree others in the Union. Assess the relative meritsof your own company’s store loyalty card data (whichyou can assume are similar to those collected by anyretail chain with which you are familiar) and of datafrom the 2001 Census in planning this strategicinitiative. Pay particular attention to issues of surveycontent, the representativeness of populationcharacteristics and problems of scale and aggregation.Suggest ways in which the two data sources mightcomplement one another in an integrated analysis.

Further readingBurrough P.A. and Frank A.U. (eds) 1996 Geographic

Objects with Indeterminate Boundaries. London: Tay-lor and Francis.

Fisher P.F. 1999 ‘Models of uncertainty in spatial data.’In Longley P.A., Goodchild M.F., Maguire D.J. andRhind D.W. (eds) 1999 Geographical InformationSystems: Principles, Techniques, Management andApplications. New York: Wiley, pp. 191–205.

Goodchild M.F. and Longley P.A. 1999 ‘The future ofGIS and spatial analysis.’ In Longley P.A., Good-child M.F., Maguire D.J. and Rhind D.W. (eds) 1999Geographical Information Systems: Principles, Tech-niques, Management and Applications. New York:Wiley, pp. 567–80.

Heuvelink G.B.M. 1998 Error Propagation in Environ-mental Modelling with GIS. London: Taylor and Fran-cis.

Openshaw S. and Alvanides S. 1999 ‘Applying geo-computation to the analysis of spatial distributions’.In Longley P.A., Goodchild M.F., Maguire D.J. andRhind D.W. (eds) 1999 Geographical Information Sys-tems: Principles, Techniques, Management and Appli-cations. New York: Wiley, pp. 267–282.

Zhang J.X. and Goodchild M.F. 2002 Uncertainty inGeographical Information. New York: Taylor andFrancis.

UNCORRECTED PROOFS

UNCORRECTED PROOFS

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162

63646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124

Queries in Chapter 6

Q1. ?Q2. ?Q3. ?Q4. ?Q5. Figures 6.2 and 6.3 is not provided in the electronic

file. So we scanned from the hardcopy. Kindly checkQ6. ?Q7. ?Q8. ?Q9. ?

uncertainty - csiss · many geographic representations depend upon inherently vague deﬁnitions...

Documents