measuring error in manually digitized maps · 2005-02-02 · measuring error in manually digitized...
Post on 28-May-2020
15 Views
Preview:
TRANSCRIPT
MEASURING ERROR IN MANUALLY DIGITIZED MAPS
A Thesis Submitted to the Faculty of Graduate Studies and Research
In Partial Fulfillment of the Requirements for the Degree of Master of Science
in Geography
University of Regina
Michael Wayne Frith Redlands, California
May, 1997
O 1997 Michael W. Frith
395 Wellington Street 395. nie Wellington Ottawa ON K I A ON4 Ottawa ON K 1 A ON4 Canada Canada
Your file Votre Md-
Our nle Notre rdfdnrnce
The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, distribuer ou copies of this thesis in rnicroform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de
reproduction sur papier ou sur fonnat électronique.
The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantid extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation .
ABSTRACT
Many spatial databases have been created from the conversion of analog maps to
their digital representation. Manual conversion introduces error into the database, but the
amount of error is often left unmeasured. Estimates of this error can be made using Perkal
epsilon bands. Three aspects of error associated with the manual conversion of analog to
digital maps are estimated using an epsilon band: registration error, digitizing error and
inherent map error. Since these estimates are independent, they can be added for a total
error estimate.
To determine the feasibility of establishing a total error estimate, a sample line
was constmcted and independently digitized by eight operators in both point and stream
digitizing modes. This approach yielded a sample data set in which the perpendicular
distances between the two lines generated in point and stream modes were used as the
digitizing error. The registration error was retumed from the digitizing software. The
inherent map error was based on the Canadian federal government's Energy, Mines and
Resources Mapping Branch horizontal accuracy standards. Those standards stipulate that
any point on a map must be within 0.5 mm of the location of the true point at rnap scale.
Results indicated that the maximum digitizing error is not usually representative
of a rnanually digitized data set. Using an epsilon based on the median of the digitizing
error is also not a particularly valid approach since 50% of the line may not be present in
the epsilon band. An epsilon band based on the mean deviation and its accompanying
statistics appears to provide the most reasonable measure of digitizing accuracy.
1 would like to express my sincere gratitude to Dr. David Gauthier for his
encouragement and guidance through this extended journey. 1 would also like to thank the
Department of Geography and the Faculty of Graduate Studies and Research for the
funding received to aid in this endeavour. Partial funding to assist this project was also
received through a Social Sciences and Humanities Research Grant of Dr. Gauthier.
Thanks are d s o extended to Environmental Systems Research Institute for
allowing me the flexibility and resources to complete my degree. In particular, I wish to
thank Bill Moreland.
Thanks also to my friend and confidant Becky who put up with me the last 2
years.
iii
Page
ABSTRACT ................................................................................................................................................. II
LIST OF FIGURES ..................................................................................................................................... V
LIST OF TABLES ....................................................................................................................................... V
CHAPTER 1 . NATURE OF STUDY .......................................................................................................... I
CHAPTER 2 . LITERATURE REVIEW .................................................................................................... 6
CHAPTER 3 . METHODOLOGY ............................................................................................................. 23
................................................................................................................................................... 3.1 DATA 23 ............................................................................................................................ 3.2 DATA COLLECTION 2 3
................................................................................................................................. 3.2.1 Registration 23 .................................................................................................................................... 3.2.2 Digitizing 24
................................................................................................................. 3.3 CALCULA~ON OF EPSILON 2 4 ................................................................................................................................. 3.3.1 Registration 24
............................................................................................................. 3.3.2 Inherent Map Error ..... 25 ........................................................................................................................... 3.3.3 Digitizing Error 25
............................................................................................................................... 3.3.4 Total Epsilon 26
CHAITER 4 . RESULTS AND DISCUSSION ......................................................................................... 28
4.1 OPERATOR BACKGROUNDS ................................................................................................................. 28 ........................................................................................................................ 4.2 DIGITIZING RE~ULTS 2 8
................................................................................................................................ 4.2.1 Registration 2 8 .......................................................................................................................... 4.2.2 Digitizing Error 3 0
................................................................................................................... 4.2.3 Test of Independence 30 4.3 TOTAL EPSILON CALCULATION ........................................................................................................... 30 4.4 DiSCussION ......................................................................................................................................... 33
4.4.1 Assessrnent of the Three Measures .............................................................................................. 33 4.4.2 Itnprovements tu the Methodology ............................................................................................... 35
........................................................................................................... 4.4.3 Areas of Further Research 36
APPENDM A .............................................................................................................................................. 38
LITERATURE CITED .............................................................................................................................. 39
LIST OF FIGURES
.................................... Figure 2.1. Digitized line with epsilon band and the "control" line 9 Figure 2.2. Point mode digitizing is a sampling method that creates a new line using the
points selected by the operator ................................................................................. 10 Figure 2.3. Perpendicular error distance is from a vertex on the digitized line to the
control line .................................................................................................................... 12 ....... Figure 2.4. Two lines representing the sarne feature but digitized at different scales 14
Figure 2.5. The RMS error is calculated from the differences in the tic table coordinates ................................................................................... and the transformed coordinates 16
Figure 2.6. Tontroi" iine with concave epsilon band .................................................................. 21 Figure 2.7. Possible representations of the "true" line from pairs of random points
generated within specified distance ........................................................................... 22
LIST OF TABLES
........................................................... Table 4.1. RMS error in table units and map units 29 Table 4.2. Statistics based on perpendicular e m r distance between the point mode
......................................................... digitized line and the Stream mode digitized line 31 Table 4.3. Calculation of total epsilon distances (metres) ................................................ 32 Table 4.4. Examination of standard deviation digitizing error for individual operators and
the operators as a group ................................................................................................ 34
CHAPTER 1. NATURI3 OF STUDY
"...it is not the map that is in 'error', it rnerely contains a considerable level of uncertainty ..."
Fisher, 1987, p. 3 15
7.1 Purpose The purpose of this thesis is to construct a method for estimating the arnount of
uncertainty introduced when analog maps are converted tci digital data in a vector based
geographic information system (GIS). There is a substantial need to obtain reliable
estimates of error associated with digital data to better infom data anaiysis, interpretation
and decision-making. The main priority of research into issues of accuracy and
uncertainty associated with GIS is to "develop an adequate means of representing and
modeling the uncertainty and error characteristics of spatial data and to develop GIS
related methods and techniques that cm explicitly take error into account during their
operations with spatial data" (Openshaw, 1989, p.265).
1.2 Background Maps have been the main source of data for geographic analysis for many years.
"The primacy of maps is an unquestioned premise of the field." (Chrisman, 1982a, p.3).
However, maps serve not only geographers. The shift from analog to digital maps has
expanded the utility of maps for many disciplines. The digital map allows for the
manipulation of data in ways that are not possible with paper maps. Analog maps have
traditionally been difficult to use for overlay analysis and are frequently, but incorrectly,
often treated as being 100% accurate (Aronoff, 1989). "For the first time we are
emancipated from the tyranny of paper map sheet with finite size and depth, although
111U1;Il iC3Gcllbll 3 L l l l 1 I b b U J L u Ucr U W i i ~ b v r r u i u ~ b r a u u i i i u i v i i = iii~ii~bv.i.v..~ .L.-.C..CJ-i,
seamless and scale variable cartographic databases." (MuIler, 1992, p. 2).
The increased speed and power of cornputers allows spatial information to be
processed faster than ever before. The transition from analog to digital rnap has benefited
many areas of rnap analysis but has raised substantive issues regarding spatial data
handling. Chrisman (1984, p.8 1) noted that "numbers in a database create an illusion of
accuracy and the cornputer opens new ways of potentid abuse." Goodchild (1996) argues
that users expect GIS databases to be developed with the "principles of scientific
measurement". Much of the necessary research into spatial processes and spatial statistics
and resulting conflicts is just beginning.
One of the major issues in the use and anaiysis of digital data is that of accuracy
and the lack of knowledge regarding spatial error. Accurate data is important for GIS but
is often overlooked or dismissed in applications and is very rarely specified on output
products. The rnap user generally has no idea of rnap accuracy and output is often
assumed to represent a higher level of accuracy than it actually contains (Chrisman, 1984;
Keefer et al., 1988).
A GIS allows users to produce maps and models by combining various sets of
spatial data. Two GIS capabilities that excite enthusiasm among potential users are the
ability to change rnap scales and the abiIity to efficiently overlay maps in any order
desired by the user. It is this ability to manipulate mapped information that makes a GIS
so valuable. However, researchers and decision-makers can be misled due to
misunderstandings in the imprecision inherent in cartographic forrns of representation and
the compounding of errors when rnap scales are changed and when maps are merged
\ A U J U L , 1 /u I )- i i iu i i a u i u w i . . . & m.--- --- ---,,.--, ---- - - - - - d Y
products has arisen for several reasons: (1) they are a requirement of many spatial data
transfer standards; (2) public agencies require accurate estimates of enor to support
decisions based on spatial data; (3) accurate error estimates help to preserve public
confidence; and (4) estimates of error assist in the resolution of litigation disputes
(Goodchild, 1993).
Digital spatial data should not be subject to the sarne limited methods of
determining accuracy that are associated with paper maps. Many spatial databases have
been converted from paper maps without considering the uses of the resulting data or the
intended use of the paper map. The paper map is a communication device transrnitting the
cartographer's view of reality to the map reader. Initial cartographic research focused on
this communication model and gave scant regard to accuracy. "That model diverted
attention away from the data gatherer and the map maker toward the transmission
process; it thereby down played problems of data accuracy and precision and those of
representation." (Woodward, 1992, p.52). "Cartographers feel little need to communicate
information on accuracy, except indirectly through map quality statements or in detailed
legends." (Goodchild, 1991, p.2). Openshaw (1989, p.263) States "there is a remarkable
lack of information about the level of errors in maps and remotely sensed data and, there
are seemingly no available tools for measuring error in the outputs, and no methodology
for assessing their significance."
One of the main features of a GIS is the ability to produce "new" information. The
users of rnaps and other GIS products want to combine information from many sources to
aid in decision making and they want the information to contain as little uncertainty as
POSSlOle. K C S C a i c i i ~ i ~ aiiu U G L I J I U ~ : - I ~ I ~ U ~ i i i u a r U C L V ~ ruiiifiuuiivv ... ---- -- --
Ieast be aware of the Iimits. If there is no quantified estimate of data quality then users
may rightly be cautious in the use of that data. For example, knowledge of error is
important to researchers that use data analysis to refine research directions or decision-
makers who may be held liable for damages incurred as a result of poor decisions. If a
user understands Iimits associated with the data, the risk of damages as a resuit of poor
decisions are reduced.
The determination of an accuracy level should be based on the intended use of the
information. The acceptable level of accuracy is "that level where the costs of making the
wrong decision are equal to the costs of acquiring more accurate information" (Aronoff,
1989, p.55). Inaccuracies cm Iead to faise perceptions about the data (Bailey, 1988)
which can lead to faulty decisions (Mead, 1982; Chrisman, 1984; Hudson, 1988).
Little has been written in articles dealing with applications of GIS in regard to
accuracy determination and levels of confidence in the output data. The most plausible
explanations for that lack are that users either do not know how much error is in the data,
or they have no way of quantifying the error, or they have sorne reason to believe it is not
significant. The first two explanations are no longer valid for certain types of errors.
Many managers use the "best" data available or data that is "good enough" for their
application, though they may have no quantified estimate of its accuracy. There are no
commercial GIS or other tools that can determine the arnount of error in a data layer or
incorporate the error during overlay procedures (Openshaw, 1989).
- - - - - - - - - - - -
Chapter 2 provides a literature review of relevant concepts and approaches to
error estimation and analysis. Chapter 3 describes the methodology used in this thesis to
measure and depict error sources. Chapter 4 presents the results and discusses
conclusions and possible approaches and methods that can be used to assist users of
digital mapped data to account for and manage the error and uncertainty in spatial
databases.
CHAPTER 2. LITERATURE REVIEW
"Digitizing is usually the most expensive part of a GIS, yet the error introduced into maps by digitizing is often overlooked or assumed to be negligible"
Keefer et al ., 199 1, p.957
2.1 Introduction There are two basic types of error in spatial databases: positional and attribute
(Amrhein and Griffith, 199 f ). Either the coordinates of the feature are wrong andor the
content or description of the feature is wrong. Error in these spatial databases occurs from
two main sources, the encoding process and the source documents (Veregin, 1993). The
first source of error is distinguished by discrepancies between the source document and
the digital data derived from that source. The second source of error is associated with
error in the source document.
There are two approaches to accuracy determination: testing and simulation.
Testing is the cornparison of the collected data to data of a known and higher accuracy
(Vonderohe and Chnsman, 1985; Chrisman, 1989). Simulation seeks to develop a
procedure with stochastic modeling that can produce a "random" data set. This "random"
data set is based on the original or generated data set but is perturbed according to some
mathematical or probability function. The "random" data set will share many
characteristics of the original data set but is only a possible version of what the real world
may be like. Usually, multiple "random" data sets are created and combined to assess
final output results. With simulation, predictions about accuracy are made from
assurnptions about the true data (Chrisman, 1989). Testing is a more accurate method, but
I L 4 1 W U I I U VLIIUL.3 U I 1 U 1 1 1 W l b U b b U l U b b UULU UV l n v r ui i r u j u v r . i u & \uvvuuririu, a i . uiu.vi..vr-,
1984).
Three indices of database accuracy are layer-based, feature-based and domain
specific indices. Layer based indices provide a quick summary of data quality over an
entire layer. A "layer" is a representation of a single type of entity, e.g. soils, hydrology,
land use. The disadvantage of this approach is that data quality varies significantIy across
space (Burroughs, 1986; Veregin, 199 1). Feature based indices provide information about
variations across space but at a high cost in terms of storage and management. A
"feature" is a single entity in a layer, e.g. a lake in a hydrology layer. Intermediate to these
two approaches are "domain specific" indices in which the spatial or thernatic dornain is
subdivided into discrete classes (Veregin, 1991). Spatial domains define areas on the map
that have different values for the same attribute, e.g. areas of the map that have more
recent information or have been surveyed at different times. Thematic domains refer to
similar features in a layer, e.g. urban areas in a land use layer. Ideally, maps should
include not only a total rnap uncertainty index but also domain specific errors.
There are numerous methods to reduce positional error in a spatial database. If
new databases are being created, then experienced digitizers and operational feedback
have been found to reduce input error (Jenks, 198 1 ; Otawa, 1987). The use of survey
control points allows for the "rubber-sheeting" or stretching of the map to a more
accurate position (Star and Estes, 1990). The use of high level surveying, e.g. global
positioning systerns, c m reduce positional error during raw data collection.
Most spatial databases are created from the conversion of analog maps (Marble,
1996). The conversion process requires the registration of the map or "map separates"
- m m - - - --a------- u -- - - Y I * .
depict a single layer, e.g. hydrology, roads. A skilled operator then digitizes the rnap or it
can be electronically scanned using an autornated scanner.
It is possible to measure the error and uncertainty introduced by the cornponents
of the map conversion process: digitizing, registration and map compilation. Any point on
the map within a radius of the composite error value should contain the true feature. This
distance is called the epsilon distance and when applied to linear features forms an error
band called an epsilon band (Figure 2.1). While the epsilon band mode1 does not provide
a determination of the location of the true line, it does provide an estimate of the
deviation of a digitized line from the true line,
2.2 Digitizing Digitizing is a strenuous event that requires intense concentration and places a
premiurn on an operator's psychological and physiological ability to discern the centre of
a line and follow the centre of the line with the cursor. There are two rnethods of manual
digitizing: point mode and Stream mode. Both involve the operator moving the cursor or
"puck" dong the features to be collected. The difference in the two modes lies in the
procedure of collecting those features. In point mode, the operator moves the cursor to
any point that the operator considers to be important in defining that feature, e.g. a bend
in a river or road. The operator then enters the location of that point into the database.
Point mode digitizing requires intelligent interpretation of each feature by the operator. It
is a tedious process and does not always give a true representation of the feature being
digitized (Douglas and Peucker, 1973; Burroughs, 1986) (Figure 2.2)- It is a sampling
Epsilon Band True
Ensilm
........ ........ ...... ........ ...... ......
.................. .............. ........ ....... ........ ........ ..-- %.." ......... ....m. ........ Digitized line
Figure 2.1. Digitized line with epsilon band and the "true" line.
Original line . . . . . . . . . . . . . . . . . . . . . . . . . . Digitized line (adapted from Burroughs, 1986)
Figure 2.2. Point mode digitizing is a sampling method that creates a new line using the points selected by the operator.
=------ ." - - - - - , - - - - - - - - - - - - - - - - - I -
accuracy of the resulting data (Blakernore, 1984; Klinkenberg and Xiao, 1990).
Alternatively, in stream mode digitizing, the operator moves the cursor along the
feature, e.g. polygon boundary, and points are automatically entered without judgement
by the operator using algorithms that are based on the distance traveled along the line or
the time elapsed. AIthough strearn mode digitizing is faster and easier than point mode,
operators tend to undercut or overshoot comers and have to make corrections as they
digitize (Jenks, 198 1).
Traylor (1979) had 15 subjects digitize a 6" x 4.5" generalized representation of
Australia in stream mode. Using perpendicular e m r distances from the digitized line to
the original line (Figure 2.3), Traylor found that stream mode digitizing error does not
occur randomly, but is correlated to the direction of cursor travel and an inability of the
operators to correct their mistakes even though they know they are not following the line.
Traylor suggested that a digitizing signature could be created for each operator and data
sets created by them could be modified according to that signature. Jenks (198 1) also
found stream mode digitizing dominated by "latitudinal" errors, i.e. operators would
'overshoot' or 'undercut' corners and realizing that they had strayed from the line would
slowly make their way back to the line rather than make an abrupt correction.
Honeycutt ( 1985) examined the affect of cartographic generalization on positional
uncertainty from four rnaps scales: 1 :24 000, 1 :62 500, 1: 100 000, and 1 :250 000. Eight
strearn channels were digitized in point mode at each of the four scales, with the largest
scale version acting as the base line. The generalized versions of the line were digitized
by one operator and overlain with the base line resulting in polygons that represented the
Original line ......................... Digitized line
Error distance
Figure 2.3. Perpendicular error distance is from a vertex on the digitized line to the true line.
- -
polygon with area weighted average and variance. It was found that generahzation caused
a bimodal distribution of location error about the cartographic line. The operator was not
able to follow the centre of the line but veered to the right and left of the line. It was
reasoned that overshooting and undercutting, based on Traylor's findings, were
responsible for this distribution. Generdization reduces the nurnber of points required to
represent a feature causing the shape of the feature to change. It is implausible that
digitizing emor causes the bimodal distribution. The transected polygons represent the
arnount of deviation between the base line and the srnaller scale representation (see
Figure 2.4), not the inability of the operator to follow the line. The comparison of
digitized lines at different scales does not provide a suitable measure of digitizing error
and, therefore, the comparison was invalid.
Otawa (1987) studied the variability of digitizing error of 14 people. Using the
same map and hardware, various sized polygons with varying complexities were
digitized. The subjects had little or no pior experience with digitizing. Analysis consisted
of comparing polygon area from operator to operator. It was concluded that manual
digitizing created more error than expected and that the larger the polygon to be digitized,
the Iarger the error.
Keefer et al. (1 988) used a method simiiar to Traylor (1979) to examine digitizing
error. Map-like features were digitized in point mode and used as the "control" Iine. The
features were then digitized again in Stream mode and the perpendicular error distance
from the sample line to the control line was calculated (see Figure 2.2) The data was
found to be non-random with a high correlation or senai dependence from point to point.
Line digitized at small scale
- - - m m - _
Line digitized at large scale (source: Honeycutt. 1985)
Figure 2.4. Two lines representing the same feature but digitized at two different scales
average (ARMA) mode1 to simulate Stream mode digitizing error. Line length and
polygon area output from the program were compared with the original data. The authors
concluded that "time series analysis is a very effective method of studying the effect of
digitizing upon map accuracy" (p.482), although they did not mention the size of the
digitizing error.
A study by Maffini et al. (1989) examined the distribution of error from digitizing
discrete and continuous features. A11 features were digitized in point mode at three
different scales under three time constraints; cornfortable, humed, and very humied. Not
surprisingly, the largest scaIe and slowest digitizing speed provided the most accurate
data.
2.3 Regis tration Error Registration is the process of defining points to correlate features in one
coordinate systern to another coordinate system. This process is done in anticipation of
transforming the features from one coordinate system to the other system. In the case of
digitizing, the transformation is from table coordinates to map coordinates. The points
used during registration are called "tics". Registration fitness or acceptability is
deterrnined by measuring the error between the output tic coordinates and the transformed
coordinates using root mean square analysis (see Figure 2.5). The distance deviation
between the original coordinates compared to the transformed coordinates determines the
"root mean square" (RMS) error.
used for RMS calculations
Output coordinate (tic entered by digitking operator)
Transformed input coordinate (original "true" geographic registration tic)
Figure 2.5. The RMS error is calculated from the differences in the tic table coordinates and the transformed coordinates.
RMS Error =
where xi,yi are the tic table coordinates x,,yj are the transformed input tic coordinates n is the number of tics
Bolstad et al. (1990) simulated the registration process by having four operators
digitize a series of points. The mean deviation around the points was 0.068 mm or 1.7 m
at a map scale of 1 :24 000. Rogowski (1995) used the affine transformation in ESRI
Corporation's ARCIINFO GIS software resulting in an RMS error of I .O metre. It is the
ARC/INFO method that is used in this research.
2.4 lnherent Map Error There are several sources of inherent map error. Primary data capture (surveying,
geodesy and photogrammetry) introduces human, instrumental and environmental errors.
Human error results from observers not reading instruments correctly or not positioning
equipment correctly. Instrumental errors occur from poorly constructed equipment or lack
of proper calibration. Environmental errors are caused by humidity, temperature,
pressure, magnetic variations, obstruction of signals, wind, and illumination.
Observations from primary data capture are subjected to rigorous statistical and
mathematical modeling to remove most of the error.
Additional sources of error include those caused by plotting control points for
map production, drawing of the features, generalization of the features, e m r in colour
registration of map separates, feature exaggeration for communication reasons, definition
and Bossier, 1992).
2.5 Epsilon Band Mode1 The epsilon model of positional uncertainty for cartographic lines was adapted by
Chrisman (1982a) based on work done by Perkal (1956). Perkal used a circle of diarneter
epsilon to determine an approxirnate length of a line. Chrisman (1982) modified this
approach concluding that somewhere within this epsiIon distance the true line exists (see
Figure 2. f ). "The epsilon model provides a conservative, generalized model directed at
unifying al1 sources of error." (Chrisman, 1982a, p.61).
The epsilon model cm be used in either a probabilistic or deterministic method
(Goodchild, 1988). In a deterministic method, the probability of the true line within the
epsilon band is 1 .O. Theoretically, this means that there can be no error outside of the
epsilon distance. The probabilistic approach assumes that the error around a line is
represented by a normal distribution (Maffini et al., 1989). With the probabilistic method,
epsilon can take on any probabilistic measure, such as standard deviation, which implies,
for example, that there is a 68% chance that the true line is within the epsilon band. Mark
and Csillag (1989) used a probabilistic epsilon band to define a probability surface
between two polygons to deterrnine the probabiiity of a sample point belonging to one or
the other polygon.
Chrisman (1982b) used epsilon bands to study systematic errors associated with
the United States Geological Survey's (USGS) Geographic Information Retrieval and
Analysis System (GIRAS) digital land useiland cover series. His determination of epsilon
was based on inherent map or scale error, digitizing error and round-off error and resulted
interpretation and registration emrs that were not included. Because of these missing
values, the 20 m epsilon was considered to be quite conservative. The results showed 7
percent of a 100 000 hectare database fell in this epsilon band. The area in the epsilon
band represented a possibility that that area was a different land usenand cover class.
Blakemore (1984) studied the number of industrial establishments that existed in
Employrnent Office Areas (EOA) in northwest England. To determine within which EOA
an establishment was located, the establishments were "geocoded" to the EOA base map.
The base map had a 1 km grid square resolution yielding an epsilon value of .707 1 km.
Point-in-polygon overlay was conducted to determine which establishments fell within
which EOA. The results were not encouraging as approximately 40 percent of the sample
points that were tested fell within the epsilon band and could not be definitely assigned to
a pol ygon.
Dunn et al. (1 990) used epsilon bands for a study on the arnount of error in digital
databases associated with the Monitoring Landscape Change project in England and
Wales. Several lines in the database were digitized twice providing administrators with an
opportunity to examine digitizing error. The values for epsilon were based on the
maximum range between the two lines and the interquartile range (IQR), the range
between the 251h and 75" percentiles. The difference between these values was quite
large: 20 m for the range and 3.1 m for the IQR. The area of uncertainty (the area in the
epsilon band) for the polygons in the study varied from 10.0% to 15.8% for the range
epsilon and 1.6% to 2.5% for the IQR epsilon.
with the band being closer to the "true" line at the center of the line segment than at the
end (Figure 2.6). Randorn pairs of locations were generated about the endpoints of a line
within a specified distance of the endpoint (Figure 2.7). These pairs of locations
simulated possible representations of the "true" line. Dutton found that the standard
deviation of these representations varied along the length of the line with dispersion being
greatest at the endpoints and least at the midpoint. The problem with Dutton's proposal is
îhat the digitized points must be independent of each other, but stream mode digitized
points are not independent.
Goodchild and Dubuc (1987) discussed problems with the use of the epsilon band.
They suggested that:
there should be no upper limit for error; the epsilon band does not provide distribution of error within the epsilon band; and dthough epsilon provides a mode1 of deviation for the line, it does not mode1 the line itself.
These difficulties are mediated, however, by the choice of epsilon distance.
Careful measurement of the components involved in the digitizing process will provide
an appropriate epsilon distance. Additionally, the epsilon band mode1 is relatively easily
implemented by most users and provides at least a coarse rneasure of the components of
error that are understandable to most users.
The objective of this research is to develop a quick and reasonably accurate
method to estimate how rnuch uncertainty exists in newly created spatial databases. The
assignment of an uncertainty measurement that assesses the percentage of the true line
within an epsilon distance from the observed line is, therefore, a realistic goal.
Concave Epsilon Band
Figure 2.6 "True" line with concave epsilon band
/- Circle of specified i
"Tme" line
Figure 2.7. Possible representations of the "tme" line from pairs of random points generated within specified distance.
Error introduced from digitizing is from three sources: registration error,
digitizing error and inherent map error. These error sources are assumed to be
independent and cm, therefore, be summed, resulting in a total measure of error calied
"epsilon". To assess characteristics of these sources of error, eight individuals were
chosen to digitize standardized line information under controlled conditions. The results
were assessed and compared relative to the three sources of error and epsilon
3.1 Data The data consisted of registration points and a sinuous line (Appendix A). The
data were plotted with a thickness of 1 mm on mylar to limit the amount of stretch and
distortion. Mylar is a more stable medium than paper that cm change in size as humidity
and temperature vary.
3.2 Data Collection The entry of information from the constructed data set required two steps from
each operator: registration and digitizing.
3.2.1 Registration
An empty data file was created for each operator, each possessing the same "tic
table" (see below). The empty data files were identical in terms of the geographic space
they represented. The geographic space of each data set was referenced according to four
known coordinates corresponding to the minimum x and y values, the minimum x and
maximum y values, the maximum x and minimum y values, and the maximum x and y
values. These four known geographic registration points are known as "tics". The
the tics. The coordinates of the tics are stored in the computer in a data file known as a
"tic table". Each time an operator begins a digitizing session, the tics (norrnally
represented on a map with large cross-hairs) rnust be re-entered from the map using an
electronic puck or cursor. The coordinates the operator enters are compared with those
aiready in the computer to detemine the root mean square (RMS) enor.
The operator enters the identification number (ID) of the tic and then places the
cursor over the crosshair as accurately as possible and enters the tic location. When al1 the
tics have been entered, the computer determines the RMS error. The deviation of the tics
entered by the operator compared to those in the "tic table" yields the RMS error. RMS
values greater than 0.003 are usually not acceptable and the map must be registered again.
3.2.2 Digitizing
The digitizing was conducted over three days. The mylar was not removed from
the digitizing table ensuring that no distortion resulted from re-taping the map to the
table. The temperature and humidity were controlled by the local environment system.
Once the map was registered to the table, each operator digitized the sarnple line in a
dense point mode; this was the control line. Each operator then digitized the simple line
in Stream mode.
3.3 Calculation of Epsilon An epsilon distance was calculated for each procedure. Each procedure distorts
the data set before it is passed to the next procedure. Because of the independence of each
procedure, it was valid to sum the epsilon distances to obtain a total epsilon distance.
3.3.1 Registration
during the registration process. That value was a measure of how well the rnap was
registered to the table. The lower the RMS error, the closer the entered tics align with the
rnap tics.
3.3 -2 Inherent Map Error
The epsilon distance for inherent rnap error was based on standards from the
Canadian federal Department of Energy, Mines and Resources, producers of the National
Topographie Series. The accepted standard states that 90 percent of well-defined features
measured from the rnap faIl within .5 mm relative to their true position (Energy, Mines
and Resources, 1976). In other words, a feature represented on the rnap will be found
within a radius of .5 mm at the rnap scale on the earth's surface. The epsilon distance was
calculated by multiplying the radius by the rnap scaie.
E , = radius x rnap scale.
where E , is the epsilon distance for inherent rnap error.
The rnap scale used for this project was 150 000 producing an inherent rnap error epsilon
distance of 25 m. This value is the maximum rnap error value.
3.3.3 Digitizing Error
An approach similar to that taken by Keefer et al. (1988) was used. They
determined digitizing error by having operators digitize the feature in a dense point mode.
The operators then digitized the feature in strearn mode and the perpendicular error
distance between the two lines was considered the digitizing error. The approach used
here was as follows:
-- - - Y A
digitized line was taken as the "control" line.
2. Each operator then re-digitized their point mode sample line in stream mode.
3. A computer program was written to compare the perpendicular distances between the
vertices entered in stream mode and the point mode line. The prograrn calculated the
distance between the stream mode digitized line and the point mode line. The
perpendicular distance from each vertex to the "control" line represented the deviation
by the operator. The distance between the two lines is the digitizing error.
4. Three measures (maximum deviation, mean deviation, and the median deviation)
were calculated using the Statistica computer package (StatSoft, 1994). The
maximum epsilon represents the maximum error that occurs. The median and
standard deviation epsilons represent a probabilistic value. The standard deviation
represents 68% of the digitizing error while the median represents the 50 percentile of
the digitizing error. With the median, the chance of a point being in the epsilon band
is the same as being outside of it.
3.3.4 Total Epsilon
Since each source of error (inherent map error, registration error, digitizing error)
is independent of the other, the epsilon values for each procedure were summed to obtain
the total epsilon value. The total epsilon is:
E t = E r + E * + E d
where: E , is the total epsilon
E , is the epsilon from registration
E , is the epsilon from inherent map error
€ d is the epsilon from digitizing
- - - - - - - I - -
epsilon distance. To differentiate between the three total epsilons for each operator, each
total epsilon was named according to the type of digitizing epsilon used. For example, the
maximum epsilon is comprised of the maximum digitizing deviation epsilon distance, the
inherent map error epsilon and the registration epsilon.
CHAPTER 4. RESULTS AND DISCUSSION
4.1 Operator Backgrounds The operators had varied backgrounds: two were computer programmers
who had never digitized before, three had considerable digitizing experience, and three
had moderate experience. The operators were somewhat nervous with one operator
reporting hands shaking more than normal. . The operators were toid this digitizing was
for a thesis and may have contributed to their nervousness. The only constraint placed on
the operators was that they produce a low RMS error during map registration. The
accepted n o m for rnap registration is 0.003 or lower.
4.2 Digitizing Results 4.2.1 Registration
The ARC/INFO registration process uses an affine transformation refemng to a
linear transformation of the table coordinates to map space. This process yields the root
mean square error of the table coordinates as well as the map space coordinates. The
output units are those defined by the values in the tic file prior to registration, in this
project the output units are meters. Table 4.1 shows the RMS error values for each
operator. Similar values for the Table RMS error produce different Map RMS errors due
to rounding by the ARC/INFO transformation software. Al1 operators were able to
achieve RMS errors less than or equal to 0.003 except for Operator 3 who, after several
üttempts, could achieve an RMS of only 0.004.
Table 4.1. RMS error in table units and map units.
Operator RMS Error RMS Error Table Units Map Units
(inches) (metres ) 1 0.002 1.978 2 0.003 3.271 3 0.004 4.562 4 0.002 2.163 5 0.002 2.32 1 6 0.00 1 1.276 7 0.002 2.072 8 0.002 2.246
The perpendicular error distances (Figure 2.2) between the point mode version of
the sample line and the Stream mode version of the sample line for each operator were
used in the calculation of the mean, maximum, and median deviations. Table 4.2 shows
the mean, maximum, and median deviations for each operator.
Using the raw values should produce a mean close to zero; error on one side of the
"controI" line should be equal to mor on the other side of the "control" line. A mean
distant from zero would suggest: (1) that the operator spent more time on one side of the
line; (2) that the operator may have had trouble digitizing the feature, especially curves;
and/or (3) the operator had a tendency to undercut or overshoot curves in a particular
direction. Absolute values of the deviations were used in the calculations. Table 4.2
shows the digitizing error for each operator. There may be sorne error introduced through
calculations by Statistica, but the arnount, if any, is unknown.
4.2.3 Test of Independence
Spearman's Rank Order Correlation was applied to each operator's registration
error and each measure of digitizing error. Using the mean digitizing error, Spearman's r
value was -.357 with a p level of -385. The maximum digitizing error had a Spearman's r
value of -.595 with a p level of .120. The median digitizing error had a Spearman's r value
of -.524 with a p level of. 183. From this, it is concluded that the digitizing error and
registration error are independent.
4.3 Total Epsilon Calculation The total epsilon distance is the sum of the digitizing epsilon, registration epsilon
and the inherent map error epsilon. Table 4.3 shows the calculation of the total epsiion
Table 4.2. Statistics based on perpendicular error distance between the point mode digitized line and the Stream mode digitized line.
Operator # of Mean Median Maximum Points Deviation (m) (m)
Table 4.3. CalcuIation of totaI epsiIon distances (metres).
Maximum Mean Median Deviation Deviation Deviation
Operator Epsilon Epsilon Epsilon 1 Registration Error 1 -978 1.978 1.978
Map Error 25.000 25 .O00 25.000 Digitizing Error 49.305 1 3.230 1 1.362 Total Epsilon 76.283 40.208 38.340
2 Registration Error 3.27 1 3.27 1 3.27 1 Map Error 25 .O00 25 .O00 25.00
Digitizing Error 3 1.44 1 7.876 6.468 Total Epsilon 59.712 36.147 34.739
3 Registration Error 4.562 4.562 4.562 Map Error 25.000 25.000 25 .O00
Digitizing Error 20.042 6.278 5.3 17 Total Epsilon 49.6û4 35.840 34.879
4 Registration Error 2.163 2.1 63 2.163 Map Error 25.000 25.000 25 .O00
Digitizing Error 33.124 9.206 7.62 1 Total Epsilon 60.28 7 36.369 34.784
5 Registration Error 2.32 1 2.32 1 2.32 1 Map Error 25.000 25.000 25.000
Digitizing Error 20.323 6.196 5.756 Total Epsilon 4 7.644 33.51 7 33.077
6 Registration Error 1 -276 1.276 1.276 Map Error 25 .O00 25.000 25.000
Digitizing Error 3 1.693 9.176 7.529 Total Epsilon 57.969 35.452 33.805
7 Registration Error 2.072 2.072 2.072 Map Error 25 .O00 25.000 25.000
Digitizing Error 26.566 5.57 1 4.797 Total Epsilon 53.638 32.643 31.869
8 Registration Error 2.246 2.246 2.246 Map Error 25.000 25.000 25.000
Digitizing Error 39.0 1 8 9.444 6.5 1 1 Total Epsilon 66.264 36.690 33.757
table shows that experienced operators introduce less error than beginners. Registration
error is not affected by experience; it is simply the ability of the operator to place the
crosshair of the cursor on a point. Stream mode digitizing is a psychomotor event;
operators must have good eye-hand coordination. Table 4.4 provides a closer examination
of the standard deviation listing the standard error and confidence limits for each operator
and for al1 the operators as a group.
4.4 Discussion 4.4.1 Assessrnent of the Three Measures
The maximum deviation represents the largest error that would occur. An epsilon
band based on this value will contain the largest percentage of the "control" line.
However, an epsilon based on this value will be quite large. The maximum deviation is
that small part of the digitized line that may have occurred when the operator lost
concentration and had trouble following the line, or from the jerking motion that rnay
occur as the cursor sticks moving over the table. The median deviation is a cmde index of
central tendency that excludes the extremes at either end of the scale. The median epsilon
will tend to show the smallest epsilon band width but contain a lower percentage of the
line. The mean deviation is a better rneasure of central tendency in which al1 values are
taken into account.
Table 4.4. Examination of standard deviation digitizing error for individual operators and the operators as a group.
Operator # of Mean Standard Standard Confidence Points Deviation Deviation Error Limits (95%)
Operators
Therefore, users will have to choose among the three measures depending upon
their objectives and the strengths and weaknesses of each measure. If a user requires
greater certainty that the largest arnount of the line is contained within the epsilon band
width, then the maximum deviation epsilon would be chosen, aithough there will be
greater uncertainty about the true position of the line with that measure. If the user wishes
to reduce the uncertainty regarding the position of the line, then the median epsilon is
best, although a smaller percentage of the line will be contained with the epsilon band.
Clearly the maximum deviation allows the most variability of the digitized line
relative to the "control" line giving the widest area of uncertainty. The line must occur
somewhere within that range. If the rnedian deviation is used, a mistake rnay occur since
50% of the range is ignored. Users will not norrnally use this measure for that reason. The
mean and standard deviation are statistical measures that provide the best estimate of the
accuracy of the digitized line. Researchers or organizations that produce digital data to
provide a statistical measurement of the positional accuracy of that data c m use this
approac h.
This thesis used a scale of 150 000 as this is a common scale used for digitizing.
The relative importance of scale must be mentioned. At globaI scales, error from
digitizing and other sources is not usually a concem. At more detailed scaies, error
becomes an important issue.
4.4.2 Improvements to the Methodology
Although the test was conducted on a group of 8 subjects, providing a sample of
approximately 2,000 points, a Iarger sample is likely necessary to provide a better profile
of digitizing error, particularly if stratified according to the experience of the operators.
experience. Furtherrnore, each operator in this test was not fatigued, i.e. they had not been
digitizing before they did the test. A better approach would be to have the operators
digitize the sarnple line, spend some tirne digitizing a map and then digitize the sample
line again. This would provide more realistic data as it would better represent the
digitizing process.
4.4.3 Areas of Further Research
The objective, as defined in Chapter 1, "to construct a method for estimating the
amount of uncertainty introduced when analog maps are converted to digital data in a
vector based GIS" has been achieved. The rnethodology was developed and applied to a
set of data to establish the feasibility of the application. There are, however, important
issues that have arisen in considering the application of this research.
1. AIthough these measures cm be readily applied with today's GIS software, the
software does not readily allow for any measures of accuracy to be stored with the
data set. There would have to be a written record of this information or a text file that
could accompany the data, recording the original information and subsequent
modifications to the data set.
2. One readily apparent problem with the interpretation of spatial polygonal data is that
lines are often seen as a hard edge, such as the lines representing boundaries on a soi1
map. However, in most cases, there is a buffer or transition zone between polygons
that is not well interpreted by an infinitely thin boundary line. The epsilon band can
incorporating this transition zone in decision-making.
3. While the epsilon band gives a uniform band about a line, error introduced from
digitizing is not uniform. Future research should consider the direction of digitizing
be as well as the sinuosity of the feature being digitized. For exarnple, straight-line
segments should have less e m r than curved segments. Furthemore, a variable width .
epsilon would provide a better estimate of error. Operator characteristics could be
measured based on the direction of digitizing and the quantity of error that is
introduced. Once the digitizing characteristics of operators has been quantified, the
data set can be altered based on those values. For example, if the operator has a
tendency to undercut right to left curves by a certain average distance, a
transformation can be developed and applied that will alter al1 right to left curves by
the specified amount.
4. If epsilon error information is stored with a map data set, concerns arise as to the
disposition and interpretation of that information with each subsequent overlay of that
rnap with other maps. What happens to the epsilon bands during map overlay? If one
data set has epsilon bands and the other does not, are the bands removed or are they
applied to the output and in what ways are they modified? If both data sets have
epsilon bands, which bands take precedence when lines from each data set represents
the same feature? For exarnple, if a soils data set was overlain with a hydrology data
set, shore and river boundaries would be present in each data set, but which shore and
river boundaries are to be used in the output?
Appendix A. Sample line used for digitizing based on hand drawn line at 150 000.
LITERATURE CITED
Abler, R.F. 1987. "The NSF NCGIA". International Journal of Geographic Information Systems, 1 (4):303-326.
Amrhein, C.G. and Griffith, D. A. 199 1. "A Mode1 for Statistical Quality Control of Spatial Data in a GIS " in Proceedings, GIS 9 1, Canadian Con ference, pp.9 1 - 1 03.
Aronoff, S. 1989. Geographic information Systems: A Management Perspective. WDL Publications, Ottawa, Canada.
Bailey, R.G. 1988. "Problems with Using Overlay Mapping for Planning and Their Implications for Geographic Information ;ystems". Environmental Management, 12(1): 11-17.
Blakemore, M. 1984. "Generalisation and Error in Spatial Data Bases" Cartographica, 21(2+3): 131-139.
Bolstad, P.V., Gessler, P. and Lillesand, T.M.. 1990. "Positional uncertainty in manually digitized map data". International Journal of Geographic Information Systems, 4(4):399-4 12.
Burroughs, P.A. 1986. Principles of Geographical Information Systems for Land Resources Assessment. Clarendon Press, Oxford.
Chrisman, N.R. 1982a. "Methods of Spatial Analysis Based on Error in Categoncal Maps", unpublished Ph.D. dissertation, University of Bristol.
Chrisman, N.R. 1982b. "A Theory of Cartographie Error and Its Measurement in Digital Data Bases". Proceedings Auto-Carto 5, Environmental Assessrnent and Resource Management, Foreman, J. (ed), American Society of Photogrammetry and American Congress of Surveying and Mapping, pp. 1 59- 168.
Chrisman, N.R. 1984. "The Role of Quality Information in the Long-Term Functioning of a Geographic Information S ystem". Cartographica, pp.52 1 -529.
Chrisman, N.R. 1989. "Error in Categorical Maps: Testing versus Simulation". Auto- Carto 9, pp.52 1-529.
Douglas, D.H. and Peucker, T.K.. 1973. "Algorithms for the reduction of the number of points required to represent a digitized line or its caricature". Canadian Cartographer, 10: 1 12- 122.
error in digital databases of land use: an empirical study". International Journal of Geographic Information Systems, 4(4):385-398.
Dutton, G. 1992. "Handling Positional Uncertainty in Spatial Databases". Proceedings, 5'h International Symposium on Spatial Data Handling, pp.460-469.
Energy, Mines and Resources 1976. "A Guide to the Accuracy of Maps". Technical Report Series, Ottawa.
Fisher, P.F. 1987. "The Nature of Soi1 Data in GIS --Error or Uncertainty". IGIS Symposium: The Research Agenda, Vol. 3, pp.307-3 18, Arlington, VA. NASA.
Goodchild, M.F. 1988. "The Issue of Accuracy in Global Databases", in Building Databases for Global Science, Mounsey, H. and Tomlinson, R.F. eds. Taylor and Francis.
Goodchild, M.F. 1991. "Keynote address". Proceedings, Symposium on Spatial Database Accuracy. Department of Surveying and Land Information, University of Melbourne, pp. 1 - 16.
Goodchild, M.F. 1993. "Data Models and Data Quality: Problems and Prospects", in Environmental Modeling with GIS, Goodchild, M.F., Parks and Seyaert Eds. Oxford University Press.
Goodchild, M.F. 1996. "Generaiization, Uncertainty, and Error Modeling". GlSILIS '96, pp. 765-774.
Goodchild, M.F., and Dubuc, 0. 1987. "A Mode1 of Error for Choropleth Maps with Applications to GIS". Auto-Carto 8, pp. 165- 174.
Honeycutt, D.M. 1985. "Epsilon, Generalization, and Probability in Spatial Data Bases", Research Paper, Dept. of Geography, UCSB
Hudson, D. 1988, "Some Comments on Data Quality in a GIS". Technical Papers, ACSM-ASPRS Annual Convention, Volume 2.
Jenks, G.F. 198 1. "Lines, Cornputers, and Human Frailties". AAAG, 7 l(1): 1 - 10-
Keefer, B.J., Smith, J.L. and Gregoire, T. G. 1988. "Simulating Manual Digitizing Error with Statistical Models". GISLIS '88, pp.475-483.
Keefer, B.J., Smith, J.L. and Gregoire, T. G. 1991. "Modeling and Evaluating the Effects of Stream Mode Digitizing Errors on Map Variables". Photogramrnetric Engineering and Remote Sensing, 57(7):957-963.
Klinkenberg, B. and Xiao, Y., 1990. "Some Conceptuai Definitions in Error Analysis i n GIS". GIS '90, Canadian Symposium, pp. 1 124- 1 130, CISM.
Maffini, G., Arno, M. and Bitterlich. W. 1989. "Observations and cornments on the generation and treatment of error in digital GIS data". Accuracy of Spatial Databases, Goodchild, M.F. and Gopal, S. Eds. Taylor and Francis.
Marble, D. 1996. Persona1 Communication.
Mark, D.M. and Csillag, F. 1989. "The Nature of Boundaries on 'Area-Class' Maps". Cartographica, 26(1):65-78.
Mead, D. A. 1982. "Assessing data quality in geographic information systems" in Remote sensing for resource management, Johannsen, C.J. and Sander, J.L. (eds.), Ankeny, Iowa.
Muller, J.C. 1 992. "Towards an Integrated Cartographie Research Mode1 : Suggestions and Priorities". Computers, Environment and Urban Systems, 16:249-259.
Openshaw, S. 1989. "Learning to live with errors in spatial databases" in Accuracy of Spatial Databases, pp.263-276.
Otawa, T. 1987. "Accuracy of Digitizing: Overlooked Factor in GIS Operations" in GIS '87, pp.295-299.
Perkal, J. 1956. "On the epsilon length". Bulletin de 1'Academie Polonaises des Sciences, 4(7): 399-403.
Rogowski, A.S. 1995. "Quantifying soi1 variability in GIS applications 1. Estimates of position". International Journal of Geographic Information Systems, 9(1):81-94.
StatSoft. 1994. CSS (Complete Statistical System): Statistica. StatSoft Incorporated, Tulsa, Oklahoma.
Star, J. and Estes, J. 1990. Geographic Information Systems: An Introduction. Prentice Hall, Englewood Cliffs, New Jersey, 1990.
Thapa, K. and Bossler, J. 1992. "Accuracy of Spatial Data Used in Geographic Information Systems". Photogrammetric Engineering and Remote Sensing, 58(6):835-84 1.
Traylor, C.T. 1979. "The evaluation of a methodology to measure manual digitizing error in cartographie data bases" unpublished Ph.D. dissertation, University of Kansas
- . 89- 12, Santa Barbara, california.
Veregin, H. 199 1. "GIS Data Quality Evaluation For Coverage Documentation Systems". Report for the Environmental Protection Agency, Las Vegas, Nevada.
Veregin, H. 1993. "Quality assurance for GIS databases". Research in Contemporary & Applied Geography: a Discussion Series - State University of New York at Bingharnton, v17 n2, 18 pp.
Veregin, H. 1994. GIS Quaiity Assurance Research. Lockheed Engineering and Sciences Company/ Environmental Monitoring Systems Laboratory, US Environmental Protection Agency.
Vonderohe, A.P. and Chrisman, N.R. 1985. "Tests to Estabhh the Quality of Digital Cartographie Data: Some Examples From the Dane County Land Records Project" in Proceedings Auto-Carto 7, pp.552-559.
Woodward, D. 1992. "The Representation of the World" in Geography's Inner Worlds: Pervasive Themes in Contemporary American Geography, Abler, R.F., Marcus, M.G. and Olson, J.M., Rutgers University Press, New Brunswick, NJ, 50-73.
APPLIED - 1 IMAGE. lnc = 1653 East Main Street - -. - Rochester, NY 14609 USA -- -- - - Phone: 71 61462-0300 -- -- - - Fax: 7161268-5989
O 1993. Applled Image. Inc.. All Rlghîs Reserved
top related