texas a&m university · civil engineering applications of gis yanru zhang abstract traffic...

Texas A&M University

CVEN 658– Civil Engineering Applications of GIS

Hotspot Analysis of Highway Accident Spatial Pattern Based on

Network Spatial Weights

Instructor: Dr. Francisco Olivera

Author: Yanru Zhang

Zachry Department of Civil Engineering

December 06, 2010

Civil Engineering Applications of GIS Yanru Zhang

ABSTRACT

Traffic accidents are increasing being recognized as a social and public health challenge due to

the increased mobility of today’s society. Factors that influence traffic accidents are equipment

failure, roadway design, poor roadway maintenance and driver behavior. From empirical

experience, we know that spatial patterns exist in traffic incident. Some places are more likely to

have accident than others, because of poor roadway design or more aggressive drivers exist in

that area. To study and identify the areas that traffic incident frequently happen is helpful for

road managers to allocate recourses to that areas to either improve the roadway conditions or

develop strategies to avoid aggressive behavior. In this paper, I applied spatial statistics

techniques in ArcGIS to study the spatial relationships among highway incidents in Houston.

Three major steps will be involved in this study, which are construct a network dataset, generate

spatial weights matrix and conduct statistical analyses. The function of Generate Network Spatial

Weights will be used to obtain spatial weights of incidents which are based on road network

rather than on straight line distances. Then the obtained spatial weights matrix will be

implemented into Hot Spot Analysis (Getis-Ord Gi*) to get the final results.

INTRODUCTION

Due to the increased mobility of today’s society, road traffic accidents analysis and prevention

are increasingly being recognized as an important topic. In United States 2009, there are 30,797

fatal crashes and 33,808 persons dead in these accidents. This is to say that approximately 4

persons dead each hour. A study released by the World Health Organization shows that an

estimated 50,000,000 people are injured and 1,200,000 people are killed in road crashes each

year worldwide. An estimated 65% increase of accidents over the next 20 years unless new

prevention actions are taken. However, highway accident are distributed on each sections of the

road, inspection of every single location that an accident happens is impractical. To study and

identify the areas (or road sections) that traffic accident frequently happen is helpful for road

managers to allocate recourses to those areas to either improve the roadway conditions or

develop strategies to avoid road accidents or diminish loses. Factors that contribute to the road

accident occurrence include: traffic volume, roadway design, weather, configuration of highway

networks and maintenance of highway, etc. and these factors are all exhibit strong spatial


patterns (Xiea and Yanb 2008). Thus investigating the spatial patterns of traffic accident is

crucial steps in understanding how, where and when a traffic accident happens. To identify the

areas that traffic accident frequently happens, we introduced the topic of the cluster analysis.

Cluster occurs when features in a center area are found have similar high or low values. Identify

the locations of accident clusters can help to identify the causes of the accidents. By comparing it

with other locations that cluster does not occur, it is also possible to find causes that lead to the

accident.

GIS technology as a valuable tool that combines the spatial information with other data has been

widely used in road accident analysis procedure to visualize accident data and analyze hotspots

in highways. Here, accident hotspot is the cluster of individual accidents. GIS can hold large

amount of data that can be easily stored, shared analyzed and managed (Erdogan et al. 2007).

Existing studies only consider geometric distance and did not take the road network into

consideration. However, the accidents are network based and it is important to take the road

network into consideration when study distance between two accidents locations. In this study,

distances between different accidents are defined based on the configuration of road network, so

that the spatial relationships between accident data are defined based on the highway network.

To realize this function, a network data set is created as the basic background for the accident

analysis. Then generate spatial weight matrix function was used to calculate the weight matrix

for accident data. Then hot spot analysis (Getis-Ord Gi*) were used to obtain the spatial

relationships among traffic accident data.

LITERATURE REVIEW

GIS-based accident information systems provide a platform to conduct spatial analysis of the

accident data which are almost impossible by using a non-spatial database. Since 1990, the GIS

technologies and its applications on traffic safety and accident analysis gained popularity among

agencies and researchers. Erdogan et al. (2007) summarized existing analysis methods that used

in traffic accident analysis procedures, which include intersection or segment analysis, proximity

analysis, spatial query analysis, cluster analysis, density analysis. He also introduced the

statistical analysis methods: kernel density analysis and repeatability analysis to conduct the

accident analysis and determine the hot spots of the accidents. The study results showed that


cross roads and junction points are places that accident frequently happens. Saffet(2009) studied

the inter-province differences in traffic accidents and mortality. He used GIS to extract the

features that can influence the accidents like day, temperature, humidity, weather conditions, and

month of occurred traffic accidents. Apply the CFS method to select important features that can

influence the traffic accident. Use SVM and ANN to classify the traffic accident dataset. The

study results show that the proposed model has better prediction results of traffic accidents than

that of SVM or ANN models alone. Anderson (2008) use Geographical Information Systems

(GIS) and Kernel Density Estimation to study the spatial relationships among injury related

accident data then using a K-means clustering algorithm to identify the accident hot spot. Based

on collision and accident attribute data in London, UK, five groups and 15 clusters were created.

There is no universally accepted definition of accident hotspot, Hauer (1997) describes two

methods that are widely used to rank the accident locations, one is based on accident rate the

other is based on accident frequencies. Road accident hotspot analysis usually focus on road

segments or junctions, area based road accident analysis are seldom used in existing studies. A

comprehensive understanding of factors are contribute to accidents are important in hotspot

analysis procedures, for example, the severity of the accident and the surrounding environment.

Because the GIS platform has the ability to link a large amount of disparate data bases, it allows

both historical and statistical analysis of traffic accident. The most commonly used function in

traffic accident analysis is spatial analysis extension and it provides varies ways to conduct

accident analysis.

METHODOLOGY

The purpose of studying the distribution of traffic accidents is to find out the cluster of accidents

that have the same feature, like the clearance time, the number of people injured or the number

of death. In this study, the clearance time is used as the attribute feature. For an accident, if the

clearance time is long, then it is defined as a serious traffic accident, if the clearance time is short,

it is defined as a minor traffic accident in this study. The basic idea of the network based hotspot

analysis of the accident data is first calculate the network spatial weights between any pair of

accident data and then use the hot spot analysis (Getis-Ord Gi*) function in ArcGIS to find the


locations that long clearance time traffic accidents happens. To realize this function, three steps

are involved:

Data Preparation

Data used for the network based accident hotspot analysis include accident data and road

network data. The accident data includes the longitude and latitude of the accident locations,

roadway name, cross street name and clearance time. The road network data should include the

line feature of the road network, the length of each road section, the longitude and latitude of the

road and the turn features.

Network Dataset

Before generate network spatial weights, a network dataset is needed to represents the distance

among different accident locations. To create a network dataset, we first need to enable the

network analysis extension in AcrCatalog. In ArcCatalog, go to the direction where the road

network shapfile is located and choose the New Network Dataset to start define the attributes of

the network dataset. In the following steps, we need to define the name of the network dataset,

the network connectivity, elevation field settings, turn information, driving directions. After all

the settings are defined, click yes to build the network. Then close the ArcCatalog. The created

network dataset is a vitalization of the transportation networks and offers functions that can

model impedances, restrictions, and hierarchy for the network. A network dataset includes: two

shapfiles which are lines features that represents the location of roadway and junctions where

two roadways intersect, one shapefile based network dataset.

Generate Network Spatial Weights

Different from traditional statistical method, spatial statistics takes space and spatial relationships

into consideration. Network spatial weights are conceptualization of spatial relationships

between any two points and are very important in the hotspot analysis. Different definitions of

the weights will leads to different results. Euclidean distance, contiguity, fixed or inverse

distances are most commonly used weighting schemes. Because spatial relationships among

traffic accident data are closely related with road network, define spatial relationship in terms of


real road network will be more accurate. In this study, weights among different accident data are

calculated based on the road network. The recently developed generate network spatial weights

tool in ArcGIS can realize this function. Figure 1 illustrates the different conceptualizations of

spatial relationships.

Inverse Distance Distance Band

Zone of Indifference Network Spatial Weights

Fig.1. Most commonly used spatial weights

The inverse distance indicates that correlation exists among all features and the correlations

become smaller as the distance between these features grows larger. A fixed distance band

allows one to specify a distance that features within that distance is closely related while

uncorrelated when out that distance. Thus the value within that distance is a fixed number and

immediately goes to zero when out of that distance. The zone of indifference combines the

inverse distance method and distance band method: value within a distance is a fixed number


when out of that distance it gradually goes to zero. The network spatial weights are different

from previous three methods, which define the weights among different objects based on a

Network dataset. Since traffic accidents are network based, it is more appropriate to define the

distance among different accident points by using the network spatial weights.

Hot Spot Analysis (Getis-Ord Gi*)

After we generate the network spatial weights, the next step is conduct traffic accident hotspot

analysis. The hot spot analysis tool in ArcGIS applies the Getis-Ord Gi* statistics can realize this

function and calculate the z-value which indicates whether features with high or low values are

clustered together at each location. In this study, the duration of accident are used as the criterion

to identify where accidents with longer duration are clustered together and where accidents with

shorter duration are clustered together. The statistical definition of Getis-Ord Gi* is as following:

∑

∑ ∑

√

∑ (

∑ )

√

∑ (∑

)

Where

The attribute value for feature j.

Sample size.

Spatial weights between feature and .

The outcome of the Gi* statistic is a z-value for each feature. Higher z-value indicates cluster of

accidents that last for a longer period, while lower z-value indicates large number of accidents

that have shorter duration locate around this area.

The hot spot analysis begin with a null hypothesis that there is no spatial pattern exists among

studied features. In this study, the null hypothesis is that spatial correlations do not exist among

traffic accidents. If the null hypothesis is true, the traffic accident should follow the normal

distribution. The z score is used as a criterion to decide whether or not this null hypothesis

should be rejected, while the p value tells the probability that one made a false statement.


Fig. 2. Normal distribution, the p-values and z-scores

At the tail of the normal distribution, z-values are either very high or very low and the p-values

are relatively small. This means that the null hypothesis is unlikely to happen at this kind of

situations, which means spatial pattern exists. The outcome of the hotspot analysis is a z-score

and a p-value for each accident data. Thus, if in an area most accidents have higher z-score and

lower p-value, then it is very likely that this area is an accident prone area and actions are needed

to prevent or release the accident happens in this area.

APPLICATION

I choose Houston highway accident data to conduct the accident hotspot analysis. Getis-Ord Gi*

statistics is used to get the p-values and z-scores. Network spatial weights and Euclidian distance

are used as two different methods to calculate the spatial distance between traffic accident data.

Before conduct the hotspot analysis, one needs to first construct the network dataset to provide

the basic structure to calculate the network spatial weights. Then use the generate spatial weights

function to obtain the spatial weights. The last step is Getis-Ord Gi* analysis of accident hotspots.

Data Description

Data used in this study are Houston highway accident data and Houston highway network

shapfile. Accident data can be obtained from police reports and should include basic accident


data attribute, for example geographic coordinate, accident duration and corresponding street

information. Figure 3 is the basic information for accident data, which includes the latitude,

longitude of the accident location, incident ID and incident duration and so on. The accident data

are in Excel file, so we need to first add the Excel data through the Add Data dialog box. To

display the accident locations on the map, one needs to use the Make XY Event Layer tool to

create a point feature shapefile.

Fig. 3. The basic information for accident data

Road network data should contain basic information to create network dataset. The Houston

highway network data were obtained from Houston-Galveston Area Council website, which

contains the basic information that required creating the network dataset.

Generate Network Dataset

The Houston highway network I get is a simple line feature file, which contains one network

impedance value-distance. To create a Network database, one needs to start ArcCatalog, enable

the network analysis extension and then create the network in the ArcCatalog by choosing the

New Network Dataset option shown as in figure 4. I give the name of the new network dataset as

hgac_majthrfare_ND, use global turns, and choose the length of the road as the cost. The

summery of the newly created network is shown in figure 5. If everything is right, then choose

finish to generate the network dataset.


Fig. 4. New Network Dataset function

Fig. 5. The summary of the new network dataset


After successively create the network dataset, three files will be created including two shapefiles

and one network dataset shapefile. Figure 6 shows three files that a network dataset generated.

The hgac_majthrfare_ND file contains the basic network dataset information and we can realize

the network analysis functions based on this shapefile network dataset. The

hgac_majthrfare_ND_Junctionsshapefile in this study represents the intersections of the road

network.

Fig. 6. Shapefiles of network dataset

After creating the network data set in AcrCatalog, one can open the newly created feature in

AcrMap. The distance between two points will be calculated based on the network instead of

straight distance. Figure 7 shows the travel distance between point 1 and point 2, which is longer

than straight line distance. If point 1 and point 2 are two accident locations, it is more reasonable

to use this distance to represent their spatial relationships, since the road accident is closely

related with the road network.


Fig. 7. Network distance between two points

Generate Network Spatial Weights

After generating the road network dataset, we can calculate the network spatial weights. To

generate network spatial weights, a point feature class is needed to represent both feature origins

and feature destinations. In our case, the accident locations are used as the feature origins and

feature destinations. The generate network spatial weights function first allocate the accident on

the highway network and then use the travel distance to calculate the weight between each and

every other accidents locations. Figure 8 shows the process that to create the network spatial

weights between different accident data. The accident data and Houston highway network data

were first displayed on the map and then open the generate network spatial weights tool. The

input feature class is the accident shapefile, the input network is the Houston highway network


and the impendent attribute is mile in this study.

Fig. 8. The generate network spatial weights function

The output of the generate network spatial weights function is a spatial weights matrix file which

contains the spatial relationships among all objects. Figure 9 is the table format of the spatial

weights matrix file, FieldID is the “from” feature ID, NID is the “to” feature ID. WEIGHT

represents spatial relationship between the FROM feature and the TO features. This file will be

used to represent spatial relationship among accident points in the Hot Spot Analysis (Getis-Ord

Gi*).


Fig. 9. Network based spatial weights matrix

Hot Spot Analysis (Getis-Ord Gi*)

Several functions in ArcGIS can conduct accident hotspot analysis, One of them is hot spot

analysis(Getis-Ord Gi*) function. This function calculate the Getis-Ord Gi* statistics for each

accident to tell us where accidents with long clearance time are clustered together and where

accidents with short clearance time are clustered together. In this study, I use two methods to

study the Getis-Ord Gi* statistics of the accident data by using different spatial weights functions:

one is the most commonly used Euclidian Distance, the other one is the Network Spatial Weights.

Figure 10 shows the Hot Spot Analysis (Getis-Ord Gi*) fuction in ArcGIS. The input feature

class is accident; different input of the conceptualization of spatial relationships will lead to

different results. Available options are inverse distance, inverse distance squared, fixed distance

band, zone of indifference get spatial weights from file, distance band or threshold distance.


Fig. 10. Hot spot analysis(Getis-Ord Gi*) function

I first choose the inverse distance as the conceptualization of spatial relationships and then use

Euclidean Distance as the distance method. So that the relationship among accidents are

calculated based on the inverse Euclidean distance, which is to say that if the nearby accidents

will have closer relationship then the that located far away. The results of the hotspot analysis of

the accident data is shown in figure 11. The blue points indicates accidents that have shorter

clearance time were clustered together, while the red points indicates that accidents that have

longer clearance time were clustered together. The Euclidean distance based hotspot analysis can

identify the area where accidents with long clearance time clustered and where accident with

short clearance time clustered.


Fig. 11. Euclidian distance based hot spot analysis

Then I conduct the network based hotspot analysis by choose get spatial weights from file option

and use the created network spatial weights swm file to define the spatial relationships of the

accident data. The distance of any two accidents are calculated based on the network. Figure 12

is the results of the hotspot analysis, the red points indicate the locations where accidents with

longer clearance time are clustered together and the blue points indicate the locations where

accidents with shorter clearance time are clustered together. The network based hotspot analysis

is able to identify the road links where accidents frequently happen.


Fig. 12. Network based hot spot analysis

COMPARISON WITH OTHER METHODS

Other two methods that can study the spatial distribution of the accident data are central feature

method and point density method. The central feature tool identifies the most centrally located

feature in the accident data. Figure 13 shows how feature central function works. The input

feature is the highway accidents, I choose Euclidian distance to calculate the distance between

each pair of features and roadway is used to group features. The output of the method is a point

feature that located in the central among studied objectives. In this study, it is the central of

accidents happen on the same road sections. Figure 14 is the accident central at each road section.

If one wants to find a best location to deal with the potential accidents in the future, the point

central tool can be used.


Fig. 13. Central feature function


Fig. 14. Accident central of each link

The point density method shows where the accidents are concentrated by displaying the accident

attribute on the map. This analysis method can be realized by the point density function in

ArcMap as shown in Figure 15. The input point feature is accident, population field is

LnDuration and the output cell size is 500. Figure 16 shows the accidents density map. This map

offers a general view where accidents are densely located. But in the Houston accident analysis,

the density map offers very little information. Since the central of the highway network are

densely distributed, the number of accident data is also densely located there.

Fig. 15. Point density feature function


Fig. 16. Accidents density map

CONCLUSIONS

Hotspot analysis performs better than central feature and point density function in identify the

accident prone area. Since the central feature can only points out the accident central of studied

objects and cannot points out where accidents frequently happens. Although the point density

function can points out the area where accidents frequently happens, but it only displays a

density map and it only offer a general view of where accident are more likely to happen. The

hotspot identifies the locations where accidents frequently happen by using the statistical method.

This method is more reliable.

Network based hotspot analysis identify the road section where accident happens while the

Euclidian distance based hotspot analysis can only points out the area where accident frequently

happen. Because the accidents are closely related with the road network, it is more reasonable to


calculate the spatial pattern of traffic accidents based on the network. Study results of the project

shows that the network based hotspot analysis are able to points out the links that accidents

happen.

FUTURE RESEARCH

Refine the network dataset according to the real highway network conditions. Because the lack

of the data, the Houston highway network dataset was simplified. The cost for the highway

network is only based on the length of the link and I did not take other impendence factors into

consideration. In the future research, in order to make the results more accurate, the network

dataset should be refined if relevant information is available.

In this study, I only focus on identifying the locations where accidents frequently happen. The

next step is to study the factors that may influence traffic accident data. One way to identify

these factors is to study the similarities among traffic accident-prone areas. So that transportation

agencies can take proper actions to prevent the accident from happening by control these factors.

REFERENCE

Xiea, Z., and Yanb,J.(2008). “Kernel Density Estimation of traffic accidents in a network space.”

Computers, Environment and Urban Systems, 32(5), 396-406.

Erdogan, S., Yilmaz, I., Baybura, T., and Gullu, M. (2007). “Geographical information systems

aided traffic accident analysis system case study: City of Afyonkarahisar.” Accident Analysis

and Prevention, 40(1), 174-181.

Erdogan, S.(2009). “Explorative spatial analysis of traffic accident statistics and road mortality

among the provinces of Turkey.” Journal of Safety Research, 40(5), 341-351.

Anderson, T.K.(2009). “Kernel density estimation and K-means clustering to profile road

accident hotspots.” Accident Analysis and Prevention, 41(3), 359-364.

Hauer, E.(1997). Observational before-after studies in road safety. Pergamon, Oxford.

texas a&m university · civil engineering applications of gis yanru zhang abstract traffic...

Documents