a survey of methods to extract buildings from high-resolution
TRANSCRIPT
A Survey of Methods to Extract Buildings from High-Resolution Satellite Images
Ryan Friese
CS510
There are many things in this world that are simply amazing. Some of these things are
crafted by nature, such as the mighty Grand Canyon, the luscious rain forests of the tropics, and
even a tiny flower on a high mountain top trying to send its roots down through cracks in the
rock. Some things are also manmade, the Great Pyramids of Egypt, the Taj Mahal, and the
Sistine Chapel all come to mind. While all these things are amazing in their own right, there
might be something going on behind the scenes that is even more amazing. I am of course
talking about the power of human vision, the ability we have that lets us easily comprehend
patterns, colors, and shapes. It is not until I started looking into the field of computer vision that
I realized what makes humans be able to comprehend images and scenes with little thought, is
not as trivial or as simple as it may seem.
Computer Vision is a broad area of study with research happening in many areas such
as: face detection, video tracking, object recognition, scene restoration, and countless others.
Each of these areas can be further refined into more specific sub-areas. For this paper, I will be
talking about a specific sub-area on object recognition. First though, let’s take a look at exactly
what object recognition is. In its most basic form, object recognition is exactly what it sounds
like, the ability to recognize objects. We as humans can perform this very fast and efficiently,
even when the objects are seen from different viewpoints, different sizes, different levels of
lighting, and even when the objects are partially obscured from view. Despite this being so easy
for humans, object recognition is still a challenge for computer vision systems. Sub areas within
object recognition can range from, being able to read text off a printed circuit board, to finding
a human face in a picture. There are many applications of object recognition. For this paper I
am going to address the problem of extracting buildings from high resolution satellite images.
Being able to extract buildings from high resolution satellite images is an open research
topic in the field of Photogrammetry. There are many applications which require accurate
information about the data contained in satellite images. This information is needed by many
agencies such as land management agencies, urban development planners, the U.S. Military,
and many others. There are several problems in being able to extract buildings properly, these
include different viewing angles, surrounding vegetation, shadows, and other objects which
obscure the edges of the buildings which are being detected. To solve this problem of building
detection, many strategies have been used. In this paper I will give a brief survey of several
different techniques, I will compare their similarities and differences, and discuss the results
from these techniques. I will be examining 5 different solutions, they are; Texture classification
and analysis used by (1) and (2), neural networks used by (3), active contour models used by
(4), local Gabor features used by (5), and model based detection used by (6).
The goal of all these solutions is to be able to extract vector data representing buildings.
It could be then used in various applications which make use of the vector data. These solutions
aim to automatically generate this vector data with as little human editing as possible. Let us
first take a look at what these techniques mean, and then we will begin comparing and
analyzing the approaches. We will start with texture classification and analysis as according to
(1) and (2). The idea behind a textural analysis approach is to use a specific texture pattern or
patterns which will then be compared to the satellite image. The algorithm looks for areas in
the satellite which are similar to what is found in the texture. This allows the image to be
characterized into specific areas such as buildings, vegetation, and water. To be able to
separate these areas apart from each other, there must be a way of comparing similarity
between the texture and the image. There are various ways to measure similarity as is seen in
the approaches of (1) and (2). While both approaches are able to identify buildings by using
textures, the way to perform this measurement is different. In the approach used by (1), each
specific area a texture represents is considered a class. For example, different classes could be
building textures, vegetation textures, water textures, and so on. These classes are modeled as
probability distributions across a “texton” library. These textons are basically the averaged
combination of all the image patches in a given class which are supplied from a training set.
Training sets are also used in approaches (3) and (4). The textons are then compared to 5x5
image patches in the satellite image. This allows each region in the image to be classified by
using a nearest-neighbor classifier using the X2 statistic. A region is considered similar only if its
distance to any class is less than a learned threshold, otherwise that region is rejected. This
application of training and meeting a threshold is used in (3) and (5) as well. The approach in (2)
makes use of GLCM(Grey Level Co-occurrence Matrix) and GLDV(Grey Level Difference Vector)
to perform the similarity measure. A GLCM texture considers the relation between two
neighboring pixels the second order texture. The co-occurrence matrix is formed from the
target image by running a kernel mask over the image. The mask can be 3x3, 5x5 7x7… The
neighboring pixels can be in one or more of eight defined directions. Typical directions are 0°,
±45°, ±90°, ±135° and 180°. The use of multi-directed masks is also seen in (5). Using directed
mask means the GLCM texture is dependent upon both directionality and kernel size. The
values in the GLCM texture can be interpreted using some known measures. These measures
include angular second moments which measure the extent of pixel orderliness, contrast which
measures how many elements don’t lie along the main diagonal of the GLCM, dissimilarity
which measures how different the elements are from each other, entropy which measures
randomness, energy which measures the extent of pixel pair repetitions, and homogeneity
which is the measure of uniformity of the co-occurrence matrix. Homogeneity is also an
important measurement used in (4). The GLDV measure is the sum of the diagonals from the
GLCM. When the GLCM and GLDV textures are transformed back into image space, the authors
of (2) state that both textures show and find the detection of shadow zones, classification of
building types, and the recognition of pavement. Comparing both (1) and (2), the use of
textures are integral to their algorithms and the detection of buildings in satellite imagery. They
are vastly different though, (1) creates an averaged texture from a training image, and then
compares it to the test image. It will select areas which are under a given threshold for
similarity, but it does not outline the actual boundaries of the image, but rather gives a general
area of where the building is. (2) ends up using a kernel mask to turn the entire image into a
texture which ends up highlighting and bringing out various zones in the image, this does not in
and of itself outline the building boundaries, but rather makes the edges more apparent.
Another viable approach to building extraction is the use of artificial neural networks as
proposed by (3). Similar to (1) and (4) this approach must first be trained on a set of training
images before being able to perform the extraction. This approach will initially perform
segmentation on the image by using a seeded region growing algorithm. What this algorithm
does is evenly distribute seed points across the entire image; it then compares the seed point’s
value to neighboring pixels and adds them to the region if the neighboring pixels are below a
set threshold. The region growing algorithm continues recursively with the newly added pixels
until there are no more neighboring pixels which fall below the set threshold. This algorithm
will find the homogenous roof regions in the image. After the region growing algorithm, this
approach uses many features to classify the regions of the image. The features include: area,
perimeter, mean color and intensity, roundness, compactness and structural features. The
values of these features help determine if a region is a building or not. For example, if the area
of a region is more than 10000 pixels then that region is assumed to not be a building. The
roundness feature can help determine whether a region is a building by calculating the ratio of
the regions area to the square of its perimeter, and discarding any regions which have a low
ratio, or a high roundness. A high roundness means the object could be something like a tree.
Since individual buildings normally maintain the same color and intensity throughout their
rooftops, the mean color and intensity features are used to measure variations in color allowing
detection of continuous building areas. These features for a given region are then set as the
input to the neural network. If the output from the neural network is greater than a specified
threshold, the region will be recorded as a building. Thresholds are also used in approach (1)
and (5). This approach is capable of finding and outlining the boundaries of the building.
We will now examine the approach presented in (4). This approach uses an active
contour model to perform the building extraction. A contour model, also known as a snake, is a
model that is used to delineate object outlines in images. It accomplishes this by minimizing the
energy associated to the current contour. This approach uses a homogeneity and similarity
measure which is similar to (2). This approach segments the image into regions so that each
region will have pixels inside it which have maximal homogeneity and similarity. This model has
the ability to extract objects without obvious edges and it is not sensitive to noise. The model
finds the building boundaries by finding whether a given contour curve is inside or outside the
building region. The model will iterate until the curve finds the minimal energy of the region,
which indicates it has hit the boundary. This is similar to the way approach (6) operates. To
perform the actual building extraction, this approach runs through 4 steps. The first step is to
run the image through an image smoothing Gaussian filter, approach (5) must also apply a
preprocessing filter to the image. The next step is taking the smoothed image and as with
(1)and (3) introduce training data into the system. This training data contains points which are
inside the boundaries of the buildings. The third step implements the active contour model and
the buildings boundaries are extracted. The fourth and final step is an accuracy assessment.
This approach along with (3) is capable of finding and outlining actual building boundaries.
Moving onto our next approach for building extraction, we are presented with (5). This
approach uses Gabor filters and spatial voting to find and extract the centers of buildings in a
satellite image. As we will see, I believe this approach could be used in conjunction with (4), but
more on that later. This approach makes use of Gabor filter sets to detect building properties in
the image. A two dimensional Gabor set is mathematically the product of a two dimensional
Gaussian and a complex exponential function, which otherwise can be thought of as a Gaussian
filter modulated by a sinusoidal wave. Similar to (4) there is some preprocessing that must be
done to the image in the form of a 5x5 median filter to try and eliminate some of the noise
which might be present in the image. This is done to help prevent false detections. There is no
guarantee that all the buildings in a given image will all be oriented the same way, to account
for this problem, approach (5) generates a set of six Gabor filters to account for multiple
directions. This implies that the same image must be convolved with a Gabor filter six times,
one for each filter in the set. This will result in 6 separate Gabor spaces. Each Gabor space will
be examined for local maximums which indicate a building property. To prevent false
detections a threshold is set to eliminate weak maximums. These collected building properties
are then used to generate descriptor vectors. For each feature three descriptors are extracted;
the feature location, the possible distance to the center of the building, and the features
dominate orientation. These descriptors are then used to describe the direction towards the
buildings’ center. The authors of (5) note that this approach only works when buildings are
brighter than the background they are on, this is because when the building is brighter, the
second highest edge will be located close to the building center, allowing the direction to the
building center to be found. If the building isn’t brighter, then the second highest edge does not
have to be close to the center, thus, no information about the direction towards the center of
the building can be found. Once the direction towards the buildings center has been calculated,
all the features then vote for possible building center locations to create a voting matrix. Once
the voting matrix has been calculated, buildings locations can be determined based upon where
local maximums in the matrix occur. Again to avoid false positives a threshold is enforced to
eliminate weak maximums. This approach differs from (3)and (4) by not being able to detect
the outlines of buildings but is instead being able to detect an approximate center location of a
building.
The final approach we will be examining is presented in (6). This approach is a model-
based building detection algorithm. It is model-based in the aspect that it uses a prior-based
variational framework to extract multiple buildings from a single image. What this means is this
approach tries to align evolving contours with a prior building shape. The use of evolving
contours is also used in (4). There are multiple prior building shapes, these are used to
represent differently shaped buildings. These prior building shapes can be scaled so that they
can fit differently sized buildings. The way this approach works, is it first applies an arbitrary
curve to the entire image. It then tries to minimize the energy, similar to (4), in the bounded
region by matching one of the prior building shapes to the region. This results in one or more
regions bounded by a prior shape. The algorithm then continues this same process of
minimizing energy in a bounded region and matching prior building shapes to contours. This
continues until the algorithm converges and all the energy in the image is minimized. The final
building shapes are then assigned to the best fitting prior building shape. This approach like (3)
and (4), is able to extract building boundaries from the image, but since the resulting
boundaries are a best fitted prior building shape, there might be slight variations from the
actual boundary of the building.
After examining all 6 approaches, we can notice a few common trends and similarities
across the techniques. One of the most common aspects in the approaches is the requirement
of training data needing to be introduced into the system before building extraction can be
performed. Training data is needed in approaches (1), (3), (4), and (6). The extent of the training
data changes from approach to approach. For example both (1) and (3) require training images
which represent the buildings the will be trying to extract. (4) needs information on where
seeding points are located in the image. (6) just needs general building templates which will be
used to approximate building shapes in the image. The use of training data allows these
approaches to better identify the objects they are searching for, at the cost of some initial set
and manual data creation. Another problem these algorithms tried to solve was the detection
of false positives, or regions being highlighted as buildings when they actually weren’t.
Approaches (1), (3), and (5) all tackle this problem by introducing thresholds into their systems.
These thresholds force regions in the image to meet a certain condition; otherwise they will be
thrown out and not considered a building. After further analyzing the approaches, I found an
interesting relationship, approaches (3), (4), and (6) all have an iteration mechanism as part of
their systems. These three approaches are also the only three approaches which actually find
and highlight the boundaries of the buildings. In the cases of (4) and (6), an energy minimization
algorithm is implemented to find the building edges, while in (3) a region growing algorithm is
implemented.
Now that we have begun discussing the actual results of the approaches, let us examine
what the other 3 approaches provide. Approach (1) fails to provide a concrete outlining of the
buildings; instead it supplies regions which have matched specific textures. Since 5x5 pixel
patches are used in the comparison, the regions are blocky. Approach (2) states they can
produce outlines of buildings, but they fail to fully explain how exactly they do this. I assume
they take the filtered images which highlight and define separate areas and perform additional
analysis to extract the building regions and boundaries. Approach (5) is the last approach which
fails to concretely provide building boundaries. It instead provides information about where the
expected center of a building is located. If the exact shape of the building is not important, this
information can be used to create generic shapes which represent buildings in the image. There
might also be a way this data about building centers could be used with another approach. As I
stated earlier approach (4) requires a set a training data. The type of information contained in
this training data is information about seed points which are inside building regions in the
image. This seems to me to be exactly what the output of approach (5) is providing. This means
a new system could be created which combines approach (5) with approach (4) thus eliminating
the need to manually create the training data, allowing the system to become more automatic.
The task of extracting building boundaries from high-resolution satellite images is an
easy yet tedious and time consuming task for humans to perform. This extracted data is also an
essential part of many applications used in agencies around the world. If the extraction of
building boundaries could become an automated process it would speed up the collection of
this data and remove the need for a human to manually find and create the boundaries.
Throughout this paper I have reviewed and analyzed several different approaches attempting
to automatically extract building boundaries from high resolution images. I believe there are
some very promising results from individual approaches such as those illustrated in (3), (4), and
(6). I also believe that a combination of several approaches could lead to a more fully
automated system, all but eliminating the need for any human interaction. In closing, after
performing this research I have a greater understanding on the subject object detection and in
a grander scope, computer vision as a whole.
Bibliography 1. Computer aided generation of stylized maps. Adabala, Neeharika, Varma, Manik and Kentaro,
Toyama. s.l. : Microsoft Research India, 2007.
2. Urban Feature Characterization using High Resoultion Satellite imagery: Texture Analysis Approach.
Jeon, So Hee, Kwon, Byung-Doo and Lee, Kiwon. 2007.
3. Automated Building Extraction from High-Resoultion Satellite Imagery Using Spectral and Structural
Information Based of Artificial Neural Networks. Lari, Zahra and Ebadi, Hamid. 2008.
4. Automatic Building Extraction from High Resoultion Aerial Images Using Active Contour Model.
Ahmady, Salman, et al., et al. 2008.
5. Building Detection Using Local Gabor Features in Very High Resotution Satellite Images. Sirmacek,
Beril and Unsalan, Cem. 2009.
6. Automatic Model-Based Building Detection from single Panchromatic High Resolution Images.
Karantzalos, Konstantions and Nikos, Paragios. 2008.