a survey of methods to extract buildings from high-resolution

A Survey of Methods to Extract Buildings from High-Resolution Satellite Images

Ryan Friese

CS510

There are many things in this world that are simply amazing. Some of these things are

crafted by nature, such as the mighty Grand Canyon, the luscious rain forests of the tropics, and

even a tiny flower on a high mountain top trying to send its roots down through cracks in the

rock. Some things are also manmade, the Great Pyramids of Egypt, the Taj Mahal, and the

Sistine Chapel all come to mind. While all these things are amazing in their own right, there

might be something going on behind the scenes that is even more amazing. I am of course

talking about the power of human vision, the ability we have that lets us easily comprehend

patterns, colors, and shapes. It is not until I started looking into the field of computer vision that

I realized what makes humans be able to comprehend images and scenes with little thought, is

not as trivial or as simple as it may seem.

Computer Vision is a broad area of study with research happening in many areas such

as: face detection, video tracking, object recognition, scene restoration, and countless others.

Each of these areas can be further refined into more specific sub-areas. For this paper, I will be

talking about a specific sub-area on object recognition. First though, let’s take a look at exactly

what object recognition is. In its most basic form, object recognition is exactly what it sounds

like, the ability to recognize objects. We as humans can perform this very fast and efficiently,

even when the objects are seen from different viewpoints, different sizes, different levels of

lighting, and even when the objects are partially obscured from view. Despite this being so easy

for humans, object recognition is still a challenge for computer vision systems. Sub areas within

object recognition can range from, being able to read text off a printed circuit board, to finding

a human face in a picture. There are many applications of object recognition. For this paper I

am going to address the problem of extracting buildings from high resolution satellite images.

Being able to extract buildings from high resolution satellite images is an open research

topic in the field of Photogrammetry. There are many applications which require accurate

information about the data contained in satellite images. This information is needed by many

agencies such as land management agencies, urban development planners, the U.S. Military,

and many others. There are several problems in being able to extract buildings properly, these

include different viewing angles, surrounding vegetation, shadows, and other objects which

obscure the edges of the buildings which are being detected. To solve this problem of building

detection, many strategies have been used. In this paper I will give a brief survey of several

different techniques, I will compare their similarities and differences, and discuss the results

from these techniques. I will be examining 5 different solutions, they are; Texture classification

and analysis used by (1) and (2), neural networks used by (3), active contour models used by

(4), local Gabor features used by (5), and model based detection used by (6).

The goal of all these solutions is to be able to extract vector data representing buildings.

It could be then used in various applications which make use of the vector data. These solutions

aim to automatically generate this vector data with as little human editing as possible. Let us

first take a look at what these techniques mean, and then we will begin comparing and

analyzing the approaches. We will start with texture classification and analysis as according to

(1) and (2). The idea behind a textural analysis approach is to use a specific texture pattern or

patterns which will then be compared to the satellite image. The algorithm looks for areas in

the satellite which are similar to what is found in the texture. This allows the image to be

characterized into specific areas such as buildings, vegetation, and water. To be able to

separate these areas apart from each other, there must be a way of comparing similarity

between the texture and the image. There are various ways to measure similarity as is seen in

the approaches of (1) and (2). While both approaches are able to identify buildings by using

textures, the way to perform this measurement is different. In the approach used by (1), each

specific area a texture represents is considered a class. For example, different classes could be

building textures, vegetation textures, water textures, and so on. These classes are modeled as

probability distributions across a “texton” library. These textons are basically the averaged

combination of all the image patches in a given class which are supplied from a training set.

Training sets are also used in approaches (3) and (4). The textons are then compared to 5x5

image patches in the satellite image. This allows each region in the image to be classified by

using a nearest-neighbor classifier using the X2 statistic. A region is considered similar only if its

distance to any class is less than a learned threshold, otherwise that region is rejected. This

application of training and meeting a threshold is used in (3) and (5) as well. The approach in (2)

makes use of GLCM(Grey Level Co-occurrence Matrix) and GLDV(Grey Level Difference Vector)

to perform the similarity measure. A GLCM texture considers the relation between two

neighboring pixels the second order texture. The co-occurrence matrix is formed from the

target image by running a kernel mask over the image. The mask can be 3x3, 5x5 7x7… The

neighboring pixels can be in one or more of eight defined directions. Typical directions are 0°,

±45°, ±90°, ±135° and 180°. The use of multi-directed masks is also seen in (5). Using directed

mask means the GLCM texture is dependent upon both directionality and kernel size. The

values in the GLCM texture can be interpreted using some known measures. These measures

include angular second moments which measure the extent of pixel orderliness, contrast which

measures how many elements don’t lie along the main diagonal of the GLCM, dissimilarity

which measures how different the elements are from each other, entropy which measures

randomness, energy which measures the extent of pixel pair repetitions, and homogeneity

which is the measure of uniformity of the co-occurrence matrix. Homogeneity is also an

important measurement used in (4). The GLDV measure is the sum of the diagonals from the

GLCM. When the GLCM and GLDV textures are transformed back into image space, the authors

of (2) state that both textures show and find the detection of shadow zones, classification of

building types, and the recognition of pavement. Comparing both (1) and (2), the use of

textures are integral to their algorithms and the detection of buildings in satellite imagery. They

are vastly different though, (1) creates an averaged texture from a training image, and then

compares it to the test image. It will select areas which are under a given threshold for

similarity, but it does not outline the actual boundaries of the image, but rather gives a general

area of where the building is. (2) ends up using a kernel mask to turn the entire image into a

texture which ends up highlighting and bringing out various zones in the image, this does not in

and of itself outline the building boundaries, but rather makes the edges more apparent.

Another viable approach to building extraction is the use of artificial neural networks as

proposed by (3). Similar to (1) and (4) this approach must first be trained on a set of training

images before being able to perform the extraction. This approach will initially perform

segmentation on the image by using a seeded region growing algorithm. What this algorithm

does is evenly distribute seed points across the entire image; it then compares the seed point’s

value to neighboring pixels and adds them to the region if the neighboring pixels are below a

set threshold. The region growing algorithm continues recursively with the newly added pixels

until there are no more neighboring pixels which fall below the set threshold. This algorithm

will find the homogenous roof regions in the image. After the region growing algorithm, this

approach uses many features to classify the regions of the image. The features include: area,

perimeter, mean color and intensity, roundness, compactness and structural features. The

values of these features help determine if a region is a building or not. For example, if the area

of a region is more than 10000 pixels then that region is assumed to not be a building. The

roundness feature can help determine whether a region is a building by calculating the ratio of

the regions area to the square of its perimeter, and discarding any regions which have a low

ratio, or a high roundness. A high roundness means the object could be something like a tree.

Since individual buildings normally maintain the same color and intensity throughout their

rooftops, the mean color and intensity features are used to measure variations in color allowing

detection of continuous building areas. These features for a given region are then set as the

input to the neural network. If the output from the neural network is greater than a specified

threshold, the region will be recorded as a building. Thresholds are also used in approach (1)

and (5). This approach is capable of finding and outlining the boundaries of the building.

We will now examine the approach presented in (4). This approach uses an active

contour model to perform the building extraction. A contour model, also known as a snake, is a

model that is used to delineate object outlines in images. It accomplishes this by minimizing the

energy associated to the current contour. This approach uses a homogeneity and similarity

measure which is similar to (2). This approach segments the image into regions so that each

region will have pixels inside it which have maximal homogeneity and similarity. This model has

the ability to extract objects without obvious edges and it is not sensitive to noise. The model

finds the building boundaries by finding whether a given contour curve is inside or outside the

building region. The model will iterate until the curve finds the minimal energy of the region,

which indicates it has hit the boundary. This is similar to the way approach (6) operates. To

perform the actual building extraction, this approach runs through 4 steps. The first step is to

run the image through an image smoothing Gaussian filter, approach (5) must also apply a

preprocessing filter to the image. The next step is taking the smoothed image and as with

(1)and (3) introduce training data into the system. This training data contains points which are

inside the boundaries of the buildings. The third step implements the active contour model and

the buildings boundaries are extracted. The fourth and final step is an accuracy assessment.

This approach along with (3) is capable of finding and outlining actual building boundaries.

Moving onto our next approach for building extraction, we are presented with (5). This

approach uses Gabor filters and spatial voting to find and extract the centers of buildings in a

satellite image. As we will see, I believe this approach could be used in conjunction with (4), but

more on that later. This approach makes use of Gabor filter sets to detect building properties in

the image. A two dimensional Gabor set is mathematically the product of a two dimensional

Gaussian and a complex exponential function, which otherwise can be thought of as a Gaussian

filter modulated by a sinusoidal wave. Similar to (4) there is some preprocessing that must be

done to the image in the form of a 5x5 median filter to try and eliminate some of the noise

which might be present in the image. This is done to help prevent false detections. There is no

guarantee that all the buildings in a given image will all be oriented the same way, to account

for this problem, approach (5) generates a set of six Gabor filters to account for multiple

directions. This implies that the same image must be convolved with a Gabor filter six times,

one for each filter in the set. This will result in 6 separate Gabor spaces. Each Gabor space will

be examined for local maximums which indicate a building property. To prevent false

detections a threshold is set to eliminate weak maximums. These collected building properties

are then used to generate descriptor vectors. For each feature three descriptors are extracted;

the feature location, the possible distance to the center of the building, and the features

dominate orientation. These descriptors are then used to describe the direction towards the

buildings’ center. The authors of (5) note that this approach only works when buildings are

brighter than the background they are on, this is because when the building is brighter, the

second highest edge will be located close to the building center, allowing the direction to the

building center to be found. If the building isn’t brighter, then the second highest edge does not

have to be close to the center, thus, no information about the direction towards the center of

the building can be found. Once the direction towards the buildings center has been calculated,

all the features then vote for possible building center locations to create a voting matrix. Once

the voting matrix has been calculated, buildings locations can be determined based upon where

local maximums in the matrix occur. Again to avoid false positives a threshold is enforced to

eliminate weak maximums. This approach differs from (3)and (4) by not being able to detect

the outlines of buildings but is instead being able to detect an approximate center location of a

building.

The final approach we will be examining is presented in (6). This approach is a model-

based building detection algorithm. It is model-based in the aspect that it uses a prior-based

variational framework to extract multiple buildings from a single image. What this means is this

approach tries to align evolving contours with a prior building shape. The use of evolving

contours is also used in (4). There are multiple prior building shapes, these are used to

represent differently shaped buildings. These prior building shapes can be scaled so that they

can fit differently sized buildings. The way this approach works, is it first applies an arbitrary

curve to the entire image. It then tries to minimize the energy, similar to (4), in the bounded

region by matching one of the prior building shapes to the region. This results in one or more

regions bounded by a prior shape. The algorithm then continues this same process of

minimizing energy in a bounded region and matching prior building shapes to contours. This

continues until the algorithm converges and all the energy in the image is minimized. The final

building shapes are then assigned to the best fitting prior building shape. This approach like (3)

and (4), is able to extract building boundaries from the image, but since the resulting

boundaries are a best fitted prior building shape, there might be slight variations from the

actual boundary of the building.

After examining all 6 approaches, we can notice a few common trends and similarities

across the techniques. One of the most common aspects in the approaches is the requirement

of training data needing to be introduced into the system before building extraction can be

performed. Training data is needed in approaches (1), (3), (4), and (6). The extent of the training

data changes from approach to approach. For example both (1) and (3) require training images

which represent the buildings the will be trying to extract. (4) needs information on where

seeding points are located in the image. (6) just needs general building templates which will be

used to approximate building shapes in the image. The use of training data allows these

approaches to better identify the objects they are searching for, at the cost of some initial set

and manual data creation. Another problem these algorithms tried to solve was the detection

of false positives, or regions being highlighted as buildings when they actually weren’t.

Approaches (1), (3), and (5) all tackle this problem by introducing thresholds into their systems.

These thresholds force regions in the image to meet a certain condition; otherwise they will be

thrown out and not considered a building. After further analyzing the approaches, I found an

interesting relationship, approaches (3), (4), and (6) all have an iteration mechanism as part of

their systems. These three approaches are also the only three approaches which actually find

and highlight the boundaries of the buildings. In the cases of (4) and (6), an energy minimization

algorithm is implemented to find the building edges, while in (3) a region growing algorithm is

implemented.

Now that we have begun discussing the actual results of the approaches, let us examine

what the other 3 approaches provide. Approach (1) fails to provide a concrete outlining of the

buildings; instead it supplies regions which have matched specific textures. Since 5x5 pixel

patches are used in the comparison, the regions are blocky. Approach (2) states they can

produce outlines of buildings, but they fail to fully explain how exactly they do this. I assume

they take the filtered images which highlight and define separate areas and perform additional

analysis to extract the building regions and boundaries. Approach (5) is the last approach which

fails to concretely provide building boundaries. It instead provides information about where the

expected center of a building is located. If the exact shape of the building is not important, this

information can be used to create generic shapes which represent buildings in the image. There

might also be a way this data about building centers could be used with another approach. As I

stated earlier approach (4) requires a set a training data. The type of information contained in

this training data is information about seed points which are inside building regions in the

image. This seems to me to be exactly what the output of approach (5) is providing. This means

a new system could be created which combines approach (5) with approach (4) thus eliminating

the need to manually create the training data, allowing the system to become more automatic.

The task of extracting building boundaries from high-resolution satellite images is an

easy yet tedious and time consuming task for humans to perform. This extracted data is also an

essential part of many applications used in agencies around the world. If the extraction of

building boundaries could become an automated process it would speed up the collection of

this data and remove the need for a human to manually find and create the boundaries.

Throughout this paper I have reviewed and analyzed several different approaches attempting

to automatically extract building boundaries from high resolution images. I believe there are

some very promising results from individual approaches such as those illustrated in (3), (4), and

(6). I also believe that a combination of several approaches could lead to a more fully

automated system, all but eliminating the need for any human interaction. In closing, after

performing this research I have a greater understanding on the subject object detection and in

a grander scope, computer vision as a whole.

Bibliography 1. Computer aided generation of stylized maps. Adabala, Neeharika, Varma, Manik and Kentaro,

Toyama. s.l. : Microsoft Research India, 2007.

2. Urban Feature Characterization using High Resoultion Satellite imagery: Texture Analysis Approach.

Jeon, So Hee, Kwon, Byung-Doo and Lee, Kiwon. 2007.

3. Automated Building Extraction from High-Resoultion Satellite Imagery Using Spectral and Structural

Information Based of Artificial Neural Networks. Lari, Zahra and Ebadi, Hamid. 2008.

4. Automatic Building Extraction from High Resoultion Aerial Images Using Active Contour Model.

Ahmady, Salman, et al., et al. 2008.

5. Building Detection Using Local Gabor Features in Very High Resotution Satellite Images. Sirmacek,

Beril and Unsalan, Cem. 2009.

6. Automatic Model-Based Building Detection from single Panchromatic High Resolution Images.

Karantzalos, Konstantions and Nikos, Paragios. 2008.

a survey of methods to extract buildings from high-resolution

Documents