finding the needle in the image stack performance metrics for big data image analysis

Finding the Needle in the ImageStack: Performance Metrics for BigData Image Analysis

O n 15 April 2013 at 2:49 p.m. Eastern Stand-

ard Time in the United States city of Bos-

ton, Massachusetts, two pressure cooker

bombs exploded, killing three people and

injuring more than 250. The blasts occurred

near the finish line of the Boston Marathon,

the world’s oldest annual marathon. Almost

immediately following the explosions, the US

Federal Bureau of Investigation (FBI) enlisted

the help of the public—spectators, media, and

public and private closed-circuit surveillance

systems—to help in its investigation. Any indi-

viduals having taken pictures or video during

the race were encouraged to submit it to the

FBI for review. On 18 April 2013, three days

later, the FBI released photographs and video

showing two suspects identified as suspect 1

and suspect 2. The suspects were also referred

to as “black hat” and “white hat” because of

the color of the baseball caps they were wear-

ing in the footage. Although an official

account has yet to be released detailing the

FBI’s operational analysis, computer software

probably played an important role in identify-

ing the individual among the many thousands

of public image and video submissions.

How did the FBI comb through likely tera-

bytes of data and close in on a pair of suspects?

Did they use software to identify and filter out

individuals near the finish line with backpacks

of sufficient size to hold the explosives? Once

the suspects were identified, how did the FBI

use those images to analyze the submissions

from the public to locate additional images or

video frames that contained the suspects?

This article explores these questions to deter-

mine whether or not it is possible to do similar

analysis on a smaller scale. In the interest of

transferring our results to other applications,

we also look at how visual data, which has

become so ubiquitous in all aspects of modern

society, can be used and analyzed on a large

scale.

State of Image AnalysisFueled by the growth in camera-enabled cell

phones and the commoditization of computer

storage such as hard drives and other medium,

the amount of images and video being con-

sumed and stored is enormous. Cisco, a major

networking equipment manufacturer, projects

that consumer Internet video traffic will be 69

percent of all consumer Internet traffic in 2017,

up from 57 percent in 2012.1 This trend is

expected to continue in the foreseeable future

as the adoption and use of the Internet and

mobile devices increases.

Consumers are a significant driver of the

growth in image and video use and storage, but

they are not alone. Private companies and gov-

ernments around the world are relying on it as

an investigative, monitoring, and forensic tool

to search for patterns or characteristics and/or

document and identify individuals. This type of

image analysis is done both in real time as the

image or video is being captured and in a post-

processing investigative setting.

Regardless of the method of capture and

when the media is processed, images and videos

require large amounts of space for storage and

computing power for analysis. Consider that,

with a moderate compression algorithm, an

average minute of video recorded on an iPhone

in 640 � 480 resolution is approximately

40 Mbytes. A metropolitan area with a high pop-

ulation density and presumably large number of

people recording image and video, coupled with

public safety and private closed-circuit television

(CCTV) monitoring systems, the computing

requirements necessary to store and analyze all

the image and video captured in one area in a

short period of time quickly balloons.

Kieran MillerKean University

Patricia MorrealeKean University

Multimedia at Work Wenjun ZengUniversity of Missouri, [email protected]

1070-986X/14/$31.00�c 2014 IEEE Published by the IEEE Computer Society84

Analyzing image and video data from multi-

ple devices and sources is a daunting task, and

it highlights the need for the creation of com-

puter software and algorithms to analyze bits

into meaningful and actionable information

for law enforcement. Without it, combing

through terabytes of data would be an exercise

in futility. The explosives detonated at the Bos-

ton Marathon in 2013 provide an excellent case

study, albeit a chilling one, of the role that

images, video, and analytical software can play

in an investigation. By identifying the perform-

ance measurements involved in working with

and analyzing images on a consumer-grade

machine, it is possible to identify the extent to

which an individual, using open source soft-

ware, can identify or track people in images or

video frames.

Analysis EnvironmentTo begin a performance analysis of video and

image data, performance benchmarks must be

established. Factors that affect system and soft-

ware performance include the operating system

and its version, CPU clock rate, number of

cores, type of hard drive, and if applicable, hard

drive rotations per minute (RPM). Other factors

include the programming language, frame-

work, and language runtime used.

For consistency, a single machine was used

during all of the analysis we describe here. Spe-

cifically, we used a 64-bit Windows 7 Service

Pack 1 operating system using Microsoft’s .NET

Framework runtime 4.0 and the C# program-

ming language.

Benchmarking BackgroundBefore we analyze performance metrics, we

must review two important points.

First, in the programming language and

environment used to evaluate performance, it

is necessary to identify what represents a unit of

time and how elapsed time is calculated. In

Microsoft’s .NET framework, the programming

language used in this research, the smallest unit

of measurement is a tick, expressed as a property

in the System.DateTime class. One tick repre-

sents “one hundred nanoseconds or one ten-

millionth of a second.”2 We measure elapsed

ticks with the Stopwatch class defined in the

System.Diagnostics namespace.3

Second, when conducting performance tests

involving images, it is important that the data-

set of images be as random as possible. The

Internet is ideal for this, in particular the wealth

of images available on the Wikimedia.org fam-

ily of websites. Conveniently, there is a specific

URL that will display a random image from one

of those sites: http://commons.wikimedia.org/

wiki/Special:Random/File.

For the sample images used in this research,

we downloaded images for later analysis using

this URL. This process was repeated until we

retrieved approximately 2,900 images, consum-

ing 2.9 Gbytes of space.

Benchmark Image TestsAs part of this research, it was important to

establish a set of tests to provide benchmarks to

determine how much time a machine takes to

perform simple tasks as well as the time needed

for image processing.

The analysis began by assessing the perform-

ance of a basic programming construct: the for-

loop. This is particularly relevant to images

because, in a simplified manner, an image can

be thought of as a two dimensional array of pix-

els and a video as a sequence of many images.

A for-loop is concerned with iteration, repeat-

ing a section of code a specified number of times.

To measure performance of the for-loop, the

methodology used is straightforward. With vary-

ing values for the upper bound, how long does it

take to run the for-loop until completion? The

logical use for values of the upper bound in the

computing world is increasingly large powers

of 2. Starting with 21 up to 224 (16.78 million).

Table 1 summarizes the results. The highest

number, 224, took an average of 4.7325 millisec-

onds over 100 runs to execute the loop.

Having established benchmark measures of

performance of the standard for-loop, we next

move on to benchmarks involving images.

With a random sample of images retrieved from

the Internet as a dataset, we were able to estab-

lish a benchmark of how long it took to load an

image into memory. For this test, 1,000 images

were selected from the sample set, picking one

at random, loading it in to memory, and repeat-

ing that step 10 times. The end result is data

measuring the time it takes to load 10,000

images.

With an average image size of 1.49 Mbytes, it

took an average of 167,122.99 ticks or 16.71 ms

to load each image.

Finding Pedestrians in ImagesWith identified and established benchmarks for

image processing in its most basic form, the

objective is to illustrate the extent to which it is

Jan

uary

–March

2014

85

possible to identify or track a pedestrian from a

known sample set of images.

The mathematics involved with detecting

people in an image are complex and beyond the

scope of this article, but it is nevertheless ex-

plored using the Open Source toolkit EMGUCv,

a .NET implementation of the OpenCV project.

OpenCV was developed by Intel in 1999 and

first released at the IEEE Conference on Com-

puter Vision and Pattern Recognition in 2002.

To obtain images for analysis, we recorded

video of an individual wearing two different col-

ored hats at five different locations on the Kean

University campus in Union, New Jersey, at vari-

ous distances and angles. The data was recorded

with a Samsung Galaxy S3 phone. Each section

of video was categorized by its location and

direction and programmatically split into image

frames, which we then analyzed.

OpenCV returns rectangular regions that

represent sections of an image that its algo-

rithms determine may contain pedestrians.

Here we show examples of the analysis of

one of the video frames analyzed at Kean (see

Figure 1) as well as one from a photo of a busy

New York City street scene (see Figure 2) to

show how this looks in practice. Programmatic

rectangular red boxes were drawn on the

images after analysis to illustrate the sections it

located.

Table 2 lists the video samples taken from

five locations across Kean’s campus with an

approximation of distance.

After the video was captured, the images

were split into frames, capturing approximately

1 image per second. This resulted in a total of

517 frames of images across the various loca-

tions. We performed analysis on each and man-

ually evaluated the results using two primary

criteria:

� Was the pedestrian correctly identified in

the image?

� Did the rectangular region identified sur-

round the entire person, or was the person

cut off?

The overall success rate was modest, with

roughly 58 percent of frames correctly identify-

ing the pedestrian, as Figure 3 shows.IEEE

Mu

ltiM

ed

ia

Table 1. Time to execute the for-loop of various iterations.

Number of iterations Average ticks Max ticks Min ticks

2 2.74 273.00 0.00

4 0.06 1.00 0.00

8 0.22 1.00 0.00

16 0.07 1.00 0.00

32 0.24 1.00 0.00

64 0.33 1.00 0.00

128 0.51 1.00 0.00

256 0.86 2.00 0.00

512 1.56 2.00 1.00

1,024 2.95 4.00 2.00

2,048 5.78 9.00 5.00

4,096 11.93 34.00 11.00

8,192 22.73 28.00 22.00

16,384 46.95 93.00 45.00

32,768 91.01 145.00 90.00

65,536 182.51 228.00 180.00

131,072 363.88 422.00 360.00

262,144 742.15 829.00 721.00

524,288 1,456.75 1,533.00 1,443.00

1,048,576 2,923.30 3,146.00 2,886.00

2,097,152 5,941.81 6,345.00 5,773.00

4,194,304 11,839.82 12,634.00 11,546.00

8,388,608 23,618.03 24,562.00 23,142.00

16,777,216 47,325.69 49,108.00 46,385.00

Multimedia at Work

86

Figure 4 displays the findings from the tests

across each of these locations on campus. The

graph shows, for each location and hat color,

the relationship between the total number of

frames and how successful each was in the two

evaluation criteria.

Figure 4 clearly shows that the color of the

hat and the direction the pedestrian was mov-

ing (right to left versus left to right) had little

impact on the relative success rates. Location 2

stands out as performing exceptionally poorly,

which can be attributed in part to the distance

of the pedestrian from the camera at roughly 50

yards. This makes sense because it is presum-

ably more difficult for OpenCV’s algorithms to

detect features that identify a set of pixels as

representing a human when the person is made

up of a smaller number of pixels.

Locations 4 and 5 had the best success rates,

likely due to a combination of two factors: short

distances and the orientation of the pedestrian

from the camera. Locations 1 though 4 con-

sisted of an individual walking left to right or

right to left, perpendicular to the camera so that

they were viewed from the side. Location 4 con-

sisted of the pedestrian walking toward the cam-

era at an angle of approximately 20 degrees,

while location 5 consisted of the pedestrian

walking toward and away from the camera at an

angle of approximately 60 degrees. As a result,

more of the frontal profile of the person was

viewable in locations 4 and 5. It is logical that

the success rates will improve when more of the

person’s defining characteristics are visible

when viewed from the front—two arms and legs

extending from the body. These conclusions are

Jan

uary

–March

2014

(a) (b)

Figure 2. Locating pedestrians on a busy New York City street. (a) Before and (b) after analysis.

(a) (b)

Figure 1. Locating a pedestrian in an image. (a) Before and (b) after analysis. A red box is drawn on the image after analysis to

illustrate the located sections.

87

speculative and further testing and analysis of

the inner workings of OpenCV’s algorithms are

necessary to quantify the impact of the viewing

angle on its ability to identify a pedestrian.

ApplicationsA broad range of industries could apply this

technology and analysis. Here we review exam-

ple scenarios in two industries.

Health Care/Hospital Security

Modern hospitals typically employ radio wrist-

bands or some other sort of radio frequency

device to restrict patients to certain areas and

prevent them from entering unauthorized

areas. This is particularly important in mental

health facilities. Although radio devices are

effective, imaging software could be utilized as

a secondary level of security when radio devices

are lost or stolen. Such a device could be deacti-

vated if it was known to be misplaced, but the

system should also function in the event a

device is unknowingly lost or stolen. For exam-

ple, sensitive areas of the hospital could be

equipped with cameras and software that, in

addition to verifying access through a radio

device, could scan the person and look for an

ID badge that is required to be prominently

displayed.

This assumes that OpenCV would have simi-

lar levels of success identifying people wearing

hospital attire. Without clearly visible legs,

however, the software may be less accurate.

Public Transportation

Public transportation is one of the most practi-

cal uses for this type of technology when

applied to law enforcement and crime/terror-

ism investigation and prevention. Software,

similar to what was discussed here, albeit more

sophisticated, could be used to sift through gig-

abytes of images from CCTV cameras to iden-

tify a suspect with known characteristics.

The technology could also be used in a more

preventive manner, such as looking for people

with backpacks large enough to carry an explo-

sive device. It could also be used to identify

unusual or suspicious patterns. For example, if

the software could hook into the train signaling

system, it would know when and which trainsIEEE

Mu

ltiM

ed

ia

Table 2. Locations of video capture on Kean University’s campus.

Location ID

Direction

of pedestrian Building or location

Approx. distance

(yards)

1 Left to right Outside Vaughn-Eames 15

1 Right to left Outside Vaughn-Eames 15

2 Left to right Between Vaughn-Eames and

Wilkins Theatre

50

3 Left to right Front of Wilkins Theatre

from Bridge

10

3 Right to left Front of Wilkins Theatre

from Bridge

10

4 Left to right Front of Nancy Thompson Library 5–15

5 Left to right Outside Hennings Hall 5–20

5 Right to left Outside Hennings Hall 5–20

300

217

Missed or incorrectly identified

Successfully identified

Figure 3. Overall success rate for all locations.

Roughly 58 percent of frames correctly identified

the pedestrian.

Multimedia at Work

88

arrived and departed on a particular track. If an

individual remains on a platform for a lengthy

period of time after the trains arrived and

departed, the software could flag the suspicious

behavior. This would require the software to be

able to track an individual pedestrian and not

just identify a random person.

ConclusionAt relatively short distances, the results are

impressive. Overall, the software was able to

identify a pedestrian in 300 of the 517 frames.

At location 5, at a distance of between five and

15 yards, the success rate was greater than 90

percent. This is comparable to the short-dis-

tance image captures of closed-circuit cameras

located in and around public spaces in the US

as well as elsewhere in the world.

The results that can be obtained using a con-

sumer grade machine and open source software

are promising, which opens up additional ave-

nues to the other applications. Desktop image

processing could be regularly used by law

enforcement, for instance. With the resources

available to the federal government, with thou-

sands of industry- and research-level computing

machines, the FBI would be able to perform this

type of analysis against terabytes of data. MM

References

1. “Cisco Visual Networking Index: Forecast and Meth-

odology, 2012–2017,” Cisco Systems, 29 May 2013;

www.cisco.com/en/US/solutions/collateral/ns341/

ns525/ns537/ns705/ns827/white paper c11-

481360 ns827 Networking Solutions White Paper.

html.

2. “System.DateTime.Ticks Property,” Microsoft

Developer Network, Framework Version 4.0 Docu-

mentation, http://msdn.microsoft.com/en-us/

library/system.datetime.ticks(v¼vs.100).aspx.

3. “System.Diagnostics.Stopwatch Class,” Microsoft

Developer Network, Framework Version 4.0 Docu-

mentation, http://msdn.microsoft.com/en-us/

library/system.diagnostics.stopwatch(v¼vs.100).

aspx.

Kieran Miller is a research student in the Depart-

ment of Computer Science at Kean University. Con-

tact him at [email protected].

Patricia Morreale is an associate professor in the

Department of Computer Science at Kean University.

Contact her at [email protected].

Jan

uary

–March

2014

70

60

50

40

Num

ber

of fr

ames

30

20

10

0Location 1

(left to right)

Black hat total frames

White hat total frames

Black hat frame success

White hat frame success

Black hat frame success with whole person

White hat frame success with whole person

Location 3(left to right)

Location 1(right to left)


Location 5(left to right)


Location 2 Location 4

Figure 4. Frame analysis success rate by location. Location 2 performed exceptionally poorly, likely

because the pedestrian was roughly 50 yards from the camera, which resulted in a smaller number

of pixels.

89

finding the needle in the image stack performance metrics for big data image analysis

Education

video data

video use

performance analysis

image data

video frames

video submissions

type of image analysis

state of image analysis