csc4000w h computer sciencejnorman/virpan/... · csc4000w – honours in computer science honours...

CSC4000W – HONOURS IN COMPUTER SCIENCE

HONOURS PROJECT REPORT

VIRTUAL PANNING FOR LECTURE RECORDING

STITCHING COMPONENT

Author:

Terry Tsen (TSNTER002)

Supervisor:

A/Prof. Patrick Marais

Category Min Max Chosen

1 Requirement Analysis and Design 0 20 18

2 Theoretical Analysis 0 25 0

3 Experiment Design and Execution 0 20 0

4 System Development and Implementation 0 15 12

5 Results, Findings and Conclusion 10 20 17

6 Aim Formulation and Background Work 10 15 13

7 Quality of Report Writing and Presentation 10 10

8 Adherence to Project Proposal and Quality of Deliverables 10 10

9 Overall General Project Evaluation 0 10 0

Total marks 80 80

DEPARTMENT OF COMPUTER SCIENCE

UNIVERSITY OF CAPE TOWN

2014

i

Abstract

This report investigates the viability of a virtual panning system for lecture recording.

More specifically, the stitching component of the virtual panning system is elaborated in

detail. Many lectures are recorded using a pan-tilt-zoom camera which automatically

identifies the presenter and tracks the presenter to keep the person within the field of

view. Cameramen are used as well to achieve the same purpose. However, these two

solutions are expensive. Additionally, the cameraman solution is prone to human error. A

system is proposed by the Centre for Innovation in Learning and Teaching in the

University of Cape Town which emulates the results of the pan-tilt-zoom camera. The

system involves placing fixed cameras in the lecture room that view different parts of the

front of the lecture. After the lecture is recorded, the videos are post-processed in three

stages: stitching, tracking and panning. The videos are stitched into one panoramic video.

The lecturer is identified and tracked within the panoramic video. Finally a cropped frame

of the panoramic video is taken to produce a video that is similar to a video recorded by

a pan-tilt-zoom camera. The stitching component is evaluated in terms of its execution

time and stitch quality. Although the execution time is not ideal, the proposed system is a

cheaper alternative to the pan-tilt-zoom camera and cameraman solution.

ii

ACKNOWLEDGEMENTS

I would like to thank A/Prof. Patrick Marais of the Department of Computer Science at the

University of Cape Town for his input and guidance to the project in which he has a lot of

experience in, namely computer vision.

Secondly, I would like to acknowledge Stephen Marquard and his team at the Centre for

Innovation in Learning and Teaching for providing system specifications as well as lecture

video recordings on which we tested our system.

Thirdly, I would like to thank my team members Jared Norman and Chris Pocock for

providing motivation to keep us going through the toughest of times.

Lastly, I would like to thank my parents for allowing me the opportunity to study at the

University of Cape Town which allowed me to be the best that I can be.

iii

Table of Contents

List of Figures ..............................................................................................................................................v

1 Introduction ........................................................................................................................................ 1

1.1 Virtual Panning for Lecture Recording ................................................................................ 1

1.2 Stitching ...................................................................................................................................... 2

1.2.1 Image Stitching ................................................................................................................. 2

1.2.2 Video Stitching .................................................................................................................. 2

1.3 The Stitching Pipeline .............................................................................................................. 3

1.4 Aim ............................................................................................................................................... 3

1.5 Report Overview ....................................................................................................................... 3

2 Background ....................................................................................................................................... 5

2.1 Applications of Image/Video Stitching ............................................................................... 5

2.2 Overview of Stitching Algorithms ........................................................................................ 5

2.3 Overview of the Stitching Pipeline ....................................................................................... 6

2.4 Feature Detection ..................................................................................................................... 6

2.5 Feature Matching ..................................................................................................................... 8

2.6 Homography Estimation......................................................................................................... 9

2.7 Bundle Adjustment................................................................................................................... 9

2.8 Image Warping ........................................................................................................................ 10

2.9 Gain Compensation ................................................................................................................ 10

2.10 Blending ..................................................................................................................................... 11

2.11 Evaluation of Visual Quality of Stitched Images .............................................................. 13

2.12 Summary ....................................................................................................................................14

3 Design ................................................................................................................................................ 15

3.1 System Overview ..................................................................................................................... 15

3.1.1 System Constraints ......................................................................................................... 16

3.1.2 Functional Requirements .............................................................................................. 16

3.1.3 Non-functional Requirements ..................................................................................... 17

3.1.4 Assumptions ..................................................................................................................... 18

3.1.5 Key Success Factors ........................................................................................................ 18

3.1.6 Input/output of the Components in the System .................................................... 18

3.2 Stitching Component ............................................................................................................. 19

iv

3.2.1 Aim ...................................................................................................................................... 19

3.2.2 Stages of the Stitching Pipeline .................................................................................. 20

3.2.3 First Design ....................................................................................................................... 21

3.2.4 Second Design ................................................................................................................ 22

3.2.5 Third Design .................................................................................................................... 24

3.2.6 Fourth Design .................................................................................................................. 25

4 Implementation .............................................................................................................................. 27

4.1 First Iteration ............................................................................................................................ 27

4.2 Second Iteration...................................................................................................................... 28

4.3 Third Iteration .......................................................................................................................... 29

4.4 Fourth Iteration ........................................................................................................................ 31

4.5 Other Related Works ............................................................................................................. 32

5 Results ............................................................................................................................................... 34

5.1 Stitching Pipeline Default Settings ..................................................................................... 34

5.1.1 Warp Type ....................................................................................................................... 35

5.2 Execution Time ........................................................................................................................ 36

5.2.1 Re-execution of the pipeline disabled ...................................................................... 37

5.2.2 Re-execution of the pipeline enabled ...................................................................... 39

5.3 Stitch Quality ............................................................................................................................ 40

5.3.1 Blend Type ....................................................................................................................... 40

5.3.2 Stitch Quality Results ......................................................................................................41

6 Conclusion ....................................................................................................................................... 44

7 Future Work ..................................................................................................................................... 46

References ................................................................................................................................................. 47

v

List of Figures

FIGURE 1: Feature detection performed on the left side of a lecture room. ................................ 7

FIGURE 2: Feature detection performed on the right side of a lecture room. ............................ 8

FIGURE 3: Illustration of keypoint descriptors....................................................................................... 8

FIGURE 4: Feature matching performed on a lecture room. ........................................................... 9

FIGURE 5: A panoramic image without gain compensation. .......................................................... 11

FIGURE 6: A panoramic image with gain compensation. ................................................................. 11

FIGURE 7: Illustration of a high-frequency band-pass component. .............................................. 12

FIGURE 8: Illustration of a medium-frequency band-pass component. ....................................... 12

FIGURE 9: Illustration of a low-frequency band-pass component. ................................................ 12

FIGURE 10: Illustration of multi-band blended image. ...................................................................... 13

FIGURE 11: A panoramic image with gain compensation and multi-band blending. ............... 13

FIGURE 12: OpenCV logo. ......................................................................................................................... 15

FIGURE 13: A bird’s-eye view of the setup and location of the cameras. .................................... 16

FIGURE 14: illustration of the different components of the proposed system. ........................... 17

FIGURE 15: Diagram illustrating the OpenCV pipeline. .................................................................... 20

FIGURE 16: Class diagram illustrating the different classes in the stitching pipeline. ................ 21

FIGURE 17: Diagram illustrating the different stages of the stitching pipeline ........................... 22

FIGURE 18: Diagram showing the use of image masks. ................................................................... 23

FIGURE 19: Diagram illustrating the second design of the stitching pipeline. ........................... 23

FIGURE 20: Diagram illustrating the third design of the stitching pipeline. ................................ 24

FIGURE 21: Diagram illustrating the fourth design of the stitching pipeline. .............................. 26

FIGURE 22: Pseudocode for the first iteration of the stitching pipeline. ..................................... 27

FIGURE 23: Pseudocode for the second iteration of the stitching pipeline................................ 28

FIGURE 24: Pseudocode for the third iteration of the stitching pipeline. ................................... 30

FIGURE 25: Pseudocode for the fourth iteration of the stitching pipeline. ................................. 32

FIGURE 26: Screenshot taken from two videos recorded by two cameras. ............................... 33

FIGURE 27: The result of the StitcHD code. ........................................................................................ 33

FIGURE 28: Table showing the different options of the stitching pipeline. ................................ 35

FIGURE 29: Diagram displaying the results of different compositing surfaces. ......................... 36

FIGURE 30: Table showing the times acquired from the multi-threaded system. .................... 37

FIGURE 31: Table showing the times acquired from the single-threaded system. ................... 38

FIGURE 32: Table showing times acquired from the multi-threaded system with the stitching

pipeline re-executed every 10 seconds. ................................................................................... 39

FIGURE 33: Table showing times acquired from the single-threaded system with the

stitching pipeline re-executed every 10 seconds. .................................................................. 39

FIGURE 34: Diagram displaying the results of different blending algorithms. ............................41

FIGURE 35: Example of a frame where the lecturer is standing in the stitch seam. ................. 42

FIGURE 36: Example of a frame where the stitching pipeline is re-executed. ........................... 42

FIGURE 37: Example showing a visually distracting stitch seam. ................................................... 43

FIGURE 38: Example of a good blended frame. ................................................................................ 43

vi

FIGURE 39: Alternate positioning of the cameras. ............................................................................ 46

1

1 Introduction

Lecture recordings have become a common tool to use for educational purposes. Pan-

tilt-zoom (PTZ) cameras are used to record the lectures while keeping the lecturer within

the field of view. Cameramen have also been used to record lectures as well with the

same effect as the PTZ camera. A cheaper solution would be to stitch together two or

more videos together from multiple static cameras, producing a panoramic video of the

front of the lecture room.

1.1 Virtual Panning for Lecture Recording

The Centre for Innovation in Learning and Teaching (CILT) is a unit within the University of

Cape Town (UCT) that manages teaching and learning difficulties within the university.

The unit is responsible, among many other things, for the recording of lectures. Many

lecture venues have multiple cameras recording different parts of the front of the venue.

CILT has proposed a system where the goal of the system is to:

Stitch the videos from multiple cameras together, effectively creating a

“panoramic” video.

The lecturer is then identified and tracked within the panoramic video.

A cropped video will be cut out from the panoramic video, which will pan along

with the lecturer as the lecturer moves.

Note that this occurs after the lecture has been recorded i.e. post-processing, so there is

no need for the proposed system to run in real-time.

A pan-tilt-zoom (PTZ) camera can be used to achieve the same result described above. It

is advantageous compared to a single, static camera, as the lecturer is always kept in the

field of view, which allows viewers to be able to see what the lecturer may be writing on

the blackboard. However, PTZ camera are very expensive compared to three static

cameras, for example [1]. Additionally, software that manages the PTZ camera needs to

be purchased as well, making the cost of such a solution for lecture recording even

higher.

Another method which produces the same result is to have a cameraman sit in a lecture

and the cameraman will pan the camera as the lecturer walks. Although it might be

cheaper than the PTZ camera solution, it is still more expensive compared to having two

or more fixed cameras placed at the front of the lecture room. Additionally, there can be

human errors when using the cameraman solution, which is absent in an automated

system.

2

Although not tackled in the proposed system, post-processing of videos allows the video’s

colour, white balance and contrast to be improved, which compensates for poor lighting

(which is often the case in some lecture venues). Post-processing can also correct image

perspectives and lens artefacts [1].

Since the videos are post-processed, emphasis is placed on producing a quality

panoramic video over execution time. However, the execution time must still be

reasonable for the quality of video produced. An unreasonable execution time would be

one day to stitch a 50 minute lecture video.

1.2 Stitching

Image and video stitching is a topic in the field of computer vision that has been

extensively researched since the 1990s. Many efficient algorithms have since been

developed and most of the algorithms can be run in real-time, especially with today’s

computing power.

1.2.1 Image Stitching

Stitching images together involves taking in two or more images as input and outputting

a single, panoramic image. Since the field of view of cameras is always smaller than the

human field of view [2], stitching together images can provide a view of the environment

as humans would see it. It also provides a way of presenting a large object in a single

picture when multiple camera shots are needed to photograph the entire object.

1.2.2 Video Stitching

In some cases, panoramic images are created using video clips. A single video is taken as

input and a panoramic image is outputted. A person will record a short video of a large

building or landmark, for example, and the program will extract certain frames from the

video and use these frames, or images, to stitch a panoramic image. This method is

documented by Drew Steedly et al. [3]. Richard Szeliski mentions that this method of

using videos allows for the creation of virtual reality environments, computer game

settings and movie special effects [4]. This is possible, because the motion of the camera,

while recording the video, allows for depth recovery and limited 3D rendering.

3

1.3 The Stitching Pipeline

The stitching pipeline occurs in stages, with each successive step being reliant on the

previous step.

Features must first be identified in each of the input images.

The identified features must then be matched between the input images.

The matched features are then used to calculate a homography (projection) matrix

between matched images.

All matched images then undergo bundle adjustment together, which solves for

the camera parameters jointly.

A compositing surface must be chosen on which the images are stitched. Some

examples of stitching surfaces include planes and cylinders.

At this stage, the images will undergo gain compensation. This means that the

brightness and contrast of the images are normalised to minimise the visibility of

stitch seams.

During the stitching of the images, blending is often done to minimise the visibility

of the stitch seam even further.

The panoramic stitched image is outputted.

1.4 Aim

The aim of the proposed system is to take in two or more videos as input and output a

video that is qualitatively comparable to a video that is recorded by a PTZ camera or a

cameraman.

In terms of the stitching component, the aim is to produce a panoramic video that has

minimal visual artefacts and/or distortions, in order to give the illusion that the panoramic

video was taken with a single, static camera. Evaluation of the stitched, panoramic video is

done qualitatively by visual inspection. Execution time of the stitching pipeline is measured

as well to determine if it is reasonable.

1.5 Report Overview

In section 2, an overview of image stitching algorithms is given as well as some

applications of stitching. The stitching pipeline is then described in detail. Section 3

describes the design of the system with the constraints of the system listed. The design of

the stitching component of the system will be shown in detail, with reasons described for

the iterations of the stitching component. Section 4 displays the implementation of the

stitching component with pseudocode to illustrate the process. In section 5, run times as

4

well as stitch quality is examined. Section 6 and section 7 details the viability of the

proposed system as well as future work that can be done in this area.

5

2 Background

Before the stitching pipeline is covered in detail, some applications of image stitching and

general stitching algorithms is discussed.

2.1 Applications of Image/Video Stitching

Image stitching have been used to create panoramic views of Jupiter and Saturn from the

multiple images taken of them by the two Voyager spacecraft. Landsat photographs have

also been stitched together to create panoramic views of Earth. Peter J. Burt et al. [5]

mentions, with reference to the two examples above, that image stitching allows for the

construction of images with a larger field of view or level of detail than could be obtained

with a single photograph.

Heung-Yeung Shum et al. [6] gives some applications of image stitching. These

applications include creating panoramic views of aerial and satellite photographs,

performing scene stabilization and change detection and creating virtual environments

and virtual travel. They mention techniques for capturing panoramic images of real-world

scenes. One technique is to use a panoramic camera to directly capture a cylindrical

panoramic image onto a long film strip. Another technique is to use a fisheye lens, which

has a very large field of view. Lastly, mirrored pyramids and parabolic mirrors can be used

[6]. These techniques require the use of expensive hardware which is sometimes not

feasible. A less intensive solution would be to use a single camera to take photographs of

different parts of the scene and have the photographs undergo a stitching algorithm to

produce a panoramic image [6].

2.2 Overview of Stitching Algorithms

Stitching algorithms fall broadly into two camps: direct based and feature based. Direct

based methods try to match pixels between images and it does this iteratively to find the

best estimates of the camera parameters. This method has the advantage of using all of

the available data, but it does require some form of initialisation and is very susceptible to

varying brightness conditions [7] [8].

Feature based methods involve finding features and finding correspondence between

features in order to find image matches. This method does not require initialisation, but

the features found are usually susceptible to illumination, zoom and rotation changes [7]

[8].

6

Once the images are matched together using one of the two methods, most stitching

algorithms will follow the same steps thereafter: all matched images will undergo bundle

adjustment, then warped and blended onto a compositing surface [3] [7] [8].

2.3 Overview of the Stitching Pipeline

The stitching pipeline involves many stages, with each stage making use of the

information found in the previous stage to fulfil its purpose in the pipeline.

Feature detection: The first step involves detecting features within all input images.

Feature matching: This involves matching features between the different images

with the intention of finding which images are part of the same panorama. Note

that for stitching to occur successfully, there needs to be some area of overlap

between the images. If there is no overlap, this step will find no features in

common and the stitching pipeline cannot continue.

Homography estimation: This step takes the matched features of the previous step

and estimates the camera parameters of pairs of images.

Bundle adjustment: This step solves for all the camera parameters jointly. This is

done, because the camera parameters calculated in the previous step is done

between pairs of images, which can cause accumulation errors for panoramas

involving more than two images. An example of this includes 360 degree

panoramas, where accumulation errors can cause the ends of the panorama to

not join up [7].

Image warping: A stitching surface is chosen onto which the panorama is stitched.

Gain compensation: A photometric parameter is solved for in this step. Essentially,

the brightness and contrast of all the images are normalised in order to minimise

the visibility of stitch seams.

Blending: Pixels along the stitch seams undergo blending in order to minimise the

visibility of stitch seams even further.

2.4 Feature Detection

Features (also known as interest points) are points in the image where the signal changes

two-dimensionally [9]. These points include corners of objects, dots and any location with

significant 2D texture.

Interest point detectors and corner detectors can be divided into three categories:

contour based methods, intensity based methods and parametric model methods [9].

Contour based methods extract contour lines from the image and searches for maximal

curvature or inflexion points along the lines [9]. Intensity based methods find features by

7

looking at pixel greyvalue colours [9]. Parametric models fit a parametric intensity model

to the image signal [9].

Cordelia Schmid et al. [9] describe two criteria for evaluating interest point detectors:

repeatability and information content. Repeatability is a measure of how independent the

detector is with regards to changes in imaging conditions, such as illumination. A high

score in this metric means that the detector is largely independent of changes and can

repeat the same results in different conditions. Information content is a measure of how

distinctive an interest point is. A detector that scores lowly in this metric means that many

features it finds can be matched to almost any other features, thus a high score is

preferred. By subjecting many detectors to these criteria, they found that the Harris

detector scores highly in both of the criteria.

However, Matthew Brown et al. [7] mention that the Harris detector is not invariant to

scaling as well as rotation. David Lowe [10] has thus proposed a method for “extracting

distinctive invariant features to perform reliable matching”. His approach is titled the Scale

Invariant Feature Transform (SIFT) and features found using this method are invariant to

scale and rotation, and are robust to affine distortion, change in viewpoint, noise and

change in illumination [10]. This is done by transforming found features into “scale-

invariant coordinates relative to local features” [10].

FIGURE 1: Features detected on the left side of the lecture room. The picture is in greyscale colour

because features are easier to detect when there is only one colour channel. Feature detection

done using OpenCV.

8

FIGURE 2: Features detected on the right side of the lecture room.

2.5 Feature Matching

Once features are detected using the SIFT algorithm, they are matched on the basis of its

location, scale and orientation between images [10]. SIFT features can be matched with a

good probability because the keypoint descriptors associated with each feature are highly

distinctive.

FIGURE 3: An example of how keypoint descriptors are created [10]. In this example, the image

gradients are grouped into a 2x2 array and the gradients are summed together. The length of the

arrows denote the magnitude of the gradient.

Since multiple images may overlap the same feature point, each feature is matched to its

k nearest neighbours using a k-d tree [8].

9

Drew Steedly et al. [3] mentions that feature detectors will often identify several hundred

features from each image, which will result in many matches with images that have large

overlaps. To improve efficiency, they group together features within a small area and

replace the feature points within the group with their centroid point.

FIGURE 4: The green lines represent the features that are matched with a high confidence level.

Feature matching done using OpenCV.

2.6 Homography Estimation

Homography estimation is the process of estimating image transformation parameters

that best fit the features that are matched between images. Matthew Brown et al. [8] use a

random sampling approach called random sample consensus (RANSAC) that estimates

the homography matrix by using direct linear transformations. The sampling is done over

a number of iterations and a solution is chosen that gives the most number of RANSAC

inliers. RANSAC inliers are a set of feature matches that are geometrically consistent [8].

These solutions also contain RANSAC outliers, which are a set of features that are inside

the area of overlap but are not consistent [8].

They then perform a verification test using a probabilistic model that determines whether

the set of inliers/outliers was generated by a correct match or by a false match.

Once the images are correctly paired together, they are passed to the next stage of the

stitching pipeline.

2.7 Bundle Adjustment

The previous step estimated the camera parameters (homography) between pairs of

matching images. This can cause errors to accumulate, causing 360 degree panoramas to

not join up at the ends. This step aims to minimise the accumulated errors by

simultaneously minimizing the misregistration between overlapping images [6] [11].

10

Bill Triggs et al. [12] defines bundle adjustment as “the problem of refining a visual

reconstruction to produce jointly optimal 3D structure and viewing parameter estimates”.

Optimal means that the errors are minimised according to some cost function and jointly

means that the solution found is optimal across all images and camera parameters.

Images are added to the bundle adjuster one by one, with the most recent image being

the best matching image (maximum number of consistent matches). This image is

initialised with the same rotation and focal length as the image to which it best matches

[7] [8]. The parameters are then updated using a cost function which must be minimised

at each step to give optimal camera parameters. This is done repeatedly until all images

have been added to the bundle adjuster.

2.8 Image Warping

Once all the input images have been registered with respect to each other, a compositing

surface needs to be chosen on which to stitch the images. If there is only a few images to

stitch, one image is usually chosen as the reference frame and the rest of the images are

warped according to the reference frame. This is known as a flat panorama [11] as the

images are stitched onto a planar surface.

Shmuel Peleg et al. [2] mentions that if the camera undergoes a pure translation while

taking pictures, then a plane is a good surface to stitch on. If the camera undergoes

rotation about a point, then a good surface to stitch on would be a cylinder. However,

cameras rarely undergo such perfect transformations and the stitching surface is almost

always a combination of a plane and a cylinder.

Richard Szeliski [4] gives a solution where the images are divided into several large,

overlapping regions and each region is represented with a planar surface onto which the

images are stitched. Another similar solution is to select a base frame and have each

frame’s positions calculated with respect to the base frame. A new base frame is chosen

periodically to perform the alignment.

2.9 Gain Compensation

Matthew Brown et al. [8] describes this step as a method of calculating a “photometric

parameter” as opposed to the previous steps which calculate “geometric parameters”.

Gain compensation aims to normalise the brightness and contrast across all the images so

that the visibility of the stitch seams are minimised. This is done because cameras will

often adjust the brightness and contrast of the scene when taking pictures.

11

FIGURE 5: A panoramic image without gain compensation [8]. The stitch seams are clearly visible

due to varying light conditions and contrast.

FIGURE 6: A panoramic image with gain compensation [8]. The stitch seams are not as visible. Stitch

seams are still visible near the sun due to unmodelled artefacts.

2.10 Blending

After gain compensation, there are still a number of unmodelled artefacts such as

vignetting (image brightness decreases towards the edges), parallax effects, mis-

registration errors, radial distortion [8] and ghosting (due to moving objects). Blending

aims to minimise these artefacts.

One method to perform blending is by simply taking the average pixel value between all

the images at the area of overlap. This is known as feather blending and it does not work

very well in handling the above-mentioned artefacts [11].

Peter J. Burt et al. [5] describe a blending algorithm, titled multi-band blending, that

handles the mentioned artefacts well. The idea behind the algorithm is to “blend low

frequencies over a large spatial range and high frequencies over a short range” [8]. This

means that fine details (such as the edges of objects) are blended within a short range

and coarse details are blended over a larger range. This is done by decomposing the

images into band-pass (frequency) component images [5].

Removing ghosting effects caused by moving objects involves using pixel values from only

one image that contributes to the area of overlap [13], thus ensuring that only one copy

of the object is displayed in the final panorama.

12

FIGURE 7: An image of an apple that represents a high-frequency band-pass component [5]. Note

that the detail of the apple does not “bleed” across the middle seam.

FIGURE 8: An image of an apple that represents a medium-frequency band-pass component [5].

The detail of the apple bleeds slightly across the middle seam.

FIGURE 9: An image of an apple that represents a low-frequency band-pass component [5]. Notice

the detail of the apple bleeds significantly across the middle seam.

13

FIGURE 10: An image of an apple that is the sum of the previous three figures [5]. This image is

ready to be blended with another image.

FIGURE 11: A panoramic image with gain compensation and multi-band blending [8]. Notice that

the stitch seams near the sum have completely disappeared.

2.11 Evaluation of Visual Quality of Stitched Images

Although stitch quality will be examined by visual inspection in this paper, there are

quantitative methods to evaluate stitch seams. Wei Xu et al. [14] describes a good

blending algorithm as an algorithm that transfers colours “from the overlapped area to

the full target image without creating visual artefacts”.

They propose two criteria to evaluate the quality of stitched images: a colour similarity

index and a structure similarity index. The structure similarity index, for example, measures

if straight lines in the input images are kept as straight lines in the stitched output image.

A stitched image should have colours and structures similar to the original input images.

The colour similarity index is measured as a function of the peak signal-to-noise ratio

between the input and output images. A high score in this index means that the colours

are more similar between the images [14].

The structure similarity index is measured as a function of luminance, contrast and

structure components. This index gives a value between 0 and 1 with 1 being that there is

no structural difference between the input and output images [14].

14

2.12 Summary

The following table summarises the papers discussed in this section by categorising them

into the parts of the stitching pipeline they describe. The abbreviations are: FD – Feature

Detection, FM – Feature Matching, HE – Homography Estimation, BA – Bundle

Adjustment, IW – Image Warping, GC – Gain Compensation, B – Blending.

Paper Author(s) FD FM HE BA IW GC B

A Multiresolution Spline

With Application to

Image Mosaics [5]

Peter J. Burt et al. ✔

Eliminating Ghosting

and Exposure Artifacts

in Image Mosaics [13]

Matthew Uyttendale

et al.

✔

Bundle Adjustment – A

Modern Synthesis [12]

Bill Triggs et al. ✔

Distinctive Image

Features from Scale-

Invariant Keypoints [10]

David G. Lowe ✔ ✔

Evaluation of Interest

Point Detectors [9]

Cordelia Schmid et al. ✔

Efficiently Registering

Video into Panoramic

Mosaics [3]

Drew Steedly et al. ✔

Image Alignment and

Stitching: A Tutorial [11]

Richard Szeliski ✔ ✔ ✔ ✔ ✔ ✔

Panoramic Mosaics by

Manifold Projection [2]

Shmuel Peleg et al. ✔ ✔

Recognising Panoramas

[7]

Matthew Brown et al. ✔ ✔ ✔ ✔ ✔

Systems and Experiment

Paper: Construction of

Panoramic Image

Mosaics with Global and

Local Alignment [6]

Heung-Yeung Shum

et al.

✔ ✔ ✔

Automatic Panoramic

Image Stitching using

Invariant Features [8]

Matthew Brown et al. ✔ ✔ ✔ ✔ ✔ ✔

Video Mosaics for

Virtual Environments [4]

Richard Szeliski ✔ ✔

15

3 Design

An overview of the system is discussed before the details of the design of the stitching

pipeline is elaborated.

3.1 System Overview

The system, titled Virtual Panning for Lecture Recording (VIRPAN) takes in two or more

videos recoded by static cameras and processes them in three stages:

Stitching. The videos are stitched to create one panoramic video.

Tracking. The lecturer is identified and tracked in the panoramic video.

Panning. The panoramic video is cropped to the area of interest, namely the

lecturer. The cropped frame will pan with the lecturer as the lecturer moves

around.

The output of the entire system is the video containing the cropped frames that pan with

the lecturer.

The system is developed using OpenCV, an open source computer vision library

distributed under the open source BSD license. By using an open source library, we are

able to develop the system without incurring any costs. This allows us to release our

source code once the system is completed, thereby allowing other people to contribute.

We are using OpenCV version 2.4.9 and the system is programmed using C++.

FIGURE 12: The logo used by OpenCV (Open Source Computer Vision).

16

FIGURE 13: A bird’s-eye view of the setup and location of the cameras. Note that there has to be

some degree of overlap for stitching to be successful.

3.1.1 System Constraints

CILT has stated that the system should adhere to the following specifications:

The system must run on Ubuntu.

The system must not use a GPU to speed up processing time.

The system must produce the completed output video in a time not longer than

three times the length of one of the input videos. If the lecture recorded is 45

minutes long, the system must not take longer than 135 minutes to produce an

output.

The system is to post-process the videos, thus it need not be run in real-time.

3.1.2 Functional Requirements

Taking the above constraints into account and what the system should accomplish, the

functional requirements are the following:

The system must take two or more videos as input.

The system must produce a cropped video which pans with the lecturer.

17

FIGURE 14: A diagram illustrating the different components of the system as well as the inputs and

output. It also illustrates the passing of information between the different components.

3.1.3 Non-functional Requirements

Taking the system constraints into account and what the system should accomplish, the

non-functional requirements are the following:

The system must be able to run on Ubuntu.

The system cannot use a GPU to accelerate processing times.

The system cannot execute in a time longer than three times the length of one of

the input videos.

18

3.1.4 Assumptions

Due to the nature of video recordings and the way that the cameras can be setup in the

lecture room, some assumptions need to be made to ensure that the system can produce

a quality output video. The assumptions are as follows:

All input videos are in sync. This means that all videos start recording at the same

time. One video cannot be “behind” the other videos, since it would make the

videos almost impossible to stitch together.

All input videos have the same frames per second (FPS). If the videos have

different frames per second, this means that the videos have “sampled” life at

different frequencies and the videos may be misregistered during the stitching part

of the system.

All input videos have some degree of overlap for stitching purposes.

All input videos have similar zoom level. This allows the videos to be stitched

together without warping the videos too much.

The cameras taking the videos are static and fixed in place.

Since we are dealing with the visual quality of the videos, sound will not be

considered.

3.1.5 Key Success Factors

One of the success factors is that the system adhere to the system constraints. If CILT is

satisfied with the output videos the system produces and thus uses the system to post-

process lecture recordings within UCT, then the system will also be considered successful.

However, if the system cannot adhere to the system constraints, but CILT recognises the

value the system can have within the realm of lecture videos and continues the

development of the system in the future, then this is also considered to be a successful

outcome.

3.1.6 Input/output of the Components in the System

The inputs and outputs of the three components of the system is listed here to indicate

how the different parts of the system pass information to each other.

Stitching:

Input: Two or more videos.

Output: One panoramic video stream.

Tracking:

19

Input: One panoramic video stream.

Output: Co-ordinates of the lecturer written to a text file.

Panning:

Input: One panoramic video stream and the text file containing the coordinates of the

lecturer.

Output: One cropped video that pans with the lecturer.

3.2 Stitching Component

The design of the stitching component will be detailed here. Since OpenCV is used, the

design of the stitching component is very similar to the design of the OpenCV stitching

pipeline.

3.2.1 Aim

The aim of the stitching component of the system is to combine two or more videos into

one panoramic video. There must be minimal visual artefacts and/or distortions in order

to give an impression that the panoramic video had been taken with a single, static

camera.

20

FIGURE 15: A diagram illustrating the OpenCV pipeline, taken from the OpenCV website1. The

stages of the stitching pipeline is based on the diagram.

3.2.2 Stages of the Stitching Pipeline

The seven stages of the stitching pipeline have been elaborated in Section 2.3. They are:

Feature Detection

Feature Matching

Homography Estimation

Bundle Adjustment

Image Warping

Gain Compensation

Blending

A class was created for each stage of the pipeline and the system makes calls to them as

necessary.

1 http://docs.opencv.org/modules/stitching/doc/introduction.html

21

FIGURE 16: A class diagram illustrating the different classes in the stitching pipeline. The stitching

class creates one instance of each class and makes calls to them as necessary.

3.2.3 First Design

The initial design of the stitching component involved extracting a frame from each of the

input videos and passing the extracted frames through the entire stitching pipeline. Once

the frames were stitched, the panoramic stitched frame was written to the output video.

This process was repeated until the all the frames from the input videos were processed.

This method of stitching videos was extremely slow. Another problem encountered was

that the stitched frames, after every iteration, would be a different dimension than the

stitched frame before it, thus making the stitched frame unwritable to the output video

file. The output video expects frames to be of a certain dimension, and frames not

conforming to the dimensions are simply ignored. OpenCV does provide a mechanism to

resize frames, but this method of stitching is already extremely slow, thus resizing frames

would only exacerbate the problem.

22

FIGURE 17: A diagram illustrating the different stages of the stitching pipeline. This is also the first

design of the stitching pipeline.

3.2.4 Second Design

The next design of the stitching pipeline involves reusing the image masks calculated

during the first iteration of the stitching pipeline. The idea behind this design is that once

the features are found and the homography matrix estimated, then there is no need to

recalculate them for subsequent frames.

During the first iteration, frames of each video are extracted and passed through the

entire stitching pipeline. All subsequent frames reuse the already calculated image masks

and homography matrices from the first iteration of the pipeline, thus only the blending

stage is re-executed.

This method is much faster than the previous design, since the first six stages of the

pipeline are only executed once. However, the execution time still falls short of the

requirement that it has to be at most three times as long as one of the input videos.

23

FIGURE 18: A diagram showing the use of image masks. The image masks tell the blender which

part of an image to use in the final panoramic image, as shown in the last frame. The image

masks are used repeatedly to blend the rest of the video.

FIGURE 19: A diagram illustrating the second design of the stitching pipeline. The base stitching

pipeline is found in FIGURE 17. Image masks and homography matrices are reused to speed up

execution time.

24

3.2.5 Third Design

The third design involves multithreading the number of frames processed. After the first

iteration of the pipeline, the blending part of the pipeline is the only part of the pipeline

that is re-executed.

In order to perform multithreading, frames were extracted from the input videos in

batches, as opposed to extracting them one at a time as in the previous design. Each

thread is then given a batch of frames to blend, and once the threads have executed

completely the main thread will write all blended frames to the output video file.

Results from this design are discussed in Section 5.

FIGURE 20: A diagram illustrating the third design of the stitching pipeline. Image masks and

homography matrices are passed by reference because they never change.

25

3.2.6 Fourth Design

This design was necessitated by the fact that the initial calculated image masks and

homography matrices may not be optimal. This occurs when the lecturer happens to

stand in the stitch seams, which creates a distorted and unpleasant-looking stitched

frame. This design allows users of the system to specify the time interval in seconds of

when the stitching pipeline should be re-executed.

Since re-executing the pipeline will create different image masks and homography

matrices, the pipeline extracts the number of frames that use the same image masks and

homography matrices before re-executing the stitching pipeline. For example, if the input

videos have 25 FPS and the user has specified that the entire pipeline be re-executed

every 4 seconds, this means that the entire pipeline will be re-executed every 100 frames.

This does mean that every batch of 100 frames will have to be resized to the dimensions

of the output video file, since it is highly unlikely that all blended frames are of the same

dimensions.

Each thread is given a number of frames and their corresponding image masks and

homography matrices. It is therefore possible to have multiple threads each blending

input frames using different image masks and homography matrices. Once the threads

have executed completely the main thread will write all blended frames to the output

video file.

Results from this design are discussed in Section 5.

26

FIGURE 21: A diagram illustrating the fourth design of the stitching pipeline. Image masks and

homography matrices are passed by value, because it is possible that they could change. It is

important to note that if the thread vector is not full, then there are no more frames that use the

same image masks, otherwise the thread vector would have been full.

27

4 Implementation

The implementation of the stitching pipeline is detailed here. Various iterations of the

implementation is documented to show the evolution of the stitching pipeline with

reasons elaborated on why the iterations were necessary.

Since the stitching pipeline is based on the OpenCV stitching pipeline, the pseudocode

shown is similar to their code of the stitching pipeline which can be found on the OpenCV

GitHub page2. As mentioned in Section 3, C++ is the programming language used and

Eclipse IDE is used for developing the VIRPAN system.

4.1 First Iteration

The first implementation of the stitching component involved extracting a frame from

each video and passing them through the entire stitching pipeline. The stitched

panoramic frame is then written to the output video file. This is repeatedly done until all

frames are stitched and written to the output file.

Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file Write stitched panoramic frame to output video file Increment frameNumber by 1 END while

FIGURE 22: Pseudocode for the first iteration of the stitching pipeline.

2 https://github.com/Itseez/opencv

28

Having each frame run through the entire stitching pipeline for each iteration is extremely

slow. This prompted us to examine the stitching pipeline in detail to see if some steps of

the pipeline could be skipped. We noticed that during the blending stage of the pipeline,

it used image masks and homography matrices that were calculated in the previous

stages of the pipeline. Since the cameras that were recording the videos were fixed in

place, there is no need for the image masks and homography matrices to be recalculated,

since the camera parameters (such as rotation values) would remain constant for the

entire duration of the video recording.

4.2 Second Iteration

The second implementation of the stitching component involved passing the first frames

of each video through the pipeline in order to calculate the image masks and

homography matrices needed to stitch the frames together. Once this is done, the rest of

the input frames are simply blended together, using the masks and matrices calculated

during the first iteration of the pipeline.

Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together Create an output video file Write stitched panoramic frame to output video file Increment frameNumber by 1 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Blend the frames together, reusing the image masks and homography matrices Write stitched panoramic frame to output video file Increment frameNumber by 1 END while

FIGURE 23: Pseudocode for the second iteration of the stitching pipeline.

29

4.3 Third Iteration

Although the second iteration gave better execution times than the first iteration, it did

not fit the constraint that the execution time must be at most three times as long as one

of the input videos. To make execution time faster, multithreading was the next option.

Once the image masks and the homography matrices are calculated, the pipeline re-

executes the blending stage repeatedly for the rest of the input frames. This can be

multithreaded, as each frame can be blended separately from the other frames. The

constraint here is that the frames need to be written in sequence to the output video file.

To facilitate this, a vector is created in the main thread to store blended frames. Each

thread receives a reference to the vector and a number which states the index at which

the thread starts storing the blended frames. Once the threads have executed completely,

the main thread iterates through the vector sequentially and writes the blended frames to

the output video file.

The stitching pipeline creates at most 10 threads at any given time with each thread

blending at most 100 frames. These numbers were chosen arbitrarily.

Create a results vector to store blended frames Initialise start_index to 0 // Index at which a thread starts storing frames Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file Store stitched/blended frame into results[start_index] Increment frameNumber by 1 Increment start_index by 1 int frames_to_stitch = number of frames to stitch – 1 while (frames_to_stitch > 0) if (frames_to_stitch <= 100) if (number of active threads = 10) Wait for all threads to finish Write all blended frames in results to output video

30

Set start_index to 0 END if Extract (frames_to_stitch) number of frames, store in vector Create a thread to blend the frames start_index += frames_to_stitch frameNumber += frames_to_stitch Set frames_to_stitch to 0 ELSE if (frames_to_stitch <= 1000) int complete_threads = frames_to_stitch / 100 if (number of active threads + complete_threads > 10) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if for (0 to complete threads) Extract 100 frames each from input videos Create a thread to blend the frames start_index += 100 frameNumber += 100 frames_to_stitch -= 100 END for ELSE // 10 full threads can be created if (number of active threads != 0) Wait for all threads to finish Write all blended frames in results to output video Set start index to 0 END if for (0 to 10) Extract 100 frames each from input videos Crate a thread to blend the frames start_index += 100 frameNumber += 100 frames_to_stitch -= 100 END for END if else branch END while END while if (there is still active threads running) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if

FIGURE 24: Pseudocode for the third iteration of the stitching pipeline. The results vector is passed

by reference into all threads and each thread gets a number (start_index) at which to write

blended frames. This ensures that all frames in the results vector are in order.

31

4.4 Fourth Iteration

After executing the third design on a number of videos, it was found that sometimes the

calculated image masks and homography matrices were not optimal. This occurs when

the lecturer stands in the stitch seam, creating a distorted stitched frame which is

unpleasant to view. This iteration provides the functionality of re-executing the stitching

pipeline every x seconds, which the user can edit.

The design of this iteration is discussed in Section 3.2.6 and the psudocode is shown

below.

Create a results vector to store blended frames Initialise start_index to 0 Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 Determine number of frames to re-execute stitching pipeline (FPS * seconds to re-execute stitching pipeline) while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file if (stitched frame needs to be resized) Resize frame Store stitched/blended frame into results[start_index] Increment frameNumber by 1 Increment start_index by 1 int frames_to_stitch = number of frames to re-execute pipeline – 1 while (frames_to_stitch > 0) if (frames_to_stitch <= 100) // Create thread ELSE if (frames_to_stitch <= 1000) // Create threads ELSE // 10 full threads can be created // Create threads

32

END if else branch END while END while if (there is still active threads running) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if

FIGURE 25: Pseudocode for the fourth iteration of the stitching pipeline. The “create threads”

comment contains the same code within the if else branch as FIGURE 24. It is important to note that

due to image masks and homography matrices changing, every blended frame will need to be

resized in order for it to be written to the output video file.

It is important to note that at any given point in time, it is possible that there are multiple

threads running, each using different image masks and different homography matrices to

blend the frames together. It is also possible that 10 threads are never created at any

point in time and that a thread can have less than 100 frames to stitch.

An example of this is if the user decides that the pipeline should be re-executed every 10

seconds for a video that has 25 FPS. This means that the pipeline is re-executed every 250

frames. The first frame is blended from the re-executing stage of the pipeline, leaving 249

frames to be blended. Three threads are created to blend the 249 frames using the same

image masks and homography matrices. The last thread will only have 49 frames to

blend.

While the three threads are running, the main thread re-executes the stitching pipeline to

calculate the image masks and homography matrices for the next 250 frames. Three

threads are then created to blend the 249 frames. This process is repeated until nine

threads are created, at which point the main thread will have to wait for them to complete

execution and write the frames to the output video file, before continuing to create more

threads.

4.5 Other Related Works

While trying to create a solution that would speed up the stitching process, a real-time

stitching solution was encountered on GitHub. The project is titled “StitcHD” [15] and it is a

NASA funded project to develop a video system that will replace windows on NASA’s

spacecrafts.

A subset of their code was quickly implemented to see the results of their stitching

solution. Their solution involves calculating a homography matrix and warping the images

based on this matrix. The blending algorithm used was a simple averaging method, where

pixel colours between two overlapping images were averaged to form a final pixel.

33

FIGURE 26: A screenshot taken from two videos recorded by two cameras: one fixed camera is on

the left of the room and one fixed camera is on the right. Note that the lecturer’s arm on the right

screenshot is more angled compared to the left screenshot due to perspective distortion.

FIGURE 27: The result of the StitcHD code.

The reason for the ghosting effect is not due to the motion of the hand, but rather that

their blending algorithm averages pixels. Another contributing factor is that the fixed

cameras recording the videos are very close to the blackboards, thus the degree of

perspective distortion is significant. This algorithm may have worked if the fixed cameras

were located at the back of the lecture room, but that might make the writing on the

blackboard illegible.

Due to the ghosting effect, their implementation was not looked further into and thus no

timing tests were taken.

34

5 Results

The performance of the stitching component of the system is evaluated in terms of its

execution time as well as the stitch quality of the videos. The results of the two categories

is detailed here.

5.1 Stitching Pipeline Default Settings

The stitching pipeline has various options from which to choose, with some examples

being what blending method to use as well as which compositing surface to use on which

the images will be stitched. These settings are stored in a text file which the user can edit

before execution of the system. The various options are listed in the table below.

Category Default Value Other Options

Use GPU No Yes

Work Megapixels 0.6 N/A

Seam Megapixels 0.1 N/A

Confidence Threshold 0.5 N/A

Feature Detector Type surf orb

Bundle Adjustment Cost Function ray reproj

Bundle Adjustment Refinement

Mask

xxxxx N/A

Perform Wave Correction Yes No

Wave Correction Type Horizontal Vertical

Warp Type paniniA2B1 plane | cylindrical | spherical |

fisheye | stereographic |

compressedPlaneA2B1 |

compressedPlaneA1.5B1 |

compressedPlanePortraitA2B1 |

compressedPlanePortraitA1.5B1

| paniniA1.5B1 |

paniniPortraitA2B1 |

paniniPartraitA1.5B1 | mercator

| transverseMercator

Exposure Compensator Type gainblocks no | gain

Match Confidence 0.3 N/A

Seam Estimation Method gccolor no | voronoi | gccolorgrad |

dpcolor | dpcolorgrad

Blend Type multiband no | feather

Blend Strength 5 N/A

35

Show stitch seams No Yes

How many seconds of the video is

stitched

Variable N/A

The amount of time in seconds to

redo the stitching pipeline

Variable N/A

FIGURE 28: Table showing the different options of the stitching pipeline.

The execution time and the stitch quality results use the default values of the table unless

otherwise stated.

5.1.1 Warp Type

Due to the number of options that OpenCV offers for warp type, an image was stitched

using all the different options. It was found that paniniA2B1 offered the best result and

thus is set as the default warp type. Note that compressedPlaneA2B1 and

compressedPlaneA1.5B1, compressedPlanePortraitA2B1 and

compressedPlanePortraitA1.5B1, and paniniPortraitA2B1 and paniniPortraitA1.5B1 all gave

similar results with respect to each other.

compressedPlaneA2B1

compressedPlanePortraitA2B1

cylindrical

fisheye

36

mercator paniniA1.5B1

paniniPortraitA2B1 plane

spherical stereographic

transverseMercator paniniA2B1

FIGURE 29: A diagram displaying the results of different compositing surfaces provided by OpenCV.

PaniniA2B1 is chosen as the default warp type, since it provides the least distortion.

5.2 Execution Time

The specifications of the computer used and the properties of the input videos used are

listed first before the various execution time results.

37

Computer Specifications

CPU: Intel Core i5-3470 CPU @ 3.2GHz No of threads: 4

RAM: 2 x 2GB DDR3 1333MHz

Linux Swap Space: 4000MiB

Input video properties

Input video resolution (x2 videos): 640 x 360

Number of frames per second: 25

Output video resolution

Stitched panoramic video resolution: 1283 x 758

5.2.1 Re-execution of the pipeline disabled

Reading # 1 minute (s) 2 minutes (s) 5 minutes (s) 10 minutes

(s)

1 246.51 779.05 2425.07 7248.17

2 241.04 735.28 2838.96 8698.35

3 239.30 694.90 2336.81 6072.11

4 241.27 713.70 2669.71 8688.84

5 241.03 790.82 3432.42 8428.09

Average 241.83 742.75 2740.59 7827.11

Standard

Deviation

2.44 36.93 388.82 1027.43

Ratio 4.03 times 6.19 times 9.14 times 13.05 times

FIGURE 30: Table showing the times acquired from the multi-threaded system. The ratio is

calculated by the following formula: average / video length (in seconds).

38

Reading # 1 minute (s) 2 minutes (s) 5 minutes (s)

1 794.60 1588.77 3964.65

2 794.83 1589.81 3976.51

3 794.42 1590.08 3985.30

4 794.20 1588.94 3975.64

5 795.05 1612.63 3974.71

Average 794.62 1594.05 3975.36

Standard

Deviation

0.30 9.31 6.56

Ratio 13.24 times 13.28 times 13.25 times

FIGURE 31: Table showing the times acquired from the single-threaded system. The ratio is

calculated by the following formula: average / video length (in seconds).

The tables show the times achieved by the system when the entire stitching pipeline is

executed only once and the rest of the frames are simply blended together using the

calculated image masks and homography matrices.

The single-threaded design provides consistent results. This is evident by the ratio of

about 13.25 times. The multi-threaded design, however, does not provide consistent

results. The ratio ranges from 4 times to stitch a 1 minute video to 13.05 times to stitch a

10 minute video.

By extrapolating the results, the single-threaded design will stitch a 50 minute lecture

video in a much shorter amount of time compared to the multi-threaded design. This

seems contradictory, however, by looking at the computer specifications, there is only 4

hardware threads available to use between the 10 threads the design creates. Additionally,

the amount of RAM available is too little for the multi-threading design, causing the swap

space to be used repeatedly. It is worth mentioning that the standard deviation increases

dramatically as the length of video to be stitched increases. This could be the result of the

computer constantly moving data between the RAM and the swap space, creating varying

values for execution times. The combination of swap space usage and thread swapping

slows down the execution time dramatically, thus making the single-threaded design the

choice here when stitching 50 minute lecture videos. None of the results here achieve the

system constraint requirement that the execution time be at most three times as long as

one of the input videos.

It is important to note that these results were run on a computer with limited RAM and

CPU threads. Results should be different when executed on a mainframe or a server

computer, where RAM is more plentiful and more hardware threads are present. Testing

on more powerful computers is left for future work.

39

5.2.2 Re-execution of the pipeline enabled

The results shown here are executed with the system re-executing the stitching pipeline

every 10 seconds (or 250 frames).

Reading # 1 minute (s)

1 1369.64

2 1377.21

3 1381.47

4 1368.46

5 1355.11

Average 1370.38

Standard

Deviation

9.02

Ratio 22.84

FIGURE 32: Table showing times acquired from the multi-threaded system with the stitching

pipeline re-executed every 10 seconds. The ratio is calculated by the following formula: average /

video length (in seconds).

Reading # 1 minute (s) 2 minutes (s)

1 2836.24 8200.81

2 2963.40 8212.73

3 2831.47 8247.67

4 2832.57 8210.75

5 2834.19 8204.96

Average 2859.57 8215.38

Standard

Deviation

51.94 16.69

Ratio 47.66 68.46

FIGURE 33: Table showing times acquired from the single-threaded system with the stitching

pipeline re-executed every 10 seconds. The ratio is calculated by the following formula: average /

video length (in seconds).

It is immediately noticeable that the ratios are very high, this it does not satisfy the system

constraint performance execution time at all. While attempting to acquire times for the

multi-threading design when stitching a 2 minute video, an OpenCV error occurred:

OpenCV Error: Insufficient memory (Failed to allocate 1493639172 bytes) in OutOfMemoryError, file /home/ttsen/Downloads/opencv-2.4.9/modules/core/src/alloc.cpp, line 52 terminate called after throwing an instance of 'cv::Exception'

40

what(): /home/ttsen/Downloads/opencv-2.4.9/modules/core/src/alloc.cpp:52: error: (-4) Failed to allocate 1493639172 bytes in function OutOfMemoryError

This error occurred during the gain compensation stage of the stitching pipeline. This is

most likely related to the fact that the multi-threading version passes image masks and

homography matrices by value to threads as well as the limited RAM of the computer,

resulting in all memory space being allocated However, this does not make sense when

considering the single-threaded version as times for the single-threaded version are also

extremely high.

We noticed that in both single- and multi-threaded versions, the pipeline would

occasionally “stall” when reaching the compensation stage of the pipeline. Compensating

for gain usually takes less than 1 second to execute. However, when the pipeline does

stall, times in this stage vary from 100 seconds to 1000 seconds. Presumably this is

because the computer is waiting for memory to free up before reallocating the memory

space.

This issue might be multi-threaded related, but there is definitely memory allocation

problems. Future work could include destroying vectors that are no longer needed,

checking for memory leaks and testing the system on a more powerful computer.

5.3 Stitch Quality

Various videos were passed through the stitching pipeline to see the stitch quality in

different environments.

5.3.1 Blend Type

OpenCV offers three methods for blending: multiband, feather and no blending. Videos

are stitched using the three blending methods and the multiband method gave the best

results visually, thus multiband is set as the default blend type. It is important to note that

multiband blending takes the longest to execute whereas no blending gives the quickest

execution time.

41

No blending

Feather blending

Multiband blending

FIGURE 34: A diagram displaying the results of different blending algorithms. The images on the

right column have had the contrasts and brightness edited, in order to see the stitch seams more

clearly. Multiband blending is chosen to be the default blend type because the stitch seam is less

visually distracting.

5.3.2 Stitch Quality Results

Multiband blending provides good results in the sense that the stitch seam is barely

visible. However, once the lecturer walks across the lecture room, the stitch seam is clearly

visible. This is not because the blending algorithm is not good enough, but rather that

since the cameras are so close to the blackboards, the perspective distortion is significant

between the left camera and the right camera. From FIGURE 13, we can see that if anything

42

moves in front of the blackboard in the area of overlap, the object will either be sliced in

half or not appear in the stitched video at all.

FIGURE 35: An example of a frame where the lecturer is standing in the stitch seam. This is quite

visually distracting.

FIGURE 36: If the pipeline is re-executed, the image masks and the homography matrices will

update to include the lecturer in the frame. As mentioned before, due to the closeness of the

cameras to the blackboard, perspective distortion is significant. Although the lecturer is fully in the

frame, the blackboard edges are not straight, which is visually distracting.

43

FIGURE 37: An example where the lecturer is standing in the stitch seam at the beginning of the

video. When the lecturer moves away, it leaves behind a visually distracting stitch seam. It is

because of this that necessitated the fourth design of the stitching pipeline. If the stitching pipeline

is re-executed, once the lecturer moves away, the blended frame looks good.

FIGURE 38: A screenshot of a frame with writing on both blackboards. The stitch seam is apparent

when looking at the tables, but with regards to the blackboards, the blending algorithm has done

a good job.

44

6 Conclusion

Lecture recording is a common tool many universities use for open courseware and

education purposes. They can be recorded using a PTZ camera, which is an automatic

camera that is able to detect the lecturer and pan along with the lecturer. Alternatively, a

cameraman can be used to record the lecture. This solution is prone to human errors.

Both methods stated above are expensive solutions for lecture recording. The Centre for

Innovation in Learning and Teaching (CILT) has proposed a system where two or more

fixed cameras are placed at the front of the lecture room, each viewing a different portion

of the front of the lecture. The system is to take the videos recorded as input and post-

process them in three stages: stitching, tracking and panning. Stitching combines all input

videos into one panoramic video, tracking identifies and tracks the lecturer and panning

crops out a region of interest from the panoramic frame, namely the lecturer. The

cropped frame will pan along with the lecturer, which simulates a PTZ camera solution as

well as the cameraman solution. The cost of the proposed system is the cost of the fixed

cameras used, which is much cheaper than the PTZ camera and cameraman solution.

The library used for the system is OpenCV. Since OpenCV is distributed under an open

source BSD license, no costs were incurred when programming the system. Additionally,

the code that we produce is also open sourced, allowing other users to contribute to the

system, thus promoting open courseware initiatives.

This paper focused on the stitching component of the proposed system. Various iterations

of the design of the stitching component were discussed. The first design involved

executing the entire stitching pipeline for each and every frame. This made execution

times extremely long. The second design involved executing the pipeline once and

reusing the calculated image masks and homography matrices. Execution times were

better, but still not good enough. The third design involved multithreading the blending

process in order to further improve execution times. The fourth design was necessitated

by the fact that sometimes the initial calculated image masks and homography matrices

were not optimal. This occurs when the lecturer stands in the stitch seams. This design

gave the user the option to re-execute the entire stitching pipeline every x seconds.

The stitching pipeline is evaluated in terms of its execution times and stitch quality.

Execution times were acquired for both re-execution of the pipeline enabled and

disabled.

Execution times with re-executing the pipeline disabled vary wildly for the multi-threaded

solution, compared to the single-threaded solution which gave consistent timings. This

was due to the computer used having only 4 hardware threads and 4 GB of RAM. Even

45

though timings vary wildly, the multi-threaded solution did stitch videos of 1 minute, 2

minutes and 5 minutes in length faster than the single-threaded solution.

Not much timings were acquired from re-executing the pipeline enabled. This is because

the execution times were extremely high. This may be due to the computer having only 4

GB of RAM, but there is definitely memory leaks occurring.

Multiband blending gives the best stitch quality results. However, multiband blending

does take the longest to execute. The stitch seam is visible only when the lecturer walks

across the stitch seam. The option to re-execute the stitching pipeline is available to rectify

this problem. However, it is important to note that writing on the blackboards is legible

and not distorted and there is no evidence of ghosting effects. Since the writing on the

blackboard is the most important thing, the lecturer being cut into half at the stitch seams

is probably not a big issue.

With better execution times and slightly better stitching quality results, the system is a

viable option for lecture recordings, and is able to simulate a PTZ camera and cameraman

solution while also being cheaper than the two solutions as well.

46

7 Future Work

Since the stitching pipeline has many parameters, there is a possibility that a different

combination of parameters could give better results. Experimenting with the different

parameters could also yield better execution times.

In the multi-threading design of the stitching pipeline, it is stated that at most 10 threads is

created at any point in time, with each thread stitching at most 100 frames. Having the

user being able to choose the number of threads as well as choosing how many frames

each thread can stitch is a way of enabling the system to be run on a variety of different

computer specifications.

The stitching component has been programmed to accept two or more videos as input.

However, the system has only been tested on at most two input videos. More testing is

required when accepting three or more videos.

One of the drawbacks of having the camera so close to the blackboard is that the

perspective distortion between the cameras is significant. It will be useful if the videos can

be recorded with the cameras at different positions to see what effects this may have on

the stitch seams.

FIGURE 39: A method to position the cameras to try to reduce perspective distortion when the

lecturer stands in the stitch seam. This is left for future work.

Due to time constraints, the system has not been executed on an entire lecture video

length (about 45 – 50 minutes). The system needs to be tested on such videos in order to

give a clearer picture of execution times.

Re-execution of the stitching pipeline will not always yield better results. This occurs when

the lecture walks from one side of the lecture room to the other side of the lecture room.

A way to counter this is to incorporate machine learning into the stitching pipeline. By

having a learning algorithm learn what a good image mask and a good homography

matrix is, if re-executing the stitching pipeline gives worse results, then they can simply be

discarded. This will allow stitching of frames to improve over time.

47

References

[1] S. Marquard, “Matterhorn 2014 Unconference: Ideas for automated post-recording

video handling.,” 19 March 2014. [Online]. Available:

http://www.slideshare.net/smarquard/matterhorn-unconf2014-fixitinpost/. [Accessed

20 October 2014].

[2] S. Peleg and J. Herman, “Panoramic mosaics by manifold projection,” Computer

Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society

Conference on, pp. 338-343, IEEE, 1997.

[3] D. Steedly, C. Pal and R. Szeliski, “Efficiently registering video into panoramic

mosaics,” Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on,

vol. 2, pp. 1300-1307, 2005.

[4] R. Szeliski, “Video mosaics for virtual environments,” Comuter Graphics and

Applications, IEEE, vol. 16, no. 2, pp. 22-30, 1996.

[5] P. J. Burt and E. H. Adelson, “A multiresolution spline with application to image

mosaics,” ACM Transactions on Graphics (TOG), vol. 2, no. 4, pp. 217-236, 1983.

[6] H.-Y. Shum and R. Szeliski, “Systems and experiment paper: Construction of

panoramic image mosaics with global and local alignment,” International Journal of

Computer Vision, vol. 36, no. 2, pp. 101-130, 2000.

[7] M. Brown and D. G. Lowe, “Recognising panoramas,” ICCV, vol. 3, p. 1218, 2003.

[8] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant

features,” International journal of computer vision, vol. 74, no. 1, pp. 59-73, 2007.

[9] C. Schmid, R. Mohr and C. Bauckhage, “Evaluation of interest point detectors,”

International Journal of computer vision, vol. 37, no. 2, pp. 151-172, 2000.

[10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International

journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.

[11] R. Szeliski, “Image alignment and stitching: A tutorial,” Foundations and Trends® in

Computer Graphics and Vision, vol. 2, no. 1, pp. 1-104, 2006.

[12] B. Triggs, P. McLauchlan, R. Hartley and A. Fitzgibbon, “Bundle adjustment—a

modern synthesis,” Vision algorithms: theory and practice, pp. 298-372, 2000.

[13] M. Uyttendaele, A. Eden and R. Szeliski, “Eliminating ghosting and exposure artifacts

in image mosaics,” Computer Vision and Pattern Recognition, 2001. CVPR 2001.

Proceedings of the 2001 IEEE Computer Society Conference on, vol. 2, pp. II-509, 2001.

48

[14] W. Xu and J. Mulligan, “Performance evaluation of color correction approaches for

automatic multi-view image and video stitching,” Computer Vision and Pattern

Recognition (CVPR), 2010 IEEE Conference on, pp. 263-270, 2010.

[15] lukeyeager, “lukeyeager/StitcHD · GitHub,” 8 May 2012. [Online]. Available:

https://github.com/lukeyeager/StitcHD. [Accessed 28 October 2014].

csc4000w h computer sciencejnorman/virpan/... · csc4000w – honours in computer science honours...

Documents