csc4000w h computer sciencejnorman/virpan/... · csc4000w – honours in computer science honours...
TRANSCRIPT
CSC4000W – HONOURS IN COMPUTER SCIENCE
HONOURS PROJECT REPORT
VIRTUAL PANNING FOR LECTURE RECORDING
STITCHING COMPONENT
Author:
Terry Tsen (TSNTER002)
Supervisor:
A/Prof. Patrick Marais
Category Min Max Chosen
1 Requirement Analysis and Design 0 20 18
2 Theoretical Analysis 0 25 0
3 Experiment Design and Execution 0 20 0
4 System Development and Implementation 0 15 12
5 Results, Findings and Conclusion 10 20 17
6 Aim Formulation and Background Work 10 15 13
7 Quality of Report Writing and Presentation 10 10
8 Adherence to Project Proposal and Quality of Deliverables 10 10
9 Overall General Project Evaluation 0 10 0
Total marks 80 80
DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF CAPE TOWN
2014
i
Abstract
This report investigates the viability of a virtual panning system for lecture recording.
More specifically, the stitching component of the virtual panning system is elaborated in
detail. Many lectures are recorded using a pan-tilt-zoom camera which automatically
identifies the presenter and tracks the presenter to keep the person within the field of
view. Cameramen are used as well to achieve the same purpose. However, these two
solutions are expensive. Additionally, the cameraman solution is prone to human error. A
system is proposed by the Centre for Innovation in Learning and Teaching in the
University of Cape Town which emulates the results of the pan-tilt-zoom camera. The
system involves placing fixed cameras in the lecture room that view different parts of the
front of the lecture. After the lecture is recorded, the videos are post-processed in three
stages: stitching, tracking and panning. The videos are stitched into one panoramic video.
The lecturer is identified and tracked within the panoramic video. Finally a cropped frame
of the panoramic video is taken to produce a video that is similar to a video recorded by
a pan-tilt-zoom camera. The stitching component is evaluated in terms of its execution
time and stitch quality. Although the execution time is not ideal, the proposed system is a
cheaper alternative to the pan-tilt-zoom camera and cameraman solution.
ii
ACKNOWLEDGEMENTS
I would like to thank A/Prof. Patrick Marais of the Department of Computer Science at the
University of Cape Town for his input and guidance to the project in which he has a lot of
experience in, namely computer vision.
Secondly, I would like to acknowledge Stephen Marquard and his team at the Centre for
Innovation in Learning and Teaching for providing system specifications as well as lecture
video recordings on which we tested our system.
Thirdly, I would like to thank my team members Jared Norman and Chris Pocock for
providing motivation to keep us going through the toughest of times.
Lastly, I would like to thank my parents for allowing me the opportunity to study at the
University of Cape Town which allowed me to be the best that I can be.
iii
Table of Contents
List of Figures ..............................................................................................................................................v
1 Introduction ........................................................................................................................................ 1
1.1 Virtual Panning for Lecture Recording ................................................................................ 1
1.2 Stitching ...................................................................................................................................... 2
1.2.1 Image Stitching ................................................................................................................. 2
1.2.2 Video Stitching .................................................................................................................. 2
1.3 The Stitching Pipeline .............................................................................................................. 3
1.4 Aim ............................................................................................................................................... 3
1.5 Report Overview ....................................................................................................................... 3
2 Background ....................................................................................................................................... 5
2.1 Applications of Image/Video Stitching ............................................................................... 5
2.2 Overview of Stitching Algorithms ........................................................................................ 5
2.3 Overview of the Stitching Pipeline ....................................................................................... 6
2.4 Feature Detection ..................................................................................................................... 6
2.5 Feature Matching ..................................................................................................................... 8
2.6 Homography Estimation......................................................................................................... 9
2.7 Bundle Adjustment................................................................................................................... 9
2.8 Image Warping ........................................................................................................................ 10
2.9 Gain Compensation ................................................................................................................ 10
2.10 Blending ..................................................................................................................................... 11
2.11 Evaluation of Visual Quality of Stitched Images .............................................................. 13
2.12 Summary ....................................................................................................................................14
3 Design ................................................................................................................................................ 15
3.1 System Overview ..................................................................................................................... 15
3.1.1 System Constraints ......................................................................................................... 16
3.1.2 Functional Requirements .............................................................................................. 16
3.1.3 Non-functional Requirements ..................................................................................... 17
3.1.4 Assumptions ..................................................................................................................... 18
3.1.5 Key Success Factors ........................................................................................................ 18
3.1.6 Input/output of the Components in the System .................................................... 18
3.2 Stitching Component ............................................................................................................. 19
iv
3.2.1 Aim ...................................................................................................................................... 19
3.2.2 Stages of the Stitching Pipeline .................................................................................. 20
3.2.3 First Design ....................................................................................................................... 21
3.2.4 Second Design ................................................................................................................ 22
3.2.5 Third Design .................................................................................................................... 24
3.2.6 Fourth Design .................................................................................................................. 25
4 Implementation .............................................................................................................................. 27
4.1 First Iteration ............................................................................................................................ 27
4.2 Second Iteration...................................................................................................................... 28
4.3 Third Iteration .......................................................................................................................... 29
4.4 Fourth Iteration ........................................................................................................................ 31
4.5 Other Related Works ............................................................................................................. 32
5 Results ............................................................................................................................................... 34
5.1 Stitching Pipeline Default Settings ..................................................................................... 34
5.1.1 Warp Type ....................................................................................................................... 35
5.2 Execution Time ........................................................................................................................ 36
5.2.1 Re-execution of the pipeline disabled ...................................................................... 37
5.2.2 Re-execution of the pipeline enabled ...................................................................... 39
5.3 Stitch Quality ............................................................................................................................ 40
5.3.1 Blend Type ....................................................................................................................... 40
5.3.2 Stitch Quality Results ......................................................................................................41
6 Conclusion ....................................................................................................................................... 44
7 Future Work ..................................................................................................................................... 46
References ................................................................................................................................................. 47
v
List of Figures
FIGURE 1: Feature detection performed on the left side of a lecture room. ................................ 7
FIGURE 2: Feature detection performed on the right side of a lecture room. ............................ 8
FIGURE 3: Illustration of keypoint descriptors....................................................................................... 8
FIGURE 4: Feature matching performed on a lecture room. ........................................................... 9
FIGURE 5: A panoramic image without gain compensation. .......................................................... 11
FIGURE 6: A panoramic image with gain compensation. ................................................................. 11
FIGURE 7: Illustration of a high-frequency band-pass component. .............................................. 12
FIGURE 8: Illustration of a medium-frequency band-pass component. ....................................... 12
FIGURE 9: Illustration of a low-frequency band-pass component. ................................................ 12
FIGURE 10: Illustration of multi-band blended image. ...................................................................... 13
FIGURE 11: A panoramic image with gain compensation and multi-band blending. ............... 13
FIGURE 12: OpenCV logo. ......................................................................................................................... 15
FIGURE 13: A bird’s-eye view of the setup and location of the cameras. .................................... 16
FIGURE 14: illustration of the different components of the proposed system. ........................... 17
FIGURE 15: Diagram illustrating the OpenCV pipeline. .................................................................... 20
FIGURE 16: Class diagram illustrating the different classes in the stitching pipeline. ................ 21
FIGURE 17: Diagram illustrating the different stages of the stitching pipeline ........................... 22
FIGURE 18: Diagram showing the use of image masks. ................................................................... 23
FIGURE 19: Diagram illustrating the second design of the stitching pipeline. ........................... 23
FIGURE 20: Diagram illustrating the third design of the stitching pipeline. ................................ 24
FIGURE 21: Diagram illustrating the fourth design of the stitching pipeline. .............................. 26
FIGURE 22: Pseudocode for the first iteration of the stitching pipeline. ..................................... 27
FIGURE 23: Pseudocode for the second iteration of the stitching pipeline................................ 28
FIGURE 24: Pseudocode for the third iteration of the stitching pipeline. ................................... 30
FIGURE 25: Pseudocode for the fourth iteration of the stitching pipeline. ................................. 32
FIGURE 26: Screenshot taken from two videos recorded by two cameras. ............................... 33
FIGURE 27: The result of the StitcHD code. ........................................................................................ 33
FIGURE 28: Table showing the different options of the stitching pipeline. ................................ 35
FIGURE 29: Diagram displaying the results of different compositing surfaces. ......................... 36
FIGURE 30: Table showing the times acquired from the multi-threaded system. .................... 37
FIGURE 31: Table showing the times acquired from the single-threaded system. ................... 38
FIGURE 32: Table showing times acquired from the multi-threaded system with the stitching
pipeline re-executed every 10 seconds. ................................................................................... 39
FIGURE 33: Table showing times acquired from the single-threaded system with the
stitching pipeline re-executed every 10 seconds. .................................................................. 39
FIGURE 34: Diagram displaying the results of different blending algorithms. ............................41
FIGURE 35: Example of a frame where the lecturer is standing in the stitch seam. ................. 42
FIGURE 36: Example of a frame where the stitching pipeline is re-executed. ........................... 42
FIGURE 37: Example showing a visually distracting stitch seam. ................................................... 43
FIGURE 38: Example of a good blended frame. ................................................................................ 43
vi
FIGURE 39: Alternate positioning of the cameras. ............................................................................ 46
1
1 Introduction
Lecture recordings have become a common tool to use for educational purposes. Pan-
tilt-zoom (PTZ) cameras are used to record the lectures while keeping the lecturer within
the field of view. Cameramen have also been used to record lectures as well with the
same effect as the PTZ camera. A cheaper solution would be to stitch together two or
more videos together from multiple static cameras, producing a panoramic video of the
front of the lecture room.
1.1 Virtual Panning for Lecture Recording
The Centre for Innovation in Learning and Teaching (CILT) is a unit within the University of
Cape Town (UCT) that manages teaching and learning difficulties within the university.
The unit is responsible, among many other things, for the recording of lectures. Many
lecture venues have multiple cameras recording different parts of the front of the venue.
CILT has proposed a system where the goal of the system is to:
Stitch the videos from multiple cameras together, effectively creating a
“panoramic” video.
The lecturer is then identified and tracked within the panoramic video.
A cropped video will be cut out from the panoramic video, which will pan along
with the lecturer as the lecturer moves.
Note that this occurs after the lecture has been recorded i.e. post-processing, so there is
no need for the proposed system to run in real-time.
A pan-tilt-zoom (PTZ) camera can be used to achieve the same result described above. It
is advantageous compared to a single, static camera, as the lecturer is always kept in the
field of view, which allows viewers to be able to see what the lecturer may be writing on
the blackboard. However, PTZ camera are very expensive compared to three static
cameras, for example [1]. Additionally, software that manages the PTZ camera needs to
be purchased as well, making the cost of such a solution for lecture recording even
higher.
Another method which produces the same result is to have a cameraman sit in a lecture
and the cameraman will pan the camera as the lecturer walks. Although it might be
cheaper than the PTZ camera solution, it is still more expensive compared to having two
or more fixed cameras placed at the front of the lecture room. Additionally, there can be
human errors when using the cameraman solution, which is absent in an automated
system.
2
Although not tackled in the proposed system, post-processing of videos allows the video’s
colour, white balance and contrast to be improved, which compensates for poor lighting
(which is often the case in some lecture venues). Post-processing can also correct image
perspectives and lens artefacts [1].
Since the videos are post-processed, emphasis is placed on producing a quality
panoramic video over execution time. However, the execution time must still be
reasonable for the quality of video produced. An unreasonable execution time would be
one day to stitch a 50 minute lecture video.
1.2 Stitching
Image and video stitching is a topic in the field of computer vision that has been
extensively researched since the 1990s. Many efficient algorithms have since been
developed and most of the algorithms can be run in real-time, especially with today’s
computing power.
1.2.1 Image Stitching
Stitching images together involves taking in two or more images as input and outputting
a single, panoramic image. Since the field of view of cameras is always smaller than the
human field of view [2], stitching together images can provide a view of the environment
as humans would see it. It also provides a way of presenting a large object in a single
picture when multiple camera shots are needed to photograph the entire object.
1.2.2 Video Stitching
In some cases, panoramic images are created using video clips. A single video is taken as
input and a panoramic image is outputted. A person will record a short video of a large
building or landmark, for example, and the program will extract certain frames from the
video and use these frames, or images, to stitch a panoramic image. This method is
documented by Drew Steedly et al. [3]. Richard Szeliski mentions that this method of
using videos allows for the creation of virtual reality environments, computer game
settings and movie special effects [4]. This is possible, because the motion of the camera,
while recording the video, allows for depth recovery and limited 3D rendering.
3
1.3 The Stitching Pipeline
The stitching pipeline occurs in stages, with each successive step being reliant on the
previous step.
Features must first be identified in each of the input images.
The identified features must then be matched between the input images.
The matched features are then used to calculate a homography (projection) matrix
between matched images.
All matched images then undergo bundle adjustment together, which solves for
the camera parameters jointly.
A compositing surface must be chosen on which the images are stitched. Some
examples of stitching surfaces include planes and cylinders.
At this stage, the images will undergo gain compensation. This means that the
brightness and contrast of the images are normalised to minimise the visibility of
stitch seams.
During the stitching of the images, blending is often done to minimise the visibility
of the stitch seam even further.
The panoramic stitched image is outputted.
1.4 Aim
The aim of the proposed system is to take in two or more videos as input and output a
video that is qualitatively comparable to a video that is recorded by a PTZ camera or a
cameraman.
In terms of the stitching component, the aim is to produce a panoramic video that has
minimal visual artefacts and/or distortions, in order to give the illusion that the panoramic
video was taken with a single, static camera. Evaluation of the stitched, panoramic video is
done qualitatively by visual inspection. Execution time of the stitching pipeline is measured
as well to determine if it is reasonable.
1.5 Report Overview
In section 2, an overview of image stitching algorithms is given as well as some
applications of stitching. The stitching pipeline is then described in detail. Section 3
describes the design of the system with the constraints of the system listed. The design of
the stitching component of the system will be shown in detail, with reasons described for
the iterations of the stitching component. Section 4 displays the implementation of the
stitching component with pseudocode to illustrate the process. In section 5, run times as
4
well as stitch quality is examined. Section 6 and section 7 details the viability of the
proposed system as well as future work that can be done in this area.
5
2 Background
Before the stitching pipeline is covered in detail, some applications of image stitching and
general stitching algorithms is discussed.
2.1 Applications of Image/Video Stitching
Image stitching have been used to create panoramic views of Jupiter and Saturn from the
multiple images taken of them by the two Voyager spacecraft. Landsat photographs have
also been stitched together to create panoramic views of Earth. Peter J. Burt et al. [5]
mentions, with reference to the two examples above, that image stitching allows for the
construction of images with a larger field of view or level of detail than could be obtained
with a single photograph.
Heung-Yeung Shum et al. [6] gives some applications of image stitching. These
applications include creating panoramic views of aerial and satellite photographs,
performing scene stabilization and change detection and creating virtual environments
and virtual travel. They mention techniques for capturing panoramic images of real-world
scenes. One technique is to use a panoramic camera to directly capture a cylindrical
panoramic image onto a long film strip. Another technique is to use a fisheye lens, which
has a very large field of view. Lastly, mirrored pyramids and parabolic mirrors can be used
[6]. These techniques require the use of expensive hardware which is sometimes not
feasible. A less intensive solution would be to use a single camera to take photographs of
different parts of the scene and have the photographs undergo a stitching algorithm to
produce a panoramic image [6].
2.2 Overview of Stitching Algorithms
Stitching algorithms fall broadly into two camps: direct based and feature based. Direct
based methods try to match pixels between images and it does this iteratively to find the
best estimates of the camera parameters. This method has the advantage of using all of
the available data, but it does require some form of initialisation and is very susceptible to
varying brightness conditions [7] [8].
Feature based methods involve finding features and finding correspondence between
features in order to find image matches. This method does not require initialisation, but
the features found are usually susceptible to illumination, zoom and rotation changes [7]
[8].
6
Once the images are matched together using one of the two methods, most stitching
algorithms will follow the same steps thereafter: all matched images will undergo bundle
adjustment, then warped and blended onto a compositing surface [3] [7] [8].
2.3 Overview of the Stitching Pipeline
The stitching pipeline involves many stages, with each stage making use of the
information found in the previous stage to fulfil its purpose in the pipeline.
Feature detection: The first step involves detecting features within all input images.
Feature matching: This involves matching features between the different images
with the intention of finding which images are part of the same panorama. Note
that for stitching to occur successfully, there needs to be some area of overlap
between the images. If there is no overlap, this step will find no features in
common and the stitching pipeline cannot continue.
Homography estimation: This step takes the matched features of the previous step
and estimates the camera parameters of pairs of images.
Bundle adjustment: This step solves for all the camera parameters jointly. This is
done, because the camera parameters calculated in the previous step is done
between pairs of images, which can cause accumulation errors for panoramas
involving more than two images. An example of this includes 360 degree
panoramas, where accumulation errors can cause the ends of the panorama to
not join up [7].
Image warping: A stitching surface is chosen onto which the panorama is stitched.
Gain compensation: A photometric parameter is solved for in this step. Essentially,
the brightness and contrast of all the images are normalised in order to minimise
the visibility of stitch seams.
Blending: Pixels along the stitch seams undergo blending in order to minimise the
visibility of stitch seams even further.
2.4 Feature Detection
Features (also known as interest points) are points in the image where the signal changes
two-dimensionally [9]. These points include corners of objects, dots and any location with
significant 2D texture.
Interest point detectors and corner detectors can be divided into three categories:
contour based methods, intensity based methods and parametric model methods [9].
Contour based methods extract contour lines from the image and searches for maximal
curvature or inflexion points along the lines [9]. Intensity based methods find features by
7
looking at pixel greyvalue colours [9]. Parametric models fit a parametric intensity model
to the image signal [9].
Cordelia Schmid et al. [9] describe two criteria for evaluating interest point detectors:
repeatability and information content. Repeatability is a measure of how independent the
detector is with regards to changes in imaging conditions, such as illumination. A high
score in this metric means that the detector is largely independent of changes and can
repeat the same results in different conditions. Information content is a measure of how
distinctive an interest point is. A detector that scores lowly in this metric means that many
features it finds can be matched to almost any other features, thus a high score is
preferred. By subjecting many detectors to these criteria, they found that the Harris
detector scores highly in both of the criteria.
However, Matthew Brown et al. [7] mention that the Harris detector is not invariant to
scaling as well as rotation. David Lowe [10] has thus proposed a method for “extracting
distinctive invariant features to perform reliable matching”. His approach is titled the Scale
Invariant Feature Transform (SIFT) and features found using this method are invariant to
scale and rotation, and are robust to affine distortion, change in viewpoint, noise and
change in illumination [10]. This is done by transforming found features into “scale-
invariant coordinates relative to local features” [10].
FIGURE 1: Features detected on the left side of the lecture room. The picture is in greyscale colour
because features are easier to detect when there is only one colour channel. Feature detection
done using OpenCV.
8
FIGURE 2: Features detected on the right side of the lecture room.
2.5 Feature Matching
Once features are detected using the SIFT algorithm, they are matched on the basis of its
location, scale and orientation between images [10]. SIFT features can be matched with a
good probability because the keypoint descriptors associated with each feature are highly
distinctive.
FIGURE 3: An example of how keypoint descriptors are created [10]. In this example, the image
gradients are grouped into a 2x2 array and the gradients are summed together. The length of the
arrows denote the magnitude of the gradient.
Since multiple images may overlap the same feature point, each feature is matched to its
k nearest neighbours using a k-d tree [8].
9
Drew Steedly et al. [3] mentions that feature detectors will often identify several hundred
features from each image, which will result in many matches with images that have large
overlaps. To improve efficiency, they group together features within a small area and
replace the feature points within the group with their centroid point.
FIGURE 4: The green lines represent the features that are matched with a high confidence level.
Feature matching done using OpenCV.
2.6 Homography Estimation
Homography estimation is the process of estimating image transformation parameters
that best fit the features that are matched between images. Matthew Brown et al. [8] use a
random sampling approach called random sample consensus (RANSAC) that estimates
the homography matrix by using direct linear transformations. The sampling is done over
a number of iterations and a solution is chosen that gives the most number of RANSAC
inliers. RANSAC inliers are a set of feature matches that are geometrically consistent [8].
These solutions also contain RANSAC outliers, which are a set of features that are inside
the area of overlap but are not consistent [8].
They then perform a verification test using a probabilistic model that determines whether
the set of inliers/outliers was generated by a correct match or by a false match.
Once the images are correctly paired together, they are passed to the next stage of the
stitching pipeline.
2.7 Bundle Adjustment
The previous step estimated the camera parameters (homography) between pairs of
matching images. This can cause errors to accumulate, causing 360 degree panoramas to
not join up at the ends. This step aims to minimise the accumulated errors by
simultaneously minimizing the misregistration between overlapping images [6] [11].
10
Bill Triggs et al. [12] defines bundle adjustment as “the problem of refining a visual
reconstruction to produce jointly optimal 3D structure and viewing parameter estimates”.
Optimal means that the errors are minimised according to some cost function and jointly
means that the solution found is optimal across all images and camera parameters.
Images are added to the bundle adjuster one by one, with the most recent image being
the best matching image (maximum number of consistent matches). This image is
initialised with the same rotation and focal length as the image to which it best matches
[7] [8]. The parameters are then updated using a cost function which must be minimised
at each step to give optimal camera parameters. This is done repeatedly until all images
have been added to the bundle adjuster.
2.8 Image Warping
Once all the input images have been registered with respect to each other, a compositing
surface needs to be chosen on which to stitch the images. If there is only a few images to
stitch, one image is usually chosen as the reference frame and the rest of the images are
warped according to the reference frame. This is known as a flat panorama [11] as the
images are stitched onto a planar surface.
Shmuel Peleg et al. [2] mentions that if the camera undergoes a pure translation while
taking pictures, then a plane is a good surface to stitch on. If the camera undergoes
rotation about a point, then a good surface to stitch on would be a cylinder. However,
cameras rarely undergo such perfect transformations and the stitching surface is almost
always a combination of a plane and a cylinder.
Richard Szeliski [4] gives a solution where the images are divided into several large,
overlapping regions and each region is represented with a planar surface onto which the
images are stitched. Another similar solution is to select a base frame and have each
frame’s positions calculated with respect to the base frame. A new base frame is chosen
periodically to perform the alignment.
2.9 Gain Compensation
Matthew Brown et al. [8] describes this step as a method of calculating a “photometric
parameter” as opposed to the previous steps which calculate “geometric parameters”.
Gain compensation aims to normalise the brightness and contrast across all the images so
that the visibility of the stitch seams are minimised. This is done because cameras will
often adjust the brightness and contrast of the scene when taking pictures.
11
FIGURE 5: A panoramic image without gain compensation [8]. The stitch seams are clearly visible
due to varying light conditions and contrast.
FIGURE 6: A panoramic image with gain compensation [8]. The stitch seams are not as visible. Stitch
seams are still visible near the sun due to unmodelled artefacts.
2.10 Blending
After gain compensation, there are still a number of unmodelled artefacts such as
vignetting (image brightness decreases towards the edges), parallax effects, mis-
registration errors, radial distortion [8] and ghosting (due to moving objects). Blending
aims to minimise these artefacts.
One method to perform blending is by simply taking the average pixel value between all
the images at the area of overlap. This is known as feather blending and it does not work
very well in handling the above-mentioned artefacts [11].
Peter J. Burt et al. [5] describe a blending algorithm, titled multi-band blending, that
handles the mentioned artefacts well. The idea behind the algorithm is to “blend low
frequencies over a large spatial range and high frequencies over a short range” [8]. This
means that fine details (such as the edges of objects) are blended within a short range
and coarse details are blended over a larger range. This is done by decomposing the
images into band-pass (frequency) component images [5].
Removing ghosting effects caused by moving objects involves using pixel values from only
one image that contributes to the area of overlap [13], thus ensuring that only one copy
of the object is displayed in the final panorama.
12
FIGURE 7: An image of an apple that represents a high-frequency band-pass component [5]. Note
that the detail of the apple does not “bleed” across the middle seam.
FIGURE 8: An image of an apple that represents a medium-frequency band-pass component [5].
The detail of the apple bleeds slightly across the middle seam.
FIGURE 9: An image of an apple that represents a low-frequency band-pass component [5]. Notice
the detail of the apple bleeds significantly across the middle seam.
13
FIGURE 10: An image of an apple that is the sum of the previous three figures [5]. This image is
ready to be blended with another image.
FIGURE 11: A panoramic image with gain compensation and multi-band blending [8]. Notice that
the stitch seams near the sum have completely disappeared.
2.11 Evaluation of Visual Quality of Stitched Images
Although stitch quality will be examined by visual inspection in this paper, there are
quantitative methods to evaluate stitch seams. Wei Xu et al. [14] describes a good
blending algorithm as an algorithm that transfers colours “from the overlapped area to
the full target image without creating visual artefacts”.
They propose two criteria to evaluate the quality of stitched images: a colour similarity
index and a structure similarity index. The structure similarity index, for example, measures
if straight lines in the input images are kept as straight lines in the stitched output image.
A stitched image should have colours and structures similar to the original input images.
The colour similarity index is measured as a function of the peak signal-to-noise ratio
between the input and output images. A high score in this index means that the colours
are more similar between the images [14].
The structure similarity index is measured as a function of luminance, contrast and
structure components. This index gives a value between 0 and 1 with 1 being that there is
no structural difference between the input and output images [14].
14
2.12 Summary
The following table summarises the papers discussed in this section by categorising them
into the parts of the stitching pipeline they describe. The abbreviations are: FD – Feature
Detection, FM – Feature Matching, HE – Homography Estimation, BA – Bundle
Adjustment, IW – Image Warping, GC – Gain Compensation, B – Blending.
Paper Author(s) FD FM HE BA IW GC B
A Multiresolution Spline
With Application to
Image Mosaics [5]
Peter J. Burt et al. ✔
Eliminating Ghosting
and Exposure Artifacts
in Image Mosaics [13]
Matthew Uyttendale
et al.
✔
Bundle Adjustment – A
Modern Synthesis [12]
Bill Triggs et al. ✔
Distinctive Image
Features from Scale-
Invariant Keypoints [10]
David G. Lowe ✔ ✔
Evaluation of Interest
Point Detectors [9]
Cordelia Schmid et al. ✔
Efficiently Registering
Video into Panoramic
Mosaics [3]
Drew Steedly et al. ✔
Image Alignment and
Stitching: A Tutorial [11]
Richard Szeliski ✔ ✔ ✔ ✔ ✔ ✔
Panoramic Mosaics by
Manifold Projection [2]
Shmuel Peleg et al. ✔ ✔
Recognising Panoramas
[7]
Matthew Brown et al. ✔ ✔ ✔ ✔ ✔
Systems and Experiment
Paper: Construction of
Panoramic Image
Mosaics with Global and
Local Alignment [6]
Heung-Yeung Shum
et al.
✔ ✔ ✔
Automatic Panoramic
Image Stitching using
Invariant Features [8]
Matthew Brown et al. ✔ ✔ ✔ ✔ ✔ ✔
Video Mosaics for
Virtual Environments [4]
Richard Szeliski ✔ ✔
15
3 Design
An overview of the system is discussed before the details of the design of the stitching
pipeline is elaborated.
3.1 System Overview
The system, titled Virtual Panning for Lecture Recording (VIRPAN) takes in two or more
videos recoded by static cameras and processes them in three stages:
Stitching. The videos are stitched to create one panoramic video.
Tracking. The lecturer is identified and tracked in the panoramic video.
Panning. The panoramic video is cropped to the area of interest, namely the
lecturer. The cropped frame will pan with the lecturer as the lecturer moves
around.
The output of the entire system is the video containing the cropped frames that pan with
the lecturer.
The system is developed using OpenCV, an open source computer vision library
distributed under the open source BSD license. By using an open source library, we are
able to develop the system without incurring any costs. This allows us to release our
source code once the system is completed, thereby allowing other people to contribute.
We are using OpenCV version 2.4.9 and the system is programmed using C++.
FIGURE 12: The logo used by OpenCV (Open Source Computer Vision).
16
FIGURE 13: A bird’s-eye view of the setup and location of the cameras. Note that there has to be
some degree of overlap for stitching to be successful.
3.1.1 System Constraints
CILT has stated that the system should adhere to the following specifications:
The system must run on Ubuntu.
The system must not use a GPU to speed up processing time.
The system must produce the completed output video in a time not longer than
three times the length of one of the input videos. If the lecture recorded is 45
minutes long, the system must not take longer than 135 minutes to produce an
output.
The system is to post-process the videos, thus it need not be run in real-time.
3.1.2 Functional Requirements
Taking the above constraints into account and what the system should accomplish, the
functional requirements are the following:
The system must take two or more videos as input.
The system must produce a cropped video which pans with the lecturer.
17
FIGURE 14: A diagram illustrating the different components of the system as well as the inputs and
output. It also illustrates the passing of information between the different components.
3.1.3 Non-functional Requirements
Taking the system constraints into account and what the system should accomplish, the
non-functional requirements are the following:
The system must be able to run on Ubuntu.
The system cannot use a GPU to accelerate processing times.
The system cannot execute in a time longer than three times the length of one of
the input videos.
18
3.1.4 Assumptions
Due to the nature of video recordings and the way that the cameras can be setup in the
lecture room, some assumptions need to be made to ensure that the system can produce
a quality output video. The assumptions are as follows:
All input videos are in sync. This means that all videos start recording at the same
time. One video cannot be “behind” the other videos, since it would make the
videos almost impossible to stitch together.
All input videos have the same frames per second (FPS). If the videos have
different frames per second, this means that the videos have “sampled” life at
different frequencies and the videos may be misregistered during the stitching part
of the system.
All input videos have some degree of overlap for stitching purposes.
All input videos have similar zoom level. This allows the videos to be stitched
together without warping the videos too much.
The cameras taking the videos are static and fixed in place.
Since we are dealing with the visual quality of the videos, sound will not be
considered.
3.1.5 Key Success Factors
One of the success factors is that the system adhere to the system constraints. If CILT is
satisfied with the output videos the system produces and thus uses the system to post-
process lecture recordings within UCT, then the system will also be considered successful.
However, if the system cannot adhere to the system constraints, but CILT recognises the
value the system can have within the realm of lecture videos and continues the
development of the system in the future, then this is also considered to be a successful
outcome.
3.1.6 Input/output of the Components in the System
The inputs and outputs of the three components of the system is listed here to indicate
how the different parts of the system pass information to each other.
Stitching:
Input: Two or more videos.
Output: One panoramic video stream.
Tracking:
19
Input: One panoramic video stream.
Output: Co-ordinates of the lecturer written to a text file.
Panning:
Input: One panoramic video stream and the text file containing the coordinates of the
lecturer.
Output: One cropped video that pans with the lecturer.
3.2 Stitching Component
The design of the stitching component will be detailed here. Since OpenCV is used, the
design of the stitching component is very similar to the design of the OpenCV stitching
pipeline.
3.2.1 Aim
The aim of the stitching component of the system is to combine two or more videos into
one panoramic video. There must be minimal visual artefacts and/or distortions in order
to give an impression that the panoramic video had been taken with a single, static
camera.
20
FIGURE 15: A diagram illustrating the OpenCV pipeline, taken from the OpenCV website1. The
stages of the stitching pipeline is based on the diagram.
3.2.2 Stages of the Stitching Pipeline
The seven stages of the stitching pipeline have been elaborated in Section 2.3. They are:
Feature Detection
Feature Matching
Homography Estimation
Bundle Adjustment
Image Warping
Gain Compensation
Blending
A class was created for each stage of the pipeline and the system makes calls to them as
necessary.
1 http://docs.opencv.org/modules/stitching/doc/introduction.html
21
FIGURE 16: A class diagram illustrating the different classes in the stitching pipeline. The stitching
class creates one instance of each class and makes calls to them as necessary.
3.2.3 First Design
The initial design of the stitching component involved extracting a frame from each of the
input videos and passing the extracted frames through the entire stitching pipeline. Once
the frames were stitched, the panoramic stitched frame was written to the output video.
This process was repeated until the all the frames from the input videos were processed.
This method of stitching videos was extremely slow. Another problem encountered was
that the stitched frames, after every iteration, would be a different dimension than the
stitched frame before it, thus making the stitched frame unwritable to the output video
file. The output video expects frames to be of a certain dimension, and frames not
conforming to the dimensions are simply ignored. OpenCV does provide a mechanism to
resize frames, but this method of stitching is already extremely slow, thus resizing frames
would only exacerbate the problem.
22
FIGURE 17: A diagram illustrating the different stages of the stitching pipeline. This is also the first
design of the stitching pipeline.
3.2.4 Second Design
The next design of the stitching pipeline involves reusing the image masks calculated
during the first iteration of the stitching pipeline. The idea behind this design is that once
the features are found and the homography matrix estimated, then there is no need to
recalculate them for subsequent frames.
During the first iteration, frames of each video are extracted and passed through the
entire stitching pipeline. All subsequent frames reuse the already calculated image masks
and homography matrices from the first iteration of the pipeline, thus only the blending
stage is re-executed.
This method is much faster than the previous design, since the first six stages of the
pipeline are only executed once. However, the execution time still falls short of the
requirement that it has to be at most three times as long as one of the input videos.
23
FIGURE 18: A diagram showing the use of image masks. The image masks tell the blender which
part of an image to use in the final panoramic image, as shown in the last frame. The image
masks are used repeatedly to blend the rest of the video.
FIGURE 19: A diagram illustrating the second design of the stitching pipeline. The base stitching
pipeline is found in FIGURE 17. Image masks and homography matrices are reused to speed up
execution time.
24
3.2.5 Third Design
The third design involves multithreading the number of frames processed. After the first
iteration of the pipeline, the blending part of the pipeline is the only part of the pipeline
that is re-executed.
In order to perform multithreading, frames were extracted from the input videos in
batches, as opposed to extracting them one at a time as in the previous design. Each
thread is then given a batch of frames to blend, and once the threads have executed
completely the main thread will write all blended frames to the output video file.
Results from this design are discussed in Section 5.
FIGURE 20: A diagram illustrating the third design of the stitching pipeline. Image masks and
homography matrices are passed by reference because they never change.
25
3.2.6 Fourth Design
This design was necessitated by the fact that the initial calculated image masks and
homography matrices may not be optimal. This occurs when the lecturer happens to
stand in the stitch seams, which creates a distorted and unpleasant-looking stitched
frame. This design allows users of the system to specify the time interval in seconds of
when the stitching pipeline should be re-executed.
Since re-executing the pipeline will create different image masks and homography
matrices, the pipeline extracts the number of frames that use the same image masks and
homography matrices before re-executing the stitching pipeline. For example, if the input
videos have 25 FPS and the user has specified that the entire pipeline be re-executed
every 4 seconds, this means that the entire pipeline will be re-executed every 100 frames.
This does mean that every batch of 100 frames will have to be resized to the dimensions
of the output video file, since it is highly unlikely that all blended frames are of the same
dimensions.
Each thread is given a number of frames and their corresponding image masks and
homography matrices. It is therefore possible to have multiple threads each blending
input frames using different image masks and homography matrices. Once the threads
have executed completely the main thread will write all blended frames to the output
video file.
Results from this design are discussed in Section 5.
26
FIGURE 21: A diagram illustrating the fourth design of the stitching pipeline. Image masks and
homography matrices are passed by value, because it is possible that they could change. It is
important to note that if the thread vector is not full, then there are no more frames that use the
same image masks, otherwise the thread vector would have been full.
27
4 Implementation
The implementation of the stitching pipeline is detailed here. Various iterations of the
implementation is documented to show the evolution of the stitching pipeline with
reasons elaborated on why the iterations were necessary.
Since the stitching pipeline is based on the OpenCV stitching pipeline, the pseudocode
shown is similar to their code of the stitching pipeline which can be found on the OpenCV
GitHub page2. As mentioned in Section 3, C++ is the programming language used and
Eclipse IDE is used for developing the VIRPAN system.
4.1 First Iteration
The first implementation of the stitching component involved extracting a frame from
each video and passing them through the entire stitching pipeline. The stitched
panoramic frame is then written to the output video file. This is repeatedly done until all
frames are stitched and written to the output file.
Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file Write stitched panoramic frame to output video file Increment frameNumber by 1 END while
FIGURE 22: Pseudocode for the first iteration of the stitching pipeline.
2 https://github.com/Itseez/opencv
28
Having each frame run through the entire stitching pipeline for each iteration is extremely
slow. This prompted us to examine the stitching pipeline in detail to see if some steps of
the pipeline could be skipped. We noticed that during the blending stage of the pipeline,
it used image masks and homography matrices that were calculated in the previous
stages of the pipeline. Since the cameras that were recording the videos were fixed in
place, there is no need for the image masks and homography matrices to be recalculated,
since the camera parameters (such as rotation values) would remain constant for the
entire duration of the video recording.
4.2 Second Iteration
The second implementation of the stitching component involved passing the first frames
of each video through the pipeline in order to calculate the image masks and
homography matrices needed to stitch the frames together. Once this is done, the rest of
the input frames are simply blended together, using the masks and matrices calculated
during the first iteration of the pipeline.
Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together Create an output video file Write stitched panoramic frame to output video file Increment frameNumber by 1 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Blend the frames together, reusing the image masks and homography matrices Write stitched panoramic frame to output video file Increment frameNumber by 1 END while
FIGURE 23: Pseudocode for the second iteration of the stitching pipeline.
29
4.3 Third Iteration
Although the second iteration gave better execution times than the first iteration, it did
not fit the constraint that the execution time must be at most three times as long as one
of the input videos. To make execution time faster, multithreading was the next option.
Once the image masks and the homography matrices are calculated, the pipeline re-
executes the blending stage repeatedly for the rest of the input frames. This can be
multithreaded, as each frame can be blended separately from the other frames. The
constraint here is that the frames need to be written in sequence to the output video file.
To facilitate this, a vector is created in the main thread to store blended frames. Each
thread receives a reference to the vector and a number which states the index at which
the thread starts storing the blended frames. Once the threads have executed completely,
the main thread iterates through the vector sequentially and writes the blended frames to
the output video file.
The stitching pipeline creates at most 10 threads at any given time with each thread
blending at most 100 frames. These numbers were chosen arbitrarily.
Create a results vector to store blended frames Initialise start_index to 0 // Index at which a thread starts storing frames Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file Store stitched/blended frame into results[start_index] Increment frameNumber by 1 Increment start_index by 1 int frames_to_stitch = number of frames to stitch – 1 while (frames_to_stitch > 0) if (frames_to_stitch <= 100) if (number of active threads = 10) Wait for all threads to finish Write all blended frames in results to output video
30
Set start_index to 0 END if Extract (frames_to_stitch) number of frames, store in vector Create a thread to blend the frames start_index += frames_to_stitch frameNumber += frames_to_stitch Set frames_to_stitch to 0 ELSE if (frames_to_stitch <= 1000) int complete_threads = frames_to_stitch / 100 if (number of active threads + complete_threads > 10) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if for (0 to complete threads) Extract 100 frames each from input videos Create a thread to blend the frames start_index += 100 frameNumber += 100 frames_to_stitch -= 100 END for ELSE // 10 full threads can be created if (number of active threads != 0) Wait for all threads to finish Write all blended frames in results to output video Set start index to 0 END if for (0 to 10) Extract 100 frames each from input videos Crate a thread to blend the frames start_index += 100 frameNumber += 100 frames_to_stitch -= 100 END for END if else branch END while END while if (there is still active threads running) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if
FIGURE 24: Pseudocode for the third iteration of the stitching pipeline. The results vector is passed
by reference into all threads and each thread gets a number (start_index) at which to write
blended frames. This ensures that all frames in the results vector are in order.
31
4.4 Fourth Iteration
After executing the third design on a number of videos, it was found that sometimes the
calculated image masks and homography matrices were not optimal. This occurs when
the lecturer stands in the stitch seam, creating a distorted stitched frame which is
unpleasant to view. This iteration provides the functionality of re-executing the stitching
pipeline every x seconds, which the user can edit.
The design of this iteration is discussed in Section 3.2.6 and the psudocode is shown
below.
Create a results vector to store blended frames Initialise start_index to 0 Read stitching pipeline settings from text file Read videos from command line arguments Determine number of frames to stitch (length of video to stitch * FPS) Initialise frameNumber to 0 Determine number of frames to re-execute stitching pipeline (FPS * seconds to re-execute stitching pipeline) while (frameNumber < number of frames to stitch) Extract one frame from each video and store in a vector Find the features in the frames Match the features Estimate the homography of the frames Perform bundle adjustment on the frames Warp the frames onto a surface Compensate for exposure and gain Blend the frames together if (this is the first iteration) Create an output video file if (stitched frame needs to be resized) Resize frame Store stitched/blended frame into results[start_index] Increment frameNumber by 1 Increment start_index by 1 int frames_to_stitch = number of frames to re-execute pipeline – 1 while (frames_to_stitch > 0) if (frames_to_stitch <= 100) // Create thread ELSE if (frames_to_stitch <= 1000) // Create threads ELSE // 10 full threads can be created // Create threads
32
END if else branch END while END while if (there is still active threads running) Wait for all threads to finish Write all blended frames in results to output video Set start_index to 0 END if
FIGURE 25: Pseudocode for the fourth iteration of the stitching pipeline. The “create threads”
comment contains the same code within the if else branch as FIGURE 24. It is important to note that
due to image masks and homography matrices changing, every blended frame will need to be
resized in order for it to be written to the output video file.
It is important to note that at any given point in time, it is possible that there are multiple
threads running, each using different image masks and different homography matrices to
blend the frames together. It is also possible that 10 threads are never created at any
point in time and that a thread can have less than 100 frames to stitch.
An example of this is if the user decides that the pipeline should be re-executed every 10
seconds for a video that has 25 FPS. This means that the pipeline is re-executed every 250
frames. The first frame is blended from the re-executing stage of the pipeline, leaving 249
frames to be blended. Three threads are created to blend the 249 frames using the same
image masks and homography matrices. The last thread will only have 49 frames to
blend.
While the three threads are running, the main thread re-executes the stitching pipeline to
calculate the image masks and homography matrices for the next 250 frames. Three
threads are then created to blend the 249 frames. This process is repeated until nine
threads are created, at which point the main thread will have to wait for them to complete
execution and write the frames to the output video file, before continuing to create more
threads.
4.5 Other Related Works
While trying to create a solution that would speed up the stitching process, a real-time
stitching solution was encountered on GitHub. The project is titled “StitcHD” [15] and it is a
NASA funded project to develop a video system that will replace windows on NASA’s
spacecrafts.
A subset of their code was quickly implemented to see the results of their stitching
solution. Their solution involves calculating a homography matrix and warping the images
based on this matrix. The blending algorithm used was a simple averaging method, where
pixel colours between two overlapping images were averaged to form a final pixel.
33
FIGURE 26: A screenshot taken from two videos recorded by two cameras: one fixed camera is on
the left of the room and one fixed camera is on the right. Note that the lecturer’s arm on the right
screenshot is more angled compared to the left screenshot due to perspective distortion.
FIGURE 27: The result of the StitcHD code.
The reason for the ghosting effect is not due to the motion of the hand, but rather that
their blending algorithm averages pixels. Another contributing factor is that the fixed
cameras recording the videos are very close to the blackboards, thus the degree of
perspective distortion is significant. This algorithm may have worked if the fixed cameras
were located at the back of the lecture room, but that might make the writing on the
blackboard illegible.
Due to the ghosting effect, their implementation was not looked further into and thus no
timing tests were taken.
34
5 Results
The performance of the stitching component of the system is evaluated in terms of its
execution time as well as the stitch quality of the videos. The results of the two categories
is detailed here.
5.1 Stitching Pipeline Default Settings
The stitching pipeline has various options from which to choose, with some examples
being what blending method to use as well as which compositing surface to use on which
the images will be stitched. These settings are stored in a text file which the user can edit
before execution of the system. The various options are listed in the table below.
Category Default Value Other Options
Use GPU No Yes
Work Megapixels 0.6 N/A
Seam Megapixels 0.1 N/A
Confidence Threshold 0.5 N/A
Feature Detector Type surf orb
Bundle Adjustment Cost Function ray reproj
Bundle Adjustment Refinement
Mask
xxxxx N/A
Perform Wave Correction Yes No
Wave Correction Type Horizontal Vertical
Warp Type paniniA2B1 plane | cylindrical | spherical |
fisheye | stereographic |
compressedPlaneA2B1 |
compressedPlaneA1.5B1 |
compressedPlanePortraitA2B1 |
compressedPlanePortraitA1.5B1
| paniniA1.5B1 |
paniniPortraitA2B1 |
paniniPartraitA1.5B1 | mercator
| transverseMercator
Exposure Compensator Type gainblocks no | gain
Match Confidence 0.3 N/A
Seam Estimation Method gccolor no | voronoi | gccolorgrad |
dpcolor | dpcolorgrad
Blend Type multiband no | feather
Blend Strength 5 N/A
35
Show stitch seams No Yes
How many seconds of the video is
stitched
Variable N/A
The amount of time in seconds to
redo the stitching pipeline
Variable N/A
FIGURE 28: Table showing the different options of the stitching pipeline.
The execution time and the stitch quality results use the default values of the table unless
otherwise stated.
5.1.1 Warp Type
Due to the number of options that OpenCV offers for warp type, an image was stitched
using all the different options. It was found that paniniA2B1 offered the best result and
thus is set as the default warp type. Note that compressedPlaneA2B1 and
compressedPlaneA1.5B1, compressedPlanePortraitA2B1 and
compressedPlanePortraitA1.5B1, and paniniPortraitA2B1 and paniniPortraitA1.5B1 all gave
similar results with respect to each other.
compressedPlaneA2B1
compressedPlanePortraitA2B1
cylindrical
fisheye
36
mercator paniniA1.5B1
paniniPortraitA2B1 plane
spherical stereographic
transverseMercator paniniA2B1
FIGURE 29: A diagram displaying the results of different compositing surfaces provided by OpenCV.
PaniniA2B1 is chosen as the default warp type, since it provides the least distortion.
5.2 Execution Time
The specifications of the computer used and the properties of the input videos used are
listed first before the various execution time results.
37
Computer Specifications
CPU: Intel Core i5-3470 CPU @ 3.2GHz No of threads: 4
RAM: 2 x 2GB DDR3 1333MHz
Linux Swap Space: 4000MiB
Input video properties
Input video resolution (x2 videos): 640 x 360
Number of frames per second: 25
Output video resolution
Stitched panoramic video resolution: 1283 x 758
5.2.1 Re-execution of the pipeline disabled
Reading # 1 minute (s) 2 minutes (s) 5 minutes (s) 10 minutes
(s)
1 246.51 779.05 2425.07 7248.17
2 241.04 735.28 2838.96 8698.35
3 239.30 694.90 2336.81 6072.11
4 241.27 713.70 2669.71 8688.84
5 241.03 790.82 3432.42 8428.09
Average 241.83 742.75 2740.59 7827.11
Standard
Deviation
2.44 36.93 388.82 1027.43
Ratio 4.03 times 6.19 times 9.14 times 13.05 times
FIGURE 30: Table showing the times acquired from the multi-threaded system. The ratio is
calculated by the following formula: average / video length (in seconds).
38
Reading # 1 minute (s) 2 minutes (s) 5 minutes (s)
1 794.60 1588.77 3964.65
2 794.83 1589.81 3976.51
3 794.42 1590.08 3985.30
4 794.20 1588.94 3975.64
5 795.05 1612.63 3974.71
Average 794.62 1594.05 3975.36
Standard
Deviation
0.30 9.31 6.56
Ratio 13.24 times 13.28 times 13.25 times
FIGURE 31: Table showing the times acquired from the single-threaded system. The ratio is
calculated by the following formula: average / video length (in seconds).
The tables show the times achieved by the system when the entire stitching pipeline is
executed only once and the rest of the frames are simply blended together using the
calculated image masks and homography matrices.
The single-threaded design provides consistent results. This is evident by the ratio of
about 13.25 times. The multi-threaded design, however, does not provide consistent
results. The ratio ranges from 4 times to stitch a 1 minute video to 13.05 times to stitch a
10 minute video.
By extrapolating the results, the single-threaded design will stitch a 50 minute lecture
video in a much shorter amount of time compared to the multi-threaded design. This
seems contradictory, however, by looking at the computer specifications, there is only 4
hardware threads available to use between the 10 threads the design creates. Additionally,
the amount of RAM available is too little for the multi-threading design, causing the swap
space to be used repeatedly. It is worth mentioning that the standard deviation increases
dramatically as the length of video to be stitched increases. This could be the result of the
computer constantly moving data between the RAM and the swap space, creating varying
values for execution times. The combination of swap space usage and thread swapping
slows down the execution time dramatically, thus making the single-threaded design the
choice here when stitching 50 minute lecture videos. None of the results here achieve the
system constraint requirement that the execution time be at most three times as long as
one of the input videos.
It is important to note that these results were run on a computer with limited RAM and
CPU threads. Results should be different when executed on a mainframe or a server
computer, where RAM is more plentiful and more hardware threads are present. Testing
on more powerful computers is left for future work.
39
5.2.2 Re-execution of the pipeline enabled
The results shown here are executed with the system re-executing the stitching pipeline
every 10 seconds (or 250 frames).
Reading # 1 minute (s)
1 1369.64
2 1377.21
3 1381.47
4 1368.46
5 1355.11
Average 1370.38
Standard
Deviation
9.02
Ratio 22.84
FIGURE 32: Table showing times acquired from the multi-threaded system with the stitching
pipeline re-executed every 10 seconds. The ratio is calculated by the following formula: average /
video length (in seconds).
Reading # 1 minute (s) 2 minutes (s)
1 2836.24 8200.81
2 2963.40 8212.73
3 2831.47 8247.67
4 2832.57 8210.75
5 2834.19 8204.96
Average 2859.57 8215.38
Standard
Deviation
51.94 16.69
Ratio 47.66 68.46
FIGURE 33: Table showing times acquired from the single-threaded system with the stitching
pipeline re-executed every 10 seconds. The ratio is calculated by the following formula: average /
video length (in seconds).
It is immediately noticeable that the ratios are very high, this it does not satisfy the system
constraint performance execution time at all. While attempting to acquire times for the
multi-threading design when stitching a 2 minute video, an OpenCV error occurred:
OpenCV Error: Insufficient memory (Failed to allocate 1493639172 bytes) in OutOfMemoryError, file /home/ttsen/Downloads/opencv-2.4.9/modules/core/src/alloc.cpp, line 52 terminate called after throwing an instance of 'cv::Exception'
40
what(): /home/ttsen/Downloads/opencv-2.4.9/modules/core/src/alloc.cpp:52: error: (-4) Failed to allocate 1493639172 bytes in function OutOfMemoryError
This error occurred during the gain compensation stage of the stitching pipeline. This is
most likely related to the fact that the multi-threading version passes image masks and
homography matrices by value to threads as well as the limited RAM of the computer,
resulting in all memory space being allocated However, this does not make sense when
considering the single-threaded version as times for the single-threaded version are also
extremely high.
We noticed that in both single- and multi-threaded versions, the pipeline would
occasionally “stall” when reaching the compensation stage of the pipeline. Compensating
for gain usually takes less than 1 second to execute. However, when the pipeline does
stall, times in this stage vary from 100 seconds to 1000 seconds. Presumably this is
because the computer is waiting for memory to free up before reallocating the memory
space.
This issue might be multi-threaded related, but there is definitely memory allocation
problems. Future work could include destroying vectors that are no longer needed,
checking for memory leaks and testing the system on a more powerful computer.
5.3 Stitch Quality
Various videos were passed through the stitching pipeline to see the stitch quality in
different environments.
5.3.1 Blend Type
OpenCV offers three methods for blending: multiband, feather and no blending. Videos
are stitched using the three blending methods and the multiband method gave the best
results visually, thus multiband is set as the default blend type. It is important to note that
multiband blending takes the longest to execute whereas no blending gives the quickest
execution time.
41
No blending
Feather blending
Multiband blending
FIGURE 34: A diagram displaying the results of different blending algorithms. The images on the
right column have had the contrasts and brightness edited, in order to see the stitch seams more
clearly. Multiband blending is chosen to be the default blend type because the stitch seam is less
visually distracting.
5.3.2 Stitch Quality Results
Multiband blending provides good results in the sense that the stitch seam is barely
visible. However, once the lecturer walks across the lecture room, the stitch seam is clearly
visible. This is not because the blending algorithm is not good enough, but rather that
since the cameras are so close to the blackboards, the perspective distortion is significant
between the left camera and the right camera. From FIGURE 13, we can see that if anything
42
moves in front of the blackboard in the area of overlap, the object will either be sliced in
half or not appear in the stitched video at all.
FIGURE 35: An example of a frame where the lecturer is standing in the stitch seam. This is quite
visually distracting.
FIGURE 36: If the pipeline is re-executed, the image masks and the homography matrices will
update to include the lecturer in the frame. As mentioned before, due to the closeness of the
cameras to the blackboard, perspective distortion is significant. Although the lecturer is fully in the
frame, the blackboard edges are not straight, which is visually distracting.
43
FIGURE 37: An example where the lecturer is standing in the stitch seam at the beginning of the
video. When the lecturer moves away, it leaves behind a visually distracting stitch seam. It is
because of this that necessitated the fourth design of the stitching pipeline. If the stitching pipeline
is re-executed, once the lecturer moves away, the blended frame looks good.
FIGURE 38: A screenshot of a frame with writing on both blackboards. The stitch seam is apparent
when looking at the tables, but with regards to the blackboards, the blending algorithm has done
a good job.
44
6 Conclusion
Lecture recording is a common tool many universities use for open courseware and
education purposes. They can be recorded using a PTZ camera, which is an automatic
camera that is able to detect the lecturer and pan along with the lecturer. Alternatively, a
cameraman can be used to record the lecture. This solution is prone to human errors.
Both methods stated above are expensive solutions for lecture recording. The Centre for
Innovation in Learning and Teaching (CILT) has proposed a system where two or more
fixed cameras are placed at the front of the lecture room, each viewing a different portion
of the front of the lecture. The system is to take the videos recorded as input and post-
process them in three stages: stitching, tracking and panning. Stitching combines all input
videos into one panoramic video, tracking identifies and tracks the lecturer and panning
crops out a region of interest from the panoramic frame, namely the lecturer. The
cropped frame will pan along with the lecturer, which simulates a PTZ camera solution as
well as the cameraman solution. The cost of the proposed system is the cost of the fixed
cameras used, which is much cheaper than the PTZ camera and cameraman solution.
The library used for the system is OpenCV. Since OpenCV is distributed under an open
source BSD license, no costs were incurred when programming the system. Additionally,
the code that we produce is also open sourced, allowing other users to contribute to the
system, thus promoting open courseware initiatives.
This paper focused on the stitching component of the proposed system. Various iterations
of the design of the stitching component were discussed. The first design involved
executing the entire stitching pipeline for each and every frame. This made execution
times extremely long. The second design involved executing the pipeline once and
reusing the calculated image masks and homography matrices. Execution times were
better, but still not good enough. The third design involved multithreading the blending
process in order to further improve execution times. The fourth design was necessitated
by the fact that sometimes the initial calculated image masks and homography matrices
were not optimal. This occurs when the lecturer stands in the stitch seams. This design
gave the user the option to re-execute the entire stitching pipeline every x seconds.
The stitching pipeline is evaluated in terms of its execution times and stitch quality.
Execution times were acquired for both re-execution of the pipeline enabled and
disabled.
Execution times with re-executing the pipeline disabled vary wildly for the multi-threaded
solution, compared to the single-threaded solution which gave consistent timings. This
was due to the computer used having only 4 hardware threads and 4 GB of RAM. Even
45
though timings vary wildly, the multi-threaded solution did stitch videos of 1 minute, 2
minutes and 5 minutes in length faster than the single-threaded solution.
Not much timings were acquired from re-executing the pipeline enabled. This is because
the execution times were extremely high. This may be due to the computer having only 4
GB of RAM, but there is definitely memory leaks occurring.
Multiband blending gives the best stitch quality results. However, multiband blending
does take the longest to execute. The stitch seam is visible only when the lecturer walks
across the stitch seam. The option to re-execute the stitching pipeline is available to rectify
this problem. However, it is important to note that writing on the blackboards is legible
and not distorted and there is no evidence of ghosting effects. Since the writing on the
blackboard is the most important thing, the lecturer being cut into half at the stitch seams
is probably not a big issue.
With better execution times and slightly better stitching quality results, the system is a
viable option for lecture recordings, and is able to simulate a PTZ camera and cameraman
solution while also being cheaper than the two solutions as well.
46
7 Future Work
Since the stitching pipeline has many parameters, there is a possibility that a different
combination of parameters could give better results. Experimenting with the different
parameters could also yield better execution times.
In the multi-threading design of the stitching pipeline, it is stated that at most 10 threads is
created at any point in time, with each thread stitching at most 100 frames. Having the
user being able to choose the number of threads as well as choosing how many frames
each thread can stitch is a way of enabling the system to be run on a variety of different
computer specifications.
The stitching component has been programmed to accept two or more videos as input.
However, the system has only been tested on at most two input videos. More testing is
required when accepting three or more videos.
One of the drawbacks of having the camera so close to the blackboard is that the
perspective distortion between the cameras is significant. It will be useful if the videos can
be recorded with the cameras at different positions to see what effects this may have on
the stitch seams.
FIGURE 39: A method to position the cameras to try to reduce perspective distortion when the
lecturer stands in the stitch seam. This is left for future work.
Due to time constraints, the system has not been executed on an entire lecture video
length (about 45 – 50 minutes). The system needs to be tested on such videos in order to
give a clearer picture of execution times.
Re-execution of the stitching pipeline will not always yield better results. This occurs when
the lecture walks from one side of the lecture room to the other side of the lecture room.
A way to counter this is to incorporate machine learning into the stitching pipeline. By
having a learning algorithm learn what a good image mask and a good homography
matrix is, if re-executing the stitching pipeline gives worse results, then they can simply be
discarded. This will allow stitching of frames to improve over time.
47
References
[1] S. Marquard, “Matterhorn 2014 Unconference: Ideas for automated post-recording
video handling.,” 19 March 2014. [Online]. Available:
http://www.slideshare.net/smarquard/matterhorn-unconf2014-fixitinpost/. [Accessed
20 October 2014].
[2] S. Peleg and J. Herman, “Panoramic mosaics by manifold projection,” Computer
Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society
Conference on, pp. 338-343, IEEE, 1997.
[3] D. Steedly, C. Pal and R. Szeliski, “Efficiently registering video into panoramic
mosaics,” Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on,
vol. 2, pp. 1300-1307, 2005.
[4] R. Szeliski, “Video mosaics for virtual environments,” Comuter Graphics and
Applications, IEEE, vol. 16, no. 2, pp. 22-30, 1996.
[5] P. J. Burt and E. H. Adelson, “A multiresolution spline with application to image
mosaics,” ACM Transactions on Graphics (TOG), vol. 2, no. 4, pp. 217-236, 1983.
[6] H.-Y. Shum and R. Szeliski, “Systems and experiment paper: Construction of
panoramic image mosaics with global and local alignment,” International Journal of
Computer Vision, vol. 36, no. 2, pp. 101-130, 2000.
[7] M. Brown and D. G. Lowe, “Recognising panoramas,” ICCV, vol. 3, p. 1218, 2003.
[8] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant
features,” International journal of computer vision, vol. 74, no. 1, pp. 59-73, 2007.
[9] C. Schmid, R. Mohr and C. Bauckhage, “Evaluation of interest point detectors,”
International Journal of computer vision, vol. 37, no. 2, pp. 151-172, 2000.
[10] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International
journal of computer vision, vol. 60, no. 2, pp. 91-110, 2004.
[11] R. Szeliski, “Image alignment and stitching: A tutorial,” Foundations and Trends® in
Computer Graphics and Vision, vol. 2, no. 1, pp. 1-104, 2006.
[12] B. Triggs, P. McLauchlan, R. Hartley and A. Fitzgibbon, “Bundle adjustment—a
modern synthesis,” Vision algorithms: theory and practice, pp. 298-372, 2000.
[13] M. Uyttendaele, A. Eden and R. Szeliski, “Eliminating ghosting and exposure artifacts
in image mosaics,” Computer Vision and Pattern Recognition, 2001. CVPR 2001.
Proceedings of the 2001 IEEE Computer Society Conference on, vol. 2, pp. II-509, 2001.
48
[14] W. Xu and J. Mulligan, “Performance evaluation of color correction approaches for
automatic multi-view image and video stitching,” Computer Vision and Pattern
Recognition (CVPR), 2010 IEEE Conference on, pp. 263-270, 2010.
[15] lukeyeager, “lukeyeager/StitcHD · GitHub,” 8 May 2012. [Online]. Available:
https://github.com/lukeyeager/StitcHD. [Accessed 28 October 2014].