efficient implementation and processing of a real-time...

8
Efficient Implementation and Processing of a Real-time Panorama Video Pipeline Marius Tennøe 1 , Espen Helgedagsrud 1 , Mikkel Næss 1 , Henrik Kjus Alstad 1 , H˚ akon Kvale Stensland 1 , Vamsidhar Reddy Gaddam 1 , Dag Johansen 2 , Carsten Griwodz 1 , P˚ al Halvorsen 1 1 University of Oslo / Simula Research Laboratory 2 University of Tromsø Abstract—High resolution, wide field of view video generated from multiple camera feeds has many use cases. However, processing the different steps of a panorama video pipeline in real-time is challenging due to the high data rates and the stringent requirements of timeliness. We use panorama video in a sport analysis system where video events must be generated in real-time. In this respect, we present a system for real-time panorama video generation from an array of low-cost CCD HD video cameras. We describe how we have implemented different components and evaluated alternatives. We also present performance results with and without co- processors like graphics processing units (GPUs), and we evaluate each individual component and show how the entire pipeline is able to run in real-time on commodity hardware. I. I NTRODUCTION Panoramic images are often used in applications that require a wide field of view. Some application scenario examples include surveillance, navigation, scenic views, infotainment kiosks, mobile e-catalogs, educational exhibits and sports anal- ysis. In such systems, video feeds are captured using multiple cameras, and the frames are processed and stitched into a single unbroken frame of the whole surrounding region. This process is, however, very resource hungry. Each individual frame must be processed for barrel distortion, rotated to have the same angle, warped to the same plane, corrected for color differences, stitched to one large panorama image, encoded to save storage space and transfer bandwidth and, finally, written to disk. Additionally, the stitching includes searching for the best possible seam in the overlapping areas of the individual frames to avoid seams through objects of interest in the video. In this paper, we describe our implementation of a real- time panorama video pipeline for an arena sports application. Here, we use a static array of low-cost CCD HD video cameras, each pointing at a different direction, to capture the wide field of view of the arena. These different views are slightly overlapped in order to facilitate the stitching of these videos to form the panoramic video. Several similar non-real- time stitching systems exist (e.g., [17]), and a simple non- real-time version of this system has earlier been described and demonstrated at the functional level [8], [20]. Our initial prototype is the first sports application to successfully integrate per-athlete sensors, an expert annotation system and a video system, but due to the non-real-time stitching, the panorama video was only available to the coaches some time after a game. The previous version also did not use any form of color correction or dynamic seam detection. Hence, the static seam did not take into account moving objects (such as players), and the seam was therefore very visible. However, new requirements like real-time performance and better visual quality have resulted in a new and improved pipeline. Using our new real-time pipeline, such systems can be used during the game. In particular, this paper therefore focuses on the details of the whole pipeline from capturing images from the cameras via various corrections steps for panorama generation to encoding and storage of both the panorama video and the individual camera streams on disks. We describe how we have evaluated different implementation alternatives (both algorithms and implementation options), and we benchmark the performance with and without using graphics processing units (GPUs) as an co-processors. We evaluate each individual component, and, we show how the entire pipeline is able to run in real-time on a low-cost 6-core machine with a GPU, i.e., moving the 1 frame per second (fps) system to 30 fps enabling game analysis during the ongoing event. The remainder of the paper is structured as follows: Next, in section II, we give a brief overview of the basic idea of our system. Then, we briefly look at some related work (section III), before we describe and evaluate our real-time panorama video pipeline in section IV. Section V provides a brief discussion of various aspect of the system before we conclude the paper in section VI. II. OUR SPORTS ANALYSIS SYSTEMS Today, a large number of (elite) sports clubs spend a large amount of resources to analyze their game performance, either manually or using one of the many existing analytics tools. For example, in the area of soccer, several systems enable trainers and coaches to analyze the gameplay in order to improve the performance. For instance, Interplay-sports, ProZone, STATS SportVU Tracking Technology and Camargus provide very nice video technology infrastructures. These systems can present player statistics, including speed profiles, accumulated distances, fatigue, fitness graphs and coverage maps using different charts, 3D graphics and animations. Thus, there exist several tools for soccer analysis. However, to the best of our knowledge, there does not exist a system that fully integrates all desired features in real-time, and existing systems still re- quire manual work moving data between different components. In this respect, we have presented Bagadus [8], [20], which integrates a camera array video capture system with a sensor- based sport tracking system for player statistics and a system

Upload: others

Post on 23-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

Efficient Implementation and Processing of aReal-time Panorama Video Pipeline

Marius Tennøe1, Espen Helgedagsrud1, Mikkel Næss1, Henrik Kjus Alstad1, Hakon Kvale Stensland1,Vamsidhar Reddy Gaddam1, Dag Johansen2, Carsten Griwodz1, Pal Halvorsen1

1University of Oslo / Simula Research Laboratory 2University of Tromsø

Abstract—High resolution, wide field of view video generatedfrom multiple camera feeds has many use cases. However,processing the different steps of a panorama video pipeline inreal-time is challenging due to the high data rates and thestringent requirements of timeliness.

We use panorama video in a sport analysis system where videoevents must be generated in real-time. In this respect, we presenta system for real-time panorama video generation from an arrayof low-cost CCD HD video cameras. We describe how we haveimplemented different components and evaluated alternatives.We also present performance results with and without co-processors like graphics processing units (GPUs), and we evaluateeach individual component and show how the entire pipeline isable to run in real-time on commodity hardware.

I. INTRODUCTION

Panoramic images are often used in applications that requirea wide field of view. Some application scenario examplesinclude surveillance, navigation, scenic views, infotainmentkiosks, mobile e-catalogs, educational exhibits and sports anal-ysis. In such systems, video feeds are captured using multiplecameras, and the frames are processed and stitched into asingle unbroken frame of the whole surrounding region. Thisprocess is, however, very resource hungry. Each individualframe must be processed for barrel distortion, rotated to havethe same angle, warped to the same plane, corrected for colordifferences, stitched to one large panorama image, encoded tosave storage space and transfer bandwidth and, finally, writtento disk. Additionally, the stitching includes searching for thebest possible seam in the overlapping areas of the individualframes to avoid seams through objects of interest in the video.

In this paper, we describe our implementation of a real-time panorama video pipeline for an arena sports application.Here, we use a static array of low-cost CCD HD videocameras, each pointing at a different direction, to capture thewide field of view of the arena. These different views areslightly overlapped in order to facilitate the stitching of thesevideos to form the panoramic video. Several similar non-real-time stitching systems exist (e.g., [17]), and a simple non-real-time version of this system has earlier been describedand demonstrated at the functional level [8], [20]. Our initialprototype is the first sports application to successfully integrateper-athlete sensors, an expert annotation system and a videosystem, but due to the non-real-time stitching, the panoramavideo was only available to the coaches some time aftera game. The previous version also did not use any formof color correction or dynamic seam detection. Hence, the

static seam did not take into account moving objects (such asplayers), and the seam was therefore very visible. However,new requirements like real-time performance and better visualquality have resulted in a new and improved pipeline. Usingour new real-time pipeline, such systems can be used duringthe game. In particular, this paper therefore focuses on thedetails of the whole pipeline from capturing images from thecameras via various corrections steps for panorama generationto encoding and storage of both the panorama video andthe individual camera streams on disks. We describe howwe have evaluated different implementation alternatives (bothalgorithms and implementation options), and we benchmarkthe performance with and without using graphics processingunits (GPUs) as an co-processors. We evaluate each individualcomponent, and, we show how the entire pipeline is able torun in real-time on a low-cost 6-core machine with a GPU,i.e., moving the 1 frame per second (fps) system to 30 fpsenabling game analysis during the ongoing event.

The remainder of the paper is structured as follows: Next,in section II, we give a brief overview of the basic ideaof our system. Then, we briefly look at some related work(section III), before we describe and evaluate our real-timepanorama video pipeline in section IV. Section V providesa brief discussion of various aspect of the system before weconclude the paper in section VI.

II. OUR SPORTS ANALYSIS SYSTEMS

Today, a large number of (elite) sports clubs spend a largeamount of resources to analyze their game performance, eithermanually or using one of the many existing analytics tools. Forexample, in the area of soccer, several systems enable trainersand coaches to analyze the gameplay in order to improve theperformance. For instance, Interplay-sports, ProZone, STATSSportVU Tracking Technology and Camargus provide verynice video technology infrastructures. These systems canpresent player statistics, including speed profiles, accumulateddistances, fatigue, fitness graphs and coverage maps usingdifferent charts, 3D graphics and animations. Thus, there existseveral tools for soccer analysis. However, to the best of ourknowledge, there does not exist a system that fully integratesall desired features in real-time, and existing systems still re-quire manual work moving data between different components.In this respect, we have presented Bagadus [8], [20], whichintegrates a camera array video capture system with a sensor-based sport tracking system for player statistics and a system

Page 2: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

for human expert annotations. Our system allows the gameanalytics to automatically playout a tagged game event or toextract a video of events extracted from the statistical playerdata. This means that we for example can query for all sprintsfaster than X or all situations where a player is in the centercircle. Using the exact player position provided by sensors, atrainer can also follow individuals or groups of players, wherethe videos are presented either using a stitched panoramaview of the entire field or by (manually or automatically)switching between the different camera views. Our prototypeis currently deployed at an elite club stadium. We use adataset captured at a premier league game to experiment and toperform benchmarks on our system. In previous versions of thesystem, the panorama video had to be generated offline, andit had static seams [8]. For comparison with the new pipelinepresented in section IV, we next present the camera setup andthe old pipeline.

A. Camera setup

To record the high resolution video of the entire soccer field,we have installed a camera array consisting of four Baslerindustry cameras with a 1/3-inch image sensors supporting 30fps at a resolution of 1280×960. The cameras are synchro-nized by an external trigger signal in order to enable a videostitching process that produces a panorama video picture. Thecameras are mounted close to the middle line, i.e., under theroof of the stadium covering the spectator area approximately10 meters from the side line and 10 meters above the ground.With a 3.5 mm wide-angle lens, each camera covers a field-of-view of about 68 degrees, and the full field with sufficientoverlap to identify common features necessary for cameracalibration and stitching, is achieved using the four cameras.Calibration is done via a classic chessboard pattern [25].

B. The offline, static stitching pipeline

Our first prototype focused on integrating the different sub-systems. We therefore did not put large efforts into real-timeperformance resulting in an unoptimized, offline panoramavideo pipeline that combined images from multiple, trigger-synchronized cameras as described above. The general stepsin this stitching pipeline are: 1) correct the images for lens dis-tortion in the outer parts of the frame due to a wide-angle fish-eye lens; 2) rotate and morph the images into the panoramaperspective caused by different positions covering different

Figure 1. Camera setup at the stadium.

areas of the field; 3) rotate and stitch the video images into apanorama image; and 4) encode and store the stitched videoto persistent storage. Several implementations were tested forthe stitching operation such as the OpenCV planar projection,cylindrical projection and spherical projection algorithms, butdue to the processing performance and quality of the outputimage, the used solution is a homography based algorithm.

The first step before executing the pipeline, is to find cor-responding pixel points in order to compute the homographybetween the camera planes [9], i.e., the head camera planeand the remaining camera planes. When the homography iscalculated, the image can be warped (step 2) in order to fitthe plane of the second image. The images must be padded tohave the same size, and the seams for the stitching must befound in the overlapping regions (our first pipeline used staticseams). Figure 2(a) shows the process of merging four warpedimages into a single, stitched panorama image, leading to theimage found in figure 2(b). We also calculate the homographybetween the sensor data plane and the camera planes to find themapping between sensor data coordinates and pixel positions.

As can be seen in the figure, the picture is not perfect, butthe main challenge is the high execution time. On an IntelCore i7-2600 @ 3.4 GHz and 8 GB memory machine, thestitching operation consumed 974 ms of CPU time to generateeach 7000x960 pixel panorama image [8]. Taking into accountthat the target display rate is 30 fps, i.e., requiring a newpanorama image every 33 ms, there are large performanceissues that must be addressed in order to bring the panoramapipeline from a 1 fps system to a 30 fps system. However,the stitching operations can be parallelized and parts of itoffloaded to external devices such as GPUs, which, as wewill see in section IV, results in a performance good enoughfor real-time, online processing and generation of a panoramavideo.

III. RELATED WORK

Real-time panorama image stitching is becoming common.For example, many have proposed systems for panoramaimage stitching (e.g., [6], [11], [14]–[16]), and modern op-erating systems for smart phones like Apple iOS and GoogleAndroid support generation of panorama pictures in real-time.However, the definition of real-time is not necessarily the samefor all applications, and in this case, real-time is similar to“within a second or two”. For video, real-time has anothermeaning, and a panorama picture must be generated in thesame speed as the display frame rate, e.g., every 33 ms for a30 frame-per-second (fps) video.

One of these existing systems is Camargus [1]. The peo-ple developing this system claim to deliver high definitionpanorama video in real-time from a setup consisting of 16cameras (ordered in an array), but since this is a commercialsystem, we have no insights to the details. Another exampleis Immersive Cockpit [22] which aims to generate a panoramafor tele-immersive applications. They generate a stitched videowhich capture a large field-of-view, but their main goal is notto give output with high visual quality. Although they are able

Page 3: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

(a) Processing the images for stitching. Each of the 4 cameras, represented bythe top four layers, are warped and padded to a horizontal line (the view ofthe second camera). Then, these 4 views are superimposed on each other. Thehighlighted areas in the bottom image show where the views overlap (where theseams are found).

(b) Result of homography based stitching with static seams.

Figure 2. The initial homography-based stitcher

to generate video at a frame rate of about 25 fps for 4 cameras,there are visual limitations to the system, which makes thesystem not well suited for our scenario.

Moreover, Baudisch et al. [5] present an application forcreating panoramic images, but the system is highly dependenton user input. Their definition of real time is ”panoramaconstruction that offers a real-time preview of the panoramawhile shooting”, but they are only able to produce about 4 fps(far below our 30 fps requirement). A system similar to oursis presented in [4], which computes stitch-maps on a GPU, butthe presented system produces low resolution images (and islimited to two cameras). The performance is within our real-time requirement, but the timings are based on the assumptionthat the user accepts a lower quality image than the camerascan produce.

Haynes [3] describes a system by the Content InterfaceCorporation that creates ultra high resolution videos. TheOmnicam system from the Fascinate [2], [21] project alsoproduces high resolution videos. However, both these systemsuse expensive and specialized hardware. The system describedin [3] also makes use of static stitching. A system for cre-ating panoramic videos from already existing video clips ispresented in [7], but it does not manage to create panoramavideos within our definition of real-time. As far as we know,the same issue of real-time is also present in [5], [10], [17],[23].

In summary, existing systems (e.g., [7], [10], [17], [22],[23]) do not meet our demand of being able to generate thevideo in real-time, and commercial systems (e.g., [1], [3]) aswell as the systems presented in [2], [21] do often not fit into

our goal to create a system with limited resource demands.The system presented in [4] is similar to our system, but werequire high quality results from processing a minimum of fourcameras streams at 30 fps. Thus, due to the lack of a low-costimplementations fulfilling our demands, we have implementedour own panorama video processing pipeline which utilizeprocessing resources on both the CPU and GPU.

IV. A REAL-TIME PANORAMA STITCHER

In this paper, we do not focus on selecting the best algo-rithms etc., as this is mostly covered in [8]. The focus hereis to describe the panorama pipeline and how the differentcomponents in the pipeline are implemented in order to run inreal-time. We will also point out various performance trade-offs.

As depicted in figure 3, the new and improved panoramastitcher pipeline is separated into two main parts: one partrunning on the CPU, and the other running on a GPU usingthe CUDA framework. The decision of using a GPU as part ofthe pipeline was due to the parallel nature of the workload, andthe decision of using the GPU has affected the architecture toa large degree. Unless otherwise stated (we have tested severalCPUs and GPUs), our test machine for the new pipeline is anIntel Core i7-3930K (6-core), based on the Sandy Bridge-Earchitecture, with 32 GB RAM and an Nvidia GeForce GTX680 GPU.

A. The Controller module

The single-threaded Controller is responsible for initializingthe pipeline, synchronizing the different modules, handlingglobal errors and frame drops, and transferring data betweenthe different modules. After initialization, it will wait for andget the next set of frames from the camera reader (CamReader)module (see below). Next, it will control the flow of data fromthe output buffers of module N to the input buffers of moduleN + 1. This is done primarily by pointer swapping, and withmemory copies as an alternative. It then signals all modules toprocess the new input and waits for them to finish processing.Next, the controller continues waiting for the next set of framesfrom the reader. Another important task of the Controller isto check the execution speed. If an earlier step in the pipelineruns too slow, and one or more frames has been lost from thecameras, the controller will tell the modules in the pipelineto skip the delayed or dropped frame, and reuse the previousframe.

B. The CamReader module

The CamReader module is responsible for retrieving framesfrom the cameras. It consists of one dedicated reader thread percamera. Each of the threads will wait for the next frame, andthen write the retrieved frame to a output buffer, overwritingthe previous frame. The cameras provide a single frame inYUV 4:2:2 format, and the retrieval rate of frames in theCamReader is what determines the real time threshold for therest of the pipeline. As described above, the camera shuttersynchronization is controlled by an external trigger box, and

Page 4: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

1) CamReader2) Converter

YUV422=>RGBA3) Debarreler 5) Uploader

5) Uploader6) Background-

subtractor9) Stitcher

10) Converter

RGBA=>YUV420

6) Background-

subtractor 11) Downloader

11) Downloader

12) Panorama-

Writer

GPU

CPU

4) SingleCam-

Writer

Controller

Player coordinate

database (ZXY)

8) Color-corrector7) Warper

Figure 3. Panorama stitcher pipeline architecture. The orange and blue components run in the CPU and the green components run on the GPU.

in our current configuration, the cameras deliver a frame rateof 30 fps, i.e., the real-time threshold and the CamReaderprocessing time are thus 33 ms.

C. The Converter module

The CamReader module outputs frames in YUV 4:2:2format. However, because of requirements of the stitchingpipeline, we must convert the frames to RGBA. This is doneby the Converter module using ffmpeg and swscale. Theprocessing time for these conversions on the CPU, as seenlater in figure 6, is well below the real-time requirement, sothis operation can run as a single thread.

D. The Debarreler module

Due to the wide angle lenses used with our cameras in orderto capture the entire field, the images delivered are sufferingfrom barrel distortion which needs to be corrected. We foundthe performance of the existing debarreling implementationin the old stitching pipeline to perform fast enough whenexecuted as a dedicated thread per camera. The Debarrelermodule is therefore still based on OpenCV’s debarrelingfunction, using nearest neighbor interpolation.

E. The SingleCamWriter module

In addition to storing the stitched panorama video, wealso want to store the video from the separate cameras. Thisstorage operation is done by the SingleCamWriter, which isrunning as a dedicated thread per camera. As we can seein [8], storing the videos as raw data proved to be impracticaldue to the size of uncompressed raw data. The differentCamWriter modules (here SingleCamWriter) therefore encodeand compress frames into 3 seconds long H.264 files, whichproved to be very efficient. Due to the use of H.264, everySingleCamWriter thread starts by converting from RGBA toYUV 4:2:0, which is the required input format by the x264encoder. The threads then encode the frames and write theresults to disk.

F. The Uploader module

The Uploader is responsible for transferring data from theCPU to the GPU, in addition, the Uploader is also responsiblefor executing the CPU part of the BackgroundSubtractor(BGS) module (see section IV-G). The Uploader consists ofa single CPU thread, that first runs the player pixel lookupcreation needed by the BGS. Next, it transfers the current

RGBA frames and the corresponding player pixel maps tothe GPU. This is done by use of double buffering andasynchronous transfers.

G. The BackgroundSubtractor module

Background subtraction is the process of determining whichpixels of a video that belong to the foreground and whichpixels that belong to the background. The BackgroundSub-tractor module, running on the GPU, generates a foregroundmask (for moving objects like players) that is later usedin the Stitcher module later to avoid seams in the players.Our BackgroundSubtractor can run like traditional systemssearching the entire image for foreground objects. However,we can also exploit information gained by the tight integrationwith the player sensor system. In this respect, through thesensor system, we know the player coordinates which canbe used to improve both performance and precision of themodule. By first retrieving player coordinates for a frame, wecan then create a player pixel lookup map, where we onlyset the players pixels, including a safety margin, to 1. Thecreation of these lookup maps are executed on the CPU aspart of the Uploader. The BGS on GPU then uses this lookupmap to only process pixels close to a player, which reducesthe GPU kernel processing times, from 811.793 microsecondsto 327.576 microseconds on average on a GeForce GTX 680.When run in a pipelined fashion, the processing delay causedby the lookup map creation is also eliminated. The sensorsystem coordinates are retrieved by a dedicated slave threadthat continuously polls the sensor system database for newsamples.

Even though we enhance the background subtraction withsensor data input, there are several implementation alterna-tives. When determining which algorithm to implement, weevaluated two alternatives: Zivkovic [26], [27] and Kaew-TraKulPong [13]. Even though the CPU implementation wasslower (see table I), Zivkovic provided the best visual results,and was therefore selected for further modification. Further-more, the Zivkovic algorithm proved to be fast enough whenmodified with input from the sensor system data. The GPUimplementation, based on [19], proved to be even faster, andthe final performance numbers for a single camera stream canbe seen in table I. A visual comparison of the unmodifiedZivkovic implementation and the sensor system-modified ver-sion is seen in figure 4 where the sensor coordinate modifica-

Page 5: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

BGS version Min Max MeanKaewTraKulPongCPU - unmodified 40.0 66.4 48.9ZivkovicCPU - unmodified 50.4 106.8 79.3ZivkovicCPU - coordinate modification 10.0 51.1 23.3ZivkovicGPU - coordinate modification* 4.6 11.1 5.5

Table IEXECUTION TIME (MS) OF ALTERNATIVE ALGORITHMS FOR THE

BACKGROUNDSUBTRACTOR MODULE (1 CAMERA STREAM).

(a) Unmodified Zivkovic BGS. (b) Player sensor data modification ofZivkovic BGS.

Figure 4. Background subtraction comparison.

tion reduce the noise as seen in the upper parts of the pictures.

H. The Warper module

The Warper module is responsible for warping the cameraframes to fit the stitched panorama image. By warping wemean twisting, rotating and skewing the images to fit thecommon panorama plane. Like we have seen from the oldpipeline, this is necessary because the stitcher assumes that itsinput images are perfectly warped and aligned to be stitchedto a large panorama. Executing on the GPU, the Warper alsowarps the foreground masks provided by the BGS module.This is because the Stitcher module at a later point will usethe masks and therefore expects the masks to fit perfectly tothe corresponding warped camera frames. Here, we use theNvidia Performance Primitives library (NPP) for an optimizedimplementation.

I. The Color-corrector module

When recording frames from several different cameraspointing in different direction, it is nearly impossible tocalibrate the cameras to output the exact same colors due tothe different lighting conditions. This means that, to generatethe best panorama videos, we need to correct the colors ofall the frames to remove color disparities. In our panoramapipeline, this is done by the Color-corrector module runningon the GPU.

We choose to do the color correction after warping theimages. The reason for this is that locating the overlappingregions is easier with aligned images, and the overlap is alsoneeded when stitching the images together. This algorithm isexecuted on the GPU, enabling fast color correction withinour pipeline. The implementation is based on the algorithmpresented in [24], but have some minor modifications. Wecalculate the color differences between the images for everysingle set of frames delivered from the cameras. Currently,

Min Max MeanCPU(Intel Core i7-2600) 1995.8 2008.3 2001.2GPU(Nvidia Geforce GTX 680) 0.0 41.0 23.2

Table IICOLOR CORRECTION TIMINGS (MS).

we color-correct each image in a sequence, meaning that eachimage is corrected according to the overlapping frame to theleft. The algorithm implemented is easy to parallelize and doesnot make use of pixel to pixel mapping which makes it wellsuited for our scenario. Table II shows a comparison betweenrunning the algorithm on the CPU and on a GPU.

J. The Stitcher module

Like in the old pipeline, we use a homography based stitcherwhere we simply create seams between the overlapping cameraframes, and then copy pixels from the images based on theseseams. These frames need to follow the same homography,which is why they have to be warped. Our old pipeline usedstatic cuts for seams, which means that a fixed rectangulararea from each frame is copied directly to the output frame.Static cut panoramas are faster, but can introduce graphicalerrors in the seam area, especially when there are movementin the scene (illustrated in figure 5(a)).

To make a better seam with a better visual result, wetherefore introduce a dynamic cut stitcher instead of the staticcut. The dynamic cut stitcher creates seams by first creating arectangle of adjustable width over the static seam area. Then,it treats all pixels within the seam area as graph nodes. Thegraph is then directed from the bottom to the top in such away that each pixel points to the three adjacent pixels above(left and right-most pixels only point to the two available).Each of these edge’s weight are calculated by using a customfunction that compares the absolute color difference betweenthe corresponding pixel in each of the two frames we are tryingto stitch. The weight function also checks the foregroundmasks from the BGS module to see if any player is in the pixel,and if so it adds a large weight to the node. In effect, boththese steps will make edges between nodes where the colorsdiffers and players are present have much larger weights. Wethen run the Dijkstra graph algorithm on the graph to createa minimal cost route. Since our path is directed upwards, wecan only move up or diagonally from each node, and we willonly get one node per horizontal position.

Min Max MeanCPU (Intel Core i7-2600) 3.5 4.2 3.8GPU (Nvidia Geforce GTX 680) 0.0 23.9 4.8

Table IIIDYNAMIC STITCHING (MS).

An illustration of how the final seam looks can be seen infigure 5(c), where the seams without and with color correctionare shown in figures 5(b). Timings for the dynamic stitchingmodule can be seen in table III. The CPU version is currentlyslightly faster than our GPU version (as searches and branches

Page 6: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

Min Max MeanCPU (Intel Core i7-2600) 33.2 61.1 42.1GPU (Nvidia Geforce GTX 680) 0.2 36.5 13.4

Table IVRGBA TO YUV 4:2:0 CONVERSION (MS).

often are more efficient on traditional CPUs), but furtheroptimization of the CUDA code will likely improve this GPUperformance. Note that the min and max numbers for the GPUare skewed by frames dropping (no processing), and the initialrun being slower.

K. The YuvConverter module

Before storing the stitched panorama frames, we need toconvert back from RGBA to YUV 4:2:0 for the H.264 encoder,just as in the SingleCamWriter module. However, due to thesize of the output panorama, this conversion is not fast enoughon the CPU, even with the highly optimize swscale. Thismodule is therefore implemented on the GPU. In table IV, wecan see the performance of the CPU based implementationversus the optimized GPU based version.

Nvidia NPP contains several conversion primitives, but not adirect conversion from RGBA to YUV 4:2:0. The GPU basedversion is therefore first using NPP to convert from RGBA toYUV 4:4:4, and then a self written CUDA code to convertfrom YUV 4:4:4 to YUV 4:2:0.

L. The Downloader module

Before we can write the stitched panorama frames to disk,we need to transfer it back to the CPU, which is done bythe Downloader module. It runs as a single CPU thread thatcopies a frame synchronously to the CPU. We could haveimplemented the Downloader as an asynchronous transfer withdouble buffering like the Uploader, but since the performanceas seen in figure 6 is very good, this is left as future work.

M. The PanoramaWriter module

The last module, executing on the CPU, is the Writer thatwrites the panorama frames to disk. The PanoramaWriteruses the same settings as the SingleCamWriter (three secondH.2264 video files).

N. Pipeline performance

In order to evaluate the performance of our pipeline, weused an off-the-shelf PC with an Intel Core i7-3930K pro-cessor and an nVidia GeForce GTX 680 GPU. We havebenchmarked each individual component and the pipeline asa whole capturing, processing and storing 1000 frames fromthe cameras.

In the initial pipeline [8], the main bottleneck was thepanorama creation (warping and stitching). This operationalone used 974 ms per frame. As shown by the breakdowninto individual components’ performance in figure 6, the newpipeline has been greatly improved. Note that all individualcomponents run in real-time running concurrently on the sameset of hardware. Adding all these, however, gives times farlarger than 33 ms. The reason why the pipeline is still running

Figure 6. Improved stitching pipeline performance, module overview(GeForce GTX 680 and Core i7-3930K)

200 400 600 800 1000Frame number

0

10

20

30

40

50

60

70

Avera

ge w

rite

diff

ere

nce

(m

s)

SingleCamWriter PanoramaWriter Realtime

(a) Inter-departure time for all processed frames.

Reader30

31

32

33

34

35

36M

ean t

ime (

ms)

4 cores 6 cores 8 cores10 cores 12 cores 14 cores16 cores Real-time

(b) Core count scalability.

Reader30

31

32

33

34

35

36

Mean t

ime (

ms)

3.2 Ghz 3.5 Ghz 4.0 Ghz4.4 Ghz Real-time

(c) Core frequency scalability.

Figure 7. Inter-departure time of frames when running the entire pipeline. Ina real-time scenario, the output rate should follow the input rate (given hereby the trigger box) at 30 fps (33ms).

in real-time is because several frames are processed in parallel.Note here that all CUDA kernels are executing at the sametime on a single GPU, so the performance of all GPU modulesare affected by the performance of the other GPU modules.On earlier GPUs like the GTX 280, this was not allowed,but concurrent CUDA kernel execution was introduced in theFermi architecture [18] (GTX 480 and above). Thus, since theController module schedules the other modules according tothe input rate of 30 fps, the amount of resources are sufficientfor real-time execution.

For the pipeline to be real-time, the output rate shouldfollow the input rate, i.e., deliver all output frames (both 4single camera and 1 panorama) at 30 fps. Thus, to give anidea of how often a frame is written to file, figure 7 showsindividual and average frame inter-departure rates. The figuresshow the time difference between consecutive writes for thegenerated panorama as well as for the individual camerastreams. Operating system calls, interrupts and disk accesseswill most likely cause small spikes in the write times (as seenin the scatter plot in figure 7(a)), but as long as the averagetimes are equal to the real-time threshold, the pipeline can beconsidered real-time. As we can see in figures 6, 7(b) and 7(c),the average frame inter-arrival time (Reader) is equal to theaverage frame inter-departure time (both SingleCamWriter and

Page 7: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

(a) The original stitchpipeline in [8]: a fixed cutstitch with a straight verticalseam, i.e., showing a playergetting distorted in the seam.

(b) Dynamic stitch with no color correction. In the leftimage, one can see the seam search area between the redlines, and the seam in yellow. In the right image, oneclearly see the seam, going outside the player, but thereare still color differences.

(c) Dynamic stitch with color correction. In the leftimage, one can see the seam search area between the redlines, and the seam in yellow. In the right image, onecannot see the seam, and there are no color differences.(Note that the seam is also slightly different with andwithout color correction due the change of pixel valueswhen searching for the best seam after color correction).

Figure 5. Stitcher comparison - improving the visual quality with dynamic seams and color correction.

PanoramaWriter). This is also the case testing other CPUfrequencies and number of available cores. Thus, the pipelineruns in real-time.

As said above and seen in figure 7(a), there is a small latencyin the panorama pipeline compared to writing the singlecameras immediately. The total panorama pipeline latency, i.e.the end to end frame delay from read from the camera untilwritten to disk, is equal to 33 ms per sequential module (aslong as the modules perform fast enough) plus a 5 secondinput buffer (the input buffer is because the sensor system hasat least 3 second latency before the data is ready for use). The33 ms are caused by the camera frame rate of 30 fps, meaningthat even though a module may finish before the threshold, theController will make it wait until the next set of frames arrivebefore it is signaled to re-execute. This means that the pipelinelatency is 5.33 seconds per frame on average.

V. DISCUSSION

Our soccer analysis application integrates a sensor system,soccer analytics annotations and video processing of a videocamera array. There already exist several components that canbe used, and we have investigated several alternatives in ourresearch. Our first prototype aimed at full integration at thesystem level, rather than being optimized for performance. Inthis paper, however, our challenge has been of an order ofmagnitude harder by making the system run in real-time onlow-cost, off-the-shelf hardware.

The new real-time capability also enables future enhance-ments with respect to functionality. For example, severalsystems have already shown the ability to serve availablepanorama video to the masses [10], [17], and by also generat-ing the panorama video live, the audience can mark and followparticular players and events. Furthermore, ongoing work alsoinclude machine learning of sensor and video data to extractplayer and team statistics for evaluation of physical and tacticalperformance. We can also can use this information to makevideo playlists [12] automatically giving a video summary ofextracted events.

Controller Converter Debarreler Uploader SingleCamWriter PanoramaWriter0

10

20

30

40

Mean t

ime (

ms) 4 cores

6 cores8 cores10 cores12 cores14 cores16 coresReal-time

Figure 8. Core count scalability

Controller Converter Debarreler Uploader SingleCamWriter PanoramaWriter0

10

20

30

Mean t

ime (

ms)

3.2 Ghz3.5 Ghz4.0 Ghz4.4 GhzReal-time

Figure 9. CPU frequency scalability

Uploader BGS Warper Color-corr. Stitcher YUVConv. Downloader0

20

40

60

80

Mean t

ime (

ms)

GTX 280GTX 480GTX 580GTX 680Real-time

Figure 10. GPU comparison

Due to limited availability of resources, we have not beenable to test our system with more cameras or higher resolutioncameras. However, to still get an impression of the scalabilitycapabilities of our pipeline, we have performed several bench-marks changing the number of available cores, the processorclock frequency and GPUs with different architecture andcompute resources. Figure 81 shows the results changingthe number of available cores that can process the manyconcurrent threads in the CPU-part of pipeline (figure 7(b)shows that the pipeline is still in real-time). As we can observefrom the figure, every component runs in real-time usingmore than 4 cores, and the pipeline as a whole using 8 or

1Note that this experiment was run on a machine with more available cores(16), each at a lower clock frequency (2.0 GHz) compared to the machineinstalled at the stadium which was used for all other tests.

Page 8: Efficient Implementation and Processing of a Real-time ...home.ifi.uio.no/paalh/publications/files/ism2013-bagadus.pdf · Efficient Implementation and Processing of a Real-time

more cores. Furthermore, the CPU pipeline contains a large,but configurable number of threads (86 in the current setup),and due to the many threads of the embarrassingly parallelworkload, the pipeline seems to scale well with the numberof available cores. Similar conclusions can be drawn fromfigure 9 where the processing time is reduced with a higherprocessor clock frequency, i.e., the pipeline runs in real-timealready at 3.2 GHz, and there is almost a linear scaling withCPU frequency (figure 7(c) shows that the pipeline is stillin real-time). Especially the H.264 encoder scales very goodwhen scaling the CPU frequency. With respect to the GPU-part of the pipeline, figure 10 plots the processing times usingdifferent GPUs. The high-end GPUs GTX 480 and above(Compute 2.x and higher) all achieve real-time performance onthe current setup. The GTX 280 is only compute 1.3 whichdoes not support the concurrent CUDA kernel execution inthe Fermi architecture [18], and the performance is thereforeslower than real-time. As expected, more powerful GPUsreduce the processing time. For now, one GPU fulfills ourreal-time requirement, we did therefore not experiment withmultiple GPUs, but the GPU processing power can easilybe increased by adding multiple cards. Thus, based on theseresults, we believe that our pipeline easily can be scaled upto both higher numbers of cameras and higher resolutioncameras.

VI. CONCLUSIONS

In this paper, we have presented a prototype of a real-timepanorama video processing system. The panorama prototype isused as a sub-component in a real sport analysis system wherethe target is automatic processing and retrieval of events at asports arena. We have described the pipeline in detail, wherewe use both the CPU and a GPU for offloading. Furthermore,we have provided experimental results which prove the real-time properties of the pipeline on a low-cost 6-core machinewith a commodity GPU, both for each component and thecombination of the different components forming the entirepipeline.

The entire system is under constant development, and newfunctionality is added all the time. Ongoing work thereforealso include scaling the panorama system up to a highernumber of cameras and to higher resolution cameras. So far,the pipeline scales nicely with the CPU frequencies, number ofcores and GPU resources. We plan to use PCI Express-basedinterconnect technology from Dolphin Interconnect Solutionsfor low latency and fast data transfers between machines.Experimental results in this respect is though ongoing workand out of scope in this paper.

REFERENCES

[1] Camargus - Premium Stadium Video Technology Inrastructure.http://www.camargus.com/.

[2] Live ultra-high resolution panoramic video. http://www.fascinate-project.eu/index.php/tech-section/hi-res-video/. [Online; accessed 04-march-2012].

[3] Software stitches 5k videos into huge panoramic video walls, inreal time. http://www.sixteen-nine.net/2012/10/22/software-stitches-5k-videos-huge-panoramic-video-walls-real-time/, 2012. [Online; accessed05-march-2012].

[4] M. Adam, C. Jung, S. Roth, and G. Brunnett. Real-time stereo-imagestitching using gpu-based belief propagation. 2009.

[5] P. Baudisch, D. Tan, D. Steedly, E. Rudolph, M. Uyttendaele, C. Pal,and R. Szeliski. An exploration of user interface designs for real-timepanoramic photography. Australasian Journal of Information Systems,13(2):151, 2006.

[6] M. Brown and D. G. Lowe. Automatic panoramic image stitching usinginvariant features. International Journal of Computer Vision, 74(1):59–73, 2007.

[7] D.-Y. Chen, M.-C. Ho, and M. Ouhyoung. Videovr: A real-time systemfor automatically constructing panoramic images from video clips. InProc. of CAPTECH, pages 140–143, 1998.

[8] P. Halvorsen, S. Sægrov, A. Mortensen, D. K. C. Kristensen, A. Eich-horn, M. Stenhaug, S. Dahl, H. K. Stensland, V. R. Gaddam, C. Griwodz,and D. Johansen. Bagadus: An integrated system for arena sportsanalytics - a soccer case study. In Proc. of MMSys, pages 48–59, 2013.

[9] R. I. Hartley and A. Zisserman. Multiple View Geometry in ComputerVision. Cambridge University Press, second edition, 2004.

[10] K. Huguenin, A.-M. Kermarrec, K. Kloudas, and F. Taiani. Contentand geographical locality in user-generated content sharing systems. InProc. of NOSSDAV, 2012.

[11] J. Jia and C.-K. Tang. Image stitching using structure deformation. IEEETransactions on Pattern Analysis and Machine Intelligence, 30(4):617–631, 2008.

[12] D. Johansen, H. Johansen, T. Aarflot, J. Hurley, A. Kvalnes, C. Gurrin,S. Sav, B. Olstad, E. Aaberg, T. Endestad, H. Riiser, C. Griwodz, andP. Halvorsen. DAVVI: A prototype for the next generation multimediaentertainment platform. In Proc. of ACM MM, pages 989–990, Oct.2009.

[13] P. Kaewtrakulpong and R. Bowden. An improved adaptive backgroundmixture model for realtime tracking with shadow detection, 2001.

[14] A. Levin, A. Zomet, S. Peleg, and Y. Weiss. Seamless image stitchingin the gradient domain. Computer Vision-ECCV 2004, pages 377–389,2004.

[15] Y. Li and L. Ma. A fast and robust image stitching algorithm. In Proc.of WCICA, volume 2, pages 9604–9608, 2006.

[16] A. Mills and G. Dudek. Image stitching with dynamic elements. Imageand Vision Computing, 27(10):1593–1602, 2009.

[17] O. A. Niamut, R. Kaiser, G. Kienast, A. Kochale, J. Spille, O. Schreer,J. R. Hidalgo, J.-F. Macq, and B. Shirley. Towards a format-agnosticapproach for production, delivery and rendering of immersive media. InProc. of MMSys, pages 249–260, 2013.

[18] nVIDIA. Nvidia’s next generation cuda compute architec-ture: Fermi. http://www.nvidia.com/content/PDF/fermi white papers/NVIDIA Fermi Compute Architecture Whitepaper.pdf, 2010. [Online;accessed 08-march-2013].

[19] V. Pham, P. Vo, V. T. Hung, and L. H. Bac. Gpu implementation ofextended gaussian mixture model for background subtraction. 2010.

[20] S. Sægrov, A. Eichhorn, J. Emerslund, H. K. Stensland, C. Griwodz,D. Johansen, and P. Halvorsen. Bagadus: An integrated system for socceranalysis (demo). In Proc. of ICDSC, Oct. 2012.

[21] O. Schreer, I. Feldmann, C. Weissig, P. Kauff, and R. Schafer. Ultrahigh-resolution panoramic imaging for format-agnostic video production.Proceedings of the IEEE, 101(1):99–114, Jan.

[22] W.-K. Tang, T.-T. Wong, and P.-A. Heng. A system for real-timepanorama generation and display in tele-immersive applications. IEEETransactions on Multimedia, 7(2):280–292, 2005.

[23] C. Weissig, O. Schreer, P. Eisert, and P. Kauff. The ultimate immersiveexperience: Panoramic 3d video acquisition. In K. Schoeffmann,B. Merialdo, A. Hauptmann, C.-W. Ngo, Y. Andreopoulos, and C. Bre-iteneder, editors, Advances in Multimedia Modeling, volume 7131 ofLecture Notes in Computer Science, pages 671–681. Springer BerlinHeidelberg, 2012.

[24] Y. Xiong and K. Pulli. Color correction for mobile panorama imaging.In Proc. of ICIMCS, pages 219–226, 2009.

[25] Z. Zhang. Flexible camera calibration by viewing a plane from unknownorientations. In Proceedings of the IEEE International Conference onComputer Vision (ICCV), pages 666 –673, 1999.

[26] Z. Zivkovic. Improved adaptive gaussian mixture model for backgroundsubtraction. In Proc. of ICPR, pages 28 – 31 (vol. 2), aug. 2004.

[27] Z. Zivkovic and F. van der Heijden. Efficient adaptive density estimationper image pixel for the task of background subtraction. PatternRecognition Letters, 27(7):773 – 780, 2006.