the university of...

The University of Manchester

Claudia-Ioana Ivan

BSc Computer Science

School of Computer Science

Final Year Project Report

Video Texture Synthesis

Supervisor: Dr. Aphrodite Galata

May 2016

Abstract

This report details the re-implementation of the

SIGGRAPH 2000 paper and it describes the whole creation

process of a new type of medium which is known as video

textures. Created from an input video of fixed length, it

combines the advantages of both images and video and it

revolutionises work in the area of both computer graphics

and vision. All the steps taken towards automatically

generating video textures are extensively explained and the

obtained results are then examined. To sum everything up,

further work is discussed, along with a reflection on what

has been achieved during the allocated time span.

Acknowledgements

I would like to thank my supervisor, Dr. Aphrodite Galata,

for offering guidance and help throughout the development

of my final year project.

I would also like to thank my family, as without their help

and support my attendance at The University of

Manchester would not have happened.

Table of contents

1. Context

1.1. Introduction ………………………………………………. 1 1.2. Goal of the Project …………………………………….. 2 1.3. Related work ……………………………………………… 4

1.3.1. SIGGRAPH 2000 …………………………..………… 4 1.3.2. Video Rewrite ……………………………………..…. 5 1.3.3. Video Sprites ………………………………………….. 6 1.3.4. Panoramic Video Texture ………………………… 7

1.4. Report Structure …………………………………………. 8

2. Development

2.1. System overview ……………………………………….. 10 2.2. Pre-processing the input video ……………………. 12 2.3. Extracting the frames ………………………………… 13 2.4. Analysing the input video …………………………… 14

2.4.1. Representation ……………………………………….. 14 2.4.2. Euclidean Distance …………………………………. 15 2.4.3. From distances to probabilities …………………. 17 2.4.4. Preserving dynamics ……………………………….. 18 2.4.5. Future optimisations ……………………………….. 20

2.5. Synthesis …………………………………………………. 22 2.5.1. Random play …………………………………………. 22 2.5.2. Video loops ……………………………………………. 22

2.6. Rendering ……………………………………………………. 25

3. Implementation and Evaluation

3.1. Organising the tasks …………………………………… 27 3.2. Implementation …………………………………………. 28 3.3. Evaluation ………………………………………………… 29

4. Reflection and Conclusion

4.1. Reflection ……………………………………………….. 32 4.2. Future work and conclusion ………………………. 33

References

Chapter 1

Context

1.1 Introduction

Although “a picture is worth a thousand words” — as the famous English idiom says, static images do not do justice and do not have the ability to capture the natural or man-made phenomena, i.e. waterfalls, candle flames, splashing water, a person performing jumping jacks. Less than 100 years since the above idiom was first used, with the help of technology which is constantly progressing, we have evolved from replacing descriptions with images to replacing the latter with videos.

Videos have the power to capture the attention and immerse the viewer into the story that is being played to a greater extent than pictures can. In spite of the fact that they have a strong visual and auditory impact on people's feelings or thoughts, videos are always of finite length as they must be stored on a computer or some other kind of storage device. They have a fixed structure and the recording is always done in a chronological order — even though the video can be edited to change this natural order, e.g. interchange scenes.

Page "1

Video textures come in useful and help combat this drawback. A combination between images and video, they are generated from a finite-length video — therefore a finite number of frames, by randomly changing the order in which the recorded frames are played.

This new type of medium can be used to replace both images and regular videos as it has numerous and diverse applications. For example, video textures could be used on a website as a way of presenting the company members interactively — more like the characters in the portraits move in the Harry Potter movies. This lively replacement can be used over both images and videos that are on loop. Dynamic backgrounds make a great way of advertising a business, e.g. an interactive fish tank can be successfully used in order to promote an aquarium or continuously moving carrousels and roller-coasters can offer a glimpse of what a theme park looks like. Video textures have also been used in video games, where video sprites better convey the idea of reality and liveliness and enhance the player’s experience.

1.2 Goal of the Project

The ultimate goal of the project is to overcome all the problems imposed by the creation of video textures, and to generate an infinitely-long new video from an input one.

The core of video texture synthesis is a rather simple algorithm which detects good transitions between frames, i.e. places where we can jump from one frame to another

Page "2

while keeping it unnoticeable or at a very minimum for the viewer. As Szeliski mentions in an article on video textures, “Once the transition points have been discovered and catalogued, we can improve the quality of the resulting animation using tricks such as blending and morphing to disguise the transitions” [1].

Szeliski also suggests in the same article that the video can be analysed by region (a region is an individual area in the video) in cases where several objects are moving independently or in complex scenes. Consequently, this approach reduces the number of samples needed in order to generate the video texture and the result is of far better quality than what would have been produced by using the whole video frame.

Since the aim is for the transitions to be as unnoticeable as possible, the second challenge of the project is to use the previously found transition points in order to compile a sequence that follows the overall arrangement of the initial video. The fact that a transition could cause a jump to a dead-end has to be taken into consideration, as this might lead to an unexpected and abrupt termination of the video texture. Additionally, because we might be unable to find perfect transitions, we sometimes resort to manually smoothing visually disturbing disruption.

Individually, the above mentioned challenges have rather straightforward solutions. It is of utmost importance to compile, from the start, a solid and powerful transition table which stores the likelihood of a transition to be good. Beyond a doubt, this table (matrix) ensures a well generated video texture.

Page "3

1.3 Related Work

Initially developed by Richard Szeliski of Microsoft Research, with help from colleagues Arno Schöld and Irfan Essa at Georgia Tech and David Salesin at the University of Washington, this new type of medium was introduced in the SIGGRAPH 2000 paper [2]. Although this third year project is mainly based on this particular paper, extensive research has been carried out not only by the above-mentioned authors, but by me as well. Additionally, this subsection includes information from both Computer Graphics and Vision fields, information that is strongly related to the Video Texture Synthesis concept.

1.3.1 SIGGRAPH 2000 The fact that the usage of images from scenes or of objects was facing an increase and computer graphics was rapidly turning toward image-based modelling, therefore leaving behind the need for geometry, was a starting point in developing this paper. The interest taken in this is explained by the authors of the SIGGRAPH 2000 paper as follows:

“As Debevec points out [13], this trend parallels the one that occurred in music synthesis a decade ago, when sample-based synthesis replaced more algorithmic approaches like frequency modulation. To date, image-based rendering

techniques have mostly been applied to still scenes such as architecture [8, 18], although they have also been used to cache and accelerate the renderings produced by con- ventional graphics hardware [27, 32].”

Page "4

The transition from image-based rendering to video textures — which can be seen “as a kind of video-based rendering” [2], emphasises the evolution faced by both computer graphics and computer vision. This is a consequence of the fact that the reuse of captured video was not a widely explored area at the time when the Video Textures paper was researched and written.

1.3.2 Video Rewrite — SIGGRAPH 1997 Video Rewrite [3] is the closest to the idea of video textures and it can be described as the manipulation of an existing video in order to create a new one of a person mouthing words that have not been previously spoken. This was the first facial-animation system to automate all labelling. It uses a multitude of Computer Vision techniques in order to localise one’s mouth during the training phase (the head orientation, position and shape of the jaw are also outstandingly important features that are both localised and analysed), and morphing ones to blend the mouth’s movements for the final video.

This technique is widely used in foreign movies due to the necessity of synchronising the dubbed audio with the initial dialogue. Because the lip movement from the translation did not match the original footage, video rewrite revolutionised the field and made it possible to resync the existing sequence to the new recorded audio.

Figure 1 below is an overview of the analysis stage of video rewrite. During this step, the audio track gets segmented into triphones and the chosen selections make it possible for the mouth movement to be synchronised with any ulterior recorded audio.

Page "5

Figure 1: Overview of analysis stage in Video Rewrite

1.3.3 Video Sprites — SIGGRAPH 2002 Following the development of video textures, various ways of applying them have been explored. Again, by using Computer Vision techniques, we can abstract objects from the background in order to render them independently. These 2D abstractions are known as video sprites [4]. Location-wise, they may be used wherever, may be used interactively and in real time and can also be combined, resulting in extremely complex and sophisticated scenes.

Arno Schödl and Irfan A. Essa, two of the authors of the “Video Texture Synthesis” project, presented a follow-up at the SIGGRAPH 2002 conference titled “Controlled

Animation of Video Sprites” [4]. This introduces an algorithm which aids to create animations and realistic characters, especially of animals due to the fact that they represent a difficulty in real world training and 3D modelling.

Page "6

Figure 2 below offers an overview of the process of creating such an animation. The similarities between video textures and video sprites are very easy to observe, as the latter is just an extension to the concept of video textures. The only difference is the aim, which in the case of video sprites, is to animate moving objects.

Figure 2: Creating an animation from video sprite samples

1.3.4 Panoramic Video Texture — SIGGRAPH 2005 Panoramic Video Textures [5] were introduced by numerous researchers from the University of Washington and Massachusetts and from Microsoft Research Amherst during the same SIGGRAPH conference, in 2005. It carries on the work and ideas behind video textures, but it differs in the sense that it requires a much larger set of data.

Just like video textures, PVTs are looking to select fragments of input videos which can be later on combined together. The novelty is brought by the usage of a panning shot which is taken from a fixed position and not by having a fixed camera perspective and while only a small section of the whole scene is been taken, the resulting PVT conveys the idea of movement throughout the whole scene.

Page "7

1.4 Report Structure

Thus far, the focus of this report has been on giving a broad overview of the concept behind video textures. Additionally, further information has been provided on the need to develop this new type of medium and how it has influenced further projects in the field of computer graphics and vision.

The remainder of it is focused on the development aspect of video textures, means of testing, as well as reflections and conclusion. By starting with the most important requirement i.e. an input video, it analyses the prerequisite fund of knowledge — technical and mathematical in this case, we move on to the software, frameworks and libraries that were used in the making of. Therefore, both top and low-level details are given and in addition to this, the approach taken towards implementing this project. Furthermore, the way in which the system was tested and how its correctness and quality were analysed are described in the third chapter.

The report ends with a reflection on what has been achieved during the implementation stage of the final year project by examining the initial plan that was presented during the seminar back in November and it talks about the tasks that were not completed — either because of technical or time-wise limitations. Throughout the last chapter, apart from the introspection on the aspects that have changed during the 6 months allocated for the practical work on the project, I also take a look at what knowledge in the field of computer graphics and vision has been gained.

Page "8

Chapter 2

Development

2.1 System Overview

Before looking into how video textures are created, it is necessary to take a step back and see the overall picture. It is well understood that they are created from a (rather short) finite-length input video and it results in “a continuous, infinitely varying stream of video images” [2]. But why is the term “textures” used?

When it comes to computer graphics, the term can easily get confusing and it does not necessarily mean the general pattern — as in the repeated geometry or design, of a 2D/3D object. As Helmut Cantzler suggests in “An overview of image texture” [6], a more suitable term would be colour map due to the fact that the texture is being mapped onto the already existing surface.

Additionally, these visual patterns are repeated in what is known as a quasi-periodic way. According to Wikipedia, quasi-periodicity is “the property of a system that displays irregular periodicity.” [7]. Periodicity can be defined as something that is happening at regular intervals, e.g. every two weeks. Consequently, the quasi-periodic comportment has a pattern of repetition, but it also comprises an element

Page "9

of fluctuation/inconsistency. Having all of the above in mind, it is safe to conclude that video textures are just quasi-periodic repetitions of frames which are (randomly) rearranged and blended together.

As the main goal of this final year project was the reimplementation of the Video Textures paper [2], a solid list containing all the requirements and steps had to be compiled. This also came in useful in estimating the length of time needed by each component, estimation that was later used to create the initial plan.

The general approach that is to be taken when creating video textures is to find places where transitions from one frame to another can be made without the viewer noticing discontinuities. Consequently, the system can be organised in three components, as seen below in Figure 3.

Figure 3: System Overview

Page "10

Analyse input video

Synthesise new video from

the analysed one

Put together the frames/

frame pieces

2.2 Pre-processing the input video

Even though the pre-process step could be skipped depending on the input video that is being used for creating video textures, we sometimes either equalise the brightness — easily performed by choosing background portions that remain the same, or remove any jitter that is a result of the camera not being stable at recording time. If this is necessary, Adobe After Effects can be used in order to perform video stabilisation over the video.

Additionally, if the object of interest does not occupy a large area in the input video, cropping it is an optimal solution which saves a lot of storage space and it also increases performance — the number of pixels to be compared in each frame can sometimes be even halved, without losing any important information (Figure 4).

Figure 4: Cropping video area

As it can be observed in Figure 4, the object of interest represents only a small portion of the whole video, therefore

Page "11

it is safe to remove the part of the background that does not contribute at all to the creation of the video textures.

If, when rendering the video texture, we want it to be the same width and height as the initial input video, we can split and store both cropped and not cropped videos, but during the analysis and synthesis steps we only use the cropped one.

2.3 Extracting the frames

Although this step is not actually included in the system overview, it is of utmost importance to extract the frames from the input video before starting the analysis phase. It is ideal to store all the frame data due to the fact that during the implementation, this has to be accessed numerous times.

Instead of using the OpenCV library, which does have a class named VideoCapture that reads a video as a sequence of frames, I have chosen to use the ffmpeg command line tool, as it was faster and it did not involve a lot of code — the extraction can be done in one line only, as seen in Figure 5.

$ ffmpeg -i inputFile -r noOfDesiredFramesPerSecond outputFolder

Figure 5: Extracting all the frames using ffmpeg

Page "12

The only drawback of using ffmpeg over OpenCV is the fact that the exact number of desired frames per second has to be supplied when the command is being run. However, knowing that the frequency at which an imaging device displays frame is usually 24 FPS and taking into consideration the need to have smooth transitions, I chose a higher frame rate per second, more precisely, 27.

This number has been chosen after multiple tries with different frame rates. When using 27 frames per second, ffmpeg was neither duplicating nor dropping frames, which indicates that the chosen value is optimal. After the frames have been extracted and stored for further use as individual images, the actual process of creating video textures can begin.

2.4 Analysing the input video

2.4.1 Representation Although video textures are essentially frames strategically stitched together in order to create a continuous video, they can also be seen as Markov processes, meaning that it satisfies the Markov property. As the video texture can be described as a Markov process, each individual frame is therefore corresponding to a single frame and each probability can be translated into the likelihood of making a transition to a frame — even to itself.

The Markov property can be explained in other words as the “memoryless property of a stochastic process” [8]. This basically means that the probability of future states of the

Page "13

current process depends only on the current state, and not one the sequence of processes that came before, as seen below in Figure 6.

Figure 6: Example of a Markov process.

Consequently, two ways of storing the information needed for the video texture are available. The more advantageous one — and the one I decided to work with, is a simple matrix of probabilities (Figure 7). The alternative is represented by a set of links (Figure 8) from frame i to frame j, where each link has a probability attached to it.

The probability matrix is fundamentally a similarity measurement between each pair of frames. Evidently, this includes the similarity of each frame with itself. Each Pij element in the matrix corresponds to the probability of making a transition from frame i to frame j and it is based on the Euclidean Distance.

2.4.2 Euclidean Distance — Distance Matrix As all the frames have been extracted from the input video, it is now possible to start looking for good transitions between them. In this current implementation, the L2

distance is used, as it is a simple yet efficient way.

Page "14

The Euclidean Distance represents the distance between two points, but in our case, it will represent the distance (similarity) between each pair of images.

If point p = (p1, p2,…, pn) and point q = (q1, q2,…,qn) are two points in Euclidean n-space, the distance from point p to point q is given by the Pythagorean formula:

Figure 9: Euclidean Distance formula

The function can be described as the sum of the squared differences between each pair of pixels in each pair of frames, sum whose square root is then calculated and stored in the distance matrix as follows:

Dij = ||Ii - Ij||2

which indicates the L2 distance between frame Ii and frame Ij. Later on, during the synthesis phase, we will want to transition from frame i to frame j whenever frame i + 1 will be similar to frame j. The value of each distance varies from 0 — self comparison or identical frames, to 255 (pixel difference) * the dimension of the frame.

After the distances are computed, it is possible to prune the frames, especially in places where we have long consecutive sequences of identical frames.

Page "15

2.4.3 From distances to probabilities The next step of the analysis stage is to map the Euclidean distances to probabilities, resulting in the generation of a Markov chain representation of the frames. This is a fundamental step before moving to the synthesis phase of the project because based on this, good transitions between frames are selected. At run time, the next frame to display is chosen based solely on Pij, therefore the correctness of this step is critical as even the smallest mistake can completely change the resulting video texture.

Smaller distances are translated into higher probabilities whereas bigger distances will be mapped to lower ones. The simplest way to do this (and also suggested by the paper’s authors) is by using an exponential function, such as:

Pij ∝ exp(-Dij/𝜎)

The 𝜎 parameter controls the mapping between the L2

distances and the probability of using a given transition. The smaller 𝜎 is, the better the transitions are going to be but also the number of available ones might significantly decrease (as it can be seen in Figure 10). Larger values allow for a bigger variety while sacrificing the quality of transitions. It is advisable to set the value of 𝜎 to a small multiple of the average of the distance matrix values and not to a random value, so that the probability at any given frame is fairly low and also based on the actual input. It can be noticed below what a major role this parameter plays.

Page "16

Figure 10: 𝜎 multiplier of 0.7, 0.4 and 0.2 respectively

After the Euclidean distances between each pair of frames are translated into probabilities, each row in the new probability matrix is normalised so that ∑Pij = 1 (figure 11 below).

Figure 11: Normalised probability matrices

2.4.4 Updating distances while preserving dynamics After the L2 distances have been mapped to probabilities, the resulting matrix could be passed to the synthesis stage in the video texture, but, depending on the nature of the input video, the result might be rather disappointing and quite displeasing to the eye of the viewer.

Page "17

Therefore, if we consider the pendulum video, the derived video texture would probably contain multiple dead-ends and unnatural movements as the swinging motion might present uncoordinated jumps. In addition to preserving similarity, it is recommended to preserve the dynamic of the video as well.

This problem can easily be solved by considering not only the immediate following frame, but instead a sequence of temporarily adjacent frames within a weighted window. This sequence match can be achieved by diagonally filtering the distance matrix with a weighted kernel [w-m,…, wm-1].

Figure 12: Different dynamics but similar frames

The binomial weights can be further modified in order to give different or equal priority to each distance within the window.

Figure 13: Filtered Distances Function

Page "18

The kernel size is represented by “m” and it is usually set to either two or four, numbers that correspond to either a 2-, respectively 4-tap filter. After the distance matrix has been filtered, it is compulsory to recompute the probability matrix while using the same exponential function shown above and then renormalise it.

2.4.5 Further optimisations Although at this point we could move on to the next step, further optimisations can be made in order to increase the quality of the to-be generated video texture.

One available option is that of selecting only the local maxima in the probability matrix for a given destination frame. By keeping only the local maxima, “sweet spots” are identified in the matrix of transitions as it can be noticed that the neighbourhoods in the good transitions are often very similar to other neighbourhoods from the same probability matrix, and due to the fact that we prune the transitions, only the best ones are kept.

In order to find the local maxima, the neighbouring transitions surrounding each transition are checked for the largest probability. The regions used usually vary depending on the matrix size, but a 3x3 window is typically used. After the local maxima has been found in that region, the remaining probabilities from that region are set to 0 and we select the next transition. The algorithm is repeated until all transitions have been checked, leaving us with only the optimal ones.

Page "19

Another option which can either be combined with the first one or used on its own is to update the probabilities by using a threshold value. This method is especially useful for video loops — these are further explained in the next section, as in this case we want to find sequences of frames that can be played continuously. The threshold value can be chosen depending on the number of desired transitions, therefore there is no correct way of choosing this value.

In practice though, by having a very high threshold, the number of transitions from each frame might be fairly low, sometimes even resulting in frames without available transitions from them. However, this behaviour is not wished for, so a well considered value should be chosen.

After all the optimisations have been performed, and the transitions have been pruned, we can move on to the second stage in our implementation.

Page "20

2.5 Synthesis

Once the good transitions have been identified in the previous stage, it is time to decide on the order in which the frames will be played in the video texture. There are two available algorithms from which one can choose from: random play and video loops.

2.5.1 Random play This method is easy to both describe and apply, as it is just a simple and straightforward Monte Carlo approach. The Monte Carlo methods are used in three different situations: when we’re dealing with optimisation, numerical integration or when we’re generating draws from a probability distribution.

The approach proves itself useful in impromptu situations where video textures are created without much analysis, but it still generates high quality results. The video texture begins at any point, but before the last transition that has a non-zero probability. The next frame j to be played after the current frame i is chosen according to the probability matrix Pij. Although in most cases, frame i+1 tends to have the highest probability, it is recommended to avoid it, and only use it in case if there is no other better transition.

2.5.2 Video Loops The second available method for synthesising video textures is known as video loops. A video loop consists of multiple

Page "21

primitive loops which are combined and form a compound one.

A primitive loop is a loop which has only a single transition, from i to j, where i is a source transition, and j represents the transition of destination. Every primitive loop has an associated cost to it which is equal to the filtered distance between the pair of frames i and j.

Additional cyclic sequences named combined loops can be created by combining multiple primitive loops, but in order to add a loop to another, they need to have overlapping ranges. As a consequence, the resulting range is the union of the ranges of the two combined loops — either primitive or compound, and the length and cost are the sum of the initial lengths, respectively costs. A compound loop might have multiple repetitions of the same primitive loop, and it can therefore be represented as a multiset, where the ordering of the loops does not have much importance.

The best compound loop of a given length can be found by enumerating all multisets of transitions that have an overall length L, and afterwards the best ones are selected and kept.

Optimal loops can be generated by using a Dynamic Programming algorithm, where a table of L rows and N columns is being constructed, L representing the maximum length to be considered, and N being the number of transitions/primitive loops to be considered. The algorithm’s goal is to compute a list of best compound loops of a certain length, loops that contain at least one instance of the primitive loop that is listed at the top of

Page "22

each column. This is achieved by updating cells one row at a time.

The next step is to schedule them in some order which generates a valid compound loop. We start by mixing together the transition that starts at the end of the current sequence. Until all primitive loops are removed, we choose transitions that might break the primitive loops in one or more sets of continuous ranges.

When using this algorithm, it is safe to keep in mind its complexity, which is quadratic in the number of transitions in the compound loop. Compared to the Monte Carlo sequencing algorithm, this one produces far more complex video textures, but due to its complexity, it might not be the right answer in all situations.

Page "23

2.6 Rendering

The last stage in creating video textures is the rendering of the chosen transitions. Additionally, although the ones that do not introduce major discontinuities are preferred, there are situations where no ideal transitions can be found. This section describes ways in which visible discontinuities can be masked and techniques for blending the frames together.

Cross-fading represents a way of blending together the frames into a transition and it comes as a replacement to the basic jump from one frame to another. The process of cross-fading is fairly simple: the frames in a sequence that are near the start of a transition are gradually faded out, as the ones near the end of a transition are linearly faded in. The fading position is as described due to the fact that we want it to be halfway complete in the place where the transition is scheduled.

However, even though this technique removes all abrupt jumps, it blurs the images if misalignments between frames are present. Thus, even though the viewer will not notice the jumps, the transition from sharp to blurry and back to sharp can be noticeable. A constant level of blur can be achieved by taking frequent transitions so that the frames are constantly being cross-faded together.

Another approach that also reduces the blur level of the transitions is to simply morph the sequences together because their common features will be aligned.

In the case of video textures consisting of multiple individually analysed regions, the rendering stage blends

Page "24

them smoothly by using the feathering approach, which smooths or blurs the edges of each region.

Compared to the analysis step, the implementation of the rendering phase does not impose intricate or crucial work that affects the whole system. It simply stitches together the transitions that have been carefully computed during the first two stages and wraps everything up.

Page "25

Chapter 3

Implementation and

Evaluation

3.1 Organising the tasks

The very first step taken towards the implementation of the video textures paper was to identify all the steps that have to be followed and use this information in order to create a plan to guide the work on (Figure 14). This has also been presented during the seminar back in November, and even though it was carefully followed, it did suffer from minor tweaks.

Figure 14: Initial Plan

Page "26

Because the plan did not include too much detail, the time span allocated for each task did not prove to be very realistic, as for example, the computation of the Euclidean distances and their mapping to probabilities turned out to be a much more difficult task than originally estimated.

However, some phases, e.g. the rendering one, took less than expected and in addition to this, more tasks were later introduced in order to ensure the finalisation and success of the project.

3.2 Implementation

A key piece in the creation of video textures is of course the input video. Because I felt the need to experiment and not be dependent on what has previously been done, I decided to film my own and work with it, in addition to the pendulum one. This has proved to be highly useful and a great deal of knowledge was derived from it, because I had no extra material to compare the resulting matrices with, therefore extra time was spent to make sure that everything was performing as expected.

As mentioned in Section 2.3, in order to extract all the frames from the input video, I chose the ffmpeg command line over the OpenCV library. This has saved a lot of time. as working with this library was a tedious and arduous process.

However, due to the fact that it is a very powerful cross-platform open source Computer Vision library, OpenCV

Page "27

was extensively used in the final rendering stage, when the video texture was stitched together. In order to generate the distance and probability matrices MATLAB was used, as not only does it contain a large number of tools and functions that are meant for matrix manipulation, but also image processing commands. But before actually computing the matrices, in order to increase the performance of the analysis step, I converted each frame from RGB to grayscale. The channel reduction (from 3 — red, green and blue, to only one) has reduced the computation time of the Euclidean distance by at least half. Even though some information, e.g. subtle differences in the background, might have been lost, the decision of using grayscale images over RGB ones proved to be the right one.

Although working with MATLAB was not a difficult task as I have previous experience with it, I did encounter a few issues in the most crucial part of the implementation, the computation of the distance matrix. As it was required to read all the frames first and then work with them, I overlooked the possibility of them being read in the wrong order. This was however easily corrected, and once the frames had been read in the right order, the distance matrix was updated too.

3.3 Evaluation

Because of the nature of the project, it was absolutely necessary to make sure throughout the whole development that the correct operations were being performed. Because it is next to impossible to ensure that the generated video texture is the most accurate and error-free one — for each

Page "28

transition, there are at least 2 or 3 frames that could be used, generating a video texture using one of the sequences used in the Video Textures paper [2] and then comparing the results seemed to be the only way to assure the accuracy of my implementation.

The pendulum sequence proved to be the suitable choice as it allowed complex analysis to be made on it. The distance matrices, and to be more precise, their diagonals, were compared and since they looked identical, the correctness of my execution was confirmed.

Figure 15: My distance matrix vs paper distance matrix

The same approach was taken in the case of the probability matrices, and because my matrix looked as the inverse of the distance one, it was clear I had obtained the desired matrix.

Figure 16: My probability matrix vs paper probability matrix

Page "29

Only after I was certain that my implementation was correct too, I switched to working on the video that I recorded.

Figure 17: Distance and probability matrix for own video

The white diagonal which suggest only values of 0 — therefore each frame is identical with itself, was yet another indicator of the fact that the result was as expected.

Unfortunately no bitmaps were provided for the preserving dynamics problem or for the optimising and pruning stages. However, the pruning functions were not hard to test, and finding the local maxima algorithm proved to be work correctly as it maintained the peaks in the probability matrix. The threshold optimisation was tested with different values as in this case there is no optimal choice.

To conclude, the only part of the whole system that could be tested is the analysis stage, as it involves mathematical work whose accuracy can be easily checked. The fact that the Video Textures paper [2] included a large number of visualisations of both distance and probability matrices was extremely beneficial and it reassured me that what I was doing was indeed right.

Page "30

Chapter 4

Reflection and conclusion

4.1 Reflection

Looking back at what has been achieved during the last 7 months, I am most proud of the amount of knowledge that was gained through research. Before starting this project, I had little to no experience in the field of Computer Vision, but the MATLAB skills that I had proved to be extremely useful. Also, when choosing the project for my final year, I looked for something rather challenging yet fun, and I feel that the synthesis of video textures was the right option.

Due to the fact that not much work has been done in this area, there have been times when I felt that I lacked information — especially because there are not a lot of ways to test the correctness of the texture. However, all difficulties were overcome and the completion of the project brought a great deal of satisfaction.

Additionally, because the initial plan was rather shallow, many changes have been made to it in terms of time allocation for each phase. Because the analysis phase proved to be harder than expected and because of its importance, much more time was spent in ensuring that the matrices were indeed correct. Consequently, even though the implementation part officially began somewhere around

Page "31

December, most of the work was done after the January exams, until the deadline for the code submission.

Reflecting on the skills gained during the time spent on implementation, it is safe to say that not only am I now more confident in my C and MATLAB skills, but the fact that this was the first big project that I worked on and completed before a deadline, I had to be quite organised and keep myself motivated.

Regardless of the challenges faced and the time spent on solving bugs and issues, I stayed focused and aimed to resolve all of them patiently as every problem has a solution.

4.2 Future work and conclusion

As the report reflects too, all goals that were set during the compilation of the initial plan have been successfully achieved. However, the idea of video textures can be further expanded, but due to time constrains, the focus has been on developing the key features.

As suggested in the Video Texture paper [2], areas where additional work can be done are, among others, the improvement of distance metrics, better blending techniques and better tools for creative control.

To conclude, even though extensions have been looked into but unfortunately have not been completed, the paper [2] has been implemented from scratch and a large amount of work went into ensuring that everything performed as

Page "32

expected. The skills gained since the start of the academic year are an invaluable asset for future projects, which is why I am grateful for having had the chance to work on this project.

Page "33

References

[1] "Computer Graphics World Magazine”. Computer

Graphics World Magazine. N.p., 2000. Web. 25 Apr. 2016.

[2] Szeliski, R., Schold, A., Essa, I. and Salesin, D. (2000).

“Video Textures”, SIGGRAPH 2000.

[3] Bregler, C., Covell, M., and Slaney, M., “Video rewrite:

Driving visual speech with audio”. Computer Graphics

(SIG- GRAPH’97), pages 353–360, August 1997.

[4] A. Schodl and I. Essa, “Controlled Animation of Video

Sprites,” [Online]. Available: http://www.cc.gatech.edu/cpl/

pubs/PDF/ACM-SCA02.pdf. [Accessed 2 May 2016].

[5] A. Agarwala and K. C. e. a. Zheneg, “Panoramic Video

Textures,” 2005. [Online]. Available: http://

grail.cs.washington.edu/projects/panovidtex/

panovidtex.pdf. [Accessed 2 May 2016].

[6] Cantzler, H., “An overview of image texture”. [Online]

Available: http://homepages.inf.ed.ac.uk/rbf/CVonline/

LOCAL_COPIES/CANTZLER2/texture.html [Accessed 2

May 2016].

[7] “Wikipedia,” 2016. [Online]. Available: https://

en.wikipedia.org/wiki/Quasiperiodicity. [Accessed 2 May

2016].

[8] “Wikipedia”, 2016. [Online]. https://en.wikipedia.org/

wiki/Markov_property [Accessed 2 May 2016].

the university of...

Documents