study and evaluation of text inpainting techniques …...based inpainting method was proposed; in...

10
1 AbstractIn this paper, a set of inpainting algorithms are implemented and evaluated, in order to identify the most appropriate one for the restoration of small regions resulting from the extraction of subtitles on images and video. The initial state of the art review suggested that the diffusion-based techniques are the most suitable for the application in view. As such, four inpainting strategies based on the diffusion principle were implemented and assessed: Bertalmio method, TV method, Oliveira method and Telea method; the four methods were applied to still images and video sequences. In the video case, a purely spatial approach was used, i.e., the inpainting was applied to each frame as a still image. When they are applied to video text inpainting, the results are not generally acceptable; although the quality of each individual frame is high, when the videos are displayed at their actual frequency, a flicker effect becomes visible on the restored area. This is due to the fact that the algorithms do not consider the time dimension of the video. In an attempt to minimize the flicker effect, a temporal interpolation based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored through motion estimation/compensation based on neighboring frames, combined with a simple spatial diffusion technique. The proposed algorithm gives good results for most of the tested videos, mainly when these have an high spatial activity and do not contain scene changes; in particular, the flicker effect which occurs with a purely spatial inpainting approach, is eliminated. Index TermsDigital inpainting, Text inpainting, Image restoration, Video restoration, Image quality, Video quality. I. INTRODUCTION The term inpainting refers, generally, the restoration process of missing or damaged areas in images. The origin of the term dates back to the Renaissance when the first manual restoration techniques of medieval pictures were developed ( Figure 1). Nowadays, with the development of digital image processing techniques, methods that enable the automatic restoration of images and videos have emerged. The concept of “digital inpainting” was introduced by Bertalmio et al., in [1]. In mathematical terms, digital inpainting can be defined as a 2D (images) or 3D (video) estimation problem. The goal of inpainting is to estimate the unknown pixels values from known pixels (typically, belong to a spatial and/or temporal neighbourhood of the pixels to be estimated). Accordingly, image/video inpainting is an inverse and ill-posed inverse problem, i.e., the observed data do not (a) (b) Figure 1- Artwork restoration by manual inpainting: (a) degraded picture; (b) restored picture. uniquely constraint the solution. To solve the problem it is therefore necessary to introduce a priori conditions about the pixels to estimate. All inpainting methods are guided by the assumption that pixels in the known and unknown parts of the image share the same statistical properties or geometrical structures. This assumption translates into different local or global conditions, with the goal of having a resulting plausible image and with good perceptual quality. In recent years, the area of digital inpainting has been subject of an intense research activity, motivated by several applications as the automatic restoration of image and video regions after text removal, error concealment and foreground removal. In this paper, the target application is the restoration of small text areas in images and videos, like those resulting from extraction and repositioning or movies subtitles. The main goal is to perform a comparative study of inpainting algorithms, identifying the most suitable technique for small text areas restoration, considering both the quality of the resulting images/video and the associated computational cost. The remainder of this paper is organized as follows. Section II describes the main inpainting methods proposed in the literature, detailing those more adequate to the restoration of small areas. The details of the implemented spatial and temporal inpainting methods are presented in Section III. Section IV discuss the experimental results, and Section V draws conclusions and outlines future work. II. RELATED WORKS Image inpainting estimates a missed or damaged image area, so that the restored image becomes as plausible as possible, and with the restored parts undetectable for human Study and Evaluation of Text Inpainting Techniques for Images and Video Adélcio Rosa, nº 59160 Instituto Superior Técnico Av. Rovisco Pais, 1, 1049-001 Lisboa Email: [email protected]

Upload: others

Post on 14-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

1

Abstract— In this paper, a set of inpainting algorithms are

implemented and evaluated, in order to identify the most

appropriate one for the restoration of small regions resulting

from the extraction of subtitles on images and video. The initial

state of the art review suggested that the diffusion-based

techniques are the most suitable for the application in view. As

such, four inpainting strategies based on the diffusion principle

were implemented and assessed: Bertalmio method, TV method,

Oliveira method and Telea method; the four methods were

applied to still images and video sequences. In the video case, a

purely spatial approach was used, i.e., the inpainting was applied

to each frame as a still image. When they are applied to video text

inpainting, the results are not generally acceptable; although the

quality of each individual frame is high, when the videos are

displayed at their actual frequency, a flicker effect becomes

visible on the restored area. This is due to the fact that the

algorithms do not consider the time dimension of the video. In an

attempt to minimize the flicker effect, a temporal interpolation

based inpainting method was proposed; in this method, the

regions of the video resulting from text extraction are restored

through motion estimation/compensation based on neighboring

frames, combined with a simple spatial diffusion technique. The

proposed algorithm gives good results for most of the tested

videos, mainly when these have an high spatial activity and do

not contain scene changes; in particular, the flicker effect which

occurs with a purely spatial inpainting approach, is eliminated.

Index Terms— Digital inpainting, Text inpainting, Image

restoration, Video restoration, Image quality, Video quality.

I. INTRODUCTION

The term inpainting refers, generally, the restoration

process of missing or damaged areas in images. The origin of

the term dates back to the Renaissance when the first manual

restoration techniques of medieval pictures were developed (

Figure 1). Nowadays, with the development of digital

image processing techniques, methods that enable the

automatic restoration of images and videos have emerged.

The concept of “digital inpainting” was introduced by

Bertalmio et al., in [1]. In mathematical terms, digital

inpainting can be defined as a 2D (images) or 3D (video)

estimation problem. The goal of inpainting is to estimate the

unknown pixels values from known pixels (typically, belong

to a spatial and/or temporal neighbourhood of the pixels to be

estimated). Accordingly, image/video inpainting is an inverse

and ill-posed inverse problem, i.e., the observed data do not

(a) (b)

Figure 1- Artwork restoration by manual inpainting: (a) degraded

picture; (b) restored picture.

uniquely constraint the solution. To solve the problem it is

therefore necessary to introduce a priori conditions about the

pixels to estimate. All inpainting methods are guided by the

assumption that pixels in the known and unknown parts of the

image share the same statistical properties or geometrical

structures. This assumption translates into different local or

global conditions, with the goal of having a resulting plausible

image and with good perceptual quality.

In recent years, the area of digital inpainting has been

subject of an intense research activity, motivated by several

applications as the automatic restoration of image and video

regions after text removal, error concealment and foreground

removal. In this paper, the target application is the restoration

of small text areas in images and videos, like those resulting

from extraction and repositioning or movies subtitles. The

main goal is to perform a comparative study of inpainting

algorithms, identifying the most suitable technique for small

text areas restoration, considering both the quality of the

resulting images/video and the associated computational cost.

The remainder of this paper is organized as follows.

Section II describes the main inpainting methods proposed in

the literature, detailing those more adequate to the restoration

of small areas. The details of the implemented spatial and

temporal inpainting methods are presented in Section III.

Section IV discuss the experimental results, and Section V

draws conclusions and outlines future work.

II. RELATED WORKS

Image inpainting estimates a missed or damaged image

area, so that the restored image becomes as plausible as

possible, and with the restored parts undetectable for human

Study and Evaluation of Text Inpainting

Techniques for Images and Video Adélcio Rosa, nº 59160

Instituto Superior Técnico

Av. Rovisco Pais, 1, 1049-001 Lisboa

Email: [email protected]

Page 2: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

2

vision (Figure 1). There are several methods, in the literature,

proposed for this effect.

The first category of methods, known as diffusion-based

inpainting, introduces smoothness priors via partial

differential equations (PDE’s) to propagate local structures,

from the exterior to the interior of the unknown image region,

Ω. Many variants of this category, using different diffusion

models (linear, non-linear, isotropic and anisotropic) can be

found in the literature, that favor the propagation in a

particular direction or according to the structure present in a

local neighborhood. These methods are naturally well suited

for completing straight and curve lines, and for inpainting

small regions. In the following, three different diffusion-based

methods are reviewed: Total Variation (TV) method [2],

Bertalmio method [1] and Oliveira method [3].

The Total Variation (TV) method was first proposed for

noise removal in [4], and later adapted to the problem of

image inpainting [2]. This method assumes that an image can

be modeled by a function where the total variation is limited.

In this case, the optimal solution is minimize the TV energy

(given by local gradient module) of the region to be inpainted,

keeping the goal of a smooth propagation of the image

intensity. This minimization is given by:

2

0, , ,TV

S S

J I I i j dxdy I i j I i j dxdy

, (1)

where the first integral term represents the total gradient (or

variation) of the image and the second term is a data term

measuring the fidelity of the reconstructed image to the input

image. The minimization problem can also be rewritten as (for

simplicity, the pixel position index (i,j) was omitted):

2 2

1 0

32 2 2

22

xx y x y xy yy xt t t

s

x y

I I I I I I II I I I

I I

(2)

where It+1 and It are the solutions at iteration t+1 and t,

respectively, and Ix, Iy, Ixx e Iyy are, respectively, the image

gradients and Laplacians over x and y, Ixy represents the

Laplacian over x and y simultaneously, the parameter ε

ensures a nonzero denominator and λs is given by:

, ,

0, ,s

i j S

i j

. (3)

Bertalmio et al. [1], pioneered a digital image inpainting

algorithm based on the PDEs. The main goal of this method is

to minimize the edges blurring problem that may result from

simple isotropic or anisotropic diffusion methods. This

algorithm propagate the isophotes lines (a contour of equal

luminance in an image) from outside into the area to be

inpainted. The propagation direction is given from isophotes

directions (normal to the gradient vector), represented in

Figure 2.

(a) (b)

Figure 2 - (a) Isophote lines; (b) Image intensity propagation over

isophotes directions.

The Bertalmio algorithm can be written as:

1 , , , , ,t t t

nI i j I i j t I i j i j , (4)

where It(i,j) represents the image intensity in position (i, j) of

region to be inpainted, in the iteration t; ∆t is the change ratio

and Int(i,j) is an update to apply to It(i,j). This update is given

by:

, , ,t t t

nI i j L i j N i j , (5)

where ( , )tL i j is a measure of information variation (given by

the image Laplacian) which is projected on the propagation

direction, ( , )tN i j .

Keeping the idea of the diffusion-based inpainting

concepts, Oliveira et al., in [3], proposed an inpainting

technique using very simple models; however, it may achieve

results close to the ones of much more sophisticated (and

complex) methods. This technique propagates the edge

information into the unknown region through the iterative

convolution between the region to be inpainted and a diffusion

filter (Figure 3), producing good results in image areas where

the contours do not have high contrast.

Figure 3 - Diffusion filter: a = 0.073235; b = 0.176765; c = 0.125.

Telea proposed, in [5], a diffusion-based algorithm without

the computational overhead of the Bertalmio method. As in

previous methods, the algorithm propagates the image

intensity from a neighborhood of a region to be inpainted;

however, the process is based in first order approximation of

image value in the unknown region.

Examplar-based algorithms are a second category of

inpainting methods. These methods are based on texture

synthesis techniques, whose basic principle is to produce a

textured image region from a known texture model. Exemplar-

based inpainting was first introduced in [6] and can be

described by the following steps (Figure 4):

Page 3: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

3

1. Definition of a patch size.

2. Priority computing for each boundary pixels (of the

unknown region) and filling order setting of these

pixels.

3. Search, in known image area, the closest patch for

each patch to be filled in.

4. Recalculation of the inpainting region boundary and

of its pixels priorities.

(a) (b)

(c) (d)

Figure 4- Structure of exemplar-based techniques.

All described inpainting techniques can be directly applied

to video, considering each frame as a still image. However,

although the quality of the resulting images can be

perceptually acceptable when evaluating each frame

separately, this may be not the case when evaluating the video

quality as a whole. In fact, spatial inpainting may result in

small differences when applied in a frame-by-frame basis, but

these differences will cause flicker effects in video, with an

high subjective impact. To minimize this flicker effect, the

temporal domain must be taken into account. Lee et al.

proposed, in [7], a spatio-temporal based inpainting technique

to restore text regions in videos. Another video inpainting

technique was proposed in [8], in order to fill-in the missing

areas using patch-based inpainting method proposed in [6].

The first stage of this method is to separate the static

background and moving foreground through a threshold

mechanism; the second stage is to perform the inpainting of

both components; the third stage reconstructs the video

through a foreground and background composition.

III. IMPLEMENTATION

From the literature review, it was conclude that the

diffusion-based algorithms present good results for the

inpainting of small regions. For this reason, three diffusion-

based spatial algorithms were selected for implementation,

and in order to determine the best one in terms of resulting

quality and processing time. This section presents the

implementation details of each algorithm. In order to minimize

the flicker effect that results from a pure spatial based

approach of video inpainting, a temporal interpolation based

algorithm is also proposed.

A. Spatial Inpainting

As mentioned above, all implemented spatial inpainting

methods are PDE-based, and propagate the information from

the inpainting area boundary, ∂Ω, into the inpainting area, Ω,

through an iterative process.

1. Total Variation Method

The function which minimizes the total image variation

energy, is described as:

2 2

1 0

32 2 2

22

xx y x y xy yy xt t t

s

x y

I I I I I I II I I I

I I

, (6)

where It+1, It e I0 represent, respectively, the resulting image

on current iteration, the resulting image on previous iteration

and the input image; ρ is an update ratio and ε ensures a non

zero denominator; Ix and Iy are the image gradients on x and y

directions, respectively, and can be computed by:

1, 1,,

2

, 1 , 1,

2

x

y

I i j I i jI i j

I i j I i jI i j

; (7)

Ixx, Iyy and Ixy are the image Laplacians, and can be computed

by:

, 1, 2 , 1,

, , 1 2 , , 1

1, 1 1, 1 1, 1 1, 1,

4

xx

yy

xy

I i j I i j I i j I i j

I i j I i j I i j I i j

I i j I i j I i j I i jI i j

. (8)

The s parameter is given by:

0s

, i, j S

, i, j

. (9)

2. Bertalmio Method

The Bertalmio method uses two processes: inpainting and

anisotropic diffusion process; each global iteration is

composed by Td iterations of anisotropic diffusion, followed

by Ti iterations of inpainting. The inpainting process is

described by the discrete equation

1, , , , ,i i it t t

nI i j I i j t I i j i j

, (10)

Page 4: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

4

where Iti(i,j) is a resulting image on iteration ti, Inti(i,j)

represents an update of the pixel intensity in position (i, j) and

Δt is an update rate; this parameter is an algorithm input

parameter. Having Iti(i,j), Inti(i,j) is given by:

, ,, , ,

, ,

ii it it t

n

i

N i j tI i j I i j I i j

N i j t

, (11)

where ( , , )

( , , )

i

i

N i j t

N i j t is the isophote direction at position (i,j), on

iteration ti, and is given by:

, ,, , , ,

, ,

i i ii t t t

x y

i

N i j tI i j I i j I i j

N i j t

. (12)

and

2 2

2 2

,,

, ,

,,

, ,

i

i

i i

i

i

i i

t

yt

xt t

x y

t

xt

yt t

x y

I i jI i j

I i j I i j

I i jI i j

I i j I i j

, (13)

with Ixti(i,j) e Iy

ti(i,j) representing the gradients in x and y

directions, respectively

2

2

x

y

I i +1, j - I i - 1, jI i, j =

I i, j +1 - I i, j - 1I i, j =

. (14)

The Laplacian variation described by eq. (11), ( ) itI , is

given by:

, 1, 1, , , 1 , 1i i i i i

t t t t tI i j I i j I i j I i j I i j (15)

where ∆Iti(i,j) results from the convolution of the filter in

Figure 5 and the resulting image on iteration ti.

Figure 5- Filter 2D used to compute the image Laplacian.

The image gradient module, ( , )itI i j , is given by

2 22 2

2 22 2

, , 0

,

, , 0

i i i i i

bm fM bm fMi

i i i i i

bM fm bM fm

t t t t t

x x y yt

t t t t t

x x y y

I I I I i j

I i j

I I I I i j

, (16)

where Ix and Iy are image gradients, backward or forward. The

sub-index b and f denote, respectively, backward and forward;

m and M denote, respectively, minimal and maximal between

zero and the gradients. βti represents the Laplacian variation

along the isophotes direction

, ,, ,

, ,

iit it

i

N i j ti j I i j

N i j t . (17)

The anisotropic diffusion is determined

1

2 2*

d d

d d

d d

t t

xx yyt t

t t

x y

I II I

I I

, (18)

where ρ is the diffusion coefficient; Ixx and Iyy are the image

Laplacian on x and y, respectively; Ix and Iy are the image

Gradient on x and y, respectively; the parameter ε is a positive

parameter that guarantees a non null denominator. The

gradients and Laplacian are given by centered differences

1, 1,

2

, 1 , 1

2

d d

d

d d

d

t t

t

x

t t

t

y

I i j I i jI

I i j I i jI

, (19)

and

, 1, 2 , 1,

, , 1 2 , , 1

d d d d

d d d d

t t t t

xx

t t t t

yy

I i j I i j I i j I i j

I i j I i j I i j I i j

. (20)

3. Oliveira Method

The Oliveira algorithm is a quite simple method,

consisting in the iterative convolution between a filter and the

image to be restored. The restoration process can be written

iteratively as

1

1 2, , ,t t tI i j a I i j b I i j , (21)

where a and b are the filter coefficient values (Figure 3), and

1

2

, 1, 1 1, 1 1, 1 1, 1

, 1, 1, , 1 , 1

t t t t t

t t t t t

I i j I i j I i j I i j I i j

I i j I i j I i j I i j I i j

(22)

B. Temporal Inpainting

In order to minimize the flicker effect resulting from a

purely spatial inpainting of video sequences, a temporal

interpolation based inpainting algorithm is proposed. The

inpainting is performed only in the region resulting from text

removal, using motion estimation and compensation between

the frame to be processed and the two neighbor frames.

The proposed method starts up by determining the

rectangular region with the lowest area that contains all the

pixels to be inpainted (showed in red in Figure 6); the regions

Page 5: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

5

immediately above and below the rectangular area are then

divided into blocks, as shown in Figure 6.

Figure 6- Block division in the inpainting region neighborhood.

In the next step, the motion of each block is estimated; this

estimate can be backward (using previous frame) or forward

(using the following frame), and is performed by a

conventional block matching algorithm (BMA), whose

working principle is shown if Figure 8. For each block, a

vector (B or F) is obtained through the best matching block on

the previous and next frame.

Finally, the inpainting of the unknown pixels is

accomplished through motion compensated and using, for

each pixel, the six closest vectors (following a strategy

proposed in [9]), as shown in Figure 7.

Figure 7- Motion vectors (v1 to v6) considered for the inpainting of

the pixels highlighted in orange.

1. Motion Estimation

As mentioned above, the block motion estimation is

performed through BMA, using the previous frame (backward

estimate) or the following frame (forward estimate), and

considering only the luminance component of the video.

Figure 8 represents the BMA algorithm for the backward case,

where the block size is given by N×N and the searching area

size is given by Ri×Rj; this are is defined in the previous frame

(or following) and around the block position whose vector is

intended to be estimated. To find the motion vector, a block is

compared to all the blocks of the search area. The comparison

is performed using one of four similarity metrics: mean

absolute error (MAE); structural similarity index, SSIM;

MAE+SMD (side match distortion); SSIM+SMD.

Figure 8- BMA basics.

The difference between the position of the block whose

vector has to be estimated (represented in blue in Figure 8)

and the position of the most similar block (represented in red

in Figure 8) is considered as a the block motion vector; this

difference is given by the vector (di,dj ).

The MAE metric is defined as:

12blo( , ci j) k

1, ( , ) ( , )i j t t i jMAE d d I i j I i d j d

N

; (23)

The SMD metric [9] is the mean absolute error between the

internal boundary pixels of the reference block (blue in Figure

8) and the external boundary pixels of the candidate block.

The metric combination (MAE or SSIM) with SMD is

performed by

MAE&SMD

SSIM&SMD

Metric MAE 1 SMD

Metric SSIM 1 SMD

, (24)

where µ measures the metrics relative weight; before

combination, all the metrics are normalized to the interval

[0,1].

2. Inpainting with motion compensation

For each pixel to be inpainted, the six closest motion

vectors (Figure 7) are considered. Let In (i, j) be the resulting

intensity of the pixel in position (i, j), after motion

compensation with motion vector vn; the resulting inpainted

value for the pixel is given by

6

1

n n

n

I i, j w i, j I i, j

, (25)

where wn is a weigth that depends on the distance between the

pixel and vector vn (eq. 26).

Finally, in order to minimize the visibility of the restored

region boundary, some cycles of anisotropic spatial diffusion

are applied.

Page 6: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

6

1

2

3

4

5

6

1 11,

2 1 1

1 1,

2 1

11,

2 1 1

11,

2 1 1

1,

2 1

1,

2 1 1

n j m iw i j

m n

m iw i j

m

m i jw i j

m n

i n jw i j

m n

iw i j

m

j iw i j

m n

(26)

IV. RESULTS

This section presents the inpainting results (both in terms

of quality and processing time) for the algorithms described in

the previous section. The first results concern the spatial

inpainting methods (TV, Bertalmio, Oliveira and Telea),

applied to still images and video, in order to determine the

best algorithm for small text areas restoration. Next, the results

of the temporal interpolation method are presented.

Figure 9 presents one frame of the test videos; the first

frame of each video was used as still frame. The original

videos and all restored image and videos can be found in [10].

Figure 10 and 12 show the masks used to simulate the regions

to be reconstructed.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 9 - Test Videos and images: (a) Bigships; (b) Oldtown;

(c) Parkjoy; (d) Raven; (e) Snowmontain; (f) Soccer; (g) Station;

(h) Sunflower.

(a) (b)

Figure 10 - Masks used in inpainting process: (a) Mask1; (b) Mask2.

All the algorithms, except the Telea method, were

implemented on Matlab; for the Telea method, a software

module available in the OpenCV library was used; the

algorithms have been executed on a computer with 8 GBytes

of RAM and a processor Intel® Core™ i7 running at 2.8 GHz.

A. Spatial Inpainting

Based on the parameter values proposed on the original

papers and also on our own experimental tests, the following

parameters were settled:

TV method: 1500 iterations; ρ=0.2; λ=0.05; ε=0.01.

Bertalmio method: 100 global iterations, each one

composed by 2 diffusion iterations and 15

inpainting iterations; ρ=0.2; ε=0.001; Δt=0.1.

Oliveira method: 100 iterations; a=0.073235,

b=0.176765.

The comparison between the techniques takes into account

the resulting quality of the inpainting region and the associated

execution time. For the quality assessment, two objective

metrics were used: peak signal to noise ratio (PSNR) and

SSIM; in the former, only the restored pixels were considered;

in the latter, the metric considers all the pixels belonging to

the minimal rectangle containing the mask. For converting the

PSNR values to subjective scores, Table 1 was used. SSIM is

a normalized metric, that can take values between zero

(restored and original image are completely different) and one

(restored and original image have exactly the same pixel

values). Only the luminance component has been considered.

Table 1 - PSNR values versus subjective quality ratio.

PSNR [dB] Quality

< 30 Low

30 – 36 Medium

36 – 42 High

> 42 Excellent

Table 2 presents the resulting PSNR values for each

method, with Mask1. This table shows that the images

resulting from the Bertalmio and Oliveira methods present

high or even excellent quality (PSNR clearly superior to 36

dB), except for Snowmontain, which presents medium quality.

This fact is confirmed by Table 3, where the SSIM values are

higher than 0.9 for all images and algorithms. Telea and TV

methods show, in every case, worst results than Bertalmio and

Oliveira methods (although most of them have also

high/excellent quality).

If a mask contains several inpainting regions (e.g., Mask2),

the inpainting results are different for each region (Figure 11),

according to their size, and/or the spatial content of the region

Page 7: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

7

to be inpainted (e.g., textured regions are more difficult to be

properly restored than uniform regions).

Table 2 - Values of PSNR for still images (Mask1).

Image PSNR [dB]

TV Bertalmio Oliveira Telea

Bigships 35.95 39.93 39.87 35.41

Oldtown 35.29 40.08 40.08 35.13

Parkjoy 37.91 49.63 49.64 45.81

Raven 40.88 45.51 45.50 39.65

Snowmontain 30.66 34.43 34.56 30.44

Soccer 37.10 41.93 41.78 37.34

Station 43.09 48.44 48.54 43.17

Sunflower 45.13 55.11 55.39 43.59

Table 3 - Values of SSIM for still images (Mask1).

Image SSIM

TV Bertalmio Oliveira Telea

Bigships 0.79 0.93 0.93 0.87

Oldtown 0.82 0.93 0.93 0.87

Parkjoy 0.63 0.98 0.98 0.95

Raven 0.82 0.95 0.95 0.89

Snowmontain 0.82 0.92 0.92 0.87

Soccer 0.80 0.93 0.93 0.87

Station 0.78 0.97 0.97 0.92

Sunflower 0.88 0.99 0.99 0.95

(a) (b)

Figure 11 - PSNR and SSIM evolution curves using Bertalmio

method, Soccer still image and Mask2.

Another important criterion for the methods evaluation is

the processing time, even if in most of the inpainting

applications there is no requirement for real time processing.

If two methods have similar images/videos qualities, then the

decision criterion should be the processing time (or the

complexity associated to method). Table 4 presents the

processing time (in seconds) for each method, using Mask1;

the indicate time corresponds to the iteration for which the

resulting quality of the restored images reaches a stationary

state. As mentioned previously, for the Telea method a

function developed in C++ (OpenCV) was used, with

optimized code. The other methods were implemented in

Matlab2015a environment, without concerns about code

optimization.

Table 4 - Processing time for still images (Mask1).

Image Processing Time [s/frame]

TV Bertalmio Oliveira Telea

Bigships 350 80 1.5 1.23

Oldtown 300 75 1.3 1.02

Parkjoy 110 77 1.3 1.07

Raven 280 80 1.4 1.24

Snowmontain 300 70 1.1 1.22

Soccer 300 80 1.1 1.28

Station 200 90 1.4 1.05

Sunflower 200 90 1.5 1.09

The Oliveira method presents the shortest processing time

of all considered methods. Taken into account the quality of

the inpainted image, it follows that Oliveira method, proposed

in [3], is the most effective for text inpainting, with text

characters with dimensions similar to those used in movie

subtitles.

For video inpainting, the considered methods were applied

to each video frame. Although the quality remains high or

excellent when each frame is observed individually, while

playing the restored video a flicker effect becomes visible on

restored area. Some examples of this phenomenon can be

found in [10]. This phenomenon is due to small variations,

from frame to frame, on the restored region.

B. Temporal Inpainting

For temporal video inpainting, the Mask3 (Figure 12) was

initially used. Table 5 shows the resulting SSIM values for

each video, and for the different similarity metrics that can be

used on the motion estimation; in these results, the diffusion

filter was disabled. These values are the average of SSIM

values obtained for each video frame. The motion estimate

was performed using one pixel resolution and the search

window size, Wsize, was chosen according to the video

temporal activity, in order to minimize the tests time. The

remaining parameter were settled as follows: Bsize = 16 and

µ = 0.2.

Page 8: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

8

Figure 12 - Mask3 used on video inpainting process.

Table 5 - SSIM results for the four similarity metrics used on motion

estimation, and without diffusion filter.

Video Wsize Similarity metrics

SSIM MAE SSIM + SMD

MAE + SMD

Bigships 5 0.722 0.636 0.734 0.653

Parkjoy 40 0.843 0.623 0.851 0.674

Raven 10 0.899 0.803 0.893 0.795

Snowmontain 5 0.684 0.752 0.702 0.578

Soccer 40 0.831 0.675 0.834 0.728

Station 10 0.853 0.800 0.857 0.759

Sunflower 10 0.971 0.956 0.953 0.865

Based on Table 5 it can be conclude that, in general, the

metric SSIM presents the best results, particularly when

combined with SMD. In order to analyze the impact of the

diffusion filter on the inpainting quality, the tests were

repeated using the same parameters, for the similarity metric

SSIM+SMD, and applying 50 diffusion iterations. The results

are shown in Table 6, confirming objectively an improvement

in the restored videos quality. This improvement is more

evident in the Bigships and Snowmontain sequences. These

sequences are particularly difficult because they have a scene

transition implemented through dissolves, and therefore the

estimated vectors in the temporal window wherein the

dissolving occurs does not correspond to the real video

motion. Consequently, the purely temporal inpainting

produces poor results, as can be seen in Figure 13, being

attenuated by the diffusion filter. However, the temporal

interpolation process can recover from the distortion caused

by dissolves after some frames, as can be seen in Figure 14.

Figures 15 and 16 shows the evolution of PSNR and SSIM on

the Sunflower and Bigships video sequences, respectively. For

the Bigships sequence, the PSNR and SSIM values confirm

the lower subjective quality; however, the resulting quality is

very close to the original in some time intervals.

Table 6 - SSIM results using SSIM+SMD on motion estimate using

diffusion filter.

Video Wsize SSIM+SMD

Niter=0 Niter=50

Bigships 5 0.734 0.790

Parkjoy 40 0.851 0.855

Raven 10 0.893 0.899

Snowmontain 5 0.702 0.792

Soccer 40 0.834 0.839

Station 10 0.857 0.887

Sunflower 10 0.953 0.961

(a) (b)

(c)

(d)

(e)

Figure 13 - Temporal inpainting result for Bigships video in

dissolves: (a) video frame with mask in the dissolves beginning; (b)

video frame during the dissolves; (c) video frame at the dissolves

end; (d) cut of inpainting region of video frame with mask; (e) cut of

inpainting region of restored video frame using temporal inpainting.

(a)

(b)

(c)

Figure 14 - Inpainting result for Bigships video after dissolves: (a)

masked frame; cut of inpainting region; inpainting result of (b).

Page 9: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

9

(a) (b)

Figure 15 - PSNR (a) and SSIM (b) for video Sunflower using mask3.

(a) (b)

Figure 16 - PSNR (a) and SSIM (b) for video Bigships using mask3.

Table 7 presents the processing time (seconds per frame)

of the motion estimation procedure, for several Bsize and Wsize

values and for the different considered similarity metrics. For

Mask3 and Bsize=16×16 there are 20 blocks,, whereas for

Bsize=32×32 there are 12 blocks. One would expect the

processing time to be higher for the SSIM+SMD and

MAE+SMD metrics, but what happens is that the two metrics

have fewer comparisons relatively to SSIM and MAE,

respectively; the number of comparisons decreases from

(2 Wsize – 1)2 to (2 Wsize – 3)2 due to the way SMD was

implemented. Finally, the impact of diffusion iterations

number was analyzed, and the results are shown in Table 8.

The diffusion filter is only applied to the mask pixels. To

obtain the total processing time, the motion estimation time

has to be added to the filtering time.

Table 7 - Processing times (seconds per frame) of the motion

estimation procedure for different Bsize and Wsize values, and for all

similarity metrics.

Bsize Wsize

Processing Time [s/frame]

SSIM MAE SSIM + SMD

MAE + SMD

16

10 4.2 1.5 3.6 1.4

20 15.7 5.3 14.8 5.3

30 34.7 11.9 34.4 12.1

40 61.0 21.0 59.7 21.2

32

10 3.5 3.4 3.0 2.9

20 12.7 12.1 12.1 11.2

30 27.5 27.3 27.3 25.7

40 48.3 46.1 50.5 46.7

Table 8 - Processing time of diffusion filter

Niter Processing Time

[s/frame]

50 0.30

100 0.59

150 0.89

200 1.20

V. CONCLUSION

The aim of this work was to study and evaluate a set of

techniques for text inpainting in still images and videos. From

the literature overview, it was concluded that the diffusion-

based techniques were the more suitable for the application in

view. Accordingly, four diffusion-based strategies were

implemented: TV, Bertalmio, Oliveira and Telea (for the last

case an existing implementation on OpenCV was used). These

methods were applied to still images and video sequences; for

the videos, a purely spatial approach was used, i.e., the

inpainting was applied to each frame as it was a still image.

The four methods were compared both in terms of resulting

quality of the restored image/video and in terms of processing

time. All the methods produced good results when applied to

still images. The method proposed by Oliveira et. al [3],

presents the best results in terms of quality versus processing

time. When the methods are applied to video text inpainting,

the results are not generally acceptable; although the quality of

each individual frame is high, when the videos are displayed at

their actual frequency, a flicker effect becomes visible on the

restored area. This is due to the fact that the algorithms do not

consider the time dimension of the video. In an attempt to

minimize the flicker effect, a temporal interpolation based

inpainting method was proposed; in this method, the regions

of the video resulting from text extraction are restored through

motion estimation/compensation based on neighboring frames,

combined with a simple spatial diffusion technique. In the

motion estimation procedure, based on a simple block

matching algorithm (BMA), four similarity measures were

evaluated; the "SSIM + SMD" measure was the one whose

vectors led to the best video inpainting quality. The proposed

algorithm gives good results for most of the tested videos,

mainly when these have an high spatial activity and do not

contain scene changes; in particular, the flicker effect which

occurs with a purely spatial inpainting approach, is eliminated.

For future work, it is suggested to complement the

proposed method with a technique for scene change detection

[11] and one of the spatial inpainting methods studied (e.g.

Oliveira’s method) – in this way, the inpainting technique will

become a spatial-temporal hybrid method. For each frame, the

weights of spatial and temporal inpainting component should

be dependent on the existence (or not) of a scene change in

that frame. Another aspect that can be exploited in the

determination of the weights associated with the temporal and

spatial inpainting components is the confidence of the motion

Page 10: Study and Evaluation of Text Inpainting Techniques …...based inpainting method was proposed; in this method, the regions of the video resulting from text extraction are restored

10

vectors – the higher the confidence, the higher the temporal

component weight.

REFERENCES

[1] M. Bertalmio, G. Sapiro, V. Caselles and C. Ballester, “Image

Inpainting,” in 27th Conf. on Computer Graphics and Interactive

Techniques, New York, USA, 2000.

[2] T. Chan and J. Shen, “Mathematical models for Local

Deterministic Inpainting,” Siam Journal on Applied

Mathematics, pp. 1-11, March 2000.

[3] M. Oliveira, B. Bowen, R. McKenna and Y.-S. Chang, “Fast

Digital Image Inpainting,” in International Conference on

Visualization, Imaging and Image Processing (VIIP 2001),

Marbella, Spain, 2001.

[4] S. Masnou and J. M. Morel, “Level-Lines Based Disocclusion,”

in International Conference on Image Processing, Chicago,

USA, October, 1998.

[5] A. Telea, “An image inpainting based on the fast marching

method,” Journal of Graphics Tools, vol. 9, no. 1, pp. 25-36,

January 2004.

[6] A. Criminisi, P. Perez and K. Toyama, “Region filling and object

removal by exemplar-based inpainting,” IEEE Transactions on

Image Processing, vol. 13, no. 9, pp. 1200-1212, September

2004.

[7] C. W. Lee, K. Jung and H. J. Kim, “Automatic text detection and

removal in video sequences,” Journal Pattern Recognition

Letters, vol. 24, no. 15, pp. 2607 - 2623, November 2003.

[8] A. Koochari and M. Soryani, “Exemplar-based video inpainting

with large patches,” Journal of Zhejiang University-SCIENCE C

(Computers & Electronics), vol. 11, no. 4, pp. 270-277, April

2010.

[9] S. Tsekeridou, F. A. Cheikh, M. Gabbouj and I. Pitas,

“Application of vector rational interpolation to erroneous motion

field estimation for error concealment,” IEEE Transactions on

Multimedia, vol. 6, no. 6, pp. 876-885, December 2004.

[10] A. Rosa, “Dropbox,” 03 10 2015. [Online]. Available:

https://www.dropbox.com/sh/is0wcny4bapg38t/AAB2gSv1NbZk

NDSTslYcmh4Ua?dl=0.

[11] R. Almeida, “Deteção Automática de Descontinuidade

Temporais em Sequências de Video Digital,” Lisboa, Outubro

2015.