study and evaluation of text inpainting techniques …...based inpainting method was proposed; in...
TRANSCRIPT
1
Abstract— In this paper, a set of inpainting algorithms are
implemented and evaluated, in order to identify the most
appropriate one for the restoration of small regions resulting
from the extraction of subtitles on images and video. The initial
state of the art review suggested that the diffusion-based
techniques are the most suitable for the application in view. As
such, four inpainting strategies based on the diffusion principle
were implemented and assessed: Bertalmio method, TV method,
Oliveira method and Telea method; the four methods were
applied to still images and video sequences. In the video case, a
purely spatial approach was used, i.e., the inpainting was applied
to each frame as a still image. When they are applied to video text
inpainting, the results are not generally acceptable; although the
quality of each individual frame is high, when the videos are
displayed at their actual frequency, a flicker effect becomes
visible on the restored area. This is due to the fact that the
algorithms do not consider the time dimension of the video. In an
attempt to minimize the flicker effect, a temporal interpolation
based inpainting method was proposed; in this method, the
regions of the video resulting from text extraction are restored
through motion estimation/compensation based on neighboring
frames, combined with a simple spatial diffusion technique. The
proposed algorithm gives good results for most of the tested
videos, mainly when these have an high spatial activity and do
not contain scene changes; in particular, the flicker effect which
occurs with a purely spatial inpainting approach, is eliminated.
Index Terms— Digital inpainting, Text inpainting, Image
restoration, Video restoration, Image quality, Video quality.
I. INTRODUCTION
The term inpainting refers, generally, the restoration
process of missing or damaged areas in images. The origin of
the term dates back to the Renaissance when the first manual
restoration techniques of medieval pictures were developed (
Figure 1). Nowadays, with the development of digital
image processing techniques, methods that enable the
automatic restoration of images and videos have emerged.
The concept of “digital inpainting” was introduced by
Bertalmio et al., in [1]. In mathematical terms, digital
inpainting can be defined as a 2D (images) or 3D (video)
estimation problem. The goal of inpainting is to estimate the
unknown pixels values from known pixels (typically, belong
to a spatial and/or temporal neighbourhood of the pixels to be
estimated). Accordingly, image/video inpainting is an inverse
and ill-posed inverse problem, i.e., the observed data do not
(a) (b)
Figure 1- Artwork restoration by manual inpainting: (a) degraded
picture; (b) restored picture.
uniquely constraint the solution. To solve the problem it is
therefore necessary to introduce a priori conditions about the
pixels to estimate. All inpainting methods are guided by the
assumption that pixels in the known and unknown parts of the
image share the same statistical properties or geometrical
structures. This assumption translates into different local or
global conditions, with the goal of having a resulting plausible
image and with good perceptual quality.
In recent years, the area of digital inpainting has been
subject of an intense research activity, motivated by several
applications as the automatic restoration of image and video
regions after text removal, error concealment and foreground
removal. In this paper, the target application is the restoration
of small text areas in images and videos, like those resulting
from extraction and repositioning or movies subtitles. The
main goal is to perform a comparative study of inpainting
algorithms, identifying the most suitable technique for small
text areas restoration, considering both the quality of the
resulting images/video and the associated computational cost.
The remainder of this paper is organized as follows.
Section II describes the main inpainting methods proposed in
the literature, detailing those more adequate to the restoration
of small areas. The details of the implemented spatial and
temporal inpainting methods are presented in Section III.
Section IV discuss the experimental results, and Section V
draws conclusions and outlines future work.
II. RELATED WORKS
Image inpainting estimates a missed or damaged image
area, so that the restored image becomes as plausible as
possible, and with the restored parts undetectable for human
Study and Evaluation of Text Inpainting
Techniques for Images and Video Adélcio Rosa, nº 59160
Instituto Superior Técnico
Av. Rovisco Pais, 1, 1049-001 Lisboa
Email: [email protected]
2
vision (Figure 1). There are several methods, in the literature,
proposed for this effect.
The first category of methods, known as diffusion-based
inpainting, introduces smoothness priors via partial
differential equations (PDE’s) to propagate local structures,
from the exterior to the interior of the unknown image region,
Ω. Many variants of this category, using different diffusion
models (linear, non-linear, isotropic and anisotropic) can be
found in the literature, that favor the propagation in a
particular direction or according to the structure present in a
local neighborhood. These methods are naturally well suited
for completing straight and curve lines, and for inpainting
small regions. In the following, three different diffusion-based
methods are reviewed: Total Variation (TV) method [2],
Bertalmio method [1] and Oliveira method [3].
The Total Variation (TV) method was first proposed for
noise removal in [4], and later adapted to the problem of
image inpainting [2]. This method assumes that an image can
be modeled by a function where the total variation is limited.
In this case, the optimal solution is minimize the TV energy
(given by local gradient module) of the region to be inpainted,
keeping the goal of a smooth propagation of the image
intensity. This minimization is given by:
2
0, , ,TV
S S
J I I i j dxdy I i j I i j dxdy
, (1)
where the first integral term represents the total gradient (or
variation) of the image and the second term is a data term
measuring the fidelity of the reconstructed image to the input
image. The minimization problem can also be rewritten as (for
simplicity, the pixel position index (i,j) was omitted):
2 2
1 0
32 2 2
22
xx y x y xy yy xt t t
s
x y
I I I I I I II I I I
I I
(2)
where It+1 and It are the solutions at iteration t+1 and t,
respectively, and Ix, Iy, Ixx e Iyy are, respectively, the image
gradients and Laplacians over x and y, Ixy represents the
Laplacian over x and y simultaneously, the parameter ε
ensures a nonzero denominator and λs is given by:
, ,
0, ,s
i j S
i j
. (3)
Bertalmio et al. [1], pioneered a digital image inpainting
algorithm based on the PDEs. The main goal of this method is
to minimize the edges blurring problem that may result from
simple isotropic or anisotropic diffusion methods. This
algorithm propagate the isophotes lines (a contour of equal
luminance in an image) from outside into the area to be
inpainted. The propagation direction is given from isophotes
directions (normal to the gradient vector), represented in
Figure 2.
(a) (b)
Figure 2 - (a) Isophote lines; (b) Image intensity propagation over
isophotes directions.
The Bertalmio algorithm can be written as:
1 , , , , ,t t t
nI i j I i j t I i j i j , (4)
where It(i,j) represents the image intensity in position (i, j) of
region to be inpainted, in the iteration t; ∆t is the change ratio
and Int(i,j) is an update to apply to It(i,j). This update is given
by:
, , ,t t t
nI i j L i j N i j , (5)
where ( , )tL i j is a measure of information variation (given by
the image Laplacian) which is projected on the propagation
direction, ( , )tN i j .
Keeping the idea of the diffusion-based inpainting
concepts, Oliveira et al., in [3], proposed an inpainting
technique using very simple models; however, it may achieve
results close to the ones of much more sophisticated (and
complex) methods. This technique propagates the edge
information into the unknown region through the iterative
convolution between the region to be inpainted and a diffusion
filter (Figure 3), producing good results in image areas where
the contours do not have high contrast.
Figure 3 - Diffusion filter: a = 0.073235; b = 0.176765; c = 0.125.
Telea proposed, in [5], a diffusion-based algorithm without
the computational overhead of the Bertalmio method. As in
previous methods, the algorithm propagates the image
intensity from a neighborhood of a region to be inpainted;
however, the process is based in first order approximation of
image value in the unknown region.
Examplar-based algorithms are a second category of
inpainting methods. These methods are based on texture
synthesis techniques, whose basic principle is to produce a
textured image region from a known texture model. Exemplar-
based inpainting was first introduced in [6] and can be
described by the following steps (Figure 4):
3
1. Definition of a patch size.
2. Priority computing for each boundary pixels (of the
unknown region) and filling order setting of these
pixels.
3. Search, in known image area, the closest patch for
each patch to be filled in.
4. Recalculation of the inpainting region boundary and
of its pixels priorities.
(a) (b)
(c) (d)
Figure 4- Structure of exemplar-based techniques.
All described inpainting techniques can be directly applied
to video, considering each frame as a still image. However,
although the quality of the resulting images can be
perceptually acceptable when evaluating each frame
separately, this may be not the case when evaluating the video
quality as a whole. In fact, spatial inpainting may result in
small differences when applied in a frame-by-frame basis, but
these differences will cause flicker effects in video, with an
high subjective impact. To minimize this flicker effect, the
temporal domain must be taken into account. Lee et al.
proposed, in [7], a spatio-temporal based inpainting technique
to restore text regions in videos. Another video inpainting
technique was proposed in [8], in order to fill-in the missing
areas using patch-based inpainting method proposed in [6].
The first stage of this method is to separate the static
background and moving foreground through a threshold
mechanism; the second stage is to perform the inpainting of
both components; the third stage reconstructs the video
through a foreground and background composition.
III. IMPLEMENTATION
From the literature review, it was conclude that the
diffusion-based algorithms present good results for the
inpainting of small regions. For this reason, three diffusion-
based spatial algorithms were selected for implementation,
and in order to determine the best one in terms of resulting
quality and processing time. This section presents the
implementation details of each algorithm. In order to minimize
the flicker effect that results from a pure spatial based
approach of video inpainting, a temporal interpolation based
algorithm is also proposed.
A. Spatial Inpainting
As mentioned above, all implemented spatial inpainting
methods are PDE-based, and propagate the information from
the inpainting area boundary, ∂Ω, into the inpainting area, Ω,
through an iterative process.
1. Total Variation Method
The function which minimizes the total image variation
energy, is described as:
2 2
1 0
32 2 2
22
xx y x y xy yy xt t t
s
x y
I I I I I I II I I I
I I
, (6)
where It+1, It e I0 represent, respectively, the resulting image
on current iteration, the resulting image on previous iteration
and the input image; ρ is an update ratio and ε ensures a non
zero denominator; Ix and Iy are the image gradients on x and y
directions, respectively, and can be computed by:
1, 1,,
2
, 1 , 1,
2
x
y
I i j I i jI i j
I i j I i jI i j
; (7)
Ixx, Iyy and Ixy are the image Laplacians, and can be computed
by:
, 1, 2 , 1,
, , 1 2 , , 1
1, 1 1, 1 1, 1 1, 1,
4
xx
yy
xy
I i j I i j I i j I i j
I i j I i j I i j I i j
I i j I i j I i j I i jI i j
. (8)
The s parameter is given by:
0s
, i, j S
, i, j
. (9)
2. Bertalmio Method
The Bertalmio method uses two processes: inpainting and
anisotropic diffusion process; each global iteration is
composed by Td iterations of anisotropic diffusion, followed
by Ti iterations of inpainting. The inpainting process is
described by the discrete equation
1, , , , ,i i it t t
nI i j I i j t I i j i j
, (10)
4
where Iti(i,j) is a resulting image on iteration ti, Inti(i,j)
represents an update of the pixel intensity in position (i, j) and
Δt is an update rate; this parameter is an algorithm input
parameter. Having Iti(i,j), Inti(i,j) is given by:
, ,, , ,
, ,
ii it it t
n
i
N i j tI i j I i j I i j
N i j t
, (11)
where ( , , )
( , , )
i
i
N i j t
N i j t is the isophote direction at position (i,j), on
iteration ti, and is given by:
, ,, , , ,
, ,
i i ii t t t
x y
i
N i j tI i j I i j I i j
N i j t
. (12)
and
2 2
2 2
,,
, ,
,,
, ,
i
i
i i
i
i
i i
t
yt
xt t
x y
t
xt
yt t
x y
I i jI i j
I i j I i j
I i jI i j
I i j I i j
, (13)
with Ixti(i,j) e Iy
ti(i,j) representing the gradients in x and y
directions, respectively
2
2
x
y
I i +1, j - I i - 1, jI i, j =
I i, j +1 - I i, j - 1I i, j =
. (14)
The Laplacian variation described by eq. (11), ( ) itI , is
given by:
, 1, 1, , , 1 , 1i i i i i
t t t t tI i j I i j I i j I i j I i j (15)
where ∆Iti(i,j) results from the convolution of the filter in
Figure 5 and the resulting image on iteration ti.
Figure 5- Filter 2D used to compute the image Laplacian.
The image gradient module, ( , )itI i j , is given by
2 22 2
2 22 2
, , 0
,
, , 0
i i i i i
bm fM bm fMi
i i i i i
bM fm bM fm
t t t t t
x x y yt
t t t t t
x x y y
I I I I i j
I i j
I I I I i j
, (16)
where Ix and Iy are image gradients, backward or forward. The
sub-index b and f denote, respectively, backward and forward;
m and M denote, respectively, minimal and maximal between
zero and the gradients. βti represents the Laplacian variation
along the isophotes direction
, ,, ,
, ,
iit it
i
N i j ti j I i j
N i j t . (17)
The anisotropic diffusion is determined
1
2 2*
d d
d d
d d
t t
xx yyt t
t t
x y
I II I
I I
, (18)
where ρ is the diffusion coefficient; Ixx and Iyy are the image
Laplacian on x and y, respectively; Ix and Iy are the image
Gradient on x and y, respectively; the parameter ε is a positive
parameter that guarantees a non null denominator. The
gradients and Laplacian are given by centered differences
1, 1,
2
, 1 , 1
2
d d
d
d d
d
t t
t
x
t t
t
y
I i j I i jI
I i j I i jI
, (19)
and
, 1, 2 , 1,
, , 1 2 , , 1
d d d d
d d d d
t t t t
xx
t t t t
yy
I i j I i j I i j I i j
I i j I i j I i j I i j
. (20)
3. Oliveira Method
The Oliveira algorithm is a quite simple method,
consisting in the iterative convolution between a filter and the
image to be restored. The restoration process can be written
iteratively as
1
1 2, , ,t t tI i j a I i j b I i j , (21)
where a and b are the filter coefficient values (Figure 3), and
1
2
, 1, 1 1, 1 1, 1 1, 1
, 1, 1, , 1 , 1
t t t t t
t t t t t
I i j I i j I i j I i j I i j
I i j I i j I i j I i j I i j
(22)
B. Temporal Inpainting
In order to minimize the flicker effect resulting from a
purely spatial inpainting of video sequences, a temporal
interpolation based inpainting algorithm is proposed. The
inpainting is performed only in the region resulting from text
removal, using motion estimation and compensation between
the frame to be processed and the two neighbor frames.
The proposed method starts up by determining the
rectangular region with the lowest area that contains all the
pixels to be inpainted (showed in red in Figure 6); the regions
5
immediately above and below the rectangular area are then
divided into blocks, as shown in Figure 6.
Figure 6- Block division in the inpainting region neighborhood.
In the next step, the motion of each block is estimated; this
estimate can be backward (using previous frame) or forward
(using the following frame), and is performed by a
conventional block matching algorithm (BMA), whose
working principle is shown if Figure 8. For each block, a
vector (B or F) is obtained through the best matching block on
the previous and next frame.
Finally, the inpainting of the unknown pixels is
accomplished through motion compensated and using, for
each pixel, the six closest vectors (following a strategy
proposed in [9]), as shown in Figure 7.
Figure 7- Motion vectors (v1 to v6) considered for the inpainting of
the pixels highlighted in orange.
1. Motion Estimation
As mentioned above, the block motion estimation is
performed through BMA, using the previous frame (backward
estimate) or the following frame (forward estimate), and
considering only the luminance component of the video.
Figure 8 represents the BMA algorithm for the backward case,
where the block size is given by N×N and the searching area
size is given by Ri×Rj; this are is defined in the previous frame
(or following) and around the block position whose vector is
intended to be estimated. To find the motion vector, a block is
compared to all the blocks of the search area. The comparison
is performed using one of four similarity metrics: mean
absolute error (MAE); structural similarity index, SSIM;
MAE+SMD (side match distortion); SSIM+SMD.
Figure 8- BMA basics.
The difference between the position of the block whose
vector has to be estimated (represented in blue in Figure 8)
and the position of the most similar block (represented in red
in Figure 8) is considered as a the block motion vector; this
difference is given by the vector (di,dj ).
The MAE metric is defined as:
12blo( , ci j) k
1, ( , ) ( , )i j t t i jMAE d d I i j I i d j d
N
; (23)
The SMD metric [9] is the mean absolute error between the
internal boundary pixels of the reference block (blue in Figure
8) and the external boundary pixels of the candidate block.
The metric combination (MAE or SSIM) with SMD is
performed by
MAE&SMD
SSIM&SMD
Metric MAE 1 SMD
Metric SSIM 1 SMD
, (24)
where µ measures the metrics relative weight; before
combination, all the metrics are normalized to the interval
[0,1].
2. Inpainting with motion compensation
For each pixel to be inpainted, the six closest motion
vectors (Figure 7) are considered. Let In (i, j) be the resulting
intensity of the pixel in position (i, j), after motion
compensation with motion vector vn; the resulting inpainted
value for the pixel is given by
6
1
n n
n
I i, j w i, j I i, j
, (25)
where wn is a weigth that depends on the distance between the
pixel and vector vn (eq. 26).
Finally, in order to minimize the visibility of the restored
region boundary, some cycles of anisotropic spatial diffusion
are applied.
6
1
2
3
4
5
6
1 11,
2 1 1
1 1,
2 1
11,
2 1 1
11,
2 1 1
1,
2 1
1,
2 1 1
n j m iw i j
m n
m iw i j
m
m i jw i j
m n
i n jw i j
m n
iw i j
m
j iw i j
m n
(26)
IV. RESULTS
This section presents the inpainting results (both in terms
of quality and processing time) for the algorithms described in
the previous section. The first results concern the spatial
inpainting methods (TV, Bertalmio, Oliveira and Telea),
applied to still images and video, in order to determine the
best algorithm for small text areas restoration. Next, the results
of the temporal interpolation method are presented.
Figure 9 presents one frame of the test videos; the first
frame of each video was used as still frame. The original
videos and all restored image and videos can be found in [10].
Figure 10 and 12 show the masks used to simulate the regions
to be reconstructed.
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 9 - Test Videos and images: (a) Bigships; (b) Oldtown;
(c) Parkjoy; (d) Raven; (e) Snowmontain; (f) Soccer; (g) Station;
(h) Sunflower.
(a) (b)
Figure 10 - Masks used in inpainting process: (a) Mask1; (b) Mask2.
All the algorithms, except the Telea method, were
implemented on Matlab; for the Telea method, a software
module available in the OpenCV library was used; the
algorithms have been executed on a computer with 8 GBytes
of RAM and a processor Intel® Core™ i7 running at 2.8 GHz.
A. Spatial Inpainting
Based on the parameter values proposed on the original
papers and also on our own experimental tests, the following
parameters were settled:
TV method: 1500 iterations; ρ=0.2; λ=0.05; ε=0.01.
Bertalmio method: 100 global iterations, each one
composed by 2 diffusion iterations and 15
inpainting iterations; ρ=0.2; ε=0.001; Δt=0.1.
Oliveira method: 100 iterations; a=0.073235,
b=0.176765.
The comparison between the techniques takes into account
the resulting quality of the inpainting region and the associated
execution time. For the quality assessment, two objective
metrics were used: peak signal to noise ratio (PSNR) and
SSIM; in the former, only the restored pixels were considered;
in the latter, the metric considers all the pixels belonging to
the minimal rectangle containing the mask. For converting the
PSNR values to subjective scores, Table 1 was used. SSIM is
a normalized metric, that can take values between zero
(restored and original image are completely different) and one
(restored and original image have exactly the same pixel
values). Only the luminance component has been considered.
Table 1 - PSNR values versus subjective quality ratio.
PSNR [dB] Quality
< 30 Low
30 – 36 Medium
36 – 42 High
> 42 Excellent
Table 2 presents the resulting PSNR values for each
method, with Mask1. This table shows that the images
resulting from the Bertalmio and Oliveira methods present
high or even excellent quality (PSNR clearly superior to 36
dB), except for Snowmontain, which presents medium quality.
This fact is confirmed by Table 3, where the SSIM values are
higher than 0.9 for all images and algorithms. Telea and TV
methods show, in every case, worst results than Bertalmio and
Oliveira methods (although most of them have also
high/excellent quality).
If a mask contains several inpainting regions (e.g., Mask2),
the inpainting results are different for each region (Figure 11),
according to their size, and/or the spatial content of the region
7
to be inpainted (e.g., textured regions are more difficult to be
properly restored than uniform regions).
Table 2 - Values of PSNR for still images (Mask1).
Image PSNR [dB]
TV Bertalmio Oliveira Telea
Bigships 35.95 39.93 39.87 35.41
Oldtown 35.29 40.08 40.08 35.13
Parkjoy 37.91 49.63 49.64 45.81
Raven 40.88 45.51 45.50 39.65
Snowmontain 30.66 34.43 34.56 30.44
Soccer 37.10 41.93 41.78 37.34
Station 43.09 48.44 48.54 43.17
Sunflower 45.13 55.11 55.39 43.59
Table 3 - Values of SSIM for still images (Mask1).
Image SSIM
TV Bertalmio Oliveira Telea
Bigships 0.79 0.93 0.93 0.87
Oldtown 0.82 0.93 0.93 0.87
Parkjoy 0.63 0.98 0.98 0.95
Raven 0.82 0.95 0.95 0.89
Snowmontain 0.82 0.92 0.92 0.87
Soccer 0.80 0.93 0.93 0.87
Station 0.78 0.97 0.97 0.92
Sunflower 0.88 0.99 0.99 0.95
(a) (b)
Figure 11 - PSNR and SSIM evolution curves using Bertalmio
method, Soccer still image and Mask2.
Another important criterion for the methods evaluation is
the processing time, even if in most of the inpainting
applications there is no requirement for real time processing.
If two methods have similar images/videos qualities, then the
decision criterion should be the processing time (or the
complexity associated to method). Table 4 presents the
processing time (in seconds) for each method, using Mask1;
the indicate time corresponds to the iteration for which the
resulting quality of the restored images reaches a stationary
state. As mentioned previously, for the Telea method a
function developed in C++ (OpenCV) was used, with
optimized code. The other methods were implemented in
Matlab2015a environment, without concerns about code
optimization.
Table 4 - Processing time for still images (Mask1).
Image Processing Time [s/frame]
TV Bertalmio Oliveira Telea
Bigships 350 80 1.5 1.23
Oldtown 300 75 1.3 1.02
Parkjoy 110 77 1.3 1.07
Raven 280 80 1.4 1.24
Snowmontain 300 70 1.1 1.22
Soccer 300 80 1.1 1.28
Station 200 90 1.4 1.05
Sunflower 200 90 1.5 1.09
The Oliveira method presents the shortest processing time
of all considered methods. Taken into account the quality of
the inpainted image, it follows that Oliveira method, proposed
in [3], is the most effective for text inpainting, with text
characters with dimensions similar to those used in movie
subtitles.
For video inpainting, the considered methods were applied
to each video frame. Although the quality remains high or
excellent when each frame is observed individually, while
playing the restored video a flicker effect becomes visible on
restored area. Some examples of this phenomenon can be
found in [10]. This phenomenon is due to small variations,
from frame to frame, on the restored region.
B. Temporal Inpainting
For temporal video inpainting, the Mask3 (Figure 12) was
initially used. Table 5 shows the resulting SSIM values for
each video, and for the different similarity metrics that can be
used on the motion estimation; in these results, the diffusion
filter was disabled. These values are the average of SSIM
values obtained for each video frame. The motion estimate
was performed using one pixel resolution and the search
window size, Wsize, was chosen according to the video
temporal activity, in order to minimize the tests time. The
remaining parameter were settled as follows: Bsize = 16 and
µ = 0.2.
8
Figure 12 - Mask3 used on video inpainting process.
Table 5 - SSIM results for the four similarity metrics used on motion
estimation, and without diffusion filter.
Video Wsize Similarity metrics
SSIM MAE SSIM + SMD
MAE + SMD
Bigships 5 0.722 0.636 0.734 0.653
Parkjoy 40 0.843 0.623 0.851 0.674
Raven 10 0.899 0.803 0.893 0.795
Snowmontain 5 0.684 0.752 0.702 0.578
Soccer 40 0.831 0.675 0.834 0.728
Station 10 0.853 0.800 0.857 0.759
Sunflower 10 0.971 0.956 0.953 0.865
Based on Table 5 it can be conclude that, in general, the
metric SSIM presents the best results, particularly when
combined with SMD. In order to analyze the impact of the
diffusion filter on the inpainting quality, the tests were
repeated using the same parameters, for the similarity metric
SSIM+SMD, and applying 50 diffusion iterations. The results
are shown in Table 6, confirming objectively an improvement
in the restored videos quality. This improvement is more
evident in the Bigships and Snowmontain sequences. These
sequences are particularly difficult because they have a scene
transition implemented through dissolves, and therefore the
estimated vectors in the temporal window wherein the
dissolving occurs does not correspond to the real video
motion. Consequently, the purely temporal inpainting
produces poor results, as can be seen in Figure 13, being
attenuated by the diffusion filter. However, the temporal
interpolation process can recover from the distortion caused
by dissolves after some frames, as can be seen in Figure 14.
Figures 15 and 16 shows the evolution of PSNR and SSIM on
the Sunflower and Bigships video sequences, respectively. For
the Bigships sequence, the PSNR and SSIM values confirm
the lower subjective quality; however, the resulting quality is
very close to the original in some time intervals.
Table 6 - SSIM results using SSIM+SMD on motion estimate using
diffusion filter.
Video Wsize SSIM+SMD
Niter=0 Niter=50
Bigships 5 0.734 0.790
Parkjoy 40 0.851 0.855
Raven 10 0.893 0.899
Snowmontain 5 0.702 0.792
Soccer 40 0.834 0.839
Station 10 0.857 0.887
Sunflower 10 0.953 0.961
(a) (b)
(c)
(d)
(e)
Figure 13 - Temporal inpainting result for Bigships video in
dissolves: (a) video frame with mask in the dissolves beginning; (b)
video frame during the dissolves; (c) video frame at the dissolves
end; (d) cut of inpainting region of video frame with mask; (e) cut of
inpainting region of restored video frame using temporal inpainting.
(a)
(b)
(c)
Figure 14 - Inpainting result for Bigships video after dissolves: (a)
masked frame; cut of inpainting region; inpainting result of (b).
9
(a) (b)
Figure 15 - PSNR (a) and SSIM (b) for video Sunflower using mask3.
(a) (b)
Figure 16 - PSNR (a) and SSIM (b) for video Bigships using mask3.
Table 7 presents the processing time (seconds per frame)
of the motion estimation procedure, for several Bsize and Wsize
values and for the different considered similarity metrics. For
Mask3 and Bsize=16×16 there are 20 blocks,, whereas for
Bsize=32×32 there are 12 blocks. One would expect the
processing time to be higher for the SSIM+SMD and
MAE+SMD metrics, but what happens is that the two metrics
have fewer comparisons relatively to SSIM and MAE,
respectively; the number of comparisons decreases from
(2 Wsize – 1)2 to (2 Wsize – 3)2 due to the way SMD was
implemented. Finally, the impact of diffusion iterations
number was analyzed, and the results are shown in Table 8.
The diffusion filter is only applied to the mask pixels. To
obtain the total processing time, the motion estimation time
has to be added to the filtering time.
Table 7 - Processing times (seconds per frame) of the motion
estimation procedure for different Bsize and Wsize values, and for all
similarity metrics.
Bsize Wsize
Processing Time [s/frame]
SSIM MAE SSIM + SMD
MAE + SMD
16
10 4.2 1.5 3.6 1.4
20 15.7 5.3 14.8 5.3
30 34.7 11.9 34.4 12.1
40 61.0 21.0 59.7 21.2
32
10 3.5 3.4 3.0 2.9
20 12.7 12.1 12.1 11.2
30 27.5 27.3 27.3 25.7
40 48.3 46.1 50.5 46.7
Table 8 - Processing time of diffusion filter
Niter Processing Time
[s/frame]
50 0.30
100 0.59
150 0.89
200 1.20
V. CONCLUSION
The aim of this work was to study and evaluate a set of
techniques for text inpainting in still images and videos. From
the literature overview, it was concluded that the diffusion-
based techniques were the more suitable for the application in
view. Accordingly, four diffusion-based strategies were
implemented: TV, Bertalmio, Oliveira and Telea (for the last
case an existing implementation on OpenCV was used). These
methods were applied to still images and video sequences; for
the videos, a purely spatial approach was used, i.e., the
inpainting was applied to each frame as it was a still image.
The four methods were compared both in terms of resulting
quality of the restored image/video and in terms of processing
time. All the methods produced good results when applied to
still images. The method proposed by Oliveira et. al [3],
presents the best results in terms of quality versus processing
time. When the methods are applied to video text inpainting,
the results are not generally acceptable; although the quality of
each individual frame is high, when the videos are displayed at
their actual frequency, a flicker effect becomes visible on the
restored area. This is due to the fact that the algorithms do not
consider the time dimension of the video. In an attempt to
minimize the flicker effect, a temporal interpolation based
inpainting method was proposed; in this method, the regions
of the video resulting from text extraction are restored through
motion estimation/compensation based on neighboring frames,
combined with a simple spatial diffusion technique. In the
motion estimation procedure, based on a simple block
matching algorithm (BMA), four similarity measures were
evaluated; the "SSIM + SMD" measure was the one whose
vectors led to the best video inpainting quality. The proposed
algorithm gives good results for most of the tested videos,
mainly when these have an high spatial activity and do not
contain scene changes; in particular, the flicker effect which
occurs with a purely spatial inpainting approach, is eliminated.
For future work, it is suggested to complement the
proposed method with a technique for scene change detection
[11] and one of the spatial inpainting methods studied (e.g.
Oliveira’s method) – in this way, the inpainting technique will
become a spatial-temporal hybrid method. For each frame, the
weights of spatial and temporal inpainting component should
be dependent on the existence (or not) of a scene change in
that frame. Another aspect that can be exploited in the
determination of the weights associated with the temporal and
spatial inpainting components is the confidence of the motion
10
vectors – the higher the confidence, the higher the temporal
component weight.
REFERENCES
[1] M. Bertalmio, G. Sapiro, V. Caselles and C. Ballester, “Image
Inpainting,” in 27th Conf. on Computer Graphics and Interactive
Techniques, New York, USA, 2000.
[2] T. Chan and J. Shen, “Mathematical models for Local
Deterministic Inpainting,” Siam Journal on Applied
Mathematics, pp. 1-11, March 2000.
[3] M. Oliveira, B. Bowen, R. McKenna and Y.-S. Chang, “Fast
Digital Image Inpainting,” in International Conference on
Visualization, Imaging and Image Processing (VIIP 2001),
Marbella, Spain, 2001.
[4] S. Masnou and J. M. Morel, “Level-Lines Based Disocclusion,”
in International Conference on Image Processing, Chicago,
USA, October, 1998.
[5] A. Telea, “An image inpainting based on the fast marching
method,” Journal of Graphics Tools, vol. 9, no. 1, pp. 25-36,
January 2004.
[6] A. Criminisi, P. Perez and K. Toyama, “Region filling and object
removal by exemplar-based inpainting,” IEEE Transactions on
Image Processing, vol. 13, no. 9, pp. 1200-1212, September
2004.
[7] C. W. Lee, K. Jung and H. J. Kim, “Automatic text detection and
removal in video sequences,” Journal Pattern Recognition
Letters, vol. 24, no. 15, pp. 2607 - 2623, November 2003.
[8] A. Koochari and M. Soryani, “Exemplar-based video inpainting
with large patches,” Journal of Zhejiang University-SCIENCE C
(Computers & Electronics), vol. 11, no. 4, pp. 270-277, April
2010.
[9] S. Tsekeridou, F. A. Cheikh, M. Gabbouj and I. Pitas,
“Application of vector rational interpolation to erroneous motion
field estimation for error concealment,” IEEE Transactions on
Multimedia, vol. 6, no. 6, pp. 876-885, December 2004.
[10] A. Rosa, “Dropbox,” 03 10 2015. [Online]. Available:
https://www.dropbox.com/sh/is0wcny4bapg38t/AAB2gSv1NbZk
NDSTslYcmh4Ua?dl=0.
[11] R. Almeida, “Deteção Automática de Descontinuidade
Temporais em Sequências de Video Digital,” Lisboa, Outubro
2015.