evaluation of video artifact perception using event ...screen to keep the participants’ gaze...

To appear in ACM TOG 0(0).

Evaluation of Video Artifact Perception Using Event-Related Potentials

Lea Lindemann∗ Stephan WengerComputer Graphics Lab, TU Braunschweig

Marcus Magnor

Abstract

When new computer graphics algorithms for image and video edit-ing, rendering or compression are developed, the quality of the re-sults has to be evaluated and compared. Since the produced mediaare usually to be presented to an audience it is important to predictimage and video quality as it would be perceived by a human ob-server. This can be done by applying some image quality metric orby expensive and time consuming user studies. Typically, statisticalimage quality metrics do not correlate to quality perceived by a hu-man observer. More sophisticated HVS-inspired algorithms oftendo not generalize to arbitrary images. A drawback of user studies isthat perceived image or video quality is filtered by a decision pro-cess, which, in turn, may be influenced by the performed task andchosen quality scale. To get an objective view on (subjectively) per-ceived image quality, electroencephalography can be used. In thispaper we show that artifacts appearing in videos elicit a measur-able brain response which can be analyzed using the event-relatedpotentials technique. Since electroencephalography itself requiresan elaborate procedure, we aim to find a minimal setup to reducetime and participants needed to conduct a reliable study of imageand video quality. As a first step we demonstrate that the reac-tion to a video with or without an artifact can be identified by anoff-the-shelf support vector machine, which is trained on a set ofpreviously recorded responses, with a reliability of up to 80% froma single recorded electroencephalogram.

c©ACM, (2011) This is the authors’s version of the work. It isposted here by permission of ACM for your personal use. Notfor redistribution.

Keywords: video, artifact detection, perception, electroen-cephalography (EEG), event-related potentials (ERP), support vec-tor machines (SVM)

1 Introduction

In computer vision and graphics the human visual system (HVS)and perception are of great interest [McNamara et al. 2010]. Manyalgorithms for image and video generation and encoding exploitproperties of the HVS, like masking effects or color sensitivity, tocreate images and videos with higher perceived quality. The perfor-mance of these algorithms is evaluated either using an image qualitymetric or by a user study [Korsar et al. 2003]. An issue with userstudies is that they require a large number of participants to givestatistically significant results, and even then quality ratings maysuffer from a high variance [Wang et al. 2004]. Another problem isthat the quality ratings obtained by user studies are always filteredby some decision process which, in turn, may be influenced by thetask and/or rating scale the participants are given [Ponomarenkoet al. 2009]. Instead of assessing image quality using user studies,electroencephalography could be used. This would eliminate theinfluence of metrics or choices on the resulting quality rating.

An electroencephalograph measures voltages produced by neuralactivity. It records an electroencephalogram (EEG) containingthe always-present random spontaneous EEG, in which is embed-ded the activity elicited by a specific stimulus. If a participant is

∗[email protected]

placed in a controlled environment and presented a stimulus mul-tiple times, signals for these trials can be averaged to reduce noisefrom the spontaneous EEG and extract the stimulus-specific reac-tion, called an event-related potential (ERP) [Luck 2005]. ERPsrepresent overlapping activity of different brain regions during theprocessing of a stimulus, known as components. These compo-nents are named after their polarity and the approximate time theypeak after the stimulus im milliseconds, for example P300 meansa positive going deflection peaking at 300 ms post stimulus. Addi-tionally, each component appears in a specific scalp region. Exten-sive studies have related components to different processing stepsan stimulus properties so that ERPs also show in which way stim-uli are perceived differently. Many methods have been applied todetect an ERP in a single trial [Mouraux and Iannetti 2008], butmost approaches require an application- and subject-specific train-ing [Kapoor et al. 2008], [Sajda et al. 2010].

Since using an EEG requires an elaborate technical and experimen-tal setup, our long term goal is to find a minimal setup for the as-sessment of image and video quality. To reduce the number of elec-trodes that have to be applied, we want to investigate which brainareas react strongest to different types of visual artifacts. We alsowant to explore different methods to reduce the number of trialsneeded to evaluate a brain response. Finally, by showing that ERPresponses are comparable for different people, we aim to reduce thenumber of subjects needed for a study. In the process of acquiringa minimal setup, we will also investigate which experimental pro-cedure, i.e. stimulus presentation and task, is best suited to assessimage quality with as little influence as possible on the participants’reaction to the stimulus.

As presented in [Pazo-Alvarez et al. 2003] the HVS reacts mea-surably to semantic visual anomalies, encouraging an analysis ofimage and video artifacts using EEG. Taking the example of JPEG-compressed images [Lindemann and Magnor 2011] demonstratesthat the influence of image quality on image perception can be eval-uated using ERPs. In this paper we present a first evaluation of ar-tifacts in videos using EEG. We use simple video stimuli to verifythat videos can be assessed using the ERP technique. By comparingERPs for videos without and with two different types of artifacts weshow that the emergence of an artifact in a video produces a measur-able ERP. Its shape depends on the conspicuousness of the artifact.In the second part of our work we use the data obtained in the exper-iment to train a support vector machine (SVM) with a polynomialkernel. Cross validation shows that detecting strong artifacts in asingle trial EEG has reliabilty of about 80%.

2 Related Work

While effortlessly performed by the human mind, image and videoquality assessment is a difficult task for computers. Many differentevaluation methods have been employed, depending on whether areference was provided or not. Basic measures like mean squarederror (MSE) or peak signal-to-noise ratio (PSNR) compare an im-age to a reference using statistical or correlation-based methods[Eskicioglu and Fisher 1995]. However, these classical algorithmsoften produce quality ratings that do not match the quality perceivedby a human observer. Thus perceptually motivated methods, adopt-ing functionality of the HVS, have been gaining popularity [En-gelke and Zepernick 2007], [Seshadrinathan et al. 2010]. One of

1


Figure 1: The three test scenes ramp, cafe, and girl with less obvious artifacts (A2) in all 4 quadrants united in one image.

the earliest algorithms implementing properties of the HVS to as-sess image quality is [Daly 1993].

In many applications human image processing and pattern match-ing capabilities are superior to computer vision algorithms and sorecent research tries to provide these capabilities to computers us-ing brain computer interfaces (BCI). Often BCIs are employed forsearching (rare) targets in a large collection of images using RapidSerial Visual Presentation (RSVP). In [Sajda et al. 2010] Hierarchi-cal Discriminant Component Analysis (HDCA) is used in two sys-tems for prioritizing images from a large image collection, sortingimages depending on how well they match a known or unknown tar-get. The more specific task of sorting images into one of three cate-gories (faces, animals, and inanimate objects) is tackled in [Kapooret al. 2008]. Assuming that the capabilities of computer visionand human vision are complementary they are combined by ker-nel alignment. For a set of training images a Pyramid Match Ker-nel (PMK) and kernels for ERPs of multiple subjects are computedand combined in a support vector machine. After that, new imagescan be assigned a category with higher accuracy than with ERPsor PMK alone. In [Koelstra et al. 2009] the mismatch negativity(N400) is used for implicit video tagging. A user is shown a shortvideo clip followed by a “matching” or “not matching” tag. Re-sults show that semantically mismatching tags elicit a larger N400component than matching ones.

While image and video quality assessment and BCI-based imageclassification have received much attention, not much has beendone to combine these two fields of research. In [Hayashi et al.2000] EEG data were used to evaluate the pleasingness of high res-olution images, based on the assumption that images with higherquality produce a higher amount of alpha-waves.

3 Experiment

3.1 Stimuli and Trial Timing

Videos are very complex targets for an evaluation by ERPs becauseevery change in a visual stimulus may cause an new brain response,stopping the processing of the previous stimulus. Also surprisingscene content may induce a reaction similar to the reaction to an(unexpected) artifact. Another issue are eye movements which pro-duce artifacts (voltage peaks) in the EEG at frontal electrode sites.To analyze the ERP signal evoked by a continuously changing com-plex stimulus (video) we decided to start by zooming into the cen-ter of a static scene, the zoom simulating a camera movement, thestill image simulating a static scene and focussing on the center

to avoid reflexive eye movement. In future experiments an exper-imental setup and data processing methods allowing for more nat-ural viewing conditions will be explored. Three images with dif-ferent complexity were used, Figure 1. The ramp scene primarilycontains uniformly colored surfaces, while the cafe scene containsmuch more details. Finally, the girl scene shows a person whichcould influence overall perception. For each scene a sequence ofimages consisting of the original image cropped by an increasingmargin was produced. The resulting images were scaled down toa resolution of 512 × 512, producing the final video frames. Afterthat, multiple versions of the three videos with different artifactswere produced.

Each video stimulus showed a 3 second zoom into one of the threescenes during which either no artifact (G), an obvious artifact (A1),or a not as obvious artifact (A2) appeared for 125 ms. An obviousartifact was a magenta 40×40 pixel block, for a less obvious artifactthe scene was blurred in a 40× 40 pixel region. If necessary, light-ness of weak artifacts was adjusted to make them easier to detect.As a result all A2 artifacts were about equally salient. Each trialstarted playing back a video from a random offset of ±5 frames,equaling±125 ms, to avoid adaptation to the time when the artifactappeared. All artifacts were positioned at the center of one of theimage’s quadrants (a = upper left, b = upper right, c = lower left,d = lower right). These four positions were chosen to avoid biastowards artifacts occurring in the upper, lower, left, or right imagehalf. Also all artifacts had the same distance to the image center sothey occur in the same range of peripheral vision.

A trial was comprised of a 500 ms fixation interval (gray screen),followed by a 3 second video. After that participants would respondwith a button click if they had seen an artifact (left mouse button) ornot (right mouse button), triggering the start of the next trial. Dur-ing the whole experiment a white dot was shown in the center of thescreen to keep the participants’ gaze fixed. Videos were displayedso they spanned about 5.4 x 5.4 degree of visual angle.

3.2 Procedure

8 right-handed healthy subjects (5 female, 3 male) with an averageage of 25 and normal or corrected-to-normal vision participatedafter signing an informed consent. After preparation participantswere given oral instructions to fixate the white dot displayed on thescreen and minimize any movement while it was being displayed.They were asked to report with a button click if they had seen anartifact or not after a video was finished. It was pointed out that an-swering carefully and correctly was more important than answering

2


Figure 2: The start of a video (0 ms) elicits an ERP, whose shapevaries slightly with scene content. After initial response stimulus-specific brain activity recedes.

fast. Afterwards they completed a training consisting of 27 trials(other images than the three test scenes), before starting with themain experiment. Each artifact class (G, A1, A2) was repeated 40times for each scene, resulting in 360 trials per participant. Allartifact-position configurations (artifact sub-class) were repeatedequally often, i.e. each combination was repeated 10 times. Allstimuli were presented in pseudo-random order, and the same stim-ulus was never shown twice in a row.

During the experiment an EEG was recorded with the BioSemi Ac-tive Two system from 32 scalp sites according to the international10-20 system with a 256 Hz sampling and a 24-bit digitization rate.Additionally, the horizontal and vertical electroocculogram (EOG)were recorded, as well as EEG at both mastoids that was later usedas reference. After recording, the acquired data was filtered with atwo-way least-squares FIR high pass-filter using a cutoff frequencyof 0.1 Hz to remove low frequency voltage drifts. After segmenta-tion, trials with artifacts (blinks, saccades, ...) were rejected man-ually. The average amplitude of the 200 ms interval preceding thetimelocked stimulus was used for baseline correction. For presen-tation the data were low pass filtered with a cutoff frequency of 30Hz.

4 Experimental Results

Subjects detected obvious artifacts (A1) 99.9% of the time whileless obvious artifacts (A2) were detected 93.7%. On 3.7% ofground truth trials (G) participants claimed to have seen an artifact,i.e. 96.3% of the trials were correctly classified as not having shownan artifact. 89% of undetected A2-trials were caused by three par-ticipants, while all other participants detceted A2 artifacts with onlysingle errors. Table 1 shows the percentage of trials in each A2 ar-tifact sub-class that was wrongly reported to not have contained anartifact. Whether scene content or artifact position are more impor-tant for misclassification is yet to be examined. The first columnshows the false positives for ground truth trials. They were moreevenly distributed over all participants and were most likely causedby the fact that the subjects were actively looking for artifacts.

Figure 2 shows grand mean ERPs at occipital electrode sites (O1,Oz, O2) timelocked to video start (marked as 0 ms). In this case,all trials showing the same scene were averaged independently ofartifact class and participant response. As can be seen the start of avideo elicits different ERPs depending on scene content. After this

Figure 3: The appearance of an artifact (0 ms) causes an ERP.The ERP for weak artifacts (A2) rises later and has a smaller peakamplitude which is most likely caused by a higher variance in thelatency of reactions to subtle artifacts.

Scene G A2a A2b A2c A2dramp 2.8 7.5 3.75 1.25 1.25cafe 5.6 7.5 6.25 2.5 1.25girl 2.5 7.5 10 26.25 1.25

Table 1: The percentage of false positives for ground truth trialsand false negatives for different artifact positions for each image.

initial reaction brain potentials show no activity until artifacts set in.Since the time of appearance of the artifact is varied by about 250ms, no clear ERP can be seen in the averaged data, but the positivedeflection implies that some reaction occurs.

The ERPs timelocked to the appearance of the artifact (marked as 0ms) averaged over all participants and all test scenes can be seen inFigure 3. It shows that ground truth videos do not elicit a response,unlike videos with artifacts. While obvious artifacts (A1) elicit aclearly shaped ERP, the ERP for less obvious artifacts (A2) is longerand shows less significant deflections. This is most likely caused bylarger variance in latency of brain responses to less visible artifacts,as it takes more time to detect and process them. The topographiesin Figure 4 visualize the difference in the ERPs for different timesteps. It confirms that the reaction to less visible artifacts occurslater and stretches longer. Still, less visible artifacts introduce aconsiderable ERP, too. Figure 5 shows that the reaction to A1-typeartifacts is similar for all images. Figure 6 shows the differencesof the ERPs for the artifact sub-classes. Although there are minordifferences in the ERPs, all artifacts reliably elicit an ERP. For astatistically significant inspection of the difference in ERPs betweensubjects, not enough data were collected, but the data do show thatfor all subjects an ERP is present for A1- and A2-videos.

5 ERP classification

To reduce data for classification via SVM, principal componentanalysis (PCA) was performed first on electrode channels andthen in the temporal domain. Experiments with reduced data setsshowed that 12 PCA components for channels and 16 PCA com-ponents for time allowed a classification with a realiability equal toa classification where all data were used, but with significantly re-duced computation time. Input were EEG data from artifact onsetat 0 ms to 1350 ms post stimulus.

3


Figure 4: Topographies for ERPs timelocked to artifact appearance at different latencies.

Figure 5: There are no considerable differences in ERPs elicitedby A1-type artifacts in different scenes.

For single-trial classification the SVM by Chang and Lin [2001]using a polynomial kernel was facilitated. Cross-validation with aleave-20-out test was performed for different configurations. Thefirst configuration tested G-trials versus A1-trials. Only trials withcorrect participant responses were used. Results showed that 80.5%of the G-trials and 76.5% of A1-trials were classified correctly. G-and A1-trials occured about equally often in the data used; the clas-sifications “G” and “A1” were correct in 78.0% and 79.1% of thecases, respectively. For an equivalent test with G- and A2-trials,76.1% and 69.7% of the trials were classified correctly, resultingin a reliability of 72.5% and 73.5% for G- and A2-trials. For threegroups, one with G-, A1-, and A2-trials, respectively, 71.3% of G-trials were identified correctly, as well as 53.8% of A1 and 49.1% ofA2-trials. The corresponding reliabilities for SVM categorizationwere 62.7%, 58.9%, and 52.1%. This shows that the performanceof the SVM for binary classification is significantly better than fora larger number groups.

Table 2 shows that A1-trials and G-trials were most often misclas-sified as A2, while A2-trials were confused with both A1 and G.The relative frequencies of misclassifications indicate an underly-ing continuous relationship between the strength of image artifactsand the EEG response. Another test was conducted in which a SVMwas trained to distinguish the three test scenes based on all A1-trials. 39.5%, 36.9%, and 28.2% of the trials, respectively, wereidentified correctly for the three scenes. Since these error rates are

Classified asA1 A2 G

Trial typeA1 53.9% 26.5% 19.6%A2 21.9% 50.8% 27.3%G 11.0% 13.8% 75.1%

Table 2: Results of SVM classification between images with strongartifacts (A1), weak artifacts (A2) and no artifacts (G) from central,parietal, and occipital electrodes only. The relative frequencies ofmisclassifications indicate an underlying continuous relationshipbetween the strength of image artifacts and the EEG response.

comparable to random classification, the scene content seems to in-fluence the EEG response a lot less than the type of artifact that ispresented.

As the strongest reaction to artifacts appears at central, parietal, andoccipital sites, the same classifications as above were run withoutfrontal and temporal electrodes. Fp1, Fp2, AF3, AF4, F7, F8, allFC electrodes, as well as T7 and T8 were removed prior to PCAreduction. In the two groups for G-A1 configuration the reliabilityof a G- or A1-classification increased slightly to 79.3% and 82.6%,and in the G-A2 configuration to 72.8% and 75.2% for G- and A2-trials. This is likely due to the fact that the frontal electrodes do notreact directly to optical stimuli, but generally show a large amountof activity unrelated to the artifacts that causes overfitting of theSVM classifier. Numbers for the simultaneous comparison of G, A1and A2 as well as for the comparison of different scenes, however,did not change considerably.

6 Conclusion and Future Work

We have shown that the ERP technique can be applied to assess ar-tifacts in simple videos, as artifacts elicit a measurable ERP, inde-pendent of artifact position and scene content. In this experimentalsetup the amplitude of the P300, known to be distinct if a (rare)target stimulus is detected, is increased because videos with arti-facts were a targeted stimulus. Therefore another experiment hasto evaluate how ERPs behave if the participants do not explicitlylook for artifacts, but are involved in a distractor task. In our ex-periments, artifacts realiably elicited an ERP in every participant.Further investigation has to show if ERPs for different kinds of ar-tifacts behave comparably for all participants. The results of ourexperiments with single-trial SVM classification show that a trialcan be identified as containing a reaction or not with 70% to 80%reliability; the classification performance drops if multiple classesof artifacts are to be distinguished. The kind of misclassifications

4


Figure 6: Differences between ERPs for artifacts appearing in different image quadrants are negligibly small.

occuring in the latter case suggest that a continuous scale of per-ceptual impact might be realized, for example using support vectorregression (SVR). While single-trial EEG classification is generallyconsidered a very difficult task, further investigation of differentmachine learning algorithms may increase the reliability to a levelwhere the classification error is of the same order of magnitude asother experimental error sources, making EEG classification a fastand objective alternative to traditional user studies.

Acknowledgements

This work has been funded by the European Research Council ERCunder contract No. 256941 “Reality CG”.

References

CHANG, C.-C., AND LIN, C.-J. 2001. LIBSVM: a library for sup-port vector machines. http://www.csie.ntu.edu.tw/˜cjlin/libsvm.

DALY, S. 1993. The visible differences predictor: an algorithm forthe assessment of image fidelity. In Digital images and humanvision, A. B. Watson, Ed. MIT Press, 179–206.

ENGELKE, U., AND ZEPERNICK, H.-J. 2007. Perceptual-basedquality metrics for image and video services: A survey. In3rd EuroNGI Conference on Next Generation Internet Networks,190–197.

ESKICIOGLU, A. M., AND FISHER, P. S. 1995. Image qualitymeasures and their performance. IEEE Transactions on Com-munications 43, 12, 2959–2965.

HAYASHI, H., SHIRAI, H., KAMEDA, M., KUNIFUJI, S., ANDMIYAHARA, M. 2000. Assessment of extra high quality imagesusing both EEG and assessment words on high order sensations.In IEEE International Conference on Systems, Man, and Cyber-netics, vol. 2, 1289–1294.

KAPOOR, A., SHENOY, P., AND TAN, D. 2008. Combining braincomputer interfaces with vision for object categorization. InIEEE Conference on Computer Vision and Pattern Recognition,1–8.

KOELSTRA, S., MUHL, C., AND PATRAS, I. 2009. EEG analysisfor implicit tagging of video data. In Proceedings of the 3rd

International Conference on Affective Computing and IntelligentInteraction and Workshops, ACII 2009, 27–32.

KORSAR, R., HEALEY, C. G., INTERRANTE, V., LAIDLAW,D. H., AND WARE, C. 2003. Thoughts on user studies: Why,how, and when. Computer Graphics and Applications 23, 4, 20–25.

LINDEMANN, L., AND MAGNOR, M. 2011. Assessing the qualityof compressed images using EEG. In Proceedings of the IEEEInternational Conference on Image Processing. To appear.

LUCK, S. J. 2005. An introduction to the event-related potentialtechnique. MIT press, Cambridge, MA.

MCNAMARA, A., MANIA, K., BANKS, M., AND HEALEY, C.2010. Perceptually-motivated graphics, visualization and 3D dis-plays. In ACM SIGGRAPH 2010 Courses, 7:1–159.

MOURAUX, A., AND IANNETTI, G. D. 2008. Across-trial av-eraging of event-related EEG responses and beyond. MagneticResonance Imaging 26, 7, 1041–1054.

PAZO-ALVAREZ, P., CADAVEIRA, F., AND AMENEDO, E. 2003.MMN in the visual modality: a review. Biological Psychology63, 3, 199–236.

PONOMARENKO, N., LUKIN, V., ZELENSKY, A., EGIAZAR-IAN, K., ASTOLA, J., CARLI, M., AND BATTISTI, F. 2009.TID2008 – A database for evaluation of full-reference visualquality assessment metrics. Advances of Modern Radioelectron-ics 10, 10, 30–45.

SAJDA, P., POHLMEYER, E., WANG, J., PARRA, L. C.,CHRISTOFOROU, C., DMOCHOWSKI, J., HANNA, B.,BAHLMANN, C., SINGH, M. K., AND CHANG, S.-F. 2010. Ina blink of an eye and a switch of a transistor: Cortically coupledcomputer vision. Proceedings of the IEEE 98, 3, 462–478.

SESHADRINATHAN, K., SOUNDARARAJAN, R., BOVIK, A., ANDCORMACK, L. 2010. Study of subjective and objective qualityassessment of video. IEEE Transactions on Image Processing19, 6, 1427–1441.

WANG, Z., BOVIK, A. C., SHEIKH, H. R., AND SIMONCELLI,E. P. 2004. Image quality assessment: from error visibility tostructural similarity. IEEE Transactions on Image Processing13, 4, 600–612.

5

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/libsvm

evaluation of video artifact perception using event ...screen to keep the participants’ gaze...

Documents