improving perception from electronic visual prostheses...

Improving Perception from

Electronic Visual Prostheses

Justin Robert Boyle BEng (Mech) Hons1

Image and Video Research Laboratory

School of Engineering Systems

Queensland University of Technology

Submitted as a requirement for the degree of

Doctor of Philosophy

February 2005

Keywords image processing, visual prostheses, bionic eye, artificial human vision, visual perception, subjective testing, visual information

ii

Abstract This thesis explores methods for enhancing digital image-like sensations which

might be similar to those experienced by blind users of electronic visual prostheses.

Visual prostheses, otherwise referred to as artificial vision systems or bionic eyes,

may operate at ultra low image quality and information levels as opposed to more

common electronic displays such as televisions, for which our expectations of image

quality are much higher. The scope of the research is limited to enhancement by

digital image processing: that is, by manipulating the content of images presented to

the user. The work was undertaken to improve the effectiveness of visual prostheses

in representing the visible world.

Presently visual prosthesis development is limited to animal models in Australia and

prototype human trials overseas. Consequently this thesis deals with simulated

vision experiments using normally sighted viewers. The experiments involve an

original application of existing image processing techniques to the field of low

quality vision anticipated from visual prostheses.

Resulting from this work are firstly recommendations for effective image processing

methods for enhancing viewer perception when using visual prosthesis prototypes.

Although limited to low quality images, recognition of some objects can still be

achieved, and it is useful for a viewer to be presented with several variations of the

image representing different processing methods. Scene understanding can be

improved by incorporating Region-of-Interest techniques that identify salient areas

within images and allow a user to zoom into that area of the image. Also there is

some benefit in tailoring the image processing depending on the type of scene.

Secondly the research involved the construction of a metric for basic information

required for the interpretation of a visual scene at low image quality. The amount of

information content within an image was quantified using inherent attributes of the

image and shown to be positively correlated with the ability of the image to be

recognised at low quality.

iii

Table of Contents Abstract .................................................................................................................. iii

List of Figures ........................................................................................................... viii

List of Tables................................................................................................................ x

Statement of Original Authorship ............................................................................ xi

Acknowledgements.................................................................................................... xii

Publications............................................................................................................... xiii

Chapter 1 Introduction ........................................................................................... 14 1.1 Overview .................................................................................................... 14 1.2 Aim............................................................................................................. 15 1.3 Scope .......................................................................................................... 15 1.4 Thesis Structure.......................................................................................... 16 1.5 Contributions.............................................................................................. 17

Chapter 2 Image Quality and Visual Perception.................................................. 19 2.1 Introduction ................................................................................................ 19 2.2 Visual Perception Physiology .................................................................... 20 2.3 A Visual Hierarchy Model ......................................................................... 22

2.3.1 Early Vision Effects ........................................................................... 23 2.3.2 Cognitive Effects................................................................................ 31

2.4 Region-of-Interest ...................................................................................... 35 2.5 Visual Information ..................................................................................... 39 2.6 Chapter Summary....................................................................................... 41

Chapter 3 Visual Prosthesis Application............................................................... 43 3.1 Overview .................................................................................................... 43 3.2 General Introduction to the Application..................................................... 43 3.3 Current Visual Prosthesis Research ........................................................... 44

3.3.1 Retinal Systems .................................................................................. 45 3.3.2 Optic Nerve Systems.......................................................................... 47 3.3.3 Visual Cortex Systems ....................................................................... 48

3.4 Image Processing specifically related to Bionic Eye Projects ................... 49 3.4.1 Vision Chip Developments ................................................................ 49 3.4.2 CCD-based Systems........................................................................... 51 3.4.3 Receptive Field Modeling .................................................................. 53 3.4.4 Multiple Resolution Work.................................................................. 54

3.5 Digital Imaging Applicable to Visual Prostheses ...................................... 55 3.5.1 Digital Imaging and Human Vision ................................................... 55 3.5.2 Image Characteristics and Visual Understanding .............................. 58

3.6 Thesis Research Questions and Approach ................................................. 67 3.6.1 Image Processing Requirements ........................................................ 68 3.6.2 Testing Method .................................................................................. 69

3.7 Chapter Summary....................................................................................... 71

iv

Chapter 4 Recognition Performance ......................................................................72 4.1 Overview.................................................................................................... 72 4.2 Subjective Tests to Determine Useful Processing Methods ...................... 73

4.2.1 Methodology ...................................................................................... 73 4.2.2 Images Chosen ................................................................................... 74 4.2.3 Results ................................................................................................ 75 4.2.4 Test Conclusions ................................................................................ 85

4.3 Subjective Tests to Determine Influence of Image Type........................... 87 4.3.1 Methodology ...................................................................................... 87 4.3.2 Images Chosen ................................................................................... 88 4.3.3 Results ................................................................................................ 90 4.3.4 Test Conclusions ................................................................................ 96

4.4 Chapter Conclusions .................................................................................. 97

Chapter 5 Quantifying Information Content ........................................................98 5.1 Introduction................................................................................................ 98 5.2 Perceived Information Content in Images................................................ 100

5.2.1 Images Used..................................................................................... 100 5.2.2 Multidimensional Visual Information Model: ................................. 101 5.2.3 Test Method ..................................................................................... 103 5.2.4 Test Participants and Instructions .................................................... 104 5.2.5 Test Results ...................................................................................... 105 5.2.6 Strong Visual Information Rankings ............................................... 111 5.2.7 Test Conclusions .............................................................................. 113

5.3 Information Content Model Fitting.......................................................... 113 5.3.1 Possible Image Attributes for a Visual Information Metric............. 113 5.3.2 Metric Development for a Specific Image Quality Class ................ 118 5.3.3 Information Content Metric for all Image Quality Classes.............. 127

5.4 Correlations Between Recognition Rate And Perceived Information Content ................................................................................................................. 133 5.5 Chapter Summary..................................................................................... 135

Chapter 6 Scene Specific Imaging ........................................................................136 6.1 Overview.................................................................................................. 136 6.2 Characteristics of Simple Scenes ............................................................. 136

6.2.1 Office ............................................................................................... 136 6.2.2 Home................................................................................................ 137 6.2.3 Street ................................................................................................ 137 6.2.4 Outdoors........................................................................................... 138 6.2.5 Head and Shoulders ......................................................................... 138 6.2.6 Café/Restaurant ................................................................................ 139 6.2.7 Public Toilets ................................................................................... 139

6.3 Image Processing targeted to Scene Type................................................ 140 6.4 Subjective Tests for Scene Weighted Processing .................................... 142 6.5 Chapter Summary..................................................................................... 144

v

Chapter 7 A comparison of ROI methods for low quality images.................... 145 7.1 Overview .................................................................................................. 145 7.2 ROI Processing applied to Entire Image .................................................. 146

7.2.1 Image Preparation ............................................................................ 146 7.2.2 Processing Methods Compared........................................................ 148 7.2.3 Images Used ..................................................................................... 150 7.2.4 Experiment ....................................................................................... 151 7.2.5 Results .............................................................................................. 151

7.3 Digital Zoom ............................................................................................ 154 7.3.1 Automatic Zoom Methods ............................................................... 155 7.3.2 Results of Automatic Zoom Experiment.......................................... 158

7.4 Chapter Summary..................................................................................... 160

Chapter 8 Discussion, Conclusion and Future Work......................................... 161 8.1 Discussion and Conclusion ...................................................................... 161 8.2 Future Work ............................................................................................. 164

8.2.1 Motion .............................................................................................. 164 8.2.2 Colour............................................................................................... 164 8.2.3 Device interfacing ............................................................................ 165 8.2.4 Supplementary/Symbolic Information ............................................. 165 8.2.5 Range Indication .............................................................................. 166 8.2.6 Simulating Techniques..................................................................... 167 8.2.7 Other Testing Techniques ................................................................ 167

8.3 Final Word................................................................................................ 168

References 169

Appendix A Section 4.2 Experiment ............................................................... 178 A.1 Example Test Stimulus............................................................................. 178 A.2 Booklet Design......................................................................................... 179 A.3 Borderline Recognition Assessment for Section 4.2 Experiment ............ 180

Appendix B Section 4.3 Experiment ............................................................... 185 B.1 Example Test Stimulus............................................................................. 185 B.2 Borderline Recognition Assessment for Section 4.3 Experiment ............ 186

Appendix C Chapter 5 Experiments............................................................... 187 C.1 Example Test Stimulus – 7 Images Presented all at same time ............... 187 C.2 Example Test Stimulus – 3 Images Presented all at same time ............... 188 C.3 Example Test Stimulus – Paired Comparison Experiments..................... 189 C.4 Booklet Design......................................................................................... 190

Appendix D Chapter 6 Experiment ................................................................ 191 D.1 Example Test Stimulus............................................................................. 191 D.2 Booklet Design......................................................................................... 192

Appendix E Chapter 7 Experiments............................................................... 193 E.1 Training Image Database ......................................................................... 193 E.2 Example Test Stimuli for Section 7.2 Experiment................................... 199 E.3 Example Test Stimuli for Section 7.3 Experiment................................... 200 E.4 Booklet Order for Chapter 7 Experiments ............................................... 201

vi

List of Figures Figure 1.1: Mean Square Error figures between reference image and low quality

versions .............................................................................................................. 24 Figure 3.1: Basis of Visual Prostheses....................................................................... 44 Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images .......................... 56 Figure 3.3: Circular pixelised vision.......................................................................... 57 Figure 3.4: Alternate stimulation strategies .................................................................. 58 Figure 3.5: Simulating the effect of modulating phosphene brightness..................... 61 Figure 3.6: Importance Mapping concept .................................................................. 65 Figure 3.7: Safety Post enhancement with advanced image processing techniques.. 66 Figure 3.8: Enhancing the information content of a low quality image of stairs....... 67 Figure 4.1: Image set used in the psychophysical testing .......................................... 74 Figure 4.2: Image Processing techniques used in the psychophysical testing ........... 75 Figure 4.3: Recognition rate for objects in the image set .............................................. 76 Figure 4.4: Effect of spatial resolution and grey-scale on object recognition. .......... 78 Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly

more recognisable than black & white images. ................................................. 79 Figure 4.6: Comparing resolution and grey scale. ..................................................... 80 Figure 4.7: Significantly higher recognition is achieved with increased spatial

resolution (Right) over increased greyscale resolution (Left)............................ 82 Figure 4.8: Object recognition rate for various processing methods (n=110) ................. 83 Figure 4.9:Edge images were not well recognised..................................................... 84 Figure 4.10: Subjective preferences between image and its inverse – some subjects

preferred white on black, others black on white. ............................................... 86 Figure 4.11: Test Objective - Obtain an Recognition-Quality curve ......................... 88 Figure 4.12: The nine image quality classes used in the tests.................................... 89 Figure 4.13: Test image set ........................................................................................ 89 Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set 91 Figure 4.15: Variation in recognition among image types......................................... 92 Figure 4.17: Recognition rates for each object type .................................................. 95 Figure 5.1: Two images with different amounts of visual information content......... 98 Figure 5.2: The nine image quality classes used in the tests.................................... 100 Figure 5.3: Multidimensional Visual Information Model........................................ 103 Figure 5.4: Images containing high information content for high quality images... 106 Figure 5.5: Images containing high information content for low quality images.... 106 Figure 5.6: Strong viewer preferences (70% or above consensus among viewers)

showing images ranked from highest to lowest perceived information content.......................................................................................................................... 112

Figure 5.7: Calculating Fractal Dimension for Binary Images ................................ 115 Figure 5.8: Calculating Fractal Dimension for Greyscale Images........................... 116 Figure 5.9: Determining image similarity and symmetry – pixel matching ............ 117 Figure 5.10: Determining image similarity and symmetry – pixel difference and

average value.................................................................................................... 118 Figure 5.11: Correlation between 15 image attributes and perceived information

content. ............................................................................................................. 128 Figure 5.12: Metric performance for Strong viewer preferences (70% or above

consensus among viewers) showing images ranked from highest to lowest perceived information content.......................................................................... 131

vii

Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)......................................................................... 134

Figure 6.1: Visual stimuli used to gauge perception of low quality images ............ 142 Figure 7.1: Image preparation .................................................................................. 146 Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images

that have had grey levels equalised (right most image). .................................. 147 Figure 7.3: Processing methods used in tests (refer text for details)........................ 148 Figure 7.4: Example Distribution – Size Map distribution for Beach training images;

.......................................................................................................................... 149 Figure 7.5: Images used in comparison tests............................................................ 150 Figure 7.6: When presenting the entire image, results indicate a clear preference for

no Importance Processing (n=96) .................................................................... 152 Figure 7.7: Digital zoom concept – the most salient area is identified in an image and

resized to the maximum display resolution...................................................... 154 Figure 7.8: Trim method to select zoom window .................................................... 155 Figure 7.9: Scope Box method to select zoom window........................................... 155 Figure 7.10: Saliency Map developed by University of Southern California.......... 156 Figure 7.11: Zoom window selected from central 25% of image............................ 156 Figure 7.12: Zoom window selected from central-bottom 25% of image ............... 157 Figure 7.13: Image Preparation for Digital Zoom Tests .......................................... 157 Figure 7.14: Example stimulus showing detail of zoom window border................. 158 Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)159 Figure 8.1: Halftone representation.......................................................................... 167

viii

List of Tables Table 3.1 – Thesis experiments.................................................................................. 70 Table 4.1: Analysis of Variance for various processing methods.............................. 84 Table 4.2: Correct image identification (n=25) ......................................................... 90 Table 5.1: Perceived information content for comparing 7 different object types .. 105 Table 5.2: Pattern analysis for information content rankings .................................. 110 Table 5.3: Dominant visual information viewer preferences................................... 111 Table 5.4: Correlation coefficients for variables considered for metric for 256x256

greyscale images .............................................................................................. 120 Table 5.5: Candidate models for metric for 256x256 greyscale images.................. 123 Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges,

Entropy) ........................................................................................................... 125 Table 5.7: Candidate models for a metric for 10x10 binary images........................ 126 Table 5.8: Summary of metric performance ............................................................ 129 Table 5.9: Predictive performance of metric proposed for all image qualities .............. 130 Table 5.10: The number of correct metric predictions of images with the highest

information content .......................................................................................... 132 Table 5.11: Correlation coefficients between recognition rate and perceived information

content............................................................................................................... 134 Table 6.1: Image Processing descriptors of different scene types ........................... 140 Table 6.2: Attentional feature weights for each scene type ..................................... 141 Table 6.3: Preferred ranking for image representation ............................................ 143

ix

Statement of Original Authorship

The work contained in this thesis has not been previously submitted for a degree or

diploma at any other higher education institution. To the best of my knowledge and

belief, the thesis contains no material previously published or written by another

person except where due reference is made.

Signed:

Date:

x

Acknowledgements I would like to thank Anthony Maeder, first for his enthusiasm and willingness to

conduct the research at QUT and also for his supervision and guidance, textbooks,

time, motivational prods, positive feedback, attention to detail and financial support.

I could not wish for a better supervisor. The comments and advice from my

associate supervisor Wageeh Boles are also appreciated.

Massive thanks to QUT for a Postgraduate Research Award and the Faculty of Built

Environment and Engineering for a top-up scholarship. Financial assistance from the

SAIVT program director Sridha Sridharan and the QUT Grants-in-Aid office is

appreciated regarding conference attendance. Thanks to the faculty administration

staff especially Scott Allberry, for assistance with preparing questionnaires.

I am grateful to Wilfried Osberger and Laurent Itti and Dirk Walther of iLab for

permission to implement variations of their importance map and saliency codes in

my research. A big thank you to all the volunteers who participated in the subjective

testing, including students at Brisbane State High School and their coordinating

teachers, as well as family members and colleagues in the Image and Speech lab who

were roped in. The efforts of Jason Pelecanos towards soccer matches, BBQs and

other inclusive activities to invigorate the research lab are appreciated.

I thank Melissa, Jeremy and Ruth: it has been a big chunk of our lives with some

major ups and downs. I really appreciate the child care & washing (!), mental

support and lifts into uni dropping Dad off over those speed-bumps. Thanks to my

parents for their silent and not-so-silent encouragement with research and

publications.

This work is dedicated to Terry – may you some day see your wife, children and

rainforest paradise.

xi

Publications The research has resulted in the following fully refereed publications (or abstract

refereed only where indicated by an asterisk).

Boyle J, Maeder A, Boles W, Digital Imaging Challenges for Artificial Human

Vision, South African Computer Journal (26), pp.222-227, 2000

Boyle J, Maeder A, Boles W, Image Processing and Artificial Human Vision

Systems, WoSPA2000 - 3rd Australasian Workshop on Signal Processing

Applications, Brisbane, 2000

* Boyle J, Maeder A, Boles W, “Challenges in Digital Imaging for Artificial Human

Vision”, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T,

Editors, Proceedings of SPIE Vol 4299, pp.533-543, 2001

Boyle J, Maeder A, Boles W, Static Image Simulation of Electronic Visual

Prostheses, ANZIIS 2001 – Proceedings of the 7th Australian and New Zealand

Intelligent Information Systems, Perth, pp.85-88, 2001,

{1st prize student paper competition}

Boyle J, Maeder A, Boles W, Image Enhancement for Electronic Visual Prostheses,

Australian Physical & Engineering Sciences in Medicine Journal 25(2), pp.81-86,

2002

* Boyle J, Maeder A, Boles W, Visual Perception with Electronic Visual Prostheses,

Physical Sciences and Engineering in Medicine Queensland Branch Local

Symposium, Brisbane, June 2002

{1st prize student paper competition}

xii

Boyle J, Maeder A, Boles W, Visual Perception of Low Quality Images, Proceedings

of the 9th International Conference on Neural Information Processing, Singapore,

2002

Boyle J, Maeder A, Boles W, Inherent Visual Information for Low Quality Image

Presentation, WDIC2003 - Proceedings of the 2003 APRS Workshop on Digital

Image Computing, Theme: Medical Applications of Image Analysis, Brisbane,

pp.51-56, 2003

Boyle J, Maeder A, Boles W, Can Environmental Knowledge Improve Perception

with Electronic Visual Prostheses?, WC2003 – Proceedings of the World Congress

on Medical Physics and Biomedical Engineering, Sydney, 2003

Boyle J, Maeder A, Boles W, Scene Specific Imaging for Bionic Vision Implants,

ISPA2003 – Proceedings of the 3rd International Symposium in Image and Signal

Processing and Analysis, Rome, pp.423-427, 2003

xiii

Chapter 1 Introduction

1.1 Overview

Blindness affects millions of people worldwide and over 100,000 Australians. This

research supports quality-of-life improvements for them by exploring appropriate

image processing techniques for electronic visual prosthesis systems: so-called

“bionic eyes". These systems consist of a vision chip or camera that records the

visual world and transmits this information via electric pulses to implanted electrodes

in contact with the retina, optic nerve or visual cortex. These three sites provide

lower resolution opportunities for synthetic image presentation to the human visual

system.

Existing mobility aids for the visually impaired such as canes, guide dogs and sonar

glasses are available. However it is anticipated that the richness of sensory

substitution would be much greater with a visual prosthesis. Although unlikely to

recreate the full experience of vision, visual prostheses may provide enough visual

cues for blind people to perform every-day tasks such as navigation, recognition, and

reading.

The terminology “low quality” used in this thesis refers to images which can only

contain relatively little visual information content. Knowledge of human perception

of low quality (eg. low resolution) images, such as those expected from visual

prostheses, is very limited. While researchers have worked extensively in

characterising high quality image perception, most of this work is not relevant or

useful for low quality images. Yet it is in this low resolution regime that the most

immediate gains in artificial vision can be made. Ways to identify the sparse

information that is important for viewer understanding of scenes when presented in

low quality images are thus needed.

Introduction

15

1.2 Aim The overall aim of this research is to develop simple image processing techniques

that improve perception for users of electronic visual prostheses. This can be broken

down into several component elements of investigation:

• Determine useful image processing methods for artificial human vision systems

• Allow wider implementation of recently developed Region-of-Interest image

processing routines beyond previous applications; these routines identify

important and salient areas within an image

• Facilitate further understanding of the human visual system, specifically

perception performance from low quality visual information

• Provide a basis from which more complex and beneficial (eg. real time) image

processing units can be developed such that a prosthesis may provide maximum

benefits to the blind

1.3 Scope

The research described in this thesis is bounded by the following limitations and

assumptions:

1. Psychophysical sensations of what might be seen with a prosthesis is simulated

by presenting visual stimuli to normally sighted viewers. Prosthesis development

in Australia is currently limited to animal models and data from implanted

patients is not yet available. It is anticipated that some of the experiments

comparing image processing techniques described in this thesis could be repeated

using implanted patients when available.

2. Perception experiments undertaken in the project were based on static/still

images. Improved perception is anticipated if the techniques are applied to image

sequences/video as a user would be offered a richer representation of a scene, in

addition to moving about to see how scene elements (background/foreground)

interact.

Introduction

16

3. The images presented to subjects in simulation experiments were ordered pixel

arrays in a square pattern (equal image height and width). The reported evoked

visual field of implanted patients is not a regularly ordered array and varies from

patient to patient. Due to this wide variability it was decided to present a

symmetric image representation to gauge some understanding of low quality

image understanding. It is anticipated that implant users would undergo tuning

and training through post operative exercises similar to auditory implant

programs to use viable electrodes efficiently.

4. There exist other techniques related to implant electrode stimulation which may

produce different psychophysical sensation. One such technique could be using

different electrode current flow and return paths to create wide variations in

perceived visual sensations. This thesis does not consider such techniques and is

instead based on manipulating conventional pixel-based images digitally (digital

image processing) to improve visual perception.

1.4 Thesis Structure The thesis is structured as two main sections: Background (Chapters 1 – 3), and New

Work (Chapters 4 – 7), followed by a Conclusion (Chapter 8).

Review reading will first be presented to determine fundamental theory relevant to

image understanding and visual prostheses. Chapter 2 describes research in the area

of image quality and visual perception. Chapter 3 describes current artificial vision

research including image processing activities specifically related to bionic eye

projects. At the end of Chapter 3, several requirements for prosthetic image

processing systems are identified. These requirements drive research questions

around which the remaining thesis chapters are based.

Chapter 4 explores low quality recognition performance and establishes the

applicability of a computationally cheap Region-of-Interest (ROI) processing

technique to low quality images. A model for characterising low quality images on

the basis of how much visual information they contain is developed in Chapter 5.

Introduction

17

An approach to tailor image processing depending on the type of scene is explored in

Chapter 6. Finally Chapter 7 compares several ROI methods against each other to

identify which may be most helpful when moving through a scene. The applicability

of all image processing techniques explored in the thesis are tested using normally

sighted human perception (subjective testing) experiments.

Chapter 8 provides a discussion and conclusion of the work and provides some

commentary on how the research can be extended.

1.5 Contributions

Resulting from this work are two significant original contributions to knowledge,

which are summarised below. These contributions are explored through the

examination of a number of related research questions, which are explored in detail

in Chapter 3 Section 3.6.

First, investigations based on early vision aspects of digital images are used to

provide recommendations for effective image processing methods for enhancing

viewer perception when using visual prosthesis prototypes.

The hypothesis that low level processing can improve scene understanding was verified

with several subjective experiments. Although bounded by ultra low quality

environments, prostheses can facilitate some recognition performance. By including a

range of image processing routines or modes of operation, users can gain as many visual

cues from a scene as possible.

When considering prototype resolution, enhanced perception is achieved with

increased spatial resolution of implant electrodes over increased greyscale resolution.

Thus a fundamental aspect of implant design is maximising the spatial resolution.

There are also considerations to be made with respect to context. The most easily

recognised environments for users of vision prostheses are faces. The spatial pattern

of two eyes and underlying mouth is easily recognised even at the lowest image

Introduction

18

quality environments tested in the thesis experiments. Thus enhanced perception is

likely when viewing simple face scenes (eg. TV newsreader) compared to other

scene types.

Incorporating a digital zoom function in prostheses designs could lead to enhanced

perception, and Region-of-Interest processing techniques (which automatically

identify salient areas within images) should be used to obtain the zoomed image.

This is an original application of these techniques beyond traditional applications

such as image compression.

The second contribution is the construction of a metric for basic information content

required for interpretation of visual scenes at low image quality.

The ability of a low quality image to be recognised was found to be positively

correlated with the amount of perceived information content in that image. Thus an

image with high information content can be expected to be more easily recognised at

low quality than an image containing low information content.

Experiments reported here have identified that a simple face with no surrounding

clutter is most visually informative amongst those image types tested. Also, higher

visual information content results from:

(a) more objects in the scene;

(b) closer objects;

(c) strong edges, arising from high intensity contrast.

Finally visual information can be quantified using the number of edges in the image.

Thus, image perception can be enhanced by maximising the number of edges within

the image.

The above contributions, while somewhat general in nature, are seen as forming the

essential elements of any visual prosthesis system capable of providing only low

quality images to the observer.

Chapter 2 Image Quality and Visual Perception

2.1 Introduction This chapter will provide some background on the overlap of visual perception and

imaging, as a framework for improving perception through electronic visual

prostheses. As a visual response is the goal of visual prostheses it is of relevance to

explore research in visual perception within a framework of image quality. Image

quality is a term used to denote the amount of information retained in an image that

has been degraded in some way from its ideal form.

Studies of the interaction between human visual perception and electronic imaging is

one of the key growth areas in imaging science [81]. Concerning quality of images,

research to date has been focused on the high quality still images and video

associated with modern multi-media environments. Perceptual models for image

quality using characteristics of the human visual system have been developed

[1,105], and with them, perceptually based image compression techniques (eg. [69]).

However, this work is based on high quality images and there is no similar quality

characterisation for the emerging field of visual prosthetics.

The rest of this chapter is presented in four sections. Section 2.2 provides an

overview of perception physiology: what happens in the eye during visual

perception. In Section 2.3, a hierarchical model for perception and imaging is

presented. Research is described first in the areas of low level, or early vision. The

discussion includes previous work in the field of image quality and how quality has

been traditionally assessed. Following from this, research in higher levels or

cognitive vision is described. Section 2.4 presents an area of work that incorporates

both low and high levels of vision, known as Region-of-Interest processing. Finally

Section 2.5 provides an overview of some research approaches concerning visual

understanding and complexity.

2.2 Visual Perception Physiology


20

An important component in understanding how to design visual prostheses is the

physiology of visual perception [59]. The retina is the innermost layer of the back of

the eye, and is organised into layers that contain photoreceptors, interneurons and

blood vessels. The embryonic development of the retina results in inside-out design,

so that the photoreceptors are nearest the back of the eye and light must pass through

the retinal interneurons and blood vessels to reach the photoreceptors. The two types

of photoreceptors, cones and rods, contain visual photopigment. The first step in

photoreception is photopigment bleaching: when light activates visual pigment

molecules. Bleaching initiates a sequence of events leading to a change in the cell

membrane potential. Ganglion cells are the retinal cells whose axons form the optic

nerve, so their output is the final product of the information processing that occurs in

the retina. Ganglion cell axons enter the optic nerve in an orderly fashion, so that

adjacent axons in the nerve correspond to adjacent receptive fields on the retinal

surface. The pathway ascends to the lateral geniculate nucleus (LGN) of the

thalamus and then projects to the primary visual cortex in the occipital lobes of the

brain.

Studies of anatomy, physiology, and human perception (eg. stroke victims) conclude

that the human visual system is subdivided into several separate parts whose

functions are quite distinct [48]. There are two subdivisions in the visual pathway

(LGN + visual cortex): parvocellular and magnocellular. Both have inputs from the

same rods and cones, but have differences in the way the photoreceptor inputs are

combined. Their receptive fields (regions of the retina over which impulse activity

can be influenced) are circularly symmetric and show centre-surround arrangement

(also in retinal bipolar cells). These cells are configured to convert information from

photoreceptors into information about spatial discontinuities in light patterns. Some

cells are excited (impulse rate speeded up) by illumination of a small retinal region

and inhibited (impulse rate slowed down) by illumination of large surrounding

region, while others are the reverse of this.


21

The magno and parvo divisions differ in four ways and imply they contribute to

different aspects of vision:

1. colour: 90% of parvo layer cells are wavelength sensitive (they combine cone

inputs in effect to subtract them); the magno system is colour blind, or

wavelength insensitive (they sum inputs of 3 cone types so the response to

illumination is on or off at all wavelengths). For example, two different colours

such as red and green at the same relative brightness are indistinguishable.

2. acuity: magno cells have larger receptive field centres than parvo (by a factor of

2 or 3) ie. lower spatial resolution. For both mango and parvo cells, the receptive

field size increases with distance from the fovea. This is consistent with

differences in acuity between foveal and peripheral vision.

3. Speed: magno cells respond faster and more transiently than parvo (which

suggests a role in detecting movement). Many cells at higher levels in this

pathway are selective for orientation and for direction of movement.

4. Contrast: magno cells are more sensitive to low contrast stimuli than parvo.

At higher stages the continuations of these pathways are selective for different

aspects of vision (form, colour, movement, stereopsis). The extended M/P pathways

are described further in the visual cortex (blob and interblob systems) and properties

include:

• orientation selective

• selective for direction of movement

• end-stopped (respond to short but not long line/edge stimuli)

• colour information and colour contrast information in separate systems (eg.

colour contrast used to determine borders, but not information of colours forming

the borders)

• brightness selective

• respond poorly when stimulation of either eye alone but vigorously when both

eyes stimulated together (stereoscopic depth).

Hubel and Livingstone [48] argue that end stopping (like centre-surround system) is

an efficient way of encoding information about shape. They also propose that


22

different kinds of visual tasks differ in their colour, temporal, acuity and contrast

characteristics. Other findings by the authors are as follows:

• People follow brightness alterations much faster than pure colour alterations (the

magno system is colour blind and faster than the parvo)

• Movement perception reflects magno characteristics: colour blindness, quickness,

high contrast sensitivity and low acuity (which has been proved with perceptual

experiments).

• Motion perception, stereoscopic depth perception, the ability to use relative

motion as a depth cue, shading as a depth cue, and perspective as a depth cue are

all lost at equiluminance.

• The retinal image is two dimensional. In order to capture three dimensions, the

human visual system uses many kinds of cues besides stereopsis and relative

motion: perspective, gradients of texture, shading, occlusion and relative position

within the image.

• Magno system is more primitive than parvo. Parvo is only well developed in

primates who can scrutinise in much more detail shape, colour and surface

properties of objects ie. visual identification and association.

• Results are presented that suggest that only luminance contrast and not colour

differences are used to link parts comprising a scene together.

It is difficult to predict if the aspects of vision described above (eg. brightness

selective, movement perception) will be similar for artificially induced vision. As

the parvocellular and magnocelular pathways extend through to the visual cortex,

any similar induced perceptions would probably be independent on the stimulation

site of the prosthesis.

2.3 A Visual Hierarchy Model

In this section a hierarchical model of imaging and visual perception is used to

describe research activities in this area. A good review of the research concerning

the interplay between visual perception and electronic imaging is given by Rogowitz

et al [81]. Their review draws on observations made during human interaction


research (see for example [15,21,69,70,75]). Arising from these observations is a

visual hierarchy:

• aesthetic & emotional aspects

Higher levels of vision

• cognitive effects: memory, semantic categorisation and visual

representation

• perceptual effects: colour constancy, suprathreshold pattern &

texture analysis

• visual phenomena mediated by the threshold sensitivity of low-

level spatial, temporal and colour mechanisms.

2.3.1 Early Vision Effects

At the bottom of the hierarchy are low-level, or “early vision” effects. Many early

vision models have been proposed to characterise image quality on the basis of these

low level effects. It is argued in this thesis that these early vision models cannot be

extended to the poor quality images anticipated from visual prostheses (eg. 10x10 or

25x25 resolution, black and white images). The early vision models have arisen to

address distortions, or artifacts, that have been caused by electronic imaging

processes, such as compression and halftoning. Such distortions include blurring,

granular noise, jerkiness and blocking. The body of knowledge on image quality is

based on reducing the visibility of these distortions. Many vision-based algorithms

have consequently been developed in the fields of still image and video compression,

image enhancement, restoration and reconstruction, image halftoning and rendering,

image and video quantisation and display.

Traditional quality models are based on measuring the differences between an

original “perfect” quality image and an altered image having undergone an imaging

process. The low spatial resolution binary images anticipated from visual prostheses

are so dramatically different from the perfect reference image that the quality models

developed to date in the literature do not apply.

To illustrate that these quality models are not useful, consider the popular quality

metric of Mean Square Error (MSE) defined as:

23


21

0

1

0)ˆ(1

ij

M

i

N

jij xx

MNMSE −= ∑∑

−

=

−

=

(Equation 1)

where M, N are the number of horizontal and vertical pixels, is the value of the

original pixel at position (i,j), and is the value of the distorted pixel at position

(i,j).

ijx

ijx̂

Figure 1.1 shows the MSE from a reference perfect image for several distorted

versions of the reference. As can be seen, the MSE metric is best suited where the

differences are small, and the MSE for the low quality 10x10 binary image is

approaching that of a grey stripe pattern which did not originate from the Reference

Image.

Reference Image

black spot 25x25 binary 25x25 10x10 pattern

MSE = 338 760 4665 6727 8919 10673

Figure 1.1: Mean Square Error figures between reference image and low quality versions

Janssen [43] has proposed an alternate description of image quality which differs

from the traditional “perceived distance to the original”. He states that quality is how

well an observer is able to employ an image as a source of information about the

outside world, and proposes the following metric:

Quality = (λ1⋅N) + (λ2⋅C) + λ3

where N = naturalness, C = brightness contrast and λ1, λ2 and λ3 are scalar constants

determined from experiments.

Much of the image quality literature is focussed on reducing the visibility of image

artefacts (distortions) and so some of the approaches used for this purpose are

24


25

included here. Rogowitz et al [81] summarise methods for characterising these

image artifacts:

1. physical measurement: measuring key image parameters; comparing images on a

pixel-by-pixel basis using a metric such as mean squared error

2. use human observers to judge perceived quality

3. develop metrics based on experiments measuring human visual characteristics to

estimate human judgements

These approaches are recognised in other sources [24,53,67] and are described below

in further detail.

2.3.1.1 Physical Measurement Metrics

Commonly used physical measurement metrics include Signal-to-Noise Ratio (SNR),

Peak Signal-to-Noise Ratio (PSNR), Mean Absolute Error (MAE), Mean Squared

Error (MSE), local MSE, and distortion contrast. These metrics are easy to use in

that information on viewing conditions is not needed and they are computationally

simple. However, the methods are considered poor (even for high quality images) in

that they do not work well on images with different content (eg. edges/textured

regions) [24] and they treat all impairments as equally important [67].

2.3.1.2 Human Observers as Judges of Perceived Quality

Human observers could be used to directly obtain feedback on image quality, and

indeed the ‘gold standard’ in determining image quality is the human observer. By

nature, this feedback incorporates human visual system considerations, but is

expensive and the results may not generalise [81]. However this method can be of

use in evaluating low quality images and is therefore of relevance to this thesis.

Traditionally this feedback, obtained either from trained experts or via psychological

experiments, has been used to compare a distorted image from an imaging process.

If the artifacts are visible (ie. suprathreshold), quality can be assessed via:

• a rating scale 1-5, eg. bad, poor, fair, good, excellent; however this will only

characterise large quality differences and may produce inconsistent results when

evaluating images with different types of artifacts


26

• paired comparison experiments, where a 7 point scale ranges from –3 to 3 for

much worse, worse, slightly worse, same, slightly better, better, much better

• two stimulus forced choice scales which ask which image has the higher quality;

comparisons can be made using images with different types of artifacts

If the artifacts are just on the visual threshold, just noticeable difference (JND)

testing is often used to assess quality. JND tests are not biased on differences in

types of artifacts, and are often used to predict the visually lossless point between a

compressed image and the original. Display time and learning effects affect the JND

point, especially if an observer is given hints about where to look. One can also

maintain a visible difference map and have a user specify image quality for different

regions in an image [24].

Quality has also been assessed via the receiver operating characteristic method,

which measures the performance of an observer in making decisions using an altered

(eg. compressed) image. The rate of true positive decisions is compared with the rate

of false positive decisions, and the decisions can be subject to fatigue and time of

day. Therefore the reliability of ROC studies is dependent on the large number of

tests conducted under a range of human circumstances [53]. Typical studies are done

with a number of controlled observers.

A detailed discussion of human observer research methods in electronic imaging is

contained in Snyder and Trejo [90]. This covers psychophysical (acuity,

discrimination), physiological methods (electroretinogram, positron emission

tomography, cerebral blood flow) and behavioural methods (search time, legibility,

response time, which includes subtasks of visual perception, decision making and

motor response). Psychophysical tests are of most relevance in this thesis as the

research hypothesis is validated by such methods. Nemine [62] defines

psychophysical techniques as a method used to measure characteristics of the

environment in terms of the psychological value. Travis [98] observes that

psychophysics is non-invasive, and involves investigation of a system by the study of

the psychological response to physical stimuli. The physics refers to measurement of

the stimuli, while the psychology refers to the measurement of sensation.


27

Recommendation 500 of the International Telecommunication Union is often cited in

the field of image quality evaluation [41]. This document covers methods and

viewing conditions for assessing perceived quality in a standardised way. More

recently, the ITU has developed Recommendation P.910 to standardize methods for

multimedia quality assessment. The premise behind subjective assessment is the use

of human observers to rate video sequences (usually short clips), and thus it may be

impractical to use these methods for the in-service continuous evaluation of image

quality. A Video Quality Experts Group (VQEG) has been established to provide

other objective methods for video image quality evaluation [101].

Obtaining subjects’ perception on the low quality images anticipated from visual

prosthesis has similarities with the classical inkblot tests used in psychology, known

as the Rorschach [27]. Clear guidelines for psychological assessment have been

established [30], covering seating, verbal instructions, recording and enquiry on

responses. Error can be introduced by censorship by the subject, scoping errors, poor

handling of subtleties of interpretation, incorrect incorporation of age or education

and examiner bias (illusion correlation). Alterations in wording, rapport and

encouragement can significantly alter responses. A large number of variables are

likely to produce spurious random significance. Side-by-side seating is

recommended if possible.

Tests used in this thesis are aimed to determine the intelligibility of low quality

images, via a viewer’s ability to recognise an object. Specifically related to these

tests are the interpretation and questioning used in ink-blot testing. For example, if

shown a pattern, typical questions would be:

Q. What is this object?

Q. What about it made it look like [ _________ ]?

Interpretation of the answers has been assisted by reference codes and categories. A

location attribute is used to categorise the area of the image used to draw the

viewer’s conclusions:

• whole response (W) – if the entire image was used

• unusual detail (Dd) – location responses made by <5% of subjects

• common detail (D) – a frequently identified area.


28

Determinants are noted for any style or characteristic of the image eg. shape, texture.

Are there any arbitrary contours created where none exist? Finally the content of the

answer is allocated a category. These include whole human (real person) eg. Lena,

the image processing test image; whole human (myth/fictional) eg. ghost, angels;

human detail eg. person without head; whole animal; anatomy eg. lungs, stomach; art

eg. statues, jewellery; botany eg. plants; clothing eg. hat; clouds; food; household;

landscape; science.

Other cognitive issues in image quality measurement are given by De Ridder [19].

He states that methods for assessing perceived image quality can produce biased

estimates of a viewer’s quality impression. These include instructions given to

subjects, and choice of rating scale, such as:

- single scaling: 1 (lowest sharpness) to 10 (highest sharpness)

- double stimulus scaling – same as above but reference image introduced

- comparison scale ranging from –10 (1st image is much sharper than 2nd) to 10

(2nd image much sharper than 1st).

2.3.1.3 HVS-based Metrics

The next category for consideration is metrics that are based on experiments

measuring human visual characteristics. To overcome the shortcomings of the

simple physical measurement metrics such as Peak Signal-to-Noise Ratio or Mean

Absolute Error, a large number of often complex perceptual quality metrics have

been proposed. A good summary of perceptual metrics as applied to image quality

research is contained in Ahumada [1]. Perceptual metrics can be classified as

follows:

• Metrics with simple characteristics; these include contrast sensitivity function

and luminance adaptation (refer below);

• Metrics incorporating preferences for suprathreshold artifacts; these attempt to

model the objectionability of imaging artifacts;

• Threshold perceptual models; these metrics predict the visibility of distortions

near the visual threshold. Visible differences maps can be created which specify

the probability of seeing a difference between two images at each pixel location.

Other models give the visibility of errors within each frequency band – overall


29

image quality is then determined by identifying the frequency band with the most

visible artifacts.

There are several visual factors used in perceptual image quality metrics [24]:

1. Contrast sensitivity functions

The contrast threshold function (CTF), or its inverse, the contrast sensitivity

function (CSF) is the most widely used attribute for metrics. The CTF indicates

when frequency components just become visible, and specifies the internal noise

level across spatial frequencies; ie. it identifies the amount of quantisation that

can be applied near the visibility threshold for the same perceptual error. The

CSF is a measure of the spatial resolution of the human visual system. When

used for image compression applications, the CSF is often assumed to be a low

pass function to ensure quantisation artifacts become less visible for increased

viewing distance.

2. Luminance adaptation

Luminance adaptation is the second most commonly used attribute. The Weber-

Fechner law describes how sensitivity to luminance differences in a stimulus is

proportional to the mean luminance of the stimulus (contrast threshold remains

constant for increasing luminance levels). As the background luminance

increases, the sensitivity to errors decreases proportionately. Again for

compression applications, as local luminance increases, an increased level of

quantisation can be tolerated. Luminance adaptation can be implemented in the

spatial or frequency domain.

3. Linear transforms

Psychophysical studies have shown that the human visual system has visual

channels selective to spatial frequency (with approximately an octave bandwidth)

and to orientation (with sensitivity of 15 deg to 60 deg). Several desirable

properties for linear transforms are used to model the human visual system:

frequency and orientation selectivity, linear/quadtree phase, minimum overlap

between adjacent channels, shift invariance, scale variancy.


30

4. Masking: contrast; noise; & mutual

Contrast/pattern masking is where the signal is masked (reduced visibility) by the

presence of another signal or noise. For compression applications, it is desired to

have the image content (eg. where image is textured) masking the quantisation

noise. In masking locations, the image and noise signals need to be in the same

spatial location, have the same spatial frequency and have the same orientation.

Incorrect predictions of contrast masking is the major reason why perceptual

metrics fail. A major problem is that metrics present contrast masking as a

complex multiple frequency decomposition (computationally complex), and

perhaps single channel metrics could be used. Osberger [67] states that masking

may be one of the most influential components of a vision quality model for

complex natural images (more than the choice of single versus multiple

channels).

5. Summation of errors

Summation of errors is done across frequency bands and across space to reduce

dimensionality arising from many channels (an excess of information) into a

single map and perhaps even a single number.

If the artifacts are above the threshold of visibility (for example at high compression

ratios), the objectionability of the artifacts depends on the personal preferences of an

observer. This could be incorporated into a model, but most metrics ignore the issue.

There is no consensus regarding distortion levels at which observer preferences play

a role [24].

Metrics should be able to characterise spatial variations in quality across an image.

Therefore some perceptual metrics provide a two dimensional quality map, which

assign a level of perceived distortion to each location [106]. There is also the desire

expressed by these metric designers to collapse the map into a single number that

reflects overall quality, so a user can specify a single quality number for the entire

image. The predictive ability of a single number is very dependent on the

psychophysical methods used to validate the metric.


31

2.3.2 Cognitive Effects

At the higher end of the hierarchical model are higher level of cognitive effects. In

this thesis, it is desired to determine how little information makes up a scene, and is

of use. In order to determine the lowest visual ‘primitives’, or building blocks that

make up an image, an analogy can be made to the process of understanding words if

alphabetical characters are removed from the word. Different thresholds of

understanding may be found depending on the approach:

• Replace letters randomly, and ask subject if it still makes sense

• Build words from letters (randomly), and ask the subject if it makes sense.

Several viewpoints on top down perceptual effects are collected in Cantoni [10].

Yarbus points out that visual exploration of a picture by a human observer follows

different paths according to the particular task that has been assigned. Savina states

that what is important in scene understanding is to collect in the shortest time

possible, information that allows one to perform the task assigned, leaving aside the

rest. To answer some question about a scene, one needs to analyse a few restricted

areas for a long time, meaning perhaps that the extraction of certain kinds of

information is difficult compared to the extraction of others. Gibson maintains that

the world around us acts as a huge external repository of information necessary to

act, and we directly extract from time to time elements that are needed.

The human response to visual stimuli is covered well by Hendee and Wells [34].

Bottom up versus top down visual processing is compared. Bottom up models

describe where a scene is comprised of individual features, and finer details are

obtained by slower scanning of the scene, requiring much time. In top down models,

an overall impression of the entire scene is obtained and features are filled in later.

Studies of visual scan paths indicate that real world knowledge (physical laws), past

experiences and expectancies affect eye fixation. Areas with high information

content (contours, non-homogeneous areas) are fixated by an observer. A perceptual

cycle/search plan is proposed:


Available stimulus information

32

Schema Visual

Exploration

) (Modifies (Samples)

(Directs)

It is not possible to separate the mechanisms of detection/recognition/interpretation

of visual images. Instead there is a single process of constant interplay between

perception and cognition. Vision can be regarded as having a preattentive phase and

an attentive phase, and terms such as ‘useful field of view’, ‘visual lobe’, functional

visual field’ are used to describe the gathering of information from an area extending

the fovea. Preattentive stimuli are immediately detected by parallel preattentive

system. Attentive processing stimuli require a serial search by disk of focal

attention.

Preattentive processing is further described by Callaghan [9]. This processing

involves parallel and independent registration of features in the visual field. Features

are registered on separate ‘maps’ and linked to a master location map. Further

attentional analysis is required for identifying an object (the information in the

master map is linked together). Callaghan suggests that perhaps there are no

preattentive/attentive stages but an attention continuum during perceptual processing.

Texture segregation and popout are easily produced from visual primitives eg. line

orientation, hue, brightness, form, line terminators, line length, curvature and

closure. The author describes experiments conducted to support the proposal that

within-region segregation is an important factor in texture segregation. Observers

were presented with arrays of elements with varying hue and form (eg. circles &

squares) and asked to identify boundaries between elements ie. natural scenes were

not used in the experiments.

Other perceptual studies are reported in [71]. The visual perception program at SRI

includes studies of reduced field of view, limited spatial resolution, system-produced

distortions, system delay and system update rates. Most relevant to the application

addressed by this thesis were limited spatial resolution studies, where a stereoscopic


33

display was used to present images on 2 colour monitors that the viewer saw as a

fused stereoscopic image. Each monitor had a resolution of 1280x1024 pixels, and

when viewed at 57cm, produced a pixel subtense of 1.5 min arc, equivalent to a

visual acuity of 20/30. Coarser spatial resolution was achieved by programming the

display to use selected small numbers of pixels instead of single pixels to produce an

image point.

Information presented at a rate higher than 30 images/second is not integrated as a

discrete sequence of images due to limitations of brain neural networks [34]. The

extraction of basic attributes of light fields (ie. characteristic features of an image) is

the central issue in biological vision. There is no mathematical theory of image

enhancement available because we are unable to objectively describe the perception-

cognition relationships. The simplest way to approach this is to focus on image

contrast and detail. In practice signals are not band-limited and sampling is a finite

duration operation. More samples are needed than theoretically predicted.

Semantic categorisation research is also of relevance to cognitive perception. Image

semantic researchers [60,80] have found that colour seems to play a significant role

in the perceptual organisation of images. Colour was found to be important in

natural scenes, but not with people or manmade objects/environments (where spatial

organisation, spatial frequency and shape features are more influential). The

presence of strong colours (bright red, lime green, pink, pure white) can indicate

man-made objects, especially when combined with spatial and regional features.

Image segmentation into regions of uniform colour or texture gave opposite results

for man-made (straight lines & boundaries, geometric shapes, sharp edges) and

natural scenes (rigid boundaries, random edge distributions). Semantic categories

appear to correlate with image descriptors (eg. indoor scenes = brownish, low light

levels, many straight edges), and so attempts have been made at a metric for

semantic categorisation. For example, feature combinations to capture semantics for

‘cityscapes’ category: skin = no, face = no, silhouette = no, nature = no, number of

regions = large, region size = small/medium, central object = no, energy = hi, number

of edges = large, details -= yes, colour = brown/grey.


34

The studies of eye movement paths, or scanpath theory, has also contributed to the

understanding of visual perception [92]. The researchers present eye movement

studies as an approach to describe how we see in our mind’s eye (top-down). When

subjects were asked to first look at a picture and then later asked to visualise that

same picture from memory, the scanpaths were very similar. This provides evidence

for the top down scanpath theory of vision, since there is no external world available

to satisfy the bottom up concept that the external world enters the brain and controls

visual perception. The researchers recognise there is a problem in matching the

bottom up signals coming from the wide peripheral visual field (low resolution and

high sensitivity for moving objects) and from multiple high resolution glimpses by

the centrally located fovea. These regions of interest (ROIs) are sequentially visited

by a string of fixations, saccades, rapid eye movement jumps and are simultaneously

matched by top down representations of the hypothesised image. The authors remark

that when the retinal field is mapped onto the visual cortex, there is considerable

magnification of the signals coming from the fovea (ROIs), and a reduction of

signals coming from the low resolution periphery (only colour and textural

segmentation of large areas). These foveal and peripheral representations indicate

the kind of bottom up information entering the visual brain.

Stark and Privitera also conjecture a meeting point for top down/bottom up

processing [92]. They propose that where top down inputs to levels I, II, and III of

the visual cortex meet bottom up visual signal information going to levels IV and V

in the visual cortex, this is the site of the matching between top down subfeature

representation with incoming bottom up sensory signal flows. After matching, they

then propose that the scanpath continues to the next ROI, and in this way, the top

down model moves, fixates and foveates the eye, to bring forward successive

subfeatures for checking.

A final point of interest concerning this research group are the studies of scanpath

eye movements with dynamic scenes. Smooth pursuit eye movements play a large

role in scanpaths while subjects are observing dynamic scenes and have an

interesting characteristic – they maintain the fovea over the moving object as long as

this is possible and as long as the moving object is one the top down spatial cognitive

model continues to address.


35

Other researchers [76] recording the eye positions of human subjects viewing natural

scenes found that subjects looked at image regions that had high spatial contrast, and

in these regions, the intensities of nearby image points (pixels) were less correlated

with each other than in images selected at random.

As Region-of-Interest concepts feature highly in the above discission of cognitive

perception, it is expanded in the next section where several research activities in this

area are described.

2.4 Region-of-Interest

This section provides commentary on an area of work incorporating both low and

higher levels of vision. Deficiencies of modelling vision using only early vision

phenomena have been identified previously [67] in that model parameters need to be

chosen to reflect human response to complex natural scenes – not simple artificial

stimuli, and that higher level and cognitive factors need to be employed.

A common goal of models described in this section is that they identify Regions-of-

Interest (ROIs) within an image in an attempt to predict where the human eye will

fixate in the image. When compared against subjective tests using eye-tracking

machines or similar attention-recording devices, these region-of interest algorithms

provide a high degree of correlation with human observer behaviour. These ROI

techniques have found application in advertising [40], military surveillance [64] and

visually lossless compression (where uninteresting areas of the image are

compressed more than others so compression artefacts are placed in these areas only)

[69,75].

There are several factors influencing attention – motion, contrast, size, eccentricity

and location, shape, foreground/background, edges and texture, prior instructions and

context of viewing, people, gestalt properties (closure, orientation, proximity,

similarity, symmetry, clutter and complexity, unusual or unpredictable stimuli (eg.

high information content), and interaction between basic features [67].


36

Schill et al [85] have recorded eye movements when subjects viewed natural scenes.

They analysed spatial statistics of fixated regions with higher order statistics

(bispectra), and found a clear bias for subjects to fixate on regions with frequency

components of multiple orientations (eg. regions with curved edges or occlusion

patterns). Using this information as candidate features of informative regions, the

authors have developed a system attempting to automatically select informative

regions in saccadic scene analysis. The system integrates a simplified bottom-up

mechanism to a task-oriented top-down mechanism. The cognitive (top-down) stage

infers knowledge based on the Dempster-Shafer theory for uncertain reasoning. The

bottom up/early visual system computation is a neural network processing stage,

where features are extracted by linear orientation selective filters. They conject that

top down and bottom up processing relies on a common principle: “information

gain”. The systems they have developed computes the most informative region

which should be selected by the next eye movement. In visual prostheses

applications, a human user would shift fixation point, and thus there is not the need

to model the top-down voluntarily controlled attention shifts.

Osberger [67] has defined a quality metric using the notion of importance maps.

This metric has improved performance over traditional quality metrics which assume

the whole of a scene is viewed foveally (representing image fidelity). An

importance-map weighted metric is more representative - traditional metrics

overemphasise visual distortion in textured areas and don’t account for strong

masking in these areas. The metric has been extended to a perceptually based

compression [69] and a new model for automatic detection of regions of interest in

complex video sequences [70]. Features of the new model include motion (the

model can distinguish camera motion eg. pan, tilt, zoom, from true object motion,

and has adaptive motion thresholds for different video scenes), colour, intensity

contrast, size, shape, location, background, and skin tone (a narrow range of Hue

Saturation Value). Feature maps representing individual factors were correlated with

eye movements from 24 viewers to quantify weights for each factor. 75% of

viewer’s fixations occurred in the 30% of the scene total area that was estimated by

the model as being the most important.


37

A technique that avoids the segmentation of the above model is the context-free

region-of-interest algorithm presented by Nguyen et al [64]. The technique is aimed

to be useful for images with interpretable content that varies with resolution and field

of view. There are three stages to the algorithm:

1. Quadtree feature maps are generated for 4 visual factors (contrast, relative

brightness, variance, edge density). Each level of quadtree decomposition

narrows the field of view over which the feature is examined. If the feature

persists in a region, the region gets further divided.

2. Each region is assigned an importance value [0 1] based on region detail (higher

importance if region keeps splitting into narrower fields of view)

3. An overall normalised [0 1] importance map is generated from the weighted

combination of importance-weighted quadtree feature maps.

From the overall importance map, an integer number of bits can be assigned to each

pixel proportional to the importance map pixel values, rather than applying a uniform

number of bits per pixel. Context dependent criteria could be integrated to improve

technique.

If colour is taken into account as well as intensity, more information can be obtained

about the image contents. Similar to the above models, colour importance has been

defined at each pixel location in an image, and used to weight the results of image

analysis tasks [54]. It is difficult to extract colour information due to poorly defined

data dependencies between colour bands. Spectral differences become important in

regions where the difference in luminance is negligible. Shadows and highlights can

cause sharp changes in luminance producing undesirably strong edges. Colour

correction (eg. joining panoramic photos) and colour quantization are also situations

where colour has a significant influence. An importance measure (normalised from

[0, 1]) was constructed by combining global and local factors on case-by-case basis.

Global factors included the probability of colour in the image, the probability of

colour group in the image, probability of colour in colour group, variability (if low,

discrimination is limited – may need to enhance). Local factors were similar to

global but acting in regular mxn neighbourhood, or irregular segmented region. It is

not possible to modulate colour in the visual prostheses application, so information

about colour as described here is not highly assistive to this thesis.


38

A progressive technique for human face archiving and retrieval uses a similar

importance measure [7]. The compression technique described has 1.5 times the

compression rate of the JPEG standard and is more visually acceptable as

compressed images do not suffer from blockiness, and the visually important

information (edges) are reconstructed first.

Another region of interest algorithm has been proposed to detect the main subject in

photographs [50]. Advantages of the approach are that the model includes semantic

(human skin and face, sky, cloud, grass, tree) as well as geometric features (no

motion or depth features as the application is only for photographic images). An

image is segmented into regions and the following features ‘believed to influence

visual attention’ computed for each region:

• Low-level: colour, brightness and texture (self saliency – by itself, and relative

saliency – in competition for each)

• Geometric: centrality, borderness, surroundedness, size, shape, symmetry

• Semantic: skin, face, sky, and grass

All features are plotted on an “effectiveness-complexity” graph in relation to the

main subject detection. Eg. face is strong indicator for main subject but is less

effective than location feature (centrality, borderness) because low likelihood of face

regions among all regions.

Itti and Koch have proposed many models for visual attention: multi-scale feature

maps to detect local discontinuities in intensity, colour, orientation & optical flow,

and biological plausible models such as a centre-surround mechanism modelled on

visual receptive fields (cortex transform) [40]. Receptive field properties can be

modelled using difference of Gaussian filters (nonoriented features) or Gabor filters

(for oriented features). Feature maps (normalised to [0 1]) are produced for intensity,

colour, and orientation (0, 45, 90, 135 deg), with 6 feature maps for each at different

spatial resolutions. The feature maps are then combined into master or saliency map

using 1 of 4 methods. In the final saliency map, the most salient location is

suppressed or inhibited, so the system can focus on the next most salient location. A

circular focus of attention (rather than actual object) is identified (radius was 80 or

64 pixels depending on image set). The average number of false detections (mean ±


39

standard deviation) was reported for each method used on a database of traffic signs

(ie. still images). The authors state that template matching algorithms are much

simpler, but their method is independent of nature of the targets (ie. context free).

The above saliency model has been extended to include object recognition [58]. The

new model combines a fast visual attention front-end which rapidly selects the few

most conspicuous image locations and a slower object recognition back-end which

identifies objects at the selected locations. The object recognition back-end is trained

on target features which are simple stimuli only eg. circle vs rectangle, with a hope to

extend this in future to natural colour images eg. pedestrians. The relevance of such

a system to visual prostheses might be to give a speech interpretation of the scene

when walking down the street eg. tree, post, sign etc.

2.5 Visual Information

This section provides background to visual information contained in images. Cooper

et al in their paper Causal scene understanding [16], asked some intriguing questions

pertinent to visual prostheses:

“What is visual understanding? What does it mean to look at a scene and

understand what it is about?”

Since understanding is in large part, the preparation we make for acting, the question

can be reformulated in this way: “What knowledge about a scene would a visually

impaired person need in order to take intelligent action in that scene?” A comment is

made that every picture tells a story and visual understanding consists of figuring out

what that story is. In Cooper et al’s paper and others like it [78], scene

understanding is described from the point of view of a robot – to predict what is

going to happen. Many intelligent agent systems have been developed, for example

robots that can pick up mugs with handles (vertical lifting force plus rotational torque

to counteract mug rotation) and vision based robot corridor-cleaners. In the visual

prosthesis application, there is a functioning human brain to interpret visual signals

and understand the significance of elements in a scene and the relationships between

those elements, unlike robot/knowledge-based computer vision applications, where

an agent does the interpretation. Therefore, visual prostheses systems should perhaps


40

mainly focus on bottom-up systems, trying to get most out of a scene, while using

knowledge of top-down cognitive interpretation eg. magno/parvo channels etc.

Experiments are described in Chapter 6 of this thesis that quantify the amount of

visual information inherent in an image. Previous research in this area includes that

concerning visual complexity. Riglis [77] has reported on the following measures

for estimating visual complexity:

• The number of words used when describing a picture;

• Estimators from pattern recognition systems; straight lines, smooth curves = low

complexity, right angles = medium complexity, intersections of lines at acute

angles = high complexity;

• Geometrical characteristics derived from Gestalt psychologists: symmetry of

stimuli, & symmetry of curves present in image, similarity of objects in image,

saliency of curves present in image, smoothness of curves present in image;

• Other geometrical characteristics: area of figure, value of angles, number of

revealed elements, diversity of angles and sides, symmetry;

• In a study involving the perceived beauty of forests, high image complexity was

found to be related to high number of colours, high number of edges, high fractal

dimension, high standard deviation, high entropy, and larger file sizes (image

encoding including Hufmann encoding & run-length encoding);

• Klinger-Salingaros complexity: temperature (internal contrast), harmony

(symmetry), life (product of temp and harmony), complexity = temp * (constant

– harmony)

Except for the perceived beauty of forest studies, all of the above were not tested

with real images. Riglis undertook experiments getting subjects to rank images from

low to high complexity and determining relationships with fractal dimension, fractal

image format compression, GIF compression, JPEG compression, TIFF

compression, pixels mean, pixels median, pixels standard deviation, and his

understanding/implementation of K-S harmony, life and temperature. He found a

positive correlation with 1 out of 3 experiments for fractal compression, K-S life,

pixels standard deviation (other 2 no correlation, 1 was on-line with no control

images for comparisons).


41

Meletiou has estimated the complexity in scenes using Osberger/Maeder importance

maps [57]. After segmentation and importance classification (based on contrast,

size, shape, location, background), the extracted regions were grouped in 10

categories according to importance levels. Complexity = sum(i, 1-10) of (i*no. of

regions in category), ie. the more complex the image is the more important it will be.

Observers were asked to describe an image presented, and then rank (from 1-5) the

difficulty of verbally describing the images presented. Reaction time, number of

words and ranking were compared with a complexity metric (reaction time and

number of words is probably better suited to exploring threshold of sensitivity of

verbalisation: some subjects could talk for hours on simple image). All measures

were statistically significant. Meletiou conjects that perhaps high fractal dimension

relates to a simple image, and low fractal dimension to a complex image.

Other researchers have used the term ‘visual complexity’ to describe the running

time of algorithms [5]. Visual complexity has been proposed as the sum of the

number of edges in the scene, the screen resolution and the number of visible edge

crossings (wire mesh rendering application).

Stange undertook several subjective experiments using simple geometric models

[91]. Visual complexity was modelled with 4 parameters:

• each individual object’s colour

• each individual object’s size

• each individual object’s shape

• number of individual objects in the image

The only parameter that had a statistically significant correlation was the number of

individual objects in the image.

2.6 Chapter Summary

This chapter has described research in image quality and visual understanding.

Physiological aspects were provided to gain a brief overview of how visual

perception is achieved in the human visual system. The large body of knowledge

2.6 Chapter Summary

42

concerning image quality was mentioned to be based on addressing imaging

distortions.

A hierarchical vision structure was then described, starting at early vision effects and

incorporating increasing levels of cognitive or higher level factors. Region-of-

Interest techniques were described which combine advantages of early vision and

cognitive vision models.

Finally some previous studies in the area of visual information were described, which

has high relevance to an application where visual information needs to be

maximised.

Chapter 3 Visual Prosthesis Application 3.1 Overview

This chapter contains five further sections. The first introduces the application of

artificial vision. Section 3.3 provides an overview of current visual prosthesis

projects. The image processing aspects of these projects are discussed separately in

Section 3.4. Section 3.5 frames the field of digital image processing for this

application. The poor quality of anticipated images produced by artificial vision

systems is described along with several processing techniques that are compatible

with the anticipated evoked visual sensations of visual prostheses. Finally Section

3.6 presents several image processing requirements for visual prostheses, which

drive research questions to be addressed in this thesis.

3.2 General Introduction to the Application

Biomedical in-vivo applications of computing, especially where computing systems

are superimposed on or integrated with human systems, offer enormous challenges to

researchers to develop new or better solutions which can improve our quality of life.

The development of intelligent, reprogrammable devices for insertion into the body

(such as pacemakers) is an example. Bold new projects have emerged, such as the

MIT “wearable computer” or “thinking cap”, where computer systems interface very

closely with the user’s body [73].

Several international research teams are currently developing artificial human vision

("bionic eye") systems that have the potential to restore some visual faculties to blind

persons. While the approaches by the various teams differ, a common element is that

they all require a system that converts a visual scene into electronic pulses that

stimulate nerve cells in the visual pathway (eg. via implanted electrodes), resulting in

a crude induced “image” being formed in the visual regions of the brain. The utility

of the induced image depends on how much visual information is presented, which in

turn is determined by image quality and image processing considerations.

3.2 General Introduction to the Application

Little human trialing of visual prostheses has yet been conducted from which to draw

conclusions on image quality. The perceived quality of an image is dependent on the

number of electrodes in the implant, with higher numbers of electrodes giving higher

spatial resolution of images. At present, size and manufacturing constraints place

limitations on the numbers of electrodes in a given implant. An open question from

an image processing point of view is how to optimise the amount of useful visual

information obtainable from the relatively few electrodes in the implants.

The next section gives an overview of research in the area of artificial human vision.

It describes the categories or general areas of research and summarises the

approaches of the various research teams. Details of their respective designs are not

covered in depth, and the interested reader is referred to the project websites or

publications listed in the text. This background on vision research provides a

framework for the image processing methods suggested later, and gives the reader an

appreciation of the challenging nature of the application.

3.3 Current Visual Prosthesis Research

Good reviews of the history and present state of the art in visual prostheses systems is

presented by Warren and Normann [104], Margalit et al [55], Suaning et al [93] and

Lysaght et al [51]. The basis of all visual prostheses is an image sensing device (video

camera or vision chip and lens) that records the visual world and transmits this

information in real-time to the upper level visual processes (refer Figure 3.1). An image

acquired by the camera is processed or manipulated to be in a form matching the implant

device. The processed image is then sent as electronic pulses to implanted electrodes

within a blind patient.

Image sensing device Processing Unit

Implanted electrode array

Figure 3.1: Basis of Visual Prostheses

44


When undergoing electrical stimulation, patients have reported the perception of spots of

lights in their visual field, referred to as phosphenes. Although unlikely to recreate

perfect vision, artificial vision systems may evoke enough phosphene perception to

perform every-day tasks such as navigation, recognition, and reading.

Visual prosthesis research can be categorised depending on the intended stimulation

site of the implant:

• Retinal - Increasing proximity to the brain

• Optic nerve

• Visual Cortex - Increased potential beneficiaries

The visual cortex holds the potential to assist the largest number of blind persons, as

prostheses designed to stimulate the retina or optic nerve require the rest of the visual

pathway to the brain to be intact. However, the surgical risk to a patient with an

otherwise healthy brain may be higher for visual cortex prostheses.

A brief overview of current artificial vision projects is presented below, along with

project websites and sample reference papers where available.

3.3.1 Retinal Systems 3.3.1.1 University of Southern California (Mark Humayun, Gislin Dagnelie,

Eugene de Juan) [37]

Ophthalmologists at the University of Southern California (Doheny Retina Institute) have

implanted permanent retinal prostheses into several patients, as part of an FDA-approved

feasibility trial. A wafer-thin silicon microchip, embedded with photosensor cells and

electrodes, is powered by an external laser beam. Photosensor cells receive and convert

light images from the pupil to electrical impulses. These impulses can then drive action

potentials in the remaining ganglion cells of patients with retinal disease.

http://www.usc.edu/hsc/doheny/ (accessed 21/1/05)

45

http://www.usc.edu/hsc/doheny/


46

3.3.1.2 MIT-Harvard (Joseph Rizzo, John Wyatt), USA [79]

This is a joint collaboration between the Massachusetts Eye and Ear Infirmary and

the Massachusetts Institute of Technology. Their prosthesis consists of a power

source, a small, fixed-direction laser with 820 nm wavelength, and a data source, a

tiny charge-coupled-device (CCD) camera whose output amplitude modulates the

laser beam. A signal-processing microchip in the data source converts the visual

information to an electronic code that is carried on the laser beam. Both the power

and data source is mounted on a pair of spectacles.

http://www.bostonretinalimplant.org/ (accessed 1/6/04)

3.3.1.3 Tübingen University (Eberhart Zrenner), Germany [87] and Bonn University (Rolf Eckmiller), Germany [26]

These two research groups are funded by the German Federal Ministry of Education

and Science. In the SUB-RET (Tübingen) approach researchers are working on a

device consisting of microphotodiodes which are to be placed underneath the retina

to stimulate postsynaptic retinal cells directly by converting light to electric energy.

In the EPI-RET (Bonn) approach scientists develop a microcontact array which is

mounted onto the retinal surface to stimulate retinal ganglion cells.

http://www.uak.medizin.uni-tuebingen.de/depii/groups/subret/

http://www.nero.uni-bonn.de/ri/retina-en.html (accessed 1/6/04)

3.3.1.4 Nagoya University (Tohru Yagi), Japan [109]

The research at Nagoya University could be termed biohybrid, in that there is a

combination of biological and man-made elements in the construction of the implant.

The research aims are to develop devices in which cultured neural cells and a

photoelectric device are combined. This technique is similar to other retinal implant

techniques in that electrical components are being placed directly into contact with

the retina. However the use of nerve cells as a part of the implant make for a

potentially more reliable system.

http://www.nidek.com/artificial_vision.html (accessed 1/6/04)

http://www.bostonretinalimplant.org/

http://www.uak.medizin.uni-tuebingen.de/depii/groups/subret/

http://www.nero.uni-bonn.de/ri/retina-en.html

http://www.nidek.com/artificial_vision.html


47

3.3.1.5 Optobionics (Alan & Vincent Chow), USA [72]

This device is powered solely by incident light and does not require the use of external

wires or batteries. An artificial silicon retina is implanted under the retina (subretina

space) and is designed to mimic the photoreceptor layer. The research effort is mentioned

here for completeness, although there is no image sensing device (shown in Figure 2.1)

and hence no opportunities for perception enhancement by image processing.

http://www.optobionics.com (accessed 1/6/04)

3.3.2 Optic Nerve Systems 3.3.2.1 Catholique Université de Louvain (Claude Veraart), Belgium [100]

The techniques are based on optic nerve stimulation using a self-sizing spiral cuff

electrode. In preliminary testing to date, with the help of blind human volunteers,

researchers have been able to produce phosphenes throughout the visual field.

Stimulation at this location would be suited to patients who have non-functioning rods

and cones in the retina but a healthy optic nerve.

http://www.md.ucl.ac.be/gren/Projets/vision.html (accessed 1/6/04)

3.3.2.2 University of New South Wales/University of Newcastle (Nigel Lovell, Gregg Suaning), Australia [94]

The visual prosthesis system consists of a camera, StrongARM microprocessor

system and an implantable electrode array connected by a radio frequency link. This

prevents the need to pass wires through the skin. The current design consists of a 10

x 10 array of electrodes, giving the potential for 100 stimulation sites. Recent work

has tended to redirect this project from optic nerve electrode cuffs towards retinal-

stimulation.

http://rambler.newcastle.edu.au/~greggs/ (accessed 1/6/04)

http://www.optobionics.com/

http://www.md.ucl.ac.be/gren/Projets/vision.html

http://rambler.newcastle.edu.au/~greggs/


48

3.3.3 Visual Cortex Systems 3.3.3.1 Dobelle Institute (William Dobelle) Portugal [20]

The research team has successfully implanted a 64-electrode array on the visual cortex of

a patient using wires passing through the skin. The patient is claimed to be able to read

two inch tall letters at a distance of five feet, representing a visual acuity of about 20/400.

Although the electrode array produces tunnel vision, the patient is also claimed to be able

to navigate in unfamiliar environments.

http://www.dobelle.com/index.html (accessed 1/6/04)

3.3.3.2 National Institutes of Health - Washington D.C. (Edward Schmidt), USA [86]

NIH researchers have implanted microelectrode arrays into the visual cortex and

recorded stimulation parameters and characteristics of artificially created

phosphenes. The Neural Prosthesis Program within the Division of Stroke, Trauma

and Neurodegenerative Disorders addresses many types of neural stimulation, not

just related to nerves in the visual system.

http://www.ninds.nih.gov/npp/ (accessed 1/6/04)

3.3.3.3 University of Utah (Richard Normann) USA [66]

This research is based at the John Moran Laboratories in Applied Vision and Neural

Sciences at the University of Utah. The design of the cortical prosthesis employs

penetrating microelectrodes rather than surface electrodes. The developers of

penetrating cortical electrode arrays claim that the closer spacing of electrodes

compared to surface cortical arrays result in increased spatial resolution and lower

currents to induce visible perception, and are therefore less likely to induce seizures

from overstimulation.

http://www.bioen.utah.edu/cni/projects/blindness.htm (accessed 1/6/04)

http://www.dobelle.com/index.html

http://www.ninds.nih.gov/npp/

http://www.bioen.utah.edu/cni/projects/blindness.htm


49

3.3.3.4 University of New South Wales (John Morley, Minas Coroneo) Australia

An animal model has been developed where one side of the brain is electrically

stimulated and responses measured in the other side of the brain. Funding sources for the

research include the National Health and Medical Research Council and the Brain

Foundation.

http://medicalsciences.med.unsw.edu.au/medsciences.nsf/website/researchactivities.labor

atories.vision_cognition (accessed 1/6/04)

3.4 Image Processing specifically related to Bionic Eye Projects

In this section the hardware and some image processing considerations are described

for research specifically relating to artificial vision. Research is described in 4 areas:

1. vision chip developments;

2. CCD-based systems;

3. receptive field modelling;

4. multiple resolution work.

3.4.1 Vision Chip Developments

Researchers at the University of Newcastle and University of New South Wales in

Australia (refer Section 3.3.2.2) use an OmniVision CMOS image chip to acquire

visual scenes for their portable prosthesis prototype [95]. They have proposed a

regular hexagonal mosaic of electrodes in the implantable array rather than a

rectangular layout, which allows better separation between electrodes [31]. The

expectation is that this will increase visual acuity and minimize aliasing in the

evoked artificial image. The researchers also conject that from an information theory

standpoint, modulating the size and intensity of a phosphene are equivalent

psychophysically.

Japanese researchers at Nagoya University (refer Section 3.3.1.4) have developed a

vision chip/artificial retina comprising parallel arrays of simple analogue circuits

together with parallel array sensors [110]. The authors review previous

http://medicalsciences.med.unsw.edu.au/medsciences.nsf/website/researchactivities.laboratories.vision_cognition

http://medicalsciences.med.unsw.edu.au/medsciences.nsf/website/researchactivities.laboratories.vision_cognition


50

developments in vision chips, and mention that these chips have not experienced

wide application as the outputs of these chips are not sufficiently accurate under

natural illumination due to low sensitivity of photosensors. They have overcome this

problem with a light-adaptive one-dimensional 100 pixel line sensor. Spatial

filtering properties of the vision chip have been tested by mounting a camera lens to

focus an image on the photosensor array. The spatial distribution of the output

voltages of the chip showed a Laplacian-Gaussian-like receptive field.

The team above have recognised the importance of depth information in visual

information processing and have consequently incorporated depth perception in the

vision chip [111]. Again, the chip has 100 analogue sensors connected laterally by

resistors, giving a one-dimensional (line) 100 pixel sensor, which allows parallel

processing in real time. The output of the circuit is a serial signal representing depth.

Depth is computed from the disparity between two vision chips (fitted with lenses)

which are 120mm apart and turned inside at 6 degrees. Zero crossings (edges) are

detected from the left and right vision chips and used in determining the disparity.

Further work on vision chips is presented by Kyuma et al [46]. An impediment to

real time processing has been the separation of image sensing (camera) and image

processing (computer) functions. Consequently, system performance is limited by

slow camera frame rate and low transmission rate between the camera and computer.

‘Artificial retina’ chips developed by the authors are described, that can

simultaneously sense and process images ie. more akin to the parallel real time

processing of the human visual system. These artificial retinas consist of a two

dimensional variable sensitivity photodetection cell array, with sensitivity similar to

commercially available CCDs. The chips are 12mmx12mm (256x256 resolution) or

6.5mmx6.5mm (32x32 resolution), have a dynamic range of 40dB (input light

power) and variable frame rate (1msec – 1000msec). A variety of on-chip image

processing can be achieved by changing a control voltage pattern on the chip. These

processing functions include image sensing, edge extraction, image smoothing,

random access (only a partial image projected onto the chip), pattern matching, and

image compression/recognition. The application of these vision chips to prostheses

is not stated by the authors beyond ‘man-machine interfaces for multimedia

systems’, and instead general industrial applications are cited eg. automotive,


51

avionics. The authors have used the vision chips to control computer game

characters by hand gesture recognition.

3.4.2 CCD-based Systems

The processing hardware for a retinal prosthesis project at the University of Southern

California (refer Section 3.3.1.1) is a FPGA/EPLD (Field Programmable Gate

Array)/(Electrically Programmable Logic Device) [18]. The device allows easier

implementation of highly parallel algorithms/hardware needed for concurrent

processing than a single processor. Three SRAM memories serve as frame buffers to

support the storage of images delivered from the camera. Two SRAMS support

dual-buffered video, where a current image in transit from the camera can be stored,

while a prior image can be simultaneously processed from a second memory. Once

the camera has completed delivery of a transit image, the roles of the memories are

reversed, so that the new image is processed, while a fresh image is stored in the

alternate RAM. A third frame buffer is available for intermediate computations that

may occur in algorithms such as spatial convolution. An 8-bit pipeline A/D

converter supports cameras which provide only analogue video. The whole board

can be worn in a shirt pocket or clipped to a belt.

The Humayun team make some interesting comments regarding image processing

required for prosthetic devices to:

• match the crude resolution of the implant array

• accommodate the limited dynamic range of the array

• simplify image aspects such as brightness and colour gradients which cannot

be faithfully represented by the array [17].

They conjecture that the sacrifice in resolution (spatially and in contrast) would be

acceptable in view of a wider operating range (field size and light/dark adaptation)

that would be achieved. Further they comment that the wearer of a prosthesis would

achieve a significant degree of learning, compensating in higher visual processing for

the detail lost at the input. A comparison is made between the processing required

for a retinal versus a cortical prosthesis. At the level of the visual cortex, the neural

information stream has already undergone several transformations (dynamic range

compression, edge and colour recoding, and translation of analogue information into


52

spike trains) and thus the processor for a cortical prosthesis may have to be more

powerful and more trainable than that for a retinal prosthesis.

Researchers involved with optic nerve stimulation (refer Section 3.3.2.1) describe a

resolution reduction algorithm based on image segmentation by growth of zones (less

computational power than other segmentation algorithms) and implementation in a

low-power VLSI device [28]. The authors propose an algorithm based on the

extraction of the main features of the image, with transmitted information being only

the position and form of the relevant entities in a scene. However, their current

implementation appears only to be based on intensity. They propose to give a blind

person ability to control the segmentation level (adjustable threshold values),

producing areas of uniform illuminance matching corresponding objects or surfaces.

Due to the nature of their segmentation algorithm, they report undesirable fast

transitions (eg. merging of 2 zones) when segmenting with successive images.

Harvey and Sawan [32] describe their efforts in two areas: a cortical implant (silicon

die mounted on the back of an electrode array) and an external system (scene

acquisition, processing, RF communication). The completed prototype allows the

testing of various stimulation algorithms and strategies. CCD array (336x244 pixels)

output is sent via an analogue to digital converter to a commercially available

processor board. The extent of image processing appears to be resolution reduction

to 25x25 pixels and colour histogram equalisation.

Werblin and Jacobs [107] propose a cellular nonlinear network as a retinal camera,

using photodetectors/conventional CCD as input. Various image processing

operations can be performed across the CNN array by changing values of a set of

amplifiers. The authors have used the CNN system to predict patterns of activity at

retinal output. Beneficial features for a retinal camera incorporating a CNN array are

proposed:

• Battery powered chip array

• Onboard stored program of image processing algorithms that can be invoked

remotely or on the basis of the characteristics of the visual scene (twilight or

bright sun, for reading or navigation, high or low resolution)


53

• Variety of output available: edge detection, motion detection, contrast

enhancement

• Background normalised, giving high contrast near the ambient background

level

3.4.3 Receptive Field Modeling

German researchers developing retinal prostheses (refer Section 3.3.1.3) have

proposed a system that approximates receptive field properties of primate retinal

ganglion cells [26]. While still preliminary in nature, the research is based on a set

of individually tuneable spatiotemporal receptive field filters, acting on input from a

photosensor array. Each receptive field filter is individually tuneable to a wide range

of physiologically plausible spatial and temporal frequencies. Details of the

receptive field function proposed are contained in [6]. Input data is fed into 2

distinct filter pathways, one for the centre computation and one for the surround.

Each pathway performs a spatial scalar product of the pixel data, and a two

dimensional Gaussian, whose width determines the spatial extent of the receptive

field. The resulting signals are then processed by a temporal low pass filter. The

surround pathway signal can be optionally delayed, and then signals from both

pathways converge at a mixer component. Finally a gain factor enables range

adaptation and switching between on-off and off-on (centre-surround) behaviour.

The resulting signal is then used to stimulate nerve cells. The authors also describe a

concept for training the system using visual perception feedback from human

subjects: the subject suggests functional changes to the system via a neural net

module, based on the difference between the actually perceived visual pattern and the

expected perception. This feedback is anticipated as an essential step in the future

for tuning a prosthesis to the needs of an individual.

The hardware for the receptive field processing above is described in [88]. Image

acquisition consists of a CMOS image sensor chip with a high dynamic range with

respect to illumination intensity (140dB). This full dynamic range can be used

within a single image frame without any distortions like blooming, smearing or time

lag. Two designs have been developed – a 128x128 pixel sensor arranged on a

hexagonal grid structure and a rectangular 400x300 pixel sensor. Signal processing

is carried out on-chip, so an additional frame buffer is not required unlike


54

conventional CCD devices. The spatial filter used for on- and off-centre receptive

field functions is inserted between the sensor chip and the signal processor. The

developers propose to house the sensor chip in a package with integrated focusing

optics, mounted on a spectacle frame, along with the telemetry unit required for

wireless transmission of stimulus data and power for electrode stimulation.

Similar hardware design has very recently been completed in Switzerland [112]. The

Swiss researchers have manufactured a thinned CMOS chip which is intended to be

placed in the sub-retinal space and remotely powered by an external coil. The output

from the system mimics the ganglion response to light: bipolar voltage pulses with

light-modulated frequency. The chip has not yet been tested physiologically.

The significance and importance of visual receptive fields in visual processing is

supported by Hungenahally [38,39], who has attempted to emulate visual receptive

fields and their implementation for image processing in an artificial retina. He has

proposed a family of differentio-aggregation functions for information extraction

from two dimensional spatial images. He demonstrates the mathematical functions

in eradicating sensory noise from medical images and extracting dimensionally

selective information.

3.4.4 Multiple Resolution Work

Amerijckx et al [2] describe a remapping algorithm using two CCD cameras and its

implementation on a VLSI chip. One tele-lens camera produces high resolution in the

central area of the image, while a second wide-angle camera captures peripheral

image areas. This system processes these two images in real time to obtain a

resulting image with high resolution at the centre, similar to the central part of the

retina.

Belgian researchers (refer Section 3.3.2.1) have extended their work on prosthetics to

sensory substitution devices converting vision to sound [11]. The image processing

involved in their models is based on their identification of the main features of the

primary visual system: lateral inhibition and graded resolution. Lateral inhibition is

implemented by an edge detection filter and graded resolution is modelled using a


55

multi-resolution artificial retina based on the filtered image. An example of this

graded resolution is given – in a grid of 8x8 large pixels, the 16 central pixels are

each divided into four pixels to build a medium resolution grid of 8x8 pixels. In this

second grid, the 16 central pixels are again divided into four pixels to build a high

resolution grid of 8x8 pixels etc.

This foveal vision representation has also been implemented in a head mounted

display unit coupled with an eye tracking system [42]. The authors claim that

conventional HMDs suffer from a narrow field of view and low resolution and

consequently cannot be used for applications such as tele-microsurgery. Their HMD

displays high resolution at a subject’s view point (obtained by an eye tracker) and

low resolution at the periphery, therefore displaying images at a higher perceived

resolution in a wider view angle.

The multiple resolution approaches described above may have application where

bandwidth is limited. However, it is most likely that the pixel density in a device

would be fixed at the maximum possible under manufacturing and size constraints.

Improved scene understanding is expected when the entire electrode layout is used

rather than applying low resolution image sections to some parts of the implant.

3.5 Digital Imaging Applicable to Visual Prostheses This section discusses digital imaging applied to visual prostheses and reviews useful

image processing methods that could enhance visual information presented to

visually impaired users.

3.5.1 Digital Imaging and Human Vision

There are many parallels between digital imaging environments and the human

visual system. Visual sensations are preprocessed from over 100 million rods and

cones (the photoreceptors in the retina), to approximately 1.5 million optic nerve

fibres, with conduction time from sub-retina to the lateral geniculate nucleus in the

order of 1-5ms [102]. The capability of the human visual system for resolving fine

detail and edges and ignoring uniform regions has been shown to be biologically

3.5 Digital Imaging Applicable to Visual Prostheses

hard-wired into our retinas. Connected directly with the rods and cones of the retina

are two layers of processing neurons that perform an operation very similar to the

Laplacian operator that highlights the points, lines and edges in an image and

suppresses uniform and smoothly varying regions [83]. Furthermore, the processes

that occur in the visual cortex when a person examines a visual scene make use of

feature extraction and object recognition processes, mimicked by computer vision

techniques [47]. This complexity suggests that in artificial vision systems, image

processing and manipulation would have a more significant role than simply a

camera and display package.

The image processing aspects of the artificial vision systems under development are

largely based on manipulating a pixelised representation, where a scene is

represented as an organised phosphene array. For example, the mandrill image

represented in various pixelised versions of different spatial resolutions is shown

below in Figure 3.2.

56

64 x 64 32 x 32 16 x 16 8 x 8 pixels

Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images

Each picture element, or pixel, would ideally correspond to a stimulating electrode in

the implant. The top row shows greyscale images which are unlikely in prosthesis

prototypes. More likely are binary images (bottom row), where a pixel is either ON

or OFF. The viewing distance also affects image interpretation - the coarser

resolution versions above (16 x 16, 8 x 8) are more comprehensible from greater

viewing distances.


The pixels shown in Figure 3.2 are also shown as squares touching adjacent pixels

along each border. Patients undergoing electrical stimulation report visual sensations

as a ‘spot of light’. Therefore it may be more representative to model pixels as

circular with a gap between adjacent pixels as shown in Figure 3.3 below.

25x25 25x25 circular 25x25 circular greyscale greyscale binary

Figure 3.3: Circular pixelised vision

Given that a limited number of stimulating electrodes is physically possible, it is

evident that some type of information content enhancing processing is required, as

described later.

An immediate problem is selecting a useful number and pattern of phosphenes for a

prosthesis. Until clinical testing progresses with human subjects, the degree of

success of an implant in creating phosphenes in the visual field of the subject is

unknown. For example, if a patient is fitted with a 100-electrode implant, will they

be able to see more or less than 100 phosphenes in their visual field? To date, there

are no physiological stimulation models in the literature that predict the number of

phosphenes that will be produced from a given number of electrodes. For example,

the Dobelle research team [20] report that each implant electrode produces one or

more phosphenes in the visual field, while Schmidt et al [86] report that 34

phosphenes were produced from 38 electrodes.

The issue is compounded by the different physical stimulation strategies followed.

In addition to varying the stimulation parameters, such as current amplitude or pulse

width duration, there is the potential to make use of different current flow and return

paths. For example, a single stimulating electrode and single return electrode may

57


give rise to a small high intensity dot in the visual field. Using the same single

stimulating electrode but with two or several return electrodes may give a different

charge density profile on the tissue which may give rise to a broader low intensity

patch of light (see Figure 3.4).

58

(a) (b)

Figure 3.4: Alternate stimulation strategies

(a) single electrode pair ; (b) single stimulating electrode and multiple return electrodes

Frequency encoding might be another possibility where an image is transformed to

the frequency domain. Here, image information is represented as signals having

various amplitude, frequency and phase characteristics, which could be delivered to

different locations within an electrode array. This has similarities with auditory

implants, where an audio signal is split into frequency bands and delivered to

different locations within the cochlea for improved perception [4,65]. While

important to the final design of prostheses, the issues of prediction of phosphene

numbers and patterns described above require physiological testing and modelling

and thus would be impacted by the other physiological properties of implants. These

aspects, such as biocompatibility and encapsulation of the internal electronics, and

evaluation of acceptable stimulation waveforms that prevent tissue damage, are

outside the scope of this thesis.

3.5.2 Image Characteristics and Visual Understanding

There is a wide range of features, or characteristics of digital images that give us

visual understanding to varying degrees [83]. However many characteristics are not

compatible with the modelling of anticipated evoked visual sensations of visual

prostheses. For example, it may not be physically possible to control the colour of a

phosphene, and make it RED, then BLUE on the following stimulation followed by

GREEN on the next etc. Colour processing along with other future possibilities are

discussed further in Chapter 8.


59

Image characteristics that are compatible with simulating artificial vision and have

variations that can be tested are the following:

• Spatial resolution

• Brightness

• Contrast

• Edges

• Distance information

• ‘Importance’ mapping (using the notion of combining several of the above

factors to add value to the information content of an image).

Results of subjective tests using these image characteristics are presented later in

Chapter 4, and the next sections provide background to these features.

3.5.2.1 Spatial Resolution

Maximising the number of electrodes in an implant to give high spatial resolution would

certainly enhance the information content of images. Some preliminary researchers in

visual prostheses systems conjected the following three approaches, in the 1970’s [99]:

1. small matrix size – coded information: Due to the small matrix size, 10x10 or less,

information must be categorised and encoded to maximise information delivery. This

system cannot effectively provide a direct two dimensional representation of space,

but must extract the significant environmental features and present them in coded

format. This approach places a heavy demand on the learning capacity of the user.

2. intermediate matrix – preprocessed input. With a matrix size of between 20x20 (400

points) and 32x32 (1024 points): an effective two dimensional display can be

achieved. Simulation experiments carried out on sighted viewers [8] suggested that a

phosphene matrix containing 600 points (24x24) would be sufficient to permit a

reading speed of 120 words per minute, where 10 letters were presented at a time to

subjects. The combination of a suitable field range for detection of peripheral hazards

with adequate central resolution for useful object identification presents a severe

challenge at this matrix size.

3. maximum density matrix – direct spatial display; A 4000 point (64x64) display can

provide a fairly good image of a face.


60

Other simulation work to determine how many electrodes would be needed to provide

useful vision has been done by Cha et al at the University of Utah [12,13,14]. Normally

sighted human subjects wore a video camera attached to a head-mounted visor which

simulated pixelised vision. The tests covered visual acuity, reading speed and mobility

performance. Images were pixelised and projected on to a small monochromatic monitor.

To create the illusion of phosphenes, perforated masks that represented different pixel

densities and field sizes were placed between the eye and the monitor. The conclusions

drawn from these studies were:

• The most important factor in visual acuity was pixel density (spacing). However

the most important factor in reading speed was pixel number, not spacing.

• When using low density masks, acuity was increased with voluntary head

movements.

• A 25 x 25 array provided a visual acuity of 20/30, which allowed a reading speed

of 100 word/min and good obstacle avoidance.

More recent (2003) studies simulating vision performance at different spatial

resolutions has been carried out at Johns Hopkins University and University of

Southern California. One study involved presenting pixelised face representations of

10x10 to 32x32 spatial resolution [96] The researchers found that parameters such as

contrast, grid size, dot size, dot gap, drop out rate and greyscale resolution had a

significant effect on facial recognition speed and accuracy. In a separate study [33],

4x4, 6x10 and 16x16 electrode arrays were simulated in a number of performance

tasks including four choice orientation discrimination of a Sloan letter E, object

recognition and discrimination, a cutting task, a pouring task, symbol recognition and

two reading tasks. Subjects performed best using the 16x16 array which

corresponded to a visual acuity of 20/420, although simple objects and symbols

could still be recognised sporadically at the lowest resolution array.

Thus, given that an implant could deliver sufficiently high phosphene numbers to the

visual field, the ability to read and navigate around obstacles is achievable. It should

be noted that the performance stated above is based on the assumption that each

electrode produces a corresponding phosphene in an ordered array in the visual field.


While more electrodes ideally equates to improved perception, the upper limit of

electrodes may be determined by the small space available to implant the array along with

the minimum electrode spacing required to determine adequate phosphene resolution.

This is a manufacturing constraint and is outside the scope of this thesis. Spatial

resolution is mentioned here as a technique to explore low quality image perception.

3.5.2.2 Brightness Modulation

Multiple brightness levels in images may be highly informative. Figure 3.5 shows

how brightness modulation might be simulated for the mandrill image. The top

images show circular pixelated versions using eight versus and three greylevels

compared with the original at 256 greylevels. The bottom images show variations of

a halftoning technique [82] with different pixel radii and dot orientation. The goal of

halftoning is to preserve the visual impression of grey tones in spite of the fact that

pixel-by-pixel the image is ideally black or white. Increasing the number of

greylevels is considered to be equivalent psychophysically to increasing the

dot/phosphene size [49].

Original 128x128 25x25 – 8 grey levels 25x25 – 3 grey levels

4 pixel radius – 45 degree 6 pixel radius – 45 degree 6 pixel radius - 0 degree

Figure 3.5: Simulating the effect of modulating phosphene brightness

61


62

Physiologically, the brightness of induced phosphenes has been found to be modified

with stimulus amplitude, frequency and pulse duration [86] and in other studies,

logarithmically related to stimulus current amplitude [35]. This suggests that in

principle, several (2 – 4) bits of greyscale/size variance should be achievable. For

early prototypes however, only a 1-bit grey scale might be possible, producing only

binary (black and white) images.

3.5.2.3 Contrast

Contrast affects the detection of many kinds of image features (eg. regions, edges,

textures) and is known to be a fundamental early vision characteristic in human

vision [3]. In some visual environments, such as reading black text on a white

background, it may perhaps be better to deliver negative or inverse images. In any

case, the ability to enhance the contrast may prove useful to highlight image contents

that would otherwise be much harder to see.

3.5.2.4 Edge Detection

In scene recognition and interpretation, edges play a fundamental role [56]. Edges

assist in the formation of a primal sketch to derive shape information from images.

Also there are biological mechanisms for detecting oriented zero-crossing segments

(edges) in retinal ganglion cells. An essential function of an artificial vision system

would be to highlight the edges of objects. The Dobelle research team [20] expect

improved results for patients with the implementation of Sobel filters for edge

detection. The prominence of uniformly shaded areas could be decreased, while

edges that might otherwise be hardly noticeable could be highlighted.

3.5.2.5 Distance Indication

An artificial vision system that conveys the distance to an object would be

particularly useful. While sonar distance aids have been common for many years,

the auditory signal emitted by these devices can interfere with important surrounding

environmental noises. The ability to convey distance visually rather than audibly

would therefore be desirable. Distance information can be obtained by computing

depth from disparity from two cameras or by using ultrasonic or laser rangefinders.


63

Distances could then be mapped to intensities, where the nearest object is shown

with the highest intensity. If the device display only supports a 1-bit grey scale, only

the nearest object need be displayed. This distance 'mode of operation' could be

quite useful in combination with a standard image of luminance intensities.

3.5.2.6 Importance Extraction

A feature of an efficient artificial vision system would be importance extraction - to

present only the most important object in a scene and disregard the uninteresting or

homogenous elements. Section 1.1 covered several region-of-interest algorithms

which aim to predict where the human eye fixates on an image. An extension of

these algorithms is the concept of assigning an importance score or weighting to each

area in an image to generate an “importance map”[52,68]. This importance ranking

has previously been applied in visually lossless compression, where improved

compression ratios have been achieved with high perceived image quality.

This importance ranking could be used in artificial vision systems to identify the

most important object in a scene and present only this object, discarding the

remainder. The definition of importance may comprise some combination of motion,

location, contrast, contrast, size and shape. The components that comprise

“importance” may also be adjusted for different viewer situations eg. home,

entertainment and mobility.

Importance ranking could also be used to optimise the bandwidth for data transfer in

artificial vision systems. If the bandwidth is limited, one could apply variable

resolution to the image on the basis of importance. Homogenous or uninteresting

scene elements would be displayed at low resolution, while important areas, such as

edges and moving objects, would be displayed in high resolution. Thus a decreased

bit-rate could be used to present an image of high perceived quality.


Figure 3.6 depicts the importance map process as an example of extracting important

areas from within an image. An image is first segmented into regions of similar

properties. A split and merge segmentation algorithm is used based on grey level

variance. Feature maps/images are then constructed from the segmented image

corresponding to five features known to influence attention:

1. closeness – the closer an object, the more important

2. intensity contrast – regions of high intensity contrast from surrounding regions

are more important

3. shape – elongated regions are more important than round regions

4. size – the larger a region the more important

5. centralness – regions in the centre of the viewing area are more important

Each region in the feature map is assigned an importance score, normalised from 0

(not important) to 1 (very important). That is, lighter areas in the feature maps

should grab a viewer’s attention more than darker areas. From the five feature maps,

an overall Importance Map is created from combining these feature maps. A

normalised sum of squares is performed as indicated below:

R.I. = [ω1•(M1)2 + ω2•(M2)2 + ω3•(M3)2 + ω4(M4)2 + ω5•(M5)2] (Equation 2)

R.I.max

Where R.I. = Region Importance

ω = weight applied to each feature map

M1 = Closeness Map

M2 = Contrast Map

M3 = Shape Map

M4 = Map

M5 = Central Map

R.I.max = Max Region Importance

64


ω5ω4

ω3

ω2ω1

Original Segmented

Closeness Contrast Shape

Size Central

Importance Map

Figure 3.6: Importance Mapping concept

65


Figure 3.7 below shows a post and chain which would pose a hazard to a blind

person. A 16 x 16 resolution copy of the image is shown adjacent to the original. It

should be noted that the 16 x 16 image is shown in full grey-scale which is unlikely

to be possible in vision prostheses. Although one can discern the dark blob of a post,

the shadow of the post provides a confusing visual cue, and the safety chain attached

to the top of the post is not evident.

Original 16 x16 image (full grey scale)

66

16x16 Importance Map 16x16 Distance Map

Figure 3.7: Safety Post enhancement with advanced image processing techniques.

The bottom left image in Figure 3.7 shows the enhancement provided by mapping

‘importance’ to intensity. The image is shown with 4 grey levels, as might be

achieved in vision prostheses. Regions assessed as important (high contrast, large in

size, long and slender, central to the image etc.) are represented with the highest

intensity. It is noted that the safety chain is now evident but the shadow of the post is

also present.


The bottom right image in Figure 3.7 shows the distance mapping discussed in

Section 3.5.2.5. The closest regions to the viewer are presented with the highest

intensity. It is noted that the chain is evident but the post shadow is not discernable.

Another example of the same processing with another outdoor scene is shown below

in Figure 3.8. It is believed that a beneficial image processing system would provide

several of these ‘modes of operation’ to gain as many visual cues form the low

quality image as possible. Experiments described in Chapter 4 will test this

conjecture.

Original 16 x16 image (full grey scale)

16x16 Distance Map 16x16 Importance Map

Figure 3.8: Enhancing the information content of a low quality image of stairs.

3.6 Thesis Research Questions and Approach

In consideration of the visual prostheses literature presented in this chapter and

within the image quality framework described in Chapter 2, several issues can be

identified to drive research questions. These research questions are described in the

next section followed by an outline of how these questions will be addressed in the

remaining thesis chapters.

67


68

3.6.1 Image Processing Requirements

Within the scope of this thesis, there are several image processing requirements for

visual prostheses. Visual prostheses need to:

1. facilitate some recognition performance while bounded by an ultra low image

quality regime;

2. allow a user to gain as many visual cues from a scene as possible;

3. use simple low level processing to improve scene understanding;

4. convey maximum scene information

5. deal with different scene types

These requirements drive several research questions around which the remaining

thesis chapters are based:

Q1: Although limited to low quality images anticipated from visual prostheses, can

recognition of some objects be achieved?

Q2: Does Region-of-Interest processing improve scene understanding beyond

standard/Base Case processing?

Q3: Can a model be constructed for basic information required for the interpretation

of a visual scene at low image quality?

Q4: Should the processing techniques be adjusted depending on the scene type?

In the low image quality domain where spatial resolution is limited, an effective

approach to improve scene understanding is Region-Of-Interest (ROI) modelling to

extract salient or important areas within an image. It is reasonable to expect ROI

processing to be an effective approach as these methods trim away information that

may not be relevant to scene understanding. Other applications such as image

compression, military and advertising use ROI processing to extract features and

regions where the human eye might fixate in an image. Therefore we expect that

such techniques could be incorporated into visual prostheses to trim away the large

amounts of data in an input image. Thus the limited number of display pixels

(implant electrodes) would be used most efficiently by presenting to blind users only

the important elements of a scene.


69

It is expected that ROI processing will provide an improved outcome over the

standard (or Base Case) type of processing in prostheses, which consists of

subsampling to match the spatial resolution of the electrode array and binarisation.

The Importance Map ROI technique discussed in Section 3.5.2.6 is selected for the

thesis experiments because it is computationally cheap and variations can be

constructed around a standard model to alter the appearance and hence the

interpretability of the final processed image.

It is also expected that a model can be constructed for basic information required for

the interpretation of a visual scene at low image quality. Image quality is

characterised thoroughly in the literature for high quality images but not for low

quality images.

3.6.2 Testing Method

The research questions will be tested by experiments with normally sighted viewers.

As explained at the commencement of the thesis in Section 1.3 - Scope, prosthesis

development in Australia is currently limited to animal models. Thus use of

normally sighted viewers is considered the only option for simulation studies at this

time.

Several simulation experiments are presented as follows:

Research Question Thesis section

Q1: Although limited to low quality

images anticipated from visual

prostheses, can recognition of some

objects be achieved?

Ch4 – Recognition experiments

Section 4.2: processing techniques

compatible with visual prostheses

Section 4.3: recognition & influence of

image type

Q2: Does Region-of-Interest processing

improve scene understanding beyond


Section 4.2: computationally cheap

region based (Importance Map) method

Ch7: several ROI methods compared.


70

Research Question Thesis section

Q3: Can a metric be constructed for

basic information required for the

interpretation of a visual scene at low

image quality?

Ch5 – Quantifying Information Content

Section 5.3: information content model

Section 5.4: recognition & information

content

Q4: Should the processing techniques be

adjusted depending on the scene type?

Ch6: scene specific imaging

Table 3.1 – Thesis experiments

Q1 is tested through recognition experiments described in Chapter 4. In Section 4.2,

several image processing techniques which may lead to improved perception of low

quality images are assessed by normally sighted viewers. This testing is to obtain an

understanding of low quality image perception and uses several processing

techniques described in Section 3.5.2 as being compatible with visual prostheses.

Section 4.3 describes a separate experiment assessing perception of low quality

image and the influence of image type.

Q2 concerns Region-of-Interest (ROI) processing applied to low quality images.

ROI processing was presented in Section 2.4 as a powerful perception modeling tool

using a combination of early vision and cognitive effects. Section 4.2 experiments

establish the applicability of a computationally cheap region-based (Importance

Map) technique to low quality images. Several variations of this method are

compared with a pixel-based (Saliency Map) technique in Chapter 7.

The construction of a metric in response to Q3 is described in Chapter 5. This

Chapter expands previous work in the area of visual complexity (Section 2.5) and

links visual complexity with information content. A robust metric to predict

perceived information content is developed from one series of subjective data and

tested against additional data. Also correlations are made between subjective

information content and object recognition for low quality images.

Q4 concerns the influence of different environments for the visual prosthesis user.

The concept of tailoring image processing to the scene type is tested in Chapter 6.


71

There is a lack of fundamental theory relating to the specifics of image

understanding, and consequently the above research questions represent

opportunities to refine this knowledge by subjective testing. A variety of viewer

behaviour is expected due to individual preferences influenced by past experiences

and expectancies, and hence the instructions given to viewers influence these

variations. In the experiments described in following chapters, viewers were advised

that:

The images appear as just a range of blocks – you may not be able to see

anything in the images at all. However this quality level is similar to what a

blind person might see with a bionic eye.

A final comment relating to the subjective testing is that there were variations in the

experiment sample sizes due to the availability of volunteers, which ranged from n =

20 to n = 247. The strength of the findings are influenced by the sample size (results

from the smaller size are more a suggestion). In addition some experiments tested

several factors simultaneously so the sample assessing one factor was much smaller

than the total participants involved in the experiment. For example 225 participants

assessing 9 image quality classes represents a sample size of only 25.

3.7 Chapter Summary

This chapter has described the research activities underway internationally in the

field of electronic visual prostheses. Research efforts were described for approaches

aimed at stimulating the retina, optic nerve and visual cortex. Image processing

aspects for many of these projects were also described.

The chapter also identified processing techniques that are compatible with the

anticipated evoked visual sensations of visual prostheses. These included spatial

resolution, brightness modulation, contrast, edges, distance information and

importance mapping.

Finally several image processing requirements for visual prostheses were identified

which drive research questions to be addressed by thesis experiments. An outline

was given of the subjective testing proposed for the remaining thesis chapters.

Chapter 4 Recognition Performance

4.1 Overview

This chapter describes two preliminary experiments aimed to explore low quality

image perception. It aims to answer the research questions:



It is anticipated that some recognition is possible and that different types of images

result in varied recognition.



There is reason to believe that ROI processing will trim away unnecessary

information resulting in improved perception.

Section 3.5.2 discussed some image characteristics that are compatible with the

anticipated evoked visual sensations of visual prostheses. Adjustment of these

characteristics may result in enhanced information delivery to blind users of visual

prostheses. However, the extent of their success is difficult to quantify without

experimentation. Therein lies the framework for the first experiment described in

Section 4.2.

The second experiment described in Section 4.3 aims to quantify recognition

performance for low quality images by constructing an envelope of recognition. The

effect of the type of image is also determined from this experiment.

4.2 Subjective Tests to Determine Useful Processing Methods


73

As acknowledged above, there is a need to quantify the perception performance of

adjusting various image characteristics for improving scene recognition and

understanding. This section describes psychophysical testing on possible operating

modes for an artificial vision system, to identify the most informative image

adjustments that could be made for improved understanding of picture content. The

experiments are aimed to assess the performance of proposed processing techniques.

This assessment is achieved by presenting degraded images to normally sighted

viewers and asking them to identify the scene and make use of the data. The images

presented have varying levels of resolution, greyscale, edge detection, importance

extraction and distance mapping.

4.2.1 Methodology

The subjective testing was undertaken by way of a booklet questionnaire survey

issued to participants. The booklet contained 20 pages of test patterns. Each page or

test pattern contained between 4 and 9 images, which were different versions of the

same object. An example of a booklet page is shown in Appendix Section A.1. The

different versions represented the various processing methods (edge detection,

distance mapping, inverse image, importance mapping) which were to be compared

against each other. The subjects were asked to write a description of the object and

rank the top three images that they believed showed the object most clearly.

Participants were drawn on a voluntary basis from senior high school students.

School students were chosen for subjects due to the large numbers and to reduce the

likelihood of familiarity with image processing issues (eg. holding a low spatial

resolution image at a distance to discern objects). The subjects had been given no

prior background information as to the nature of the images except that the pictures

‘may be similar to what a blind person might see with a bionic eye and are scenes

that people are likely to see when walking about’. Viewing conditions for the

experiment were not controlled.


4.2.2 Images Chosen

While there are several good image databases available to the computer vision

community, it was desired to produce unique images for the project which would not

have been seen by others. This would reduce inconsistencies in the results had the

subjects had a priori knowledge.

The image set was composed of chairs, doorways, posts, steps, and faces which were

considered to form mobility hazards within a visually-impaired person’s

environment. Two different types of each hazard were included (see Figure 4.1).

chair 1 chair 2 doorway 1 doorway 2 face 1

post 1 post 2 steps 1 steps 2 face 2

Figure 4.1: Image set used in the psychophysical testing

Variations of image characteristics that are applicable to visual prostheses were

applied to the images. The spatial resolution and greyscale of images were

representative of current prototypes – 10x10, 16x16, and 25x25 pixel images with

either 2 (black and white) or 3 grey levels (black, grey, white). An inverse (reverse

contrast) was included along with an edge image, and Distance and Importance

Maps.

A phosphene mask was applied to each image to create the illusion of phosphenes

(ie. pixels were circular and did not touch each other along their borders). An

example of the types of images presented is shown below in Figure 4.2.

74


Original 25 x 25 16 x 16 10 x 10 Inverse

Spatial Resolution Variations

3 grey levels Edges Distance ‘Importance’

Mapping Mapping

Figure 4.2: Image Processing techniques used in the psychophysical testing

Only two pages with the same object appeared in one booklet to minimise learning

effects. In establishing the order of the two images in the booklet, the object version

with the lower resolution was always presented first. The booklet page order was

chosen to ensure substantial differences in appearance between successive booklet

sheets (refer Appendix Section A.2).

4.2.3 Results

The questionnaire survey was completed by 174 high school students in their

penultimate year of study. In analysing the results of the questionnaire surveys, it

could not be assumed that a blank entry counted the same as ‘Don’t know’.

Therefore the sample size was reduced to exclude blank entries. The experiments

were designed to answer a number of questions which appear in italics in this

section.

Q. What were the most recognisable objects in the survey?

Object recognition was assessed by analysing the respondent’s guess of the object.

In the analysis, a range of descriptions were accepted when accessing the recognition

of objects as the context or environment of the user would contribute perception

75


cues. Users of artificial vision systems would be (presumably) aware of their

surrounds and would consequently be able to place objects in their context. Also

powerful interpretation and increased understanding of a scene would be gained by

rapid succession of images: ie. a movie versus a single image, in addition to moving

about to see how various objects interact.

Many of the responses in the survey indicated that the person was able to recognise

the object as having certain properties, but when it came to naming the object, the

description was wrong. These descriptions were deemed to be correct (contextual

recognition) given that the person interpreting the image would have knowledge of

the context of the object. Examples of this are ‘Post 1’, which some respondents

named as ‘cactus in the desert’. Its height and slender form identified it as an object

to be avoided. If the same image was viewed on a city street, the description would

be likely to be more representative of the actual object. Other examples are the face

images wherein a respondent was able to recognise a face and head but associating

the wrong gender with the face. Appendix Section A.3 documents the recognition

assessment for all images. Note the listing comprises only borderline responses and

is not a complete list.

Combining the results from all test patterns (multiple resolutions and grey scale, and

different processing methods), the recognition rate was as follows:

O bject Recognition (n=168)

0%

20%

40%

60%

80%

100%

% R

ecog

nitio

n

98% 98% 92% 69% 33% 31% 20% 19% 16% 4%

fa c e 1 fa c e 2 c ha ir1 post 1 c ha ir2 door 2 post 2 door 1 st e ps 2 st e ps 1

Figure 4.3: Recognition rate for objects in the image set

76


77

Figure 4.3 includes 95% confidence intervals around mean recognition rates across

all test patterns. The most recognisable objects were the two faces with 98%

recognition. Chair 1 also was highly recognised. Face recognition has been

recognised as one of the foremost visual learning steps in the human baby [29]. In

studies where the eye positions of babies were monitored when presented with visual

stimuli, the babies spent longer looking a true face pictures than at other stimuli

patterns, including where the same face components (mouth, eyes etc) were present

but rearranged spatially. This finding may give evidence of immediate visual

response to biologically important objects. The result also agrees with studies

looking at the specialised processing required for face recognition [97]. There is

neurophysiological evidence for the existence of neurons in the temporal lobe in

monkeys, sheep and humans which responds selectively to faces. Particular neurons

are sensitive to the direction of gaze and have a maximum response if the face is

viewed straight on. Faces are encoded with differences to a prototypic ‘average’

face/caricature, where differences are assessed relative to a norm.

Interestingly, Chair 2 was difficult to recognise, and its round features contributed to

many animal impressions in the responses. As mentioned above and in other sources

[34], past experiences and expectancies influence visual perception. Had subjects

been told they were in a room containing office equipment one would expect there

not to be responses such as “animals”. Thus perception performance in this

assessment can be considered to be a worst case: no context or hints were provided

and static (still) images were viewed. The ability of the brain to interpret low

information even at this worst case is apparent.

Analysis of Variance (ANOVA) on the data shown in Figure 4.3 reveals significant

differences in recognition rate for the ten images tested. The test was based on 12

observations (refer to booklet layout shown in Appendix Section A.2) and resulted in

F(9,110),α = 0.05 = 200 > Fcritical (2.0) with P = 5.8E-64 (highly significant). Thus there

are significant differences between the mean recognition rates when averaged across

all test patterns (multiple resolutions and grey scale, and different processing methods).


Another fundamental aspect to determine from the testing was the effect of spatial

resolution on object recognition. The result found is represented graphically below

in Figure 4.4. The plot shows 95% confidence intervals around mean recognition

rates for the ten images used in the test.

Spatial Resolution & Grey Scale (n=140)

0%10%

20%30%

40%50%

60%70%

80%

Spatial Resolution

% R

ecog

nitio

n

B&W Images 48% 44% 49%

3 Grey Level Images 49% 50% 53%

10x10 16x16 25x25

Figure 4.4: Effect of spatial resolution and grey-scale on object recognition.

Q. How does greyscale affect recognition?

Although the differences in Figure 4.4 are fairly small, it can be seen that at a

particular resolution, images with 3 grey levels (white, mid-grey and black) are more

recognisable than black and white images (Figure 4.5).

Statistical testing shows however that the differences are not significant. A two

sample t-test was performed using 30 observations (3 spatial resolutions across 10

different images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference when adding greyscale;

H1: The addition of greyscale achieves significantly different recognition

results; ie. a two-tailed t-test.

A t-statistic of -0.4 was obtained which was much less than the critical t value 2.0 for

58 degrees of freedom. The significance of this value for a one-tail test was P=0.71

and since this is greater than 0.05, H0 was not rejected: adding greyscale does not

result in significant recognition results.

78


So while small improvements with adding greyscale were noted in the results, for

constant spatial resolution, images with 3 grey levels (white, mid-grey and black)

were not significantly more recognisable than black and white images (Figure 4.5).

No significant increase in recognition

Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly more recognisable than black & white images.

Q. How does spatial resolution affect recognition?

Again referring to Figure 4.4, images with 3 grey levels were more easily recognised

as spatial resolution increased. However for black and white images this was not

always so. Thus resolution is still important for object recognition – not pure

resolution but relative to the size of the object one is trying to show.

Analysis of Variance (ANOVA) of the data shown in Fig 4.4 reveals that the

differences when increasing spatial resolution were are not significant. This analysis

compared the hypotheses:

H0: µ 10x10 = µ 16x16 = µ 25x25

H1: At least two of the means are not equal, at α = 0.05.

The test was performed for 20 observations (10 images with 3 greylevels and 10

B&W images) and also for averaged results for B&W images and 3-grey images (2

observations). For 20 observations, F(2,57)=0.07 < Fcritical (3.1) with P = 0.93, while

results for 2 observations were F(2,3)= 1.2 < Fcritical (9.6) with P = 0.40. Thus as both

P values were above 0.05, H0 was not rejected: mean results did not differ

significantly as spatial resolution increased.

79


Q. Given the choice between increased spatial resolution and increased intensity

resolution (grey scale), which would give higher recognition?

One aspect of the testing was designed to analyse the effect of resolution versus grey

scale. A subject was simultaneously presented with images at a low resolution with

3 grey levels (ie. white, grey, black) and higher resolution black and white images.

The test analysed the following issues:

• 10x10 3grey compared with 16x16 B&W



The findings from this testing are shown in Figures 4.6 and 4.7.

Resolution vs Greyscale (n=110)

0% 10% 20% 30% 40% 50% 60%

10x10 - 3grey vs16x16 - B&W

10x10 - 3grey vs25x25 - B&W

16x16 - 3grey vs25x25 - B&W

% Recognition

Spatial Resolution b&w - importb&w - distanceb&w - inverseb&w - normal3grey - import3grey - distance3grey - inverse3grey - normal

Figure 4.6: Comparing resolution and grey scale.

Figure 4.6 comprises data for only those subjects who could correctly identify the

object. Three groups of bars are shown corresponding to the 3 bullet points above.

Each group shows a distribution of processing method chosen by subjects as showing

the object most clearly (ie. Rank 1 on the bottom of the test sheet shown in Appendix

80


81

Section A.1A.2). The bars in each of the three groupings add to 100%. The plot also

shows individual 95% confidence intervals for each of the processing methods

obtained across the ten images used in the test. The four bars at the top of each

grouping refer to the higher resolution black and white images, while the bottom four

represent lower resolution images with 3 grey levels. It is clear that higher

recognition is achieved with the higher resolution black and white images over lower

resolution images with 3 grey levels. This indicates higher recognition is achieved

with increased spatial resolution rather than increased greyscale resolution.

Statistical testing of this data shows these differences in recognition are significant.

Two sample t-tests were performed for each of the three groupings using 10

observations (10 different images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference between low resolution images with 3

greylevels compared with higher resolution black and white images;

H1: Higher resolution black and white images are more easily recognised; ie.

a one-tailed t-test.

In all three cases (10x10 3grey compared with 16x16 B&W, 10x10 3grey compared

with 25x25 B&W, 16x16 3grey compared with 25x25 B&W), P values were less

than 0.05, indicating recognition rates for the lower resolution 3-grey images are

significantly lower than the higher resolution black and white images.


Significantly increased

understanding

Figure 4.7: Significantly higher recognition is achieved with increased spatial resolution (Right) over increased greyscale resolution (Left)

82


Q. What are the presentation modes, or processing methods, that show objects most clearly?

On the questionnaire test sheets , the subjects were asked to rank the top 3 images in

the order that they thought showed the object most clearly (refer Appendix Section

A.1A.2). While all 3 rankings show a trend of the most useful processing methods, it

was of prime interest to determine what subjects nominated as their first choice,

which they may have felt most strongly about. The first choice nominations are

graphically represented in Figure 4.8 below. The plot shows individual 95%

confidence intervals for each of the processing methods obtained across the ten

images used in the test.

0%

20%

40%

60%

80%

Spatial Resolution

Ave

rage

Rec

ogni

tion

normalinversedistanceimportanceedges

normal 9% 19% 24%

inverse 6% 21% 16%

distance 30% 30% 27%

importance 52% 28% 20%

edges 3% 2% 2%

10x10 16x16 25x25

0%

10%

20%

30%40%

50%

60%

70%

Spatial Resolution

Ave

rage

Rec

ogni

tion

normalinversedistanceimportance

normal 17% 14% 33%

inverse 15% 8% 7%

distance 23% 25% 12%

importance 35% 42% 29%

10x10 16x16 25x25

3 Grey Level (Black, Grey& White) Im

Black & White Images

ages

Figure 4.8: Object recognition rate for various processing methods (n=110)

83


Analysis of Variance (ANOVA) of the data shown in Figure 4.8 testing the

hypothesis that the means are equal at α = 0.05 for 10 observations, reveals the

following significance levels: Spatial resolution Greyscale

resolution

Degrees of

freedom 10x10 16x16 25x25

Black and white

images

(4,45) P= 1.0E-5 P = 0.11 P = 0.28

3-grey level

images

(3,36) P = 0.22 P = 0.02 P = 0.05

Table 4.1: Analysis of Variance for various processing methods

Table 4.1 indicates that there are significant differences between the means for 10x10

B&W images and 16x16 images with 3 grey levels (P<0.05). From Figure 4.8, it can

be seen that distance and importance processing were more commonly nominated as

showing the object clearly in these cases. These processing methods were

significantly higher than ‘Normal’ presentation modes. However, there were no

significant differences between means for 25x25 images. This also suggests that

several presentation modes should be used in artificial vision systems rather than a

single mode of operation.

Q. How does edge enhancement affect recognition?

The upper plot of Figure 4.8 also shows that edge-processed images were not well

recognised (Figure 4.9). At the low resolutions used in the tests, edges comprised

too large a percentage of the total image pixels. For example, in a 10x10 image, a

vertical edge would comprise an entire column representing a tenth of the image.

Figure 4.9:Edge images were not well recognised

84


85

Statistical testing confirms that results for edges are significantly lower than the

average of other methods. Two sample t-tests were performed for each of the three

spatial resolution groups (10x10, 16x16, 25x25) using 10 observations (10 different

images) at α = 0.05. The test assessed the hypotheses:

H0: There is no recognition difference between edge-processed images

compared with the average recognition of other processing methods (average

{normal, inverse, distance, importance};

H1: Edge-processed images are significantly less easily recognised; ie. a one-

tailed t-test.

In all three cases, P values were less than 0.05, indicating recognition rates for the

edge-processed images were significantly lower than the average of other processing

methods.

Q7. What effect does image content (type of scene) have on recognition?

The test image set reflected diversity in scene content, and it was found that image

content is important in recognising ability. It would therefore be beneficial to have

adaptive processing for different scenes. For recognising chairs and doorways,

distance and importance processing was best, while for human faces, normal (or

inverse) processing was most beneficial. Interestingly, there appeared to be a

subjective difference between inverse and normal images, which differed between

individuals.

4.2.4 Test Conclusions

This section describes subjective experiments undertaken to determine useful image

processing methods for visual prosthetic applications and provide a framework for

prototype development. A condensed summary of the results is as follows:

• at a particular resolution, images with 3 grey levels (white, grey and black) were

not significantly more recognisable than black and white images;

• higher recognition was achieved with increased resolution rather than increased

grey scale;

• the most recognisable objects were images of human faces with 98% recognition;


• the test image set reflected diversity in scene content, and it was found that image

content is important in recognising ability – therefore beneficial to have device

switchable processing for different scenes;

• at lower spatial resolutions, 1 or 2 processing methods were quire useful

(importance & distance processing);

• edge-processed images were not well recognised; at the low resolutions used in

the tests, edges comprised too large a percentage of the total image pixels;

• there appeared to be a subjective difference between inverse and normal images

(Figure 4.10);

Figure 4.10: Subjective preferences between image and its inverse – some subjects preferred white on black, others black on white.

• for recognising chairs/doorways – distance & importance processing was best;

for human faces: normal & inverse; and

• resolution is still important for object recognition – not pure resolution but

relative size of object trying to show.

86

4.3 Subjective Tests to Determine Influence of Image Type


87

In this section further results are presented on subjective tests simulating what might

be seen by users of low quality vision systems. A group of 225 normally sighted

subjects viewed a set of low quality (low spatial resolution and low grey-scale

resolution) static images. The aim for this testing was to quantify

intelligibility/recognition for low quality images and determine the effect of the type

of image. Results from this testing form part of an image quality model to assess the

usefulness of low quality images.

4.3.1 Methodology

Part of the research involves assessing visual perception at this low end of the image

quality spectrum. Chapter 2 described numerous models assessing the human visual

system and image quality. However, these models apply to the high end of the image

quality spectrum (see Watson [105] for a good compilation). There is a need to fill

this void to assess image quality for emerging implant designs. The work extends

upon the previous subjective tests on normally sighted viewers described in the

preceding section which determined the impact of several image processing

techniques on object recognition.

The simulation tests were undertaken to provide insight into human perception of

low quality images and were aimed at simulating artificially-induced low quality

vision.

The objective was to obtain a Recognition-Quality envelope (see Fig 4.11), where a

subject was able to use the information presented to draw an intelligible conclusion

about the image. This section introduces the concept of recognition-quality curves

which show recognition performance plotted against image quality.


88

e a threshold of minimum lity required for intelligible

Is therquaviewing?

Image Quality

RecognitionIs there a variation in the degree of recognition possible for different images of the same quality?

Figure 4.11: Test Objective - Obtain an Recognition-Quality curve

It was anticipated that as image quality was increased, there would be an increase in

the ability of an object to be recognised. However, for a given image quality,

recognition performance was expected to vary among viewers, and so producing an

‘envelope’ of recognition as opposed to a straight line response. This may also

indicate that the ability of an object to be recognised may not improve within a range

of image qualities.

Participation was on a voluntary basis and comprised 271 senior high school students

and 11 mature age respondents. Invalid data resulted in the rejection of 57

questionnaires (21%). Thus the final sample size was 225, representing sample sizes

of 25 for each of the 9 image quality classes.

Participants had no prior knowledge of the images. Booklet instructions stated that a

range of high quality and low quality images could be expected, and although the

low quality images might just appear as a range of blocks, they may be similar to

what a blind person might see with a bionic eye.

4.3.2 Images Chosen

There were 9 Image Quality classes tested (see Fig 4.12). Original images were

256x256 pixels representing a range of scene types. A decreasing image quality


scale was presented using spatial resolutions typical of visual prosthesis designs

(25x25, 16x16, 10x10) and reducing the grey levels from full greyscale to binary. It

was also of interest to expose the structure of an image by presenting image edges.

Full Greyscale Binary

1. 2. 256 x 256 3.

4. 5. 25x25

6. 7. 16x16

8. 9. 10x10

Figure 4.12: The nine image quality classes used in the tests

Reduced quality image sets were prepared for the images shown in Fig 4.13.

Tree Flower Balloon

Lighthouse Face Buildings

Capsicum Gorilla Duck

Decreasing Image Quality

256x256 Edge (image structure)

Figure 4.13: Test image set

89


90

The subject was presented with 9 different images on the one page (tree, flower,

balloon, lighthouse, face, buildings, capsicum, gorilla, rubber duck) corresponding to

an image quality class described above. An example of the test stimuli is shown in

Appendix Section B.1.

4.3.3 Results

Responses indicated by subjects were collated to determine recognition rates. Most

subject responses were easy to classify as "Yes, this person has correctly recognised

the object" or otherwise. However, where a subject's response was borderline,

Appendix Section B.2 was constructed to maintain consistent judgements on whether

images where correctly recognised. Responses were accepted if they had similar

context to the answer. Note the table includes only borderline responses and is not a

complete listing.

Table 4.2 shows the proportion of respondents who could correctly identify all (9/9)

images presented to them and two-thirds (6/9) of the images shown to them. 6/9 was

chosen to reflect recognition performance clearly over 50%.

QUALITY CLASS Respondents

Correctly

identifying all 9

images (out of

total of 25)

% Correct

response for

all 9/9 images

Respondents

Correctly identifying

two thirds (6/9) of

the image set (out of

total of 25)

% Correct

response for

6/9 images

10 x 10 Binary 0 0% 0 0%

10 x 10 Greyscale 0 0% 4 16%

16 x 16 Binary 0 0% 1 4%

16 x 16 Greyscale 1 4% 7 28%

25 x 25 Binary 0 0% 4 16%

25 x 25 Greyscale 2 8% 20 80%

256 x 256 Binary 4 16% 25 100%

256 x 256 Edge 24 96% 25 100%

256 x 256 Greyscale 25 100% 25 100%

Table 4.2: Correct image identification (n=25)


None of the respondents viewing the low quality binary images (10x10, 16x16,

25x25) were able to correctly identify all 9 of the presented images. Also

surprisingly at high resolution, only 16% of respondents viewing the binary versions

of the originals correctly identified all images. Even when considering identification

of more than half of the image set, recognition performance was still low for

respondents viewing the low quality binary images. In fact, the same number of

people could identify two thirds (6/9) of the 25x25 binary image set as they could

with the 10x10 greyscale images. 80% of viewers of the 25 x 25 greyscale image set

could correctly identify half of the image set. This value of useful spatial resolution

agrees with previous simulation work of others ([99] refer Section 3.5.2.1), which

found that effective two dimensional displays can be achieved with matrix sizes of

between 20x20 and 32x32.

When considering an average across all image types, it was possible to construct an

envelope of recognition for the test set as shown below in Fig 4.14, which has a

similar shape to the envelope proposed in Fig 4.11. The plot shows 95% confidence

intervals around mean recognition rates for the nine image quality classes used in the

test. Maximum and minimum curves have been added to indicate the range of values

obtained. Although the maximum and minimum values are shown joined with a line

to form an envelope, they do not imply that the x-axis is always ordered in the image

quality order as shown. (The order shown below is in increasing order of recognition

for mean recognition rates across all image types).

All Images (n=225)

0%

20%

40%

60%

80%

100%

10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256Bin

256Edges

256F/G

Image Quality

% R

ecog

nitio

n

Error bars denote 95% confidence intervals around mean recognition rate.

Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set

91


Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.14 to

compare the hypotheses:

H0: µ 10 Bin = µ 16 Bin = …. = µ 256 F/G


The test resulted in a F-value of 111 which exceeded the critical F-value (1.98) for

the number of degrees of freedom in the data (8, 216), and was highly significant at

P=2.64E-72. Thus H0 was rejected and it was concluded that recognition rates were

significantly different for the image quality classes used in the test.

Recognition-Quality curves for specific object types are shown in Fig 4.16. It can be

seen that the x-axes have different ordering. One conclusion from these results is

that recognition performance varies widely depending on many factors, one of which

is the type of image. Had these curves been plotted with the same x-axis ordering

(say on increasing recognition rate of values averaged across image types), the

recognition plot of Fig 4.15 below would be obtained. The data points are shown

joined to demonstrate the jaggedness of the curves, highlighting that recognition

performance varies with type of image.

Specific Images (n=25)

0%

20%

40%

60%

80%

100%

10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256 Bin 256 Edges 256 F/G

Image Quality

% R

ecog

nitio

n

MeanLighthouseBuildingsTreeGorillaCapsicumFaceFlowerBalloonRubber Duck

Figure 4.15: Variation in recognition among image types

In general, one might expect recognition to improve as spatial resolution and the

number of greylevels increase. The results here validate the experiments of Section

4.2.3 and those by others [36] that recognition rate/perceived quality is dependent on

image type and there is interplay between greylevel and spatial resolution.

92

Average recognition rate (averaged across viewers and image quality classes) is shown above each chart.

Gorilla (n=25) - Avge: 44%

0%

20%

40%

60%

80%

100%O

rig

Orig

Bin

Edg

es

25 F

/G

16 F

/G

16 B

in

25 B

in

10 F

/G

10 B

in

Face (n=25) - Avge: 85%

0%

20%

40%

60%

80%

100%

Orig

Orig

Bin

Edg

es

25 F

/G

16 F

/G

25 B

in

10 F

/G

16 B

in

10 B

in

Balloon (n=25) - Avge: 51%

0%

20%

40%

60%

80%

100%

Orig

Edg

es

25 F

/G

10 F

/G

16 F

/G

Orig

Bin

25 B

in

10 B

in

16 B

in

Buildings (n=25) - Avge: 50%

0%

20%

40%

60%

80%

100%

Orig

Orig

Bin

Edg

es

25 B

in

25 F

/G

16 B

in

10 F

/G

10 B

in

16 F

/G

Tree (n=25) - Avge: 53%

0%

20%

40%

60%

80%

100%

Orig

Orig

Bin

Edg

es

25 F

/G

10 B

in

25 B

in

16 B

in

10 F

/G

16 F

/G

Capsicum (n=25) - Avge: 50%

0%

20%

40%

60%

80%

100%

Orig

Orig

Bin

Edg

es

25 F

/G

25 B

in

16 F

/G

10 F

/G

16 B

in

10 B

in

Rubber Duck (n=25) - Avge: 57%

0%

20%

40%

60%

80%

100%

Orig

Edg

es

Orig

Bin

25 F

/G

16 F

/G

10 F

/G

25 B

in

16 B

in

10 B

in

Lighthouse (n=25) - Avge: 66%

0%

20%

40%

60%

80%

100%

Orig

Edg

es

Orig

Bin

25 F

/G

25 B

in

16 F

/G

10 F

/G

16 B

in

10 B

in

Flower (n=25) - Avge: 72%

0%

20%

40%

60%

80%

100%

Orig

Orig

Bin

Edg

es

25 F

/G

25 B

in

16 F

/G

10 F

/G

16 B

in

10 B

in

Figure 4.16: Recognition-Image Quality curves for each test image;

93


94

For 5 of the 9 images (face, flower, lighthouse, duck, capsicum), as more information

was presented in the way of either greyscale or spatial resolution, the recognition rate

increased.

For example, recognition performance for

• the flower image set = 10Bin < 16 Bin < 10 F/G < 16 F/G < 25 Bin < 25 F/G etc

• the face image set = 10Bin < 16 Bin < 10 F/G < 25Bin < 16 F/G < 25 F/G etc

where “10Bin < 16 Bin” indicates 16x16 Binary images were more easily recognised

than 10x10 Binary images.

On the other hand, the gorilla image set resulted in the following quality class

ordering: 10 Bin < 10 F/G < 25 in < 16 Bin < 16 F/G < 25 F/G etc.

This appears an illogical order due to a lower recognition rate achieved for 25x25

Binary images than 16x16 Binary images. However the actual recognition rates for

the gorilla image set are very low. It can be conjectured that images with low overall

recognition rates give spurious results due to guessing. In contrast, the 4 images that

were most highly recognised across all image quality classes (face=85%,

flower=72%, lighthouse=66%, duck=57%) all had logical quality class ordering as

recognition rate increased.

Mean recognition rates for each object type are shown over in Fig 4.17. The plot of

Fig 4.17 shows 95% confidence intervals around mean values (average across all

image quality classes) and maximum and minimum values to show the range of

recognition obtained.


All Image Quality Classes (n=225)

0%

20%

40%

60%

80%

100%

Face

Flow

er

Ligh

thou

se

Rub

ber

Duc

k

Tree

Bal

loon

Bui

ldin

gs

Cap

sicu

m

Gor

illa

Image Quality

% R

ecog

nitio

n

Figure 4.17: Recognition rates for each object type

Similar to the results found in Section 4.2.3, the face image had the highest mean

recognition rate when averaged across all image quality classes. Again this agrees

with the literature where face recognition has been recognised as neurologically

programmed [97] and one of the foremost visual learning steps in the human baby

[29]. For low quality presentations, images that were highly recognised were the

face and flower (greyscale images) and face, lighthouse and flower (binary images).

The gorilla, duck and capsicum images were not recognised well at low quality.

Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.17 to

compare the hypotheses:

H0: µ Face = µ Flower = …. = µ Gorilla


The test reports an F-value = (found variation of the group averages)/(expected

variation of the group averages), and if the H0 hypothesis is correct, the F-value is

about 1. Interestingly, the test resulted in a F-value of 1.12 which was less than the

critical F-value (2.07) for the number of degrees of freedom in the data (8, 72). This

F-value has a significance of P=0.36, and thus H0 could not be rejected.

Thus when combining recognition rates for all quality classes, there was no

significant difference based on image type.

95


96


This section described an experiment to further understanding of recognition of low

quality images and determine the affect of the type of image on object recognition.

Recognition was found to vary depending on the type of image, but differences were

not significant when averaged across the different image qualities assessed in the

test. The face image had the highest mean recognition rate across all image qualities.

The number of respondents correctly identifying two thirds of the 25x25 binary

image set = the number of respondents correctly identifying more than half of the

10x10 greyscale image set.

80% of respondents could correctly identify more than half of the 25x25 greyscale

images indicating reasonable vision is achieved at this level. It must be remembered

that these test images were static and improved perception could be expected with

presentation of image sequences – ie. a movie versus a single image. Also a visual

prosthesis user would be able to move about to see how various objects interact.

There is an interplay between greyscale resolution and spatial resolution – for some

objects, higher recognition is achieved with increased greyscale over spatial

resolution, while the reverse applies for other objects.

For those objects which were highly recognised (above 55% averaged across all

image quality classes: face, flower, lighthouse, duck) it was possible to obtain a

recognition curve that increased as image quality increased. However for the

remaining images which were not well recognised (much guessing), recognition rates

both increased and decreased as image quality increased.

This work is extended in the next Chapter by:

1. Correlating several image statistics, such as fractal dimension, symmetry, number

of edges, and number of segments, with the images in these tests to determine if

recognition can be automatically predicted.

2. Constructing a visual information model for low quality images comprising

several dimensions, in addition to the actual object as considered in this chapter.

4.4 Chapter Conclusions

4.4 Chapter Conclusions

97

At the commencement of this chapter, two research questions were stated which can

now be answered following the experiments described in this chapter:



A1: Results indicated that greyscale images were easier to identify: 80% of

respondents could correctly identify more than half of a 25x25 greyscale image set,

while only 16% of respondents correctly identified more than half of a 25x25 binary

image set. Spatial resolution was more important for recognition performance than

greyscale resolution.

Recognition was found to vary depending on the type of image, with face images

being the most easily recognised. It would be beneficial to have device switchable

processing for different scenes. Further exploration of the idea of adjusting image

processing adjusted depending on the scene type is presented in Chapter 6.



A2: Results indicated that there may be some benefit in pursuing ROI methods in

further detail, especially for very low (10 x 10) resolution images. For recognising

chairs and doorways – ROI and distance processing were best, while for human faces

– standard/Base Case and inverse was best. Access to a range of processing routines

was therefore advisable. Further assessment of the applicability of ROI techniques to

low quality image perception is presented in Chapter 7.

Chapter 5 Quantifying Information Content

5.1 Introduction

One of the desired functions of visual prostheses is to convey maximum scene

information to limited electrode numbers in implants. How can one tell if there is

maximum scene information in the conveyed image, or even, how does one quantify

the amount of visual information in an image? Can a metric be developed that can

rank images (like the two shown in Figure 5.1) for the amount of visual information

they contain?

Figure 5.1: Two images with different amounts of visual information content

This chapter attempts to answer these questions and describes in detail the

construction of a metric for visual information. It aims to answer the research

question proposed at the end of Chapter 3:

Q3: Can a metric be constructed for basic information required for the interpretation

of a visual scene at low image quality?

This knowledge would result in a new way to characterise low quality images on the

basis of providing maximum information. Images could be characterised on the ratio

of perceived information they convey (human user’s concept) to their representative

information (raw measure of intrinsic information, typically ‘bit’).

Assuming that information content in images can be quantified, how can this

knowledge be used in the visual prostheses application? This chapter proposes that

image content can be manipulated in a way so that the resulting image to be

98

5.1 Quantifying Information Content Introduction

99

conveyed to implant electrodes contains maximum information. One means to do

this is using the Importance Map method described previously in Section 3.5.2.6.

This method involves the combination of several feature maps/images representing

attentional features to form an overall importance map. Using the knowledge of

what constitutes visual information, weights for each feature map (Intensity, Edges,

Colour contrast, Edges etc) could be adjusted iteratively to maximise the amount of

visual information in the resulting importance map.

This Chapter comprises three further sections depicted below:

Section 5.2

Identify perceived visual information content Construct 8 subjective data sets

Section 5.3 Propose Metric using 1 of 8 subjective data sets:

• Metric specific to a particular image quality • One metric for all image quality

Validate metric against other 7 subjective data sets

Section 5.4 Correlating visual information

content with perception

Section 5.2 describes a subjective experiment for perceived information content in

images. Subjective rankings are presented for eight visual ‘dimensions’. Patterns

among rankings and viewer preferences are noted to gain insight into subjective

visual information. The development of a robust metric is detailed in Section 5.3 and

predictive performance of the metric examined. Finally, Section 5.4 determines

whether high perceived information content in images actually corresponds to high

recognition rates: is it true that an image with high information content can be

recognised easier at low quality than an image with low information content.

5.2 Perceived Information Content in Images


An experiment was conducted to rank the amount of inherent visual information in

images. In the experiments images were compared with each other to obtain a

ranking from most to least visually informative. In addition to using the results to

propose a metric to quantify visual information, an additional benefit is determining

how perceived information content changes as image quality decreases.

5.2.1 Images Used Similar to the experiments discussed in Section 4.3 and shown below in Figure 5.2,

there were 9 image quality classes tested. Original images were 256x256 pixels

representing a range of scene types.

Full Greyscale Binary

1. 2. 256 x 256 3. 3.

4. 5. 25x25

6. 7. 16x16

8. 9. 10x10

Image

Quality

256x256 Edge (image structure)

Decreasing

Figure 5.2: The nine image quality classes used in the tests

A decreasing image quality scale was presented using spatial resolutions typical of

visual prosthesis designs (25x25, 16x16, 10x10) and reducing the grey levels from

full 256 levels of greyscale to binary. It was also of interest to expose the structure

100


of an image by presenting image edges. Reduced quality image sets across the nine

classes were prepared for each of the images shown in Figure 5.3.

5.2.2 Multidimensional Visual Information Model: Eight aspects/dimensions were explored to determine what impact, if any, they had

on perceived information content. The first issue (Actual Objects) was assessed by

comparing 7 images against each other while the other dimensions were assessed

with only 3 images each.

1. Actual Objects

7 images representing a range of different scene types: tree, flower, balloon,

lighthouse, face, buildings, capsicum. There was no implicit ranking concerning

visual information.

2. Number of Objects

3 images of increasing object number for similar scene type. The first image

contained one balloon, the second image three or four balloons and finally an image

containing many balloons. This set had an expected ranking of visual information in

proportion to the number of objects.

3. Angle of Object

3 images of a fruit bowl at 90° (top-down), 45° (angled) and 0° (side-on). Perceived

visual information may vary due to occlusion and distortion of objects. Figure 5.3: Multidimensional Visual Information Model (continued over)

101


4. Distance to Object

3 images of a couple on a bicycle with decreasing distance to the couple’s faces. The

first image is a whole of body image, the second shows a half-body view and the

final image consists of head and shoulders only. This set had an expected ranking of

visual information with distance ie. higher visual information where the whole of the

scene can be viewed.

5. Connection between Image Objects

3 images of different couples with decreasing connection between the couple. One

image shows the cheeks touching, the next shoulders touching, while the final image

shows space between the couple. It was expected that images showing space

between the couple may indicate more of what was happening in the scene, and thus

information content would be greater with increased separation.

6. Image Detail

3 images of the same face with different edge detail. The first image shows the face

alone, the second includes a phone, while the third shows part of an additional face.

Information content was expected to be greater for images containing higher detail

ie. the additional face and phone images would be more visually informative.

Figure 5.3: Multidimensional Visual Information Model (continued over)

102


7. Contrast between Objects & Surround

3 images of capsicums with varying contrast. Green, red and yellow capsicums gave

varying contrast against a light background when viewed as greyscale images. It was

expected that images with higher contrast would be ranked as containing higher

information content.

8. Variety of Object Types

3 images comparing different object types. The first image contained an orange and

sunglasses, the second depicted an orange and a mug, while the third showed scissors

and a mug. There was no expected ranking of information content. tion content.

Figure 5.3: Multidimensional Visual Information Model Figure 5.3: Multidimensional Visual Information Model

5.2.3 Test Method 5.2.3 Test Method

Two questionnaire-based methods were used: Two questionnaire-based methods were used:

1) Images presented all on one page1) Images presented all on one page

An example test stimulus is shown in Appendix Section C.1 (this shows the

presentation of 7 images for assessing the “Actual Objects” test set) and Section C.2

(showing the presentation of 3 images for assessing the “Distance to Object” test

set).

For assessing the first issue (Actual Objects), the 7 images were presented on the

same page, and subjects were asked to rank from 1 to 7. When considering such a

large number of comparisons, this method gives strong responses for the extremes

103


104

(most and least visually informative) and weaker responses for the mid-lying images.

Thus a paired comparison (binary decision) test was performed on the 7 image set.

2) Paired comparison (binary decision) questionnaire test

An example test stimulus is shown in Appendix Section C.3.

Considerable effort was required in the design of the questionnaires to ensure variety

(avoid boredom), reduce the chance of learning effects from multiple viewings of the

same object, and to keep the questionnaires short (avoid fatigue). To achieve this, 9

booklet versions were produced (Books A, B, C etc) with the format shown in

Appendix Section C.4. Conditions for viewing the experiment (ambient illumination

etc) were not controlled.

5.2.4 Test Participants and Instructions Participation was on a voluntary basis and comprised 271 Year 11 students and 11

mature age respondents. Invalid data resulted in the rejection of 57 questionnaires

(21%). Thus the final sample size was 225, representing sample sizes of 25 for each

of the 9 image quality classes.

Participants had no prior knowledge of the images. Booklet instructions stated that a

range of high quality and low quality images could be expected, and although the

low quality images might just appear as a range of blocks, they may be similar to

what a blind person might see with a bionic eye.

In assessing visual information using human viewers, it was anticipated that there

would be a varied understanding and interpretation of the concept of visual

information. In addition to the above comment that viewers were advised of the

bionic eye application, the following example question was provided to all viewers:

WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION?

In other words, which image could you answer the most questions about? (eg.

What is the scene? How many objects?) If you had to rely on only one of the

images to perform a task which would it be?

Beyond these comments viewers made their own interpretation of visual information.


105

5.2.5 Test Results

Eight factors were analysed. The first factor was determining the effect of the actual

object shown in the image on perceived information content. The seven different

objects were compared against each other. Two ranking schemes were used: 1)

images were presented all at the same time 2) paired comparison tests. Both methods

gave similar results and the rankings for each method are shown below. The table

shows images ranked from highest perceived information content (1) to lowest

information content (7) for the nine image quality classes and a ranking combining

all image quality classes.

Visual Information Ranking - Images Presented all at same time

1 2 3 4 5 6 7

All Quality Classes Face Flower Tree Buildings Lighthouse Capsicum Balloon

256 F/G Face Buildings Tree Lighthouse Flower Balloon Capsicum

256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon

256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum

25 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings

25 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon

16 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings

16 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon

10 F/G Face Flower Capsicum Tree Balloon Buildings Lighthouse

10 Bin Tree Flower Face Buildings Lighthouse Capsicum Balloon

Visual Information Ranking - Paired Comparison Presentations

1 2 3 4 5 6 7

All Quality Classes Face Flower Tree Buildings Capsicum Lighthouse Balloon

256 F/G Buildings Face Lighthouse Tree Flower Balloon Capsicum

256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon

256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum

25 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings

25 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon

16 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings

16 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon

10 F/G Face Flower Tree Capsicum Balloon Lighthouse Buildings

10 Bin Tree Face Flower Buildings Capsicum Lighthouse Balloon

Table 5.1: Perceived information content for comparing 7 different object types


When considering the ranking for all quality classes (n=225) both methods gave the

following near identical ranking order:

Face > Flower > Tree > Buildings > Lighthouse/Capsicum > Balloon.

Ie. the face image has higher subjective information content than the flower etc.

The high quality Top 3 (256x256, 256x256_Binary, 256x256_Edge) were Face,

Buildings and Tree:

Figure 5.4: Images containing high information content for high quality images

106

The low quality Top 3 were Face, Flower and Tree (Binary), Face, Flower and

Capsicum (Greyscale):

(Binary)

(Greyscale)

Figure 5.5: Images containing high information content for low quality images

The effect of the other factors/dimensions on visual information are presented in

Table 5.2 over. Patterns that emerged in the visual information ranking are noted

along with the number of image quality classes (out of 9) with that ranking. Strong

viewer preferences are defined as the pattern chosen by 70% or more of the sample

size. Although it may appear arbitrary, a 70% level was chosen from careful

inspection of the data and the fact that in a normal distribution, 68% of cases will fall

1 standard deviation above and below the mean.


Dimension Visual Information Ranking Order

HIGHEST LOWEST

Any very

strong viewer

preferences?

(chosen by

>70% of

sample)

How many

image

quality

classes

Were

original

(256F/G)

images

ranked

like this?

Number of

Objects in Scene

1st dominant pattern

Yes – 5/9

16Bin: 88%

25Bin: 92%

Edges: 84%

256Bin: 96%

256F/G: 96%

6/9

10Bin: 64%

16Bin: 88%

25Bin: 92%

Edges: 84%

256Bin: 96%

256F/G:

96%

Yes

2nd pattern

No 2/9

16 F/G: 32%

25 F/G: 36%

no

Comments: A strong pattern was clear in the results which confirmed expectations: the more objects

in the scene, the higher the visual information. 6/9 image quality classes were ranked in this way

which was two thirds of the quality classes. Five of the nine image quality classes had very strong

viewer preferences for this ordering. For two low quality classes, the image of the single balloon was

favoured highest, but preferences were not strong

Table 5.2: Pattern analysis for information content rankings (continues over)

107



HIGHEST LOWEST

Any very

strong viewer

preferences?

How many

quality

classes

Original

images

like this?

Angle of Object 1st dominant pattern

No 5/9

10F/G: 36%

16Bin: 32%

16F/G: 40%

Edges: 36%

256F/G:

52%

Yes

2nd pattern

No 3/9

10Bin: 60%

16Bin: 32%

25Bin: 44%

no

Comments: The dominant pattern indicates highest information is in a top down view (90 degrees) of

the fruit bowl, where almost the entire bowl circumference is visible. The contents of the bowl can be

most easily seen in this top-down view. This pattern was ranked more visually informative for high

quality and greyscale images. When limited to binary representation, the side on view (0 degrees) was

ranked higher, perhaps to a sharper profile of the bananas against the background.

Distance to

Object


No 5/9

10F/G: 32%

16Bin: 28%

16F/G: 32%

Edges: 48%

256F/G:

68%

Yes

2nd pattern

No 4/9

10Bin: 40%

16Bin: 28%

25F/G: 64%

256Bin: 36%

no

Comments: The dominant pattern, including the Original image set is ranked in increasing distance to

the viewer ie. more visual information where you can see more of the image and background.

However the second pattern, which includes rankings for low quality binary images, indicate the

closest view of faces contain more information.


108



HIGHEST LOWEST

Any very

strong viewer

preferences?

How many

quality

classes

Original

images

like this?

Connection

between image

objects


No 5/9

10F/G: 36%

16Bin: 32%

16F/G: 36%

Edges: 68%

256Bin: 36%

no

2nd pattern

Yes – 1/9

25F/G: 80%

4/9

10Bin: 36%

25Bin: 32%

25F/G: 80%

256F/G:

48%

Yes

Comments: The first dominant pattern indicates viewers rated the image with the most separation the

most informative, perhaps because more of the occupation of the couple (card game) was visible, or

maybe the background picture and table edge contributed highly. However the 2nd pattern indicates

included a strong (80%) preference in decreasing connection between the couple for 25x25 greyscale

images.

Image Detail 1st dominant pattern

Yes – 1/9

16F/G: 80%

4/9

10Bin: 52%

10F/G: 64%

16F/G: 80%

25F/G: 60%

no

2nd pattern

No 3/9

16Bin: 64%

25Bin: 40%

256Bin: 36%

no

3rd pattern

No 2/9

Edges: 48%

256F/G:

64%

Yes

Comments: A simple face with no surrounding clutter was most visually informative for the low

quality images (1st & 2nd patterns). The influence of the mobile phone in our society may be reflected

in the ranking of the high quality images (3rd pattern) where the face with the phone appeared to

contain more information.


109



HIGHEST LOWEST

Any very

strong viewer

preferences?

How many

quality

classes

Original

images

like this?

Contrast

between objects

and surround


Yes – 1/9

Edges: 72%

5/9

10F/G: 44%

16F/G: 64%

25F/G: 56%

Edges: 72%

256F/G:56%

Yes

2nd pattern

No 2/9

10Bin: 40%

256Bin: 64%

no

3rd pattern

No 2/9

16Bin: 44%

25Bin: 28%

no

Comments: Strong edges correspond with high perceived information content. The dominant pattern

was for high quality and the low quality greyscale images. When greyscale is available, the stalk and

capsicum form/contours may cause this ranking.

Variety of Object

types


No 4/9

10Bin: 40%

10F/G: 60%

16Bin: 48%

16F/G: 32%

no

2nd pattern

No 3/9

25Bin: 32%

25F/G: 40%

256F/G:44%

Yes

3rd pattern

No 2/9

Edges: 52%

256Bin: 52%

no

Comments: The ordering of the dominant pattern is the same as presented to viewers on the

questionnaire sheets. It is interesting that the lowest quality image classes make up this dominant

pattern. Perhaps viewers of the low quality images were not able to make an intelligible distinction

between images and ranked the images in order of appearance.

Table 5.2: Pattern analysis for information content rankings

110


5.2.6 Strong Visual Information Rankings

63 visual information rankings were obtained (7 additional factors/dimensions x 9

image quality classes). Dominant patterns (ie. the most frequently specified ordering

in terms of perceived information content) were identified for each case. The

strength of the dominant patterns (ie. the frequency with which that pattern was

specified by observers) ranged from 96% (24 of 25 respondents ranking images in

that order) to 28% (only 7 of 25 respondents). The number of cases for each ten

percentile class were as follows:

Strength and number of cases for

dominant viewer patterns (63 in total)

90-100%: 3

80-89%: 4

70-79%: 1

60-69%: 12

50-59%: 6

40-49%: 16

30:39%: 19

20-29%: 2

10-19%: 0

0-9%: 0

STRONG

WEAK

Dominant patterns

Table 5.3: Dominant visual information viewer preferences

It was of interest to further examine strong dominant viewer patterns in the data.

Eight of the 63 rankings had 70% or above consensus among viewers. Five of these

related to the number of objects in the scene.

Strong viewer preferences are shown over in Figure 5.6.

111


Number of Objects in Scene

5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%),

256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%)

Highest Lowest

112

Closeness between image objects

1 image quality class: 25x25greyscale set (80%)

Highest Lowest

Image Detail


Highest Lowest

Contrast between Objects & Surround

1 image quality class: 256x256_Edge set (72%)

Highest Lowest

Figure 5.6: Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content


113


Four conclusions can be drawn from Figure 5.6 and the perceived information

content experiment:

1. the more objects in the scene, the higher the visual information

2. the closer the objects in the scene, the higher the visual information

3. a simple face with no surrounding clutter was most visually informative at low

resolution levels

4. strong edges, arising from high intensity contrast, correspond with high perceived

visual information content

These viewer preferences now need to be checked against predictions from a visual

information metric which is undertaken in the next section.

5.3 Information Content Model Fitting

Described above are experiments to assess perceived information content in eight

visual dimensions. Subjective rankings from one of these eight dimensions (Actual

Objects) is now used to construct a metric to quantify visual information in images.

The metric is then validated against the subjective results of the other 7 dimensions.

5.3.1 Possible Image Attributes for a Visual Information Metric

After consideration of the literature on visual information content (refer Section 2.5)

15 image attributes were considered for the visual information metric:

1. file size

2. standard deviation

3. maximum standard deviation in 4 image quadrants

4. variance

5. maximum variance in 4 image quadrants

6. entropy


7. number of edges

8. number of segments

9. fractal dimension

10. 11. 12. image internal similarity measures

11. 14. 15. image symmetry measures

Descriptions are provided below for these attributes.

File Size

Size on disk (bytes) from Windows/DOS

Standard deviation

Standard deviation for image pixels using:

1

)(1

2

−

−=

∑=

n

XXs

n

ii

, where ∑=

=n

iiX

nX

1

1 and n = number of elements

Maximum standard deviation in 4 image quadrants

114

S S

Variance

Variance of image pixels = s2; squaring this term emphasises different parts.

Maximum variance in 4 image quadrants

Entropy

While this term is also used in the field of thermodynamics, its use here refers to the

image processing context of describing the probability of each possible grey level

occurring in an image. For greyscale images this is 0 to 255, and for binary images

this is 0 and 255 only) using:

−=∂−= ∫ ∑

= 256256)(

log256256

)()(ln)()( 2

255

0 xpixels

xpixels

xxpxpxh g

g

g

max{s1,s2,s3,s4}

max{s21,s2

2,s23,s2

4}

(Equation 3)

(Equation 4)

Where (pixels)g = no. of pixels at that greylevel.

S1 S2

3 4

S21 S2

2

S23 S2

4


Number of edges

Sobel edge detection (Matlab version 6.5 Release 13) – horizontal and vertical edges.

Number of segments

The image is segmented using quadtree decomposition; this segments an image on

the basis that a block is split into 4 smaller blocks if the maximum value in the block

minus the minimum value in the block is greater than a threshold (200/255 was

used). Block splitting continues until max value - min value is not greater than the

threshold. Blocks are then merged with neighbours if similar in value.

Fractal Dimension

The Box Counting Method [45] was used for binary images.

The image is covered with a grid of square cells with cell size r. Fractal dimension is

determined from functions of cell size as shown in Figure 5.7.

r = Cell side length

N(r) = no. of cells containing a

portion of image

Performed over range of box sizes: 128x128, 64x64, 32x32 …..1x1

Log(N(r))

Log(1/r)

Slope = dimensi

fractalon

Figure 5.7: Calculating Fractal Dimension for Binary Images

115


Fractal dimension for greyscale images (refer Figure 5.8) was determined from an

analysis of a pixel’s environment at different square size r [45].

116

Min & Max grey values are determined within square size r & assigned to central

pixel respectively

• For each square size r, get 2D max & min function

• Difference in volume between max & min function is determined for entire

image V(r)

xyzVolumeyxfz

yxfz

∂∂∂= ∫ ∫ ∫=

=

256

1

256

1

),(

),(

2

1

(Equation 5)

Boundary in x-plane Boundary in y-plane

for every (x,y) in region, z may extend from lower surface to upper surface

fractal dimension = 3 - (slope/2)

r = 5 r = 7 r = 9

ln(V)

Ln(r)

Slope

Figure 5.8: Calculating Fractal Dimension for Greyscale Images


Image similarity and symmetry Three measures were used for image internal similarity (exact match across x and y

axes) and image symmetry (mirror match across x and y axes):

1. Exact pixel match (Fig 5.9) - no sub-block analysis (same result operating on big

or small block)

117

• Exact pixel match across y-axis (same)

• Exact pixel match across y-axis (mirror)

• Exact pixel match across x-axis (same)

• Exact pixel match across x-axis (mirror)

Figure 5.9: Determining image similarity and symmetry – pixel matching

2. Shaded pixel difference between blocks - 5 level subblock analysis (objects

might be in a different position within a block)

3. Average pixel value - 5 level sub-block analysis


For the sub-block analysis used in measures (2) and (3) above, five levels were used

as depicted below.

LEVEL 1 2 3 4 5

CONFIG 2x2 4x4 8x8 16x16 32x32

No. PIXELS 128 64 32 16 8

Eg.

block 2

e

m

sam

irror block 1

This sub-block level is weighted more (weight = 128 / block-pixels)

Figure 5.10: Determining image similarity and symmetry – pixel difference and average value

There were two approaches to developing the metric to then compare predictions

against subjective data:

1. develop a metric for each image quality class (25x25 greyscale, 256x256 binary

etc) to be applied only to images of that quality

2. develop a metric that is stable across all image quality classes (not just the ones

used in these tests).

5.3.2 Metric Development for a Specific Image Quality Class

Stepwise regression was used to search for the optimum subset of variables. The

procedure was based on sequentially introducing variables into a regression model

one at a time and testing the significance of all variables at each stage.

15 image attributes were considered for the visual information model. The addition

of any single variable from the above list will increase the regression sum of squares,

or SSR (amount of variation in y-values explained by the model) and reduce the error

118


119

sum of squares (variation about the regression line). The use of unimportant

variables reduces the effectiveness of the model by increasing the variance of the

estimated response.

The stepwise regression procedure, taken from [103] was as follows:

STEP 1

Simple linear regression was performed with each variable. The variable giving the

largest regression sum of squares, or largest value of R2, with significance (tested

using the F-statistic) was chosen as the initial variable, x1 say.

STEP 2

Each variable was inserted along with x1. The variable giving the largest significant

increase in R2, in the presence of x1, over the R2 found in step 1 was then selected as

x2.

This process was continued until the most recent variable inserted failed to induce a

significant increase in the explained regression. Such an increase was determined

using the F-test.

It was quite possible that a variable entering the regression equation at an early stage

might have been rendered unimportant or redundant because of relationships that

exist between it and other variables entering the later stages. Therefore at each stage

in which a new variable was entered in the regression equation through a significant

increase in R2 as determined by the F-test, all the variables already in the model were

subjected to F-tests in light of this new variable, and were deleted if they did not

display a significant f-value. The procedure was continued until a stage is reached in

which no additional variables could be inserted or deleted.

Model development and comparisons of metric predictions with subjective data are

illustrated below for 2 image quality classes: 256x256 greyscale and 10x10 binary,

which represent the two extremes of the image quality classes tested.


120

5.3.2.1 Example 1: Construction of a model for the 256x256 greyscale quality image set

The table below shows sample (Pearson product-moment) correlation coefficients (=

Multiple R) for each variable along with the level where the model is significant as

tested by the F statistic.

Variable y1 – Images presented at one time y2 – Paired comparison tests

Ref R

Correlation

Coefficient

Level where

significant

for F(1,7-1-1)

test

Ref R

Correlation

Coefficient

Level where

significant

for F(1,7-1-1)

test

File size 1 0.74 0.06 A 0.63 0.13

SD 2 0.74 0.06 B 0.71 0.08

Quad max SD 3 0.64 0.12 C 0.58 0.18

Variance 4 0.74 0.06 D 0.71 0.08

Quad max var 5 0.63 0.14 E 0.56 0.20

Entropy 6 0.56 0.20 F 0.43 0.33

Edges 7 0.90 0.01 G 0.86 0.02

Segments 8 0.28 0.55 H 0.20 0.68

Sim_pixels 9 0.08 0.87 I 0.06 0.90

Sim_shaded 10 0.35 0.45 J 0.23 0.63

Sim_mean 11 0.70 0.08 K 0.65 0.12

Sym_pixels 12 0.10 0.84 L 0.00 1.00

Sym_shaded 13 0.43 0.34 M 0.28 0.54

Sym_mean 14 0.70 0.08 N 0.68 0.10

Fractal dim 15 0.76 0.05 O 0.70 0.08

Table 5.4: Correlation coefficients for variables considered for metric for 256x256 greyscale images

Initially a model will be developed for y1 data, where images were presented to

subjects at one time. This will be compared to y2 – Paired Comparison data. In the

discussion that follows, the reference numbers (column 2) or letters (column 5) for

the image attributes tabulated above are contained within angle brackets.

We start with Edges <7> – this variable has the highest regression sum of squares

SSR, correlation coefficient R and R2. The model is significant at the 0.005 level

(highly significant).


121

We now test all variables with Edges already in the model. We need to find the

largest increase in SSR, in the presence of Edges, over the SSR found for Edges

alone. ie. We need to find the variable xj, for which R(βj|β7) = R(β7,βj) - R(β7) is

largest, where R(β) denotes regression sum of squares for a model with variable β.

The combination of Edges and Entropy <7,6> is significant at the 0.006 level and has

the highest increase in SSR above the model with Edges alone. This SSR increase is

significant at the α=0.084 level (F(1,7-2-1) test). In order for this increase to be

significant at the α=0.05 level we would need the sample size to be 12, not 7. Now

when subjecting edges in the presence of entropy to a significance test ie. R(β7|β6),

P = 0.005, which is highly significant, so Edges can be retained. Thus looking at the

α=0.1 level of confidence Entropy can be included along with Edges.

We now require checking all other variables with Edges and Entropy already in the

model ie. R(βj|β7,β6) = R(β7,β6,βj) - R(β7,β6). The combination of Edges, Entropy

and Sym_mean <7,6,14> gives a model significance of 0.017 and the largest increase

in SSR. However this increased regression is only significant at the α=0.255 level of

confidence (F(1,7-3-1) test). Other variable models are significant at the following

levels of confidence: α=0.012 <14>, α=0.062 <7>, and α=0.008<6>. Thus if

considering variables up to the α=0.26 level then this third variable can be included.

For four variables, we require R(βj|β7,β6,β14) = R(β7,β6,β14,βj) - R(β7,β6,β14), and

need to check increases in SSR with a F(1,7-4-1) test. The largest increase in SSR is

with variable <9> - Sim_pixels. However this significance is very low: P = 0.466!

Other variable models are significant at α=0.017 <9>, α=0.034 <14>, α=0.098 <7>,

and α=0.025 <6> levels of confidence. Thus if considering variables up to the α=0.5

level then this fourth variable can be included (overall model significance = 0.067).

A check will also be made using the variable with the second highest R2 value as the

first term in the model. We start with Fractal Dimension <15>, which has a

correlation coefficient of 0.76 and an F statistic that is significant at the α=0.05 level.


2 variable model: Fractal Dimension + Standard Deviation <15,2> gives the highest

increase in SSR, which is significant at the α=0.06 level.

3 variable model: Fractal Dimension + Standard Deviation + Sim_mean <15,2,11> is

significant at the α=0.195 level. Also Fractal Dimension + Standard Deviation +

Variance <15,2,4> is easier to compute and is significant at the α=0.215 level.

4 variable model: <15,2,4,14> is significant at the α=0.23 level, while <15,2,11,9> is

significant at the α=0.184 level of confidence.

5 variable model: <15,2,11,9,10> is significant only at the α=0.419 level of

confidence.

These models with low significance (α>0.05, representing small confidence

intervals) are of limited use in developing a metric to predict information content.

For completeness a check is made on using the variable with the third highest R2

value as the first term in the model. We start with File size <1>, which has a

correlation coefficient of 0.74 and an F statistic that is significant at the α=0.055

level.

2 variable model: File size + Edges <1,7> is significant at the α=0.173 level, which

is clearly inferior to models proposed above. Thus there is no need for further

analysis down this path.

Several candidate models were proposed above with varying numbers of model

terms. A simple model is a consideration that cannot be ignored but it is not desired

to underfit the model. The Cp statistic can be used to consider compromise between

excessive bias incurred when one underfits the model (chooses too few model terms)

and excessive prediction variance when one overfits (has redundancies in the model).

The Cp statistic is a function of the total number of parameters (p) in the candidate

model and the error mean square:

2

22 ))((σ

σ pnspcp−−

+= (Equation 6)

where is the error mean square for the most complete model, s is the error mean

square for the candidate model,

2σ 2

p is the number of model parameters.

Cp > p indicates a model that is biased due to being an underfitted model, while Cp ≈

p indicates a reasonable model.

122


123

Candidate models are listed in the table below. Variables No. of

parameters

R2 Error

mean

square

Model

significance

Significance

of increased

regression

Cp

7 2 0.82 352.33 0.005 7.7

7,6 3 0.92 190.23 0.006 0.084 3.6

7,6,14 4 0.95 153.22 0.017 0.255 3.8

7,6,14,9 5 0.97 164.32 0.067 0.466 5.0

15 2 0.58 803.41 0.046 20.5

15,2 3 0.84 374.44 0.024 0.060 7.8

15,2,4 4 0.91 274.58 0.041 0.215 5.8

15,2,4,14 5 0.96 170.58 0.070 0.235 5.0

15 2 0.58 803.41 0.046 38.2

15,2 3 0.84 374.44 0.024 0.060 14.4

15,2,11 4 0.92 259.74 0.038 0.195 9.0

15,2,11,9 5 0.97 130.38 0.053 0.184 5.7

15,2,11,9,10 6 0.99 97.52 0.170 0.419 6.0

Table 5.5: Candidate models for metric for 256x256 greyscale images

The best model is <7,6> - Edges and Entropy, which has a high R2, high overall

model significance, significant increase in regression over <7> and a Cp statistic that

indicates a reasonable model (not underfitted or overfitted). The simple model of

<7> - Edges has high significance but the Cp value indicates it is underfitted.

Thus a modelling function f is proposed such that:

Information Content = f(Edges, Entropy)

Actual Equation: Information Content = 295.6 – (0.047*Edges) – (23.43*Entropy)

Prediction performance will still be checked with the simpler underfitted model:

Information Content = f(Edges)

Actual Equation: Information Content = 195.5 – (0.052*Edges)


124

We also wished to compare the two types of experiment data, where images were

presented at one time (y1) against paired comparison data (y2). So for a model based

on y2- paired comparison experiments, we start with Edges <G> which has the

highest SSR and significant F statistic. The 2 variable model of Edges + SD <G,B>

gives the highest increase in regression. While the overall 2 variable model is

significant at P=0.047, the increase in regression over the 1 variable model is not

highly significant (P=0.407). The 2 variable model Edges + Entropy <G,F> which is

equivalent to the <7,6> model developed for y1 data, has an overall significance of

P=0.048 but the regression increase is not significant over the 1 variable model

(P=0.422).

Model Performance of 256x256 greyscale metric

The model was built using the responses from subjects viewing the Actual Objects

test set, containing 7 images of that particular image quality class. Subjective

responses were also collated in the experiment for other image sets for that quality

class. Example test stimuli for this data collection is shown in Appendix Section

C.2. The predictive performance of the model is now checked against rankings of

the additional test image sets. The additional test sets comprise only 3 images each.

Model predictions are presented over in Table 5.6.

Using the model Information Content = f(Edges, Entropy), the dominant ranking

as selected by 25 test respondents, was only predicted for 1/7 test cases: the number

of objects in scene. This was a very strong preference with 96% of respondents

selecting this order. Using the simpler model Information Content = f(Edges), the

dominant ranking was predicted for 2/7 test cases: {the number of objects in scene}

+ {contrast between objects & surround).

An additional example will be presented outlining the construction of a metric

specifically for the 10x10 Binary image set to determine if there is similar poor

predictive performance.


Dominant Pattern arising from

tests:

Visual Information Ranking Order

HIGHEST LOWEST

No. of test

responses in

dominant

pattern

Correct

Prediction

using

model

No. of test

responses

same as

model

Other quality

classes with

model-predicted

dominant pattern


24/25 Y 24/25 6/9

10B, 16B, 25B,

256E, 256B,

256F/G

Angle of Object

13/25 N 3/25 3/9

25B, 16B, 10B

Distance to Object

17/25 N 1/25 4/9

256B, 25F, 16B,

10B

Connection between image objects

12/25 N 0/25 0/9

Image Detail

16/25 N, but same

image

ranked #1

1/25 1/9

256B

Contrast between objects and

surround

14/25

N 8/25 2/9

16B, 25B

Variety of Object types

11/25 N 3/25 0/9

Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges, Entropy)

125


126

5.3.2.2 Example 2: Construction of a model for the 10x10 Binary image set

Candidate models are listed in the below table. Variables No. of

parameters

R2 Error

mean

square

Model

significance

Significance

of increased

regression

Cp

8 2 0.88 337.29 0.002 46.41

8,12 3 0.96 127.10 0.001 0.038 13.90

8,12,10 4 0.99 70.66 0.003 0.133 7.21

8,12,10,15 5 1.00 31.91 0.009 0.164 4.87

8,12,10,15,11 6 1.00 34.13 0.082 0.522 6.00

Table 5.7: Candidate models for a metric for 10x10 binary images

Model <8,12> - Segments & Sym_pixels has a significant increase in regression over

<8> but high Cp statistic. Model <8,12,10,15> - Segments, Sym_pixels, Sim_shaded,

& Fractal dimension has a better Cp value but the increased regression over simpler

model is not as significant as Model <8,12>.

Using the model Information Content = f(Segments, Sym_pixels), the dominant

ranking as selected by 25 test respondents, was predicted for 4/7 test cases. The

more detailed model Information Content = f(Segments, Sym_pixels,

Sim_shaded, Fractal dim) predicted 3/7 cases.

Using the models developed for the 25x25 greyscale test set on the 10x10 Binary

image sets yielded 4/7 correct predictions for Information Content = f(Edges,

Entropy), and 3/7 correct predictions for Information Content = f(Edges).

That is, similar predictive performance was experienced between a model developed

specifically from data from that image quality class (10x10 binary) and data from

much higher quality images. Thus there is motivation to pursue the development of

one metric applicable across all image quality classes.


127

5.3.3 Information Content Metric for all Image Quality Classes

It was desired to develop a metric that was stable across all image quality classes, as

predictive power of specifically tailored metrics for a particular image quality class

appears arbitrary as described above. Again subjective rankings from one of the

eight visual dimensions (Actual Objects) was used to construct this global metric.

The metric was then validated against the subjective results of the other 7

dimensions.

Correlations between the 15 image attributes discussed above and perceived

information content rankings are shown over in Figure 5.11. The vertical axis

represents average correlations for the two different ranking schemes used: Images

presented all at once and Paired Comparison tests. The stepwise regression model

developed in the previous section systematically grouped attributes based on

increased significance of regression. Now it is desirable to choose one attribute

applicable for all image quality classes. From the plots of Fig 5.11, it is evident that

the “Edge” attribute features in the uppermost one or two correlation curves for both

binary and greyscale images at all spatial resolutions tested in the experiment. Thus

edges are proposed as a dominant indicator of information content across both low

and high image quality classes.

This determination supports Marr’s emphasis of zero crossing (edge) detection in

producing images of the external world [56]. The role of edges in scene recognition

and interpretation was discussed in Section 3.5.2.4.

A metric based on the number of edges in an image is now validated by comparing

metric predictions with perceived information rankings for the remaining 7 data sets.

63 dominant viewer rankings were compared – 7 visual dimensions x 9 image quality

classes.


Greyscale Images

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10x10 16x16 25x25 256x256

Spatial Resolution

corr

elat

ion

Filesize (B)Standard Devquad max SDVariancequad varEntropyEdgesSegments Sim_pixelsSim_shadedSim_meanSym_pixelsSym_shadedSym_meanFractal dim

Binary Images

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10x10 16x16 25x25 256x256

Spatial Resolution

corr

elat

ion

Filesize (B)Standard Devquad max SDVariancequad varEntropyEdgesSegments Sim_pixelsSim_shadedSim_meanSym_pixelsSym_shadedSym_meanFractal dim

Figure 5.11: Correlation between 15 image attributes and perceived information content.

128


The performance of the edge metric in predicting subjective dominant viewer

patterns is shown over in Table 5.9 and in summary form below in Table 5.8.

Strength and

number of cases for

dominant viewer

patterns (63 in

total)

Frequency of image

with highest info

content being

predicted by metric

Frequency of exact

ranking being

predicted by metric

90-100%: 3 100% 100%

80-89%: 4 75% 75%

70-79%: 1 100% 100%

60-69%: 12 67% 25%

50-59%: 6 83% 50%

40-49%: 16 38% 19%

30:39%: 19 32% 21%

20-29%: 2 100% 100%

10-19%: 0 - -

0-9%: 0 - -

STRONG WEAK

Dominant patterns

Table 5.8: Summary of metric performance

Out of the 63 test cases examined, three cases had 90% or above consensus from

subjects viewing the sample set. For each of these cases, the metric successfully

predicted not only which of the 3 images had the highest information content (2nd

column above) but also the ranking order chosen by subjects (3rd column above).

Metric performance at weaker subject consensus levels are also shown.

There were several cases where the metric prediction in the 2nd column above was

low. However this was for cases where there was low consensus amongst the sample

regarding the preferred ranking order. ie. if human subjects could not agree

unanimously on a preferred ranking order it was difficult to expect a metric to do so.

What is important is whether the metric could predict those cases where their was

strong agreement among the sample.

129


130

Predictive Metric Performance

A metric has been proposed from subject responses to one of eight visual dimensions (Actual Objects).Here it is validated against the other seven dimensions - Number, Angle, Distance, Connectivity, Detail, Contrast, Variety.

The table shows whether the metric predicted the dominant ranking as chosen by test subjects.

Quality Class NUMBER ANGLE DISTANCE CONNECTIVITY DETAIL CONTRAST VARIETYnum1num2num3 angleangleangle3 dista dista distance3 conn conn conn3 detai detai detail3 contrcontrcontrast3 varie varie variety3

10 Bin 3 2 1 16 3 1 2 15 3 1 2 10 1 2 3 9 1 3 2 13 3 2 1 10 1 2 3 10***Predicted 64% #1 Predicted 60% ***Predicted 10 36% ***Predicted 52% 40% 40%

10 F/G 2 3 1 12 1 2 3 9 1 2 3 8 2 1 3 9 1 3 2 16 2 1 3 11 1 2 3 15#1 Predicted 48% 36% 32% 36% 64% 44% 60%

16 Bin 3 2 1 22 3 1 2 8 3 1 2 7 3 1 2 8 1 2 3 16 1 2 3 11 1 2 3 12***Predicted 88% #1 Predicted 32% ***Predicted 7 #1 Predicted 32% #1 Predicted 64% #1 Predicted 44% 48%

16 F/G 1 2 3 8 1 2 3 10 1 2 3 8 3 1 2 9 1 3 2 20 2 1 3 16 1 2 3 832% 40% 32% 36% 80% #1 Predicted 64% 32%

25 Bin 3 2 1 23 3 1 2 11 2 1 3 8 1 2 3 8 1 2 3 10 1 2 3 7 1 3 2 8***Predicted 92% #1 Predicted 44% 32% 32% ***Predicted 40% ***Predicted 28% 32%

25 F/G 1 2 3 9 2 3 1 12 3 2 1 16 1 2 3 20 1 3 2 15 2 1 3 14 1 3 2 1036% 48% 64% ***Predicted 80% #1 Predicted 60% #1 Predicted 56% 40%

256 Edges 3 2 1 21 1 2 3 9 1 2 3 12 3 1 2 17 3 1 2 12 2 1 3 18 2 3 1 1384% 36% 48% 68% 48% 72% 52%

256 Bin 3 2 1 24 3 2 1 9 3 2 1 9 3 1 2 9 1 2 3 9 3 2 1 16 2 3 1 13***Predicted 96% 36% ***Predicted 36% ***Predicted 36% ***Predicted 36% ***Predicted 64% ***Predicted 52%

256 F/G 3 2 1 24 1 2 3 13 1 2 3 17 1 2 3 12 3 1 2 16 2 1 3 14 1 3 2 11***Predicted 96% 52% 68% 48% #1 Predicted 64% ***Predicted 56% 44%

denotes strong viewer preferenceswith 70% of more of respondents choosing that pattern(18 out of sample size = 25)

***Predicted - refers to exact ranking of all 3 images predicted by the metric.#1 Predicted - refers to image with highest info content predicted by the metric, ie. #1 in the rank order only.

Ranking of Image 1, 2 and 3 chosen by Highest Number of Respondents and number (out of 25) and %

Table 5.9: Predictive performance of metric proposed for all image qualities


It was of main interest to examine metric performance in view of strong dominant

viewer patterns in the data. Eight of the 63 rankings had 70% or above consensus

among viewers. These have been mentioned above in Sec 5.2.6, but are reproduced

below in Figure 5.12 with the inclusion of metric performance.


5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%),

256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%)

Highest Lowest

All 5 cases predicted by metric?: Yes

Closeness between image objects


Highest Lowest

Single case predicted by metric?: Yes

Image Detail


Highest Lowest

Single case predicted by metric?: No

(Metric prediction: phone > 2 faces > single face)

Contrast between Objects & Surround

1 image quality class: 256x256_Edge set (72%)

Highest Lowest

Single case predicted by metric?: Yes

Figure 5.12: Metric performance for Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content

131


132

The visual information metric predicted 7 of the 8 strong viewer preferences (70% or

above consensus level). Viewers of the 16x16 greyscale Image Detail set ranked a

simple face as containing most visual information, while the metric ranked the image

of the phone and two faces ahead of the single face. The familiarity and strong

recognition of the human face at low levels of image quality may cause viewers to

select it over others containing unrecognisable blobs.

The metric was found to work best with binary images, which are expected from at

least early prototype designs. (Limited greyscale may be possible by modulating

stimulus amplitude, frequency and pulse duration as discussed in Section 3.5.2.2).

The number of ranking cases where the metric was able to predict the image with the

highest information content is shown in Table 5.10 below. There are a total of seven

ranking cases for each image quality class, corresponding to each visual dimension

explored.

10x10 Binary set - 4/7 10x10 Greyscale set – 1/7



256x256 Binary set – 6/7 256x256 Greyscale set – 3/7

256x256 Edge set - 6/7

Table 5.10: The number of correct metric predictions of images with the highest information content

This may be another reason why the metric prediction for the 10x10 greyscale Image

Detail set did not agree with the ranking chosen by 80% of viewers. Table 5.10

shows that for 16x16 greyscale images, the metric was successful in predicting the

image with the highest information content in only 1 out of 7 cases. However for

16x16 binary images, the metric prediction was correct for 6 out of 7 cases. It should

be remembered that the strength of dominant patterns on which metric performance

is assessed range from 96% to 28%. At high levels of viewer consensus, the metric

is accurate in predicting images with the highest information content, and is thus

considered acceptable for this application.

Therefore, it can be stated that visual information content in images can be quantified

and a mechanism for achieving this with a reasonable level of performance has been


133

proposed here. However, does maximising information content in low quality

images result in enhanced perception of that image? In order to answer this question,

it is necessary to analyse the relationship between low quality image recognition and

information content in images ie. is the measure for information content an adequate

pointer to how well an image might be recognised? This relationship is explored in

the next section.

5.4 Correlations Between Recognition Rate And Perceived Information Content

It was desired to determine if there was any relationship between recognition rates

and the amount of visual information as perceived by viewers.

Previous experiments described in Section 4.3 assessed perception performance for

these same images (ie. the ability of these images to be correctly recognised). The

subjects were first asked to describe the objects (eg. Appendix Section B.1) and then

secondly to assess the images for the amount of visual information they contained

(refer test stimulus C.1). The questionnaire booklet design shown in Appendix

Section C.4 shows a section referred to as “PART 3 Check if correlated with recogn”

which relates to this section of correlating information content and recognition. As

with other aspects of this experiment, care was taken in the booklet design to reduce

learning effects, fatigue and boredom.

Relationships between correct object recognition and subjective information content

scores were obtained for each image quality class. For example, Figure 5.13 shows

the relationship for 25x25 binary Paired Comparison experiments. The horizontal

axis shows a subjective score for visual information developed from the numeric

ranking scheme used (higher numeric score = higher perceived information content).


Recognition Rate vs Information Content

0

0.2

0.4

0.6

0.8

1

0 50 100 150

Subjective Score for Visual Information Content

Rec

ogni

tion

Rat

e%

sub

ject

s co

rrec

tly

iden

tifyi

ng o

bjec

t (n=

25)

Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)

The significance of these relationships were then assessed. Linear regression models

for each quality class were developed for two series of data:

1. where images were presented at one time

2. paired comparison data

The significance of the models and correlation coefficients appears in Table 5.11

below.

Images presented at one

time

Paired Comparison

Image Quality Class R

Correlation

Coeff.

Significance

F(1,7-1-1) test

R

Correlation

Coeff.

Significance

F(1,7-1-1) test

10x10 Binary 0.76 0.05 0.76 0.05

10x10 Greyscale 0.54 0.21 0.36 0.43

16x16 Binary 0.69 0.09 0.71 0.07

16x16 Greyscale 0.70 0.08 0.61 0.15

25x25 Binary 0.90 0.01 0.85 0.01

25x25 Greyscale 0.75 0.05 0.81 0.03

256x256Edge 0.70 0.08 0.66 0.10

256x256 Bin 0.69 0.09 0.73 0.06

Table 5.11: Correlation coefficients between recognition rate and perceived information content

134


135

There was some evidence for correlation between ranked information content and

recognition rates with significance levels ranging from P=0.05 to P=0.1 for all but

the 10x10 greyscale image set. Thus if the information content of an image is

maximised (by maximising the number of edges) enhanced perception is expected.

5.5 Chapter Summary In the field of low quality vision, there is a need for delivering maximum scene

information to a limited number of display electrodes/pixels. In this chapter a

method is proposed to enhance recognition using importance maps weighted to

maximise the “information content” in the resulting importance map.

An experiment was described to quantify the term information content. 15 image

attributes were correlated with subjective rankings of visual information. Initially

metrics were developed tailored to a specific image quality (eg. 10x10 binary

metric). However their arbitrary predictive performance led to the construction of a

metric that was stable across a wide range of image quality classes. The number of

edges in an image was found to be a dominant indicator of perceived information

content. An edge metric was tested on additional subjective data and found to be

appropriate in assessing information content. Finally it was shown that subjective

information content was significantly related to object recognition.

Thus it was possible to construct a model for basic information required for the

interpretation of a visual scene at low image quality.

This finding can now be applied to generating importance maps containing higher

information content. Chapter 8 compares such a method with others to determine

preferred presentation options.

Chapter 6 Scene Specific Imaging

6.1 Overview

As discussed in preceding chapters of this thesis, in order to make best use of the

limited number of electrodes in visual prostheses, it is proposed to first process

images to extract more information from the scene. One of the conclusions of the

preliminary experiments discussed in Chapter 4 was that it might be beneficial to

have device switchable processing for different scenes. This chapter then aims to

answer the research question :


Characteristics of several scene types are outlined in this chapter and then

categorised in image processing terms. A simple experiment involving 20 normally

sighted viewers tests if there is some benefit in scene-dependent processing to deliver

enhanced perception of low quality images.

6.2 Characteristics of Simple Scenes

This section lists characteristics of simple scenes that a patient fitted with a visual

implant may experience.

6.2.1 Office

Many office environments have fluorescent lighting. Work spaces might be defined

by partitions that are up to two metres high or floor-to-ceiling walls, or a

combination of walls and partitions. A person’s working range would be of the order

136


of one metre, but a visual range of five metres would be useful. Objects in the

environment may be located on a horizontal desk surface (phone, computer,

documents). Objects could be distinguished by intensity and colour contrast on the

desk. The rest of the office would probably not have much colour except for office

plants, pictures, and people. The user is mostly stationary in this environment.

6.2.2 Home

Although the viewer is familiar with the home environment, potential hazards

abound, including room and cupboard doors left open and obstacles on the floor.

The kitchen is usually comprised of a (reflective) sink, benches and cupboard.

Bedrooms would contain a bed, cupboard and dresser. Chairs and possibly a

television are found in a lounge room, while a dining room contains table and chairs.

Bathrooms may contain a bathtub, vanity unit and toilet.

6.2.3 Street

The viewer is likely to be moving in a streetscape environment, either as a vehicle

passenger of walking. Possible obstacles include posts, street signs, curb, people,

and construction works (including holes and fencing). Many edges are contained in

this man-made (constructed) environment, including footpaths and shopfronts. There

is limited colour variation, with the predominant building and ground colours being

greys, and pedestrians and parked cars represented by coloured regions. There is a

combination of natural lighting (sunlight) and shaded portions. The viewer may

require a working range of five metres (when moving) and a visual range of up to

fifty metres.

137


6.2.4 Outdoors

Outdoor environments contain natural scenes, such as trees, plants, grasses, bushes,

seats, beach, ocean. There are limited edges and natural lighting (sunlight). For park

areas there is limited colour variation (mostly greens) and alternating intensity from

shade and sun patches. For beach scenes there is high intensity glare, with many

reflections from water and white sand. Beach environments also usually have finite

colours, with blue ocean and white or yellow sand. Smaller coloured regions in the

outdoor environment may correspond to people or signs. A viewer may require a

working range up to ten metres and a visual range as much as 100 metres.

6.2.5 Head and Shoulders

A special case of scene type exists for situations where the viewer is engaged in

conversation or communication with others. In these close contact situations the

viewer requires an image of the head and shoulders alone. The visual range need not

be more than two metres. Both the scene and viewer are mostly stationary. Faces

and other skin areas are detected within a range of pink hues, while the hair may be

darker (eg. browns and black).

138


6.2.6 Café/Restaurant

This scene type often has indoor lighting conditions (fluorescent/incandescent).

There are usually tables approximately one metre high separated by small spaces

(navigation gaps). Tables could be circular or rectangular. Chairs are positioned

around the tables with chair backs typically higher than the surface of the table.

Cutlery and plates may lie on the table, with glasses, cups or jugs projecting up to

200mm above the table surface. Cutlery and glass on the table are highly reflective.

A payment area may comprise a desk at waist-chest height with a horizontal top for

signing cheques etc. A viewer would need to be able to locate toilets and the café

exit. The exit is usually signed and may be a two metre high large rectangle of

different intensity, perhaps with a door.

6.2.7 Public Toilets

Gender differentiation is required for the viewer. The entrance to a public toilet is

often through doors, with a handle at waist height on the left or right of the door. A

90 degree or 180 degree turn is then made to the left or right, along a floor to ceiling

wall which is often tiled. Cubicles are built out from the wall and may be timber or

rendered concrete with doors containing a lock at waist height. There may be a

urinal, consisting of either separate units or a continuous unit along a wall sometimes

with a step up. The toilets should contain a wash basin with soap or a liquid soap

dispenser located above the basin. There may be a hand dryer/towel dispenser with a

waste bin located below.

139

6.3 Image Processing targeted to Scene Type


140

It is proposed to present maximum information to implant electrodes targeted to a

user’s environment. This can be achieved by applying varying processing routines to

the input images.

In this section the scene types mentioned in the previous section are described in

image processing terms to identify which image processing routines to apply. Table

6.1 shows these scene type descriptions.

Motion Colour Edges Number

Regions

Range

SCENE In

scene

Viewer Dominant

Colours

Colour

variation

Number Types Working

range

Visual

range

Office Low Low White,

Pastel

Low –

med

High Straight Med 1m 5m

Home Low Med All Low –

med

Med Straight

Curved

Med 2m 5m

Street High High Grey Med High Straight High 5m 50m

Outdoor Low Med Green

White

Yellow

Blue

Low Low Curved Low 10m 100m

Head &

Shoulder

Low Low pink hues Low med Curved Low 1m 2m

Café Med Low Silver Med high Straight

Curved

High 1m 2m

Toilets Low Med White,

Silver

Low –

med

high Straight med 2m 4m

Table 6.1: Image Processing descriptors of different scene types

In order to translate these scene descriptions into processing algorithms the scene

first needs to be categorised. There have been some advancements in the area of

automatic scene categorization. For example Chernyak and Stark [15] have

developed a model for sequential knowledge acquisition using Bayes’ theorem

(probability based). Segment features, such as average colour, aspect ratio and

position are obtained from training sets of images covering scene categories (eg.

“office”, “construction”, “children playing”). The algorithm then attempts to guess


141

the scene category of a test image. Although useful for robot applications designed

to minimize human intervention, automatic categorization would not be essential for

visual prostheses applications. Human users would presumably be aware of their

environment, and would be able to manually select the scene type suited to their

surrounds.

Once the scene type is known, it is proposed to apply context/scene dependent

importance weighting to the image. Chapter 5 described a means to manipulate

image content using Importance Weighting to produce higher information content in

the image conveyed to prosthesis electrodes. This chapter proposes a similar method

of image manipulation again using the Importance Map method (refer Section

3.5.2.6). However, here it is proposed that weights for feature maps are selected

according to their scene type, and proposed feature weights for several scene types

are shown below in Table 6.2. Rather than all features having the same effect on the

resulting importance map, we propose to vary their contribution depending on the

scene type. The percentage weights shown in Table 6.2 indicate the weight to apply

to that feature map to produce the resulting importance map. A 50% level would

indicate neutral leaning/bias. ATTENTIONAL FEATURE

Closeness Intensity

Contrast

Shape Size Viewing Area

SCENE Foreground

(100%) vs

Background

(0%)

Lots of

contrast

(100%) vs

little contrast

(0%)

Long & skinny

(100%) vs

broad & round

(0%)

Large regions

(100%) vs

small regions

(0%)

Most viewing

in central

view (100%)

vs periphery

(0%)

Office 90% 70% 90% 25% 100%

Home 70% 90% 50% 50% 95%

Street 25% 20% 100% 50% 50%

Outdoors 25% 80% 10% 100% 50%

Head &

Shoulders

100% 80% 50% 25% 100%

Café 100% 20% 50% 50% 80%

Toilets 90% 30% 10% 50% 80%

Table 6.2: Attentional feature weights for each scene type

6.4 Subjective Tests for Scene Weighted Processing


It was desired to test the proposal that context or scene weighted processing can

improve perception of low quality images. The images shown in Figure 6.1 taken

from the test stimulus in Appendix Section D.1 were shown to 20 normally sighted

volunteers. Two images were shown, one representing an "outdoor" scene, and the

other an "office" scene. Low quality versions of the images were presented

alongside, representing quality levels typical of current prosthesis prototypes (25x25

spatial resolution, binary images). Booklet design is shown in Appendix Section

D.2. Four low quality versions of the original were shown:

1. subsampled to 25x25 and binarised – this represents the standard or base case

level of image processing present in most implant designs (no importance

processing);

2. subsampled, importance processing with features weighted equally, then

binarised;

3. subsampled, importance processing with features weighted according to the

correct scene type (eg. for the lighthouse image, weights selected for "outdoor"

scenes in Table 2), then binarised;

4. subsampled, importance processing with weights applied corresponding to a

different scene type (eg. applying "office" weights to lighthouse image), then

binarised.

A B C D E

Figure 6.1: Visual stimuli used to gauge perception of low quality images

A = Original 256x256 image; B = subsampling to 25x25 binary; C = importance mapping with all

feature weights equal; D = importance mapping with weights selected for “outdoor” scene type”;

E = importance mapping with “office” weights

142


143

Participants were asked to rank the images for how best (ie. most informatively) they

represented the original scene. The images nominated as best representing the

originals were as per Table 6.3.

Lighthouse image

“Outdoor” weights “Office”

Weights

No

processing

Equal

Weights

10/20

50%

7/20

35%

3/20

15%

0/20

0%

Chair image

No

processing

“Office”

Weights

Equal

Weights

“Outdoor” weights

12/20

60%

6/20

30%

2/20

10%

0/20

0%

Table 6.3: Preferred ranking for image representation

For the lighthouse image, half of the sample size (10/20) chose the “outdoor”

weighted image as best representing the original scene. This is in line with the

expectation that improved perception may be obtained with processing images with

respect to scene type. However for the chair image, most respondents found the

image with no importance processing was better at representing the original image.

Feedback from participants suggested that this image was closest in grey level values

to the original, and if the importance-processed images had been inverted (ie. a black

chair on a white background) they would have chosen that image. For a subsequent

thesis experiment described in the next chapter, the importance-processed images

were inverted to be in a similar form to the original.

It should be noted that the image inversion recommendation arises from experiments

with sighted viewers with sophisticated expectations. This may not be the same for

visually impaired persons with a simplified understanding of the world. Other

criteria relating to electrode stimulation may dominate once these systems are more

common. This might include always stimulating the smaller number of electrodes

irrespective of dominant greylevel information to avoid long term tissue damage, or

sharing of electrodes to obtain longer life from the electrode arrays.

6.5 Chapter Summary

6.5 Chapter Summary

144

By first processing images presented to implant electrodes it may be possible to

provide enhanced presentation than subsampling alone. In this chapter, some of the

characteristics of simple scenes have been listed and described in image processing

terms. A simple experiment was conducted to determine the effect of processing

images depending on the scene type. Expectations were confirmed for a test image

containing an outdoor scene, but not for an office scene. Image inversion to best

match an original high quality scene is recommended.

This chapter aimed to address the following research question:


Experiment results show that improved perception may be obtained with processing

images with respect to scene type. Scene dependent importance mapping is a

powerful tool to use in the automatic optimisation of low quality images for human

viewers. The next chapter compares this method with others to determine preferred

presentation options.

Chapter 7 A comparison of ROI methods for low quality images

7.1 Overview

So far the thesis has reviewed useful image processing strategies to present the most

useful information to implanted users considering this great information loss.

The previous two chapters described methods of manipulating image content based

on a region of interest image processing technique known as Importance Maps.

Chapter 5 concerned adjusting feature weights so that the resulting Importance Map

contains maximum scene information. Chapter 6 proposed that feature weights are

set for the particular scene type. Experiments described here in Chapter 7 aim to

assess these proposals along with other methods to determine which method best

helps users move through a scene.

The work in this chapter aims to extend upon the findings of Chapter 4 concerning

the research question:



It is anticipated that ROI processing will trim away unnecessary information

resulting in improved perception.

The region-of-interest/importance framework of the processing used in this Chapter

was described in Section 3.5.2.6. The following section describes the comparison

experiment, showing the types and range of images used and instructions given to

participants. Results follow which indicate a clear trend. Two experiments were

conducted:

1. Presentation of entire region-of-interest processed image

2. Presentation of only salient area found from ROI processing only – digital zoom.

145

7.2 ROI Processing applied to Entire Image

7.2 ROI Processing applied to Entire Image 7.2.1 Image Preparation

Images used in the tests were prepared as shown in Fig 7.1.

146

I

Test Image 256x 256

Processing Refer 1) – 6) in text

below

Resized to 25x25 (nearest neighbour)

Thresholded at 128 greylevel

Histogram equalisation

Possible inversion to appear like

Original. nverted


Possible inversion to appear like

Original. Inverted

Figure 7.1: Image preparation


Six image processing methods were tested: four variations of Importance Mapping,

edge detection and a non-processed ‘base-case’. In all methods the final image was

Nearest-Neighbour resized to 25x25 spatial resolution which is representative of

electrode numbers in prosthesis prototypes. One test case had greylevels equalised

before thresholding at the 128 grey level, while a second test set had only

thresholding at the 128 level with no histogram equalisation. Histogram equalisation

spreads the grey levels out across the full greyscale range and it is intuitive to apply

this equalisation to use the full dynamic range obtainable from the image. Other

methods of histogram transformations (eg stretch, uniform) may be considered, but

as the eventual image is reduced to very few shades, the differences are unlikely to

be influential. Histogram equalisation depends on illumination and object shades of

grey and can introduce spurious shadings in the thresholded image which do not

actually represent image objects (see Fig 7.2). Hence it was desired to find if there

were any differences in preferred processing algorithm using histogram equalisation.

147

Original 25x25 thresholded versions

Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images that have had grey levels equalised (right most image).

Finally, the thresholded images were compared with the original 256x256 test image

and inverted if necessary to most closely match the original grey level. This

inversion was the result of recommendations made after previous experiments

described at the close of Chapter 6.


Original

1) IM eq 2) IM sc 3) IM tr

4) IM opt 5) Edge 6) No IP

7.2.2 Processing Methods Compared

Six methods of processing an original high quality image were compared:

Figure 7.3: Processing methods used in tests (refer text for details)

1. Importance Mapping with all features weighted equally: ωcontrast = ωsize = ωshape

etc.

2. Importance Mapping with weights selected depending on the image scene type.

Experiments described in Chapter 7 have shown that improved perception may

be obtained with processing images with respect to scene type in this way.

3. Weights selected in accordance with a training set of images from that scene

type. A training database was developed consisting of 15 images of each scene

category used in the tests (refer Appendix E.1). Feature maps for each image

were determined along with the percentage of pixels in the top 25% of each

feature map (ie. between 0.75 and 1.00 in the normalised images). This gave a

measure of the strength of that feature for that image. For some images, there

148


would be no pixels in the range 0.75 – 1.00 for that feature, while for others

100% of the image pixels might lie in this 0.75 – 1.00 importance range. Feature

distributions were made for each scene category. Pixels in this upper 0.75 – 1.00

range would be determined for a test image and its position within the

distribution determined. Weights were selected according to position of the test

image within the feature distribution (for example refer Fig 7.4).

0

20

40

60

80

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Feature Weight

% Im

ages

in tr

aini

ng d

atab

ase

12%

Figure 7.4: Example Distribution – Size Map distribution for Beach training images;

If the size map for a test image had 50% of pixels in the range 0.75 – 1.00, this would be greater than only one other image in the training image set. The weight applied to that feature map when combining feature maps is the interpolated value between the nearest images, ie. ~12%.

4. Weights iteratively adjusted in order to give the highest number of edges in the

resulting Importance Map. Experiments described in Chapter 6 found that a

subject’s ability to recognise objects in low quality images is correlated with the

amount of edges in that image. A medium-scale Quasi-Newton line search

optimisation routine was used to adjust the five weights to maximise the number

of edges in the Importance Map.

5. Considering that the number of edges were found to correlate with correct object

recognition, it was desired to present an edge map alone. Images were prepared

using the Canny edge detection method operating on 25x25 spatial resolution

images.

6. Finally an image was presented with no importance processing applied, as a base

comparison case.

149


7.2.3 Images Used Chapter 6 described several scene categories that a blind person might encounter.

Six of these were chosen for this experiment and 4 images for each category (Figure

7.5). Image selection was made on the basis of forming functional mobility

problems. Dowling [21] has reviewed previous efforts in enhancing mobility for

visually impaired persons, including the following mobility problems:

• Lighting conditions and glare

• Changes in terrain and depth (stairs, curbs)

• Unwanted contacts (bumps)

• Street crossings

• Visual clutter

(i) beach

(ii) street

(iii) office

(iv) home

(v) cafe

(vi) head &

shoulders

Figure 7.5: Images used in comparison tests

150


7.2.4 Experiment A group of 242 volunteers participated in the experiment. From this, 50 samples

were discarded (21%) due to either incomplete responses or subjects who normally

wore glasses/contact lenses but who were not wearing them at that time. This left

192 normally sighted or corrected-to-normal viewers. Half the sample (n=96)

viewed the images that were greylevel equalised, while the other half viewed the

non-equalised images. Subjects were presented with test stimuli as shown in

Appendix Section E.2 An original high quality (256x256 greyscale) image was

shown along with the six different versions of the image. The design of the booklets

is shown in Appendix Section E.4 – Part A. Viewing conditions for the experiment

were not controlled.

7.2.5 Results Figure 7.6 below and over shows the breakdown of viewer preferences. There was a

clear preference in both the equalised and non-equalised viewer group for no

importance processing (base-case). This was the most chosen method for six of the

six scene types, especially for faces, where 85% of subjects chose that processing

method. Error bars representing 95% confidence intervals are shown in the plot

below, and were obtained from average preferences across the six scene types.

0%

10%

20%

30%

40%

50%

60%

70%

80%

IM eq IM sc IM tr IM opt Edge No IP

Image processing method

% S

ubje

cts

choo

sing

that

pr

oces

sing

met

hod

Equalised

Not Equalised

Figure 7.6: Viewer preferences when presenting entire image; n=96 (continues over)

151


0

500

1000

1500

2000

2500

3000

IM eq IM sc IM tr IM opt Edge No IP

Image processing method

No.

of s

ampl

es ra

nked

(460

8 to

tal)

beach

off ice

face

house

street

café

Figure 7.6: When presenting the entire image, results indicate a clear preference for no Importance Processing (n=96)

To ensure the ‘clear’ preference for no importance processing (‘No IP’) was

statistically significant, Analysis of Variance (ANOVA) was performed on the data

shown in Fig 7.6 to compare the hypotheses:

H0: µ IM eq = µ IM Sc = …. = µ No IP


Viewer preferences across each scene type formed the basis of observations for the

ANOVA: 6 observations (beach, office, face, house, street, café) comparing 6

different processing methods. The test resulted in F-values of {19 & 57} for

{equalised & non-equalised data} which exceeded the critical F-value (2.53) for the

number of degrees of freedom in the data (5, 30), and was highly significant at

{P=1.5E-8 & 1.8E-14 2.64E-72}. Thus H0 was rejected and it was concluded that at

least two of the means were not equal. The ANOVA was then repeated but this time

excluding data for ‘No IP’. This time F-values were {1.6 & 1.3} for {equalised &

non-equalised data} which were less than the critical F-value (2.76) for the number

of degrees of freedom in the data (4, 25), and {P=0.20 & 0.30}. This indicates that

when ‘No IP’ data is excluded, the means of the other processing methods are not

significantly different.

Another issue of interest was histogram equalisation. A two sample t-test was

performed using 36 observations (6 processing methods and average results of each

of 6 scene types) at α = 0.05. The test assessed the hypotheses:

152


153

H0: There is no difference in histogram equalisation (differences shown in the

upper plot of Fig 7.6 were due to sampling errors or chance only);

H1: Histogram equalisation achieves significantly different results; ie. a one-

tailed t-test (a directional test showing results as higher or lower was not of

interest).

A t-statistic of 1.76E-15 was obtained which was much less than the critical t value

1.67 for 62 degrees of freedom. The significance of this value for a one-tail test was

P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation

does not result in significant differences.

Fig 7.6 also shows processing methods divided into scene type. The plot indicates

which scene types may be better suited for a particular processing method. For

example, one of the processing methods compared in the experiment was edge

detection. Section 3.5.2.4 described how a research team developing cortical implant

devices expected improved perception results with the implementation of edge

detection processing. The data shown in Fig 7.6 indicate that low quality edge maps

were best recognised for house and street scenes. ANOVA testing using 8

observations (4 of each image type across 2 equalisation/non-equalised data sets) of

6 image types shows significantly higher results for house and street scenes

(P=0.0005).

This experiment had several conclusions:

(i) The Base Case (‘No IP’) data was significantly better than the

other processing methods.

(ii) There was no significant difference between the remaining five

processing methods This indicates there is no real advantage in

tuning feature weights for the importance map method for low

quality images. This is a worthwhile conclusion to allow the

computational overhead required for this processing to be used

elsewhere in prosthesis systems.

(iii) There was no significant difference in histogram equalisation

before thresholding images.

(iv) Low quality edge maps were best recognised for house and street

scenes.

7.3 Digital Zoom

7.3 Digital Zoom The results discussed in the previous section indicate that the base case was best for

presenting images – ie. presenting subsampled and binarised images only without

any region-of-interest processing. However, rather than presenting an entire ROI-

processed image, improved perceptual results might be obtained with using ROI

processing to identify salient areas within an image and presenting those areas alone

(in a subsampled and binarised form). In effect the approach is to find interesting

areas within the image and perform a ‘digital zoom’ – enlargening those salient areas

to the resolution limit set by the implant electrode array (refer Figure 7.7). It is

anticipated that digital zoom would be a common and easily-implemented prosthesis

function, and it would be useful to make this zoom method automatic for a blind

user.

Figure 7.7: Digital zoom concept – the most salient area is identified in an image and resized to the maximum display resolution

154

7.3 Digital Zoom

7.3.1 Automatic Zoom Methods An additional test was conducted comparing seven methods of zooming into an

image. For the purposes of this exercise, the original image was 256x256 spatial

resolution.

1. IM_trim (Fig 7.8); A trimmed version of an Importance Map to only include

elements above a threshold. Pixels were trimmed around each border: top, right,

bottom, left, top, right etc. until a pixel value above the threshold was found. As

only square images were presented, the final image was made into a square of

dimension equal to the maximum dimension of the trimmed box. The smaller

dimension was expanded until image dimensions were equal, and the expansion

direction was on the basis that pixels of more important regions were added.

Figure 7.8: Trim method to select zoom window

2. IM_scope (Fig 7.9); A 128x128 box size containing the highest greylevel values

in a 256x256 Importance Map ie. one quarter of the image area. The 128x128

box was moved pixel by pixel across the image until it contained the highest sum

of pixel values.

Original Importance Map Square with trimmed box trimmed box

Figure 7.9: Scope Box method to select zoom window

155

7.3 Digital Zoom

3. The trim method described in 1) above applied to a Saliency Map generated by

code obtained from iLab at the University of Southern California [40,84]. This

Region-of-Interest research was discussed in Section 2.4. A Saliency Map is

created from combining three feature maps corresponding to colour, intensity and

orientation at six spatial scales. Unlike the Importance Map concept which

segments images first into regions, the saliency feature maps are created from

Difference-of-Gaussian (Mexican-hat) operators applied direct to pixel data

(Figure 7.10). Default values for the code implementation were used.

Figure 7.10: Saliency Map developed by University of Southern California

Top: Difference of Gaussians filter applied to 3 feature maps; Bottom: Saliency Map output showing Regions-of-Interest

4. The 128x128 box scope method described in 2) above applied to a Saliency Map.

5. A 128x128 box containing the horizontal and vertical centre of the image (Figure

7.11). This method has no dependence on image content and relies on spatial

position with the image only. It assumes that the centremost part of an image

may be the area worth zooming into.

image

window

Figure 7.11: Zoom window selected from central 25% of image

156

7.3 Digital Zoom

6. Similarly to 5) in that there is no dependence on image content, this method crops

a 128x128 box aligned at the bottom centre of the image. This area may be

significant for a viewer especially when mobile, as it contains the foreground

immediately in front of the camera (Fig 7.12).

Figure 7.12: Zoom window selected from central-bottom 25% of image

7. For reference, an option of “No Zoom” was also included, where the whole

256x256 image was represented.

For all the above methods, the stimulus presented to viewers was the cropped

zoomed version from the original resized to 25x25 spatial resolution (Fig 7.13). One

test case had greylevels equalised before thresholding at the 128 grey level, while a

second test set had only thresholding at the 128 level with no histogram equalisation.

Original Zoom window Zoom window Subsampled 256x256 selected with cropped 25x25 1 of 6 methods

Histogram Equalisation


image

window

Figure 7.13: Image Preparation for Digital Zoom Tests

157

7.3 Digital Zoom

The same subjects who viewed the earlier described experiment (refer Section 7.2)

also viewed the variations on zoom method. Half the sample (n=96) viewed the

images that were greylevel equalised, while the other half viewed the non-equalised

images. An example of the test stimuli presented to subjects is shown in Appendix

Section E.3 and the design of the test booklet is shown in Section E.4 – Part B.

Viewing conditions for the experiment were not controlled. Subjects were shown a

zoom window overlaid on the original image, in addition to a 25x25 black and white

version of the zoom window. When overlaid on the original image, the zoom

window was shown as a white square bordered on the inside and outside by a black

square to maximise visibility on all background greylevels (refer Figure 7.14).

Figure 7.14: Example stimulus showing detail of zoom window border

7.3.2 Results of Automatic Zoom Experiment Viewer preferences are shown over in Figure 7.15. Error bars representing 95%

confidence intervals are shown on the upper plot, and were obtained from average

preferences for the six scene types. ANOVA testing on the seven processing

methods resulted in strongly significant differences between the means (P=7.09E-8

and 2.33E-6 for non-equalised and equalised datasets respectively). The trim method

applied to Saliency Maps (“Sal trim”) had the highest preference for automatically

zooming into a part of the image. This method was best overall and for four of the

158

7.3 Digital Zoom

six scene types. For beach scenes the trim method applied to importance maps (“IM

trim”) was best, while for café scenes, which contained high clutter, “No Zoom” was

best.

0%

5%

10%

15%

20%

25%

30%

35%

IM trim IM scope sal trim salscope

centre bottom none

Zoom method

% S

ubje

cts

choo

sing

that

pr

oces

sing

met

hod

Equalised

Not Equalised

0

200

400

600

800

1000

1200

1400

IM trim IMscope

sal trim salscope

centre bottom none

Zoom method

No.

of s

ampl

es ra

nked

(460

8 to

tal)

beach

off ice

face

house

street

café

Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)

Again results were independent on whether histogram equalisation was applied to

images. A two sample t-test was performed using 42 observations (7 processing

methods and average results of each of 6 scene types) at α = 0.05. The test assessed

the hypotheses:

H0: There is no difference in histogram equalisation (differences shown in Fig

7.15 were due to sampling errors or chance only);

H1: Histogram equalisation achieves significantly different results; ie. a one-

tailed t-test (not interested in a directional test which would show results as

higher or lower).

159

7.3 Digital Zoom

160

A t-statistic of 1.85E-15 was obtained which was much less than the critical t value

1.66 for 80 degrees of freedom. The significance of this value for a one-tail test was

P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation

does not result in significant differences.

The trim methods (“Sal trim” and “IM trim”) were approximately twice as good as

the scope methods (“Sal scope” and “IM scope”). This may be due to the scope box

method having a fixed box size (equal to one quarter of the image area) while the

box size for the trim method varied depending on the image, potentially returning a

more useful zoomed image.

Thus if a digital zoom function were to be employed in a prosthesis design to

highlight areas which may help a visually impaired user, favourable results are most

likely to be achieved with the Saliency Map method. The trim method on

importance maps (“IM trim”) is also slightly better than zoom windows based on a

geometric part of the image which do not consider image content.

7.4 Chapter Summary This chapter described a comparison of region-of-interest processing methods for

low quality image presentation. The experiments showed that it is better to use

Importance Map/Region-of-Interest processing to select a region within the image

and present that alone, rather than presenting the actual Importance/Salience

representation for the entire image.

So in response to the research question:



It can be seen that ROI processing does improve scene understanding when used in a

zoom application, but not if applied to the entire image.

Chapter 8 Discussion, Conclusion and Future Work

8.1 Discussion and Conclusion

Electronic visual prostheses, or bionic eyes, are likely to provide some coarse visual

sensations to blind patients who have these systems implanted. The quality of

artificially induced vision is anticipated to be very poor initially. Research described

in this thesis has explored image processing techniques that improve perception for

users of visual prostheses.

The work has focussed on improving perception via image processing techniques.

Images are just data and image processing is simply manipulating that data. There

are potentially other techniques that may results in improved perception that are

outside the scope of this research. These include using different electrode paths to

create variations of charge density patterns, and delivering preconditioning stimulus

to selectively excite deeper axons away from those in contact with electrodes.

Useful image processing methods were determined by way of subjective experiments

with normally sighted viewers. These experiments provide a basis from which more

complex and beneficial vision prostheses may be derived. For example, the tests

involved presentation of static (still) images to viewers and a logical extension to the

work is to conduct similar experiments on image sequences (video). Thus a body of

knowledge has been established from which real-time processing units can be

developed such that a prosthesis may provide maximum benefits to the blind.

The work has also facilitated further understanding of the human visual system,

specifically perception from low quality visual information. The field of image

quality, including how quality is measured and characterised, is vast. However work

to date focuses on high quality images associated with modern multimedia

environments. This research contributes to the image quality literature by

characterising low quality images. The amount of visual information carried by

161


162

images has been quantified, and in this way a new means of characterising low

quality images can be stated on the basis of presenting maximum information.

Finally the research has involved an original application of Region-of-Interest

processing routines beyond traditional applications such as image compression.

Region-of-Interest processing was presented in Section 2.4 as a powerful perception

modeling tool using a combination of early vision and cognitive effects. Prosthesis

researchers had not previously recognised the benefit of applying techniques to

automatically identify important areas in images. Thesis experiments validated the

use of computationally cheap Region-of-Interest techniques in visual prosthesis

designs.

Detailed research findings are now discussed in order of presenting in the thesis.

Some preliminary experiments are discussed in Chapter 4. In an experiment to

determine useful image processing methods, it was found that some types of images

are better recognised at low quality using Importance and Distance Map methods,

while for others, ‘Base Case/Normal’ processing is better. It is recommended to

have switchable modes of operation to allow user selection of the processing routine.

Results also indicated that it is better for a device to have increased spatial resolution

rather than increased greyscale resolution, and faces are pretty much the easiest type

of image to understand. These results were reinforced in a separate experiment

determining the effect of image type on perception.

In Chapter 5, a stable measure was developed to quantify the amount of visual

information in an image. This was found to be the number of edges in an image. It

was also found that high information content in images correlates with higher

perception.

Chapter 6 described an experiment that showed there is some benefit in processing

images according to their scene type (office, home etc). It is recommended to invert

the processed binary images before presentation if necessary to most closely match

the original images.


163

In Chapter 7, a comparison of Region-of-Interest methods was presented. ‘Base

Case/Normal’ processing was best when presenting the entire image, but ROI

processing has some benefit over Base Case processing with a zoom type function.

If a Region-Of-Interest technique was to be implemented in a zoom function, a

technique known as the Saliency Map (pixel based) gives the most favourable result.

This method was better than region-based Region-of-Interest methods, ‘Base Case’

and methods using the geometric location within an image (where image content is

not taken into account). The experiments found that there was no benefit in tuning

feature weights in the Importance Map method when such low quality images are

used. Finally there was no difference in the results if images are histogram-equalised

prior to binarisation or not.

In exploring image processing techniques that improve perception from visual

prostheses, four research questions were addressed:

Research Question Findings

Q1: Although limited to low quality

images anticipated from visual

prostheses, can recognition of some

objects be achieved?

Basic recognition can be achieved for

low quality images although this is

dependent on spatial and greyscale

resolutions and image type. Face

environments are most easily recognised.

Q2: Does Region-of-Interest processing

improve scene understanding beyond


Improved scene understanding can be

expected if Region-of-Interest processing

is used to zoom into interesting areas

within the image.

Q3: Can a model be constructed for

basic information required for the

interpretation of a visual scene at low

image quality?

Maximising the number of edges in an

image results in higher perceived

information content and higher

recognition performance

Q4: Should the processing techniques be

adjusted depending on the scene type?

Improved perception may be obtained

with processing images with respect to

scene type.

8.2 Future Work

8.2 Future Work

164

The research presented in this thesis can be extended in several areas.

8.2.1 Motion

The algorithms described in this paper were employed on static images, and a further

extension of this work is to image sequences/video. It is anticipated that enhanced

perception would be achieved over that experienced with static images, as image

sequences convey higher scene information and subjects would be able to move

about to see how various scene elements (background/foreground) interact.

It is likely that an optimum frame presentation rate exists to maximise visual

understanding. Nauseating disorientation effects may arise if head movements are

not matched with visual information. Time delay effects in helmet/head mounted

displays resulting in limitations from image lag are covered by Dudfield [22] and

Nelson [61].

8.2.2 Colour

As identified by Suaning et al [93], colour would further enhance images delivered to

subjects. While patients undergoing cortical stimulation have reported seeing white,

yellow, red, and blue phosphenes [86], the successful modulation of colour has not yet

been reported in the literature. From an image processing viewpoint, colour filtering

could be relatively easily applied to present monochromatic images of only a selected

colour. For example green filtering to locate green apples in a bin full of green and red

apples. In a paper [108] which discusses alternatives to colour in the context of

transmitting information in images eg. monochromatic coding (single colour), size,

flashing stimulus, a disadvantage of colour is identified in that colour cannot depict

relative importance.

8.2 Future Work

165

8.2.3 Device interfacing

The ability to connect an artificial vision system to a television or computer would be

desirable. This concept has been incorporated in the design of the Dobelle implant

which allows for a television/computer/internet interface and remote video

screen/VCR monitor [20]. Other routine image processing functions could include

the possibility to record a video clip/snapshot of an interesting landmark, route or

environment, and transmit the clip/snapshot to another viewer or an external device.

8.2.4 Supplementary/Symbolic Information

Given that phosphenes can be produced in the visual field, increased information

transfer might be achieved if use were made of these phosphenes not just for

representing one part of a scene, but for “coding” of associated information. For

example, a particular phosphene in the top left hand corner of the visual field could

represent the close proximity to the left of a tall and narrow object (such as a lamp-

post), which may prevent an injury. Similarly, the middle right-most phosphene

might represent fast movement from the right. In addition to using a selected

phosphene for information representing, other modes of transfer might cover

phosphene brightness and blink rate. It is feasible that supplementary sensory

information (eg. distance) might be conveyed in such a manner to efficiently use the

small number of phosphenes.

The ability to convey alphanumeric character patterns directly, rather than capturing

the text via camera, may be beneficial in improving reading speeds and visual acuity.

A 5 x 7 phosphene array can be used to create a full set of symbols, similar to dot

matrix displays, where the 35 dots can overlap or appear as discrete elements without

affecting legibility [74,89]. While a 5 x 7 matrix is generally quite adequate for

groups of characters presented in context, it can result in some confusion when single

characters are used (eg. 2 called Z, B called 8). This may lead to the use of 7 x 9

matrix fonts, but matrices larger than 7 x 9 do not result in meaningful improvements

[89].

8.2 Future Work

166

The use of supplementary stimulus could also be applied for colour analysis.

Although artificially-created phosphenes might not represent the true colour of the

camera image, it should be possible to overlay some colour identification

information. For example, a colour identification function would be advantageous to

selectively give an indication of the colour detected in the centre of the camera view.

8.2.5 Range Indication

The distance to an object would be particularly useful information to convey through

an artificial vision system. The use of sonar distance aids has been common for

many years (eg. [23]). However these devices emit an auditory signal to convey

distance information which can interfere with important surrounding environmental

noises. It is thus desirable to incorporate distance information visually through an

artificial vision system. Distance information can be obtained using similar

ultrasonic rangefinders or computing depth from disparity from two cameras.

Distances could then be mapped to intensities, where the nearest object is shown

with the highest intensity. If the device display only supports a 1-bit grey scale, only

the nearest object need be displayed. Newman and Jain [63] state that the greatest

advantage of range data is in explicitly representing surface information.

Binary/silhouette data provides negligible information about surface and intensity

data can contain significant variations in reflected light.

In combination with a standard image of luminance intensities, this distance 'mode of

operation' could be quite useful. The literature contains several works from a

computer graphic modelling viewpoint which claim that through the combination of

range and intensity, a fuller description of the model can be achieved which takes

advantage of the strengths and weaknesses of each method is isolation [44,113].

8.2 Future Work

167

8.2.6 Simulating Techniques Many of the experiments undertaken in this thesis involved the presentation of

binary images to viewers that did not include greyscale effects. The use of

halftoning techniques to create the illusion of greyscale was mentioned in Sections

3.5.2.2 and 3.4.1 along with the theory that modulating the size and intensity of a

phosphene are equivalent psychophysically. Future experiments in simulating

artificial vision could explore the use of this technique (refer Figure 8.1).

Original Halftone – max radius = 2

Halftone – max radius = 4 Halftone – max radius = 10

Figure 8.1: Halftone representation

8.2.7 Other Testing Techniques

Also the recognition experiments conducted in this thesis were open-ended in that

hints or clues were not provided. Other ways to assess recognition performance

might include asking task dependent questions about the scene such as “circle the

doorway” or “identify the obstacle location”. The assessment of perception in this

way requires careful design and consistent determination of correct responses across

subjects. Objective measures, such as tracking tasks, maze navigation, reading

speed, visual acuity scores based on Landolt-C’s and Sloan E’s could also be used

similar to previous attempts to quantify visual processing methods eg. [12,13,14,33].

8.3 Final Word

8.3 Final Word

My motivation for undertaking this work was growing up with a blind parent and

witnessing frequent injuries from collisions with obstacles: dishwasher doors being

left open, cupboard doors ajar, corners of brick walls. There is always some wish

that one day some visual perception may return to a level sufficient to avoid such

collisions, and allow the amazing world so familiar to us sighted people to be

viewed.

In the initial design of this research project, a scope of work was set that was

achievable. It would not be realistic to expect to achieve sight restoration as a result

of a PhD. However with the facilities and testing methods available, some ideas

could be explored in the area of image processing. It is hoped that this work may

contribute in a small way to the numerous international efforts of developing a safe

and useful electronic visual prostheses for blind people.

168

References 1. Ahumada A, “Computational image quality metrics: A review”, SID Digest of

Technical Papers, 24, pp.305-308, 1993 2. Amerijckx C, Legat J, Trullemans C, Design and implementation of a

remapping algorithm for visual prosthesis, Proceedings Vision Interface '99, Canadian Image Processing. & Pattern Recognition Society, Toronto, Canada, pp.380-385, 1999

3. Barten P, Contrast sensitivity of the human eye and its effects on image quality,

SPIE Press, Washington, 1999 4. Baskent D, Shannon R, Frequency-place compression and expansion in

cochlear implant listeners, Journal of the Acoustical Society of America, 116(5), pp.3130-3140, 2004

5. Beatty J, Booth K, Matthies L, Revisiting Watkins’ algorithm (Computer

Graphics), 7th Canadian Man-Computer Communications Conference, pp.359-370, 1981

6. Becker M, Braun M, Eckmiller R, “Retina implant adjustment with

reinforcement learning”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98, IEEE, New York, USA, Vol.2, pp.1181-4; 1998

7. Bell D, Maeder A, Progressive technique for human face archiving and

retrieval, Journal of Electronic Imaging, Vol. 5(2), pp191-197, 1996 8. Brindley G, The number of information channels needed for efficient reading, J

Physiol. 177, pp.46, 1964 9. Callaghan T, Interference and dominance in texture segregation: Hue,

geometric form, and line orientation, Perception & Psychophysics, Vol. 46 (4), pp.299-311, 1989

10. Cantoni V (ed), “Human and machine vision – Analogies and divergencies”,

Proceedings of the 3rd International Workshop on Perception, Plenum Press, 1994

11. Capelle C, Faik C, Trullemans C, Veraart C, “Real time experimental visual

prosthesis using sensory substitution of vision by audition”, Proceedings of the 16th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Engineering Advances: New Opportunities for Biomedical Engineers, IEEE, New York, USA; Vol.1, pp.255-256, 1994

169

References

12. Cha K, Horch K, Normann R, Mobility performance with a pixelised vision system, Vision Research 32(7), pp.1367-1372, 1992

13. Cha K, Horch K, Normann R, Simulation of a phosphene-based visual field:

Visual acuity in a pixelized vision system, Annals of Biomedical Engineering 20(4), pp. 439-449, 1992

14. Cha K, Horch K, Normann R, Reading speed with a pixelized vision system,

Journal of the Optical Society of America A-Optics & Image Science, 9(5), pp. 673-677, 1992

15. Chernyak D.A. and Stark L.W., “Top-down guided eye movements: Peripheral

model,” in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, SPIE Proc. Vol. 4299, pp.349-360, 2001

16. Cooper P, Birnbaum L, Brand M, Causal scene understanding, Computer

Vision and Image Understanding, Vol.62 (2), pp.215-231, 1995 17. Dagnelie G, Humayun M, Greenberg R, de-Juan E, “The physiological

connection: stimulating the human and amphibian retina”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2321-2326, 1997

18. DeMarco S, Clements M, Vichienchom K, Liu W, Humayun M, Weiland J, An

epi-retinal visual prosthesis implementation, Proceedings of the First Joint BMES/EMBS Conference 1999: IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society, IEEE, Piscataway, USA, Vol.1, pp475, 1999

19. De Ridder H, Cognitive issues in image quality measurement, Journal of

Electronic Imaging, January, Vol.10(1), pp.47-55, 2001 20. Dobelle W, Artificial vision for the blind by connecting a television camera to

the visual cortex, ASAIO (American Society of Artificial Internal Organs) Journal 2000; 46(1), pp. 3-9, 2000

21. Dowling J, Maeder A, Boles W, Mobility enhancement and assessment for a

Visual Prosthesis, in Human Vision and Electronic Imaging IX, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol 5369, pp. 780-791, 2004

22. Dudfield H, Hardiman T and Selcon S, “Human factors issues in the design of

Helmet-Mounted Displays,” in Helmet- and Head-Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995

23. Easton R, Inherent problems of attempts to apply sonar and vibrotactile sensory

aid technology to the perceptual needs of the blind, Optometry and Vision Science, 69(1), pp. 3-14, 1992

170

References

24. Eckert M, Bradley A, “Perceptual quality metrics applied to still image compression”, Signal Processing, European Association for Signal Processing, Vol. 70, pp177-200, 1998

25. Eckmiller R, Becker M, Hunermann R, “Towards a learning retina implant with

epiretinal contacts”, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. p.396-399, 1999

26. Eckmiller R, Becker M, Hunermann R, “Dialog concepts for learning retina

encoders”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2315-2320, 1997

27. Exner J, The Rorschach: A comprehensive system: Vol 2 Current research and

advance interpretation, Wiley, New York, 1978 28. Gilmont T, Verians X, Legat J, Veraart C, Resolution reduction by growth of

zones for visual prosthesis, Proceedings: International Conference on Image Processing, IEEE, New York, USA, pp.299-302 vol.1, 1996

29. Gregory R, Eye and Brain – the psychology of seeing, World University

Library, London, 1966 30. Groth-Marnat G, Handbook of psychological assessment, 3rd ed, John Wiley &

Sons, New York, 1997 31. Hallum L, Taubman D, Suaning G, Morley J, Lovell N, A filtering approach to

artificial vision: A phosphene visual tracking task, Proceedings of the World Congress on Medical Physics and Biomedical Engineering (WC2003), 2003

32. Harvey J, Sawan M, Image acquisition and reduction dedicated to a visual

implant, Proceedings of the 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. `Bridging Disciplines for Biomedicine', IEEE, New York, USA, pp.403-4 vol.1, 1997

33. Hayes J, Yin V, Piyathaisere D, Weiland J, Humayun M, Dagnelie G, Visually

guided performance of simple tasks using simulated prosthetic vision, Artificial Organs Vol. 27 (11), pp.1016-1028, 2003

34. Hendee W, Wells P (eds), The perception of visual information - 2nd edition,

Springer-Verlag, New York, 1997 35. Henderson D, Evans J, Dobelle W, The relationship between stimulus

parameters and phosphene threshold/brightness, during stimulation of human visual cortex, Transactions - American Society for Artificial Internal Organs, 25, pp. 367-71, 1979

36. Huang T, PCM picture transmission, IEEE Spectrum, Vol.2 (12), pp.57-63,

1965

171

References

37. Humayun M, Weiland J, Fujii G, Greenberg R, Williamson R, Little J, Mech B, Cimmarusti V, Van Boemel G, Dagnelie G, de Juan E, Visual perception in a blind subject with a chronic microelectronic retinal prosthesis, Vision Research, Vol 43 (24), pp. 2573-2581, 2003

38. Hungenahally S, “Differentio-aggregation functions for perceptual sub-band coding of images: Emulation of visual receptive fields”, IEEE International Conference on Systems, Man and Cybernetics. 'Humans, Information and Technology', IEEE, USA, Vol.3, pp.2420-2425, 1994

39. Hungenahally S, “Mathematical basis for the design of an artificial retina:

visual prosthesis for the retinally blind”, IEEE International Conference on Systems, Man and Cybernetics. ‘Intelligent Systems for the 21st Century’, IEEE, New York, USA, Vol.3, pp.2396-2401, 1995

40. Itti L, Koch C, Feature combination strategies for saliency-based visual

attention systems, Journal of Electronic Imaging, 10(1), pp.161-169, 2001 41. ITU/R Recommendation BT.500-7, 10/1995, http://www.itu.ch/ , access date:

4/6/04 42. Iwamoto K, Tanie K, “Development of an eye movement tracking type head

mounted display: capturing and displaying real environment images with high reality”, Proceedings. 1997 IEEE International Conference on Robotics and Automation, IEEE, New York, USA, Vol.4, pp.3385-3390, 1997

43. Janssen R, Computational Image Quality, SPIE PRESS Monograph Vol.

PM101, SPIE - The International Society for Optical Engineering, 2001 44. Kay G, Caelli T, Inverting an illumination model from range and intensity

maps, CVGIP: Image Understanding, Vol 59 (2) March, pp. 183-201, 1994 45. Kraft R, Kauer J, Estimating the fractal dimension from digitized images,

Munich University of Technology – Weihenstephan, Dept of Agricultural & Horticultural Sciences Mathematics, Statistics & Data Processing Institute, Freising / Germany, http://www.wzw.tum.de/ane/algorithms/algorithms.html, Access date: 20/4/04, 1995

46. Kyuma K, Miyake Y, Kage H, Artificial retina chips, IEEE International Conference on Neural Networks, Vol.4, pp.2304-2308, 1997

47. Levine M, Vision in man and machine, McGraw-Hill, New York, 1985 48. Livingstone M, Hubel D, Segregation of form, color, movement, and depth:

Anatomy, physiology, and perception, Science, Vol. 240, pp.740-749, 1988 49. Loce R, Roetling P, Lin Y, Digital halftoning for printing and display of

electronic images, in Electronic Imaging Technology, Dougherty E (Ed), SPIE - The International Society for Optical Engineering, pp225-288, 1999

172

http://www.itu.ch/

http://www.wzw.tum.de/ane/algorithms/algorithms.html

References

50. Luo J, Etz S, Singhal A, Gray R, Performance-scalable computational approach to main subject detection in photographs, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 494-505, 2001

51. Lysaght M, Vogelstein J, Lockhart N, Cheswick Thide C, Nallari M, Caulkins

C, Artificial vision, 1999 (compiled), Brown University, Providence, Rhode Island, Available: http://biomed.brown.edu/Courses/BI108/BI108_1999_Groups/Vision_Team/Vision.htm , Access date: 20 April 2004

52. Maeder A, Human understanding limits in visualization, International Journal

of Pattern Recognition & Artificial Intelligence, 11(2), pp.229-237, 1997 53. Maeder A, Eckert M, “Medical image compression: Quality and performance

issues”, Proceedings: New Approaches in Medical Image Analysis, SPIE – The International Society for Optical Engineering, Washington, Vol.3747, 1999

54. Maeder A, Pham B, A colour importance measure for colour image analysis,

IS&T and SID’s Color Imaging Conference: Transforms & Transportability of Colour, Phoenix, Nov 1993, 232-237, 1993

55. Margalit E, Maia M, Weiland J, et al, Retinal prosthesis for the blind, Survey of

Ophthalmology, Vol.47 (4), pp.335-356, 2002 56. Marr D, Vision, W.H. Freeman & Company, New York, 1982 57. Meletiou A, Measurement of complexity in visual images, MSc in Human

Computer Interaction Project, Department of Computing & Electrical Engineering, Heriot-Watt University, 1999

58. Miau F, Itti L, A neural model combining attentional orienting to object

recognition: Preliminary explorations on the interplay between where and what, In: Proc. IEEE Engineering in Medicine and Biology Society (EMBS), Istanbul, Turkey, Oct 2001

59. Moffett D, Moffet S, Schauf C, Human physiology – foundations and frontiers,

2nd edition, Mosby-Year Book Inc, St Louis, pp.268-283, 1993 60. Mojsilovic A Rogowitz B, A psychophysical approach to modeling image

semantics, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 470-477, 2001

61. Nelson W, Hettinger L, Haas M, Russell C, Warm J, Dember W and Stoffregen

T, Compensation for the effects of time delay in a helmet-mounted display: perceptual adaptation versus algorithmic prediction, in Helmet- and Head-Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995

173

http://biomed.brown.edu/Courses/BI108/BI108_1999_Groups/Vision_Team/Vision.htm

http://biomed.brown.edu/Courses/BI108/BI108_1999_Groups/Vision_Team/Vision.htm

References

62. Nemine K, Calibration and evaluation of virtual environment displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993

63. Newman T, Jain A, A survey of automated visual inspection, Computer Vision

and Image Understanding, Vol. 61(2), March pp.231-262, 1995 64. Nguyen A, Chandran V, Sridharan S, Prandolini R, Importance assignment to

regions in surveillance imagery to aid visual examination and interpretation of compressed images, Proceedings of International Symposium on Intelligent Multimedia, Video & Speech Processing (ISIMP), Hong Kong, pp. 385-388, 2001

65. Nie K, Stickney G, Zeng F, Encoding frequency modulation to improve

cochlear implant performance in noise, IEEE Transactions on Biomedical Engineering, 52(1), pp.64-73, 2005

66. Normann R, Maynard E, Rousche P, Warren D, A neural interface for a cortical

vision prosthesis, Vision Research 39(15), pp. 2577-2587, 1999 67. Osberger W, Perceptual vision models for picture quality assessment and

compression applications, PhD thesis, Space Centre for Satellite Navigation, School of Electrical and Electronic Engineering, QUT, 1999

68. Osberger W, Maeder A, Automatic identification of perceptually important

regions in an image using a model of the human vision system, 14th International Conference on Pattern Recognition, Brisbane, Australia, pp. 701-704, 1998

69. Osberger W, Maeder A, Bergmann N, A perceptually based quantisation

technique for MPEG encoding, Proceedings of the SPIE – Human Vision & Electronic Imaging III, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3299, San Jose, USA, Jan 1998

70. Osberger W, Rohaly A, Automatic detection of regions of interest in complex

video sequences, in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 4299, pp.361-372, 2001

71. Pausch R, Shackelford M, Proffitt D, A user study comparing head-mounted

and stationary displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993

72. Peachey N, Chow A, Subretinal implantation of semiconductor-based

photodiodes: progress and challenges, Journal of Rehabilitation Research & Development, 36(4), pp. 371-376, 1999

73. Pentland A, Wearable computers, IEEE Microcomputers 19(6), pp.9-11, 1999 74. Perez R, Electronic display devices, TAB Professional and Reference Books,

Pennsylvania USA, pp.196-202, 1988

174

References

75. Privitera C, Stark L, Focused JPEG encoding based upon automatic pre-

identified regions-of-interest, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3644, pp. 552-558, 1999

76. Reinagel P and Zador AM, Natural scene statistics at the centre of gaze,

Network: Computation in Neural Systems. vol.10, no.4, pp341-350, 1999 77. Riglis E, Modeling visual complexity in images, First Year PhD Report, Image

Systems Engineering Laboratory, School of Mathematical and Computer Sciences, Heriot-Watt University, 1998

78. Rivlin E, Rosenfeld A, Navigational functionalities, Computer Vision and

Image Understanding, Vol.62 (2), pp.232-244, 1995 79. Rizzo J, Wyatt J, Loewenstein J, Kelly S, Shire D, Methods and perceptual

thresholds for short-term electrical stimulation of human retina with microelectrode arrays, Investigative Ophthalmology & Visual Science, 44(12), pp 5355-5361, 2003

80. Rogowitz B, Frese T, Smith J, Bouman C, Kalin E, Perceptual image similarity

experiments, SPIE Conference on Human Vision and Electronic Imaging III, San Jose, California, January 1998, SPIE Vol.3299, pp.576-590, 1998

81. Rogowitz B, Pappas T, Allebach J, Human vision and electronic imaging,

Journal of Electronic Imaging, Vol.10(1), pp.10-19, 2001 82. Rosenberg D, Color Halftone Version 7.0, An Adobe Photoshop filter

module which simulates an enlarged print color halftone effect, Adobe Systems, 2002

83. Russ J, The image processing handbook 3rd Edition, CRC Press, Florida USA,

pp. 242-247, 1999 84. Saliency Map source code sourced from iLab, University of Southern

California: http://ilab.usc.edu/toolkit/ Access date: 12 August 2003 85. Schill K, Umkehrer E, Beinlich S, Krieger G, Zetzsche C, Scene analysis with

saccadic eye movements: Top-down and bottom-up modeling, Journal of Electronic Imaging, Vol.10(1), pp.152-160, 2001

86. Schmidt E, Bak M, Hambrecht F, Kufta C, O'Rourke D, Vallabhanath P,

Feasibility of a visual prosthesis for the blind based on intracortical microstimulation of the visual cortex, Brain 119, pp. 507-522, 1996

87. Schubert M, Stelzle M, Graf M, Stert A, Nisch W, Graf H, Hammerle H, Gabel

V, Hofflinger B, Zrenner E, Subretinal implants for the recovery of vision, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 376-381, 1999

175

http://ilab.usc.edu/toolkit/

References

88. Schwarz M, Hauschild R, Hosticka B, Huppertz J, Kneip T, Kolnsberg S,

Mokwa W, Trieu H, Single chip CMOS image sensors for a retina implant system, IEEE, Vol.6, pp.645-648, 1998

89. Sherr S, Electronic displays, John Wiley & Sons, New York, pp.29-37, 1979 90. Snyder H, Trejo L, Research Methods, in Colour in Electronic Displays, Widdel

H, Post D (eds), Plenum Press, 1992 91. Stange K, A 4-parameter model of visual complexity in abstract images and a

computer program for the empirical investigation of complexity, pleasingness and interestingness of images based on the model, XVI Congress Of The International Association Of Empirical Aesthetics, New York, 2000

92. Stark L, Privitera C, Yang H, Azzariti M, Fai Ho Y, Blackman T, Chernyak D,

Representation of human vision in the brain: How does human perception recognise images?, Journal of Electronic Imaging, Vol.10(1), pp.123-151, 2001

93. Suaning G, Lovell N, Schindhelm K, Coroneo M, The bionic eye (electronic

visual prosthesis): A review, Australian and New Zealand Journal of Ophthalmology 26, 195-202, 1998

94. Suaning G, Lovell N, Kwok C, Fabrication of platinum spherical electrodes in

an intra-ocular prosthesis using high-energy electrical discharge, Sensors and Actuators A: Physical, vol. 108, pp. 155-161, 2003

95. Suaning G, Lovell N, CMOS neurostimulation system with 100 electrodes and

radio frequency telemetry”, Inaugural Conference of the IEEE EMBS (Vic), Melbourne, pp.37-40, Feb 1999

96. Thompson R, Barnett G, Humayun M, Dagnelie G, Facial recognition using

simulated prosthetic pixelized vision, Investigative Ophthalmology & Visual Science, 44(11), pp.5035-5042, 2003

97. Thorpe S, Image processing by the human visual systems, Eurographics ’90

EG.90TN4 – Tutorial Note, Eurographics Technical Report Series, 1990 98. Travis D, Effective colour displays – Theory and Practice, Academic Press,

1991 99. Vaughan H, Schimmel H, Feasibility of electrocortical visual prosthesis, in

Visual prosthesis – The interdisciplinary dialogue, Sterling T, Bering E, Pollack S, Vaughan H Editors, Proceedings of the second conference on visual prosthesis, Academic Press, New York, pp.65-79, 1971

100. Veraart C, Wanet-Defalque M, Gérard B, Vanlierde A, Delbeke J, Pattern

recognition with the optic nerve visual prosthesis, Artificial Organs Vol.27 (11), pp.996-1004, 2003

176

References

101. VQEG – Video Quality Experts Group, Institute for Telecommunication Sciences, U.S. Department of Commerce, Colorado http://www.its.bldrdoc.gov/vqeg/, accessed 4/6/04.

102. Wandell B, Foundations of vision, Sinauer Associates Inc, Massachusetts,

USA, pp. 124-126, 1995 103. Walpole R, Myers R, Probability and statistics for engineers and scientists – 5th

edition, Macmillan Publishing Company, New York, 1993 104. Warren D, Normann R, Visual neuroprostheses, in Handbook of

neuroprosthestic methods, Finn W, LoPResti P (eds), The Biomedical Engineering Series, CRC Press, pp. 2003

105. Watson A (ed), Digital images and human vision, MIT Press, 1993 106. Watson A et al, The DCTune algorithm, Vision Science and Technology

Group, NASA Ames Research Center, http://vision.arc.nasa.gov/dctune/, Access date: 29 May 2004

107. Werblin F, Jacobs A, The cellular neural network as a retinal camera for visual

prosthesis, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2327-2332, 1997

108. Widdel H, Post D, (eds), Colour in Electronic Displays, Plenum Press, 1992 109. Yagi T, Ito Y, Kanda H, Tanaka S, Watanabe M, Uchikawa Y, Hybrid retinal

implant: fusion of engineering and neuroscience, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 382-385, 1999

110. Yagi T, Kameda S, Hayashida Y, Li L, An artificial retina with adaptive

mechanisms and its application to retinal prostheses, Vol.4, IEEE, pp.418-423, 1999

111. Yamakawa T, Shimonomura K, Udono T, Yagi T, Depth perception circuit

employing serial output signals from two vision chips, Vol. 4, IEEE, pp.390-395, 1999

112. Ziegler D, Linderholm P, Mazza M, Ferazzutti S, Bertrand D, Ionescu A,

Renaud, An active microphotodiode array of oscillating pixels for retinal stimulation, Sensors and Actuators A: Physical, 110(1-3):11-17, 2004

113. Zhang G, Wallace A, Physical modelling and combination of range and

intensity edge data, CVGIP: Image Understanding, Vol. 58 (2) September, pp.191-220, 1993

177

http://www.its.bldrdoc.gov/vqeg/

http://vision.arc.nasa.gov/dctune/

Appendix A Section 4.2 Experiment

A.1 Example Test Stimulus All images are different versions of the same object.

A B

C D

E 1. DESCRIBE THE OBJECT:

_________________________________________________________________ 2. RANK THE TOP 3 IMAGES THAT YOU THINK SHOW THE OBJECT

MOST CLEARLY: 1)_________ 2)_________ 3)_________

178


179

A.2 Booklet Design Test Attributes Ref Spatial Res & Grey levels Images

shown per page

Image Characteristics shown per page: N = normal, I = inverse, D = distance, Im = importance, E = edges

1 10x10 B&W 5 5 x B&W (N,I,D,Im,E) 2 10x10 3 grey 4 4 x 3grey

(N,I,D,Im) 3 10x10 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 4 16x16 B&W 5 5 x B&W (N,I,D,Im,E) 5 16x16 3 grey 4 4 x 3grey

(N,I,D,Im) 6 16x16 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 7 25x25 B&W 5 5 x B&W (N,I,D,Im,E) 8 25x25 3 grey 4 4 x 3grey

(N,I,D,Im) 9 25x25 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 10 10x10 3 grey vs 16x16 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 11 10x10 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 12 16x16 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey) Image presentation criteria:

1. images with lower spatial resolution should be presented before higher resolution versions of the same image to reduce learning effects

2. try to obtain a large a difference as possible, so pair of the following attribute reference numbers from above table: 1 – 9, 2 – 7, 3 – 8, 4 – 10, 5 – 12, 6 – 11

Six booklets A – F measuring different combinations of image attributes:

Book Chair1 Chair2 Post1 Post2 Steps1 Steps2 Face1 Face2 Door1 Door2A 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 B 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 C 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 D 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 E 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 F 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8

Then booklet order to ensure variety and reduce learning effects (20 pages):

1 2 3 4 5 6 7 8 9 10 Chair1 Post1 Steps1 Face1 Door1 Steps2 Post2 Face2 Chair2 Door2 11 12 13 14 15 16 17 18 19 20 Face1 Post1 Door1 Steps1 Chair1 Post2 Chair2 Door2 Face2 Steps2


180

A.3 Borderline Recognition Assessment for Section 4.2 Experiment

Contextually Accepted Rejected Accepted Toilet Pidgeon Park bench Chair on skateboard Baby pram Chair with armrests Vulture chairlift Giraffe/dog Flamingo Rooster Emu

Contextually Accepted Rejected Accepted Lounge chair a rocking [?] facing to the left

that’s sitting up straight Posture chair

High chair Rocking horse 3d chair on a rock Electric chair Baby’s pram Rocking chair Child’s pram Reclining chair Baby stroller Chair with wheels Kids move chair stool


181

Contextually Accepted Rejected Accepted Pillar holding up roof Tower Door of a house with welcome

mat Gravestone Foot Corner of wall Wall from side or column (building support)

Torch Door in the right corner

Closeups of walls in a maze Buildings/skyscraper Vase/box which is a very different colour to the walls of the room

Jug in the corner of the room A square Wall Vase / bottle Window Block Grass bush Rectangle A pile on Large rectangular window

Candle holder Corridors Toilet paper Can of food Cup on table Elevation shaft Fire hydrant Block of chocolate

Lighthouse

Object on a table (eg. mug, salt shaker)

pole

Tree trunk Statue on stand O/head view of street

Contextually Accepted Rejected Accepted

Block of dry ice Beaker Window / Window with tree outside

Computer screen Container

Gameboy screen Swimming pool Wall / Wall with window & curtain on right

Piece of paper Building Box (rectangular shaped object); A box (not clear enough to describe); Block or box

Blackboard Bucket A block that someone can sit on Cup / mug Square

Top view of a table/ desk Block of chocolate An enclosed space t.v. screen The sky Floor mat Fish bowl Towel hanging

Football field Hole / Hole in the ground / A square ditch

Piece of cloth (rectangular) hanging up on something

Box, possibly a floor pan


182

Contextually Accepted Rejected Accepted “If gender noted was wrong” eg. man with afro

George Washington

The back of a head Early American president Gorilla’s face/monkey Footballer with helmet

Mozart Beethoven

Shrub / face Bob Marley Person with lots of facial hair Ben Harper William Shakespeare Elizabeth Taylor Jimmy Hendrix Mon Lisa painting Artist

Contextually Accepted Rejected “If gender noted was wrong” eg. ladies face

Head of a bird Child molester from ch.7 news

An old man’s head in profile facing left

Flower Dero person

Mr. Ed the talking horse Side view of a person facing left Old person with glass with string attached (crying)

Crying person Side view of a dog’s head Guy with long hair Face with long hair Barbie

Accepted

Back of a persons head Human face, child


183

Contextually Accepted Rejected Accepted Water pump with hose on right Long thing with protrudence Cactus ‘L’ End of a pier Big flower Axe Powerpole with powerline

extending to right Streetlight/lamp Half an anchor Telephone pole Flower stem Sign pointing up Umbrella Diving board/platform Tree with branches/leaves Powerline tree Crane Joining of posts (T) Basketball hoop Winder on old clothesline Arrow pointing down Stick figure pointing gun to right

A tree with monkeys in it

Corn / maize Ladder Saxophone Spear with hook underneath Street sign Office building with stairs Traffic lights Skyscraper Flag facing right Waving finger Hockey stick Submarine Pencil? Windsurfer Flagpole Traffic light Tower? Road markings Lighthouse? Hand basin Sword pipes

Fish hook

Contextually Accepted Rejected Accepted Tree beside a hole/ditch in the ground to the left

Tall building A line

Worm/slug/caterpillar City with buildings Flagpole Driveway Birds eye view of coin box Something sticking up Road at night (with reflective median strip)

Caterpillar crawling A stick

Perspective view of a road White thing sticking up Tree / tree out of the ground / a lone tree

A stage with curtains pulled back on either side and pole in middle

Mountain with stick Street lamp / light Traffic lights Tower


184

Person standing out in the open Straight white line behind black surface

Pole with cave Tall archway/corridor Pole in foreground, mountain in background

Mountains with a pole at top

Contextually Accepted Rejected Accepted Multi-storey building with stepped storeys

Building

Contextually Accepted Rejected Accepted Stick Swimming pool Stairs leading up to a tree A very tall skinny tree in a raised garden bed

Tree

Flagpole/yellow & red beach poles

Clothesline

Life savers flag Powerline pole lightpole Flag on a golf course Street footpath Golf hole A stick in the ground Weed/seedling coming from ground

Something pole-like with a house/shelter in the distance on the right

Pole in ground

Appendix B Section 4.3 Experiment B.1 Example Test Stimulus

CAN YOU TELL WHAT IS SHOWN IN EACH IMAGE.

1) Write a word under each image to describe the main object or content of the scene. If you can’t tell what is shown, write “Can’t tell”.

2) Put a circle around the images that you are confident about.

185

Appendix B Section 4.3 Experiment

186

B.2 Borderline Recognition Assessment for Section 4.3 Experiment

Image Accepted Rejected

Lighthouse

tower, well, powerhouse, buildings, horizon, house on

cliff, post, high-rise building, chimney, house,

mineshaft, oil rig, sky, pole, jetty with post, watch

tower, small building, landscape

Buildings

houses, factory, households stairs, steps, trees, mountain,

hill, forest, house with

smoking chimney

Tree

plant, canyon, gully

Gorilla

woman's back, dog (common for 256x256_Binary

images), men, person sitting, person, teddy, bear toy,

animal, someone eating, godzilla, dinosaur, lady, person

leaning over, hair & head, man bending over

R2D2, people kissing, 2

people, duck

Capsicum

apple, fruit, pumpkin, jack-o-lantern ball, rose, balloon, love

heart, fist

Face

monster, side view of a head 2 people looking at jumping

dog, tiger, bird, duck

Flower

splat, flowers, centre of fruit eye, shooting target, food

plate, donut, letter 'O', 'Q',

box, clock, sign, square,

wheel, ball, door handle,

apple

Balloon

lightbulb, plane, sun, cloud, bird, aeroplane, moon dot, tennis ball, heart, block,

window, rectangle, wall

with window, star, footprint,

square, light, firefly

Duck

man, smiley face, person,

dog

Appendix C Chapter 5 Experiment

Appendix C Chapter 5 Experiments C.1 Example Test Stimulus – 7 Images Presented all at same time

1

Rank the images for how much visualinformation they contain: 1 = contains most visual information 7 = contains least visual information

87


C.2 Example Test Stimulus – 3 Images Presented all at same time

Rank the images for how much visual information they contain: 1 = contains most visual information 3 = contains least visual information

188


C.3 Example Test Stimulus – Paired Comparison Experiments

WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION?

The subjects were also g

DEFINITELY LEFT IMAGE

L

DEFINITELYRIGHT

SAME

There are 5 boxes under

Box 1 = left imag

Box 2 = left imag

Box 3 = images h

Box 4 = right im

Box 5 = right im

SLIGHTLY EFT IMAGE

iven the following comments relatin

IMAGE

the image pair to be compared:

e has much more information than

e has slightly more information tha

ave same amount of visual informa

age has slightly more information th

age has much more information than

189

SLIGHTLYRIGHT IMAGE

g to this question:

right image

n right image

tion

an left image

left image


190

C.4 Booklet Design Booklet design chosen to minimize learning effects; 9 booklets A – I, with 3 parts in each booklet

PART 1 – Develop Metric; 2 methods: 1) paired comparison 2) presenting 7 images all at once

PART 2 – Validate Metric 7 additional visual dimensions

3 images were presented to subjects at one time

PART 3 Check if correlated with recogn

Paired Comparison All at once

1 2 3 4 5 6 7

BOOK Image Set 1 2 3 OBJECT NUMBER ANGLE DISTANCE CONN. DETAIL CONTR VARIETYA a 10B 16F OB 25F 10B 25F O 16F OB 25B OE 25BB b “ “ “ 25B 10F OE 10B 25B O 25F OB 25FC c “ “ “ OE 16B OB 10F 25F 10B OE O 16BD a 16B 25F OE 10F 16F O 16B OE 10F OB 10B OE b “ “ “ 10B 25B 10B 16F OB 16B O 10F OBF c “ “ “ O 25F 10F 25B O 16F 10B 16B 10FG a 25B 10F O 16B OE 16B 25F 10B 25B 10F 16F 16FH b “ “ “ 16F OB 16F OE 10F 25F 16B 25B OEI c “ “ “ OB O 25B OB 16B OE 16F 25F 10B

10B = 10x10 binary 10F = 10x10 full grey 16B = 16x16 binary 16F = 16x16 full grey 25F = 25x25 full grey OB = original (256x256) binary OE = original (256x256) edge O = original (256x256)

Image Set (a) Image Set (b) Image Set (c) Balloon – Caps Balloon – People Balloon – Building Balloon – Flower Balloon – Lighthouse Balloon – Tree Caps – Tree Caps – Building Caps – Flower People – Tree Caps – Lighthouse Caps – People People – Building People – Flower People – Lighthouse Building – Lighthouse Building – Tree Building – Flower Flower - Lighthouse Flower - Tree Lighthouse – Tree

Binary comparison of 7 different objects = 21 comparisons

Appendix D Chapter 6 Experiment

D.1 Example Test Stimulus

RANK THE BOXES 1 → 4 ACCORDING TO HOW THEY BEST (IE MOST INFORMATIVELY) REPRESENT THE ORIGINAL SCENE SHOWN ON THE LEFT

1 = IMAGE THAT REPRESENTS THE ORIGINAL IMAGE THE BEST*; 4 = THE WORST

191

Appendix D Chapter 6 Experiment

192

D.2 Booklet Design 2 images were chosen – lighthouse (outdoor image) and chair (office image). Four low

quality versions were shown beside an original image. The order of presentation differed

between the two images.

Lighthouse images shown in Section D1 from Left to Right are: ORIGINAL “OUTDOOR”

WEIGHTS BASE CASE (no importance processing)

EQUAL WEIGHTS

“OFFICE” WEIGHTS

Chair images shown in Section D1 from Left to Right are: ORIGINAL BASE CASE

(no importance processing)

EQUAL WEIGHTS

“OFFICE” WEIGHTS

“OUTDOOR” WEIGHTS

Appendix E Chapter 7 Experiments E.1 Training Image Database Streetscape

193

Appendix E Chapter 7 Experiments

Café/Restaurant

194


Heads/Shoulders

195


Beach

196


Office

197


Home

198

E.2 Example Test Stim

a

Appendix Chapter 7 ExperimentsE

199

If you were trying to move through this scene:

which version would you find most

helpful.

uli for Section 7.2 Experiment

b c d e f

E.3 Example Test Stimuli for Section 7.3 E Zoom window shown on original:

25x25 Black and White version of zoom window:

a b c


200

Consider the same image again.

Imagine you could zoom in to one part of theimage. Which zoomed version(s) shown on thebottom row would you find most helpful if youwere moving through the scene? The part of the original image from which thezoomed image has been taken from is shown abovefor interest.

xperiment

(NO ZOOM)

d e f g (NO ZOOM)


201

E.4 Booklet Order for Chapter 7 Experiments

IMAGE IMAGE

Random placement of images: TYPE No. A B C D E F PARTA PARTBcafé 1 C1 1 2 3 4 5 6 IM eq 1 sal_trimstreet 1 S1 3 5 4 6 1 2 IM opt 2 sal_scopehouse 1 H1 2 1 6 5 3 4 IM sc 3 IM_trimface 1 F1 5 4 2 3 6 1 tr IM 4 IM_scope

office 2 3 5 centre 1 O1 4 6 1No import. processing 5

beach 1 B1 6 4 5 1 2 3 edge 6 bottom

café 2 PART A – ROI applied to entire image

street 2 PART B – ROI use for automatic zoom

house 2 Table above is repeated for all 4 image sets face 2 Numbers 1-6 above refer to processing method used (refer Table above right) office 2 Booklets had 48 pages – 24 for PART A and 24 for PART B beach 2 PART A and PART B were shown in consecutive page order for each image café 3 street 3 house 3 face 3 office 3 beach 3 café 4 street 4 house 4 face 4 office 4 beach 4

improving perception from electronic visual prostheses...

Documents