improving perception from electronic visual prostheses...
TRANSCRIPT
Improving Perception from
Electronic Visual Prostheses
Justin Robert Boyle BEng (Mech) Hons1
Image and Video Research Laboratory
School of Engineering Systems
Queensland University of Technology
Submitted as a requirement for the degree of
Doctor of Philosophy
February 2005
Keywords image processing, visual prostheses, bionic eye, artificial human vision, visual perception, subjective testing, visual information
ii
Abstract This thesis explores methods for enhancing digital image-like sensations which
might be similar to those experienced by blind users of electronic visual prostheses.
Visual prostheses, otherwise referred to as artificial vision systems or bionic eyes,
may operate at ultra low image quality and information levels as opposed to more
common electronic displays such as televisions, for which our expectations of image
quality are much higher. The scope of the research is limited to enhancement by
digital image processing: that is, by manipulating the content of images presented to
the user. The work was undertaken to improve the effectiveness of visual prostheses
in representing the visible world.
Presently visual prosthesis development is limited to animal models in Australia and
prototype human trials overseas. Consequently this thesis deals with simulated
vision experiments using normally sighted viewers. The experiments involve an
original application of existing image processing techniques to the field of low
quality vision anticipated from visual prostheses.
Resulting from this work are firstly recommendations for effective image processing
methods for enhancing viewer perception when using visual prosthesis prototypes.
Although limited to low quality images, recognition of some objects can still be
achieved, and it is useful for a viewer to be presented with several variations of the
image representing different processing methods. Scene understanding can be
improved by incorporating Region-of-Interest techniques that identify salient areas
within images and allow a user to zoom into that area of the image. Also there is
some benefit in tailoring the image processing depending on the type of scene.
Secondly the research involved the construction of a metric for basic information
required for the interpretation of a visual scene at low image quality. The amount of
information content within an image was quantified using inherent attributes of the
image and shown to be positively correlated with the ability of the image to be
recognised at low quality.
iii
Table of Contents Abstract .................................................................................................................. iii
List of Figures ........................................................................................................... viii
List of Tables................................................................................................................ x
Statement of Original Authorship ............................................................................ xi
Acknowledgements.................................................................................................... xii
Publications............................................................................................................... xiii
Chapter 1 Introduction ........................................................................................... 14 1.1 Overview .................................................................................................... 14 1.2 Aim............................................................................................................. 15 1.3 Scope .......................................................................................................... 15 1.4 Thesis Structure.......................................................................................... 16 1.5 Contributions.............................................................................................. 17
Chapter 2 Image Quality and Visual Perception.................................................. 19 2.1 Introduction ................................................................................................ 19 2.2 Visual Perception Physiology .................................................................... 20 2.3 A Visual Hierarchy Model ......................................................................... 22
2.3.1 Early Vision Effects ........................................................................... 23 2.3.2 Cognitive Effects................................................................................ 31
2.4 Region-of-Interest ...................................................................................... 35 2.5 Visual Information ..................................................................................... 39 2.6 Chapter Summary....................................................................................... 41
Chapter 3 Visual Prosthesis Application............................................................... 43 3.1 Overview .................................................................................................... 43 3.2 General Introduction to the Application..................................................... 43 3.3 Current Visual Prosthesis Research ........................................................... 44
3.3.1 Retinal Systems .................................................................................. 45 3.3.2 Optic Nerve Systems.......................................................................... 47 3.3.3 Visual Cortex Systems ....................................................................... 48
3.4 Image Processing specifically related to Bionic Eye Projects ................... 49 3.4.1 Vision Chip Developments ................................................................ 49 3.4.2 CCD-based Systems........................................................................... 51 3.4.3 Receptive Field Modeling .................................................................. 53 3.4.4 Multiple Resolution Work.................................................................. 54
3.5 Digital Imaging Applicable to Visual Prostheses ...................................... 55 3.5.1 Digital Imaging and Human Vision ................................................... 55 3.5.2 Image Characteristics and Visual Understanding .............................. 58
3.6 Thesis Research Questions and Approach ................................................. 67 3.6.1 Image Processing Requirements ........................................................ 68 3.6.2 Testing Method .................................................................................. 69
3.7 Chapter Summary....................................................................................... 71
iv
Chapter 4 Recognition Performance ......................................................................72 4.1 Overview.................................................................................................... 72 4.2 Subjective Tests to Determine Useful Processing Methods ...................... 73
4.2.1 Methodology ...................................................................................... 73 4.2.2 Images Chosen ................................................................................... 74 4.2.3 Results ................................................................................................ 75 4.2.4 Test Conclusions ................................................................................ 85
4.3 Subjective Tests to Determine Influence of Image Type........................... 87 4.3.1 Methodology ...................................................................................... 87 4.3.2 Images Chosen ................................................................................... 88 4.3.3 Results ................................................................................................ 90 4.3.4 Test Conclusions ................................................................................ 96
4.4 Chapter Conclusions .................................................................................. 97
Chapter 5 Quantifying Information Content ........................................................98 5.1 Introduction................................................................................................ 98 5.2 Perceived Information Content in Images................................................ 100
5.2.1 Images Used..................................................................................... 100 5.2.2 Multidimensional Visual Information Model: ................................. 101 5.2.3 Test Method ..................................................................................... 103 5.2.4 Test Participants and Instructions .................................................... 104 5.2.5 Test Results ...................................................................................... 105 5.2.6 Strong Visual Information Rankings ............................................... 111 5.2.7 Test Conclusions .............................................................................. 113
5.3 Information Content Model Fitting.......................................................... 113 5.3.1 Possible Image Attributes for a Visual Information Metric............. 113 5.3.2 Metric Development for a Specific Image Quality Class ................ 118 5.3.3 Information Content Metric for all Image Quality Classes.............. 127
5.4 Correlations Between Recognition Rate And Perceived Information Content ................................................................................................................. 133 5.5 Chapter Summary..................................................................................... 135
Chapter 6 Scene Specific Imaging ........................................................................136 6.1 Overview.................................................................................................. 136 6.2 Characteristics of Simple Scenes ............................................................. 136
6.2.1 Office ............................................................................................... 136 6.2.2 Home................................................................................................ 137 6.2.3 Street ................................................................................................ 137 6.2.4 Outdoors........................................................................................... 138 6.2.5 Head and Shoulders ......................................................................... 138 6.2.6 Café/Restaurant ................................................................................ 139 6.2.7 Public Toilets ................................................................................... 139
6.3 Image Processing targeted to Scene Type................................................ 140 6.4 Subjective Tests for Scene Weighted Processing .................................... 142 6.5 Chapter Summary..................................................................................... 144
v
Chapter 7 A comparison of ROI methods for low quality images.................... 145 7.1 Overview .................................................................................................. 145 7.2 ROI Processing applied to Entire Image .................................................. 146
7.2.1 Image Preparation ............................................................................ 146 7.2.2 Processing Methods Compared........................................................ 148 7.2.3 Images Used ..................................................................................... 150 7.2.4 Experiment ....................................................................................... 151 7.2.5 Results .............................................................................................. 151
7.3 Digital Zoom ............................................................................................ 154 7.3.1 Automatic Zoom Methods ............................................................... 155 7.3.2 Results of Automatic Zoom Experiment.......................................... 158
7.4 Chapter Summary..................................................................................... 160
Chapter 8 Discussion, Conclusion and Future Work......................................... 161 8.1 Discussion and Conclusion ...................................................................... 161 8.2 Future Work ............................................................................................. 164
8.2.1 Motion .............................................................................................. 164 8.2.2 Colour............................................................................................... 164 8.2.3 Device interfacing ............................................................................ 165 8.2.4 Supplementary/Symbolic Information ............................................. 165 8.2.5 Range Indication .............................................................................. 166 8.2.6 Simulating Techniques..................................................................... 167 8.2.7 Other Testing Techniques ................................................................ 167
8.3 Final Word................................................................................................ 168
References 169
Appendix A Section 4.2 Experiment ............................................................... 178 A.1 Example Test Stimulus............................................................................. 178 A.2 Booklet Design......................................................................................... 179 A.3 Borderline Recognition Assessment for Section 4.2 Experiment ............ 180
Appendix B Section 4.3 Experiment ............................................................... 185 B.1 Example Test Stimulus............................................................................. 185 B.2 Borderline Recognition Assessment for Section 4.3 Experiment ............ 186
Appendix C Chapter 5 Experiments............................................................... 187 C.1 Example Test Stimulus – 7 Images Presented all at same time ............... 187 C.2 Example Test Stimulus – 3 Images Presented all at same time ............... 188 C.3 Example Test Stimulus – Paired Comparison Experiments..................... 189 C.4 Booklet Design......................................................................................... 190
Appendix D Chapter 6 Experiment ................................................................ 191 D.1 Example Test Stimulus............................................................................. 191 D.2 Booklet Design......................................................................................... 192
Appendix E Chapter 7 Experiments............................................................... 193 E.1 Training Image Database ......................................................................... 193 E.2 Example Test Stimuli for Section 7.2 Experiment................................... 199 E.3 Example Test Stimuli for Section 7.3 Experiment................................... 200 E.4 Booklet Order for Chapter 7 Experiments ............................................... 201
vi
List of Figures Figure 1.1: Mean Square Error figures between reference image and low quality
versions .............................................................................................................. 24 Figure 3.1: Basis of Visual Prostheses....................................................................... 44 Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images .......................... 56 Figure 3.3: Circular pixelised vision.......................................................................... 57 Figure 3.4: Alternate stimulation strategies .................................................................. 58 Figure 3.5: Simulating the effect of modulating phosphene brightness..................... 61 Figure 3.6: Importance Mapping concept .................................................................. 65 Figure 3.7: Safety Post enhancement with advanced image processing techniques.. 66 Figure 3.8: Enhancing the information content of a low quality image of stairs....... 67 Figure 4.1: Image set used in the psychophysical testing .......................................... 74 Figure 4.2: Image Processing techniques used in the psychophysical testing ........... 75 Figure 4.3: Recognition rate for objects in the image set .............................................. 76 Figure 4.4: Effect of spatial resolution and grey-scale on object recognition. .......... 78 Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly
more recognisable than black & white images. ................................................. 79 Figure 4.6: Comparing resolution and grey scale. ..................................................... 80 Figure 4.7: Significantly higher recognition is achieved with increased spatial
resolution (Right) over increased greyscale resolution (Left)............................ 82 Figure 4.8: Object recognition rate for various processing methods (n=110) ................. 83 Figure 4.9:Edge images were not well recognised..................................................... 84 Figure 4.10: Subjective preferences between image and its inverse – some subjects
preferred white on black, others black on white. ............................................... 86 Figure 4.11: Test Objective - Obtain an Recognition-Quality curve ......................... 88 Figure 4.12: The nine image quality classes used in the tests.................................... 89 Figure 4.13: Test image set ........................................................................................ 89 Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set 91 Figure 4.15: Variation in recognition among image types......................................... 92 Figure 4.17: Recognition rates for each object type .................................................. 95 Figure 5.1: Two images with different amounts of visual information content......... 98 Figure 5.2: The nine image quality classes used in the tests.................................... 100 Figure 5.3: Multidimensional Visual Information Model........................................ 103 Figure 5.4: Images containing high information content for high quality images... 106 Figure 5.5: Images containing high information content for low quality images.... 106 Figure 5.6: Strong viewer preferences (70% or above consensus among viewers)
showing images ranked from highest to lowest perceived information content.......................................................................................................................... 112
Figure 5.7: Calculating Fractal Dimension for Binary Images ................................ 115 Figure 5.8: Calculating Fractal Dimension for Greyscale Images........................... 116 Figure 5.9: Determining image similarity and symmetry – pixel matching ............ 117 Figure 5.10: Determining image similarity and symmetry – pixel difference and
average value.................................................................................................... 118 Figure 5.11: Correlation between 15 image attributes and perceived information
content. ............................................................................................................. 128 Figure 5.12: Metric performance for Strong viewer preferences (70% or above
consensus among viewers) showing images ranked from highest to lowest perceived information content.......................................................................... 131
vii
Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)......................................................................... 134
Figure 6.1: Visual stimuli used to gauge perception of low quality images ............ 142 Figure 7.1: Image preparation .................................................................................. 146 Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images
that have had grey levels equalised (right most image). .................................. 147 Figure 7.3: Processing methods used in tests (refer text for details)........................ 148 Figure 7.4: Example Distribution – Size Map distribution for Beach training images;
.......................................................................................................................... 149 Figure 7.5: Images used in comparison tests............................................................ 150 Figure 7.6: When presenting the entire image, results indicate a clear preference for
no Importance Processing (n=96) .................................................................... 152 Figure 7.7: Digital zoom concept – the most salient area is identified in an image and
resized to the maximum display resolution...................................................... 154 Figure 7.8: Trim method to select zoom window .................................................... 155 Figure 7.9: Scope Box method to select zoom window........................................... 155 Figure 7.10: Saliency Map developed by University of Southern California.......... 156 Figure 7.11: Zoom window selected from central 25% of image............................ 156 Figure 7.12: Zoom window selected from central-bottom 25% of image ............... 157 Figure 7.13: Image Preparation for Digital Zoom Tests .......................................... 157 Figure 7.14: Example stimulus showing detail of zoom window border................. 158 Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)159 Figure 8.1: Halftone representation.......................................................................... 167
viii
List of Tables Table 3.1 – Thesis experiments.................................................................................. 70 Table 4.1: Analysis of Variance for various processing methods.............................. 84 Table 4.2: Correct image identification (n=25) ......................................................... 90 Table 5.1: Perceived information content for comparing 7 different object types .. 105 Table 5.2: Pattern analysis for information content rankings .................................. 110 Table 5.3: Dominant visual information viewer preferences................................... 111 Table 5.4: Correlation coefficients for variables considered for metric for 256x256
greyscale images .............................................................................................. 120 Table 5.5: Candidate models for metric for 256x256 greyscale images.................. 123 Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges,
Entropy) ........................................................................................................... 125 Table 5.7: Candidate models for a metric for 10x10 binary images........................ 126 Table 5.8: Summary of metric performance ............................................................ 129 Table 5.9: Predictive performance of metric proposed for all image qualities .............. 130 Table 5.10: The number of correct metric predictions of images with the highest
information content .......................................................................................... 132 Table 5.11: Correlation coefficients between recognition rate and perceived information
content............................................................................................................... 134 Table 6.1: Image Processing descriptors of different scene types ........................... 140 Table 6.2: Attentional feature weights for each scene type ..................................... 141 Table 6.3: Preferred ranking for image representation ............................................ 143
ix
Statement of Original Authorship
The work contained in this thesis has not been previously submitted for a degree or
diploma at any other higher education institution. To the best of my knowledge and
belief, the thesis contains no material previously published or written by another
person except where due reference is made.
Signed:
Date:
x
Acknowledgements I would like to thank Anthony Maeder, first for his enthusiasm and willingness to
conduct the research at QUT and also for his supervision and guidance, textbooks,
time, motivational prods, positive feedback, attention to detail and financial support.
I could not wish for a better supervisor. The comments and advice from my
associate supervisor Wageeh Boles are also appreciated.
Massive thanks to QUT for a Postgraduate Research Award and the Faculty of Built
Environment and Engineering for a top-up scholarship. Financial assistance from the
SAIVT program director Sridha Sridharan and the QUT Grants-in-Aid office is
appreciated regarding conference attendance. Thanks to the faculty administration
staff especially Scott Allberry, for assistance with preparing questionnaires.
I am grateful to Wilfried Osberger and Laurent Itti and Dirk Walther of iLab for
permission to implement variations of their importance map and saliency codes in
my research. A big thank you to all the volunteers who participated in the subjective
testing, including students at Brisbane State High School and their coordinating
teachers, as well as family members and colleagues in the Image and Speech lab who
were roped in. The efforts of Jason Pelecanos towards soccer matches, BBQs and
other inclusive activities to invigorate the research lab are appreciated.
I thank Melissa, Jeremy and Ruth: it has been a big chunk of our lives with some
major ups and downs. I really appreciate the child care & washing (!), mental
support and lifts into uni dropping Dad off over those speed-bumps. Thanks to my
parents for their silent and not-so-silent encouragement with research and
publications.
This work is dedicated to Terry – may you some day see your wife, children and
rainforest paradise.
xi
Publications The research has resulted in the following fully refereed publications (or abstract
refereed only where indicated by an asterisk).
Boyle J, Maeder A, Boles W, Digital Imaging Challenges for Artificial Human
Vision, South African Computer Journal (26), pp.222-227, 2000
Boyle J, Maeder A, Boles W, Image Processing and Artificial Human Vision
Systems, WoSPA2000 - 3rd Australasian Workshop on Signal Processing
Applications, Brisbane, 2000
* Boyle J, Maeder A, Boles W, “Challenges in Digital Imaging for Artificial Human
Vision”, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T,
Editors, Proceedings of SPIE Vol 4299, pp.533-543, 2001
Boyle J, Maeder A, Boles W, Static Image Simulation of Electronic Visual
Prostheses, ANZIIS 2001 – Proceedings of the 7th Australian and New Zealand
Intelligent Information Systems, Perth, pp.85-88, 2001,
{1st prize student paper competition}
Boyle J, Maeder A, Boles W, Image Enhancement for Electronic Visual Prostheses,
Australian Physical & Engineering Sciences in Medicine Journal 25(2), pp.81-86,
2002
* Boyle J, Maeder A, Boles W, Visual Perception with Electronic Visual Prostheses,
Physical Sciences and Engineering in Medicine Queensland Branch Local
Symposium, Brisbane, June 2002
{1st prize student paper competition}
xii
Boyle J, Maeder A, Boles W, Visual Perception of Low Quality Images, Proceedings
of the 9th International Conference on Neural Information Processing, Singapore,
2002
Boyle J, Maeder A, Boles W, Inherent Visual Information for Low Quality Image
Presentation, WDIC2003 - Proceedings of the 2003 APRS Workshop on Digital
Image Computing, Theme: Medical Applications of Image Analysis, Brisbane,
pp.51-56, 2003
Boyle J, Maeder A, Boles W, Can Environmental Knowledge Improve Perception
with Electronic Visual Prostheses?, WC2003 – Proceedings of the World Congress
on Medical Physics and Biomedical Engineering, Sydney, 2003
Boyle J, Maeder A, Boles W, Scene Specific Imaging for Bionic Vision Implants,
ISPA2003 – Proceedings of the 3rd International Symposium in Image and Signal
Processing and Analysis, Rome, pp.423-427, 2003
xiii
Chapter 1 Introduction
1.1 Overview
Blindness affects millions of people worldwide and over 100,000 Australians. This
research supports quality-of-life improvements for them by exploring appropriate
image processing techniques for electronic visual prosthesis systems: so-called
“bionic eyes". These systems consist of a vision chip or camera that records the
visual world and transmits this information via electric pulses to implanted electrodes
in contact with the retina, optic nerve or visual cortex. These three sites provide
lower resolution opportunities for synthetic image presentation to the human visual
system.
Existing mobility aids for the visually impaired such as canes, guide dogs and sonar
glasses are available. However it is anticipated that the richness of sensory
substitution would be much greater with a visual prosthesis. Although unlikely to
recreate the full experience of vision, visual prostheses may provide enough visual
cues for blind people to perform every-day tasks such as navigation, recognition, and
reading.
The terminology “low quality” used in this thesis refers to images which can only
contain relatively little visual information content. Knowledge of human perception
of low quality (eg. low resolution) images, such as those expected from visual
prostheses, is very limited. While researchers have worked extensively in
characterising high quality image perception, most of this work is not relevant or
useful for low quality images. Yet it is in this low resolution regime that the most
immediate gains in artificial vision can be made. Ways to identify the sparse
information that is important for viewer understanding of scenes when presented in
low quality images are thus needed.
Introduction
15
1.2 Aim The overall aim of this research is to develop simple image processing techniques
that improve perception for users of electronic visual prostheses. This can be broken
down into several component elements of investigation:
• Determine useful image processing methods for artificial human vision systems
• Allow wider implementation of recently developed Region-of-Interest image
processing routines beyond previous applications; these routines identify
important and salient areas within an image
• Facilitate further understanding of the human visual system, specifically
perception performance from low quality visual information
• Provide a basis from which more complex and beneficial (eg. real time) image
processing units can be developed such that a prosthesis may provide maximum
benefits to the blind
1.3 Scope
The research described in this thesis is bounded by the following limitations and
assumptions:
1. Psychophysical sensations of what might be seen with a prosthesis is simulated
by presenting visual stimuli to normally sighted viewers. Prosthesis development
in Australia is currently limited to animal models and data from implanted
patients is not yet available. It is anticipated that some of the experiments
comparing image processing techniques described in this thesis could be repeated
using implanted patients when available.
2. Perception experiments undertaken in the project were based on static/still
images. Improved perception is anticipated if the techniques are applied to image
sequences/video as a user would be offered a richer representation of a scene, in
addition to moving about to see how scene elements (background/foreground)
interact.
Introduction
16
3. The images presented to subjects in simulation experiments were ordered pixel
arrays in a square pattern (equal image height and width). The reported evoked
visual field of implanted patients is not a regularly ordered array and varies from
patient to patient. Due to this wide variability it was decided to present a
symmetric image representation to gauge some understanding of low quality
image understanding. It is anticipated that implant users would undergo tuning
and training through post operative exercises similar to auditory implant
programs to use viable electrodes efficiently.
4. There exist other techniques related to implant electrode stimulation which may
produce different psychophysical sensation. One such technique could be using
different electrode current flow and return paths to create wide variations in
perceived visual sensations. This thesis does not consider such techniques and is
instead based on manipulating conventional pixel-based images digitally (digital
image processing) to improve visual perception.
1.4 Thesis Structure The thesis is structured as two main sections: Background (Chapters 1 – 3), and New
Work (Chapters 4 – 7), followed by a Conclusion (Chapter 8).
Review reading will first be presented to determine fundamental theory relevant to
image understanding and visual prostheses. Chapter 2 describes research in the area
of image quality and visual perception. Chapter 3 describes current artificial vision
research including image processing activities specifically related to bionic eye
projects. At the end of Chapter 3, several requirements for prosthetic image
processing systems are identified. These requirements drive research questions
around which the remaining thesis chapters are based.
Chapter 4 explores low quality recognition performance and establishes the
applicability of a computationally cheap Region-of-Interest (ROI) processing
technique to low quality images. A model for characterising low quality images on
the basis of how much visual information they contain is developed in Chapter 5.
Introduction
17
An approach to tailor image processing depending on the type of scene is explored in
Chapter 6. Finally Chapter 7 compares several ROI methods against each other to
identify which may be most helpful when moving through a scene. The applicability
of all image processing techniques explored in the thesis are tested using normally
sighted human perception (subjective testing) experiments.
Chapter 8 provides a discussion and conclusion of the work and provides some
commentary on how the research can be extended.
1.5 Contributions
Resulting from this work are two significant original contributions to knowledge,
which are summarised below. These contributions are explored through the
examination of a number of related research questions, which are explored in detail
in Chapter 3 Section 3.6.
First, investigations based on early vision aspects of digital images are used to
provide recommendations for effective image processing methods for enhancing
viewer perception when using visual prosthesis prototypes.
The hypothesis that low level processing can improve scene understanding was verified
with several subjective experiments. Although bounded by ultra low quality
environments, prostheses can facilitate some recognition performance. By including a
range of image processing routines or modes of operation, users can gain as many visual
cues from a scene as possible.
When considering prototype resolution, enhanced perception is achieved with
increased spatial resolution of implant electrodes over increased greyscale resolution.
Thus a fundamental aspect of implant design is maximising the spatial resolution.
There are also considerations to be made with respect to context. The most easily
recognised environments for users of vision prostheses are faces. The spatial pattern
of two eyes and underlying mouth is easily recognised even at the lowest image
Introduction
18
quality environments tested in the thesis experiments. Thus enhanced perception is
likely when viewing simple face scenes (eg. TV newsreader) compared to other
scene types.
Incorporating a digital zoom function in prostheses designs could lead to enhanced
perception, and Region-of-Interest processing techniques (which automatically
identify salient areas within images) should be used to obtain the zoomed image.
This is an original application of these techniques beyond traditional applications
such as image compression.
The second contribution is the construction of a metric for basic information content
required for interpretation of visual scenes at low image quality.
The ability of a low quality image to be recognised was found to be positively
correlated with the amount of perceived information content in that image. Thus an
image with high information content can be expected to be more easily recognised at
low quality than an image containing low information content.
Experiments reported here have identified that a simple face with no surrounding
clutter is most visually informative amongst those image types tested. Also, higher
visual information content results from:
(a) more objects in the scene;
(b) closer objects;
(c) strong edges, arising from high intensity contrast.
Finally visual information can be quantified using the number of edges in the image.
Thus, image perception can be enhanced by maximising the number of edges within
the image.
The above contributions, while somewhat general in nature, are seen as forming the
essential elements of any visual prosthesis system capable of providing only low
quality images to the observer.
Chapter 2 Image Quality and Visual Perception
2.1 Introduction This chapter will provide some background on the overlap of visual perception and
imaging, as a framework for improving perception through electronic visual
prostheses. As a visual response is the goal of visual prostheses it is of relevance to
explore research in visual perception within a framework of image quality. Image
quality is a term used to denote the amount of information retained in an image that
has been degraded in some way from its ideal form.
Studies of the interaction between human visual perception and electronic imaging is
one of the key growth areas in imaging science [81]. Concerning quality of images,
research to date has been focused on the high quality still images and video
associated with modern multi-media environments. Perceptual models for image
quality using characteristics of the human visual system have been developed
[1,105], and with them, perceptually based image compression techniques (eg. [69]).
However, this work is based on high quality images and there is no similar quality
characterisation for the emerging field of visual prosthetics.
The rest of this chapter is presented in four sections. Section 2.2 provides an
overview of perception physiology: what happens in the eye during visual
perception. In Section 2.3, a hierarchical model for perception and imaging is
presented. Research is described first in the areas of low level, or early vision. The
discussion includes previous work in the field of image quality and how quality has
been traditionally assessed. Following from this, research in higher levels or
cognitive vision is described. Section 2.4 presents an area of work that incorporates
both low and high levels of vision, known as Region-of-Interest processing. Finally
Section 2.5 provides an overview of some research approaches concerning visual
understanding and complexity.
2.2 Visual Perception Physiology
2.2 Visual Perception Physiology
20
An important component in understanding how to design visual prostheses is the
physiology of visual perception [59]. The retina is the innermost layer of the back of
the eye, and is organised into layers that contain photoreceptors, interneurons and
blood vessels. The embryonic development of the retina results in inside-out design,
so that the photoreceptors are nearest the back of the eye and light must pass through
the retinal interneurons and blood vessels to reach the photoreceptors. The two types
of photoreceptors, cones and rods, contain visual photopigment. The first step in
photoreception is photopigment bleaching: when light activates visual pigment
molecules. Bleaching initiates a sequence of events leading to a change in the cell
membrane potential. Ganglion cells are the retinal cells whose axons form the optic
nerve, so their output is the final product of the information processing that occurs in
the retina. Ganglion cell axons enter the optic nerve in an orderly fashion, so that
adjacent axons in the nerve correspond to adjacent receptive fields on the retinal
surface. The pathway ascends to the lateral geniculate nucleus (LGN) of the
thalamus and then projects to the primary visual cortex in the occipital lobes of the
brain.
Studies of anatomy, physiology, and human perception (eg. stroke victims) conclude
that the human visual system is subdivided into several separate parts whose
functions are quite distinct [48]. There are two subdivisions in the visual pathway
(LGN + visual cortex): parvocellular and magnocellular. Both have inputs from the
same rods and cones, but have differences in the way the photoreceptor inputs are
combined. Their receptive fields (regions of the retina over which impulse activity
can be influenced) are circularly symmetric and show centre-surround arrangement
(also in retinal bipolar cells). These cells are configured to convert information from
photoreceptors into information about spatial discontinuities in light patterns. Some
cells are excited (impulse rate speeded up) by illumination of a small retinal region
and inhibited (impulse rate slowed down) by illumination of large surrounding
region, while others are the reverse of this.
2.2 Visual Perception Physiology
21
The magno and parvo divisions differ in four ways and imply they contribute to
different aspects of vision:
1. colour: 90% of parvo layer cells are wavelength sensitive (they combine cone
inputs in effect to subtract them); the magno system is colour blind, or
wavelength insensitive (they sum inputs of 3 cone types so the response to
illumination is on or off at all wavelengths). For example, two different colours
such as red and green at the same relative brightness are indistinguishable.
2. acuity: magno cells have larger receptive field centres than parvo (by a factor of
2 or 3) ie. lower spatial resolution. For both mango and parvo cells, the receptive
field size increases with distance from the fovea. This is consistent with
differences in acuity between foveal and peripheral vision.
3. Speed: magno cells respond faster and more transiently than parvo (which
suggests a role in detecting movement). Many cells at higher levels in this
pathway are selective for orientation and for direction of movement.
4. Contrast: magno cells are more sensitive to low contrast stimuli than parvo.
At higher stages the continuations of these pathways are selective for different
aspects of vision (form, colour, movement, stereopsis). The extended M/P pathways
are described further in the visual cortex (blob and interblob systems) and properties
include:
• orientation selective
• selective for direction of movement
• end-stopped (respond to short but not long line/edge stimuli)
• colour information and colour contrast information in separate systems (eg.
colour contrast used to determine borders, but not information of colours forming
the borders)
• brightness selective
• respond poorly when stimulation of either eye alone but vigorously when both
eyes stimulated together (stereoscopic depth).
Hubel and Livingstone [48] argue that end stopping (like centre-surround system) is
an efficient way of encoding information about shape. They also propose that
2.2 Visual Perception Physiology
22
different kinds of visual tasks differ in their colour, temporal, acuity and contrast
characteristics. Other findings by the authors are as follows:
• People follow brightness alterations much faster than pure colour alterations (the
magno system is colour blind and faster than the parvo)
• Movement perception reflects magno characteristics: colour blindness, quickness,
high contrast sensitivity and low acuity (which has been proved with perceptual
experiments).
• Motion perception, stereoscopic depth perception, the ability to use relative
motion as a depth cue, shading as a depth cue, and perspective as a depth cue are
all lost at equiluminance.
• The retinal image is two dimensional. In order to capture three dimensions, the
human visual system uses many kinds of cues besides stereopsis and relative
motion: perspective, gradients of texture, shading, occlusion and relative position
within the image.
• Magno system is more primitive than parvo. Parvo is only well developed in
primates who can scrutinise in much more detail shape, colour and surface
properties of objects ie. visual identification and association.
• Results are presented that suggest that only luminance contrast and not colour
differences are used to link parts comprising a scene together.
It is difficult to predict if the aspects of vision described above (eg. brightness
selective, movement perception) will be similar for artificially induced vision. As
the parvocellular and magnocelular pathways extend through to the visual cortex,
any similar induced perceptions would probably be independent on the stimulation
site of the prosthesis.
2.3 A Visual Hierarchy Model
In this section a hierarchical model of imaging and visual perception is used to
describe research activities in this area. A good review of the research concerning
the interplay between visual perception and electronic imaging is given by Rogowitz
et al [81]. Their review draws on observations made during human interaction
2.3 A Visual Hierarchy Model
research (see for example [15,21,69,70,75]). Arising from these observations is a
visual hierarchy:
• aesthetic & emotional aspects
Higher levels of vision
• cognitive effects: memory, semantic categorisation and visual
representation
• perceptual effects: colour constancy, suprathreshold pattern &
texture analysis
• visual phenomena mediated by the threshold sensitivity of low-
level spatial, temporal and colour mechanisms.
2.3.1 Early Vision Effects
At the bottom of the hierarchy are low-level, or “early vision” effects. Many early
vision models have been proposed to characterise image quality on the basis of these
low level effects. It is argued in this thesis that these early vision models cannot be
extended to the poor quality images anticipated from visual prostheses (eg. 10x10 or
25x25 resolution, black and white images). The early vision models have arisen to
address distortions, or artifacts, that have been caused by electronic imaging
processes, such as compression and halftoning. Such distortions include blurring,
granular noise, jerkiness and blocking. The body of knowledge on image quality is
based on reducing the visibility of these distortions. Many vision-based algorithms
have consequently been developed in the fields of still image and video compression,
image enhancement, restoration and reconstruction, image halftoning and rendering,
image and video quantisation and display.
Traditional quality models are based on measuring the differences between an
original “perfect” quality image and an altered image having undergone an imaging
process. The low spatial resolution binary images anticipated from visual prostheses
are so dramatically different from the perfect reference image that the quality models
developed to date in the literature do not apply.
To illustrate that these quality models are not useful, consider the popular quality
metric of Mean Square Error (MSE) defined as:
23
2.3 A Visual Hierarchy Model
21
0
1
0)ˆ(1
ij
M
i
N
jij xx
MNMSE −= ∑∑
−
=
−
=
(Equation 1)
where M, N are the number of horizontal and vertical pixels, is the value of the
original pixel at position (i,j), and is the value of the distorted pixel at position
(i,j).
ijx
ijx̂
Figure 1.1 shows the MSE from a reference perfect image for several distorted
versions of the reference. As can be seen, the MSE metric is best suited where the
differences are small, and the MSE for the low quality 10x10 binary image is
approaching that of a grey stripe pattern which did not originate from the Reference
Image.
Reference Image
black spot 25x25 binary 25x25 10x10 pattern
MSE = 338 760 4665 6727 8919 10673
Figure 1.1: Mean Square Error figures between reference image and low quality versions
Janssen [43] has proposed an alternate description of image quality which differs
from the traditional “perceived distance to the original”. He states that quality is how
well an observer is able to employ an image as a source of information about the
outside world, and proposes the following metric:
Quality = (λ1⋅N) + (λ2⋅C) + λ3
where N = naturalness, C = brightness contrast and λ1, λ2 and λ3 are scalar constants
determined from experiments.
Much of the image quality literature is focussed on reducing the visibility of image
artefacts (distortions) and so some of the approaches used for this purpose are
24
2.3 A Visual Hierarchy Model
25
included here. Rogowitz et al [81] summarise methods for characterising these
image artifacts:
1. physical measurement: measuring key image parameters; comparing images on a
pixel-by-pixel basis using a metric such as mean squared error
2. use human observers to judge perceived quality
3. develop metrics based on experiments measuring human visual characteristics to
estimate human judgements
These approaches are recognised in other sources [24,53,67] and are described below
in further detail.
2.3.1.1 Physical Measurement Metrics
Commonly used physical measurement metrics include Signal-to-Noise Ratio (SNR),
Peak Signal-to-Noise Ratio (PSNR), Mean Absolute Error (MAE), Mean Squared
Error (MSE), local MSE, and distortion contrast. These metrics are easy to use in
that information on viewing conditions is not needed and they are computationally
simple. However, the methods are considered poor (even for high quality images) in
that they do not work well on images with different content (eg. edges/textured
regions) [24] and they treat all impairments as equally important [67].
2.3.1.2 Human Observers as Judges of Perceived Quality
Human observers could be used to directly obtain feedback on image quality, and
indeed the ‘gold standard’ in determining image quality is the human observer. By
nature, this feedback incorporates human visual system considerations, but is
expensive and the results may not generalise [81]. However this method can be of
use in evaluating low quality images and is therefore of relevance to this thesis.
Traditionally this feedback, obtained either from trained experts or via psychological
experiments, has been used to compare a distorted image from an imaging process.
If the artifacts are visible (ie. suprathreshold), quality can be assessed via:
• a rating scale 1-5, eg. bad, poor, fair, good, excellent; however this will only
characterise large quality differences and may produce inconsistent results when
evaluating images with different types of artifacts
2.3 A Visual Hierarchy Model
26
• paired comparison experiments, where a 7 point scale ranges from –3 to 3 for
much worse, worse, slightly worse, same, slightly better, better, much better
• two stimulus forced choice scales which ask which image has the higher quality;
comparisons can be made using images with different types of artifacts
If the artifacts are just on the visual threshold, just noticeable difference (JND)
testing is often used to assess quality. JND tests are not biased on differences in
types of artifacts, and are often used to predict the visually lossless point between a
compressed image and the original. Display time and learning effects affect the JND
point, especially if an observer is given hints about where to look. One can also
maintain a visible difference map and have a user specify image quality for different
regions in an image [24].
Quality has also been assessed via the receiver operating characteristic method,
which measures the performance of an observer in making decisions using an altered
(eg. compressed) image. The rate of true positive decisions is compared with the rate
of false positive decisions, and the decisions can be subject to fatigue and time of
day. Therefore the reliability of ROC studies is dependent on the large number of
tests conducted under a range of human circumstances [53]. Typical studies are done
with a number of controlled observers.
A detailed discussion of human observer research methods in electronic imaging is
contained in Snyder and Trejo [90]. This covers psychophysical (acuity,
discrimination), physiological methods (electroretinogram, positron emission
tomography, cerebral blood flow) and behavioural methods (search time, legibility,
response time, which includes subtasks of visual perception, decision making and
motor response). Psychophysical tests are of most relevance in this thesis as the
research hypothesis is validated by such methods. Nemine [62] defines
psychophysical techniques as a method used to measure characteristics of the
environment in terms of the psychological value. Travis [98] observes that
psychophysics is non-invasive, and involves investigation of a system by the study of
the psychological response to physical stimuli. The physics refers to measurement of
the stimuli, while the psychology refers to the measurement of sensation.
2.3 A Visual Hierarchy Model
27
Recommendation 500 of the International Telecommunication Union is often cited in
the field of image quality evaluation [41]. This document covers methods and
viewing conditions for assessing perceived quality in a standardised way. More
recently, the ITU has developed Recommendation P.910 to standardize methods for
multimedia quality assessment. The premise behind subjective assessment is the use
of human observers to rate video sequences (usually short clips), and thus it may be
impractical to use these methods for the in-service continuous evaluation of image
quality. A Video Quality Experts Group (VQEG) has been established to provide
other objective methods for video image quality evaluation [101].
Obtaining subjects’ perception on the low quality images anticipated from visual
prosthesis has similarities with the classical inkblot tests used in psychology, known
as the Rorschach [27]. Clear guidelines for psychological assessment have been
established [30], covering seating, verbal instructions, recording and enquiry on
responses. Error can be introduced by censorship by the subject, scoping errors, poor
handling of subtleties of interpretation, incorrect incorporation of age or education
and examiner bias (illusion correlation). Alterations in wording, rapport and
encouragement can significantly alter responses. A large number of variables are
likely to produce spurious random significance. Side-by-side seating is
recommended if possible.
Tests used in this thesis are aimed to determine the intelligibility of low quality
images, via a viewer’s ability to recognise an object. Specifically related to these
tests are the interpretation and questioning used in ink-blot testing. For example, if
shown a pattern, typical questions would be:
Q. What is this object?
Q. What about it made it look like [ _________ ]?
Interpretation of the answers has been assisted by reference codes and categories. A
location attribute is used to categorise the area of the image used to draw the
viewer’s conclusions:
• whole response (W) – if the entire image was used
• unusual detail (Dd) – location responses made by <5% of subjects
• common detail (D) – a frequently identified area.
2.3 A Visual Hierarchy Model
28
Determinants are noted for any style or characteristic of the image eg. shape, texture.
Are there any arbitrary contours created where none exist? Finally the content of the
answer is allocated a category. These include whole human (real person) eg. Lena,
the image processing test image; whole human (myth/fictional) eg. ghost, angels;
human detail eg. person without head; whole animal; anatomy eg. lungs, stomach; art
eg. statues, jewellery; botany eg. plants; clothing eg. hat; clouds; food; household;
landscape; science.
Other cognitive issues in image quality measurement are given by De Ridder [19].
He states that methods for assessing perceived image quality can produce biased
estimates of a viewer’s quality impression. These include instructions given to
subjects, and choice of rating scale, such as:
- single scaling: 1 (lowest sharpness) to 10 (highest sharpness)
- double stimulus scaling – same as above but reference image introduced
- comparison scale ranging from –10 (1st image is much sharper than 2nd) to 10
(2nd image much sharper than 1st).
2.3.1.3 HVS-based Metrics
The next category for consideration is metrics that are based on experiments
measuring human visual characteristics. To overcome the shortcomings of the
simple physical measurement metrics such as Peak Signal-to-Noise Ratio or Mean
Absolute Error, a large number of often complex perceptual quality metrics have
been proposed. A good summary of perceptual metrics as applied to image quality
research is contained in Ahumada [1]. Perceptual metrics can be classified as
follows:
• Metrics with simple characteristics; these include contrast sensitivity function
and luminance adaptation (refer below);
• Metrics incorporating preferences for suprathreshold artifacts; these attempt to
model the objectionability of imaging artifacts;
• Threshold perceptual models; these metrics predict the visibility of distortions
near the visual threshold. Visible differences maps can be created which specify
the probability of seeing a difference between two images at each pixel location.
Other models give the visibility of errors within each frequency band – overall
2.3 A Visual Hierarchy Model
29
image quality is then determined by identifying the frequency band with the most
visible artifacts.
There are several visual factors used in perceptual image quality metrics [24]:
1. Contrast sensitivity functions
The contrast threshold function (CTF), or its inverse, the contrast sensitivity
function (CSF) is the most widely used attribute for metrics. The CTF indicates
when frequency components just become visible, and specifies the internal noise
level across spatial frequencies; ie. it identifies the amount of quantisation that
can be applied near the visibility threshold for the same perceptual error. The
CSF is a measure of the spatial resolution of the human visual system. When
used for image compression applications, the CSF is often assumed to be a low
pass function to ensure quantisation artifacts become less visible for increased
viewing distance.
2. Luminance adaptation
Luminance adaptation is the second most commonly used attribute. The Weber-
Fechner law describes how sensitivity to luminance differences in a stimulus is
proportional to the mean luminance of the stimulus (contrast threshold remains
constant for increasing luminance levels). As the background luminance
increases, the sensitivity to errors decreases proportionately. Again for
compression applications, as local luminance increases, an increased level of
quantisation can be tolerated. Luminance adaptation can be implemented in the
spatial or frequency domain.
3. Linear transforms
Psychophysical studies have shown that the human visual system has visual
channels selective to spatial frequency (with approximately an octave bandwidth)
and to orientation (with sensitivity of 15 deg to 60 deg). Several desirable
properties for linear transforms are used to model the human visual system:
frequency and orientation selectivity, linear/quadtree phase, minimum overlap
between adjacent channels, shift invariance, scale variancy.
2.3 A Visual Hierarchy Model
30
4. Masking: contrast; noise; & mutual
Contrast/pattern masking is where the signal is masked (reduced visibility) by the
presence of another signal or noise. For compression applications, it is desired to
have the image content (eg. where image is textured) masking the quantisation
noise. In masking locations, the image and noise signals need to be in the same
spatial location, have the same spatial frequency and have the same orientation.
Incorrect predictions of contrast masking is the major reason why perceptual
metrics fail. A major problem is that metrics present contrast masking as a
complex multiple frequency decomposition (computationally complex), and
perhaps single channel metrics could be used. Osberger [67] states that masking
may be one of the most influential components of a vision quality model for
complex natural images (more than the choice of single versus multiple
channels).
5. Summation of errors
Summation of errors is done across frequency bands and across space to reduce
dimensionality arising from many channels (an excess of information) into a
single map and perhaps even a single number.
If the artifacts are above the threshold of visibility (for example at high compression
ratios), the objectionability of the artifacts depends on the personal preferences of an
observer. This could be incorporated into a model, but most metrics ignore the issue.
There is no consensus regarding distortion levels at which observer preferences play
a role [24].
Metrics should be able to characterise spatial variations in quality across an image.
Therefore some perceptual metrics provide a two dimensional quality map, which
assign a level of perceived distortion to each location [106]. There is also the desire
expressed by these metric designers to collapse the map into a single number that
reflects overall quality, so a user can specify a single quality number for the entire
image. The predictive ability of a single number is very dependent on the
psychophysical methods used to validate the metric.
2.3 A Visual Hierarchy Model
31
2.3.2 Cognitive Effects
At the higher end of the hierarchical model are higher level of cognitive effects. In
this thesis, it is desired to determine how little information makes up a scene, and is
of use. In order to determine the lowest visual ‘primitives’, or building blocks that
make up an image, an analogy can be made to the process of understanding words if
alphabetical characters are removed from the word. Different thresholds of
understanding may be found depending on the approach:
• Replace letters randomly, and ask subject if it still makes sense
• Build words from letters (randomly), and ask the subject if it makes sense.
Several viewpoints on top down perceptual effects are collected in Cantoni [10].
Yarbus points out that visual exploration of a picture by a human observer follows
different paths according to the particular task that has been assigned. Savina states
that what is important in scene understanding is to collect in the shortest time
possible, information that allows one to perform the task assigned, leaving aside the
rest. To answer some question about a scene, one needs to analyse a few restricted
areas for a long time, meaning perhaps that the extraction of certain kinds of
information is difficult compared to the extraction of others. Gibson maintains that
the world around us acts as a huge external repository of information necessary to
act, and we directly extract from time to time elements that are needed.
The human response to visual stimuli is covered well by Hendee and Wells [34].
Bottom up versus top down visual processing is compared. Bottom up models
describe where a scene is comprised of individual features, and finer details are
obtained by slower scanning of the scene, requiring much time. In top down models,
an overall impression of the entire scene is obtained and features are filled in later.
Studies of visual scan paths indicate that real world knowledge (physical laws), past
experiences and expectancies affect eye fixation. Areas with high information
content (contours, non-homogeneous areas) are fixated by an observer. A perceptual
cycle/search plan is proposed:
2.3 A Visual Hierarchy Model
Available stimulus information
32
Schema Visual
Exploration
) (Modifies (Samples)
(Directs)
It is not possible to separate the mechanisms of detection/recognition/interpretation
of visual images. Instead there is a single process of constant interplay between
perception and cognition. Vision can be regarded as having a preattentive phase and
an attentive phase, and terms such as ‘useful field of view’, ‘visual lobe’, functional
visual field’ are used to describe the gathering of information from an area extending
the fovea. Preattentive stimuli are immediately detected by parallel preattentive
system. Attentive processing stimuli require a serial search by disk of focal
attention.
Preattentive processing is further described by Callaghan [9]. This processing
involves parallel and independent registration of features in the visual field. Features
are registered on separate ‘maps’ and linked to a master location map. Further
attentional analysis is required for identifying an object (the information in the
master map is linked together). Callaghan suggests that perhaps there are no
preattentive/attentive stages but an attention continuum during perceptual processing.
Texture segregation and popout are easily produced from visual primitives eg. line
orientation, hue, brightness, form, line terminators, line length, curvature and
closure. The author describes experiments conducted to support the proposal that
within-region segregation is an important factor in texture segregation. Observers
were presented with arrays of elements with varying hue and form (eg. circles &
squares) and asked to identify boundaries between elements ie. natural scenes were
not used in the experiments.
Other perceptual studies are reported in [71]. The visual perception program at SRI
includes studies of reduced field of view, limited spatial resolution, system-produced
distortions, system delay and system update rates. Most relevant to the application
addressed by this thesis were limited spatial resolution studies, where a stereoscopic
2.3 A Visual Hierarchy Model
33
display was used to present images on 2 colour monitors that the viewer saw as a
fused stereoscopic image. Each monitor had a resolution of 1280x1024 pixels, and
when viewed at 57cm, produced a pixel subtense of 1.5 min arc, equivalent to a
visual acuity of 20/30. Coarser spatial resolution was achieved by programming the
display to use selected small numbers of pixels instead of single pixels to produce an
image point.
Information presented at a rate higher than 30 images/second is not integrated as a
discrete sequence of images due to limitations of brain neural networks [34]. The
extraction of basic attributes of light fields (ie. characteristic features of an image) is
the central issue in biological vision. There is no mathematical theory of image
enhancement available because we are unable to objectively describe the perception-
cognition relationships. The simplest way to approach this is to focus on image
contrast and detail. In practice signals are not band-limited and sampling is a finite
duration operation. More samples are needed than theoretically predicted.
Semantic categorisation research is also of relevance to cognitive perception. Image
semantic researchers [60,80] have found that colour seems to play a significant role
in the perceptual organisation of images. Colour was found to be important in
natural scenes, but not with people or manmade objects/environments (where spatial
organisation, spatial frequency and shape features are more influential). The
presence of strong colours (bright red, lime green, pink, pure white) can indicate
man-made objects, especially when combined with spatial and regional features.
Image segmentation into regions of uniform colour or texture gave opposite results
for man-made (straight lines & boundaries, geometric shapes, sharp edges) and
natural scenes (rigid boundaries, random edge distributions). Semantic categories
appear to correlate with image descriptors (eg. indoor scenes = brownish, low light
levels, many straight edges), and so attempts have been made at a metric for
semantic categorisation. For example, feature combinations to capture semantics for
‘cityscapes’ category: skin = no, face = no, silhouette = no, nature = no, number of
regions = large, region size = small/medium, central object = no, energy = hi, number
of edges = large, details -= yes, colour = brown/grey.
2.3 A Visual Hierarchy Model
34
The studies of eye movement paths, or scanpath theory, has also contributed to the
understanding of visual perception [92]. The researchers present eye movement
studies as an approach to describe how we see in our mind’s eye (top-down). When
subjects were asked to first look at a picture and then later asked to visualise that
same picture from memory, the scanpaths were very similar. This provides evidence
for the top down scanpath theory of vision, since there is no external world available
to satisfy the bottom up concept that the external world enters the brain and controls
visual perception. The researchers recognise there is a problem in matching the
bottom up signals coming from the wide peripheral visual field (low resolution and
high sensitivity for moving objects) and from multiple high resolution glimpses by
the centrally located fovea. These regions of interest (ROIs) are sequentially visited
by a string of fixations, saccades, rapid eye movement jumps and are simultaneously
matched by top down representations of the hypothesised image. The authors remark
that when the retinal field is mapped onto the visual cortex, there is considerable
magnification of the signals coming from the fovea (ROIs), and a reduction of
signals coming from the low resolution periphery (only colour and textural
segmentation of large areas). These foveal and peripheral representations indicate
the kind of bottom up information entering the visual brain.
Stark and Privitera also conjecture a meeting point for top down/bottom up
processing [92]. They propose that where top down inputs to levels I, II, and III of
the visual cortex meet bottom up visual signal information going to levels IV and V
in the visual cortex, this is the site of the matching between top down subfeature
representation with incoming bottom up sensory signal flows. After matching, they
then propose that the scanpath continues to the next ROI, and in this way, the top
down model moves, fixates and foveates the eye, to bring forward successive
subfeatures for checking.
A final point of interest concerning this research group are the studies of scanpath
eye movements with dynamic scenes. Smooth pursuit eye movements play a large
role in scanpaths while subjects are observing dynamic scenes and have an
interesting characteristic – they maintain the fovea over the moving object as long as
this is possible and as long as the moving object is one the top down spatial cognitive
model continues to address.
2.3 A Visual Hierarchy Model
35
Other researchers [76] recording the eye positions of human subjects viewing natural
scenes found that subjects looked at image regions that had high spatial contrast, and
in these regions, the intensities of nearby image points (pixels) were less correlated
with each other than in images selected at random.
As Region-of-Interest concepts feature highly in the above discission of cognitive
perception, it is expanded in the next section where several research activities in this
area are described.
2.4 Region-of-Interest
This section provides commentary on an area of work incorporating both low and
higher levels of vision. Deficiencies of modelling vision using only early vision
phenomena have been identified previously [67] in that model parameters need to be
chosen to reflect human response to complex natural scenes – not simple artificial
stimuli, and that higher level and cognitive factors need to be employed.
A common goal of models described in this section is that they identify Regions-of-
Interest (ROIs) within an image in an attempt to predict where the human eye will
fixate in the image. When compared against subjective tests using eye-tracking
machines or similar attention-recording devices, these region-of interest algorithms
provide a high degree of correlation with human observer behaviour. These ROI
techniques have found application in advertising [40], military surveillance [64] and
visually lossless compression (where uninteresting areas of the image are
compressed more than others so compression artefacts are placed in these areas only)
[69,75].
There are several factors influencing attention – motion, contrast, size, eccentricity
and location, shape, foreground/background, edges and texture, prior instructions and
context of viewing, people, gestalt properties (closure, orientation, proximity,
similarity, symmetry, clutter and complexity, unusual or unpredictable stimuli (eg.
high information content), and interaction between basic features [67].
2.4 Region-of-Interest
36
Schill et al [85] have recorded eye movements when subjects viewed natural scenes.
They analysed spatial statistics of fixated regions with higher order statistics
(bispectra), and found a clear bias for subjects to fixate on regions with frequency
components of multiple orientations (eg. regions with curved edges or occlusion
patterns). Using this information as candidate features of informative regions, the
authors have developed a system attempting to automatically select informative
regions in saccadic scene analysis. The system integrates a simplified bottom-up
mechanism to a task-oriented top-down mechanism. The cognitive (top-down) stage
infers knowledge based on the Dempster-Shafer theory for uncertain reasoning. The
bottom up/early visual system computation is a neural network processing stage,
where features are extracted by linear orientation selective filters. They conject that
top down and bottom up processing relies on a common principle: “information
gain”. The systems they have developed computes the most informative region
which should be selected by the next eye movement. In visual prostheses
applications, a human user would shift fixation point, and thus there is not the need
to model the top-down voluntarily controlled attention shifts.
Osberger [67] has defined a quality metric using the notion of importance maps.
This metric has improved performance over traditional quality metrics which assume
the whole of a scene is viewed foveally (representing image fidelity). An
importance-map weighted metric is more representative - traditional metrics
overemphasise visual distortion in textured areas and don’t account for strong
masking in these areas. The metric has been extended to a perceptually based
compression [69] and a new model for automatic detection of regions of interest in
complex video sequences [70]. Features of the new model include motion (the
model can distinguish camera motion eg. pan, tilt, zoom, from true object motion,
and has adaptive motion thresholds for different video scenes), colour, intensity
contrast, size, shape, location, background, and skin tone (a narrow range of Hue
Saturation Value). Feature maps representing individual factors were correlated with
eye movements from 24 viewers to quantify weights for each factor. 75% of
viewer’s fixations occurred in the 30% of the scene total area that was estimated by
the model as being the most important.
2.4 Region-of-Interest
37
A technique that avoids the segmentation of the above model is the context-free
region-of-interest algorithm presented by Nguyen et al [64]. The technique is aimed
to be useful for images with interpretable content that varies with resolution and field
of view. There are three stages to the algorithm:
1. Quadtree feature maps are generated for 4 visual factors (contrast, relative
brightness, variance, edge density). Each level of quadtree decomposition
narrows the field of view over which the feature is examined. If the feature
persists in a region, the region gets further divided.
2. Each region is assigned an importance value [0 1] based on region detail (higher
importance if region keeps splitting into narrower fields of view)
3. An overall normalised [0 1] importance map is generated from the weighted
combination of importance-weighted quadtree feature maps.
From the overall importance map, an integer number of bits can be assigned to each
pixel proportional to the importance map pixel values, rather than applying a uniform
number of bits per pixel. Context dependent criteria could be integrated to improve
technique.
If colour is taken into account as well as intensity, more information can be obtained
about the image contents. Similar to the above models, colour importance has been
defined at each pixel location in an image, and used to weight the results of image
analysis tasks [54]. It is difficult to extract colour information due to poorly defined
data dependencies between colour bands. Spectral differences become important in
regions where the difference in luminance is negligible. Shadows and highlights can
cause sharp changes in luminance producing undesirably strong edges. Colour
correction (eg. joining panoramic photos) and colour quantization are also situations
where colour has a significant influence. An importance measure (normalised from
[0, 1]) was constructed by combining global and local factors on case-by-case basis.
Global factors included the probability of colour in the image, the probability of
colour group in the image, probability of colour in colour group, variability (if low,
discrimination is limited – may need to enhance). Local factors were similar to
global but acting in regular mxn neighbourhood, or irregular segmented region. It is
not possible to modulate colour in the visual prostheses application, so information
about colour as described here is not highly assistive to this thesis.
2.4 Region-of-Interest
38
A progressive technique for human face archiving and retrieval uses a similar
importance measure [7]. The compression technique described has 1.5 times the
compression rate of the JPEG standard and is more visually acceptable as
compressed images do not suffer from blockiness, and the visually important
information (edges) are reconstructed first.
Another region of interest algorithm has been proposed to detect the main subject in
photographs [50]. Advantages of the approach are that the model includes semantic
(human skin and face, sky, cloud, grass, tree) as well as geometric features (no
motion or depth features as the application is only for photographic images). An
image is segmented into regions and the following features ‘believed to influence
visual attention’ computed for each region:
• Low-level: colour, brightness and texture (self saliency – by itself, and relative
saliency – in competition for each)
• Geometric: centrality, borderness, surroundedness, size, shape, symmetry
• Semantic: skin, face, sky, and grass
All features are plotted on an “effectiveness-complexity” graph in relation to the
main subject detection. Eg. face is strong indicator for main subject but is less
effective than location feature (centrality, borderness) because low likelihood of face
regions among all regions.
Itti and Koch have proposed many models for visual attention: multi-scale feature
maps to detect local discontinuities in intensity, colour, orientation & optical flow,
and biological plausible models such as a centre-surround mechanism modelled on
visual receptive fields (cortex transform) [40]. Receptive field properties can be
modelled using difference of Gaussian filters (nonoriented features) or Gabor filters
(for oriented features). Feature maps (normalised to [0 1]) are produced for intensity,
colour, and orientation (0, 45, 90, 135 deg), with 6 feature maps for each at different
spatial resolutions. The feature maps are then combined into master or saliency map
using 1 of 4 methods. In the final saliency map, the most salient location is
suppressed or inhibited, so the system can focus on the next most salient location. A
circular focus of attention (rather than actual object) is identified (radius was 80 or
64 pixels depending on image set). The average number of false detections (mean ±
2.4 Region-of-Interest
39
standard deviation) was reported for each method used on a database of traffic signs
(ie. still images). The authors state that template matching algorithms are much
simpler, but their method is independent of nature of the targets (ie. context free).
The above saliency model has been extended to include object recognition [58]. The
new model combines a fast visual attention front-end which rapidly selects the few
most conspicuous image locations and a slower object recognition back-end which
identifies objects at the selected locations. The object recognition back-end is trained
on target features which are simple stimuli only eg. circle vs rectangle, with a hope to
extend this in future to natural colour images eg. pedestrians. The relevance of such
a system to visual prostheses might be to give a speech interpretation of the scene
when walking down the street eg. tree, post, sign etc.
2.5 Visual Information
This section provides background to visual information contained in images. Cooper
et al in their paper Causal scene understanding [16], asked some intriguing questions
pertinent to visual prostheses:
“What is visual understanding? What does it mean to look at a scene and
understand what it is about?”
Since understanding is in large part, the preparation we make for acting, the question
can be reformulated in this way: “What knowledge about a scene would a visually
impaired person need in order to take intelligent action in that scene?” A comment is
made that every picture tells a story and visual understanding consists of figuring out
what that story is. In Cooper et al’s paper and others like it [78], scene
understanding is described from the point of view of a robot – to predict what is
going to happen. Many intelligent agent systems have been developed, for example
robots that can pick up mugs with handles (vertical lifting force plus rotational torque
to counteract mug rotation) and vision based robot corridor-cleaners. In the visual
prosthesis application, there is a functioning human brain to interpret visual signals
and understand the significance of elements in a scene and the relationships between
those elements, unlike robot/knowledge-based computer vision applications, where
an agent does the interpretation. Therefore, visual prostheses systems should perhaps
2.5 Visual Information
40
mainly focus on bottom-up systems, trying to get most out of a scene, while using
knowledge of top-down cognitive interpretation eg. magno/parvo channels etc.
Experiments are described in Chapter 6 of this thesis that quantify the amount of
visual information inherent in an image. Previous research in this area includes that
concerning visual complexity. Riglis [77] has reported on the following measures
for estimating visual complexity:
• The number of words used when describing a picture;
• Estimators from pattern recognition systems; straight lines, smooth curves = low
complexity, right angles = medium complexity, intersections of lines at acute
angles = high complexity;
• Geometrical characteristics derived from Gestalt psychologists: symmetry of
stimuli, & symmetry of curves present in image, similarity of objects in image,
saliency of curves present in image, smoothness of curves present in image;
• Other geometrical characteristics: area of figure, value of angles, number of
revealed elements, diversity of angles and sides, symmetry;
• In a study involving the perceived beauty of forests, high image complexity was
found to be related to high number of colours, high number of edges, high fractal
dimension, high standard deviation, high entropy, and larger file sizes (image
encoding including Hufmann encoding & run-length encoding);
• Klinger-Salingaros complexity: temperature (internal contrast), harmony
(symmetry), life (product of temp and harmony), complexity = temp * (constant
– harmony)
Except for the perceived beauty of forest studies, all of the above were not tested
with real images. Riglis undertook experiments getting subjects to rank images from
low to high complexity and determining relationships with fractal dimension, fractal
image format compression, GIF compression, JPEG compression, TIFF
compression, pixels mean, pixels median, pixels standard deviation, and his
understanding/implementation of K-S harmony, life and temperature. He found a
positive correlation with 1 out of 3 experiments for fractal compression, K-S life,
pixels standard deviation (other 2 no correlation, 1 was on-line with no control
images for comparisons).
2.5 Visual Information
41
Meletiou has estimated the complexity in scenes using Osberger/Maeder importance
maps [57]. After segmentation and importance classification (based on contrast,
size, shape, location, background), the extracted regions were grouped in 10
categories according to importance levels. Complexity = sum(i, 1-10) of (i*no. of
regions in category), ie. the more complex the image is the more important it will be.
Observers were asked to describe an image presented, and then rank (from 1-5) the
difficulty of verbally describing the images presented. Reaction time, number of
words and ranking were compared with a complexity metric (reaction time and
number of words is probably better suited to exploring threshold of sensitivity of
verbalisation: some subjects could talk for hours on simple image). All measures
were statistically significant. Meletiou conjects that perhaps high fractal dimension
relates to a simple image, and low fractal dimension to a complex image.
Other researchers have used the term ‘visual complexity’ to describe the running
time of algorithms [5]. Visual complexity has been proposed as the sum of the
number of edges in the scene, the screen resolution and the number of visible edge
crossings (wire mesh rendering application).
Stange undertook several subjective experiments using simple geometric models
[91]. Visual complexity was modelled with 4 parameters:
• each individual object’s colour
• each individual object’s size
• each individual object’s shape
• number of individual objects in the image
The only parameter that had a statistically significant correlation was the number of
individual objects in the image.
2.6 Chapter Summary
This chapter has described research in image quality and visual understanding.
Physiological aspects were provided to gain a brief overview of how visual
perception is achieved in the human visual system. The large body of knowledge
2.6 Chapter Summary
42
concerning image quality was mentioned to be based on addressing imaging
distortions.
A hierarchical vision structure was then described, starting at early vision effects and
incorporating increasing levels of cognitive or higher level factors. Region-of-
Interest techniques were described which combine advantages of early vision and
cognitive vision models.
Finally some previous studies in the area of visual information were described, which
has high relevance to an application where visual information needs to be
maximised.
Chapter 3 Visual Prosthesis Application 3.1 Overview
This chapter contains five further sections. The first introduces the application of
artificial vision. Section 3.3 provides an overview of current visual prosthesis
projects. The image processing aspects of these projects are discussed separately in
Section 3.4. Section 3.5 frames the field of digital image processing for this
application. The poor quality of anticipated images produced by artificial vision
systems is described along with several processing techniques that are compatible
with the anticipated evoked visual sensations of visual prostheses. Finally Section
3.6 presents several image processing requirements for visual prostheses, which
drive research questions to be addressed in this thesis.
3.2 General Introduction to the Application
Biomedical in-vivo applications of computing, especially where computing systems
are superimposed on or integrated with human systems, offer enormous challenges to
researchers to develop new or better solutions which can improve our quality of life.
The development of intelligent, reprogrammable devices for insertion into the body
(such as pacemakers) is an example. Bold new projects have emerged, such as the
MIT “wearable computer” or “thinking cap”, where computer systems interface very
closely with the user’s body [73].
Several international research teams are currently developing artificial human vision
("bionic eye") systems that have the potential to restore some visual faculties to blind
persons. While the approaches by the various teams differ, a common element is that
they all require a system that converts a visual scene into electronic pulses that
stimulate nerve cells in the visual pathway (eg. via implanted electrodes), resulting in
a crude induced “image” being formed in the visual regions of the brain. The utility
of the induced image depends on how much visual information is presented, which in
turn is determined by image quality and image processing considerations.
3.2 General Introduction to the Application
Little human trialing of visual prostheses has yet been conducted from which to draw
conclusions on image quality. The perceived quality of an image is dependent on the
number of electrodes in the implant, with higher numbers of electrodes giving higher
spatial resolution of images. At present, size and manufacturing constraints place
limitations on the numbers of electrodes in a given implant. An open question from
an image processing point of view is how to optimise the amount of useful visual
information obtainable from the relatively few electrodes in the implants.
The next section gives an overview of research in the area of artificial human vision.
It describes the categories or general areas of research and summarises the
approaches of the various research teams. Details of their respective designs are not
covered in depth, and the interested reader is referred to the project websites or
publications listed in the text. This background on vision research provides a
framework for the image processing methods suggested later, and gives the reader an
appreciation of the challenging nature of the application.
3.3 Current Visual Prosthesis Research
Good reviews of the history and present state of the art in visual prostheses systems is
presented by Warren and Normann [104], Margalit et al [55], Suaning et al [93] and
Lysaght et al [51]. The basis of all visual prostheses is an image sensing device (video
camera or vision chip and lens) that records the visual world and transmits this
information in real-time to the upper level visual processes (refer Figure 3.1). An image
acquired by the camera is processed or manipulated to be in a form matching the implant
device. The processed image is then sent as electronic pulses to implanted electrodes
within a blind patient.
Image sensing device Processing Unit
Implanted electrode array
Figure 3.1: Basis of Visual Prostheses
44
3.3 Current Visual Prosthesis Research
When undergoing electrical stimulation, patients have reported the perception of spots of
lights in their visual field, referred to as phosphenes. Although unlikely to recreate
perfect vision, artificial vision systems may evoke enough phosphene perception to
perform every-day tasks such as navigation, recognition, and reading.
Visual prosthesis research can be categorised depending on the intended stimulation
site of the implant:
• Retinal - Increasing proximity to the brain
• Optic nerve
• Visual Cortex - Increased potential beneficiaries
The visual cortex holds the potential to assist the largest number of blind persons, as
prostheses designed to stimulate the retina or optic nerve require the rest of the visual
pathway to the brain to be intact. However, the surgical risk to a patient with an
otherwise healthy brain may be higher for visual cortex prostheses.
A brief overview of current artificial vision projects is presented below, along with
project websites and sample reference papers where available.
3.3.1 Retinal Systems 3.3.1.1 University of Southern California (Mark Humayun, Gislin Dagnelie,
Eugene de Juan) [37]
Ophthalmologists at the University of Southern California (Doheny Retina Institute) have
implanted permanent retinal prostheses into several patients, as part of an FDA-approved
feasibility trial. A wafer-thin silicon microchip, embedded with photosensor cells and
electrodes, is powered by an external laser beam. Photosensor cells receive and convert
light images from the pupil to electrical impulses. These impulses can then drive action
potentials in the remaining ganglion cells of patients with retinal disease.
http://www.usc.edu/hsc/doheny/ (accessed 21/1/05)
45
3.3 Current Visual Prosthesis Research
46
3.3.1.2 MIT-Harvard (Joseph Rizzo, John Wyatt), USA [79]
This is a joint collaboration between the Massachusetts Eye and Ear Infirmary and
the Massachusetts Institute of Technology. Their prosthesis consists of a power
source, a small, fixed-direction laser with 820 nm wavelength, and a data source, a
tiny charge-coupled-device (CCD) camera whose output amplitude modulates the
laser beam. A signal-processing microchip in the data source converts the visual
information to an electronic code that is carried on the laser beam. Both the power
and data source is mounted on a pair of spectacles.
http://www.bostonretinalimplant.org/ (accessed 1/6/04)
3.3.1.3 Tübingen University (Eberhart Zrenner), Germany [87] and Bonn University (Rolf Eckmiller), Germany [26]
These two research groups are funded by the German Federal Ministry of Education
and Science. In the SUB-RET (Tübingen) approach researchers are working on a
device consisting of microphotodiodes which are to be placed underneath the retina
to stimulate postsynaptic retinal cells directly by converting light to electric energy.
In the EPI-RET (Bonn) approach scientists develop a microcontact array which is
mounted onto the retinal surface to stimulate retinal ganglion cells.
http://www.uak.medizin.uni-tuebingen.de/depii/groups/subret/
http://www.nero.uni-bonn.de/ri/retina-en.html (accessed 1/6/04)
3.3.1.4 Nagoya University (Tohru Yagi), Japan [109]
The research at Nagoya University could be termed biohybrid, in that there is a
combination of biological and man-made elements in the construction of the implant.
The research aims are to develop devices in which cultured neural cells and a
photoelectric device are combined. This technique is similar to other retinal implant
techniques in that electrical components are being placed directly into contact with
the retina. However the use of nerve cells as a part of the implant make for a
potentially more reliable system.
http://www.nidek.com/artificial_vision.html (accessed 1/6/04)
3.3 Current Visual Prosthesis Research
47
3.3.1.5 Optobionics (Alan & Vincent Chow), USA [72]
This device is powered solely by incident light and does not require the use of external
wires or batteries. An artificial silicon retina is implanted under the retina (subretina
space) and is designed to mimic the photoreceptor layer. The research effort is mentioned
here for completeness, although there is no image sensing device (shown in Figure 2.1)
and hence no opportunities for perception enhancement by image processing.
http://www.optobionics.com (accessed 1/6/04)
3.3.2 Optic Nerve Systems 3.3.2.1 Catholique Université de Louvain (Claude Veraart), Belgium [100]
The techniques are based on optic nerve stimulation using a self-sizing spiral cuff
electrode. In preliminary testing to date, with the help of blind human volunteers,
researchers have been able to produce phosphenes throughout the visual field.
Stimulation at this location would be suited to patients who have non-functioning rods
and cones in the retina but a healthy optic nerve.
http://www.md.ucl.ac.be/gren/Projets/vision.html (accessed 1/6/04)
3.3.2.2 University of New South Wales/University of Newcastle (Nigel Lovell, Gregg Suaning), Australia [94]
The visual prosthesis system consists of a camera, StrongARM microprocessor
system and an implantable electrode array connected by a radio frequency link. This
prevents the need to pass wires through the skin. The current design consists of a 10
x 10 array of electrodes, giving the potential for 100 stimulation sites. Recent work
has tended to redirect this project from optic nerve electrode cuffs towards retinal-
stimulation.
http://rambler.newcastle.edu.au/~greggs/ (accessed 1/6/04)
3.3 Current Visual Prosthesis Research
48
3.3.3 Visual Cortex Systems 3.3.3.1 Dobelle Institute (William Dobelle) Portugal [20]
The research team has successfully implanted a 64-electrode array on the visual cortex of
a patient using wires passing through the skin. The patient is claimed to be able to read
two inch tall letters at a distance of five feet, representing a visual acuity of about 20/400.
Although the electrode array produces tunnel vision, the patient is also claimed to be able
to navigate in unfamiliar environments.
http://www.dobelle.com/index.html (accessed 1/6/04)
3.3.3.2 National Institutes of Health - Washington D.C. (Edward Schmidt), USA [86]
NIH researchers have implanted microelectrode arrays into the visual cortex and
recorded stimulation parameters and characteristics of artificially created
phosphenes. The Neural Prosthesis Program within the Division of Stroke, Trauma
and Neurodegenerative Disorders addresses many types of neural stimulation, not
just related to nerves in the visual system.
http://www.ninds.nih.gov/npp/ (accessed 1/6/04)
3.3.3.3 University of Utah (Richard Normann) USA [66]
This research is based at the John Moran Laboratories in Applied Vision and Neural
Sciences at the University of Utah. The design of the cortical prosthesis employs
penetrating microelectrodes rather than surface electrodes. The developers of
penetrating cortical electrode arrays claim that the closer spacing of electrodes
compared to surface cortical arrays result in increased spatial resolution and lower
currents to induce visible perception, and are therefore less likely to induce seizures
from overstimulation.
http://www.bioen.utah.edu/cni/projects/blindness.htm (accessed 1/6/04)
3.3 Current Visual Prosthesis Research
49
3.3.3.4 University of New South Wales (John Morley, Minas Coroneo) Australia
An animal model has been developed where one side of the brain is electrically
stimulated and responses measured in the other side of the brain. Funding sources for the
research include the National Health and Medical Research Council and the Brain
Foundation.
http://medicalsciences.med.unsw.edu.au/medsciences.nsf/website/researchactivities.labor
atories.vision_cognition (accessed 1/6/04)
3.4 Image Processing specifically related to Bionic Eye Projects
In this section the hardware and some image processing considerations are described
for research specifically relating to artificial vision. Research is described in 4 areas:
1. vision chip developments;
2. CCD-based systems;
3. receptive field modelling;
4. multiple resolution work.
3.4.1 Vision Chip Developments
Researchers at the University of Newcastle and University of New South Wales in
Australia (refer Section 3.3.2.2) use an OmniVision CMOS image chip to acquire
visual scenes for their portable prosthesis prototype [95]. They have proposed a
regular hexagonal mosaic of electrodes in the implantable array rather than a
rectangular layout, which allows better separation between electrodes [31]. The
expectation is that this will increase visual acuity and minimize aliasing in the
evoked artificial image. The researchers also conject that from an information theory
standpoint, modulating the size and intensity of a phosphene are equivalent
psychophysically.
Japanese researchers at Nagoya University (refer Section 3.3.1.4) have developed a
vision chip/artificial retina comprising parallel arrays of simple analogue circuits
together with parallel array sensors [110]. The authors review previous
3.4 Image Processing specifically related to Bionic Eye Projects
50
developments in vision chips, and mention that these chips have not experienced
wide application as the outputs of these chips are not sufficiently accurate under
natural illumination due to low sensitivity of photosensors. They have overcome this
problem with a light-adaptive one-dimensional 100 pixel line sensor. Spatial
filtering properties of the vision chip have been tested by mounting a camera lens to
focus an image on the photosensor array. The spatial distribution of the output
voltages of the chip showed a Laplacian-Gaussian-like receptive field.
The team above have recognised the importance of depth information in visual
information processing and have consequently incorporated depth perception in the
vision chip [111]. Again, the chip has 100 analogue sensors connected laterally by
resistors, giving a one-dimensional (line) 100 pixel sensor, which allows parallel
processing in real time. The output of the circuit is a serial signal representing depth.
Depth is computed from the disparity between two vision chips (fitted with lenses)
which are 120mm apart and turned inside at 6 degrees. Zero crossings (edges) are
detected from the left and right vision chips and used in determining the disparity.
Further work on vision chips is presented by Kyuma et al [46]. An impediment to
real time processing has been the separation of image sensing (camera) and image
processing (computer) functions. Consequently, system performance is limited by
slow camera frame rate and low transmission rate between the camera and computer.
‘Artificial retina’ chips developed by the authors are described, that can
simultaneously sense and process images ie. more akin to the parallel real time
processing of the human visual system. These artificial retinas consist of a two
dimensional variable sensitivity photodetection cell array, with sensitivity similar to
commercially available CCDs. The chips are 12mmx12mm (256x256 resolution) or
6.5mmx6.5mm (32x32 resolution), have a dynamic range of 40dB (input light
power) and variable frame rate (1msec – 1000msec). A variety of on-chip image
processing can be achieved by changing a control voltage pattern on the chip. These
processing functions include image sensing, edge extraction, image smoothing,
random access (only a partial image projected onto the chip), pattern matching, and
image compression/recognition. The application of these vision chips to prostheses
is not stated by the authors beyond ‘man-machine interfaces for multimedia
systems’, and instead general industrial applications are cited eg. automotive,
3.4 Image Processing specifically related to Bionic Eye Projects
51
avionics. The authors have used the vision chips to control computer game
characters by hand gesture recognition.
3.4.2 CCD-based Systems
The processing hardware for a retinal prosthesis project at the University of Southern
California (refer Section 3.3.1.1) is a FPGA/EPLD (Field Programmable Gate
Array)/(Electrically Programmable Logic Device) [18]. The device allows easier
implementation of highly parallel algorithms/hardware needed for concurrent
processing than a single processor. Three SRAM memories serve as frame buffers to
support the storage of images delivered from the camera. Two SRAMS support
dual-buffered video, where a current image in transit from the camera can be stored,
while a prior image can be simultaneously processed from a second memory. Once
the camera has completed delivery of a transit image, the roles of the memories are
reversed, so that the new image is processed, while a fresh image is stored in the
alternate RAM. A third frame buffer is available for intermediate computations that
may occur in algorithms such as spatial convolution. An 8-bit pipeline A/D
converter supports cameras which provide only analogue video. The whole board
can be worn in a shirt pocket or clipped to a belt.
The Humayun team make some interesting comments regarding image processing
required for prosthetic devices to:
• match the crude resolution of the implant array
• accommodate the limited dynamic range of the array
• simplify image aspects such as brightness and colour gradients which cannot
be faithfully represented by the array [17].
They conjecture that the sacrifice in resolution (spatially and in contrast) would be
acceptable in view of a wider operating range (field size and light/dark adaptation)
that would be achieved. Further they comment that the wearer of a prosthesis would
achieve a significant degree of learning, compensating in higher visual processing for
the detail lost at the input. A comparison is made between the processing required
for a retinal versus a cortical prosthesis. At the level of the visual cortex, the neural
information stream has already undergone several transformations (dynamic range
compression, edge and colour recoding, and translation of analogue information into
3.4 Image Processing specifically related to Bionic Eye Projects
52
spike trains) and thus the processor for a cortical prosthesis may have to be more
powerful and more trainable than that for a retinal prosthesis.
Researchers involved with optic nerve stimulation (refer Section 3.3.2.1) describe a
resolution reduction algorithm based on image segmentation by growth of zones (less
computational power than other segmentation algorithms) and implementation in a
low-power VLSI device [28]. The authors propose an algorithm based on the
extraction of the main features of the image, with transmitted information being only
the position and form of the relevant entities in a scene. However, their current
implementation appears only to be based on intensity. They propose to give a blind
person ability to control the segmentation level (adjustable threshold values),
producing areas of uniform illuminance matching corresponding objects or surfaces.
Due to the nature of their segmentation algorithm, they report undesirable fast
transitions (eg. merging of 2 zones) when segmenting with successive images.
Harvey and Sawan [32] describe their efforts in two areas: a cortical implant (silicon
die mounted on the back of an electrode array) and an external system (scene
acquisition, processing, RF communication). The completed prototype allows the
testing of various stimulation algorithms and strategies. CCD array (336x244 pixels)
output is sent via an analogue to digital converter to a commercially available
processor board. The extent of image processing appears to be resolution reduction
to 25x25 pixels and colour histogram equalisation.
Werblin and Jacobs [107] propose a cellular nonlinear network as a retinal camera,
using photodetectors/conventional CCD as input. Various image processing
operations can be performed across the CNN array by changing values of a set of
amplifiers. The authors have used the CNN system to predict patterns of activity at
retinal output. Beneficial features for a retinal camera incorporating a CNN array are
proposed:
• Battery powered chip array
• Onboard stored program of image processing algorithms that can be invoked
remotely or on the basis of the characteristics of the visual scene (twilight or
bright sun, for reading or navigation, high or low resolution)
3.4 Image Processing specifically related to Bionic Eye Projects
53
• Variety of output available: edge detection, motion detection, contrast
enhancement
• Background normalised, giving high contrast near the ambient background
level
3.4.3 Receptive Field Modeling
German researchers developing retinal prostheses (refer Section 3.3.1.3) have
proposed a system that approximates receptive field properties of primate retinal
ganglion cells [26]. While still preliminary in nature, the research is based on a set
of individually tuneable spatiotemporal receptive field filters, acting on input from a
photosensor array. Each receptive field filter is individually tuneable to a wide range
of physiologically plausible spatial and temporal frequencies. Details of the
receptive field function proposed are contained in [6]. Input data is fed into 2
distinct filter pathways, one for the centre computation and one for the surround.
Each pathway performs a spatial scalar product of the pixel data, and a two
dimensional Gaussian, whose width determines the spatial extent of the receptive
field. The resulting signals are then processed by a temporal low pass filter. The
surround pathway signal can be optionally delayed, and then signals from both
pathways converge at a mixer component. Finally a gain factor enables range
adaptation and switching between on-off and off-on (centre-surround) behaviour.
The resulting signal is then used to stimulate nerve cells. The authors also describe a
concept for training the system using visual perception feedback from human
subjects: the subject suggests functional changes to the system via a neural net
module, based on the difference between the actually perceived visual pattern and the
expected perception. This feedback is anticipated as an essential step in the future
for tuning a prosthesis to the needs of an individual.
The hardware for the receptive field processing above is described in [88]. Image
acquisition consists of a CMOS image sensor chip with a high dynamic range with
respect to illumination intensity (140dB). This full dynamic range can be used
within a single image frame without any distortions like blooming, smearing or time
lag. Two designs have been developed – a 128x128 pixel sensor arranged on a
hexagonal grid structure and a rectangular 400x300 pixel sensor. Signal processing
is carried out on-chip, so an additional frame buffer is not required unlike
3.4 Image Processing specifically related to Bionic Eye Projects
54
conventional CCD devices. The spatial filter used for on- and off-centre receptive
field functions is inserted between the sensor chip and the signal processor. The
developers propose to house the sensor chip in a package with integrated focusing
optics, mounted on a spectacle frame, along with the telemetry unit required for
wireless transmission of stimulus data and power for electrode stimulation.
Similar hardware design has very recently been completed in Switzerland [112]. The
Swiss researchers have manufactured a thinned CMOS chip which is intended to be
placed in the sub-retinal space and remotely powered by an external coil. The output
from the system mimics the ganglion response to light: bipolar voltage pulses with
light-modulated frequency. The chip has not yet been tested physiologically.
The significance and importance of visual receptive fields in visual processing is
supported by Hungenahally [38,39], who has attempted to emulate visual receptive
fields and their implementation for image processing in an artificial retina. He has
proposed a family of differentio-aggregation functions for information extraction
from two dimensional spatial images. He demonstrates the mathematical functions
in eradicating sensory noise from medical images and extracting dimensionally
selective information.
3.4.4 Multiple Resolution Work
Amerijckx et al [2] describe a remapping algorithm using two CCD cameras and its
implementation on a VLSI chip. One tele-lens camera produces high resolution in the
central area of the image, while a second wide-angle camera captures peripheral
image areas. This system processes these two images in real time to obtain a
resulting image with high resolution at the centre, similar to the central part of the
retina.
Belgian researchers (refer Section 3.3.2.1) have extended their work on prosthetics to
sensory substitution devices converting vision to sound [11]. The image processing
involved in their models is based on their identification of the main features of the
primary visual system: lateral inhibition and graded resolution. Lateral inhibition is
implemented by an edge detection filter and graded resolution is modelled using a
3.4 Image Processing specifically related to Bionic Eye Projects
55
multi-resolution artificial retina based on the filtered image. An example of this
graded resolution is given – in a grid of 8x8 large pixels, the 16 central pixels are
each divided into four pixels to build a medium resolution grid of 8x8 pixels. In this
second grid, the 16 central pixels are again divided into four pixels to build a high
resolution grid of 8x8 pixels etc.
This foveal vision representation has also been implemented in a head mounted
display unit coupled with an eye tracking system [42]. The authors claim that
conventional HMDs suffer from a narrow field of view and low resolution and
consequently cannot be used for applications such as tele-microsurgery. Their HMD
displays high resolution at a subject’s view point (obtained by an eye tracker) and
low resolution at the periphery, therefore displaying images at a higher perceived
resolution in a wider view angle.
The multiple resolution approaches described above may have application where
bandwidth is limited. However, it is most likely that the pixel density in a device
would be fixed at the maximum possible under manufacturing and size constraints.
Improved scene understanding is expected when the entire electrode layout is used
rather than applying low resolution image sections to some parts of the implant.
3.5 Digital Imaging Applicable to Visual Prostheses This section discusses digital imaging applied to visual prostheses and reviews useful
image processing methods that could enhance visual information presented to
visually impaired users.
3.5.1 Digital Imaging and Human Vision
There are many parallels between digital imaging environments and the human
visual system. Visual sensations are preprocessed from over 100 million rods and
cones (the photoreceptors in the retina), to approximately 1.5 million optic nerve
fibres, with conduction time from sub-retina to the lateral geniculate nucleus in the
order of 1-5ms [102]. The capability of the human visual system for resolving fine
detail and edges and ignoring uniform regions has been shown to be biologically
3.5 Digital Imaging Applicable to Visual Prostheses
hard-wired into our retinas. Connected directly with the rods and cones of the retina
are two layers of processing neurons that perform an operation very similar to the
Laplacian operator that highlights the points, lines and edges in an image and
suppresses uniform and smoothly varying regions [83]. Furthermore, the processes
that occur in the visual cortex when a person examines a visual scene make use of
feature extraction and object recognition processes, mimicked by computer vision
techniques [47]. This complexity suggests that in artificial vision systems, image
processing and manipulation would have a more significant role than simply a
camera and display package.
The image processing aspects of the artificial vision systems under development are
largely based on manipulating a pixelised representation, where a scene is
represented as an organised phosphene array. For example, the mandrill image
represented in various pixelised versions of different spatial resolutions is shown
below in Figure 3.2.
56
64 x 64 32 x 32 16 x 16 8 x 8 pixels
Figure 3.2: Pixelised vision; top – greyscale, bottom = binary images
Each picture element, or pixel, would ideally correspond to a stimulating electrode in
the implant. The top row shows greyscale images which are unlikely in prosthesis
prototypes. More likely are binary images (bottom row), where a pixel is either ON
or OFF. The viewing distance also affects image interpretation - the coarser
resolution versions above (16 x 16, 8 x 8) are more comprehensible from greater
viewing distances.
3.5 Digital Imaging Applicable to Visual Prostheses
The pixels shown in Figure 3.2 are also shown as squares touching adjacent pixels
along each border. Patients undergoing electrical stimulation report visual sensations
as a ‘spot of light’. Therefore it may be more representative to model pixels as
circular with a gap between adjacent pixels as shown in Figure 3.3 below.
25x25 25x25 circular 25x25 circular greyscale greyscale binary
Figure 3.3: Circular pixelised vision
Given that a limited number of stimulating electrodes is physically possible, it is
evident that some type of information content enhancing processing is required, as
described later.
An immediate problem is selecting a useful number and pattern of phosphenes for a
prosthesis. Until clinical testing progresses with human subjects, the degree of
success of an implant in creating phosphenes in the visual field of the subject is
unknown. For example, if a patient is fitted with a 100-electrode implant, will they
be able to see more or less than 100 phosphenes in their visual field? To date, there
are no physiological stimulation models in the literature that predict the number of
phosphenes that will be produced from a given number of electrodes. For example,
the Dobelle research team [20] report that each implant electrode produces one or
more phosphenes in the visual field, while Schmidt et al [86] report that 34
phosphenes were produced from 38 electrodes.
The issue is compounded by the different physical stimulation strategies followed.
In addition to varying the stimulation parameters, such as current amplitude or pulse
width duration, there is the potential to make use of different current flow and return
paths. For example, a single stimulating electrode and single return electrode may
57
3.5 Digital Imaging Applicable to Visual Prostheses
give rise to a small high intensity dot in the visual field. Using the same single
stimulating electrode but with two or several return electrodes may give a different
charge density profile on the tissue which may give rise to a broader low intensity
patch of light (see Figure 3.4).
58
(a) (b)
Figure 3.4: Alternate stimulation strategies
(a) single electrode pair ; (b) single stimulating electrode and multiple return electrodes
Frequency encoding might be another possibility where an image is transformed to
the frequency domain. Here, image information is represented as signals having
various amplitude, frequency and phase characteristics, which could be delivered to
different locations within an electrode array. This has similarities with auditory
implants, where an audio signal is split into frequency bands and delivered to
different locations within the cochlea for improved perception [4,65]. While
important to the final design of prostheses, the issues of prediction of phosphene
numbers and patterns described above require physiological testing and modelling
and thus would be impacted by the other physiological properties of implants. These
aspects, such as biocompatibility and encapsulation of the internal electronics, and
evaluation of acceptable stimulation waveforms that prevent tissue damage, are
outside the scope of this thesis.
3.5.2 Image Characteristics and Visual Understanding
There is a wide range of features, or characteristics of digital images that give us
visual understanding to varying degrees [83]. However many characteristics are not
compatible with the modelling of anticipated evoked visual sensations of visual
prostheses. For example, it may not be physically possible to control the colour of a
phosphene, and make it RED, then BLUE on the following stimulation followed by
GREEN on the next etc. Colour processing along with other future possibilities are
discussed further in Chapter 8.
3.5 Digital Imaging Applicable to Visual Prostheses
59
Image characteristics that are compatible with simulating artificial vision and have
variations that can be tested are the following:
• Spatial resolution
• Brightness
• Contrast
• Edges
• Distance information
• ‘Importance’ mapping (using the notion of combining several of the above
factors to add value to the information content of an image).
Results of subjective tests using these image characteristics are presented later in
Chapter 4, and the next sections provide background to these features.
3.5.2.1 Spatial Resolution
Maximising the number of electrodes in an implant to give high spatial resolution would
certainly enhance the information content of images. Some preliminary researchers in
visual prostheses systems conjected the following three approaches, in the 1970’s [99]:
1. small matrix size – coded information: Due to the small matrix size, 10x10 or less,
information must be categorised and encoded to maximise information delivery. This
system cannot effectively provide a direct two dimensional representation of space,
but must extract the significant environmental features and present them in coded
format. This approach places a heavy demand on the learning capacity of the user.
2. intermediate matrix – preprocessed input. With a matrix size of between 20x20 (400
points) and 32x32 (1024 points): an effective two dimensional display can be
achieved. Simulation experiments carried out on sighted viewers [8] suggested that a
phosphene matrix containing 600 points (24x24) would be sufficient to permit a
reading speed of 120 words per minute, where 10 letters were presented at a time to
subjects. The combination of a suitable field range for detection of peripheral hazards
with adequate central resolution for useful object identification presents a severe
challenge at this matrix size.
3. maximum density matrix – direct spatial display; A 4000 point (64x64) display can
provide a fairly good image of a face.
3.5 Digital Imaging Applicable to Visual Prostheses
60
Other simulation work to determine how many electrodes would be needed to provide
useful vision has been done by Cha et al at the University of Utah [12,13,14]. Normally
sighted human subjects wore a video camera attached to a head-mounted visor which
simulated pixelised vision. The tests covered visual acuity, reading speed and mobility
performance. Images were pixelised and projected on to a small monochromatic monitor.
To create the illusion of phosphenes, perforated masks that represented different pixel
densities and field sizes were placed between the eye and the monitor. The conclusions
drawn from these studies were:
• The most important factor in visual acuity was pixel density (spacing). However
the most important factor in reading speed was pixel number, not spacing.
• When using low density masks, acuity was increased with voluntary head
movements.
• A 25 x 25 array provided a visual acuity of 20/30, which allowed a reading speed
of 100 word/min and good obstacle avoidance.
More recent (2003) studies simulating vision performance at different spatial
resolutions has been carried out at Johns Hopkins University and University of
Southern California. One study involved presenting pixelised face representations of
10x10 to 32x32 spatial resolution [96] The researchers found that parameters such as
contrast, grid size, dot size, dot gap, drop out rate and greyscale resolution had a
significant effect on facial recognition speed and accuracy. In a separate study [33],
4x4, 6x10 and 16x16 electrode arrays were simulated in a number of performance
tasks including four choice orientation discrimination of a Sloan letter E, object
recognition and discrimination, a cutting task, a pouring task, symbol recognition and
two reading tasks. Subjects performed best using the 16x16 array which
corresponded to a visual acuity of 20/420, although simple objects and symbols
could still be recognised sporadically at the lowest resolution array.
Thus, given that an implant could deliver sufficiently high phosphene numbers to the
visual field, the ability to read and navigate around obstacles is achievable. It should
be noted that the performance stated above is based on the assumption that each
electrode produces a corresponding phosphene in an ordered array in the visual field.
3.5 Digital Imaging Applicable to Visual Prostheses
While more electrodes ideally equates to improved perception, the upper limit of
electrodes may be determined by the small space available to implant the array along with
the minimum electrode spacing required to determine adequate phosphene resolution.
This is a manufacturing constraint and is outside the scope of this thesis. Spatial
resolution is mentioned here as a technique to explore low quality image perception.
3.5.2.2 Brightness Modulation
Multiple brightness levels in images may be highly informative. Figure 3.5 shows
how brightness modulation might be simulated for the mandrill image. The top
images show circular pixelated versions using eight versus and three greylevels
compared with the original at 256 greylevels. The bottom images show variations of
a halftoning technique [82] with different pixel radii and dot orientation. The goal of
halftoning is to preserve the visual impression of grey tones in spite of the fact that
pixel-by-pixel the image is ideally black or white. Increasing the number of
greylevels is considered to be equivalent psychophysically to increasing the
dot/phosphene size [49].
Original 128x128 25x25 – 8 grey levels 25x25 – 3 grey levels
4 pixel radius – 45 degree 6 pixel radius – 45 degree 6 pixel radius - 0 degree
Figure 3.5: Simulating the effect of modulating phosphene brightness
61
3.5 Digital Imaging Applicable to Visual Prostheses
62
Physiologically, the brightness of induced phosphenes has been found to be modified
with stimulus amplitude, frequency and pulse duration [86] and in other studies,
logarithmically related to stimulus current amplitude [35]. This suggests that in
principle, several (2 – 4) bits of greyscale/size variance should be achievable. For
early prototypes however, only a 1-bit grey scale might be possible, producing only
binary (black and white) images.
3.5.2.3 Contrast
Contrast affects the detection of many kinds of image features (eg. regions, edges,
textures) and is known to be a fundamental early vision characteristic in human
vision [3]. In some visual environments, such as reading black text on a white
background, it may perhaps be better to deliver negative or inverse images. In any
case, the ability to enhance the contrast may prove useful to highlight image contents
that would otherwise be much harder to see.
3.5.2.4 Edge Detection
In scene recognition and interpretation, edges play a fundamental role [56]. Edges
assist in the formation of a primal sketch to derive shape information from images.
Also there are biological mechanisms for detecting oriented zero-crossing segments
(edges) in retinal ganglion cells. An essential function of an artificial vision system
would be to highlight the edges of objects. The Dobelle research team [20] expect
improved results for patients with the implementation of Sobel filters for edge
detection. The prominence of uniformly shaded areas could be decreased, while
edges that might otherwise be hardly noticeable could be highlighted.
3.5.2.5 Distance Indication
An artificial vision system that conveys the distance to an object would be
particularly useful. While sonar distance aids have been common for many years,
the auditory signal emitted by these devices can interfere with important surrounding
environmental noises. The ability to convey distance visually rather than audibly
would therefore be desirable. Distance information can be obtained by computing
depth from disparity from two cameras or by using ultrasonic or laser rangefinders.
3.5 Digital Imaging Applicable to Visual Prostheses
63
Distances could then be mapped to intensities, where the nearest object is shown
with the highest intensity. If the device display only supports a 1-bit grey scale, only
the nearest object need be displayed. This distance 'mode of operation' could be
quite useful in combination with a standard image of luminance intensities.
3.5.2.6 Importance Extraction
A feature of an efficient artificial vision system would be importance extraction - to
present only the most important object in a scene and disregard the uninteresting or
homogenous elements. Section 1.1 covered several region-of-interest algorithms
which aim to predict where the human eye fixates on an image. An extension of
these algorithms is the concept of assigning an importance score or weighting to each
area in an image to generate an “importance map”[52,68]. This importance ranking
has previously been applied in visually lossless compression, where improved
compression ratios have been achieved with high perceived image quality.
This importance ranking could be used in artificial vision systems to identify the
most important object in a scene and present only this object, discarding the
remainder. The definition of importance may comprise some combination of motion,
location, contrast, contrast, size and shape. The components that comprise
“importance” may also be adjusted for different viewer situations eg. home,
entertainment and mobility.
Importance ranking could also be used to optimise the bandwidth for data transfer in
artificial vision systems. If the bandwidth is limited, one could apply variable
resolution to the image on the basis of importance. Homogenous or uninteresting
scene elements would be displayed at low resolution, while important areas, such as
edges and moving objects, would be displayed in high resolution. Thus a decreased
bit-rate could be used to present an image of high perceived quality.
3.5 Digital Imaging Applicable to Visual Prostheses
Figure 3.6 depicts the importance map process as an example of extracting important
areas from within an image. An image is first segmented into regions of similar
properties. A split and merge segmentation algorithm is used based on grey level
variance. Feature maps/images are then constructed from the segmented image
corresponding to five features known to influence attention:
1. closeness – the closer an object, the more important
2. intensity contrast – regions of high intensity contrast from surrounding regions
are more important
3. shape – elongated regions are more important than round regions
4. size – the larger a region the more important
5. centralness – regions in the centre of the viewing area are more important
Each region in the feature map is assigned an importance score, normalised from 0
(not important) to 1 (very important). That is, lighter areas in the feature maps
should grab a viewer’s attention more than darker areas. From the five feature maps,
an overall Importance Map is created from combining these feature maps. A
normalised sum of squares is performed as indicated below:
R.I. = [ω1•(M1)2 + ω2•(M2)2 + ω3•(M3)2 + ω4(M4)2 + ω5•(M5)2] (Equation 2)
R.I.max
Where R.I. = Region Importance
ω = weight applied to each feature map
M1 = Closeness Map
M2 = Contrast Map
M3 = Shape Map
M4 = Map
M5 = Central Map
R.I.max = Max Region Importance
64
3.5 Digital Imaging Applicable to Visual Prostheses
ω5ω4
ω3
ω2ω1
Original Segmented
Closeness Contrast Shape
Size Central
Importance Map
Figure 3.6: Importance Mapping concept
65
3.5 Digital Imaging Applicable to Visual Prostheses
Figure 3.7 below shows a post and chain which would pose a hazard to a blind
person. A 16 x 16 resolution copy of the image is shown adjacent to the original. It
should be noted that the 16 x 16 image is shown in full grey-scale which is unlikely
to be possible in vision prostheses. Although one can discern the dark blob of a post,
the shadow of the post provides a confusing visual cue, and the safety chain attached
to the top of the post is not evident.
Original 16 x16 image (full grey scale)
66
16x16 Importance Map 16x16 Distance Map
Figure 3.7: Safety Post enhancement with advanced image processing techniques.
The bottom left image in Figure 3.7 shows the enhancement provided by mapping
‘importance’ to intensity. The image is shown with 4 grey levels, as might be
achieved in vision prostheses. Regions assessed as important (high contrast, large in
size, long and slender, central to the image etc.) are represented with the highest
intensity. It is noted that the safety chain is now evident but the shadow of the post is
also present.
3.5 Digital Imaging Applicable to Visual Prostheses
The bottom right image in Figure 3.7 shows the distance mapping discussed in
Section 3.5.2.5. The closest regions to the viewer are presented with the highest
intensity. It is noted that the chain is evident but the post shadow is not discernable.
Another example of the same processing with another outdoor scene is shown below
in Figure 3.8. It is believed that a beneficial image processing system would provide
several of these ‘modes of operation’ to gain as many visual cues form the low
quality image as possible. Experiments described in Chapter 4 will test this
conjecture.
Original 16 x16 image (full grey scale)
16x16 Distance Map 16x16 Importance Map
Figure 3.8: Enhancing the information content of a low quality image of stairs.
3.6 Thesis Research Questions and Approach
In consideration of the visual prostheses literature presented in this chapter and
within the image quality framework described in Chapter 2, several issues can be
identified to drive research questions. These research questions are described in the
next section followed by an outline of how these questions will be addressed in the
remaining thesis chapters.
67
3.6 Thesis Research Questions and Approach
68
3.6.1 Image Processing Requirements
Within the scope of this thesis, there are several image processing requirements for
visual prostheses. Visual prostheses need to:
1. facilitate some recognition performance while bounded by an ultra low image
quality regime;
2. allow a user to gain as many visual cues from a scene as possible;
3. use simple low level processing to improve scene understanding;
4. convey maximum scene information
5. deal with different scene types
These requirements drive several research questions around which the remaining
thesis chapters are based:
Q1: Although limited to low quality images anticipated from visual prostheses, can
recognition of some objects be achieved?
Q2: Does Region-of-Interest processing improve scene understanding beyond
standard/Base Case processing?
Q3: Can a model be constructed for basic information required for the interpretation
of a visual scene at low image quality?
Q4: Should the processing techniques be adjusted depending on the scene type?
In the low image quality domain where spatial resolution is limited, an effective
approach to improve scene understanding is Region-Of-Interest (ROI) modelling to
extract salient or important areas within an image. It is reasonable to expect ROI
processing to be an effective approach as these methods trim away information that
may not be relevant to scene understanding. Other applications such as image
compression, military and advertising use ROI processing to extract features and
regions where the human eye might fixate in an image. Therefore we expect that
such techniques could be incorporated into visual prostheses to trim away the large
amounts of data in an input image. Thus the limited number of display pixels
(implant electrodes) would be used most efficiently by presenting to blind users only
the important elements of a scene.
3.6 Thesis Research Questions and Approach
69
It is expected that ROI processing will provide an improved outcome over the
standard (or Base Case) type of processing in prostheses, which consists of
subsampling to match the spatial resolution of the electrode array and binarisation.
The Importance Map ROI technique discussed in Section 3.5.2.6 is selected for the
thesis experiments because it is computationally cheap and variations can be
constructed around a standard model to alter the appearance and hence the
interpretability of the final processed image.
It is also expected that a model can be constructed for basic information required for
the interpretation of a visual scene at low image quality. Image quality is
characterised thoroughly in the literature for high quality images but not for low
quality images.
3.6.2 Testing Method
The research questions will be tested by experiments with normally sighted viewers.
As explained at the commencement of the thesis in Section 1.3 - Scope, prosthesis
development in Australia is currently limited to animal models. Thus use of
normally sighted viewers is considered the only option for simulation studies at this
time.
Several simulation experiments are presented as follows:
Research Question Thesis section
Q1: Although limited to low quality
images anticipated from visual
prostheses, can recognition of some
objects be achieved?
Ch4 – Recognition experiments
Section 4.2: processing techniques
compatible with visual prostheses
Section 4.3: recognition & influence of
image type
Q2: Does Region-of-Interest processing
improve scene understanding beyond
standard/Base Case processing?
Section 4.2: computationally cheap
region based (Importance Map) method
Ch7: several ROI methods compared.
3.6 Thesis Research Questions and Approach
70
Research Question Thesis section
Q3: Can a metric be constructed for
basic information required for the
interpretation of a visual scene at low
image quality?
Ch5 – Quantifying Information Content
Section 5.3: information content model
Section 5.4: recognition & information
content
Q4: Should the processing techniques be
adjusted depending on the scene type?
Ch6: scene specific imaging
Table 3.1 – Thesis experiments
Q1 is tested through recognition experiments described in Chapter 4. In Section 4.2,
several image processing techniques which may lead to improved perception of low
quality images are assessed by normally sighted viewers. This testing is to obtain an
understanding of low quality image perception and uses several processing
techniques described in Section 3.5.2 as being compatible with visual prostheses.
Section 4.3 describes a separate experiment assessing perception of low quality
image and the influence of image type.
Q2 concerns Region-of-Interest (ROI) processing applied to low quality images.
ROI processing was presented in Section 2.4 as a powerful perception modeling tool
using a combination of early vision and cognitive effects. Section 4.2 experiments
establish the applicability of a computationally cheap region-based (Importance
Map) technique to low quality images. Several variations of this method are
compared with a pixel-based (Saliency Map) technique in Chapter 7.
The construction of a metric in response to Q3 is described in Chapter 5. This
Chapter expands previous work in the area of visual complexity (Section 2.5) and
links visual complexity with information content. A robust metric to predict
perceived information content is developed from one series of subjective data and
tested against additional data. Also correlations are made between subjective
information content and object recognition for low quality images.
Q4 concerns the influence of different environments for the visual prosthesis user.
The concept of tailoring image processing to the scene type is tested in Chapter 6.
3.6 Thesis Research Questions and Approach
71
There is a lack of fundamental theory relating to the specifics of image
understanding, and consequently the above research questions represent
opportunities to refine this knowledge by subjective testing. A variety of viewer
behaviour is expected due to individual preferences influenced by past experiences
and expectancies, and hence the instructions given to viewers influence these
variations. In the experiments described in following chapters, viewers were advised
that:
The images appear as just a range of blocks – you may not be able to see
anything in the images at all. However this quality level is similar to what a
blind person might see with a bionic eye.
A final comment relating to the subjective testing is that there were variations in the
experiment sample sizes due to the availability of volunteers, which ranged from n =
20 to n = 247. The strength of the findings are influenced by the sample size (results
from the smaller size are more a suggestion). In addition some experiments tested
several factors simultaneously so the sample assessing one factor was much smaller
than the total participants involved in the experiment. For example 225 participants
assessing 9 image quality classes represents a sample size of only 25.
3.7 Chapter Summary
This chapter has described the research activities underway internationally in the
field of electronic visual prostheses. Research efforts were described for approaches
aimed at stimulating the retina, optic nerve and visual cortex. Image processing
aspects for many of these projects were also described.
The chapter also identified processing techniques that are compatible with the
anticipated evoked visual sensations of visual prostheses. These included spatial
resolution, brightness modulation, contrast, edges, distance information and
importance mapping.
Finally several image processing requirements for visual prostheses were identified
which drive research questions to be addressed by thesis experiments. An outline
was given of the subjective testing proposed for the remaining thesis chapters.
Chapter 4 Recognition Performance
4.1 Overview
This chapter describes two preliminary experiments aimed to explore low quality
image perception. It aims to answer the research questions:
Q1: Although limited to low quality images anticipated from visual prostheses, can
recognition of some objects be achieved?
It is anticipated that some recognition is possible and that different types of images
result in varied recognition.
Q2: Does Region-of-Interest processing improve scene understanding beyond
standard/Base Case processing?
There is reason to believe that ROI processing will trim away unnecessary
information resulting in improved perception.
Section 3.5.2 discussed some image characteristics that are compatible with the
anticipated evoked visual sensations of visual prostheses. Adjustment of these
characteristics may result in enhanced information delivery to blind users of visual
prostheses. However, the extent of their success is difficult to quantify without
experimentation. Therein lies the framework for the first experiment described in
Section 4.2.
The second experiment described in Section 4.3 aims to quantify recognition
performance for low quality images by constructing an envelope of recognition. The
effect of the type of image is also determined from this experiment.
4.2 Subjective Tests to Determine Useful Processing Methods
4.2 Subjective Tests to Determine Useful Processing Methods
73
As acknowledged above, there is a need to quantify the perception performance of
adjusting various image characteristics for improving scene recognition and
understanding. This section describes psychophysical testing on possible operating
modes for an artificial vision system, to identify the most informative image
adjustments that could be made for improved understanding of picture content. The
experiments are aimed to assess the performance of proposed processing techniques.
This assessment is achieved by presenting degraded images to normally sighted
viewers and asking them to identify the scene and make use of the data. The images
presented have varying levels of resolution, greyscale, edge detection, importance
extraction and distance mapping.
4.2.1 Methodology
The subjective testing was undertaken by way of a booklet questionnaire survey
issued to participants. The booklet contained 20 pages of test patterns. Each page or
test pattern contained between 4 and 9 images, which were different versions of the
same object. An example of a booklet page is shown in Appendix Section A.1. The
different versions represented the various processing methods (edge detection,
distance mapping, inverse image, importance mapping) which were to be compared
against each other. The subjects were asked to write a description of the object and
rank the top three images that they believed showed the object most clearly.
Participants were drawn on a voluntary basis from senior high school students.
School students were chosen for subjects due to the large numbers and to reduce the
likelihood of familiarity with image processing issues (eg. holding a low spatial
resolution image at a distance to discern objects). The subjects had been given no
prior background information as to the nature of the images except that the pictures
‘may be similar to what a blind person might see with a bionic eye and are scenes
that people are likely to see when walking about’. Viewing conditions for the
experiment were not controlled.
4.2 Subjective Tests to Determine Useful Processing Methods
4.2.2 Images Chosen
While there are several good image databases available to the computer vision
community, it was desired to produce unique images for the project which would not
have been seen by others. This would reduce inconsistencies in the results had the
subjects had a priori knowledge.
The image set was composed of chairs, doorways, posts, steps, and faces which were
considered to form mobility hazards within a visually-impaired person’s
environment. Two different types of each hazard were included (see Figure 4.1).
chair 1 chair 2 doorway 1 doorway 2 face 1
post 1 post 2 steps 1 steps 2 face 2
Figure 4.1: Image set used in the psychophysical testing
Variations of image characteristics that are applicable to visual prostheses were
applied to the images. The spatial resolution and greyscale of images were
representative of current prototypes – 10x10, 16x16, and 25x25 pixel images with
either 2 (black and white) or 3 grey levels (black, grey, white). An inverse (reverse
contrast) was included along with an edge image, and Distance and Importance
Maps.
A phosphene mask was applied to each image to create the illusion of phosphenes
(ie. pixels were circular and did not touch each other along their borders). An
example of the types of images presented is shown below in Figure 4.2.
74
4.2 Subjective Tests to Determine Useful Processing Methods
Original 25 x 25 16 x 16 10 x 10 Inverse
Spatial Resolution Variations
3 grey levels Edges Distance ‘Importance’
Mapping Mapping
Figure 4.2: Image Processing techniques used in the psychophysical testing
Only two pages with the same object appeared in one booklet to minimise learning
effects. In establishing the order of the two images in the booklet, the object version
with the lower resolution was always presented first. The booklet page order was
chosen to ensure substantial differences in appearance between successive booklet
sheets (refer Appendix Section A.2).
4.2.3 Results
The questionnaire survey was completed by 174 high school students in their
penultimate year of study. In analysing the results of the questionnaire surveys, it
could not be assumed that a blank entry counted the same as ‘Don’t know’.
Therefore the sample size was reduced to exclude blank entries. The experiments
were designed to answer a number of questions which appear in italics in this
section.
Q. What were the most recognisable objects in the survey?
Object recognition was assessed by analysing the respondent’s guess of the object.
In the analysis, a range of descriptions were accepted when accessing the recognition
of objects as the context or environment of the user would contribute perception
75
4.2 Subjective Tests to Determine Useful Processing Methods
cues. Users of artificial vision systems would be (presumably) aware of their
surrounds and would consequently be able to place objects in their context. Also
powerful interpretation and increased understanding of a scene would be gained by
rapid succession of images: ie. a movie versus a single image, in addition to moving
about to see how various objects interact.
Many of the responses in the survey indicated that the person was able to recognise
the object as having certain properties, but when it came to naming the object, the
description was wrong. These descriptions were deemed to be correct (contextual
recognition) given that the person interpreting the image would have knowledge of
the context of the object. Examples of this are ‘Post 1’, which some respondents
named as ‘cactus in the desert’. Its height and slender form identified it as an object
to be avoided. If the same image was viewed on a city street, the description would
be likely to be more representative of the actual object. Other examples are the face
images wherein a respondent was able to recognise a face and head but associating
the wrong gender with the face. Appendix Section A.3 documents the recognition
assessment for all images. Note the listing comprises only borderline responses and
is not a complete list.
Combining the results from all test patterns (multiple resolutions and grey scale, and
different processing methods), the recognition rate was as follows:
O bject Recognition (n=168)
0%
20%
40%
60%
80%
100%
% R
ecog
nitio
n
98% 98% 92% 69% 33% 31% 20% 19% 16% 4%
fa c e 1 fa c e 2 c ha ir1 post 1 c ha ir2 door 2 post 2 door 1 st e ps 2 st e ps 1
Figure 4.3: Recognition rate for objects in the image set
76
4.2 Subjective Tests to Determine Useful Processing Methods
77
Figure 4.3 includes 95% confidence intervals around mean recognition rates across
all test patterns. The most recognisable objects were the two faces with 98%
recognition. Chair 1 also was highly recognised. Face recognition has been
recognised as one of the foremost visual learning steps in the human baby [29]. In
studies where the eye positions of babies were monitored when presented with visual
stimuli, the babies spent longer looking a true face pictures than at other stimuli
patterns, including where the same face components (mouth, eyes etc) were present
but rearranged spatially. This finding may give evidence of immediate visual
response to biologically important objects. The result also agrees with studies
looking at the specialised processing required for face recognition [97]. There is
neurophysiological evidence for the existence of neurons in the temporal lobe in
monkeys, sheep and humans which responds selectively to faces. Particular neurons
are sensitive to the direction of gaze and have a maximum response if the face is
viewed straight on. Faces are encoded with differences to a prototypic ‘average’
face/caricature, where differences are assessed relative to a norm.
Interestingly, Chair 2 was difficult to recognise, and its round features contributed to
many animal impressions in the responses. As mentioned above and in other sources
[34], past experiences and expectancies influence visual perception. Had subjects
been told they were in a room containing office equipment one would expect there
not to be responses such as “animals”. Thus perception performance in this
assessment can be considered to be a worst case: no context or hints were provided
and static (still) images were viewed. The ability of the brain to interpret low
information even at this worst case is apparent.
Analysis of Variance (ANOVA) on the data shown in Figure 4.3 reveals significant
differences in recognition rate for the ten images tested. The test was based on 12
observations (refer to booklet layout shown in Appendix Section A.2) and resulted in
F(9,110),α = 0.05 = 200 > Fcritical (2.0) with P = 5.8E-64 (highly significant). Thus there
are significant differences between the mean recognition rates when averaged across
all test patterns (multiple resolutions and grey scale, and different processing methods).
4.2 Subjective Tests to Determine Useful Processing Methods
Another fundamental aspect to determine from the testing was the effect of spatial
resolution on object recognition. The result found is represented graphically below
in Figure 4.4. The plot shows 95% confidence intervals around mean recognition
rates for the ten images used in the test.
Spatial Resolution & Grey Scale (n=140)
0%10%
20%30%
40%50%
60%70%
80%
Spatial Resolution
% R
ecog
nitio
n
B&W Images 48% 44% 49%
3 Grey Level Images 49% 50% 53%
10x10 16x16 25x25
Figure 4.4: Effect of spatial resolution and grey-scale on object recognition.
Q. How does greyscale affect recognition?
Although the differences in Figure 4.4 are fairly small, it can be seen that at a
particular resolution, images with 3 grey levels (white, mid-grey and black) are more
recognisable than black and white images (Figure 4.5).
Statistical testing shows however that the differences are not significant. A two
sample t-test was performed using 30 observations (3 spatial resolutions across 10
different images) at α = 0.05. The test assessed the hypotheses:
H0: There is no recognition difference when adding greyscale;
H1: The addition of greyscale achieves significantly different recognition
results; ie. a two-tailed t-test.
A t-statistic of -0.4 was obtained which was much less than the critical t value 2.0 for
58 degrees of freedom. The significance of this value for a one-tail test was P=0.71
and since this is greater than 0.05, H0 was not rejected: adding greyscale does not
result in significant recognition results.
78
4.2 Subjective Tests to Determine Useful Processing Methods
So while small improvements with adding greyscale were noted in the results, for
constant spatial resolution, images with 3 grey levels (white, mid-grey and black)
were not significantly more recognisable than black and white images (Figure 4.5).
No significant increase in recognition
Figure 4.5: Images with 3 grey levels (white, grey, black) were not significantly more recognisable than black & white images.
Q. How does spatial resolution affect recognition?
Again referring to Figure 4.4, images with 3 grey levels were more easily recognised
as spatial resolution increased. However for black and white images this was not
always so. Thus resolution is still important for object recognition – not pure
resolution but relative to the size of the object one is trying to show.
Analysis of Variance (ANOVA) of the data shown in Fig 4.4 reveals that the
differences when increasing spatial resolution were are not significant. This analysis
compared the hypotheses:
H0: µ 10x10 = µ 16x16 = µ 25x25
H1: At least two of the means are not equal, at α = 0.05.
The test was performed for 20 observations (10 images with 3 greylevels and 10
B&W images) and also for averaged results for B&W images and 3-grey images (2
observations). For 20 observations, F(2,57)=0.07 < Fcritical (3.1) with P = 0.93, while
results for 2 observations were F(2,3)= 1.2 < Fcritical (9.6) with P = 0.40. Thus as both
P values were above 0.05, H0 was not rejected: mean results did not differ
significantly as spatial resolution increased.
79
4.2 Subjective Tests to Determine Useful Processing Methods
Q. Given the choice between increased spatial resolution and increased intensity
resolution (grey scale), which would give higher recognition?
One aspect of the testing was designed to analyse the effect of resolution versus grey
scale. A subject was simultaneously presented with images at a low resolution with
3 grey levels (ie. white, grey, black) and higher resolution black and white images.
The test analysed the following issues:
• 10x10 3grey compared with 16x16 B&W
• 10x10 3grey compared with 25x25 B&W
• 16x16 3grey compared with 25x25 B&W
The findings from this testing are shown in Figures 4.6 and 4.7.
Resolution vs Greyscale (n=110)
0% 10% 20% 30% 40% 50% 60%
10x10 - 3grey vs16x16 - B&W
10x10 - 3grey vs25x25 - B&W
16x16 - 3grey vs25x25 - B&W
% Recognition
Spatial Resolution b&w - importb&w - distanceb&w - inverseb&w - normal3grey - import3grey - distance3grey - inverse3grey - normal
Figure 4.6: Comparing resolution and grey scale.
Figure 4.6 comprises data for only those subjects who could correctly identify the
object. Three groups of bars are shown corresponding to the 3 bullet points above.
Each group shows a distribution of processing method chosen by subjects as showing
the object most clearly (ie. Rank 1 on the bottom of the test sheet shown in Appendix
80
4.2 Subjective Tests to Determine Useful Processing Methods
81
Section A.1A.2). The bars in each of the three groupings add to 100%. The plot also
shows individual 95% confidence intervals for each of the processing methods
obtained across the ten images used in the test. The four bars at the top of each
grouping refer to the higher resolution black and white images, while the bottom four
represent lower resolution images with 3 grey levels. It is clear that higher
recognition is achieved with the higher resolution black and white images over lower
resolution images with 3 grey levels. This indicates higher recognition is achieved
with increased spatial resolution rather than increased greyscale resolution.
Statistical testing of this data shows these differences in recognition are significant.
Two sample t-tests were performed for each of the three groupings using 10
observations (10 different images) at α = 0.05. The test assessed the hypotheses:
H0: There is no recognition difference between low resolution images with 3
greylevels compared with higher resolution black and white images;
H1: Higher resolution black and white images are more easily recognised; ie.
a one-tailed t-test.
In all three cases (10x10 3grey compared with 16x16 B&W, 10x10 3grey compared
with 25x25 B&W, 16x16 3grey compared with 25x25 B&W), P values were less
than 0.05, indicating recognition rates for the lower resolution 3-grey images are
significantly lower than the higher resolution black and white images.
4.2 Subjective Tests to Determine Useful Processing Methods
Significantly increased
understanding
Figure 4.7: Significantly higher recognition is achieved with increased spatial resolution (Right) over increased greyscale resolution (Left)
82
4.2 Subjective Tests to Determine Useful Processing Methods
Q. What are the presentation modes, or processing methods, that show objects most clearly?
On the questionnaire test sheets , the subjects were asked to rank the top 3 images in
the order that they thought showed the object most clearly (refer Appendix Section
A.1A.2). While all 3 rankings show a trend of the most useful processing methods, it
was of prime interest to determine what subjects nominated as their first choice,
which they may have felt most strongly about. The first choice nominations are
graphically represented in Figure 4.8 below. The plot shows individual 95%
confidence intervals for each of the processing methods obtained across the ten
images used in the test.
0%
20%
40%
60%
80%
Spatial Resolution
Ave
rage
Rec
ogni
tion
normalinversedistanceimportanceedges
normal 9% 19% 24%
inverse 6% 21% 16%
distance 30% 30% 27%
importance 52% 28% 20%
edges 3% 2% 2%
10x10 16x16 25x25
0%
10%
20%
30%40%
50%
60%
70%
Spatial Resolution
Ave
rage
Rec
ogni
tion
normalinversedistanceimportance
normal 17% 14% 33%
inverse 15% 8% 7%
distance 23% 25% 12%
importance 35% 42% 29%
10x10 16x16 25x25
3 Grey Level (Black, Grey& White) Im
Black & White Images
ages
Figure 4.8: Object recognition rate for various processing methods (n=110)
83
4.2 Subjective Tests to Determine Useful Processing Methods
Analysis of Variance (ANOVA) of the data shown in Figure 4.8 testing the
hypothesis that the means are equal at α = 0.05 for 10 observations, reveals the
following significance levels: Spatial resolution Greyscale
resolution
Degrees of
freedom 10x10 16x16 25x25
Black and white
images
(4,45) P= 1.0E-5 P = 0.11 P = 0.28
3-grey level
images
(3,36) P = 0.22 P = 0.02 P = 0.05
Table 4.1: Analysis of Variance for various processing methods
Table 4.1 indicates that there are significant differences between the means for 10x10
B&W images and 16x16 images with 3 grey levels (P<0.05). From Figure 4.8, it can
be seen that distance and importance processing were more commonly nominated as
showing the object clearly in these cases. These processing methods were
significantly higher than ‘Normal’ presentation modes. However, there were no
significant differences between means for 25x25 images. This also suggests that
several presentation modes should be used in artificial vision systems rather than a
single mode of operation.
Q. How does edge enhancement affect recognition?
The upper plot of Figure 4.8 also shows that edge-processed images were not well
recognised (Figure 4.9). At the low resolutions used in the tests, edges comprised
too large a percentage of the total image pixels. For example, in a 10x10 image, a
vertical edge would comprise an entire column representing a tenth of the image.
Figure 4.9:Edge images were not well recognised
84
4.2 Subjective Tests to Determine Useful Processing Methods
85
Statistical testing confirms that results for edges are significantly lower than the
average of other methods. Two sample t-tests were performed for each of the three
spatial resolution groups (10x10, 16x16, 25x25) using 10 observations (10 different
images) at α = 0.05. The test assessed the hypotheses:
H0: There is no recognition difference between edge-processed images
compared with the average recognition of other processing methods (average
{normal, inverse, distance, importance};
H1: Edge-processed images are significantly less easily recognised; ie. a one-
tailed t-test.
In all three cases, P values were less than 0.05, indicating recognition rates for the
edge-processed images were significantly lower than the average of other processing
methods.
Q7. What effect does image content (type of scene) have on recognition?
The test image set reflected diversity in scene content, and it was found that image
content is important in recognising ability. It would therefore be beneficial to have
adaptive processing for different scenes. For recognising chairs and doorways,
distance and importance processing was best, while for human faces, normal (or
inverse) processing was most beneficial. Interestingly, there appeared to be a
subjective difference between inverse and normal images, which differed between
individuals.
4.2.4 Test Conclusions
This section describes subjective experiments undertaken to determine useful image
processing methods for visual prosthetic applications and provide a framework for
prototype development. A condensed summary of the results is as follows:
• at a particular resolution, images with 3 grey levels (white, grey and black) were
not significantly more recognisable than black and white images;
• higher recognition was achieved with increased resolution rather than increased
grey scale;
• the most recognisable objects were images of human faces with 98% recognition;
4.2 Subjective Tests to Determine Useful Processing Methods
• the test image set reflected diversity in scene content, and it was found that image
content is important in recognising ability – therefore beneficial to have device
switchable processing for different scenes;
• at lower spatial resolutions, 1 or 2 processing methods were quire useful
(importance & distance processing);
• edge-processed images were not well recognised; at the low resolutions used in
the tests, edges comprised too large a percentage of the total image pixels;
• there appeared to be a subjective difference between inverse and normal images
(Figure 4.10);
Figure 4.10: Subjective preferences between image and its inverse – some subjects preferred white on black, others black on white.
• for recognising chairs/doorways – distance & importance processing was best;
for human faces: normal & inverse; and
• resolution is still important for object recognition – not pure resolution but
relative size of object trying to show.
86
4.3 Subjective Tests to Determine Influence of Image Type
4.3 Subjective Tests to Determine Influence of Image Type
87
In this section further results are presented on subjective tests simulating what might
be seen by users of low quality vision systems. A group of 225 normally sighted
subjects viewed a set of low quality (low spatial resolution and low grey-scale
resolution) static images. The aim for this testing was to quantify
intelligibility/recognition for low quality images and determine the effect of the type
of image. Results from this testing form part of an image quality model to assess the
usefulness of low quality images.
4.3.1 Methodology
Part of the research involves assessing visual perception at this low end of the image
quality spectrum. Chapter 2 described numerous models assessing the human visual
system and image quality. However, these models apply to the high end of the image
quality spectrum (see Watson [105] for a good compilation). There is a need to fill
this void to assess image quality for emerging implant designs. The work extends
upon the previous subjective tests on normally sighted viewers described in the
preceding section which determined the impact of several image processing
techniques on object recognition.
The simulation tests were undertaken to provide insight into human perception of
low quality images and were aimed at simulating artificially-induced low quality
vision.
The objective was to obtain a Recognition-Quality envelope (see Fig 4.11), where a
subject was able to use the information presented to draw an intelligible conclusion
about the image. This section introduces the concept of recognition-quality curves
which show recognition performance plotted against image quality.
4.3 Subjective Tests to Determine Influence of Image Type
88
e a threshold of minimum lity required for intelligible
Is therquaviewing?
Image Quality
RecognitionIs there a variation in the degree of recognition possible for different images of the same quality?
Figure 4.11: Test Objective - Obtain an Recognition-Quality curve
It was anticipated that as image quality was increased, there would be an increase in
the ability of an object to be recognised. However, for a given image quality,
recognition performance was expected to vary among viewers, and so producing an
‘envelope’ of recognition as opposed to a straight line response. This may also
indicate that the ability of an object to be recognised may not improve within a range
of image qualities.
Participation was on a voluntary basis and comprised 271 senior high school students
and 11 mature age respondents. Invalid data resulted in the rejection of 57
questionnaires (21%). Thus the final sample size was 225, representing sample sizes
of 25 for each of the 9 image quality classes.
Participants had no prior knowledge of the images. Booklet instructions stated that a
range of high quality and low quality images could be expected, and although the
low quality images might just appear as a range of blocks, they may be similar to
what a blind person might see with a bionic eye.
4.3.2 Images Chosen
There were 9 Image Quality classes tested (see Fig 4.12). Original images were
256x256 pixels representing a range of scene types. A decreasing image quality
4.3 Subjective Tests to Determine Influence of Image Type
scale was presented using spatial resolutions typical of visual prosthesis designs
(25x25, 16x16, 10x10) and reducing the grey levels from full greyscale to binary. It
was also of interest to expose the structure of an image by presenting image edges.
Full Greyscale Binary
1. 2. 256 x 256 3.
4. 5. 25x25
6. 7. 16x16
8. 9. 10x10
Figure 4.12: The nine image quality classes used in the tests
Reduced quality image sets were prepared for the images shown in Fig 4.13.
Tree Flower Balloon
Lighthouse Face Buildings
Capsicum Gorilla Duck
Decreasing Image Quality
256x256 Edge (image structure)
Figure 4.13: Test image set
89
4.3 Subjective Tests to Determine Influence of Image Type
90
The subject was presented with 9 different images on the one page (tree, flower,
balloon, lighthouse, face, buildings, capsicum, gorilla, rubber duck) corresponding to
an image quality class described above. An example of the test stimuli is shown in
Appendix Section B.1.
4.3.3 Results
Responses indicated by subjects were collated to determine recognition rates. Most
subject responses were easy to classify as "Yes, this person has correctly recognised
the object" or otherwise. However, where a subject's response was borderline,
Appendix Section B.2 was constructed to maintain consistent judgements on whether
images where correctly recognised. Responses were accepted if they had similar
context to the answer. Note the table includes only borderline responses and is not a
complete listing.
Table 4.2 shows the proportion of respondents who could correctly identify all (9/9)
images presented to them and two-thirds (6/9) of the images shown to them. 6/9 was
chosen to reflect recognition performance clearly over 50%.
QUALITY CLASS Respondents
Correctly
identifying all 9
images (out of
total of 25)
% Correct
response for
all 9/9 images
Respondents
Correctly identifying
two thirds (6/9) of
the image set (out of
total of 25)
% Correct
response for
6/9 images
10 x 10 Binary 0 0% 0 0%
10 x 10 Greyscale 0 0% 4 16%
16 x 16 Binary 0 0% 1 4%
16 x 16 Greyscale 1 4% 7 28%
25 x 25 Binary 0 0% 4 16%
25 x 25 Greyscale 2 8% 20 80%
256 x 256 Binary 4 16% 25 100%
256 x 256 Edge 24 96% 25 100%
256 x 256 Greyscale 25 100% 25 100%
Table 4.2: Correct image identification (n=25)
4.3 Subjective Tests to Determine Influence of Image Type
None of the respondents viewing the low quality binary images (10x10, 16x16,
25x25) were able to correctly identify all 9 of the presented images. Also
surprisingly at high resolution, only 16% of respondents viewing the binary versions
of the originals correctly identified all images. Even when considering identification
of more than half of the image set, recognition performance was still low for
respondents viewing the low quality binary images. In fact, the same number of
people could identify two thirds (6/9) of the 25x25 binary image set as they could
with the 10x10 greyscale images. 80% of viewers of the 25 x 25 greyscale image set
could correctly identify half of the image set. This value of useful spatial resolution
agrees with previous simulation work of others ([99] refer Section 3.5.2.1), which
found that effective two dimensional displays can be achieved with matrix sizes of
between 20x20 and 32x32.
When considering an average across all image types, it was possible to construct an
envelope of recognition for the test set as shown below in Fig 4.14, which has a
similar shape to the envelope proposed in Fig 4.11. The plot shows 95% confidence
intervals around mean recognition rates for the nine image quality classes used in the
test. Maximum and minimum curves have been added to indicate the range of values
obtained. Although the maximum and minimum values are shown joined with a line
to form an envelope, they do not imply that the x-axis is always ordered in the image
quality order as shown. (The order shown below is in increasing order of recognition
for mean recognition rates across all image types).
All Images (n=225)
0%
20%
40%
60%
80%
100%
10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256Bin
256Edges
256F/G
Image Quality
% R
ecog
nitio
n
Error bars denote 95% confidence intervals around mean recognition rate.
Figure 4.14: Recognition-Quality Envelope of recognition for all images in test set
91
4.3 Subjective Tests to Determine Influence of Image Type
Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.14 to
compare the hypotheses:
H0: µ 10 Bin = µ 16 Bin = …. = µ 256 F/G
H1: At least two of the means are not equal, at α = 0.05.
The test resulted in a F-value of 111 which exceeded the critical F-value (1.98) for
the number of degrees of freedom in the data (8, 216), and was highly significant at
P=2.64E-72. Thus H0 was rejected and it was concluded that recognition rates were
significantly different for the image quality classes used in the test.
Recognition-Quality curves for specific object types are shown in Fig 4.16. It can be
seen that the x-axes have different ordering. One conclusion from these results is
that recognition performance varies widely depending on many factors, one of which
is the type of image. Had these curves been plotted with the same x-axis ordering
(say on increasing recognition rate of values averaged across image types), the
recognition plot of Fig 4.15 below would be obtained. The data points are shown
joined to demonstrate the jaggedness of the curves, highlighting that recognition
performance varies with type of image.
Specific Images (n=25)
0%
20%
40%
60%
80%
100%
10 Bin 16 Bin 10 F/G 25 Bin 16 F/G 25 F/G 256 Bin 256 Edges 256 F/G
Image Quality
% R
ecog
nitio
n
MeanLighthouseBuildingsTreeGorillaCapsicumFaceFlowerBalloonRubber Duck
Figure 4.15: Variation in recognition among image types
In general, one might expect recognition to improve as spatial resolution and the
number of greylevels increase. The results here validate the experiments of Section
4.2.3 and those by others [36] that recognition rate/perceived quality is dependent on
image type and there is interplay between greylevel and spatial resolution.
92
Average recognition rate (averaged across viewers and image quality classes) is shown above each chart.
Gorilla (n=25) - Avge: 44%
0%
20%
40%
60%
80%
100%O
rig
Orig
Bin
Edg
es
25 F
/G
16 F
/G
16 B
in
25 B
in
10 F
/G
10 B
in
Face (n=25) - Avge: 85%
0%
20%
40%
60%
80%
100%
Orig
Orig
Bin
Edg
es
25 F
/G
16 F
/G
25 B
in
10 F
/G
16 B
in
10 B
in
Balloon (n=25) - Avge: 51%
0%
20%
40%
60%
80%
100%
Orig
Edg
es
25 F
/G
10 F
/G
16 F
/G
Orig
Bin
25 B
in
10 B
in
16 B
in
Buildings (n=25) - Avge: 50%
0%
20%
40%
60%
80%
100%
Orig
Orig
Bin
Edg
es
25 B
in
25 F
/G
16 B
in
10 F
/G
10 B
in
16 F
/G
Tree (n=25) - Avge: 53%
0%
20%
40%
60%
80%
100%
Orig
Orig
Bin
Edg
es
25 F
/G
10 B
in
25 B
in
16 B
in
10 F
/G
16 F
/G
Capsicum (n=25) - Avge: 50%
0%
20%
40%
60%
80%
100%
Orig
Orig
Bin
Edg
es
25 F
/G
25 B
in
16 F
/G
10 F
/G
16 B
in
10 B
in
Rubber Duck (n=25) - Avge: 57%
0%
20%
40%
60%
80%
100%
Orig
Edg
es
Orig
Bin
25 F
/G
16 F
/G
10 F
/G
25 B
in
16 B
in
10 B
in
Lighthouse (n=25) - Avge: 66%
0%
20%
40%
60%
80%
100%
Orig
Edg
es
Orig
Bin
25 F
/G
25 B
in
16 F
/G
10 F
/G
16 B
in
10 B
in
Flower (n=25) - Avge: 72%
0%
20%
40%
60%
80%
100%
Orig
Orig
Bin
Edg
es
25 F
/G
25 B
in
16 F
/G
10 F
/G
16 B
in
10 B
in
Figure 4.16: Recognition-Image Quality curves for each test image;
93
4.3 Subjective Tests to Determine Influence of Image Type
94
For 5 of the 9 images (face, flower, lighthouse, duck, capsicum), as more information
was presented in the way of either greyscale or spatial resolution, the recognition rate
increased.
For example, recognition performance for
• the flower image set = 10Bin < 16 Bin < 10 F/G < 16 F/G < 25 Bin < 25 F/G etc
• the face image set = 10Bin < 16 Bin < 10 F/G < 25Bin < 16 F/G < 25 F/G etc
where “10Bin < 16 Bin” indicates 16x16 Binary images were more easily recognised
than 10x10 Binary images.
On the other hand, the gorilla image set resulted in the following quality class
ordering: 10 Bin < 10 F/G < 25 in < 16 Bin < 16 F/G < 25 F/G etc.
This appears an illogical order due to a lower recognition rate achieved for 25x25
Binary images than 16x16 Binary images. However the actual recognition rates for
the gorilla image set are very low. It can be conjectured that images with low overall
recognition rates give spurious results due to guessing. In contrast, the 4 images that
were most highly recognised across all image quality classes (face=85%,
flower=72%, lighthouse=66%, duck=57%) all had logical quality class ordering as
recognition rate increased.
Mean recognition rates for each object type are shown over in Fig 4.17. The plot of
Fig 4.17 shows 95% confidence intervals around mean values (average across all
image quality classes) and maximum and minimum values to show the range of
recognition obtained.
4.3 Subjective Tests to Determine Influence of Image Type
All Image Quality Classes (n=225)
0%
20%
40%
60%
80%
100%
Face
Flow
er
Ligh
thou
se
Rub
ber
Duc
k
Tree
Bal
loon
Bui
ldin
gs
Cap
sicu
m
Gor
illa
Image Quality
% R
ecog
nitio
n
Figure 4.17: Recognition rates for each object type
Similar to the results found in Section 4.2.3, the face image had the highest mean
recognition rate when averaged across all image quality classes. Again this agrees
with the literature where face recognition has been recognised as neurologically
programmed [97] and one of the foremost visual learning steps in the human baby
[29]. For low quality presentations, images that were highly recognised were the
face and flower (greyscale images) and face, lighthouse and flower (binary images).
The gorilla, duck and capsicum images were not recognised well at low quality.
Analysis of Variance (ANOVA) was performed on the data shown in Fig 4.17 to
compare the hypotheses:
H0: µ Face = µ Flower = …. = µ Gorilla
H1: At least two of the means are not equal, at α = 0.05.
The test reports an F-value = (found variation of the group averages)/(expected
variation of the group averages), and if the H0 hypothesis is correct, the F-value is
about 1. Interestingly, the test resulted in a F-value of 1.12 which was less than the
critical F-value (2.07) for the number of degrees of freedom in the data (8, 72). This
F-value has a significance of P=0.36, and thus H0 could not be rejected.
Thus when combining recognition rates for all quality classes, there was no
significant difference based on image type.
95
4.3 Subjective Tests to Determine Influence of Image Type
96
4.3.4 Test Conclusions
This section described an experiment to further understanding of recognition of low
quality images and determine the affect of the type of image on object recognition.
Recognition was found to vary depending on the type of image, but differences were
not significant when averaged across the different image qualities assessed in the
test. The face image had the highest mean recognition rate across all image qualities.
The number of respondents correctly identifying two thirds of the 25x25 binary
image set = the number of respondents correctly identifying more than half of the
10x10 greyscale image set.
80% of respondents could correctly identify more than half of the 25x25 greyscale
images indicating reasonable vision is achieved at this level. It must be remembered
that these test images were static and improved perception could be expected with
presentation of image sequences – ie. a movie versus a single image. Also a visual
prosthesis user would be able to move about to see how various objects interact.
There is an interplay between greyscale resolution and spatial resolution – for some
objects, higher recognition is achieved with increased greyscale over spatial
resolution, while the reverse applies for other objects.
For those objects which were highly recognised (above 55% averaged across all
image quality classes: face, flower, lighthouse, duck) it was possible to obtain a
recognition curve that increased as image quality increased. However for the
remaining images which were not well recognised (much guessing), recognition rates
both increased and decreased as image quality increased.
This work is extended in the next Chapter by:
1. Correlating several image statistics, such as fractal dimension, symmetry, number
of edges, and number of segments, with the images in these tests to determine if
recognition can be automatically predicted.
2. Constructing a visual information model for low quality images comprising
several dimensions, in addition to the actual object as considered in this chapter.
4.4 Chapter Conclusions
4.4 Chapter Conclusions
97
At the commencement of this chapter, two research questions were stated which can
now be answered following the experiments described in this chapter:
Q1: Although limited to low quality images anticipated from visual prostheses, can
recognition of some objects be achieved?
A1: Results indicated that greyscale images were easier to identify: 80% of
respondents could correctly identify more than half of a 25x25 greyscale image set,
while only 16% of respondents correctly identified more than half of a 25x25 binary
image set. Spatial resolution was more important for recognition performance than
greyscale resolution.
Recognition was found to vary depending on the type of image, with face images
being the most easily recognised. It would be beneficial to have device switchable
processing for different scenes. Further exploration of the idea of adjusting image
processing adjusted depending on the scene type is presented in Chapter 6.
Q2: Does Region-of-Interest processing improve scene understanding beyond
standard/Base Case processing?
A2: Results indicated that there may be some benefit in pursuing ROI methods in
further detail, especially for very low (10 x 10) resolution images. For recognising
chairs and doorways – ROI and distance processing were best, while for human faces
– standard/Base Case and inverse was best. Access to a range of processing routines
was therefore advisable. Further assessment of the applicability of ROI techniques to
low quality image perception is presented in Chapter 7.
Chapter 5 Quantifying Information Content
5.1 Introduction
One of the desired functions of visual prostheses is to convey maximum scene
information to limited electrode numbers in implants. How can one tell if there is
maximum scene information in the conveyed image, or even, how does one quantify
the amount of visual information in an image? Can a metric be developed that can
rank images (like the two shown in Figure 5.1) for the amount of visual information
they contain?
Figure 5.1: Two images with different amounts of visual information content
This chapter attempts to answer these questions and describes in detail the
construction of a metric for visual information. It aims to answer the research
question proposed at the end of Chapter 3:
Q3: Can a metric be constructed for basic information required for the interpretation
of a visual scene at low image quality?
This knowledge would result in a new way to characterise low quality images on the
basis of providing maximum information. Images could be characterised on the ratio
of perceived information they convey (human user’s concept) to their representative
information (raw measure of intrinsic information, typically ‘bit’).
Assuming that information content in images can be quantified, how can this
knowledge be used in the visual prostheses application? This chapter proposes that
image content can be manipulated in a way so that the resulting image to be
98
5.1 Quantifying Information Content Introduction
99
conveyed to implant electrodes contains maximum information. One means to do
this is using the Importance Map method described previously in Section 3.5.2.6.
This method involves the combination of several feature maps/images representing
attentional features to form an overall importance map. Using the knowledge of
what constitutes visual information, weights for each feature map (Intensity, Edges,
Colour contrast, Edges etc) could be adjusted iteratively to maximise the amount of
visual information in the resulting importance map.
This Chapter comprises three further sections depicted below:
Section 5.2
Identify perceived visual information content Construct 8 subjective data sets
Section 5.3 Propose Metric using 1 of 8 subjective data sets:
• Metric specific to a particular image quality • One metric for all image quality
Validate metric against other 7 subjective data sets
Section 5.4 Correlating visual information
content with perception
Section 5.2 describes a subjective experiment for perceived information content in
images. Subjective rankings are presented for eight visual ‘dimensions’. Patterns
among rankings and viewer preferences are noted to gain insight into subjective
visual information. The development of a robust metric is detailed in Section 5.3 and
predictive performance of the metric examined. Finally, Section 5.4 determines
whether high perceived information content in images actually corresponds to high
recognition rates: is it true that an image with high information content can be
recognised easier at low quality than an image with low information content.
5.2 Perceived Information Content in Images
5.2 Perceived Information Content in Images
An experiment was conducted to rank the amount of inherent visual information in
images. In the experiments images were compared with each other to obtain a
ranking from most to least visually informative. In addition to using the results to
propose a metric to quantify visual information, an additional benefit is determining
how perceived information content changes as image quality decreases.
5.2.1 Images Used Similar to the experiments discussed in Section 4.3 and shown below in Figure 5.2,
there were 9 image quality classes tested. Original images were 256x256 pixels
representing a range of scene types.
Full Greyscale Binary
1. 2. 256 x 256 3. 3.
4. 5. 25x25
6. 7. 16x16
8. 9. 10x10
Image
Quality
256x256 Edge (image structure)
Decreasing
Figure 5.2: The nine image quality classes used in the tests
A decreasing image quality scale was presented using spatial resolutions typical of
visual prosthesis designs (25x25, 16x16, 10x10) and reducing the grey levels from
full 256 levels of greyscale to binary. It was also of interest to expose the structure
100
5.2 Perceived Information Content in Images
of an image by presenting image edges. Reduced quality image sets across the nine
classes were prepared for each of the images shown in Figure 5.3.
5.2.2 Multidimensional Visual Information Model: Eight aspects/dimensions were explored to determine what impact, if any, they had
on perceived information content. The first issue (Actual Objects) was assessed by
comparing 7 images against each other while the other dimensions were assessed
with only 3 images each.
1. Actual Objects
7 images representing a range of different scene types: tree, flower, balloon,
lighthouse, face, buildings, capsicum. There was no implicit ranking concerning
visual information.
2. Number of Objects
3 images of increasing object number for similar scene type. The first image
contained one balloon, the second image three or four balloons and finally an image
containing many balloons. This set had an expected ranking of visual information in
proportion to the number of objects.
3. Angle of Object
3 images of a fruit bowl at 90° (top-down), 45° (angled) and 0° (side-on). Perceived
visual information may vary due to occlusion and distortion of objects. Figure 5.3: Multidimensional Visual Information Model (continued over)
101
5.2 Perceived Information Content in Images
4. Distance to Object
3 images of a couple on a bicycle with decreasing distance to the couple’s faces. The
first image is a whole of body image, the second shows a half-body view and the
final image consists of head and shoulders only. This set had an expected ranking of
visual information with distance ie. higher visual information where the whole of the
scene can be viewed.
5. Connection between Image Objects
3 images of different couples with decreasing connection between the couple. One
image shows the cheeks touching, the next shoulders touching, while the final image
shows space between the couple. It was expected that images showing space
between the couple may indicate more of what was happening in the scene, and thus
information content would be greater with increased separation.
6. Image Detail
3 images of the same face with different edge detail. The first image shows the face
alone, the second includes a phone, while the third shows part of an additional face.
Information content was expected to be greater for images containing higher detail
ie. the additional face and phone images would be more visually informative.
Figure 5.3: Multidimensional Visual Information Model (continued over)
102
5.2 Perceived Information Content in Images
7. Contrast between Objects & Surround
3 images of capsicums with varying contrast. Green, red and yellow capsicums gave
varying contrast against a light background when viewed as greyscale images. It was
expected that images with higher contrast would be ranked as containing higher
information content.
8. Variety of Object Types
3 images comparing different object types. The first image contained an orange and
sunglasses, the second depicted an orange and a mug, while the third showed scissors
and a mug. There was no expected ranking of information content. tion content.
Figure 5.3: Multidimensional Visual Information Model Figure 5.3: Multidimensional Visual Information Model
5.2.3 Test Method 5.2.3 Test Method
Two questionnaire-based methods were used: Two questionnaire-based methods were used:
1) Images presented all on one page1) Images presented all on one page
An example test stimulus is shown in Appendix Section C.1 (this shows the
presentation of 7 images for assessing the “Actual Objects” test set) and Section C.2
(showing the presentation of 3 images for assessing the “Distance to Object” test
set).
For assessing the first issue (Actual Objects), the 7 images were presented on the
same page, and subjects were asked to rank from 1 to 7. When considering such a
large number of comparisons, this method gives strong responses for the extremes
103
5.2 Perceived Information Content in Images
104
(most and least visually informative) and weaker responses for the mid-lying images.
Thus a paired comparison (binary decision) test was performed on the 7 image set.
2) Paired comparison (binary decision) questionnaire test
An example test stimulus is shown in Appendix Section C.3.
Considerable effort was required in the design of the questionnaires to ensure variety
(avoid boredom), reduce the chance of learning effects from multiple viewings of the
same object, and to keep the questionnaires short (avoid fatigue). To achieve this, 9
booklet versions were produced (Books A, B, C etc) with the format shown in
Appendix Section C.4. Conditions for viewing the experiment (ambient illumination
etc) were not controlled.
5.2.4 Test Participants and Instructions Participation was on a voluntary basis and comprised 271 Year 11 students and 11
mature age respondents. Invalid data resulted in the rejection of 57 questionnaires
(21%). Thus the final sample size was 225, representing sample sizes of 25 for each
of the 9 image quality classes.
Participants had no prior knowledge of the images. Booklet instructions stated that a
range of high quality and low quality images could be expected, and although the
low quality images might just appear as a range of blocks, they may be similar to
what a blind person might see with a bionic eye.
In assessing visual information using human viewers, it was anticipated that there
would be a varied understanding and interpretation of the concept of visual
information. In addition to the above comment that viewers were advised of the
bionic eye application, the following example question was provided to all viewers:
WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION?
In other words, which image could you answer the most questions about? (eg.
What is the scene? How many objects?) If you had to rely on only one of the
images to perform a task which would it be?
Beyond these comments viewers made their own interpretation of visual information.
5.2 Perceived Information Content in Images
105
5.2.5 Test Results
Eight factors were analysed. The first factor was determining the effect of the actual
object shown in the image on perceived information content. The seven different
objects were compared against each other. Two ranking schemes were used: 1)
images were presented all at the same time 2) paired comparison tests. Both methods
gave similar results and the rankings for each method are shown below. The table
shows images ranked from highest perceived information content (1) to lowest
information content (7) for the nine image quality classes and a ranking combining
all image quality classes.
Visual Information Ranking - Images Presented all at same time
1 2 3 4 5 6 7
All Quality Classes Face Flower Tree Buildings Lighthouse Capsicum Balloon
256 F/G Face Buildings Tree Lighthouse Flower Balloon Capsicum
256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon
256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum
25 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings
25 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon
16 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings
16 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon
10 F/G Face Flower Capsicum Tree Balloon Buildings Lighthouse
10 Bin Tree Flower Face Buildings Lighthouse Capsicum Balloon
Visual Information Ranking - Paired Comparison Presentations
1 2 3 4 5 6 7
All Quality Classes Face Flower Tree Buildings Capsicum Lighthouse Balloon
256 F/G Buildings Face Lighthouse Tree Flower Balloon Capsicum
256 Bin Buildings Face Tree Flower Lighthouse Capsicum Balloon
256 Edges Buildings Face Tree Flower Lighthouse Balloon Capsicum
25 F/G Face Flower Capsicum Balloon Tree Lighthouse Buildings
25 Bin Face Flower Tree Buildings Capsicum Lighthouse Balloon
16 F/G Face Flower Capsicum Tree Balloon Lighthouse Buildings
16 Bin Face Flower Tree Buildings Lighthouse Capsicum Balloon
10 F/G Face Flower Tree Capsicum Balloon Lighthouse Buildings
10 Bin Tree Face Flower Buildings Capsicum Lighthouse Balloon
Table 5.1: Perceived information content for comparing 7 different object types
5.2 Perceived Information Content in Images
When considering the ranking for all quality classes (n=225) both methods gave the
following near identical ranking order:
Face > Flower > Tree > Buildings > Lighthouse/Capsicum > Balloon.
Ie. the face image has higher subjective information content than the flower etc.
The high quality Top 3 (256x256, 256x256_Binary, 256x256_Edge) were Face,
Buildings and Tree:
Figure 5.4: Images containing high information content for high quality images
106
The low quality Top 3 were Face, Flower and Tree (Binary), Face, Flower and
Capsicum (Greyscale):
(Binary)
(Greyscale)
Figure 5.5: Images containing high information content for low quality images
The effect of the other factors/dimensions on visual information are presented in
Table 5.2 over. Patterns that emerged in the visual information ranking are noted
along with the number of image quality classes (out of 9) with that ranking. Strong
viewer preferences are defined as the pattern chosen by 70% or more of the sample
size. Although it may appear arbitrary, a 70% level was chosen from careful
inspection of the data and the fact that in a normal distribution, 68% of cases will fall
1 standard deviation above and below the mean.
5.2 Perceived Information Content in Images
Dimension Visual Information Ranking Order
HIGHEST LOWEST
Any very
strong viewer
preferences?
(chosen by
>70% of
sample)
How many
image
quality
classes
Were
original
(256F/G)
images
ranked
like this?
Number of
Objects in Scene
1st dominant pattern
Yes – 5/9
16Bin: 88%
25Bin: 92%
Edges: 84%
256Bin: 96%
256F/G: 96%
6/9
10Bin: 64%
16Bin: 88%
25Bin: 92%
Edges: 84%
256Bin: 96%
256F/G:
96%
Yes
2nd pattern
No 2/9
16 F/G: 32%
25 F/G: 36%
no
Comments: A strong pattern was clear in the results which confirmed expectations: the more objects
in the scene, the higher the visual information. 6/9 image quality classes were ranked in this way
which was two thirds of the quality classes. Five of the nine image quality classes had very strong
viewer preferences for this ordering. For two low quality classes, the image of the single balloon was
favoured highest, but preferences were not strong
Table 5.2: Pattern analysis for information content rankings (continues over)
107
5.2 Perceived Information Content in Images
Dimension Visual Information Ranking Order
HIGHEST LOWEST
Any very
strong viewer
preferences?
How many
quality
classes
Original
images
like this?
Angle of Object 1st dominant pattern
No 5/9
10F/G: 36%
16Bin: 32%
16F/G: 40%
Edges: 36%
256F/G:
52%
Yes
2nd pattern
No 3/9
10Bin: 60%
16Bin: 32%
25Bin: 44%
no
Comments: The dominant pattern indicates highest information is in a top down view (90 degrees) of
the fruit bowl, where almost the entire bowl circumference is visible. The contents of the bowl can be
most easily seen in this top-down view. This pattern was ranked more visually informative for high
quality and greyscale images. When limited to binary representation, the side on view (0 degrees) was
ranked higher, perhaps to a sharper profile of the bananas against the background.
Distance to
Object
1st dominant pattern
No 5/9
10F/G: 32%
16Bin: 28%
16F/G: 32%
Edges: 48%
256F/G:
68%
Yes
2nd pattern
No 4/9
10Bin: 40%
16Bin: 28%
25F/G: 64%
256Bin: 36%
no
Comments: The dominant pattern, including the Original image set is ranked in increasing distance to
the viewer ie. more visual information where you can see more of the image and background.
However the second pattern, which includes rankings for low quality binary images, indicate the
closest view of faces contain more information.
Table 5.2: Pattern analysis for information content rankings (continues over)
108
5.2 Perceived Information Content in Images
Dimension Visual Information Ranking Order
HIGHEST LOWEST
Any very
strong viewer
preferences?
How many
quality
classes
Original
images
like this?
Connection
between image
objects
1st dominant pattern
No 5/9
10F/G: 36%
16Bin: 32%
16F/G: 36%
Edges: 68%
256Bin: 36%
no
2nd pattern
Yes – 1/9
25F/G: 80%
4/9
10Bin: 36%
25Bin: 32%
25F/G: 80%
256F/G:
48%
Yes
Comments: The first dominant pattern indicates viewers rated the image with the most separation the
most informative, perhaps because more of the occupation of the couple (card game) was visible, or
maybe the background picture and table edge contributed highly. However the 2nd pattern indicates
included a strong (80%) preference in decreasing connection between the couple for 25x25 greyscale
images.
Image Detail 1st dominant pattern
Yes – 1/9
16F/G: 80%
4/9
10Bin: 52%
10F/G: 64%
16F/G: 80%
25F/G: 60%
no
2nd pattern
No 3/9
16Bin: 64%
25Bin: 40%
256Bin: 36%
no
3rd pattern
No 2/9
Edges: 48%
256F/G:
64%
Yes
Comments: A simple face with no surrounding clutter was most visually informative for the low
quality images (1st & 2nd patterns). The influence of the mobile phone in our society may be reflected
in the ranking of the high quality images (3rd pattern) where the face with the phone appeared to
contain more information.
Table 5.2: Pattern analysis for information content rankings (continues over)
109
5.2 Perceived Information Content in Images
Dimension Visual Information Ranking Order
HIGHEST LOWEST
Any very
strong viewer
preferences?
How many
quality
classes
Original
images
like this?
Contrast
between objects
and surround
1st dominant pattern
Yes – 1/9
Edges: 72%
5/9
10F/G: 44%
16F/G: 64%
25F/G: 56%
Edges: 72%
256F/G:56%
Yes
2nd pattern
No 2/9
10Bin: 40%
256Bin: 64%
no
3rd pattern
No 2/9
16Bin: 44%
25Bin: 28%
no
Comments: Strong edges correspond with high perceived information content. The dominant pattern
was for high quality and the low quality greyscale images. When greyscale is available, the stalk and
capsicum form/contours may cause this ranking.
Variety of Object
types
1st dominant pattern
No 4/9
10Bin: 40%
10F/G: 60%
16Bin: 48%
16F/G: 32%
no
2nd pattern
No 3/9
25Bin: 32%
25F/G: 40%
256F/G:44%
Yes
3rd pattern
No 2/9
Edges: 52%
256Bin: 52%
no
Comments: The ordering of the dominant pattern is the same as presented to viewers on the
questionnaire sheets. It is interesting that the lowest quality image classes make up this dominant
pattern. Perhaps viewers of the low quality images were not able to make an intelligible distinction
between images and ranked the images in order of appearance.
Table 5.2: Pattern analysis for information content rankings
110
5.2 Perceived Information Content in Images
5.2.6 Strong Visual Information Rankings
63 visual information rankings were obtained (7 additional factors/dimensions x 9
image quality classes). Dominant patterns (ie. the most frequently specified ordering
in terms of perceived information content) were identified for each case. The
strength of the dominant patterns (ie. the frequency with which that pattern was
specified by observers) ranged from 96% (24 of 25 respondents ranking images in
that order) to 28% (only 7 of 25 respondents). The number of cases for each ten
percentile class were as follows:
Strength and number of cases for
dominant viewer patterns (63 in total)
90-100%: 3
80-89%: 4
70-79%: 1
60-69%: 12
50-59%: 6
40-49%: 16
30:39%: 19
20-29%: 2
10-19%: 0
0-9%: 0
STRONG
WEAK
Dominant patterns
Table 5.3: Dominant visual information viewer preferences
It was of interest to further examine strong dominant viewer patterns in the data.
Eight of the 63 rankings had 70% or above consensus among viewers. Five of these
related to the number of objects in the scene.
Strong viewer preferences are shown over in Figure 5.6.
111
5.2 Perceived Information Content in Images
Number of Objects in Scene
5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%),
256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%)
Highest Lowest
112
Closeness between image objects
1 image quality class: 25x25greyscale set (80%)
Highest Lowest
Image Detail
1 image quality class: 16x16greyscale set (80%)
Highest Lowest
Contrast between Objects & Surround
1 image quality class: 256x256_Edge set (72%)
Highest Lowest
Figure 5.6: Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content
5.2 Perceived Information Content in Images
113
5.2.7 Test Conclusions
Four conclusions can be drawn from Figure 5.6 and the perceived information
content experiment:
1. the more objects in the scene, the higher the visual information
2. the closer the objects in the scene, the higher the visual information
3. a simple face with no surrounding clutter was most visually informative at low
resolution levels
4. strong edges, arising from high intensity contrast, correspond with high perceived
visual information content
These viewer preferences now need to be checked against predictions from a visual
information metric which is undertaken in the next section.
5.3 Information Content Model Fitting
Described above are experiments to assess perceived information content in eight
visual dimensions. Subjective rankings from one of these eight dimensions (Actual
Objects) is now used to construct a metric to quantify visual information in images.
The metric is then validated against the subjective results of the other 7 dimensions.
5.3.1 Possible Image Attributes for a Visual Information Metric
After consideration of the literature on visual information content (refer Section 2.5)
15 image attributes were considered for the visual information metric:
1. file size
2. standard deviation
3. maximum standard deviation in 4 image quadrants
4. variance
5. maximum variance in 4 image quadrants
6. entropy
5.3 Information Content Model Fitting
7. number of edges
8. number of segments
9. fractal dimension
10. 11. 12. image internal similarity measures
11. 14. 15. image symmetry measures
Descriptions are provided below for these attributes.
File Size
Size on disk (bytes) from Windows/DOS
Standard deviation
Standard deviation for image pixels using:
1
)(1
2
−
−=
∑=
n
XXs
n
ii
, where ∑=
=n
iiX
nX
1
1 and n = number of elements
Maximum standard deviation in 4 image quadrants
114
S S
Variance
Variance of image pixels = s2; squaring this term emphasises different parts.
Maximum variance in 4 image quadrants
Entropy
While this term is also used in the field of thermodynamics, its use here refers to the
image processing context of describing the probability of each possible grey level
occurring in an image. For greyscale images this is 0 to 255, and for binary images
this is 0 and 255 only) using:
−=∂−= ∫ ∑
= 256256)(
log256256
)()(ln)()( 2
255
0 xpixels
xpixels
xxpxpxh g
g
g
max{s1,s2,s3,s4}
max{s21,s2
2,s23,s2
4}
(Equation 3)
(Equation 4)
Where (pixels)g = no. of pixels at that greylevel.
S1 S2
3 4
S21 S2
2
S23 S2
4
5.3 Information Content Model Fitting
Number of edges
Sobel edge detection (Matlab version 6.5 Release 13) – horizontal and vertical edges.
Number of segments
The image is segmented using quadtree decomposition; this segments an image on
the basis that a block is split into 4 smaller blocks if the maximum value in the block
minus the minimum value in the block is greater than a threshold (200/255 was
used). Block splitting continues until max value - min value is not greater than the
threshold. Blocks are then merged with neighbours if similar in value.
Fractal Dimension
The Box Counting Method [45] was used for binary images.
The image is covered with a grid of square cells with cell size r. Fractal dimension is
determined from functions of cell size as shown in Figure 5.7.
r = Cell side length
N(r) = no. of cells containing a
portion of image
Performed over range of box sizes: 128x128, 64x64, 32x32 …..1x1
Log(N(r))
Log(1/r)
Slope = dimensi
fractalon
Figure 5.7: Calculating Fractal Dimension for Binary Images
115
5.3 Information Content Model Fitting
Fractal dimension for greyscale images (refer Figure 5.8) was determined from an
analysis of a pixel’s environment at different square size r [45].
116
Min & Max grey values are determined within square size r & assigned to central
pixel respectively
• For each square size r, get 2D max & min function
• Difference in volume between max & min function is determined for entire
image V(r)
xyzVolumeyxfz
yxfz
∂∂∂= ∫ ∫ ∫=
=
256
1
256
1
),(
),(
2
1
(Equation 5)
Boundary in x-plane Boundary in y-plane
for every (x,y) in region, z may extend from lower surface to upper surface
fractal dimension = 3 - (slope/2)
r = 5 r = 7 r = 9
ln(V)
Ln(r)
Slope
Figure 5.8: Calculating Fractal Dimension for Greyscale Images
5.3 Information Content Model Fitting
Image similarity and symmetry Three measures were used for image internal similarity (exact match across x and y
axes) and image symmetry (mirror match across x and y axes):
1. Exact pixel match (Fig 5.9) - no sub-block analysis (same result operating on big
or small block)
117
• Exact pixel match across y-axis (same)
• Exact pixel match across y-axis (mirror)
• Exact pixel match across x-axis (same)
• Exact pixel match across x-axis (mirror)
Figure 5.9: Determining image similarity and symmetry – pixel matching
2. Shaded pixel difference between blocks - 5 level subblock analysis (objects
might be in a different position within a block)
3. Average pixel value - 5 level sub-block analysis
5.3 Information Content Model Fitting
For the sub-block analysis used in measures (2) and (3) above, five levels were used
as depicted below.
LEVEL 1 2 3 4 5
CONFIG 2x2 4x4 8x8 16x16 32x32
No. PIXELS 128 64 32 16 8
Eg.
block 2
e
m
sam
irror block 1
This sub-block level is weighted more (weight = 128 / block-pixels)
Figure 5.10: Determining image similarity and symmetry – pixel difference and average value
There were two approaches to developing the metric to then compare predictions
against subjective data:
1. develop a metric for each image quality class (25x25 greyscale, 256x256 binary
etc) to be applied only to images of that quality
2. develop a metric that is stable across all image quality classes (not just the ones
used in these tests).
5.3.2 Metric Development for a Specific Image Quality Class
Stepwise regression was used to search for the optimum subset of variables. The
procedure was based on sequentially introducing variables into a regression model
one at a time and testing the significance of all variables at each stage.
15 image attributes were considered for the visual information model. The addition
of any single variable from the above list will increase the regression sum of squares,
or SSR (amount of variation in y-values explained by the model) and reduce the error
118
5.3 Information Content Model Fitting
119
sum of squares (variation about the regression line). The use of unimportant
variables reduces the effectiveness of the model by increasing the variance of the
estimated response.
The stepwise regression procedure, taken from [103] was as follows:
STEP 1
Simple linear regression was performed with each variable. The variable giving the
largest regression sum of squares, or largest value of R2, with significance (tested
using the F-statistic) was chosen as the initial variable, x1 say.
STEP 2
Each variable was inserted along with x1. The variable giving the largest significant
increase in R2, in the presence of x1, over the R2 found in step 1 was then selected as
x2.
This process was continued until the most recent variable inserted failed to induce a
significant increase in the explained regression. Such an increase was determined
using the F-test.
It was quite possible that a variable entering the regression equation at an early stage
might have been rendered unimportant or redundant because of relationships that
exist between it and other variables entering the later stages. Therefore at each stage
in which a new variable was entered in the regression equation through a significant
increase in R2 as determined by the F-test, all the variables already in the model were
subjected to F-tests in light of this new variable, and were deleted if they did not
display a significant f-value. The procedure was continued until a stage is reached in
which no additional variables could be inserted or deleted.
Model development and comparisons of metric predictions with subjective data are
illustrated below for 2 image quality classes: 256x256 greyscale and 10x10 binary,
which represent the two extremes of the image quality classes tested.
5.3 Information Content Model Fitting
120
5.3.2.1 Example 1: Construction of a model for the 256x256 greyscale quality image set
The table below shows sample (Pearson product-moment) correlation coefficients (=
Multiple R) for each variable along with the level where the model is significant as
tested by the F statistic.
Variable y1 – Images presented at one time y2 – Paired comparison tests
Ref R
Correlation
Coefficient
Level where
significant
for F(1,7-1-1)
test
Ref R
Correlation
Coefficient
Level where
significant
for F(1,7-1-1)
test
File size 1 0.74 0.06 A 0.63 0.13
SD 2 0.74 0.06 B 0.71 0.08
Quad max SD 3 0.64 0.12 C 0.58 0.18
Variance 4 0.74 0.06 D 0.71 0.08
Quad max var 5 0.63 0.14 E 0.56 0.20
Entropy 6 0.56 0.20 F 0.43 0.33
Edges 7 0.90 0.01 G 0.86 0.02
Segments 8 0.28 0.55 H 0.20 0.68
Sim_pixels 9 0.08 0.87 I 0.06 0.90
Sim_shaded 10 0.35 0.45 J 0.23 0.63
Sim_mean 11 0.70 0.08 K 0.65 0.12
Sym_pixels 12 0.10 0.84 L 0.00 1.00
Sym_shaded 13 0.43 0.34 M 0.28 0.54
Sym_mean 14 0.70 0.08 N 0.68 0.10
Fractal dim 15 0.76 0.05 O 0.70 0.08
Table 5.4: Correlation coefficients for variables considered for metric for 256x256 greyscale images
Initially a model will be developed for y1 data, where images were presented to
subjects at one time. This will be compared to y2 – Paired Comparison data. In the
discussion that follows, the reference numbers (column 2) or letters (column 5) for
the image attributes tabulated above are contained within angle brackets.
We start with Edges <7> – this variable has the highest regression sum of squares
SSR, correlation coefficient R and R2. The model is significant at the 0.005 level
(highly significant).
5.3 Information Content Model Fitting
121
We now test all variables with Edges already in the model. We need to find the
largest increase in SSR, in the presence of Edges, over the SSR found for Edges
alone. ie. We need to find the variable xj, for which R(βj|β7) = R(β7,βj) - R(β7) is
largest, where R(β) denotes regression sum of squares for a model with variable β.
The combination of Edges and Entropy <7,6> is significant at the 0.006 level and has
the highest increase in SSR above the model with Edges alone. This SSR increase is
significant at the α=0.084 level (F(1,7-2-1) test). In order for this increase to be
significant at the α=0.05 level we would need the sample size to be 12, not 7. Now
when subjecting edges in the presence of entropy to a significance test ie. R(β7|β6),
P = 0.005, which is highly significant, so Edges can be retained. Thus looking at the
α=0.1 level of confidence Entropy can be included along with Edges.
We now require checking all other variables with Edges and Entropy already in the
model ie. R(βj|β7,β6) = R(β7,β6,βj) - R(β7,β6). The combination of Edges, Entropy
and Sym_mean <7,6,14> gives a model significance of 0.017 and the largest increase
in SSR. However this increased regression is only significant at the α=0.255 level of
confidence (F(1,7-3-1) test). Other variable models are significant at the following
levels of confidence: α=0.012 <14>, α=0.062 <7>, and α=0.008<6>. Thus if
considering variables up to the α=0.26 level then this third variable can be included.
For four variables, we require R(βj|β7,β6,β14) = R(β7,β6,β14,βj) - R(β7,β6,β14), and
need to check increases in SSR with a F(1,7-4-1) test. The largest increase in SSR is
with variable <9> - Sim_pixels. However this significance is very low: P = 0.466!
Other variable models are significant at α=0.017 <9>, α=0.034 <14>, α=0.098 <7>,
and α=0.025 <6> levels of confidence. Thus if considering variables up to the α=0.5
level then this fourth variable can be included (overall model significance = 0.067).
A check will also be made using the variable with the second highest R2 value as the
first term in the model. We start with Fractal Dimension <15>, which has a
correlation coefficient of 0.76 and an F statistic that is significant at the α=0.05 level.
5.3 Information Content Model Fitting
2 variable model: Fractal Dimension + Standard Deviation <15,2> gives the highest
increase in SSR, which is significant at the α=0.06 level.
3 variable model: Fractal Dimension + Standard Deviation + Sim_mean <15,2,11> is
significant at the α=0.195 level. Also Fractal Dimension + Standard Deviation +
Variance <15,2,4> is easier to compute and is significant at the α=0.215 level.
4 variable model: <15,2,4,14> is significant at the α=0.23 level, while <15,2,11,9> is
significant at the α=0.184 level of confidence.
5 variable model: <15,2,11,9,10> is significant only at the α=0.419 level of
confidence.
These models with low significance (α>0.05, representing small confidence
intervals) are of limited use in developing a metric to predict information content.
For completeness a check is made on using the variable with the third highest R2
value as the first term in the model. We start with File size <1>, which has a
correlation coefficient of 0.74 and an F statistic that is significant at the α=0.055
level.
2 variable model: File size + Edges <1,7> is significant at the α=0.173 level, which
is clearly inferior to models proposed above. Thus there is no need for further
analysis down this path.
Several candidate models were proposed above with varying numbers of model
terms. A simple model is a consideration that cannot be ignored but it is not desired
to underfit the model. The Cp statistic can be used to consider compromise between
excessive bias incurred when one underfits the model (chooses too few model terms)
and excessive prediction variance when one overfits (has redundancies in the model).
The Cp statistic is a function of the total number of parameters (p) in the candidate
model and the error mean square:
2
22 ))((σ
σ pnspcp−−
+= (Equation 6)
where is the error mean square for the most complete model, s is the error mean
square for the candidate model,
2σ 2
p is the number of model parameters.
Cp > p indicates a model that is biased due to being an underfitted model, while Cp ≈
p indicates a reasonable model.
122
5.3 Information Content Model Fitting
123
Candidate models are listed in the table below. Variables No. of
parameters
R2 Error
mean
square
Model
significance
Significance
of increased
regression
Cp
7 2 0.82 352.33 0.005 7.7
7,6 3 0.92 190.23 0.006 0.084 3.6
7,6,14 4 0.95 153.22 0.017 0.255 3.8
7,6,14,9 5 0.97 164.32 0.067 0.466 5.0
15 2 0.58 803.41 0.046 20.5
15,2 3 0.84 374.44 0.024 0.060 7.8
15,2,4 4 0.91 274.58 0.041 0.215 5.8
15,2,4,14 5 0.96 170.58 0.070 0.235 5.0
15 2 0.58 803.41 0.046 38.2
15,2 3 0.84 374.44 0.024 0.060 14.4
15,2,11 4 0.92 259.74 0.038 0.195 9.0
15,2,11,9 5 0.97 130.38 0.053 0.184 5.7
15,2,11,9,10 6 0.99 97.52 0.170 0.419 6.0
Table 5.5: Candidate models for metric for 256x256 greyscale images
The best model is <7,6> - Edges and Entropy, which has a high R2, high overall
model significance, significant increase in regression over <7> and a Cp statistic that
indicates a reasonable model (not underfitted or overfitted). The simple model of
<7> - Edges has high significance but the Cp value indicates it is underfitted.
Thus a modelling function f is proposed such that:
Information Content = f(Edges, Entropy)
Actual Equation: Information Content = 295.6 – (0.047*Edges) – (23.43*Entropy)
Prediction performance will still be checked with the simpler underfitted model:
Information Content = f(Edges)
Actual Equation: Information Content = 195.5 – (0.052*Edges)
5.3 Information Content Model Fitting
124
We also wished to compare the two types of experiment data, where images were
presented at one time (y1) against paired comparison data (y2). So for a model based
on y2- paired comparison experiments, we start with Edges <G> which has the
highest SSR and significant F statistic. The 2 variable model of Edges + SD <G,B>
gives the highest increase in regression. While the overall 2 variable model is
significant at P=0.047, the increase in regression over the 1 variable model is not
highly significant (P=0.407). The 2 variable model Edges + Entropy <G,F> which is
equivalent to the <7,6> model developed for y1 data, has an overall significance of
P=0.048 but the regression increase is not significant over the 1 variable model
(P=0.422).
Model Performance of 256x256 greyscale metric
The model was built using the responses from subjects viewing the Actual Objects
test set, containing 7 images of that particular image quality class. Subjective
responses were also collated in the experiment for other image sets for that quality
class. Example test stimuli for this data collection is shown in Appendix Section
C.2. The predictive performance of the model is now checked against rankings of
the additional test image sets. The additional test sets comprise only 3 images each.
Model predictions are presented over in Table 5.6.
Using the model Information Content = f(Edges, Entropy), the dominant ranking
as selected by 25 test respondents, was only predicted for 1/7 test cases: the number
of objects in scene. This was a very strong preference with 96% of respondents
selecting this order. Using the simpler model Information Content = f(Edges), the
dominant ranking was predicted for 2/7 test cases: {the number of objects in scene}
+ {contrast between objects & surround).
An additional example will be presented outlining the construction of a metric
specifically for the 10x10 Binary image set to determine if there is similar poor
predictive performance.
5.3 Information Content Model Fitting
Dominant Pattern arising from
tests:
Visual Information Ranking Order
HIGHEST LOWEST
No. of test
responses in
dominant
pattern
Correct
Prediction
using
model
No. of test
responses
same as
model
Other quality
classes with
model-predicted
dominant pattern
Number of Objects in Scene
24/25 Y 24/25 6/9
10B, 16B, 25B,
256E, 256B,
256F/G
Angle of Object
13/25 N 3/25 3/9
25B, 16B, 10B
Distance to Object
17/25 N 1/25 4/9
256B, 25F, 16B,
10B
Connection between image objects
12/25 N 0/25 0/9
Image Detail
16/25 N, but same
image
ranked #1
1/25 1/9
256B
Contrast between objects and
surround
14/25
N 8/25 2/9
16B, 25B
Variety of Object types
11/25 N 3/25 0/9
Table 5.6: Model Predictions of 256x256 Greyscale Image Set using model: f(Edges, Entropy)
125
5.3 Information Content Model Fitting
126
5.3.2.2 Example 2: Construction of a model for the 10x10 Binary image set
Candidate models are listed in the below table. Variables No. of
parameters
R2 Error
mean
square
Model
significance
Significance
of increased
regression
Cp
8 2 0.88 337.29 0.002 46.41
8,12 3 0.96 127.10 0.001 0.038 13.90
8,12,10 4 0.99 70.66 0.003 0.133 7.21
8,12,10,15 5 1.00 31.91 0.009 0.164 4.87
8,12,10,15,11 6 1.00 34.13 0.082 0.522 6.00
Table 5.7: Candidate models for a metric for 10x10 binary images
Model <8,12> - Segments & Sym_pixels has a significant increase in regression over
<8> but high Cp statistic. Model <8,12,10,15> - Segments, Sym_pixels, Sim_shaded,
& Fractal dimension has a better Cp value but the increased regression over simpler
model is not as significant as Model <8,12>.
Using the model Information Content = f(Segments, Sym_pixels), the dominant
ranking as selected by 25 test respondents, was predicted for 4/7 test cases. The
more detailed model Information Content = f(Segments, Sym_pixels,
Sim_shaded, Fractal dim) predicted 3/7 cases.
Using the models developed for the 25x25 greyscale test set on the 10x10 Binary
image sets yielded 4/7 correct predictions for Information Content = f(Edges,
Entropy), and 3/7 correct predictions for Information Content = f(Edges).
That is, similar predictive performance was experienced between a model developed
specifically from data from that image quality class (10x10 binary) and data from
much higher quality images. Thus there is motivation to pursue the development of
one metric applicable across all image quality classes.
5.3 Information Content Model Fitting
127
5.3.3 Information Content Metric for all Image Quality Classes
It was desired to develop a metric that was stable across all image quality classes, as
predictive power of specifically tailored metrics for a particular image quality class
appears arbitrary as described above. Again subjective rankings from one of the
eight visual dimensions (Actual Objects) was used to construct this global metric.
The metric was then validated against the subjective results of the other 7
dimensions.
Correlations between the 15 image attributes discussed above and perceived
information content rankings are shown over in Figure 5.11. The vertical axis
represents average correlations for the two different ranking schemes used: Images
presented all at once and Paired Comparison tests. The stepwise regression model
developed in the previous section systematically grouped attributes based on
increased significance of regression. Now it is desirable to choose one attribute
applicable for all image quality classes. From the plots of Fig 5.11, it is evident that
the “Edge” attribute features in the uppermost one or two correlation curves for both
binary and greyscale images at all spatial resolutions tested in the experiment. Thus
edges are proposed as a dominant indicator of information content across both low
and high image quality classes.
This determination supports Marr’s emphasis of zero crossing (edge) detection in
producing images of the external world [56]. The role of edges in scene recognition
and interpretation was discussed in Section 3.5.2.4.
A metric based on the number of edges in an image is now validated by comparing
metric predictions with perceived information rankings for the remaining 7 data sets.
63 dominant viewer rankings were compared – 7 visual dimensions x 9 image quality
classes.
5.3 Information Content Model Fitting
Greyscale Images
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10x10 16x16 25x25 256x256
Spatial Resolution
corr
elat
ion
Filesize (B)Standard Devquad max SDVariancequad varEntropyEdgesSegments Sim_pixelsSim_shadedSim_meanSym_pixelsSym_shadedSym_meanFractal dim
Binary Images
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10x10 16x16 25x25 256x256
Spatial Resolution
corr
elat
ion
Filesize (B)Standard Devquad max SDVariancequad varEntropyEdgesSegments Sim_pixelsSim_shadedSim_meanSym_pixelsSym_shadedSym_meanFractal dim
Figure 5.11: Correlation between 15 image attributes and perceived information content.
128
5.3 Information Content Model Fitting
The performance of the edge metric in predicting subjective dominant viewer
patterns is shown over in Table 5.9 and in summary form below in Table 5.8.
Strength and
number of cases for
dominant viewer
patterns (63 in
total)
Frequency of image
with highest info
content being
predicted by metric
Frequency of exact
ranking being
predicted by metric
90-100%: 3 100% 100%
80-89%: 4 75% 75%
70-79%: 1 100% 100%
60-69%: 12 67% 25%
50-59%: 6 83% 50%
40-49%: 16 38% 19%
30:39%: 19 32% 21%
20-29%: 2 100% 100%
10-19%: 0 - -
0-9%: 0 - -
STRONG WEAK
Dominant patterns
Table 5.8: Summary of metric performance
Out of the 63 test cases examined, three cases had 90% or above consensus from
subjects viewing the sample set. For each of these cases, the metric successfully
predicted not only which of the 3 images had the highest information content (2nd
column above) but also the ranking order chosen by subjects (3rd column above).
Metric performance at weaker subject consensus levels are also shown.
There were several cases where the metric prediction in the 2nd column above was
low. However this was for cases where there was low consensus amongst the sample
regarding the preferred ranking order. ie. if human subjects could not agree
unanimously on a preferred ranking order it was difficult to expect a metric to do so.
What is important is whether the metric could predict those cases where their was
strong agreement among the sample.
129
5.3 Information Content Model Fitting
130
Predictive Metric Performance
A metric has been proposed from subject responses to one of eight visual dimensions (Actual Objects).Here it is validated against the other seven dimensions - Number, Angle, Distance, Connectivity, Detail, Contrast, Variety.
The table shows whether the metric predicted the dominant ranking as chosen by test subjects.
Quality Class NUMBER ANGLE DISTANCE CONNECTIVITY DETAIL CONTRAST VARIETYnum1num2num3 angleangleangle3 dista dista distance3 conn conn conn3 detai detai detail3 contrcontrcontrast3 varie varie variety3
10 Bin 3 2 1 16 3 1 2 15 3 1 2 10 1 2 3 9 1 3 2 13 3 2 1 10 1 2 3 10***Predicted 64% #1 Predicted 60% ***Predicted 10 36% ***Predicted 52% 40% 40%
10 F/G 2 3 1 12 1 2 3 9 1 2 3 8 2 1 3 9 1 3 2 16 2 1 3 11 1 2 3 15#1 Predicted 48% 36% 32% 36% 64% 44% 60%
16 Bin 3 2 1 22 3 1 2 8 3 1 2 7 3 1 2 8 1 2 3 16 1 2 3 11 1 2 3 12***Predicted 88% #1 Predicted 32% ***Predicted 7 #1 Predicted 32% #1 Predicted 64% #1 Predicted 44% 48%
16 F/G 1 2 3 8 1 2 3 10 1 2 3 8 3 1 2 9 1 3 2 20 2 1 3 16 1 2 3 832% 40% 32% 36% 80% #1 Predicted 64% 32%
25 Bin 3 2 1 23 3 1 2 11 2 1 3 8 1 2 3 8 1 2 3 10 1 2 3 7 1 3 2 8***Predicted 92% #1 Predicted 44% 32% 32% ***Predicted 40% ***Predicted 28% 32%
25 F/G 1 2 3 9 2 3 1 12 3 2 1 16 1 2 3 20 1 3 2 15 2 1 3 14 1 3 2 1036% 48% 64% ***Predicted 80% #1 Predicted 60% #1 Predicted 56% 40%
256 Edges 3 2 1 21 1 2 3 9 1 2 3 12 3 1 2 17 3 1 2 12 2 1 3 18 2 3 1 1384% 36% 48% 68% 48% 72% 52%
256 Bin 3 2 1 24 3 2 1 9 3 2 1 9 3 1 2 9 1 2 3 9 3 2 1 16 2 3 1 13***Predicted 96% 36% ***Predicted 36% ***Predicted 36% ***Predicted 36% ***Predicted 64% ***Predicted 52%
256 F/G 3 2 1 24 1 2 3 13 1 2 3 17 1 2 3 12 3 1 2 16 2 1 3 14 1 3 2 11***Predicted 96% 52% 68% 48% #1 Predicted 64% ***Predicted 56% 44%
denotes strong viewer preferenceswith 70% of more of respondents choosing that pattern(18 out of sample size = 25)
***Predicted - refers to exact ranking of all 3 images predicted by the metric.#1 Predicted - refers to image with highest info content predicted by the metric, ie. #1 in the rank order only.
Ranking of Image 1, 2 and 3 chosen by Highest Number of Respondents and number (out of 25) and %
Table 5.9: Predictive performance of metric proposed for all image qualities
5.3 Information Content Model Fitting
It was of main interest to examine metric performance in view of strong dominant
viewer patterns in the data. Eight of the 63 rankings had 70% or above consensus
among viewers. These have been mentioned above in Sec 5.2.6, but are reproduced
below in Figure 5.12 with the inclusion of metric performance.
Number of Objects in Scene
5 image quality classes: 16x16_Binary (88%), 25x25_Binary (92%),
256x256_Edges (84%), 256x256_Binary (96%), 256x256 (96%)
Highest Lowest
All 5 cases predicted by metric?: Yes
Closeness between image objects
1 image quality class: 25x25greyscale set (80%)
Highest Lowest
Single case predicted by metric?: Yes
Image Detail
1 image quality class: 16x16greyscale set (80%)
Highest Lowest
Single case predicted by metric?: No
(Metric prediction: phone > 2 faces > single face)
Contrast between Objects & Surround
1 image quality class: 256x256_Edge set (72%)
Highest Lowest
Single case predicted by metric?: Yes
Figure 5.12: Metric performance for Strong viewer preferences (70% or above consensus among viewers) showing images ranked from highest to lowest perceived information content
131
5.3 Information Content Model Fitting
132
The visual information metric predicted 7 of the 8 strong viewer preferences (70% or
above consensus level). Viewers of the 16x16 greyscale Image Detail set ranked a
simple face as containing most visual information, while the metric ranked the image
of the phone and two faces ahead of the single face. The familiarity and strong
recognition of the human face at low levels of image quality may cause viewers to
select it over others containing unrecognisable blobs.
The metric was found to work best with binary images, which are expected from at
least early prototype designs. (Limited greyscale may be possible by modulating
stimulus amplitude, frequency and pulse duration as discussed in Section 3.5.2.2).
The number of ranking cases where the metric was able to predict the image with the
highest information content is shown in Table 5.10 below. There are a total of seven
ranking cases for each image quality class, corresponding to each visual dimension
explored.
10x10 Binary set - 4/7 10x10 Greyscale set – 1/7
16x16 Binary set - 6/7 16x16 Greyscale set – 1/7
25x25 Binary set - 4/7 25x25 Greyscale set – 3/7
256x256 Binary set – 6/7 256x256 Greyscale set – 3/7
256x256 Edge set - 6/7
Table 5.10: The number of correct metric predictions of images with the highest information content
This may be another reason why the metric prediction for the 10x10 greyscale Image
Detail set did not agree with the ranking chosen by 80% of viewers. Table 5.10
shows that for 16x16 greyscale images, the metric was successful in predicting the
image with the highest information content in only 1 out of 7 cases. However for
16x16 binary images, the metric prediction was correct for 6 out of 7 cases. It should
be remembered that the strength of dominant patterns on which metric performance
is assessed range from 96% to 28%. At high levels of viewer consensus, the metric
is accurate in predicting images with the highest information content, and is thus
considered acceptable for this application.
Therefore, it can be stated that visual information content in images can be quantified
and a mechanism for achieving this with a reasonable level of performance has been
5.3 Information Content Model Fitting
133
proposed here. However, does maximising information content in low quality
images result in enhanced perception of that image? In order to answer this question,
it is necessary to analyse the relationship between low quality image recognition and
information content in images ie. is the measure for information content an adequate
pointer to how well an image might be recognised? This relationship is explored in
the next section.
5.4 Correlations Between Recognition Rate And Perceived Information Content
It was desired to determine if there was any relationship between recognition rates
and the amount of visual information as perceived by viewers.
Previous experiments described in Section 4.3 assessed perception performance for
these same images (ie. the ability of these images to be correctly recognised). The
subjects were first asked to describe the objects (eg. Appendix Section B.1) and then
secondly to assess the images for the amount of visual information they contained
(refer test stimulus C.1). The questionnaire booklet design shown in Appendix
Section C.4 shows a section referred to as “PART 3 Check if correlated with recogn”
which relates to this section of correlating information content and recognition. As
with other aspects of this experiment, care was taken in the booklet design to reduce
learning effects, fatigue and boredom.
Relationships between correct object recognition and subjective information content
scores were obtained for each image quality class. For example, Figure 5.13 shows
the relationship for 25x25 binary Paired Comparison experiments. The horizontal
axis shows a subjective score for visual information developed from the numeric
ranking scheme used (higher numeric score = higher perceived information content).
5.4 Correlations Between Recognition Rate And Perceived Information Content
Recognition Rate vs Information Content
0
0.2
0.4
0.6
0.8
1
0 50 100 150
Subjective Score for Visual Information Content
Rec
ogni
tion
Rat
e%
sub
ject
s co
rrec
tly
iden
tifyi
ng o
bjec
t (n=
25)
Figure 5.13: Example relationship between recognition and information content (25x25 Binary Paired Comparison data)
The significance of these relationships were then assessed. Linear regression models
for each quality class were developed for two series of data:
1. where images were presented at one time
2. paired comparison data
The significance of the models and correlation coefficients appears in Table 5.11
below.
Images presented at one
time
Paired Comparison
Image Quality Class R
Correlation
Coeff.
Significance
F(1,7-1-1) test
R
Correlation
Coeff.
Significance
F(1,7-1-1) test
10x10 Binary 0.76 0.05 0.76 0.05
10x10 Greyscale 0.54 0.21 0.36 0.43
16x16 Binary 0.69 0.09 0.71 0.07
16x16 Greyscale 0.70 0.08 0.61 0.15
25x25 Binary 0.90 0.01 0.85 0.01
25x25 Greyscale 0.75 0.05 0.81 0.03
256x256Edge 0.70 0.08 0.66 0.10
256x256 Bin 0.69 0.09 0.73 0.06
Table 5.11: Correlation coefficients between recognition rate and perceived information content
134
5.4 Correlations Between Recognition Rate And Perceived Information Content
135
There was some evidence for correlation between ranked information content and
recognition rates with significance levels ranging from P=0.05 to P=0.1 for all but
the 10x10 greyscale image set. Thus if the information content of an image is
maximised (by maximising the number of edges) enhanced perception is expected.
5.5 Chapter Summary In the field of low quality vision, there is a need for delivering maximum scene
information to a limited number of display electrodes/pixels. In this chapter a
method is proposed to enhance recognition using importance maps weighted to
maximise the “information content” in the resulting importance map.
An experiment was described to quantify the term information content. 15 image
attributes were correlated with subjective rankings of visual information. Initially
metrics were developed tailored to a specific image quality (eg. 10x10 binary
metric). However their arbitrary predictive performance led to the construction of a
metric that was stable across a wide range of image quality classes. The number of
edges in an image was found to be a dominant indicator of perceived information
content. An edge metric was tested on additional subjective data and found to be
appropriate in assessing information content. Finally it was shown that subjective
information content was significantly related to object recognition.
Thus it was possible to construct a model for basic information required for the
interpretation of a visual scene at low image quality.
This finding can now be applied to generating importance maps containing higher
information content. Chapter 8 compares such a method with others to determine
preferred presentation options.
Chapter 6 Scene Specific Imaging
6.1 Overview
As discussed in preceding chapters of this thesis, in order to make best use of the
limited number of electrodes in visual prostheses, it is proposed to first process
images to extract more information from the scene. One of the conclusions of the
preliminary experiments discussed in Chapter 4 was that it might be beneficial to
have device switchable processing for different scenes. This chapter then aims to
answer the research question :
Q4: Should the processing techniques be adjusted depending on the scene type?
Characteristics of several scene types are outlined in this chapter and then
categorised in image processing terms. A simple experiment involving 20 normally
sighted viewers tests if there is some benefit in scene-dependent processing to deliver
enhanced perception of low quality images.
6.2 Characteristics of Simple Scenes
This section lists characteristics of simple scenes that a patient fitted with a visual
implant may experience.
6.2.1 Office
Many office environments have fluorescent lighting. Work spaces might be defined
by partitions that are up to two metres high or floor-to-ceiling walls, or a
combination of walls and partitions. A person’s working range would be of the order
136
6.2 Characteristics of Simple Scenes
of one metre, but a visual range of five metres would be useful. Objects in the
environment may be located on a horizontal desk surface (phone, computer,
documents). Objects could be distinguished by intensity and colour contrast on the
desk. The rest of the office would probably not have much colour except for office
plants, pictures, and people. The user is mostly stationary in this environment.
6.2.2 Home
Although the viewer is familiar with the home environment, potential hazards
abound, including room and cupboard doors left open and obstacles on the floor.
The kitchen is usually comprised of a (reflective) sink, benches and cupboard.
Bedrooms would contain a bed, cupboard and dresser. Chairs and possibly a
television are found in a lounge room, while a dining room contains table and chairs.
Bathrooms may contain a bathtub, vanity unit and toilet.
6.2.3 Street
The viewer is likely to be moving in a streetscape environment, either as a vehicle
passenger of walking. Possible obstacles include posts, street signs, curb, people,
and construction works (including holes and fencing). Many edges are contained in
this man-made (constructed) environment, including footpaths and shopfronts. There
is limited colour variation, with the predominant building and ground colours being
greys, and pedestrians and parked cars represented by coloured regions. There is a
combination of natural lighting (sunlight) and shaded portions. The viewer may
require a working range of five metres (when moving) and a visual range of up to
fifty metres.
137
6.2 Characteristics of Simple Scenes
6.2.4 Outdoors
Outdoor environments contain natural scenes, such as trees, plants, grasses, bushes,
seats, beach, ocean. There are limited edges and natural lighting (sunlight). For park
areas there is limited colour variation (mostly greens) and alternating intensity from
shade and sun patches. For beach scenes there is high intensity glare, with many
reflections from water and white sand. Beach environments also usually have finite
colours, with blue ocean and white or yellow sand. Smaller coloured regions in the
outdoor environment may correspond to people or signs. A viewer may require a
working range up to ten metres and a visual range as much as 100 metres.
6.2.5 Head and Shoulders
A special case of scene type exists for situations where the viewer is engaged in
conversation or communication with others. In these close contact situations the
viewer requires an image of the head and shoulders alone. The visual range need not
be more than two metres. Both the scene and viewer are mostly stationary. Faces
and other skin areas are detected within a range of pink hues, while the hair may be
darker (eg. browns and black).
138
6.2 Characteristics of Simple Scenes
6.2.6 Café/Restaurant
This scene type often has indoor lighting conditions (fluorescent/incandescent).
There are usually tables approximately one metre high separated by small spaces
(navigation gaps). Tables could be circular or rectangular. Chairs are positioned
around the tables with chair backs typically higher than the surface of the table.
Cutlery and plates may lie on the table, with glasses, cups or jugs projecting up to
200mm above the table surface. Cutlery and glass on the table are highly reflective.
A payment area may comprise a desk at waist-chest height with a horizontal top for
signing cheques etc. A viewer would need to be able to locate toilets and the café
exit. The exit is usually signed and may be a two metre high large rectangle of
different intensity, perhaps with a door.
6.2.7 Public Toilets
Gender differentiation is required for the viewer. The entrance to a public toilet is
often through doors, with a handle at waist height on the left or right of the door. A
90 degree or 180 degree turn is then made to the left or right, along a floor to ceiling
wall which is often tiled. Cubicles are built out from the wall and may be timber or
rendered concrete with doors containing a lock at waist height. There may be a
urinal, consisting of either separate units or a continuous unit along a wall sometimes
with a step up. The toilets should contain a wash basin with soap or a liquid soap
dispenser located above the basin. There may be a hand dryer/towel dispenser with a
waste bin located below.
139
6.3 Image Processing targeted to Scene Type
6.3 Image Processing targeted to Scene Type
140
It is proposed to present maximum information to implant electrodes targeted to a
user’s environment. This can be achieved by applying varying processing routines to
the input images.
In this section the scene types mentioned in the previous section are described in
image processing terms to identify which image processing routines to apply. Table
6.1 shows these scene type descriptions.
Motion Colour Edges Number
Regions
Range
SCENE In
scene
Viewer Dominant
Colours
Colour
variation
Number Types Working
range
Visual
range
Office Low Low White,
Pastel
Low –
med
High Straight Med 1m 5m
Home Low Med All Low –
med
Med Straight
Curved
Med 2m 5m
Street High High Grey Med High Straight High 5m 50m
Outdoor Low Med Green
White
Yellow
Blue
Low Low Curved Low 10m 100m
Head &
Shoulder
Low Low pink hues Low med Curved Low 1m 2m
Café Med Low Silver Med high Straight
Curved
High 1m 2m
Toilets Low Med White,
Silver
Low –
med
high Straight med 2m 4m
Table 6.1: Image Processing descriptors of different scene types
In order to translate these scene descriptions into processing algorithms the scene
first needs to be categorised. There have been some advancements in the area of
automatic scene categorization. For example Chernyak and Stark [15] have
developed a model for sequential knowledge acquisition using Bayes’ theorem
(probability based). Segment features, such as average colour, aspect ratio and
position are obtained from training sets of images covering scene categories (eg.
“office”, “construction”, “children playing”). The algorithm then attempts to guess
6.3 Image Processing targeted to Scene Type
141
the scene category of a test image. Although useful for robot applications designed
to minimize human intervention, automatic categorization would not be essential for
visual prostheses applications. Human users would presumably be aware of their
environment, and would be able to manually select the scene type suited to their
surrounds.
Once the scene type is known, it is proposed to apply context/scene dependent
importance weighting to the image. Chapter 5 described a means to manipulate
image content using Importance Weighting to produce higher information content in
the image conveyed to prosthesis electrodes. This chapter proposes a similar method
of image manipulation again using the Importance Map method (refer Section
3.5.2.6). However, here it is proposed that weights for feature maps are selected
according to their scene type, and proposed feature weights for several scene types
are shown below in Table 6.2. Rather than all features having the same effect on the
resulting importance map, we propose to vary their contribution depending on the
scene type. The percentage weights shown in Table 6.2 indicate the weight to apply
to that feature map to produce the resulting importance map. A 50% level would
indicate neutral leaning/bias. ATTENTIONAL FEATURE
Closeness Intensity
Contrast
Shape Size Viewing Area
SCENE Foreground
(100%) vs
Background
(0%)
Lots of
contrast
(100%) vs
little contrast
(0%)
Long & skinny
(100%) vs
broad & round
(0%)
Large regions
(100%) vs
small regions
(0%)
Most viewing
in central
view (100%)
vs periphery
(0%)
Office 90% 70% 90% 25% 100%
Home 70% 90% 50% 50% 95%
Street 25% 20% 100% 50% 50%
Outdoors 25% 80% 10% 100% 50%
Head &
Shoulders
100% 80% 50% 25% 100%
Café 100% 20% 50% 50% 80%
Toilets 90% 30% 10% 50% 80%
Table 6.2: Attentional feature weights for each scene type
6.4 Subjective Tests for Scene Weighted Processing
6.4 Subjective Tests for Scene Weighted Processing
It was desired to test the proposal that context or scene weighted processing can
improve perception of low quality images. The images shown in Figure 6.1 taken
from the test stimulus in Appendix Section D.1 were shown to 20 normally sighted
volunteers. Two images were shown, one representing an "outdoor" scene, and the
other an "office" scene. Low quality versions of the images were presented
alongside, representing quality levels typical of current prosthesis prototypes (25x25
spatial resolution, binary images). Booklet design is shown in Appendix Section
D.2. Four low quality versions of the original were shown:
1. subsampled to 25x25 and binarised – this represents the standard or base case
level of image processing present in most implant designs (no importance
processing);
2. subsampled, importance processing with features weighted equally, then
binarised;
3. subsampled, importance processing with features weighted according to the
correct scene type (eg. for the lighthouse image, weights selected for "outdoor"
scenes in Table 2), then binarised;
4. subsampled, importance processing with weights applied corresponding to a
different scene type (eg. applying "office" weights to lighthouse image), then
binarised.
A B C D E
Figure 6.1: Visual stimuli used to gauge perception of low quality images
A = Original 256x256 image; B = subsampling to 25x25 binary; C = importance mapping with all
feature weights equal; D = importance mapping with weights selected for “outdoor” scene type”;
E = importance mapping with “office” weights
142
6.4 Subjective Tests for Scene Weighted Processing
143
Participants were asked to rank the images for how best (ie. most informatively) they
represented the original scene. The images nominated as best representing the
originals were as per Table 6.3.
Lighthouse image
“Outdoor” weights “Office”
Weights
No
processing
Equal
Weights
10/20
50%
7/20
35%
3/20
15%
0/20
0%
Chair image
No
processing
“Office”
Weights
Equal
Weights
“Outdoor” weights
12/20
60%
6/20
30%
2/20
10%
0/20
0%
Table 6.3: Preferred ranking for image representation
For the lighthouse image, half of the sample size (10/20) chose the “outdoor”
weighted image as best representing the original scene. This is in line with the
expectation that improved perception may be obtained with processing images with
respect to scene type. However for the chair image, most respondents found the
image with no importance processing was better at representing the original image.
Feedback from participants suggested that this image was closest in grey level values
to the original, and if the importance-processed images had been inverted (ie. a black
chair on a white background) they would have chosen that image. For a subsequent
thesis experiment described in the next chapter, the importance-processed images
were inverted to be in a similar form to the original.
It should be noted that the image inversion recommendation arises from experiments
with sighted viewers with sophisticated expectations. This may not be the same for
visually impaired persons with a simplified understanding of the world. Other
criteria relating to electrode stimulation may dominate once these systems are more
common. This might include always stimulating the smaller number of electrodes
irrespective of dominant greylevel information to avoid long term tissue damage, or
sharing of electrodes to obtain longer life from the electrode arrays.
6.5 Chapter Summary
6.5 Chapter Summary
144
By first processing images presented to implant electrodes it may be possible to
provide enhanced presentation than subsampling alone. In this chapter, some of the
characteristics of simple scenes have been listed and described in image processing
terms. A simple experiment was conducted to determine the effect of processing
images depending on the scene type. Expectations were confirmed for a test image
containing an outdoor scene, but not for an office scene. Image inversion to best
match an original high quality scene is recommended.
This chapter aimed to address the following research question:
Q4: Should the processing techniques be adjusted depending on the scene type?
Experiment results show that improved perception may be obtained with processing
images with respect to scene type. Scene dependent importance mapping is a
powerful tool to use in the automatic optimisation of low quality images for human
viewers. The next chapter compares this method with others to determine preferred
presentation options.
Chapter 7 A comparison of ROI methods for low quality images
7.1 Overview
So far the thesis has reviewed useful image processing strategies to present the most
useful information to implanted users considering this great information loss.
The previous two chapters described methods of manipulating image content based
on a region of interest image processing technique known as Importance Maps.
Chapter 5 concerned adjusting feature weights so that the resulting Importance Map
contains maximum scene information. Chapter 6 proposed that feature weights are
set for the particular scene type. Experiments described here in Chapter 7 aim to
assess these proposals along with other methods to determine which method best
helps users move through a scene.
The work in this chapter aims to extend upon the findings of Chapter 4 concerning
the research question:
Q2: Does Region-of-Interest processing improve scene understanding beyond
standard/Base Case processing?
It is anticipated that ROI processing will trim away unnecessary information
resulting in improved perception.
The region-of-interest/importance framework of the processing used in this Chapter
was described in Section 3.5.2.6. The following section describes the comparison
experiment, showing the types and range of images used and instructions given to
participants. Results follow which indicate a clear trend. Two experiments were
conducted:
1. Presentation of entire region-of-interest processed image
2. Presentation of only salient area found from ROI processing only – digital zoom.
145
7.2 ROI Processing applied to Entire Image
7.2 ROI Processing applied to Entire Image 7.2.1 Image Preparation
Images used in the tests were prepared as shown in Fig 7.1.
146
I
Test Image 256x 256
Processing Refer 1) – 6) in text
below
Resized to 25x25 (nearest neighbour)
Thresholded at 128 greylevel
Histogram equalisation
Possible inversion to appear like
Original. nverted
Thresholded at 128 greylevel
Possible inversion to appear like
Original. Inverted
Figure 7.1: Image preparation
7.2 ROI Processing applied to Entire Image
Six image processing methods were tested: four variations of Importance Mapping,
edge detection and a non-processed ‘base-case’. In all methods the final image was
Nearest-Neighbour resized to 25x25 spatial resolution which is representative of
electrode numbers in prosthesis prototypes. One test case had greylevels equalised
before thresholding at the 128 grey level, while a second test set had only
thresholding at the 128 level with no histogram equalisation. Histogram equalisation
spreads the grey levels out across the full greyscale range and it is intuitive to apply
this equalisation to use the full dynamic range obtainable from the image. Other
methods of histogram transformations (eg stretch, uniform) may be considered, but
as the eventual image is reduced to very few shades, the differences are unlikely to
be influential. Histogram equalisation depends on illumination and object shades of
grey and can introduce spurious shadings in the thresholded image which do not
actually represent image objects (see Fig 7.2). Hence it was desired to find if there
were any differences in preferred processing algorithm using histogram equalisation.
147
Original 25x25 thresholded versions
Figure 7.2: Shaded areas do not necessarily correlate with scene objects in images that have had grey levels equalised (right most image).
Finally, the thresholded images were compared with the original 256x256 test image
and inverted if necessary to most closely match the original grey level. This
inversion was the result of recommendations made after previous experiments
described at the close of Chapter 6.
7.2 ROI Processing applied to Entire Image
Original
1) IM eq 2) IM sc 3) IM tr
4) IM opt 5) Edge 6) No IP
7.2.2 Processing Methods Compared
Six methods of processing an original high quality image were compared:
Figure 7.3: Processing methods used in tests (refer text for details)
1. Importance Mapping with all features weighted equally: ωcontrast = ωsize = ωshape
etc.
2. Importance Mapping with weights selected depending on the image scene type.
Experiments described in Chapter 7 have shown that improved perception may
be obtained with processing images with respect to scene type in this way.
3. Weights selected in accordance with a training set of images from that scene
type. A training database was developed consisting of 15 images of each scene
category used in the tests (refer Appendix E.1). Feature maps for each image
were determined along with the percentage of pixels in the top 25% of each
feature map (ie. between 0.75 and 1.00 in the normalised images). This gave a
measure of the strength of that feature for that image. For some images, there
148
7.2 ROI Processing applied to Entire Image
would be no pixels in the range 0.75 – 1.00 for that feature, while for others
100% of the image pixels might lie in this 0.75 – 1.00 importance range. Feature
distributions were made for each scene category. Pixels in this upper 0.75 – 1.00
range would be determined for a test image and its position within the
distribution determined. Weights were selected according to position of the test
image within the feature distribution (for example refer Fig 7.4).
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Feature Weight
% Im
ages
in tr
aini
ng d
atab
ase
12%
Figure 7.4: Example Distribution – Size Map distribution for Beach training images;
If the size map for a test image had 50% of pixels in the range 0.75 – 1.00, this would be greater than only one other image in the training image set. The weight applied to that feature map when combining feature maps is the interpolated value between the nearest images, ie. ~12%.
4. Weights iteratively adjusted in order to give the highest number of edges in the
resulting Importance Map. Experiments described in Chapter 6 found that a
subject’s ability to recognise objects in low quality images is correlated with the
amount of edges in that image. A medium-scale Quasi-Newton line search
optimisation routine was used to adjust the five weights to maximise the number
of edges in the Importance Map.
5. Considering that the number of edges were found to correlate with correct object
recognition, it was desired to present an edge map alone. Images were prepared
using the Canny edge detection method operating on 25x25 spatial resolution
images.
6. Finally an image was presented with no importance processing applied, as a base
comparison case.
149
7.2 ROI Processing applied to Entire Image
7.2.3 Images Used Chapter 6 described several scene categories that a blind person might encounter.
Six of these were chosen for this experiment and 4 images for each category (Figure
7.5). Image selection was made on the basis of forming functional mobility
problems. Dowling [21] has reviewed previous efforts in enhancing mobility for
visually impaired persons, including the following mobility problems:
• Lighting conditions and glare
• Changes in terrain and depth (stairs, curbs)
• Unwanted contacts (bumps)
• Street crossings
• Visual clutter
(i) beach
(ii) street
(iii) office
(iv) home
(v) cafe
(vi) head &
shoulders
Figure 7.5: Images used in comparison tests
150
7.2 ROI Processing applied to Entire Image
7.2.4 Experiment A group of 242 volunteers participated in the experiment. From this, 50 samples
were discarded (21%) due to either incomplete responses or subjects who normally
wore glasses/contact lenses but who were not wearing them at that time. This left
192 normally sighted or corrected-to-normal viewers. Half the sample (n=96)
viewed the images that were greylevel equalised, while the other half viewed the
non-equalised images. Subjects were presented with test stimuli as shown in
Appendix Section E.2 An original high quality (256x256 greyscale) image was
shown along with the six different versions of the image. The design of the booklets
is shown in Appendix Section E.4 – Part A. Viewing conditions for the experiment
were not controlled.
7.2.5 Results Figure 7.6 below and over shows the breakdown of viewer preferences. There was a
clear preference in both the equalised and non-equalised viewer group for no
importance processing (base-case). This was the most chosen method for six of the
six scene types, especially for faces, where 85% of subjects chose that processing
method. Error bars representing 95% confidence intervals are shown in the plot
below, and were obtained from average preferences across the six scene types.
0%
10%
20%
30%
40%
50%
60%
70%
80%
IM eq IM sc IM tr IM opt Edge No IP
Image processing method
% S
ubje
cts
choo
sing
that
pr
oces
sing
met
hod
Equalised
Not Equalised
Figure 7.6: Viewer preferences when presenting entire image; n=96 (continues over)
151
7.2 ROI Processing applied to Entire Image
0
500
1000
1500
2000
2500
3000
IM eq IM sc IM tr IM opt Edge No IP
Image processing method
No.
of s
ampl
es ra
nked
(460
8 to
tal)
beach
off ice
face
house
street
café
Figure 7.6: When presenting the entire image, results indicate a clear preference for no Importance Processing (n=96)
To ensure the ‘clear’ preference for no importance processing (‘No IP’) was
statistically significant, Analysis of Variance (ANOVA) was performed on the data
shown in Fig 7.6 to compare the hypotheses:
H0: µ IM eq = µ IM Sc = …. = µ No IP
H1: At least two of the means are not equal, at α = 0.05.
Viewer preferences across each scene type formed the basis of observations for the
ANOVA: 6 observations (beach, office, face, house, street, café) comparing 6
different processing methods. The test resulted in F-values of {19 & 57} for
{equalised & non-equalised data} which exceeded the critical F-value (2.53) for the
number of degrees of freedom in the data (5, 30), and was highly significant at
{P=1.5E-8 & 1.8E-14 2.64E-72}. Thus H0 was rejected and it was concluded that at
least two of the means were not equal. The ANOVA was then repeated but this time
excluding data for ‘No IP’. This time F-values were {1.6 & 1.3} for {equalised &
non-equalised data} which were less than the critical F-value (2.76) for the number
of degrees of freedom in the data (4, 25), and {P=0.20 & 0.30}. This indicates that
when ‘No IP’ data is excluded, the means of the other processing methods are not
significantly different.
Another issue of interest was histogram equalisation. A two sample t-test was
performed using 36 observations (6 processing methods and average results of each
of 6 scene types) at α = 0.05. The test assessed the hypotheses:
152
7.2 ROI Processing applied to Entire Image
153
H0: There is no difference in histogram equalisation (differences shown in the
upper plot of Fig 7.6 were due to sampling errors or chance only);
H1: Histogram equalisation achieves significantly different results; ie. a one-
tailed t-test (a directional test showing results as higher or lower was not of
interest).
A t-statistic of 1.76E-15 was obtained which was much less than the critical t value
1.67 for 62 degrees of freedom. The significance of this value for a one-tail test was
P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation
does not result in significant differences.
Fig 7.6 also shows processing methods divided into scene type. The plot indicates
which scene types may be better suited for a particular processing method. For
example, one of the processing methods compared in the experiment was edge
detection. Section 3.5.2.4 described how a research team developing cortical implant
devices expected improved perception results with the implementation of edge
detection processing. The data shown in Fig 7.6 indicate that low quality edge maps
were best recognised for house and street scenes. ANOVA testing using 8
observations (4 of each image type across 2 equalisation/non-equalised data sets) of
6 image types shows significantly higher results for house and street scenes
(P=0.0005).
This experiment had several conclusions:
(i) The Base Case (‘No IP’) data was significantly better than the
other processing methods.
(ii) There was no significant difference between the remaining five
processing methods This indicates there is no real advantage in
tuning feature weights for the importance map method for low
quality images. This is a worthwhile conclusion to allow the
computational overhead required for this processing to be used
elsewhere in prosthesis systems.
(iii) There was no significant difference in histogram equalisation
before thresholding images.
(iv) Low quality edge maps were best recognised for house and street
scenes.
7.3 Digital Zoom
7.3 Digital Zoom The results discussed in the previous section indicate that the base case was best for
presenting images – ie. presenting subsampled and binarised images only without
any region-of-interest processing. However, rather than presenting an entire ROI-
processed image, improved perceptual results might be obtained with using ROI
processing to identify salient areas within an image and presenting those areas alone
(in a subsampled and binarised form). In effect the approach is to find interesting
areas within the image and perform a ‘digital zoom’ – enlargening those salient areas
to the resolution limit set by the implant electrode array (refer Figure 7.7). It is
anticipated that digital zoom would be a common and easily-implemented prosthesis
function, and it would be useful to make this zoom method automatic for a blind
user.
Figure 7.7: Digital zoom concept – the most salient area is identified in an image and resized to the maximum display resolution
154
7.3 Digital Zoom
7.3.1 Automatic Zoom Methods An additional test was conducted comparing seven methods of zooming into an
image. For the purposes of this exercise, the original image was 256x256 spatial
resolution.
1. IM_trim (Fig 7.8); A trimmed version of an Importance Map to only include
elements above a threshold. Pixels were trimmed around each border: top, right,
bottom, left, top, right etc. until a pixel value above the threshold was found. As
only square images were presented, the final image was made into a square of
dimension equal to the maximum dimension of the trimmed box. The smaller
dimension was expanded until image dimensions were equal, and the expansion
direction was on the basis that pixels of more important regions were added.
Figure 7.8: Trim method to select zoom window
2. IM_scope (Fig 7.9); A 128x128 box size containing the highest greylevel values
in a 256x256 Importance Map ie. one quarter of the image area. The 128x128
box was moved pixel by pixel across the image until it contained the highest sum
of pixel values.
Original Importance Map Square with trimmed box trimmed box
Figure 7.9: Scope Box method to select zoom window
155
7.3 Digital Zoom
3. The trim method described in 1) above applied to a Saliency Map generated by
code obtained from iLab at the University of Southern California [40,84]. This
Region-of-Interest research was discussed in Section 2.4. A Saliency Map is
created from combining three feature maps corresponding to colour, intensity and
orientation at six spatial scales. Unlike the Importance Map concept which
segments images first into regions, the saliency feature maps are created from
Difference-of-Gaussian (Mexican-hat) operators applied direct to pixel data
(Figure 7.10). Default values for the code implementation were used.
Figure 7.10: Saliency Map developed by University of Southern California
Top: Difference of Gaussians filter applied to 3 feature maps; Bottom: Saliency Map output showing Regions-of-Interest
4. The 128x128 box scope method described in 2) above applied to a Saliency Map.
5. A 128x128 box containing the horizontal and vertical centre of the image (Figure
7.11). This method has no dependence on image content and relies on spatial
position with the image only. It assumes that the centremost part of an image
may be the area worth zooming into.
image
window
Figure 7.11: Zoom window selected from central 25% of image
156
7.3 Digital Zoom
6. Similarly to 5) in that there is no dependence on image content, this method crops
a 128x128 box aligned at the bottom centre of the image. This area may be
significant for a viewer especially when mobile, as it contains the foreground
immediately in front of the camera (Fig 7.12).
Figure 7.12: Zoom window selected from central-bottom 25% of image
7. For reference, an option of “No Zoom” was also included, where the whole
256x256 image was represented.
For all the above methods, the stimulus presented to viewers was the cropped
zoomed version from the original resized to 25x25 spatial resolution (Fig 7.13). One
test case had greylevels equalised before thresholding at the 128 grey level, while a
second test set had only thresholding at the 128 level with no histogram equalisation.
Original Zoom window Zoom window Subsampled 256x256 selected with cropped 25x25 1 of 6 methods
Histogram Equalisation
Thresholded at 128 greylevel
image
window
Figure 7.13: Image Preparation for Digital Zoom Tests
157
7.3 Digital Zoom
The same subjects who viewed the earlier described experiment (refer Section 7.2)
also viewed the variations on zoom method. Half the sample (n=96) viewed the
images that were greylevel equalised, while the other half viewed the non-equalised
images. An example of the test stimuli presented to subjects is shown in Appendix
Section E.3 and the design of the test booklet is shown in Section E.4 – Part B.
Viewing conditions for the experiment were not controlled. Subjects were shown a
zoom window overlaid on the original image, in addition to a 25x25 black and white
version of the zoom window. When overlaid on the original image, the zoom
window was shown as a white square bordered on the inside and outside by a black
square to maximise visibility on all background greylevels (refer Figure 7.14).
Figure 7.14: Example stimulus showing detail of zoom window border
7.3.2 Results of Automatic Zoom Experiment Viewer preferences are shown over in Figure 7.15. Error bars representing 95%
confidence intervals are shown on the upper plot, and were obtained from average
preferences for the six scene types. ANOVA testing on the seven processing
methods resulted in strongly significant differences between the means (P=7.09E-8
and 2.33E-6 for non-equalised and equalised datasets respectively). The trim method
applied to Saliency Maps (“Sal trim”) had the highest preference for automatically
zooming into a part of the image. This method was best overall and for four of the
158
7.3 Digital Zoom
six scene types. For beach scenes the trim method applied to importance maps (“IM
trim”) was best, while for café scenes, which contained high clutter, “No Zoom” was
best.
0%
5%
10%
15%
20%
25%
30%
35%
IM trim IM scope sal trim salscope
centre bottom none
Zoom method
% S
ubje
cts
choo
sing
that
pr
oces
sing
met
hod
Equalised
Not Equalised
0
200
400
600
800
1000
1200
1400
IM trim IMscope
sal trim salscope
centre bottom none
Zoom method
No.
of s
ampl
es ra
nked
(460
8 to
tal)
beach
off ice
face
house
street
café
Figure 7.15: Preferences for methods to automatically zoom into an image (n=96)
Again results were independent on whether histogram equalisation was applied to
images. A two sample t-test was performed using 42 observations (7 processing
methods and average results of each of 6 scene types) at α = 0.05. The test assessed
the hypotheses:
H0: There is no difference in histogram equalisation (differences shown in Fig
7.15 were due to sampling errors or chance only);
H1: Histogram equalisation achieves significantly different results; ie. a one-
tailed t-test (not interested in a directional test which would show results as
higher or lower).
159
7.3 Digital Zoom
160
A t-statistic of 1.85E-15 was obtained which was much less than the critical t value
1.66 for 80 degrees of freedom. The significance of this value for a one-tail test was
P=0.5 and since this is greater than 0.05, H0 was not rejected: histogram equalisation
does not result in significant differences.
The trim methods (“Sal trim” and “IM trim”) were approximately twice as good as
the scope methods (“Sal scope” and “IM scope”). This may be due to the scope box
method having a fixed box size (equal to one quarter of the image area) while the
box size for the trim method varied depending on the image, potentially returning a
more useful zoomed image.
Thus if a digital zoom function were to be employed in a prosthesis design to
highlight areas which may help a visually impaired user, favourable results are most
likely to be achieved with the Saliency Map method. The trim method on
importance maps (“IM trim”) is also slightly better than zoom windows based on a
geometric part of the image which do not consider image content.
7.4 Chapter Summary This chapter described a comparison of region-of-interest processing methods for
low quality image presentation. The experiments showed that it is better to use
Importance Map/Region-of-Interest processing to select a region within the image
and present that alone, rather than presenting the actual Importance/Salience
representation for the entire image.
So in response to the research question:
Q2: Does Region-of-Interest processing improve scene understanding beyond
standard/Base Case processing?
It can be seen that ROI processing does improve scene understanding when used in a
zoom application, but not if applied to the entire image.
Chapter 8 Discussion, Conclusion and Future Work
8.1 Discussion and Conclusion
Electronic visual prostheses, or bionic eyes, are likely to provide some coarse visual
sensations to blind patients who have these systems implanted. The quality of
artificially induced vision is anticipated to be very poor initially. Research described
in this thesis has explored image processing techniques that improve perception for
users of visual prostheses.
The work has focussed on improving perception via image processing techniques.
Images are just data and image processing is simply manipulating that data. There
are potentially other techniques that may results in improved perception that are
outside the scope of this research. These include using different electrode paths to
create variations of charge density patterns, and delivering preconditioning stimulus
to selectively excite deeper axons away from those in contact with electrodes.
Useful image processing methods were determined by way of subjective experiments
with normally sighted viewers. These experiments provide a basis from which more
complex and beneficial vision prostheses may be derived. For example, the tests
involved presentation of static (still) images to viewers and a logical extension to the
work is to conduct similar experiments on image sequences (video). Thus a body of
knowledge has been established from which real-time processing units can be
developed such that a prosthesis may provide maximum benefits to the blind.
The work has also facilitated further understanding of the human visual system,
specifically perception from low quality visual information. The field of image
quality, including how quality is measured and characterised, is vast. However work
to date focuses on high quality images associated with modern multimedia
environments. This research contributes to the image quality literature by
characterising low quality images. The amount of visual information carried by
161
8.1 Discussion and Conclusion
162
images has been quantified, and in this way a new means of characterising low
quality images can be stated on the basis of presenting maximum information.
Finally the research has involved an original application of Region-of-Interest
processing routines beyond traditional applications such as image compression.
Region-of-Interest processing was presented in Section 2.4 as a powerful perception
modeling tool using a combination of early vision and cognitive effects. Prosthesis
researchers had not previously recognised the benefit of applying techniques to
automatically identify important areas in images. Thesis experiments validated the
use of computationally cheap Region-of-Interest techniques in visual prosthesis
designs.
Detailed research findings are now discussed in order of presenting in the thesis.
Some preliminary experiments are discussed in Chapter 4. In an experiment to
determine useful image processing methods, it was found that some types of images
are better recognised at low quality using Importance and Distance Map methods,
while for others, ‘Base Case/Normal’ processing is better. It is recommended to
have switchable modes of operation to allow user selection of the processing routine.
Results also indicated that it is better for a device to have increased spatial resolution
rather than increased greyscale resolution, and faces are pretty much the easiest type
of image to understand. These results were reinforced in a separate experiment
determining the effect of image type on perception.
In Chapter 5, a stable measure was developed to quantify the amount of visual
information in an image. This was found to be the number of edges in an image. It
was also found that high information content in images correlates with higher
perception.
Chapter 6 described an experiment that showed there is some benefit in processing
images according to their scene type (office, home etc). It is recommended to invert
the processed binary images before presentation if necessary to most closely match
the original images.
8.1 Discussion and Conclusion
163
In Chapter 7, a comparison of Region-of-Interest methods was presented. ‘Base
Case/Normal’ processing was best when presenting the entire image, but ROI
processing has some benefit over Base Case processing with a zoom type function.
If a Region-Of-Interest technique was to be implemented in a zoom function, a
technique known as the Saliency Map (pixel based) gives the most favourable result.
This method was better than region-based Region-of-Interest methods, ‘Base Case’
and methods using the geometric location within an image (where image content is
not taken into account). The experiments found that there was no benefit in tuning
feature weights in the Importance Map method when such low quality images are
used. Finally there was no difference in the results if images are histogram-equalised
prior to binarisation or not.
In exploring image processing techniques that improve perception from visual
prostheses, four research questions were addressed:
Research Question Findings
Q1: Although limited to low quality
images anticipated from visual
prostheses, can recognition of some
objects be achieved?
Basic recognition can be achieved for
low quality images although this is
dependent on spatial and greyscale
resolutions and image type. Face
environments are most easily recognised.
Q2: Does Region-of-Interest processing
improve scene understanding beyond
standard/Base Case processing?
Improved scene understanding can be
expected if Region-of-Interest processing
is used to zoom into interesting areas
within the image.
Q3: Can a model be constructed for
basic information required for the
interpretation of a visual scene at low
image quality?
Maximising the number of edges in an
image results in higher perceived
information content and higher
recognition performance
Q4: Should the processing techniques be
adjusted depending on the scene type?
Improved perception may be obtained
with processing images with respect to
scene type.
8.2 Future Work
8.2 Future Work
164
The research presented in this thesis can be extended in several areas.
8.2.1 Motion
The algorithms described in this paper were employed on static images, and a further
extension of this work is to image sequences/video. It is anticipated that enhanced
perception would be achieved over that experienced with static images, as image
sequences convey higher scene information and subjects would be able to move
about to see how various scene elements (background/foreground) interact.
It is likely that an optimum frame presentation rate exists to maximise visual
understanding. Nauseating disorientation effects may arise if head movements are
not matched with visual information. Time delay effects in helmet/head mounted
displays resulting in limitations from image lag are covered by Dudfield [22] and
Nelson [61].
8.2.2 Colour
As identified by Suaning et al [93], colour would further enhance images delivered to
subjects. While patients undergoing cortical stimulation have reported seeing white,
yellow, red, and blue phosphenes [86], the successful modulation of colour has not yet
been reported in the literature. From an image processing viewpoint, colour filtering
could be relatively easily applied to present monochromatic images of only a selected
colour. For example green filtering to locate green apples in a bin full of green and red
apples. In a paper [108] which discusses alternatives to colour in the context of
transmitting information in images eg. monochromatic coding (single colour), size,
flashing stimulus, a disadvantage of colour is identified in that colour cannot depict
relative importance.
8.2 Future Work
165
8.2.3 Device interfacing
The ability to connect an artificial vision system to a television or computer would be
desirable. This concept has been incorporated in the design of the Dobelle implant
which allows for a television/computer/internet interface and remote video
screen/VCR monitor [20]. Other routine image processing functions could include
the possibility to record a video clip/snapshot of an interesting landmark, route or
environment, and transmit the clip/snapshot to another viewer or an external device.
8.2.4 Supplementary/Symbolic Information
Given that phosphenes can be produced in the visual field, increased information
transfer might be achieved if use were made of these phosphenes not just for
representing one part of a scene, but for “coding” of associated information. For
example, a particular phosphene in the top left hand corner of the visual field could
represent the close proximity to the left of a tall and narrow object (such as a lamp-
post), which may prevent an injury. Similarly, the middle right-most phosphene
might represent fast movement from the right. In addition to using a selected
phosphene for information representing, other modes of transfer might cover
phosphene brightness and blink rate. It is feasible that supplementary sensory
information (eg. distance) might be conveyed in such a manner to efficiently use the
small number of phosphenes.
The ability to convey alphanumeric character patterns directly, rather than capturing
the text via camera, may be beneficial in improving reading speeds and visual acuity.
A 5 x 7 phosphene array can be used to create a full set of symbols, similar to dot
matrix displays, where the 35 dots can overlap or appear as discrete elements without
affecting legibility [74,89]. While a 5 x 7 matrix is generally quite adequate for
groups of characters presented in context, it can result in some confusion when single
characters are used (eg. 2 called Z, B called 8). This may lead to the use of 7 x 9
matrix fonts, but matrices larger than 7 x 9 do not result in meaningful improvements
[89].
8.2 Future Work
166
The use of supplementary stimulus could also be applied for colour analysis.
Although artificially-created phosphenes might not represent the true colour of the
camera image, it should be possible to overlay some colour identification
information. For example, a colour identification function would be advantageous to
selectively give an indication of the colour detected in the centre of the camera view.
8.2.5 Range Indication
The distance to an object would be particularly useful information to convey through
an artificial vision system. The use of sonar distance aids has been common for
many years (eg. [23]). However these devices emit an auditory signal to convey
distance information which can interfere with important surrounding environmental
noises. It is thus desirable to incorporate distance information visually through an
artificial vision system. Distance information can be obtained using similar
ultrasonic rangefinders or computing depth from disparity from two cameras.
Distances could then be mapped to intensities, where the nearest object is shown
with the highest intensity. If the device display only supports a 1-bit grey scale, only
the nearest object need be displayed. Newman and Jain [63] state that the greatest
advantage of range data is in explicitly representing surface information.
Binary/silhouette data provides negligible information about surface and intensity
data can contain significant variations in reflected light.
In combination with a standard image of luminance intensities, this distance 'mode of
operation' could be quite useful. The literature contains several works from a
computer graphic modelling viewpoint which claim that through the combination of
range and intensity, a fuller description of the model can be achieved which takes
advantage of the strengths and weaknesses of each method is isolation [44,113].
8.2 Future Work
167
8.2.6 Simulating Techniques Many of the experiments undertaken in this thesis involved the presentation of
binary images to viewers that did not include greyscale effects. The use of
halftoning techniques to create the illusion of greyscale was mentioned in Sections
3.5.2.2 and 3.4.1 along with the theory that modulating the size and intensity of a
phosphene are equivalent psychophysically. Future experiments in simulating
artificial vision could explore the use of this technique (refer Figure 8.1).
Original Halftone – max radius = 2
Halftone – max radius = 4 Halftone – max radius = 10
Figure 8.1: Halftone representation
8.2.7 Other Testing Techniques
Also the recognition experiments conducted in this thesis were open-ended in that
hints or clues were not provided. Other ways to assess recognition performance
might include asking task dependent questions about the scene such as “circle the
doorway” or “identify the obstacle location”. The assessment of perception in this
way requires careful design and consistent determination of correct responses across
subjects. Objective measures, such as tracking tasks, maze navigation, reading
speed, visual acuity scores based on Landolt-C’s and Sloan E’s could also be used
similar to previous attempts to quantify visual processing methods eg. [12,13,14,33].
8.3 Final Word
8.3 Final Word
My motivation for undertaking this work was growing up with a blind parent and
witnessing frequent injuries from collisions with obstacles: dishwasher doors being
left open, cupboard doors ajar, corners of brick walls. There is always some wish
that one day some visual perception may return to a level sufficient to avoid such
collisions, and allow the amazing world so familiar to us sighted people to be
viewed.
In the initial design of this research project, a scope of work was set that was
achievable. It would not be realistic to expect to achieve sight restoration as a result
of a PhD. However with the facilities and testing methods available, some ideas
could be explored in the area of image processing. It is hoped that this work may
contribute in a small way to the numerous international efforts of developing a safe
and useful electronic visual prostheses for blind people.
168
References 1. Ahumada A, “Computational image quality metrics: A review”, SID Digest of
Technical Papers, 24, pp.305-308, 1993 2. Amerijckx C, Legat J, Trullemans C, Design and implementation of a
remapping algorithm for visual prosthesis, Proceedings Vision Interface '99, Canadian Image Processing. & Pattern Recognition Society, Toronto, Canada, pp.380-385, 1999
3. Barten P, Contrast sensitivity of the human eye and its effects on image quality,
SPIE Press, Washington, 1999 4. Baskent D, Shannon R, Frequency-place compression and expansion in
cochlear implant listeners, Journal of the Acoustical Society of America, 116(5), pp.3130-3140, 2004
5. Beatty J, Booth K, Matthies L, Revisiting Watkins’ algorithm (Computer
Graphics), 7th Canadian Man-Computer Communications Conference, pp.359-370, 1981
6. Becker M, Braun M, Eckmiller R, “Retina implant adjustment with
reinforcement learning”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98, IEEE, New York, USA, Vol.2, pp.1181-4; 1998
7. Bell D, Maeder A, Progressive technique for human face archiving and
retrieval, Journal of Electronic Imaging, Vol. 5(2), pp191-197, 1996 8. Brindley G, The number of information channels needed for efficient reading, J
Physiol. 177, pp.46, 1964 9. Callaghan T, Interference and dominance in texture segregation: Hue,
geometric form, and line orientation, Perception & Psychophysics, Vol. 46 (4), pp.299-311, 1989
10. Cantoni V (ed), “Human and machine vision – Analogies and divergencies”,
Proceedings of the 3rd International Workshop on Perception, Plenum Press, 1994
11. Capelle C, Faik C, Trullemans C, Veraart C, “Real time experimental visual
prosthesis using sensory substitution of vision by audition”, Proceedings of the 16th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Engineering Advances: New Opportunities for Biomedical Engineers, IEEE, New York, USA; Vol.1, pp.255-256, 1994
169
References
12. Cha K, Horch K, Normann R, Mobility performance with a pixelised vision system, Vision Research 32(7), pp.1367-1372, 1992
13. Cha K, Horch K, Normann R, Simulation of a phosphene-based visual field:
Visual acuity in a pixelized vision system, Annals of Biomedical Engineering 20(4), pp. 439-449, 1992
14. Cha K, Horch K, Normann R, Reading speed with a pixelized vision system,
Journal of the Optical Society of America A-Optics & Image Science, 9(5), pp. 673-677, 1992
15. Chernyak D.A. and Stark L.W., “Top-down guided eye movements: Peripheral
model,” in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, SPIE Proc. Vol. 4299, pp.349-360, 2001
16. Cooper P, Birnbaum L, Brand M, Causal scene understanding, Computer
Vision and Image Understanding, Vol.62 (2), pp.215-231, 1995 17. Dagnelie G, Humayun M, Greenberg R, de-Juan E, “The physiological
connection: stimulating the human and amphibian retina”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2321-2326, 1997
18. DeMarco S, Clements M, Vichienchom K, Liu W, Humayun M, Weiland J, An
epi-retinal visual prosthesis implementation, Proceedings of the First Joint BMES/EMBS Conference 1999: IEEE Engineering in Medicine and Biology 21st Annual Conference and the 1999 Annual Fall Meeting of the Biomedical Engineering Society, IEEE, Piscataway, USA, Vol.1, pp475, 1999
19. De Ridder H, Cognitive issues in image quality measurement, Journal of
Electronic Imaging, January, Vol.10(1), pp.47-55, 2001 20. Dobelle W, Artificial vision for the blind by connecting a television camera to
the visual cortex, ASAIO (American Society of Artificial Internal Organs) Journal 2000; 46(1), pp. 3-9, 2000
21. Dowling J, Maeder A, Boles W, Mobility enhancement and assessment for a
Visual Prosthesis, in Human Vision and Electronic Imaging IX, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol 5369, pp. 780-791, 2004
22. Dudfield H, Hardiman T and Selcon S, “Human factors issues in the design of
Helmet-Mounted Displays,” in Helmet- and Head-Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995
23. Easton R, Inherent problems of attempts to apply sonar and vibrotactile sensory
aid technology to the perceptual needs of the blind, Optometry and Vision Science, 69(1), pp. 3-14, 1992
170
References
24. Eckert M, Bradley A, “Perceptual quality metrics applied to still image compression”, Signal Processing, European Association for Signal Processing, Vol. 70, pp177-200, 1998
25. Eckmiller R, Becker M, Hunermann R, “Towards a learning retina implant with
epiretinal contacts”, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. p.396-399, 1999
26. Eckmiller R, Becker M, Hunermann R, “Dialog concepts for learning retina
encoders”, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2315-2320, 1997
27. Exner J, The Rorschach: A comprehensive system: Vol 2 Current research and
advance interpretation, Wiley, New York, 1978 28. Gilmont T, Verians X, Legat J, Veraart C, Resolution reduction by growth of
zones for visual prosthesis, Proceedings: International Conference on Image Processing, IEEE, New York, USA, pp.299-302 vol.1, 1996
29. Gregory R, Eye and Brain – the psychology of seeing, World University
Library, London, 1966 30. Groth-Marnat G, Handbook of psychological assessment, 3rd ed, John Wiley &
Sons, New York, 1997 31. Hallum L, Taubman D, Suaning G, Morley J, Lovell N, A filtering approach to
artificial vision: A phosphene visual tracking task, Proceedings of the World Congress on Medical Physics and Biomedical Engineering (WC2003), 2003
32. Harvey J, Sawan M, Image acquisition and reduction dedicated to a visual
implant, Proceedings of the 18th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. `Bridging Disciplines for Biomedicine', IEEE, New York, USA, pp.403-4 vol.1, 1997
33. Hayes J, Yin V, Piyathaisere D, Weiland J, Humayun M, Dagnelie G, Visually
guided performance of simple tasks using simulated prosthetic vision, Artificial Organs Vol. 27 (11), pp.1016-1028, 2003
34. Hendee W, Wells P (eds), The perception of visual information - 2nd edition,
Springer-Verlag, New York, 1997 35. Henderson D, Evans J, Dobelle W, The relationship between stimulus
parameters and phosphene threshold/brightness, during stimulation of human visual cortex, Transactions - American Society for Artificial Internal Organs, 25, pp. 367-71, 1979
36. Huang T, PCM picture transmission, IEEE Spectrum, Vol.2 (12), pp.57-63,
1965
171
References
37. Humayun M, Weiland J, Fujii G, Greenberg R, Williamson R, Little J, Mech B, Cimmarusti V, Van Boemel G, Dagnelie G, de Juan E, Visual perception in a blind subject with a chronic microelectronic retinal prosthesis, Vision Research, Vol 43 (24), pp. 2573-2581, 2003
38. Hungenahally S, “Differentio-aggregation functions for perceptual sub-band coding of images: Emulation of visual receptive fields”, IEEE International Conference on Systems, Man and Cybernetics. 'Humans, Information and Technology', IEEE, USA, Vol.3, pp.2420-2425, 1994
39. Hungenahally S, “Mathematical basis for the design of an artificial retina:
visual prosthesis for the retinally blind”, IEEE International Conference on Systems, Man and Cybernetics. ‘Intelligent Systems for the 21st Century’, IEEE, New York, USA, Vol.3, pp.2396-2401, 1995
40. Itti L, Koch C, Feature combination strategies for saliency-based visual
attention systems, Journal of Electronic Imaging, 10(1), pp.161-169, 2001 41. ITU/R Recommendation BT.500-7, 10/1995, http://www.itu.ch/ , access date:
4/6/04 42. Iwamoto K, Tanie K, “Development of an eye movement tracking type head
mounted display: capturing and displaying real environment images with high reality”, Proceedings. 1997 IEEE International Conference on Robotics and Automation, IEEE, New York, USA, Vol.4, pp.3385-3390, 1997
43. Janssen R, Computational Image Quality, SPIE PRESS Monograph Vol.
PM101, SPIE - The International Society for Optical Engineering, 2001 44. Kay G, Caelli T, Inverting an illumination model from range and intensity
maps, CVGIP: Image Understanding, Vol 59 (2) March, pp. 183-201, 1994 45. Kraft R, Kauer J, Estimating the fractal dimension from digitized images,
Munich University of Technology – Weihenstephan, Dept of Agricultural & Horticultural Sciences Mathematics, Statistics & Data Processing Institute, Freising / Germany, http://www.wzw.tum.de/ane/algorithms/algorithms.html, Access date: 20/4/04, 1995
46. Kyuma K, Miyake Y, Kage H, Artificial retina chips, IEEE International Conference on Neural Networks, Vol.4, pp.2304-2308, 1997
47. Levine M, Vision in man and machine, McGraw-Hill, New York, 1985 48. Livingstone M, Hubel D, Segregation of form, color, movement, and depth:
Anatomy, physiology, and perception, Science, Vol. 240, pp.740-749, 1988 49. Loce R, Roetling P, Lin Y, Digital halftoning for printing and display of
electronic images, in Electronic Imaging Technology, Dougherty E (Ed), SPIE - The International Society for Optical Engineering, pp225-288, 1999
172
References
50. Luo J, Etz S, Singhal A, Gray R, Performance-scalable computational approach to main subject detection in photographs, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 494-505, 2001
51. Lysaght M, Vogelstein J, Lockhart N, Cheswick Thide C, Nallari M, Caulkins
C, Artificial vision, 1999 (compiled), Brown University, Providence, Rhode Island, Available: http://biomed.brown.edu/Courses/BI108/BI108_1999_Groups/Vision_Team/Vision.htm , Access date: 20 April 2004
52. Maeder A, Human understanding limits in visualization, International Journal
of Pattern Recognition & Artificial Intelligence, 11(2), pp.229-237, 1997 53. Maeder A, Eckert M, “Medical image compression: Quality and performance
issues”, Proceedings: New Approaches in Medical Image Analysis, SPIE – The International Society for Optical Engineering, Washington, Vol.3747, 1999
54. Maeder A, Pham B, A colour importance measure for colour image analysis,
IS&T and SID’s Color Imaging Conference: Transforms & Transportability of Colour, Phoenix, Nov 1993, 232-237, 1993
55. Margalit E, Maia M, Weiland J, et al, Retinal prosthesis for the blind, Survey of
Ophthalmology, Vol.47 (4), pp.335-356, 2002 56. Marr D, Vision, W.H. Freeman & Company, New York, 1982 57. Meletiou A, Measurement of complexity in visual images, MSc in Human
Computer Interaction Project, Department of Computing & Electrical Engineering, Heriot-Watt University, 1999
58. Miau F, Itti L, A neural model combining attentional orienting to object
recognition: Preliminary explorations on the interplay between where and what, In: Proc. IEEE Engineering in Medicine and Biology Society (EMBS), Istanbul, Turkey, Oct 2001
59. Moffett D, Moffet S, Schauf C, Human physiology – foundations and frontiers,
2nd edition, Mosby-Year Book Inc, St Louis, pp.268-283, 1993 60. Mojsilovic A Rogowitz B, A psychophysical approach to modeling image
semantics, Human Vision and Electronic Imaging VI, Rogowitz B, Pappas T (Eds), Proceedings of SPIE Vol.4299, pp. 470-477, 2001
61. Nelson W, Hettinger L, Haas M, Russell C, Warm J, Dember W and Stoffregen
T, Compensation for the effects of time delay in a helmet-mounted display: perceptual adaptation versus algorithmic prediction, in Helmet- and Head-Mounted Displays and Symbology Design Requirements II, Lewandowski R, Stephens W, Haworth L (eds.), Proc. SPIE 2465, pp. 132-141, 1995
173
References
62. Nemine K, Calibration and evaluation of virtual environment displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993
63. Newman T, Jain A, A survey of automated visual inspection, Computer Vision
and Image Understanding, Vol. 61(2), March pp.231-262, 1995 64. Nguyen A, Chandran V, Sridharan S, Prandolini R, Importance assignment to
regions in surveillance imagery to aid visual examination and interpretation of compressed images, Proceedings of International Symposium on Intelligent Multimedia, Video & Speech Processing (ISIMP), Hong Kong, pp. 385-388, 2001
65. Nie K, Stickney G, Zeng F, Encoding frequency modulation to improve
cochlear implant performance in noise, IEEE Transactions on Biomedical Engineering, 52(1), pp.64-73, 2005
66. Normann R, Maynard E, Rousche P, Warren D, A neural interface for a cortical
vision prosthesis, Vision Research 39(15), pp. 2577-2587, 1999 67. Osberger W, Perceptual vision models for picture quality assessment and
compression applications, PhD thesis, Space Centre for Satellite Navigation, School of Electrical and Electronic Engineering, QUT, 1999
68. Osberger W, Maeder A, Automatic identification of perceptually important
regions in an image using a model of the human vision system, 14th International Conference on Pattern Recognition, Brisbane, Australia, pp. 701-704, 1998
69. Osberger W, Maeder A, Bergmann N, A perceptually based quantisation
technique for MPEG encoding, Proceedings of the SPIE – Human Vision & Electronic Imaging III, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3299, San Jose, USA, Jan 1998
70. Osberger W, Rohaly A, Automatic detection of regions of interest in complex
video sequences, in Human Vision and Electronic Imaging VI, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 4299, pp.361-372, 2001
71. Pausch R, Shackelford M, Proffitt D, A user study comparing head-mounted
and stationary displays, Proceedings: Symposium on Research Frontiers in Virtual Reality, IEEE Computer Society Press, San Jose, USA, Oct 25-26 1993
72. Peachey N, Chow A, Subretinal implantation of semiconductor-based
photodiodes: progress and challenges, Journal of Rehabilitation Research & Development, 36(4), pp. 371-376, 1999
73. Pentland A, Wearable computers, IEEE Microcomputers 19(6), pp.9-11, 1999 74. Perez R, Electronic display devices, TAB Professional and Reference Books,
Pennsylvania USA, pp.196-202, 1988
174
References
75. Privitera C, Stark L, Focused JPEG encoding based upon automatic pre-
identified regions-of-interest, in Human Vision and Electronic Imaging IV, Rogowitz T, Pappas T, Editors, Proceedings of SPIE Vol. 3644, pp. 552-558, 1999
76. Reinagel P and Zador AM, Natural scene statistics at the centre of gaze,
Network: Computation in Neural Systems. vol.10, no.4, pp341-350, 1999 77. Riglis E, Modeling visual complexity in images, First Year PhD Report, Image
Systems Engineering Laboratory, School of Mathematical and Computer Sciences, Heriot-Watt University, 1998
78. Rivlin E, Rosenfeld A, Navigational functionalities, Computer Vision and
Image Understanding, Vol.62 (2), pp.232-244, 1995 79. Rizzo J, Wyatt J, Loewenstein J, Kelly S, Shire D, Methods and perceptual
thresholds for short-term electrical stimulation of human retina with microelectrode arrays, Investigative Ophthalmology & Visual Science, 44(12), pp 5355-5361, 2003
80. Rogowitz B, Frese T, Smith J, Bouman C, Kalin E, Perceptual image similarity
experiments, SPIE Conference on Human Vision and Electronic Imaging III, San Jose, California, January 1998, SPIE Vol.3299, pp.576-590, 1998
81. Rogowitz B, Pappas T, Allebach J, Human vision and electronic imaging,
Journal of Electronic Imaging, Vol.10(1), pp.10-19, 2001 82. Rosenberg D, Color Halftone Version 7.0, An Adobe Photoshop filter
module which simulates an enlarged print color halftone effect, Adobe Systems, 2002
83. Russ J, The image processing handbook 3rd Edition, CRC Press, Florida USA,
pp. 242-247, 1999 84. Saliency Map source code sourced from iLab, University of Southern
California: http://ilab.usc.edu/toolkit/ Access date: 12 August 2003 85. Schill K, Umkehrer E, Beinlich S, Krieger G, Zetzsche C, Scene analysis with
saccadic eye movements: Top-down and bottom-up modeling, Journal of Electronic Imaging, Vol.10(1), pp.152-160, 2001
86. Schmidt E, Bak M, Hambrecht F, Kufta C, O'Rourke D, Vallabhanath P,
Feasibility of a visual prosthesis for the blind based on intracortical microstimulation of the visual cortex, Brain 119, pp. 507-522, 1996
87. Schubert M, Stelzle M, Graf M, Stert A, Nisch W, Graf H, Hammerle H, Gabel
V, Hofflinger B, Zrenner E, Subretinal implants for the recovery of vision, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 376-381, 1999
175
References
88. Schwarz M, Hauschild R, Hosticka B, Huppertz J, Kneip T, Kolnsberg S,
Mokwa W, Trieu H, Single chip CMOS image sensors for a retina implant system, IEEE, Vol.6, pp.645-648, 1998
89. Sherr S, Electronic displays, John Wiley & Sons, New York, pp.29-37, 1979 90. Snyder H, Trejo L, Research Methods, in Colour in Electronic Displays, Widdel
H, Post D (eds), Plenum Press, 1992 91. Stange K, A 4-parameter model of visual complexity in abstract images and a
computer program for the empirical investigation of complexity, pleasingness and interestingness of images based on the model, XVI Congress Of The International Association Of Empirical Aesthetics, New York, 2000
92. Stark L, Privitera C, Yang H, Azzariti M, Fai Ho Y, Blackman T, Chernyak D,
Representation of human vision in the brain: How does human perception recognise images?, Journal of Electronic Imaging, Vol.10(1), pp.123-151, 2001
93. Suaning G, Lovell N, Schindhelm K, Coroneo M, The bionic eye (electronic
visual prosthesis): A review, Australian and New Zealand Journal of Ophthalmology 26, 195-202, 1998
94. Suaning G, Lovell N, Kwok C, Fabrication of platinum spherical electrodes in
an intra-ocular prosthesis using high-energy electrical discharge, Sensors and Actuators A: Physical, vol. 108, pp. 155-161, 2003
95. Suaning G, Lovell N, CMOS neurostimulation system with 100 electrodes and
radio frequency telemetry”, Inaugural Conference of the IEEE EMBS (Vic), Melbourne, pp.37-40, Feb 1999
96. Thompson R, Barnett G, Humayun M, Dagnelie G, Facial recognition using
simulated prosthetic pixelized vision, Investigative Ophthalmology & Visual Science, 44(11), pp.5035-5042, 2003
97. Thorpe S, Image processing by the human visual systems, Eurographics ’90
EG.90TN4 – Tutorial Note, Eurographics Technical Report Series, 1990 98. Travis D, Effective colour displays – Theory and Practice, Academic Press,
1991 99. Vaughan H, Schimmel H, Feasibility of electrocortical visual prosthesis, in
Visual prosthesis – The interdisciplinary dialogue, Sterling T, Bering E, Pollack S, Vaughan H Editors, Proceedings of the second conference on visual prosthesis, Academic Press, New York, pp.65-79, 1971
100. Veraart C, Wanet-Defalque M, Gérard B, Vanlierde A, Delbeke J, Pattern
recognition with the optic nerve visual prosthesis, Artificial Organs Vol.27 (11), pp.996-1004, 2003
176
References
101. VQEG – Video Quality Experts Group, Institute for Telecommunication Sciences, U.S. Department of Commerce, Colorado http://www.its.bldrdoc.gov/vqeg/, accessed 4/6/04.
102. Wandell B, Foundations of vision, Sinauer Associates Inc, Massachusetts,
USA, pp. 124-126, 1995 103. Walpole R, Myers R, Probability and statistics for engineers and scientists – 5th
edition, Macmillan Publishing Company, New York, 1993 104. Warren D, Normann R, Visual neuroprostheses, in Handbook of
neuroprosthestic methods, Finn W, LoPResti P (eds), The Biomedical Engineering Series, CRC Press, pp. 2003
105. Watson A (ed), Digital images and human vision, MIT Press, 1993 106. Watson A et al, The DCTune algorithm, Vision Science and Technology
Group, NASA Ames Research Center, http://vision.arc.nasa.gov/dctune/, Access date: 29 May 2004
107. Werblin F, Jacobs A, The cellular neural network as a retinal camera for visual
prosthesis, Proceedings: International Conference on Neural Networks, IEEE, New York, USA, Vol.4, pp.2327-2332, 1997
108. Widdel H, Post D, (eds), Colour in Electronic Displays, Plenum Press, 1992 109. Yagi T, Ito Y, Kanda H, Tanaka S, Watanabe M, Uchikawa Y, Hybrid retinal
implant: fusion of engineering and neuroscience, IEEE International Conference on Systems, Man, and Cybernetics Vol. 4, pp. 382-385, 1999
110. Yagi T, Kameda S, Hayashida Y, Li L, An artificial retina with adaptive
mechanisms and its application to retinal prostheses, Vol.4, IEEE, pp.418-423, 1999
111. Yamakawa T, Shimonomura K, Udono T, Yagi T, Depth perception circuit
employing serial output signals from two vision chips, Vol. 4, IEEE, pp.390-395, 1999
112. Ziegler D, Linderholm P, Mazza M, Ferazzutti S, Bertrand D, Ionescu A,
Renaud, An active microphotodiode array of oscillating pixels for retinal stimulation, Sensors and Actuators A: Physical, 110(1-3):11-17, 2004
113. Zhang G, Wallace A, Physical modelling and combination of range and
intensity edge data, CVGIP: Image Understanding, Vol. 58 (2) September, pp.191-220, 1993
177
Appendix A Section 4.2 Experiment
A.1 Example Test Stimulus All images are different versions of the same object.
A B
C D
E 1. DESCRIBE THE OBJECT:
_________________________________________________________________ 2. RANK THE TOP 3 IMAGES THAT YOU THINK SHOW THE OBJECT
MOST CLEARLY: 1)_________ 2)_________ 3)_________
178
Appendix A Section 4.2 Experiment
179
A.2 Booklet Design Test Attributes Ref Spatial Res & Grey levels Images
shown per page
Image Characteristics shown per page: N = normal, I = inverse, D = distance, Im = importance, E = edges
1 10x10 B&W 5 5 x B&W (N,I,D,Im,E) 2 10x10 3 grey 4 4 x 3grey
(N,I,D,Im) 3 10x10 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 4 16x16 B&W 5 5 x B&W (N,I,D,Im,E) 5 16x16 3 grey 4 4 x 3grey
(N,I,D,Im) 6 16x16 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 7 25x25 B&W 5 5 x B&W (N,I,D,Im,E) 8 25x25 3 grey 4 4 x 3grey
(N,I,D,Im) 9 25x25 B&W vs 3grey 9 9 = (5 x B&W) + (4 x 3grey) 10 10x10 3 grey vs 16x16 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 11 10x10 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey) 12 16x16 3 grey vs 25x25 B&W 8 8 = (4 x B&W )+ (4 x 3grey) Image presentation criteria:
1. images with lower spatial resolution should be presented before higher resolution versions of the same image to reduce learning effects
2. try to obtain a large a difference as possible, so pair of the following attribute reference numbers from above table: 1 – 9, 2 – 7, 3 – 8, 4 – 10, 5 – 12, 6 – 11
Six booklets A – F measuring different combinations of image attributes:
Book Chair1 Chair2 Post1 Post2 Steps1 Steps2 Face1 Face2 Door1 Door2A 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 B 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 C 3-8 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 D 4-10 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 E 5-12 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 F 6-11 1-9 2-7 3-8 4-10 5-12 6-11 1-9 2-7 3-8
Then booklet order to ensure variety and reduce learning effects (20 pages):
1 2 3 4 5 6 7 8 9 10 Chair1 Post1 Steps1 Face1 Door1 Steps2 Post2 Face2 Chair2 Door2 11 12 13 14 15 16 17 18 19 20 Face1 Post1 Door1 Steps1 Chair1 Post2 Chair2 Door2 Face2 Steps2
Appendix A Section 4.2 Experiment
180
A.3 Borderline Recognition Assessment for Section 4.2 Experiment
Contextually Accepted Rejected Accepted Toilet Pidgeon Park bench Chair on skateboard Baby pram Chair with armrests Vulture chairlift Giraffe/dog Flamingo Rooster Emu
Contextually Accepted Rejected Accepted Lounge chair a rocking [?] facing to the left
that’s sitting up straight Posture chair
High chair Rocking horse 3d chair on a rock Electric chair Baby’s pram Rocking chair Child’s pram Reclining chair Baby stroller Chair with wheels Kids move chair stool
Appendix A Section 4.2 Experiment
181
Contextually Accepted Rejected Accepted Pillar holding up roof Tower Door of a house with welcome
mat Gravestone Foot Corner of wall Wall from side or column (building support)
Torch Door in the right corner
Closeups of walls in a maze Buildings/skyscraper Vase/box which is a very different colour to the walls of the room
Jug in the corner of the room A square Wall Vase / bottle Window Block Grass bush Rectangle A pile on Large rectangular window
Candle holder Corridors Toilet paper Can of food Cup on table Elevation shaft Fire hydrant Block of chocolate
Lighthouse
Object on a table (eg. mug, salt shaker)
pole
Tree trunk Statue on stand O/head view of street
Contextually Accepted Rejected Accepted
Block of dry ice Beaker Window / Window with tree outside
Computer screen Container
Gameboy screen Swimming pool Wall / Wall with window & curtain on right
Piece of paper Building Box (rectangular shaped object); A box (not clear enough to describe); Block or box
Blackboard Bucket A block that someone can sit on Cup / mug Square
Top view of a table/ desk Block of chocolate An enclosed space t.v. screen The sky Floor mat Fish bowl Towel hanging
Football field Hole / Hole in the ground / A square ditch
Piece of cloth (rectangular) hanging up on something
Box, possibly a floor pan
Appendix A Section 4.2 Experiment
182
Contextually Accepted Rejected Accepted “If gender noted was wrong” eg. man with afro
George Washington
The back of a head Early American president Gorilla’s face/monkey Footballer with helmet
Mozart Beethoven
Shrub / face Bob Marley Person with lots of facial hair Ben Harper William Shakespeare Elizabeth Taylor Jimmy Hendrix Mon Lisa painting Artist
Contextually Accepted Rejected “If gender noted was wrong” eg. ladies face
Head of a bird Child molester from ch.7 news
An old man’s head in profile facing left
Flower Dero person
Mr. Ed the talking horse Side view of a person facing left Old person with glass with string attached (crying)
Crying person Side view of a dog’s head Guy with long hair Face with long hair Barbie
Accepted
Back of a persons head Human face, child
Appendix A Section 4.2 Experiment
183
Contextually Accepted Rejected Accepted Water pump with hose on right Long thing with protrudence Cactus ‘L’ End of a pier Big flower Axe Powerpole with powerline
extending to right Streetlight/lamp Half an anchor Telephone pole Flower stem Sign pointing up Umbrella Diving board/platform Tree with branches/leaves Powerline tree Crane Joining of posts (T) Basketball hoop Winder on old clothesline Arrow pointing down Stick figure pointing gun to right
A tree with monkeys in it
Corn / maize Ladder Saxophone Spear with hook underneath Street sign Office building with stairs Traffic lights Skyscraper Flag facing right Waving finger Hockey stick Submarine Pencil? Windsurfer Flagpole Traffic light Tower? Road markings Lighthouse? Hand basin Sword pipes
Fish hook
Contextually Accepted Rejected Accepted Tree beside a hole/ditch in the ground to the left
Tall building A line
Worm/slug/caterpillar City with buildings Flagpole Driveway Birds eye view of coin box Something sticking up Road at night (with reflective median strip)
Caterpillar crawling A stick
Perspective view of a road White thing sticking up Tree / tree out of the ground / a lone tree
A stage with curtains pulled back on either side and pole in middle
Mountain with stick Street lamp / light Traffic lights Tower
Appendix A Section 4.2 Experiment
184
Person standing out in the open Straight white line behind black surface
Pole with cave Tall archway/corridor Pole in foreground, mountain in background
Mountains with a pole at top
Contextually Accepted Rejected Accepted Multi-storey building with stepped storeys
Building
Contextually Accepted Rejected Accepted Stick Swimming pool Stairs leading up to a tree A very tall skinny tree in a raised garden bed
Tree
Flagpole/yellow & red beach poles
Clothesline
Life savers flag Powerline pole lightpole Flag on a golf course Street footpath Golf hole A stick in the ground Weed/seedling coming from ground
Something pole-like with a house/shelter in the distance on the right
Pole in ground
Appendix B Section 4.3 Experiment B.1 Example Test Stimulus
CAN YOU TELL WHAT IS SHOWN IN EACH IMAGE.
1) Write a word under each image to describe the main object or content of the scene. If you can’t tell what is shown, write “Can’t tell”.
2) Put a circle around the images that you are confident about.
185
Appendix B Section 4.3 Experiment
186
B.2 Borderline Recognition Assessment for Section 4.3 Experiment
Image Accepted Rejected
Lighthouse
tower, well, powerhouse, buildings, horizon, house on
cliff, post, high-rise building, chimney, house,
mineshaft, oil rig, sky, pole, jetty with post, watch
tower, small building, landscape
Buildings
houses, factory, households stairs, steps, trees, mountain,
hill, forest, house with
smoking chimney
Tree
plant, canyon, gully
Gorilla
woman's back, dog (common for 256x256_Binary
images), men, person sitting, person, teddy, bear toy,
animal, someone eating, godzilla, dinosaur, lady, person
leaning over, hair & head, man bending over
R2D2, people kissing, 2
people, duck
Capsicum
apple, fruit, pumpkin, jack-o-lantern ball, rose, balloon, love
heart, fist
Face
monster, side view of a head 2 people looking at jumping
dog, tiger, bird, duck
Flower
splat, flowers, centre of fruit eye, shooting target, food
plate, donut, letter 'O', 'Q',
box, clock, sign, square,
wheel, ball, door handle,
apple
Balloon
lightbulb, plane, sun, cloud, bird, aeroplane, moon dot, tennis ball, heart, block,
window, rectangle, wall
with window, star, footprint,
square, light, firefly
Duck
man, smiley face, person,
dog
Appendix C Chapter 5 Experiment
Appendix C Chapter 5 Experiments C.1 Example Test Stimulus – 7 Images Presented all at same time
1
Rank the images for how much visualinformation they contain: 1 = contains most visual information 7 = contains least visual information
87
Appendix C Chapter 5 Experiment
C.2 Example Test Stimulus – 3 Images Presented all at same time
Rank the images for how much visual information they contain: 1 = contains most visual information 3 = contains least visual information
188
Appendix C Chapter 5 Experiment
C.3 Example Test Stimulus – Paired Comparison Experiments
WHICH IMAGE APPEARS TO CONTAIN MORE INFORMATION?
The subjects were also g
DEFINITELY LEFT IMAGE
L
DEFINITELYRIGHT
SAME
There are 5 boxes under
Box 1 = left imag
Box 2 = left imag
Box 3 = images h
Box 4 = right im
Box 5 = right im
SLIGHTLY EFT IMAGE
iven the following comments relatin
IMAGE
the image pair to be compared:
e has much more information than
e has slightly more information tha
ave same amount of visual informa
age has slightly more information th
age has much more information than
189
SLIGHTLYRIGHT IMAGE
g to this question:
right image
n right image
tion
an left image
left image
Appendix C Chapter 5 Experiment
190
C.4 Booklet Design Booklet design chosen to minimize learning effects; 9 booklets A – I, with 3 parts in each booklet
PART 1 – Develop Metric; 2 methods: 1) paired comparison 2) presenting 7 images all at once
PART 2 – Validate Metric 7 additional visual dimensions
3 images were presented to subjects at one time
PART 3 Check if correlated with recogn
Paired Comparison All at once
1 2 3 4 5 6 7
BOOK Image Set 1 2 3 OBJECT NUMBER ANGLE DISTANCE CONN. DETAIL CONTR VARIETYA a 10B 16F OB 25F 10B 25F O 16F OB 25B OE 25BB b “ “ “ 25B 10F OE 10B 25B O 25F OB 25FC c “ “ “ OE 16B OB 10F 25F 10B OE O 16BD a 16B 25F OE 10F 16F O 16B OE 10F OB 10B OE b “ “ “ 10B 25B 10B 16F OB 16B O 10F OBF c “ “ “ O 25F 10F 25B O 16F 10B 16B 10FG a 25B 10F O 16B OE 16B 25F 10B 25B 10F 16F 16FH b “ “ “ 16F OB 16F OE 10F 25F 16B 25B OEI c “ “ “ OB O 25B OB 16B OE 16F 25F 10B
10B = 10x10 binary 10F = 10x10 full grey 16B = 16x16 binary 16F = 16x16 full grey 25F = 25x25 full grey OB = original (256x256) binary OE = original (256x256) edge O = original (256x256)
Image Set (a) Image Set (b) Image Set (c) Balloon – Caps Balloon – People Balloon – Building Balloon – Flower Balloon – Lighthouse Balloon – Tree Caps – Tree Caps – Building Caps – Flower People – Tree Caps – Lighthouse Caps – People People – Building People – Flower People – Lighthouse Building – Lighthouse Building – Tree Building – Flower Flower - Lighthouse Flower - Tree Lighthouse – Tree
Binary comparison of 7 different objects = 21 comparisons
Appendix D Chapter 6 Experiment
D.1 Example Test Stimulus
RANK THE BOXES 1 → 4 ACCORDING TO HOW THEY BEST (IE MOST INFORMATIVELY) REPRESENT THE ORIGINAL SCENE SHOWN ON THE LEFT
1 = IMAGE THAT REPRESENTS THE ORIGINAL IMAGE THE BEST*; 4 = THE WORST
191
Appendix D Chapter 6 Experiment
192
D.2 Booklet Design 2 images were chosen – lighthouse (outdoor image) and chair (office image). Four low
quality versions were shown beside an original image. The order of presentation differed
between the two images.
Lighthouse images shown in Section D1 from Left to Right are: ORIGINAL “OUTDOOR”
WEIGHTS BASE CASE (no importance processing)
EQUAL WEIGHTS
“OFFICE” WEIGHTS
Chair images shown in Section D1 from Left to Right are: ORIGINAL BASE CASE
(no importance processing)
EQUAL WEIGHTS
“OFFICE” WEIGHTS
“OUTDOOR” WEIGHTS
Appendix E Chapter 7 Experiments E.1 Training Image Database Streetscape
193
Appendix E Chapter 7 Experiments
Café/Restaurant
194
Appendix E Chapter 7 Experiments
Heads/Shoulders
195
Appendix E Chapter 7 Experiments
Beach
196
Appendix E Chapter 7 Experiments
Office
197
Appendix E Chapter 7 Experiments
Home
198
E.2 Example Test Stim
a
Appendix Chapter 7 ExperimentsE
199
If you were trying to move through this scene:
which version would you find most
helpful.
uli for Section 7.2 Experiment
b c d e f
E.3 Example Test Stimuli for Section 7.3 E Zoom window shown on original:
25x25 Black and White version of zoom window:
a b c
Appendix E Chapter 7 Experiments
200
Consider the same image again.
Imagine you could zoom in to one part of theimage. Which zoomed version(s) shown on thebottom row would you find most helpful if youwere moving through the scene? The part of the original image from which thezoomed image has been taken from is shown abovefor interest.
xperiment
(NO ZOOM)
d e f g (NO ZOOM)
Appendix E Chapter 7 Experiments
201
E.4 Booklet Order for Chapter 7 Experiments
IMAGE IMAGE
Random placement of images: TYPE No. A B C D E F PARTA PARTBcafé 1 C1 1 2 3 4 5 6 IM eq 1 sal_trimstreet 1 S1 3 5 4 6 1 2 IM opt 2 sal_scopehouse 1 H1 2 1 6 5 3 4 IM sc 3 IM_trimface 1 F1 5 4 2 3 6 1 tr IM 4 IM_scope
office 2 3 5 centre 1 O1 4 6 1No import. processing 5
beach 1 B1 6 4 5 1 2 3 edge 6 bottom
café 2 PART A – ROI applied to entire image
street 2 PART B – ROI use for automatic zoom
house 2 Table above is repeated for all 4 image sets face 2 Numbers 1-6 above refer to processing method used (refer Table above right) office 2 Booklets had 48 pages – 24 for PART A and 24 for PART B beach 2 PART A and PART B were shown in consecutive page order for each image café 3 street 3 house 3 face 3 office 3 beach 3 café 4 street 4 house 4 face 4 office 4 beach 4