university of groningen university of …nick/strisciuglio_phd.pdfuniversity of salerno through a...

UNIVERSITY OF GRONINGENJOHANN BERNOULLI INSTITUTE FOR MATHEMATICS AND

COMPUTER SCIENCE

UNIVERSITY OF SALERNODEPARTMENT OF INFORMATION AND ELECTRICAL ENGINEERING

AND APPLIED MATHEMATICS

BIO-INSPIRED ALGORITHMS FORPATTERN RECOGNITION IN

AUDIO AND IMAGE PROCESSING

A dissertation supervised by promotors

PROF. DR. SC. TECHN. NICOLAI PETKOV

PROF. DR. MARIO VENTO

and submitted by

NICOLA STRISCIUGLIO

in fulfillment of the requirements for the Degree of

PHILOSOPHIÆDOCTOR (PH.D.)

May 2016

ISBN: 978-90-367-8931-8 (ISBN ebook: 978-90-367-8932-5)

Bio-inspired algorithms forpattern recognition in audio and

image processing

PhD thesis

to obtain the degree of PhD at theUniversity of Groningenon the authority of the

Rector Magnificus Prof. E. Sterkenand in accordance with

the decision by the College of Deans.

This thesis will be defended in public on

Friday 10 June 2016 at 09.00 hours

by

Nicola Strisciuglio

born on 16 November 1987in Nocera Inferiore, Salerno, Italy

SupervisorsProf. N. PetkovProf. M. Vento

Co-supervisorDr. G. Azzopardi

Assessment committeeProf. A.C. TeleaProf. C.N. SchizasProf. V. LoiaProf. X. Jiang

This research has been conducted at the Intelligent Systems group of JohannBernoulli Institute for Mathematics and Computer Science (Onderzoeksinstituut:JBI) of University of Groningen and at the MIVIA research group of the Depart-ment of Information and Electrical Engineering and Applied Mathematics (DIEM)of University of Salerno.

This research has been supported by the University of Groningen through an”Ubbo Emmius” scholarship for international sandwich PhD programs and by theDepartment of Information and Electrical Engineering and Applied Mathematics ofUniversity of Salerno through a research grant on the project ”Embedded systemsin critical domains” (cod. 4-17-12, P.O.R. Campania FSE 2007-2013).

Bio-inspired algorithms for pattern recognition in audio and image processingNicola StrisciuglioISBN: 978-90-367-8931-8 (printed version)ISBN: 978-90-367-8932-5 (electronic version)

To my lovely family

Abstract

This thesis investigates the construction of pattern recognition systems that arebased on the computation of features inspired by the characteristics of human au-ditory and visual systems. The thesis addresses two important applications in thefields of intelligent audio surveillance and medical image analysis. In particular, wepropose two algorithms for the detection of audio events that can occur with vari-ous levels of signal-to-noise ratio (SNR) and two algorithms for the delineation ofblood vessels in retinal fundus images.

Audio analysis for detection of events of interest has recently raised large inter-est in the pattern recognition community due to increasing demand for safety inpublic and private environments and the consequent demand for improved surveil-lance systems. Traditional applications of audio analysis concern speech recogni-tion, speaker identification and music classification. They usually require that thesound source is close to the microphone. This implies a low influence of noise onthe functioning of the overall system. In applications like event detection for au-dio surveillance, the source of the sound of interest can be at any distance fromthe microphone. Thus, the detection system has to be able to detect events at vari-ous levels of SNR, sometimes also negative. Another key requirement for an audiosurveillance system is the ability to detect events of interest when they are mixedwith different kinds of background noise. Such constraints make the problem athand very different from traditional applications of audio analysis. Intelligent au-dio surveillance is a recent research field and at the time of this work no public datasets were available for testing event detection algorithms. Thus, we constructed andpublicly released two new data sets of abnormal events that can occur in everydaylife, which we called MIVIA audio event and MIVIA road event data sets.

We start from the consideration that an audio stream is composed of small,atomic units of sound, similarly to a piece of text that is composed of a number

of words. We propose a system for the detection of audio events based on the bagof features approach. Since the events of interest can be mixed with various types ofbackground noise, we tailored the training phase of the proposed method in orderto build a system robust to such variability. We tested the system for the detection ofglass breaking, gun shot and scream events in public and private environments byusing the audio clips in the MIVIA audio event data set. We achieved a high recog-nition rate (up to 86.7%) with a very low false positive rate (2.1% on the whole testset). Successively, we extended the system in order to be employed for monitoringand surveillance of roads, with the aim of detecting anomalous situations such ascar crash and tire skidding events. We designed a deployment strategy for differ-ent kinds of road (from very calm country roads to very busy cities or motorways),based on an internationally accepted road noise model. We carried out experimentson the MIVIA road event data set and achieved a recognition rate (82%) and a falsepositive rate (2.85%) that confirm the performance achieved on the MIVIA audioevent data set.

In a further study, we take inspiration from some characteristics of the humanauditory system to propose trainable filters, which we call CoPE filters, that auto-matically determine the important features from training audio samples. One ofthe critical steps for the construction of a pattern recognition system is, indeed, thechoice of the most appropriate set of features to face the particular problem at hand,i.e. a feature engineering step. The CoPE filters are trainable as their structure isnot fixed in the implementation but it is instead learned during a configurationprocess from training samples. This eliminates the needs of a features engineeringstep. The important features are learned directly from the events of interest, mak-ing the system easily adaptable to different sound recognition tasks and requiringless knowledge about the specific domain of application. We employ the responsesof a bank of CoPE filters to build feature vectors that we use to describe the inputaudio stream. We train a classifier with such feature vectors in order to performthe detection task. We carried out experiments on the MIVIA audio event and theMIVIA road event data set, achieving a recognition rate (higher than 94%) and falsepositive rate (less than 4%) that are considerably better than the results achieved bythe approach based on the bag of features architecture.

In the second part of the thesis we address an important application in the fieldof medical image analysis, i.e. the segmentation of blood vessels in retinal fundusimages. Retinal fundus imaging is a non-invasive tool that is widely employed bymedical experts to diagnose various pathologies such as glaucoma, age-related mac-ular degeneration, diabetic retinopathy and atherosclerosis. There is also evidencethat such images may contain signs of non-eye-related pathologies, including car-diovascular and systemic diseases. In the last years, particular attention by medical

communities has been given to early diagnosis and monitoring of diabetic retinopa-thy, since it is one of the principal causes of blindness in the world. The manualinspection of retinal fundus images requires highly skilled people, which results inan expensive and time-consuming process. Thus, the mass screening of a popu-lation is not feasible without the use of computer aided diagnosis systems. Suchsystems could be used to refer to medical experts only the patients with suspicioussigns of diseases.

We introduce a novel method for the automatic segmentation of vessel trees inretinal fundus images. We propose a filter that selectively responds to vessels andthat we call B-COSFIRE with B standing for bar which is an abstraction of a ves-sel. It is based on the existing COSFIRE (Combination Of Shifted Filter Responses)approach. A B-COSFIRE filter achieves orientation selectivity by computing theweighted geometric mean of the output of a pool of Difference-of-Gaussians filters,whose supports are aligned in a collinear manner. It achieves rotation invarianceefficiently by simple shifting operations. The proposed filter is versatile as its selec-tivity is determined from any given vessel-like prototype pattern in an automaticconfiguration process. The results that we achieve on three publicly available datasets (DRIVE: Se = 0.7655, Sp = 0.9704; STARE: Se = 0.7716, Sp = 0.9701; CHASE DB1:Se = 0.7585, Sp = 0.9587) are higher than many of the state-of-the-art methods.

In the last part of the thesis, we further investigate the flexibility and adaptabil-ity of the proposed B-COSFIRE filters and propose to employ them within a clas-sification pipeline. The framework that we propose automatically determines themost appropriate sub-set of filters for the application at hand. Initially, we config-ure a bank of B-COSFIRE filters and use the responses obtained on training retinalimages to form pixel-wise feature vectors, which describe vessel and non-vessel pix-els. Then, we employ various techniques based on information theory and machinelearning to select an optimal subset of B-COSFIRE filters. We finally train a classi-fier by using feature vectors constructed with the responses of the selected filtersand employ it to classify every pixel in the testing image. The improvement of theresults that we achieve on the DRIVE and STARE data sets with respect the unsu-pervised B-COSFIRE filters is statistically significant.

We studied the computational requirements of the proposed algorithms in orderto evaluate their applicability in real-world applications and the fulfillment of real-time constraints given by the considered problems.

This thesis contributes to the development of bio-inspired algorithms for audioand image processing and promotes their use in higher-level pattern recognitionsystems.

Samenvatting

Dit proefschrift onderzoekt de constructie van patroonherkenningssystemen diegebaseerd zijn op kenmerken geınspireerd door de eigenschappen van het menselijkvisueel en auditief systeem. Het proefschrift behandelt twee belangrijke toepassin-gen op het gebied van intelligente audio surveillance en medische beeldanalyse. Inhet bijzonder leggen we twee algoritmes voor de detectie van audio “gebeurtenis-sen” die voor kunnen komen met verschillende niveaus van signaal-ruis verhoud-ing, en twee algoritmes voor de segmentatie van bloedvaten in retinale fundusbeelden.

Audioanalyse voor detectie van “gebeurtenissen van belang” heeft recentelijkaan interesse gewonnen in het vakgebied van patroonherkenning, dankzij de toen-emende behoefte aan veiligheid in het publieke en private domein en het daaruitvoortkomende verzoek voor betere surveillancesystemen / beveiligingssystemen.Typische toepassingen van audioanalyse zijn onder andere spraakherkenning,spreker identificatie en muziek classificatie. Normaliter vereisen deze toepassin-gen dat de geluidsbron nabij de microfoon is, om de invloed van ruis op het func-tioneren van het gehele systeem te beperken. In toepassingen zoals “gebeurtenis”detectie voor audiosurveillance kan de bron van het geluid op elke afstand van demicrofoon zijn. Zodoende dient het detectiesysteem in staat te zijn om gebeurtenis-sen op verschillende SNR niveaus te detecteren. Een andere vereiste voor een au-dio surveillancesysteem is de mogelijkheid om “gebeurtenissen van belang” te de-tecteren wanneer deze vermengd zijn met verschillende soorten achtergrondgeluid.Dergelijke beperkingen maken het probleem in kwestie zeer afwijkend van klassieketoepassingen van audioanalyse. Intelligente audiosurveillance is een recent onder-zoeksveld en ten tijde van dit onderzoek waren er geen openbare datasets beschik-baar voor het testen van algoritmes voor “gebeurtenisdetectie”. Derhalve hebbenwe twee nieuwe datasets van abnormale gebeurtenissen ontworpen en vrijgegeven.

De datasets, genaamd MIVIA audio event en MIVIA road event, bevatten abnor-male gebeurtenissen die zich in het alledaagse leven voor kunnen doen.

Ons uitgangspunt is de overweging dat een geluidsstroom bestaat uit kleine,atomische geluidseenheden, zoals een stuk tekst bestaat uit een aantal woorden. Wedragen een systeem voor de detectie van audio gebeurtenissen aan dat gebaseerdis op de bag of features benadering. Aangezien de gebeurtenissen van belanggemengd kunnen zijn met verschillende soorten achtergrondgeluid, hebben we detrainingsfase van de voorgedragen methode afgesteld, om een systeem te ontwer-pen dat dergelijke variabiliteit kan weerstaan. Het systeem is getest op detectie vanbrekend glas, geweerschoten en schreeuwen in publieke en prive omgevingen metgebruik van de audiofragmenten in de MIVIA audio event dataset. We behaaldeneen hoog herkenningspercentage (tot 86.7%) met een zeer laag fout-positief percent-age (2.1% op de hele test set). Vervolgens hebben we het systeem uitgebreid voor in-gebruikstelling bij het toezicht van wegen, met het doel om anomale situaties zoalsbotsingen of bandenslippingen te detecteren. We hebben een invoeringsstrategieontworpen voor verschillende type wegen (van zeer rustige landwegen tot drukkesteden of snelwegen), gebaseerd op een internationaal erkend weggeluidsmodel.De experimenten met de MIVIA road event dataset behaalden een herkenningsper-centage (82%) en een fout-positief percentage (2.85%) die de behaalde resultaten metde MIVIA audio event dataset onderschrijven.

In een vervolgonderzoek geınspireerd op enkele eigenschappen van hetmenselijk auditief systeem dragen we trainbare filters voor, genaamd CoPE filters,die automatisch de belangrijke onderdelen van training audio samples bepalen. Eencruciale stap in de constructie van een patroonherkenningssysteem is de keuze vande meest geschikte set van kenmerken voor de ophanden taak, oftewel de featureengineering stap. De CoPE filters zijn te trainen, aangezien hun structuur niet vastligt in de implementatie; het wordt in plaats daarvan aangeleerd tijdens een con-figuratieproces van traningsmonsters. Hierdoor is een “features egineering” stapoverbodig. De belangrijke kenmerken worden direct verworven uit de gebeurtenis-sen van belang, wat de systemen adaptief maakt voor verschillende geluidherken-ningstaken en de vereiste kennis van het specifieke toepassingsdomein vermindert.We hanteren de resultaten van een bank van CoPE filters om feature-vectoren tebouwen die we gebruiken om de geluidsstroom van de input te beschrijven. Eenclassifier wordt getraind met dergelijke kenmerkvectoren om de detectietaak uit tevoeren. We hebben de experimenten op de MIVIA audio event en de MIVIA roadevent datasets uitgevoerd, en deze behaalden een herkenningspercentage (hogerdan 94%) en een fout-positief percentage (minder dan 4%) die de resultaten behaaldmet de benadering gebaseerd op het bag of features ontwerp aanzienlijk verbeteren.

In het tweede deel van dit proefschrift stellen we een belangrijke toepassing op

het gebied van medische beeldanalyse aan de orde, nl. de segmentatie van bloed-vaten in retinale fundus beelden. Retinale fundus beeldvorming is een niet-invasiefmiddel dat veel gebruikt wordt door medisch specialisten om verscheidene ziektente diagnosticeren, waaronder glaucoom, leeftijdsgebonden maculadegeneratie, dia-betische retinopathie en atherosclerose. Er is ook bewijs dat dergelijke beelden sig-nalen van niet-oog gerelateerde ziektebeelden kunnen bevatten, waaronder cardio-vasculaire en systemische ziekten. In de afgelopen jaren hebben medische gemeen-schappen bijzondere aandacht geschonken aan vroegtijdige diagnostisering en con-trole van diabetische retinopathie, aangezien het een van de voornaamste oorzakenvan blindheid is ter wereld. De handmatige inspectie van retinale fundus beeldenvereist zeer vakkundig personeel, wat het een zeer duur en tijdrovend proces maakt.Zodoende is massale screening van een populatie niet haalbaar zonder de aanwend-ing van computerondersteunde diagnosesystemen. Dergelijke systemen zouden ge-bruikt kunnen worden om enkel de patienten met verdachte symptomen van ziektedoor te verwijzen naar medisch specialisten.

Wij introduceren een nieuwe methode voor de automatische segmentatie vanbloedvatenbomen in retinale fundus beelden. We leggen een filter voor dat selec-tief reageert op bloedvaten, genaamd B-COSFIRE noemen, de B refererend naarbar; een abstractie van een bloedvat. Het is gebaseerd op de bestaande COSFIRE(Combination of Shifted Filter Responses) benadering. Een B-COSFIRE filter behaaltorientatieselectiviteit door het gewogen geometrisch gemiddelde te berekenen vande output van een poel van Difference-of-Gaussians filters waarvan de steunen opcollineaire wijze zijn uitgelijnd. Het bereikt op effectieve wijze rotatie-invariantiemiddels simpele shift operaties. Het voorgelegde filter is veelzijdig, aangezien deselectiviteit van het filter bepaald wordt door elk gegeven bloedvatachtige proto-type patroon in een automatisch configuratieproces. De resultaten die we behaaldhebben op drie publiekelijk beschikbare datasets (DRIVE: Se = 0.7655, Sp = 0.9704;STARE: Se = 0.7716, Sp = 0.9701; CHASE DB1: Se = 0.7585, Sp = 0.9587) zijnhoger dan vele state of the art methoden.

In het laatste gedeelte van het proefschrift wordt er een vervolgonderzoek om-schreven omtrent de flexibiliteit en het aanpassingsvermogen van de voorgedra-gen B-COSFIRE filters en stellen we voor ze in gebruik te stellen binnen een clas-sificatiekanaal. Het raamwerk dat we voordragen, bepaalt automatisch de meestgeschikte subset van filters voor de toepassing ophanden. In eerste instantie con-figureren we een bank van B-COSFIRE filters en gebruiken de verkregen responsiesom retinale beelden te trainen in het vormen van pixelmatige kenmerkvectoren diebloedvaten- en non-bloedvatenpixels beschrijven. Daarna hanteren we verschei-dene technieken gebaseerd op information theory en machine learning om een op-timale subset van B-COSFIRE filters te selecteren. De verbetering van de resultaten

die we bereiken met de DRIVE en STARE datasets ten opzichte van de B-COSFIREfilters zonder supervisie is statistisch significant.

We bestudeerden de computationele eisen van de voorgedragen algoritmes omzowel hun toepasbaarheid in werkelijke toepassingen, als de uitvoering van echtijdbeperkingen, ingegeven door de overwogen problemen, te evalueren.

Dit proefschrift draagt bij aan de ontwikkeling van bio-geınspireerde algoritmesvoor audio- en beeldverwerking en bevordert hun toepassing in hogere niveaus vanpatroonherkenningssystemen.

Contents

List of Figures iv

List of Tables vi

Acknowledgements ix

1 Introduction 11.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Audio events detection in noisy environments 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Short-time and long-time descriptors . . . . . . . . . . . . . . 102.2.2 The classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Performance comparison . . . . . . . . . . . . . . . . . . . . . 202.3.3 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Design of a practical system for audio surveillance of roads 253.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Low-level features extraction . . . . . . . . . . . . . . . . . . . 283.2.2 Dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 High-level representation . . . . . . . . . . . . . . . . . . . . . 29

i

Contents

3.2.4 Classification architecture . . . . . . . . . . . . . . . . . . . . . 303.3 Deployment Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Intensity level of the event of interest . . . . . . . . . . . . . . 323.3.2 Intensity level of the traffic noise . . . . . . . . . . . . . . . . . 343.3.3 Architecture discussion . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 The data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . 393.4.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.5 Real-time performance . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Trainable CoPE filters for audio events detection 474.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Gammatone filterbank . . . . . . . . . . . . . . . . . . . . . . . 524.3.2 CoPE filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 A bank of CoPE filters . . . . . . . . . . . . . . . . . . . . . . . 564.3.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4.1 MIVIA audio events . . . . . . . . . . . . . . . . . . . . . . . . 574.4.2 MIVIA road events . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.1 Performance and results . . . . . . . . . . . . . . . . . . . . . . 604.5.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . 614.5.3 Results comparison . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Retinal vessel delineation using trainable B-COSFIRE filters 695.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.2.2 Detection of Changes in Intensity . . . . . . . . . . . . . . . . . 735.2.3 Configuration of a B-COSFIRE Filter . . . . . . . . . . . . . . . 755.2.4 Blurring and Shifting DoG Responses . . . . . . . . . . . . . . 755.2.5 Response of a B-COSFIRE Filter . . . . . . . . . . . . . . . . . 76

ii

Contents

5.2.6 Achieving Rotation Invariance . . . . . . . . . . . . . . . . . . 775.2.7 Detection of Bar Endings . . . . . . . . . . . . . . . . . . . . . 77

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.3.1 Data Sets and Ground Truth . . . . . . . . . . . . . . . . . . . . 795.3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.3.3 Performance Measurements . . . . . . . . . . . . . . . . . . . . 815.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Automatic selection of an optimal set of B-COSFIRE filters 976.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.1 B-COSFIRE filters . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2.2 A bank of B-COSFIRE filters . . . . . . . . . . . . . . . . . . . 1036.2.3 Feature transformation and rescaling . . . . . . . . . . . . . . 1056.2.4 Automatic subset selection of B-COSFIRE filters . . . . . . . . 1056.2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.2.6 Application phase . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.3.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.3.2 B-COSFIRE implementation . . . . . . . . . . . . . . . . . . . . 109

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.4.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.4.4 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1156.4.5 Comparison with existing methods . . . . . . . . . . . . . . . 115

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7 Summary and Outlook 1217.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1217.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Bibliography 127

Research Activities 139

iii

List of Figures

1.1 Architecture of a pattern recognition system. . . . . . . . . . . . . . . 2

2.1 Overview of the system architecture. . . . . . . . . . . . . . . . . . . . 132.2 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Comparison of miss and false positive rates . . . . . . . . . . . . . . . 22

3.1 Histogram construction. . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Multi-class SVM classifier. . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Scheme of the deployment architecture. . . . . . . . . . . . . . . . . . 323.4 SNR variation with respect to distance . . . . . . . . . . . . . . . . . . 373.5 Performance results and robustness analysis. . . . . . . . . . . . . . . 413.6 ROC curves of the proposed system. . . . . . . . . . . . . . . . . . . . 44

4.1 Overview of the proposed CoPE approach. . . . . . . . . . . . . . . . 514.2 Gammatonegrams of typical events of interest. . . . . . . . . . . . . . 544.3 Detection Error Trade-off curves. . . . . . . . . . . . . . . . . . . . . . 654.4 Examples of CoPE filters response. . . . . . . . . . . . . . . . . . . . . 67

5.1 Examples of retinal images. . . . . . . . . . . . . . . . . . . . . . . . . 715.2 Sketch of the proposed B-COSFIRE filter. . . . . . . . . . . . . . . . . 735.3 Example prototype line patterns. . . . . . . . . . . . . . . . . . . . . . 745.4 Configuration of a B-COSFIRE filter. . . . . . . . . . . . . . . . . . . . 745.5 Orientation-tuned B-COSFIRE filters responses. . . . . . . . . . . . . 785.6 Configuration of an asymmetric B-COSFIRE filter. . . . . . . . . . . . 795.7 Examples of response to bar endings. . . . . . . . . . . . . . . . . . . . 795.8 Pre-processing step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

iv

List of Figures

5.9 ROC curves for the considered data sets. . . . . . . . . . . . . . . . . . 855.10 Examples of segmented images. . . . . . . . . . . . . . . . . . . . . . . 885.11 Analysis of robustness to noise. . . . . . . . . . . . . . . . . . . . . . . 94

6.1 Examples of fundus images. . . . . . . . . . . . . . . . . . . . . . . . . 986.2 Configuration of a B-COSFIRE filter. . . . . . . . . . . . . . . . . . . . 1026.3 Response of B-COSFIRE filters at various scales. . . . . . . . . . . . . 1046.4 Feature relevances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.5 Sketch of the application phase of the proposed method. . . . . . . . 1086.6 MCC as a function of the top performing features. . . . . . . . . . . . 1136.7 ROC curves achieved on the DRIVE and STARE data sets. . . . . . . 114

v

List of Tables

2.1 Short-time descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 MIVIA audio events data set details . . . . . . . . . . . . . . . . . . . 172.3 Results of the proposed system on the test set. . . . . . . . . . . . . . 182.4 Results at different SNR values. . . . . . . . . . . . . . . . . . . . . . . 192.5 Comparison of area under ROC curves. . . . . . . . . . . . . . . . . . 212.6 Results achieved by Conte et al. (2012) . . . . . . . . . . . . . . . . . . 212.7 Performance on the deta set of Aurino et al. (2014). . . . . . . . . . . 232.8 Performance achieved by Aurino et al. (2014). . . . . . . . . . . . . . . 232.9 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Features sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Noise model parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 MIVIA road events data set details. . . . . . . . . . . . . . . . . . . . . 393.4 Results for K = 64 clusters. . . . . . . . . . . . . . . . . . . . . . . . . 423.5 Classification matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.6 Results sensitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 Extended MIVIA audio events data set details. . . . . . . . . . . . . . 594.2 MIVIA road events data set details. . . . . . . . . . . . . . . . . . . . . 594.3 Classification matrix on MIVIA audio events data set. . . . . . . . . . 604.4 Detailed SNR performance results. . . . . . . . . . . . . . . . . . . . . 614.5 Average classification results on MIVIA road events data set. . . . . . 614.6 Sensitivity with respect to the parameters of the filter. . . . . . . . . . 624.7 Performance comparison on MIVIA audio events data set. . . . . . . 634.8 Performance comparison on MIVIA road events data set. . . . . . . . 64

vi

5.1 Performance result of symmetric B-COSFIRE filter. . . . . . . . . . . 845.2 Performance results on the asymmetric B-COSFIRE filter. . . . . . . . 845.3 Performance results of the combine B-COSFIRE filters. . . . . . . . . 865.4 Results comparison on DRIVE data set. . . . . . . . . . . . . . . . . . 895.5 Results comparison on STARE data set. . . . . . . . . . . . . . . . . . 905.6 Results comparison on CHASE DB1 data set. . . . . . . . . . . . . . . 905.7 Sensitivity to the parameters of the B-COSFIRE filter. . . . . . . . . . 925.8 Comparative analysis of the processing time. . . . . . . . . . . . . . . 93

6.1 Results comparison on the DRIVE data set. . . . . . . . . . . . . . . . 1126.2 Results comparison on the STARE data set. . . . . . . . . . . . . . . . 1126.3 Comparison with results of existing methods. . . . . . . . . . . . . . . 116

Acknowledgements

In these last almost four years I have learned that a doctorate is not a personal soli-tary path towards the thesis, but it is rather a life experience that involves manypeople from whom you learn better who you are and, probably, who you want tobecome. All the people that I met along this experience have taught me something,about science or about life and, thus, I will be thankful to them forever.

Thanks to my doctor-fathers Nicolai Petkov and Mario Vento. You taught me whata scientist should be and how he should live his scientific life. Thanks for the inter-esting discussions, in the labs and outside the labs, thanks for the suggestions andfor your invaluable guide, thanks for the possibilities you gave to me. Thanks forarranging this amazing sandwich PhD degree for me. You will be always inspiringpersons for me and I will be forever grateful to you.

I want to special thank George Azzopardi, my co-promotor but first of all mygreat friend, and Alessia Saggese. It is because of you that my passion for researchexploded and became huge and it is also because of you if I managed to completethis experience. Thank you for the scientific and personal chats and for being suchgreat friends and collaborators.

I am grateful to Professors Alexander C. Telea, Christos Schizas, Vincenzo Loiaand Xiaoyi Jiang for their service as members of the assessment committee of thisthesis. Thanks for your reviews and for your insightful comments and suggestions.

It has been a honor for me to share the labs in Groningen and in Salerno withgreat scientists and teachers. I want to express my gratitude to Michael Biehl,Pasquale Foggia, Michael Wilkinson, Gennaro Percannella and Pierluigi Ritrovato.You all have contributed to my scientific growth, directly or indirectly.

I want to express my sincere thanks to Laura and Walter, my lovely paranymphs,for accepting to stand by my side in one of the days that will sign my life. Laura,with you I shared almost the whole PhD trip, happiness and worries, and I feelvery honored and proud of that. We became great friends and exchanging theparanymph role with each other definitely and ”officially” tights our friendship(even if it is already tightened by our scars). Walter, you and I became friends allof a sudden and in a very intense and pure way. I thank you for being such a good

ix

friend and for all the great moments we had and we will have together.Thanks to my fellow PhD students and friends, both in Groningen and in

Salerno, because I always felt to be part of a big international family at work: Astoneand Jiapan, Daniel and Renata, Andreas, Manuel, the great Maestro Ugo, Danilo,Estefania, Fritz, Sophia, Laura, Christian, Charmaine, Ahmad, Antonio ”Tonino”,Raffaele ”Paffo” and Rosario. A special thanks goes to Vincenzo Carletti, becausewe started our PhD together in Salerno and shared most of our research discussionsand troubles. I gently offered you many moments of distractions, but you also gaveme the possibility of doing that.

Thanks to my student collaborators that helped me during these four years ofstudy and research: Giuseppe D’Alessio, Walter Petretta, Raffaele Cafaro, FrancescoDe Martino, Nicola Apicella, Ilario Nenna and Marco Gerardi.

I’m grateful to Kitty de Vries that helped me with the translation of the abstractto the dutch samenvatting and that I’m discovering as a good friend.

I am thankful to the administrative staff of University of Groningen and Uni-versity of Salerno, that made my bureaucratic life easier and managed to com-plete the double degree agreement: Janieta de Jong-Schlukebir, Esmee Elshof, InekeSchelhaas, Desiree Hansen, Ingrid Veltman, Annette Korringa, Barbara Visser, ElenaCaracciolo, Nunzia Fraiese and Silvia Governatori.

I want to thank my great friend Raffaele Fiducioso that designed the cover of mythesis. My friends are a big part of my life and they all contributed to this thesis,with their support and love. The cover is a witness that friendship fills your life andhelps to achieve your goals. I want to express my gratitude and my love to all myfriends from Italy and to all the ones I have met during my stays in Groningen andthat are now spread all over the world.

I cannot express with words how much I am grateful to my parents. GrazieMamma e Papa, perche mi avete cresciuto con amore e pazienza, perche mi avete educatoe permesso di studiare, con liberta di scelta e passione. E grazie a voi se, oggi, concludoil mio dottorato di ricerca. E grazie a voi, che mi avete dato fiducia, se oggi sto lavorandoper realizzare il mio sogno. Non bastano parole per dirvi quanto vi amo e quanto vi sonoriconoscente. I migliori genitori che potessi avere. Thanks to my lovely sister, Mary,because of your pure love and because you have never stopped being on my side.

At the last, but not the least, I want to express my immense gratitude and love toArianna. You have been a strong shoulder when I was in difficulties during the PhDstudies and a great partner for a lot of moments of fun. You have supported andloved me constantly and still today you are supporting and helping me to realizemy dreams. Thank you from the deep of my heart.

Nicola StrisciuglioMay 17, 2016

Chapter 1

Introduction

Since when we are very young we can learn new concepts very quickly and eas-ily distinguish between different kinds of object or sound. If we see a single

object or hear a particular sound, we are then able to recognize such sample or evendifferent versions of it in other scenarios. As an example, if one sees a iron chairand associates the object to the general concept of “chairs”, he will be able to detectand recognize also wooden or wicker chairs. Similarly, when we hear the sound ofa particular event, such as a scream, we are then able to recognize other kinds ofscream that occur in other different environments. We learn and store models ofthe reality, which we then unconsciously apply in new situations in order to under-stand the surrounding environment. Such learning abilities can be attributed to thepower and flexibility of the brain that exploits the functions and characteristics ofthe auditory and visual systems. These systems have been studied along years andhave inspired the work of many researchers in different disciplines such as cognitiveneuroscience, psychology, visual and auditory perception, etc.

In hearing research, on one side, a great deal of information about the single com-ponents of the outer, middle and inner ear has been collected. Existing models aimat explaining the psychology of hearing from what we know about the functions ofeach anatomical components. For instance, Poveda and Meddis (2001) and Meddis(2006) modeled the conversion of the sound pressure waves into electrical stimuli onthe auditory nerve in the outer and middle ear. Jeffress (1948), instead, proposed anearly model of superior olivary nucleus of the brainstem, which inspired the workof researchers on sound source localization Blauert (2013). Nowadays, a completeunderstanding of hearing is not available, but the abundance of models of the singleparts and the use of computers are helping to reach the goal.

On the other side, historically, a great interest has been dedicated by the scien-tific community to the study of the visual system, starting from the seminal workof Hubel and Wiesel (1962), Nobel prize for medicine, about the discovery of neu-rons devoted to detection of bars and edges in the primary visual cortex of cats.Their work opened a fruitful research path that led to the understanding of thecomplete organization of the visual system. Moreover, it inspired computer visionresearchers that proposed models of the neurons in area V1 of the primary visual

2 1. Introduction

featureextractionpre-processing feature

selectionmodel

training

classification/rejection

featurelearning

inputdata

decision

Figure 1.1: Architectural scheme of a pattern recognition system. The input data are pre-processed, according to specific needs of the system, and then features are computed to ex-tract important properties from such data. The features to be computed can be determinedby an engineering process or can be learned from the data (feature learning). Feature selec-tion procedures are usually involved in order to determine a sub set of discriminant featuresthat are then used to traing a classifier, which determines a model of the training data. Suchmodel is then used in the operating phase of the system, while a classifier is employed to takedecisions on the input data.

cortex, such as the Gabor model (Daugman, 1985) and the CORF model (Azzopardiand Petkov, 2012). Other researchers, instead, recognized the importance of hierar-chical organization of the visual system and applied it to computer vision (Gemanand Geman, 1984).

In this thesis, we start from the consideration that our brain and in particular thevisual and auditory systems are as sophisticated as fascinating pattern recognitionsystems. The architecture of a pattern recognition system is drawn in Figure 1.1.At a first stage in auditory and visual systems, basic and simple characteristics areextracted from the scene, which are then combined and elaborated in later stagesin the brain in order to extract relevant information and take decisions. This re-minds of data acquisition and features extraction stages in traditional pattern recog-nition systems, followed by modules where the data are further processed (featurelearning and selection, model training) in order to take decisions, such as detection,recognition, classification or rejection. Quite simply, humans are amazing pattern-recognition machines.

1.1. Scope 3

1.1 Scope

In this thesis, we investigate the construction of pattern recognition systems thatare based on the computation of features inspired by the characteristics of humanauditory and visual systems. We specifically focus on the task of learning a repre-sentation of the pattern of interest directly from the input data (feature learning inFigure 1.1) by using biological inspired approaches and to select discriminant fea-tures for improved classification (feature selection in Figure 1.1). The main researchquestions that motivated the work presented in this thesis are:

• Can biology-inspired filters detect important properties of acoustic and visualsignals in a more effective way than traditional approaches?

• How would such filters perform when used as feature extractors in patternrecognition systems for real-world applications?

The first question has led to the proposal, in Chapters 4 and 5 of two biologically-inspired filters for processing acoustic signals and for image analysis, respectively.These two filters are inspired by some properties of the outer and middle ear in theauditory system and of simple cell neurons in area V1 of the primary visual cortex,respectively.

In order to fulfill the second research question, we employed the proposed filtersin two important real-world applications, namely the detection of acoustic eventsfor intelligent surveillance and the delineation of blood vessels in retinal fundusimages. Since intelligent audio surveillance is a recent research field and no publicdata sets are available for testing events detection algorithms, we constructed twodata sets of abnormal events that can occur in everyday-life. We present them inChapters 2 and 3 together with an approach based on traditional pattern recognitiontechniques. Moreover, in Chapter 3 we design a framework for the deployment of areal-world system for audio surveillance of roads.

As mentioned in the previous section, one of the interesting properties of humanbiological systems is the capability of quickly learning the model of new objectsor sounds. Such models are flexible for the recognition of similar versions of thelearned samples. This posed the third research question:

• Can the structure of a filter be learned from a single training sample and thenapplied for the recognition of similar patterns?

This question is addressed in Chapters 4 and 5. One of the characteristics of theproposed filters is the trainability as their structure is learned in an automatic config-uration process given a prototype pattern of interest. In the application phase theyare for variations of the patterns of interest. We further investigate the flexibility of

4 1. Introduction

using such filters by proposing, in Chapter 6, a method for automatic selection of aset of filters, which optimize the performance on the delineation task.

Applied research has the purpose of solving relevant problems and producing in-novation that can improve the quality of our lives. One of the constrains that has tobe taken always into account is the computational resources that will be required bythe approach under investigation. This posed the last, but not least, question thathas guided this work:

• Is the processing time of the proposed filters appropriate for their applicationin practical real-world systems?

Attention to the amount of resources required by pattern recognition tools cov-ers a strategic role in the development of innovative systems. To this extent, duringthe development of the filters, we evaluated the impact that each operation has inthe processing of the input signal and took into account their overall required pro-cessing time.

1.2 Thesis Organization

The chapters of this thesis are organized as follows.In Chapter 2, we introduce the problem of audio events detection for intelli-

gent surveillance applications and propose a system for detection of glass breaking,gun shot and scream sounds in public noisy environments. The proposed system isbased on the bag-of-features classification approach (Joachims, 1998), which is ableto learn a feature set directly form the data (feature learning). Since audio surveil-lance is a relatively new research field and public data sets were not available, inthis chapter we also present a new large data set of audio events that we madeavailable for benchmark purpose. We provide a quantitative analysis of the perfor-mance that compares the results of the proposed method with the ones achieved byexisting methods on the new data set.

In Chapter 3 we present an application of the method proposed in Chapter 2 tosurveillance of roads for the automatic detection of hazardous situations. In partic-ular, we design a practical system and provide a deployment strategy for differentkinds of roads (from very calm country roads to very busy cities or motorways).The proposed architecture is based on models of noise caused by moving vehiclesin roads. Moreover, we present a data set of tire skidding and car crash sounds, forevaluation of the detection algorithm.

In Chapter 4 we present novel filters, which we call CoPE (Combination of Peaksof Energy), for feature extraction in audio signals. The proposed filters are inspiredby the way the sound waves are converted into neural impulses on the auditory

1.2. Thesis Organization 5

nerve in the human auditory system. An important characteristic of CoPE filters istrainability: their structure is determined in an automatic configuration process ona given sample of interest. The process of learning important features is addressedin this chapter in a more direct way than in previous chapters. We involve the pro-posed trainable-features in a system for audio events detection based on the use ofCoPE filters to automatically extract important features of event of interest. The pro-posed system outperforms performance of the method based on the bag-of-features.This confirms that biology-inspired CoPE filters are suitable for audio analysis.

In Chapter 5 we address an important application in medical image analysis,that is the delineation of blood vessels in retinal fundus images. We introduce train-able bar-selective COSFIRE filters, or B-COSFIRE for brevity. Their selectivity is notpredefined in the implementation but rather determined from a user-specified pro-totype pattern (e.g. a straight vessel) in an automatic configuration process. Weevaluate the performance of the proposed method on three data sets: DRIVE (Staalet al., 2004), STARE (Hoover et al., 2000) and CHASE DB1 (Owen et al., 2009).

We present an approach for automatic selection of a set of B-COSFIRE filters (fea-ture selection in the scheme of Figure 1.1) that are the most relevant for the applicationat hand in Chapter 6. The discriminant qualities of the filters are evaluated by us-ing machine learning and information theory techniques. In particular we employGeneralized Matrix Learning Vector Quantization (Schneider et al., 2009b), geneticalgorithms and entropy score. We apply the proposed method for the delineation ofblood vessels in retinal fundus images and evaluate its performance on two bench-mark data sets: DRIVE (Staal et al., 2004) and STARE (Hoover et al., 2000).

The above introduced five chapters have been submitted to peer reviewed jour-nals. Chapters 2, 3 and 5 are published, while Chapter 6 has been accepted forpublication and Chapter 4 is currently under review.

Published as:

Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, Mario Vento, “Reliable detection ofaudio events in highly noisy environments,” Pattern Recognition Letters, Volume 65, 1 November 2015,Pages 22-28, ISSN 0167-8655, http://dx.doi.org/10.1016/j.patrec.2015.06.026

Chapter 2

Audio events detection in noisy environments

Abstract

In this chapter we propose a novel method for the detection of audio events for surveil-lance applications. The method is based on the bag of words approach, adapted to dealwith the specific issues of audio surveillance: the need to recognize both short and longsounds, the presence of a significant noise level and of superimposed background soundsof intensity comparable to the audio events to be detected. In order to test the proposedmethod in complex, realistic scenarios, we have built a large, publicly available datasetof audio events. The dataset has allowed us to evaluate the robustness of our methodwith respect to varying levels of the Signal-to-Noise Ratio; the experimentation has con-firmed its applicability in real world conditions, and has shown a significant performanceimprovement with respect to other methods from the literature.

2.1 Introduction

Audio analysis has been traditionally focused on the recognition ofspeech (Anusuya and Katti, 2010; Besacier et al., 2014), speaker identifica-tion (Cordella et al., 2003; Chetty and Wagner, 2005; Saquib et al., 2010; Roy et al.,2012) and scene categorization (Cai et al., 2008; Pancoast and Akbacak, 2012).Recently, research in the area of intelligent surveillance systems shifted its attentionto the automatic detection of abnormal or dangerous events through the analysisof audio streams acquired by microphones. Indeed, there are kinds of event(gun shots, screams and glass breakings) that can be effectively detected by usingaudio sensors but are much less evident by looking at the video stream. Audioanalytics systems can be easily and inexpensively employed together with existingsurveillance infrastructures, today mainly based on video analytics algorithms thatuse object tracking techniques (Di Lascio et al., 2013). Indeed, many IP surveillancecameras are already equipped or predisposed to be connected to a microphone,making possible a joint analysis of audio and video streams (Cristani et al., 2007).

8 2. Audio events detection in noisy environments

One of the problems in audio surveillance applications is that the sounds of inter-est are superimposed on significant background sounds, often with very differentvalues of the signal to noise ratio (SNR). Thus, it might be difficult to separate thenoise to be ignored from the sounds to be recognized. Moreover, the properties ofthose events might be evident at different time scales: for instance, a gun shot is animpulsive sound and its spectrum distribution over time is very different from thatof a scream that is an exemplary sustained sound.

The state-of-the-art audio surveillance methods (see Crocco et al. (2014) for acomprehensive review) can be categorized in two main groups, depending on thearchitecture employed for classification. In the first group, the approach is to extractcharacteristic features (Mel-Frequency Cepstral Coefficients or Wavelet-based coef-ficients) from small audio frame in which the input signal is divided and use themin combination with a classifier to take decisions. Vacher et al. (2004) and Clavelet al. (2005) employ Gaussian Mixture Model (GMM) based classifiers trained ondifferent sets of features in order to detect screams or gun shots, while Valenziseet al. (2007) use them to address the problem of modeling the background sounds.In order to reduce the influence of the background sounds on the classification re-sults, Rabaoui et al. (2008) adopted a pool of One Class Support Vector Machines(OC-SVM) with a novel dissimilarity measure. Performing only a short-time analy-sis, these methods display limited capabilities when confronted with both sustainedand impulsive sounds, and a low robustness to background sound variations.

In the second group, more complex architectures have been proposed to increasethe reliability of the systems. Rouas et al. (2006) propose an approach that com-bines GMMs and Support Vector Machines (SVM) for detecting screams in outdoorenvironments, together with an adaptive thresholding on sound intensity for lim-iting the number of false detections. Ntalampiras et al. (2009) propose a two-stageGMM based classifier in which the first stage aims to separate normal and abnormalsounds while in the second stage they are classified in one of the classes of interest.Conte et al. (2012) present a method in which impulsive and sustained sounds areanalyzed by means of two classifiers that work at different time scales. The methoduses a quantitative estimation of the reliability of each classification to reduce thefalse detections by rejecting the classifications that are not considered sufficientlyreliable. A reject option for a pool of OC-SVM classifiers has been defined by Au-rino et al. (2014). The temporal sequence of symbols that represent spectral shapeshas been taken into account by Chin and Burred (2012), that classify the audio eventsby matching sub-sequences of the reference events using the Genetic Motif Discov-ery technique. The event detection task is formulated by Foggia et al. (2014a) as anobject detection problem in the Gammatone image of the sound.

Generally, more complex classification architectures require a ground truth de-

2.1. Introduction 9

fined both at short- and long-time level, increasing the human labor time needed tolabel the data set. Complex architectures achieve stronger robustness to the back-ground noise, but require higher computational resources. When, instead, a high-level representation of the data is used, the discriminative power of the systemsimproves while the classification scheme is kept simple.

In this chapter we present a system for audio analysis based on the bag of wordsapproach, as an extension of the paper of Carletti et al. (2013). The bag of wordsparadigm has been successfully applied in other fields, ranging from textual docu-ments retrieval (Joachims, 1998), to human actions recognition (Foggia et al., 2013),video-based object detection (Sivic and Zisserman, 2009) or music classification (Fuet al., 2011a).

The application of this paradigm to audio has been pioneered by Pancoast andAkbacak (2012). The underlying idea is that the audio stream can be thought of asbeing composed of small perceptual units of hearing, which we call aural words,whose distribution over a finite interval of time allows to characterize the type ofsound. While a single aural word describes the short-time characteristics of theaudio signal, the presence of certain words together is likely representative of theoccurrence of a given event. In the paper of Pancoast and Akbacak (2012), thisparadigm is applied to the classification and indexing of multimedia assets: namely,the authors classify a set video clips into different kinds of scenes on the basis oftheir audio content. That problem is significantly different from audio surveillance,for the following reasons: the scene to be recognized is long at least several seconds,while in surveillance the sound of interest can be very short (e.g. a gun shot canlast for less than 200 milliseconds); the audio quality is usually good, while audiosurveillance has to deal with noise introduced during both the acquisition (e.g. be-cause of low quality microphones and of distance) and the transmission of the sound(e.g. compressed transmission over a low bandwidth network); the whole scene hasto be recognized, while in a surveillance scenario, instead, the event of interest isone of many sounds simultaneously present in the environment, and not necessar-ily the loudest one, since other background sounds can be produced by sources thatare closer to the microphone.

The specific characteristics of the surveillance problem strongly impact on howthe bag of word approach must be tailored in order to be effectively applied. First,the fact that the sounds of interest will usually occur superimposed with other back-ground sounds must be considered during the system training. Second, the highlevel of noise means that an exact matching of the aural words may fail, giving er-roneous detections.

In this chapter we present a system for audio surveillance based on a specificadaptation of the bag of words approach. The system has been validated on a large


data set introduced in this thesis, especially designed for benchmarking in realisticenvironmental conditions, and made publicly available (http://mivia.unisa.it/). Themain contributions and the differences of this work with respect to Carletti et al.(2013) are: a) the design and realization of a wide and challenging data set ofsounds, with highly noisy background, occurring at different SNRs ranging from5dB to 30dB; b) a technique for reducing the required training time without affect-ing the accuracy; c) an improvement of the robustness to noise through the use ofsoft assignment to bags, to limit the errors due to the exact matching of aural words;d) a detailed analysis of the robustness of the proposed approach to significant vari-ations of the SNR and with very different background sounds. The newly proposeddata set is composed by 6000 events (glass breakings, gun shots and screams) thatoccur in various environmental conditions.

2.2 The proposed method

Given M classes of events of interest C1, . . . , CM , and a class of background soundsC0 the system has to detect if and when a certain audio event occurs and effectivelydistinguish it from the background sounds. The audio stream is first divided insmall frames of the order of milliseconds for which short-time, low-level features,that we call aural words, are computed and then used to construct a higher-level fea-ture vector whose elements are indicators of the occurrence of such short-time fea-tures. The set of words is obtained by means of a clustering process that quantizesthe original space of short-time features. The detection of audio events is performedin a time window of m seconds that moves along the audio stream. For each timewindow, the histogram of the occurrences of the aural words is constructed and isused as a feature vector to be fed to a pool of SVM classifiers, one for each classCi. Inthe operating phase, the decisions taken independently by each SVM are combinedtogether to obtain the output class of the sound.

2.2.1 Short-time and long-time descriptors

In contrast to video signals, in which a scene can persist even for several seconds,an audio signal might show huge temporal variations within a few milliseconds.Thus, in order to take into account the short-time variability, the input audio streamis first segmented into groups of N partially overlapping frames of duration TF ,windowed by a Hamming window. The choice of TF is influenced by two contrast-ing effects: if the value is too short, the frame will be unable to accurately representlow-frequency components of the sounds. Conversely, if it is too long the framewill not represent adequately short-time changes in the audio signal. We found as

2.2. The proposed method 11

Category Features

Spectral features spectral centroid, spectral spread,spectral rolloff, spectral flux

Energy features energy, 4 sub-bands energy ratios, volume

Temporal features Zero-crossing rate (ZCR)

Table 2.1: Feature set used to build the short-time descriptor.

a reasonable compromise for a reliable analysis a value of TF = 32msec for audiostreams sampled at 32KHz. Every frame is built by advancing the frame windowby TF /4 and contains L = 1024 PCM samples. A set of spectral and temporal fea-tures (Peeters, 2004) and energy features (Liu et al., 1998), listed in Table 2.1, areused to build the short-time descriptor. A complete explanation with mathematicalformulations is reported in Carletti et al. (2013). In the same way a text cannot beclassified from a single word but rather from the occurrence of different words, ourhypothesis is that a given audio event is characterized by the occurrence of specificbasic sound units. In order to derive a finite set of atomic sounds, which we callaural words to emphasize the fact that they are related to perceptual units of hearingand not to linguistic units, we quantized the space of the short-time descriptors us-ing the K-Means clustering algorithm. Since the space of short-time descriptors isdense, an uniform down-sampling of the space, by a factor 2 (i.e., we have randomlyselected half of the short-time descriptors using a uniform probability distribution),allowed to considerably speed-up (up to 90% of time less) the clustering process,which ended up in a set of K well-representative clusters, without influencing thefinal performance of the system. The centroids of the obtained clusters are collectedin the set of words wi that constitutes the dictionary W = {w1, . . . , wK} of the sys-tem. Each vector wi in the dictionary is representative of a recurrent atomic soundunit, whose occurrence increases the statistical evidence of being in presence of agiven sound. Of course it is expected that a single word is not representative ofthe presence of an event of interest; thus we performed the detection at a time scaleof the order of seconds by looking at the occurrences of certain aural words that aredistinctive for the sounds of interest. Since the K-Means algorithm only requires un-labeled samples, for the training of the proposed system it is not strictly necessary tohave a ground truth with a granularity of a single frame. It is, instead, sufficient todefine the true labels, which indicate the presence or not of a specific event, only ata longer time scale, corresponding to time windows of the order of seconds. Thus,it is a significant advantage with respect to methods that require a ground truth atframe level, so reducing the human labor time.


In the conventional bag of words approach, the construction of the descriptorsuses the so called hard assignment technique: for each feature vector vi, the dic-tionary is searched for the closest word wj to vi. Finally the long-time descriptorH = (h1, . . . , hK) is calculated as the histogram of the occurrences of the differentwords, as in the following:

hj =

N∑i=1

aij , j = 1, . . . ,K, (2.1)

where aij is an indicator function with value 1 if the closest word to vi is wj :

aij =

{1 if i = arg minj D(vi, wj), j = 1, . . . ,K

0 otherwise(2.2)

with D(vi, wj) a distance measure between the i-th vector in the time window andthe j-th word of the dictionary (for uniformity with the distance metric employedin the K-Means algorithm, the Euclidean distance was used).

While hard assignment may work well in contexts with low to moderate noise,the effect of noise on the vectors vi is somewhat amplified by the quantization in-troduced by this rule. Since in audio surveillance we expect the noise to be signif-icant, in order to contrast this effect we have adopted the so called soft assignmentdescribed by Liu et al. (2011); Equation 2.2 is replaced by:

aij =exp(−βD(vi, wj))∑l exp(−βD(vi, wl))

(2.3)

with the parameter β used to control the “softness” of the assignment. The valueof β has to be chosen depending on the density of the low-level feature space. Infact, for a dense space a too low value of β determines the smoothing of the his-togram, which loses its descriptive power. For β →∞, instead, Eq. 2.3 correspondsto hard assignment. In this work we set β = 10 in order to give weight to a limitedneighborhood of clusters.

It is worth noting that such representation is invariant with respect to the posi-tion of the event of interest within the considered time interval. Indeed, since thetemporal arrangement of the aural words does not affect the construction of the his-togram, a target sound that occurs at the beginning of an interval and another onethat occurs at the end can be modeled with the same histogram, so contributing tothe stability and simplicity of the representation.

2.2. The proposed method 13

Cre

atio

nof

dict

iona

ry

Long

-tim

ede

scri

ptor

H

aura

lwor

ds

occurrence

(a)

(b)

(c)

(d)

(e)

shor

t-ti

me

anal

ysis

...M

ulti

clas

sSV

M

Combinationrule

back

grou

nd

brea

king

glas

ses

gun

shot

s

scre

ams

ofau

ralw

ords

aura

lw

ords

C

shor

t-ti

me

desc

ript

ors

trai

ning

desc

ript

ors

Figu

re2.

1:O

verv

iew

ofth

esy

stem

arch

itec

ture

.The

inpu

taud

iosi

gnal

(a)i

sdi

vide

din

smal

lfra

mes

(b)a

nda

shor

t-ti

me

desc

ript

oris

com

pute

dfo

rea

chof

them

.Adi

ctio

nary

ofau

ralw

ords

iscr

eate

ddu

ring

the

trai

ning

phas

e(g

reen

box)

bym

eans

ofa

clus

teri

ngpr

oces

s(c

).Th

enth

ehi

stog

ram

ofth

eoc

curr

ence

sof

the

aura

lwor

dsin

am

-sec

onds

tim

ew

indo

ws

(lon

g-ti

me

desc

ript

orH

)is

com

pute

d(d

)and

isfe

dto

am

ulti

clas

sSV

Mcl

assi

fier

(e)t

hatp

erfo

rms

the

dete

ctio

nof

even

ts.


2.2.2 The classifier

The long-time descriptors are used to train a pool of SVM classifiers (Cortes andVapnik, 1995) with a ground truth defined at interval time-scale. The motivationof this choice lies in the ability of a SVM classifier, like other classifiers based ondiscriminant analysis, to construct a decision function that gives only to a subset ofthe features a non-zero weight. More in details, it means that the classifier learnswhich aural words are really discriminative for a certain class of events ignoring theothers. We have used the original, linear version of the SVM, and not the kernelizedone, since it provided satisfactory results in our experiments. Since the proposedsystem has to face a multi class classification problem and the SVM is essentiallya binary classifier, we adopted the 1-vs-all SVMs classification scheme (Fig. 2.1e).Namely, we have a pool ofM+1 1-vs-all SVM classifiers (whereM is the number ofthe classes to be recognized). The i-th classifier (with i = 0, . . . ,M ) is trained usingas positive examples the samples from class Ci and as negative examples all thesamples from the other classes. We observed that, employing a SVM classifier alsofor the background sounds allows to reduce the detection of false positive events.During the operating phase, each input pattern (long-time descriptor) is fed to thepool of SVM classifiers; each classifier gives as its output a score si, which indicatesits confidence, higher for more robust decisions. The final class C is assigned to theinput pattern H through the following combination rule:

C =

C0, if si < τ ∀i = 0, . . . ,M

arg maxisi, otherwise.

(2.4)

Namely, if at least one of the scores is higher than a threshold τ , the vector is as-signed to the class whose SVM gives the maximum score. Otherwise, if all the clas-sifiers give a score si < τ , the vector will be assigned to the background classC0. Theoverall architecture of the proposed system is depicted in Fig. 2.1. Only during thetraining phase, the short-time descriptors extracted from the input signals are usedto build a dictionary of aural words (Fig. 2.1c), which serve then to compute thelong-time descriptors (Fig. 2.1d). The SVM classifiers (Fig. 2.1e) learn which auralwords are specific for a given class of sounds while through the combination of thescores of the SVMs, during the testing phase, the detection of events is performed.

2.3 Experimental results

The proposed system has been experimentally validated considering a typical ap-plication of audio surveillance in which three classes of audio events have to be

2.3. Experimental results 15

detected: scream, glass breaking and gun shot.A key requisite for an audio surveillance system is the ability to detect events of

interest, even when they are mixed with different kinds of background sounds atdifferent energy levels. Thus, the approach followed in other application domainsof training a system by using a set of training samples each containing only onekind of sound (either an event of interest or a background sound) will give a poorperformance during the test in realistic conditions.

In order to address this problem, and to provide a quantitative assessment ofits impact, we have decided to construct a training set (and a test set) in which theindividual sounds are not present isolated but are already superimposed to eachother. After collecting a large number of audio clips, we combined them in severalways, obtaining an extended data set, so as to produce very challenging detectiontasks, with low SNR and with the events mixed with a plurality of backgroundnoises.

To the best of our knowledge, there are no publicly available data sets for thebenchmarking of audio surveillance applications. Thus, we constructed our owndataset of PCM audio clips sampled at 32 KHz and with a sample resolution of16 bits. The data set, available at http://mivia.unisa.it, contains highly noisy envi-ronmental sounds with events of interest superimposed at different values of theSNR (in our case, 6 different values), making the detection and classification ofevents very challenging tasks. The intensity of the background sound is modulatedin order to obtain low levels of SNR and simulate events that occur at various dis-tances from the microphone. Originally, we collected a total of 650 audio clips, 271

of which relative to sounds that belong to the three classes of interest and the othersrelative to a wide variety of different sounds both from indoor and outdoor envi-ronments (silence and Gaussian noise, rain, whistles, crowded ambiance, vehicles,household appliances, bells, applauses and claps). Thus, even though the classesof interest are only three, the number of different types of sound the system mustdeal with is significantly higher; furthermore, some of the background sounds aresimilar to the classes of interest (e.g. the voices of people in a crowded ambianceare similar to screams). All the audio clips have been recorded with an Axis P8221Audio Module and an Axis T83 omnidirectional microphone for audio surveillance.

The audio clips from the original data set have been normalized so that theyhave all the same overall energy and then they were split in two disjoint groupscomprising the 70% and 30% of the total amount of sounds from the original set,respectively. We used the clips from the first group to build the training set and theones from the second set to build the test group. The procedure described in thefollowing was applied both for the training and the test set.

First, the soundB(n) of a complex environment is created by mixing a randomly


defined number d ∈ {1, 2, 3}, of the above mentioned background sounds, as fol-lows:

Bj(n) =

d∑i=1

bi(n), (2.5)

where bi(n) are the d background sounds used to create the complex environmentalsound. All the audio files in the created data set have a duration of about 3 minutes;in case an original background sound has a duration shorter than 3 minutes, it isreplicated in order to fit the established length.

Once the environmental sound has been created, a number Ne of foregroundevents is randomly chosen from the original data set and superimposed to the en-vironmental sound, in order to simulate the occurrence of an event in a real andcomplex environment. In this way, an event can be present in the final data seta plurality of times, but every time it appears with a different background noise.Moreover, since in real situations the source of a target event can be at different dis-tances from the microphone, different values of the SNR of each event have beenproduced in the creation of the final data set. In particular, when a foregroundsound is mixed with the environmental sound, the energy of the foreground soundis amplified or attenuated according to a specific value of the SNR, SNRp withp = {5dB, 10dB, 15dB, 20dB, 25dB, 30dB}, for the target sound. The rule for theconstruction of the audio event ypj (n) at a certain SNR value is defined as follow:

ypj (n) =

Ne∑i=1

{Bj(n)⊕[si,ei] Apxi(n)

}, (2.6)

where

Ap = 10SNRp/20 rms(Bj(n))

rms(xi(n)). (2.7)

The amplification (or attenuation) coefficient Ap depends on the specific SNRp

value and on the root mean square values (rms) of the environmental sound andof the foreground sound. With ⊕[si,ei] we define an operator that mixes the sig-nal Apxi(n) with the signal Bj(n) in the interval delimited by [si, ei], starting andending points of the target sounds respectively.

The final data set consists of a training set and a test set that contain, respectively,396 and 184 audio files of about 3 minutes, each of them containing a sequence ofevents at a specific SNR value. The total duration is about 20 hours for the trainingset and about 9 hours for the test set, making the database huge. A total of 6000

events per class have been collected, 4200 in the training set and 1800 in the test set.In the following we will refer to the different classes of events with the abbreviations


Data set description

Training set Test set

#Events Duration (s) #Events Duration (s)

BN - 58371.6 - 25036.8

GB 4200 6024.8 1800 2561.7

GS 4200 1883.6 1800 743.5

S 4200 5488.8 1800 2445.4

Table 2.2: Summary of the composition of the data set.

GB for glass breaking, GS for gun shot, S for scream and BN for background noise.In Table 2.2 a summary of the composition of the data set is reported.

2.3.1 Performance evaluation

For evaluating the algorithm performance, we considered two measures: the recog-nition rate of the events of interest and the false positive rate (FPR), i.e events ofinterest detected when only background sound is present.

An event is considered as correctly detected if it is detected in at least one of thesliding time windows that overlap with it occurrence. Table 2.3 shows the classifi-cation matrix for three versions of the system: the first one is trained using isolatedsounds; the second one uses the mixed training sounds with hard assignment, andthe last one uses the mixed sounds with soft assignment. As it can be seen, the useof mixed sounds in the training is essential for achieving good results on the test set.Soft assignment also yields a significant improvement, especially for the gun shotclass, which is more sensitive to errors in the codeword assignment because its sam-ples are very short. In the rest of the discussion, we will refer to the soft assignmentversion, unless otherwise specified.

The average correct classification rate of foreground events, at different SNR val-ues, achieved on the whole test set is 86.7%. The classification matrix also shows thatmost of the errors are mainly directed to the background noise class (missed detec-tions), while confusion between different classes of interest is low. We count a falsepositive (FP) when an event of interest is erroneously detected in a time windowthat contains only background noise; it is worth pointing out that if in two consecu-tive time windows a foreground event is detected, we count, as it should be, a singlefalse positive occurrence. Thus, the FPR is computed as the ratio of the detected falsepositive events to the total number of intervals between two foreground sounds. Forthe whole test set we achieved a FPR equal to 2.1%, being 0.83% false detected glass


Training with isolated sounds

Guessed class

GB GS S MissTr

uecl

ass GB 84.3% 0% 0.1% 15.6%

GS 29.6% 0% 1.7% 68.7%

S 22.4% 0% 28% 49.6%

Proposed method – Hard assignment

Guessed class

GB GS S Miss

True

clas

s GB 93.6% 0.2% 0.2% 6%

GS 3.3% 81.6% 0.5% 14.6%

S 2.8% 0.9% 79.3% 17%

Proposed method – Soft assignment

Guessed class

GB GS S Miss

True

clas

s GB 94.4% 0.2% 0.2% 5.2%

GS 3.5% 84.9% 0.5% 11.1%

S 2.6% 0.9% 80.8% 15.7%

Table 2.3: Results of the proposed system on the test set.

breakings, 0.74% gun shots and 0.53% screams. It is worth noting that more than70% of the number of FP is concentrated in about one hour of background sound,composed mainly of household appliance and rain sounds, while the remaining30% is distributed along more than 6 hours of other background sounds. In realconditions, it is difficult to have, in the same environment, both rain and householdappliance sounds and it is thus expected to achieve a lower number of false alarms.Moreover, the system could be optimized for particular kind of environments, forinstance by using a pre-filter, in order to decrease the FPR.

A more detailed analysis of the performance of the system on audio clips withdifferent SNR values of the foreground events is reported in Table 2.4. As clearlyexpected, when the sounds of interest have higher values of SNR (and thus a lowerlevel of the noise), the influence of the background noise is reduced leading to an


Proposed method - Detailed results

SNR Recognition Miss Error False Positive

5dB 81.1% 12% 6.9% 11.5%

10dB 85% 12.1% 2.9% 2.4%

15dB 87% 10.9% 2.1% 1.3%

20dB 88.4% 9.9% 1.7% 1.2%

25dB 88.7% 9.9% 1.4% 1.2%

30dB 90% 9.2% 0.8% 1%

Average 86.7% 10.7% 2.3% 2.6%

Conte et al. (2012) - Detailed results

SNR Recognition Miss Error False Positive

5dB 71.4% 0.1% 28.4% 27.9%

10dB 81.2% 1.8% 17% 21.1%

15dB 86.2% 3.6% 10.2% 9.7%

20dB 87.6% 4.7% 7.8% 9.3%

25dB 88.2% 5.1% 6.7% 7.2%

30dB 88.9% 4.9% 6.2% 7.6%

Average 83.9% 3.4% 12.7% 13.8%

Table 2.4: Detailed results achieved by using the proposed classifier and by Conte et al. fordifferent values of the SNR of the foreground sounds.

improvement of the recognition rate and a reduction of FPR and miss rate. It isalso worth noting that the recognition rate for events at 5dB SNR is only 5% lowerthan the average recognition rate on the whole data set and about 9% lower than thebest value achieved for the events at 30dB SNR. The correct classification rate andthe false positive rate achieved in different SNR conditions prove that the proposedsystem is robust to high background noise variations.

We have performed the analysis of the performance versus SNR also for the bagof words implemented with hard assigment. The result is that the increment ofperformance due to soft assignment is roughly equivalent to a 5dB increment of theSNR (e.g. soft assigment at 10dB of SNR has approximately the same performanceof hard assignment at 15dB).


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Posi

tive

Rat

e

Proposed methodConte et al. [9]

Figure 2.2: ROC curve of the proposed system (solid line) compared with that of Conte etal. 2012 (dashed line).

2.3.2 Performance comparison

We compared the performance of the proposed system with the ones achieved bythe method of Conte et al. (2012), which employs a single-level representation andaggregates short-time decisions, taken by a LVQ classifier trained with the sameshort-time descriptors, at time window level using a rejection rule. The system at-tributes an interval to the class Ci that obtains the highest score zi = (ni − ni)/ni,where ni is the number of frames in the interval assigned to the class Ci; ni is athreshold that indicates a limit under which the i-th class is not considered as acandidate for the final decision. An interval is rejected (classified as background) ifzi < 0 for ∀i = 1, ...,M .

First, we compare the performance of the two methods by using the receiver op-erating characteristic (ROC) curves that give an overall evaluation of the classifica-tion performance. The ROC curves, obtained varying the values τ for the proposedmethod and ni for Conte et al. (2012) are depicted in Fig. 2.2. The proposed methodclearly outperforms the other one, as the corresponding curve is closer to the leftand top borders of the quadrant. We consider the area under the ROC curves (AUC),which is equal to 1 for a perfect classification, as a measure of the performance of thetwo methods and report the results in Table 2.5. The higher this measure, the bet-ter the overall performance of the system is. We observe that the proposed method


Proposed Method Conte et al. (2012)

GB 0.954 0.872

GS 0.968 0.886

S 0.966 0.938

Table 2.5: Comparison of the proposed method with Conte et al. (2012) in terms of AUC forthe foreground classes.

Conte et al. (2012) - Classification matrix

Guessed class

GB GS S Miss

True

clas

s GB 91.3% 5.3% 1.4% 1.9%

GS 12.1% 80.6% 3.9% 3.4%

S 7.6% 7.9% 79.8% 4.7%

Table 2.6: Results achieved by Conte et al. (2012) on the data set.

(solid line) generally outperforms Conte et al. (2012) (dashed line), achieving an av-erage AUC that is about 7.2% higher.

The introduction of a second level of representation exploits the long-time prop-erties of the signal and contextual information about the environmental sounds, lim-iting the effects of the background noise on the detection of events. Due to the com-bination of short- and long-time analysis, the performance is generally better than amethod based on a single representation level that considers for the evaluation onlythe short-time properties of the audio signal. In fact, when a decision is taken onlyat a lower level, like in the case of Conte et al. (2012), the reliability of the systemis penalized by the effect of the background noise. Thus, the system can be use-ful for audio surveillance due to its robustness to the environmental noise and theconsequently lower false alarm rate, even at low SNR.

In order to compare the performance of the two systems in operating conditions,we determined the value of the threshold τ = 0 for the bag of aural words classifierand the values of nBN = 266, nGB = 95, nGS = 26, nS = 62 for Conte et al. (2012),through a validation on the training set. The value τ = 0 determined in the trainingphase allows to achieve low false positive rate together with a reasonable miss rate.In Fig. 2.3, we show how the miss rate and false positive rate vary in function of thevalue of the threshold τ on the test set. In Table 2.6, we report the recognition resultsachieved by Conte et al. (2012) on the proposed data set. The average recognition


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0

0.2

0.4

0.6

0.8

1

Miss False positive

τ

rate

Figure 2.3: Variation of the miss rate (dashed line) and false positive rate (solid line) in func-tion of the classification threshold τ . The choice of τ = 0 guarantees a low false positive rateand a reasonable miss rate.

rate is 83.9%, that is lower than the one (86.7%) achieved by the proposed method.However, the difference is much more consistent for low values of the SNR: at 5dB,the recognition rate of Conte et al. (2012) is 71.4%, about 10% less than the one of theproposed method, and the False Positive Rate is about 10% higher. This is evidentfrom the comparison of the detailed results of the proposed method and of Conteet al. (2012) (Table 2.4). The improvement of performance with respect to the methodbased on hard assignment and the one by Conte et al. (2012) is statistical significantwith a confidence greater than 99%, as confirmed using a t-test statistic.

We evaluated the effectiveness of the proposed method also on the data set de-scribed in the paper of Aurino et al. (2014), whose results we compare with. InTable 2.7 and Table 2.8, we report the classification matrix achieved by the proposedapproach and by Aurino et al. (2014), respectively. We obtain an average recognitionrate of 100% in contrast with Aurino et al. (2014) that achieve a recognition rate of93.54% and shows lower robustness in the detection of gun shots and screams.

We conclude this section with some information on the computational cost ofthe proposed method. The processing time of the training phase is significant: on a2.6 GHz Opteron processor, the preparation of the codebook requires about 7 hours(reduced from about 70 hours needed before we introduced the technique describedin Section 2.2.1); the training of the pool of SVMs requires about 2.5 hours. Once thesystem is trained, the execution of the algorithm is quite fast: on the processor usedfor the training, the execution in real time at a 32 KHz sound sampling rate requiresabout 3% of the time of a single CPU core. The system also runs in real time on anembedded Raspberry Pi board, making its deployment very inexpensive.


Proposed method - Classification matrix

BN GB GS S

BN 100% 0% 0% 0%

GB 0% 100% 0% 0%

GS 0% 0% 100% 0%

S 0% 0% 0% 100%

Table 2.7: Performance results of the proposed method on the data set used by Aurino et al.(2014).

Aurino et al. (2014) - Classification matrix

BN GB GS S Rej.

BN 100% 0% 0% 0% 0%

GB 0% 100% 0% 0% 0%

GS 0% 0% 87.5% 0% 12.5%

S 10% 3.33% 0% 86.67% 0%

Table 2.8: Performance achieved by Aurino et al. (2014).

2.3.3 Sensitivity analysis

We performed an analysis of the sensitivity of the proposed system with respect tothe number of clusters and the length of the low-level window. In order to performsuch analysis, we constructed a version of the data set suited for cross-validationexperiments, which we made publicly available (http://mivia.unisa.it). The data sethas been divided into k = 5 folds, each of them containing 200 events from eachclass of interest mixed with typical background sounds. In turn, k−1 folds are usedas a training set, and the remaining fold is used as a test set. The results of the k testsare then averaged.

In Table 2.9 we report the recognition rates achieved with different number K ofclusters on the 5-folds data set together with their standard deviation. The numberof clusters employed in the training phase influences the recognition capabilities ofthe system and its generalization capabilities. For the application at hand we setK = 1024 that is a reasonable compromise between high recognition rate and real-time response of the system.

As discussed in Section 2.2.1, the length Tf of the time-window for the short-time analysis is chosen to effectively describe low- and high-frequency components


Sensitivity to the number of clusters

K 64 128 256 512 1024 2048

Rec. 51.02% 62.55% 70.68% 76.14% 80.32% 82.21%

σ 17.96 14.63 14.73 9.50 6.51 5.6

Sensitivity to the length of low-level window

Tf 16ms 32ms 64ms

Rec. 51.02% 80.32% 70.68%

σ 17.96 6.51 14.73

Table 2.9: Sensitivity analysis with respect to the number K of clusters (first table) and to thelength of the low-level time window (second table).

of the sound at the same time. In Table 2.9, we report the results of the analysis ofsensitivity with respect to the parameter Tf . We experimented with three values ofTf (namely 16, 32 and 64 ms). The value Tf = 32 ms is confirmed to effectivelydescribe both low and high-frequency for sounds sampled at 32KHz, while for thevalues 16 ms and 64 ms low-frequency and high-frequency components are not-wellanalyzed, respectively.

2.4 Conclusions

In this chapter we proposed a system based on the bag of aural words approach forthe detection of events in audio streams for surveillance applications. The bag ofword approach has been tailored and adapted to the specificity of the applicationdomain, such as the high noise level and the need to deal with loud backgroundsounds superimposed to the events of interest. We experimentally validated thesystem on a large and challenging audio data set that we have made publicly avail-able for benchmarking purposes. The performance results, compared with a state ofthe art approach (Conte et al., 2012), confirm the robustness of the proposed methodwith respect to the background noise and its applicability to real environments. Ingeneral, the main advantage of the proposed system is that the bag of words ap-proach intrinsically takes into account contextual information about the environ-ment to build a model of the events of interest and learns which features of thesound are distinctive for a specific class of events. This leads to more reliable per-formance and to a higher robustness to the background noise, even for highly noisyenvironments and low SNR levels.

Published as:

Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., Vento, M. “Audio Surveillance of Roads: A System forDetecting Anomalous Sounds,” in Intelligent Transportation Systems, IEEE Transactions on , vol.17, no.1,pp.279-288, 2016, doi: 10.1109/TITS.2015.2470216.

Chapter 3

Design of a practical system for audiosurveillance of roads

Abstract

In the last decades several systems based on video analysis have been proposed for auto-matically detecting accidents on the roads so as to ensure a quick intervention of emer-gency teams. However, in some situations the visual information is not sufficient orsufficiently reliable, while the use of microphones and audio event detectors can signif-icantly improve the overall reliability of surveillance systems. In this paper we proposea novel method for detecting road accidents by analyzing audio streams so as to iden-tify hazardous situations like tire skidding and car crashes. Our method is based on atwo layer representation of the audio stream: at a low level, the system extracts a set offeatures able to capture the discriminant properties of the events of interest; a high levelrepresentation based on the bag of words approach is then exploited in order to detect bothshort and sustained events. The deployment architecture for using the system in real en-vironments is discussed, together with an experimental analysis carried out on a dataset made publicly available for benchmarking purposes. The obtained results confirm theeffectiveness of the proposed approach.

3.1 Introduction

In the last years, a need for more security and safety in public environments hasrisen due to the increasing number of people and transportation vehicles that movearound cities. Road traffic monitoring involves, for instance, the detection of acci-dents or road disruptions to quickly ensure the intervention of emergency teamsand to guarantee the safety of the people (Gandhi and Trivedi, 2007). In fact, it hasbeen shown (Rauscher et al., 2009; White et al., 2011) that the reduction of the timebetween the moment in which an accident occurs and the moment in which theemergency team is dispatched substantially decreases the mortality rate (approxi-mately by 6%). Within this context, cameras have been widely used to control the

26 3. Design of a practical system for audio surveillance of roads

behavior of vehicles by tracking their trajectories (Sivaraman et al., 2013; Sivaramanand Trivedi, 2013; Brun et al., 2014; Wang et al., 2014) near traffic lights or in proxim-ity of road crosses in order to detect abrupt maneuvers, or on motorways to monitorthe traffic flow and detect long queues (Lv et al., 2014; Abadi et al., 2014).

However, in certain cases, the visual information is not sufficient to reliably un-derstand the activity of vehicles or to detect possibly hazardous situations. For in-stance, a tire skidding on the road has a very distinctive acoustic signature that is notdetectable from video streams but can be an evidence of an anomalous situation (anaccident or a dangerous state of the road) that requires human intervention to en-sure safety. Furthermore, the abnormal events can happen outside the field of viewof the camera, making it impossible to be detected both by a human operator and byan automatic video analytics system. In such cases, the use of microphones and theprocessing of the audio stream as a complementary tool to the video analysis mayimprove the detection abilities of security systems (Cristani et al., 2007; Marmaroliet al., 2013) and, in general, the reaction time of the emergency teams. As a mat-ter of fact, nowadays, IP cameras used for surveillance are normally equipped withembedded microphones that facilitate the deployment of audio analysis systems.

One of the main advantages of audio analysis systems is that they do not have todeal with variations in illumination conditions and can be equally employed duringday and night. However, the problem of detection of audio events in open environ-ments is very challenging: one of the main issues is that the events of interest aresuperimposed to a significant level with background noise; furthermore, it is diffi-cult to model a priori all the possible background sounds that may occur in roadenvironments. Think, for example, about a very busy highway where an accidentoccurs: an audio event detector needs to be able to separate the background noisedue to the vehicle flow from the car crash (the event of interest) potentially occur-ring at a significant distance from the microphone. In such a case, the signal to noiseratio (SNR) is very low, thus making the recognition of such events a very complextask. Another typical problem the audio analysis systems has to face is related tothe duration of the events of interest: a tire skidding, for instance, is typically a sus-tained sound and may last several seconds, while a car crash is an impulsive soundand its duration is very limited in time.

In the last decades a large number of methods dealing with the analysis of audiostreams has been proposed, ranging from speech recognition (Anusuya and Katti,2010; Besacier et al., 2014) and scene classification (Cai et al., 2008; Malik, 2013) tospeaker identification (Cordella et al., 2003; Saquib et al., 2010). More recently, agrowing interest for audio analysis has been also shown in surveillance applications,in order to detect crimes for public transport security (Vu et al., 2006; Zajdel et al.,2007; Rouas et al., 2006), the maximum speed of vehicles for security reasons (Borkar

3.1. Introduction 27

and Malik, 2013; Marmaroli et al., 2013; Barnwal et al., 2013) or accidents on theroads (Pham et al., 2010; White et al., 2011; Rauscher et al., 2009).

In this chapter, we focus on the problem of road surveillance, and we proposea system tailored for the automatic detection of two hazardous situations, namelytire skidding and car crashes, by analyzing the sound captured by microphones. Acomprehensive review of the state of the art approaches focusing on surveillancesystems has been recently proposed by Crocco et al. (2014), where it is highlightedthat the detection of audio events can be considered as a traditional pattern recogni-tion problem. In fact, the common idea is that the data to be analyzed is describedby means of a set of features, whose values are used to form a vector represen-tation of the pattern of interest. A feature is a salient characteristic of the patternto be detected or classified, while a feature set aims at effectively describing pat-terns from different classes: similar patterns in the real world should have veryclose vectors in the feature space. The feature vectors are thus used to train a clas-sifier, which creates models of the patterns of different classes through a learningprocess. Then it employs such models to classify newly observed patterns in thetesting phase (Vacher et al., 2004; Clavel et al., 2005; Gerosa et al., 2007; Valenziseet al., 2007; Rabaoui et al., 2008; Ntalampiras et al., 2011; Li et al., 2013; Foggia et al.,2014a). In the last years traditional classification schemes have been improved, andmore sophisticated architectures have been proposed in order to increase the over-all reliability of the audio detector (Rouas et al., 2006; Ntalampiras et al., 2009) or totake into account the different time resolution of the events of interest (Conte et al.,2012; Chin and Burred, 2012).

In this chapter we present a detection system based on a high-level represen-tation of the audio stream, able to take into account both the short- and long-timeproperties of the events of interest. Thanks to the use of a bag-of-words approach,our method learns which are the short-time characteristics of an event that are dis-criminant for such event on a longer time scale and that differentiate it from thebackground sound. This is a very important property, especially in the considereddomain. In fact, in the case of the application at hand, a car crash sound is charac-terized by an abrupt variation of energy in time while a skidding tire is a sustainedsound whose energy is concentrated in a narrow interval of frequencies.

We validated the system on a data set1 that we made available for benchmark-ing purposes. In the proposed data set, the sounds of interest are not isolated butsuperimposed on different typical background sounds of roads and traffic jam, inorder to consider the occurrence of such abnormal events in real-world conditions.

1The dataset is available at the url http://mivia.unisa.it/


3.2 Method

The purpose of the proposed system is to distinguish audio events of interest fromthe background sound and classify them into one of M classes. The rationale ofthe proposed approach is based on the consideration that a sound is composed ofsmall, atomic audio units, similarly to a text that is composed of a number of words,and the occurrence of particular units in a given time interval is an indicator of thepresence of a certain event.

In order to build a description of the audio stream based on such assumption,a classification architecture exploiting the Bag of Words approach is employed. Thebag of words technique has been widely applied for text categorization, in whichthe datum to be classified is represented by counting the occurrences of low-levelfeatures (words) and constructing a (high-level) vector whose dimensionality corre-sponds to the number of possible words contained in a dictionary. The high-levelvector corresponds, thus, to the histogram of occurrences of words, used for theclassification of the text.

In the proposed architecture for audio analysis, the following layers have beendefined: 1) extraction of low-level features, 2) learning of a dictionary of basic audiowords, 3) construction of a high-level vector and 4) classification. Below, a detailedexplanation of each layer is provided.

3.2.1 Low-level features extraction

In contrast with video streams, an audio signal can show abrupt variations withinfew milliseconds. Thus, in order to take into account its short-time variability, theaudio stream is framed in small, partially overlapped, chunks (frames) of durationTf . The value Tf has to be chosen to take into account the analysis of both low andhigh frequency components at the same time: with a very short frame, for instance,the system will not be able to consider low-frequency components; conversely, witha very long frame, high-frequency components will be averaged over a long timeinterval. For each frame, the system computes a vector of low-level features.

Three sets of low-level features have been considered and experimented with,namely the Mel-frequency cepstral coefficients (MFCC) Zheng et al. (2001), energyratios in Bark sub-bands Zwicker (1961) and features based on temporal and spec-tral characteristics of the signal Liu et al. (1998); Peeters (2004), previously employedin Carletti et al. (2013). More details on the three feature sets are reported in Ta-ble 3.1.

3.2. Method 29

Set Type Description Ref.

1 Temporal andSpectral

• volume, energy, zerocrossing rate

• Spectral centroid, spectralspread, roll-off frequency,spectral flux

• energy ratio in 4 sub-bands

Carletti et al.(2013), Liu et al.(1998), Peeters

(2004)

2 Cepstral• 13 Mel-frequency Cep-

stral Coefficients (MFCC)Zheng et al.

(2001)

3 Psychoacoustical• Energy ratio in the first 24

critical bands of hearing Zwicker (1961)

Table 3.1: Details of the three low-level feature sets used for the experiments.

3.2.2 Dictionary learning

The low-level feature space is continuous and theoretically infinite, thus not suit-able for the detection of the presence of specific relevant atomic units of sounds(hereinafter audio words). In order to derive a finite set of audio words, we use theK-means algorithm, which clusters the vectors on the basis of their similarity. Theoutput of theK-means algorithm is a set ofK points that correspond to the centroidsof the clusters. Since a centroid is representative for a group of similar low-level vec-tors, we consider the set D = {u1, . . . , uK} of the centroids as the dictionary of basicaudio words.

3.2.3 High-level representation

In Figure 3.1, a sketch of the process of construction of the high-level representationis shown. Given the dictionary D, for each low-level vector vi, the closest audioword uj is determined. The occurrences of each word uj in a time-limited inter-val are used to build a high-level feature vector. Such vector corresponds to thehistogram H = (h1, . . . , hK), whose bins are computed as:

hj =

N∑i=1

δ (bi, j) , j = 1, . . . ,K, (3.1)


Low-level vectors Dictionary D High-level vectors H(histogram)

v2

v3

vi

v1

u1

u2

u3

u4 uK 1

2

3

h1 h2 h3 h4 hK

(a) (b) (c)

Figure 3.1: Construction of the high-level representation. For each vector vi the nearest audioword uj in the dictionary is determined (b). Then, the occurrence counts of the single audiowords are stored in a histogram, whose bins are hj (j = 1, . . . ,K) that constitutes the high-level vector (c). In the example, vectors v1 and v2 have u2 as the nearest audio word in thedictionary. Thus, the second bin of the histogram has a value equal to 2. In the same way,audio word u3 has only one close vector, resulting in a value equal to 1 for the third bin of thehistogram. Audio words u1 and u4, instead, have no occurrences.

where δ (·) is the Kronecker delta and bi is the index of a word within the set D,determined as:

bi = arg minjd(vi, uj), j = 1, . . . ,K, (3.2)

where d(vi, uj) is a dissimilarity measure between the vector uj and the prototypevi (the Euclidean distance is considered).

3.2.4 Classification architecture

Our hypothesis is that certain classes of sounds are considered to have distinc-tive audio words that allow the system to differentiate such sounds from the otherclasses. A pool of M + 1 Support Vector Machines (SVM), each of them dedicated tothe detection of a certain class of sounds (M events of interest plus the backgroundsounds), has been trained with the high-level feature vectors. The SVM classifier isparticularly suited for the employed sound representation since it is able to learnwhich are the words that are relevant for a particular class of events and discardthose words that do not contribute to an effective classification, giving them a verylow weight. We employed SVM with linear kernel, which gives satisfactory resultsin our experiments coupled with fast processing that is important for real-time re-sponses.

The SVM is, originally, a binary classifier. Thus, a pool of SVM (Figure 3.2) is re-alized in order to face the multi-class problem at hand. The i-th classifier is trained

3.3. Deployment Architecture 31

using as positive examples the samples from the class Ci and as negative examplesall the samples from the other classes. During the testing phase, each classifier com-putes a score si which is a measure of the confidence of the classification, higher formore reliable decisions. The final class C is chosen as the one of the SVM that givesthe highest score above a certain threshold λ:

C =

C0, if si < λ ∀i = 0, . . . ,M

arg maxisi, otherwise.

(3.3)

If all the classifiers give a confidence score si < λ the time interval is classified asa background sound in class C0. For our experiments the threshold is set as λ = 0.The use of a SVM classifier for the background class increases the robustness of theproposed system with respect to background noise and entails a reduction of falsealarms.

Com

bine

r

SVM0

SVM1

SVMM

...

H C

Figure 3.2: Architecture of the classifier. The scores of the SVM classifiers are combined inorder to determine the final class to be assigned to the input vector H .

3.3 Deployment Architecture

Our hypothesis for the deployment of the system is that we have a set R = {ri|i =

1, ..., Nm} of Nm microphones installed on one side of the road and located at adistance ofmmeters far from each other and at a height of hr meters (see Figure 3.3).The choice of the distancem strongly depends on two factors: 1) the sound intensityof the events to be detected, 2) the maximum distance d from the microphone atwhich an event can still be detected by the system. Of course, d depends on thekind of environment the system has to work on: we expect that this value is higher


for a country road (where only few vehicles go through the street with a low speed)than for a highway, where the number of vehicles and their speed are significantlyhigher.

In order to better understand the impact of the environment on the coveragecapabilities of the microphones, we consider that the signal to noise ratio (SNR) ofthe sound acquired by a microphone (expressed in decibel) is computed as follows:

SNR = Ls(d)− Ln, (3.4)

where Ls(d) represents the intensity level expressed in decibel of the event of inter-est occurring at a distance d from the microphone, while Ln is the noise in decibelintroduced by the traffic. In the following more information about the computationof these two contributions is provided.

hr

m

r1

r2

S

hs

d

d’

Figure 3.3: A sketch of the deployment of the proposed system: a set R of microphones islocated at a distance of m meters far from each other and at a height hr . An event of interestcan be recognized at a maximum distance of d meters from the closest microphone.

3.3.1 Intensity level of the event of interest

Since the propagation of the sound is affected by spreading, absorption, groundconfiguration and so on, the intensity of the audio event acquired by the microphoneis attenuated by a factor A(d):

Ls(d) = Ls(d0)−A(d), (3.5)


where Ls(d0) is the sound intensity at a reference distance d0.According to the standard ISO 9613-2 (Technical Committee ISO/TC 43, 1996),

the attenuation can be computed as a combination of four contributions, whichstrongly depend on the environment where the sound is propagating:

A(d) = Adiv(d) +Aatm(d) +Agr(d) +Abar(d). (3.6)

Each of these contributions is determined by particular characteristics of the envi-ronment. In particular:

• Adiv is due to the geometrical divergence; we suppose a spherical spreadingfrom the source, whose sound is radiated equally in all directions; thus, thesound level is reduced by 6 dB for each doubling of distance from the source:

Adiv(d) = 20 logd

d0+ 11, (3.7)

where 11, computed as 10 · log(4 · π), is a constant that models the sphericalspreading factor.

• Aatm is due to the atmospheric absorption during the propagation of thesound waves and can be computed as follows:

Aatm(d) =α · d1000

, (3.8)

where α is the atmospheric attenuation coefficient, which is a function of thetemperature, the humidity and the nominal frequency. According to Techni-cal Committee ISO/TC 43 (1996), α = 32.8 dB/Km assuming a temperaturearound 10◦ C and a nominal frequency of 4 kHz.

• The ground attenuation Agr is the result of sound reflected by the ground sur-face interfering with the sound propagating directly from the source (the ve-hicle causing the sound of interest) to the receiver (the microphone).

Let hr and hs be the receiver height and the source height, respectively. Inorder to compute Agr, the standard Technical Committee ISO/TC 43 (1996)suggests to partition the area between the source and the receiver into threeregions: source region (whose size is 30 · hs), around the source, which deter-mines the attenuation As; middle region, which determines the attenuation Am;receiver region (whose size is 30 ·hr), around the receiver, which determines theattenuation Ar.


Agr is thus computed as:

Agr(d) = As +Am(d) +Ar. (3.9)

In particular, at the nominal band of 4 kHz, Ar and As can be computed asfollows:

Ar = As = 1.5 · (1−G) = 1.5. (3.10)

According to the standard, the G value is equal to 0, since we suppose that theroad is a hard ground. Conversely, Am can be computed as:

Am(d) = 3 · q(d) · (1−G), (3.11)

where

q(d) =

{0 d ≤ 30(hs + hr)

1− 30(hs+hr)d d > 30(hs + hr)

• Finally, Abar is due to the presence of barriers. Considering that the micro-phones are mounted directly on the road, this factor can be neglected in ourscenarios.

3.3.2 Intensity level of the traffic noise

In the last decades the scientific community has proposed several approaches formodeling traffic noise, since it is considered very important in order to evalu-ate the acoustical impact both for environment management and urban planning.As shown by Steele (2001) and Quartieri et al. (2009), there is not a commonlyadopted rule but rather each country adopts its own standard: for instance, theCoRTN (United Kingdom Department of Environment and welsh Office Joint Publi-cation, HMSO, 1975) procedure has been adopted in England, the RLS 90 model (furVerkehr, 1981) in Germany, the C.N.R. model (Canelli et al., 1983) in Italy and theNMPB in France (SETRA et al., 1995).

A common idea of such methodologies is to take into account the traffic flow,both of light and heavy vehicles, the typology of the road surface and the distancebetween the microphone and the carriage generating the noise. In particular, in thispaper we apply the CoRTN model in order to evaluate the traffic noise generatedin different scenarios by taking advantage on the on line application provided byNational Physical Laboratory (2015). The CoRTN model evaluates the so called L10

(from now on Ln), that is the noise level exceeded for just 10% of the time over aperiod of one hour.


The main idea is to partition the road into a set of S segments (so as within onesegment the noise level variation is lower than 2 dB) and to separately evaluate foreach i-th segment the basic noise level Li, taking into account attenuation due to thedistance as well as the particular environment. Finally, the contribution of all thesegments is combined so as to obtain the overall noise Ln.

According to the CoRTN model, the noiseLi for the i-th segment, evaluated witha given traffic flow q, is computed as follows:

Li = 42.2 + 10 log10 q + C, (3.12)

where C is the correction factor required for different values of speed v, percentageof heavy vehicles p and gradient of the road g. In fact, the basic computation of Li(with C = 0) considers the average speed v = 75 Km/h, the percentage of heavyvehicles p = 0% and the gradient of the road G = 0 degrees.

In order to simulate scenarios different with respect to the basic one, a propercorrection C = C1 + C2 needs to be applied. In particular, C1 is the correction for vand p:

C1 = 33 log10

(v + 40 +

500

v

)+ 10 log10

(1 +

5p

v

)− 68.8 (3.13)

while C2 is the correction for the gradient of the road and is computed as:

C2 = 0.3 · g. (3.14)

Finally, the contributions of the S segments are combined in order to calculatethe overall traffic noise Ln:

Ln = 10 log10

S∑i=1

10Li/10 (3.15)

3.3.3 Architecture discussion

The simulation has been performed by considering different scenarios our systemcan work on. In particular, we evaluate how the SNR varies depending on the fol-lowing parameters: the distance d, the vehicle speed v in the set {50, 70, 100, 130}Km/h, the number of vehicles per hour q in the set {100, 500, 1000, 4000} vehicles/h.

In Table 3.2 we report the value of the parameters considered in the simulation,while the obtained results are reported in Figure 3.4: in particular, each figure showshow the SNR (y-axis) varies with respect to the distance (x-axis) as the value of q isfixed. The curves on the same graphic refer to different values of v. As expected, itis evident how the SNR significantly decreases by increasing the speed, the traffic


Parameter Value Description

Ls 140 dB intensity level of the source

hs 1 meter height of the source

hr 4 meter height of the receiver

d0 1 meter reference distance

G 0 ground coefficient

α 32.8 dB/Km atmospheric attenuation co-efficient

p 5 % percentage of heavy vehicles

g 3 % gradient of the road

S 5 number of segments

Table 3.2: Summary of the values of the parameters used for the evaluation of the distance d.

flow and the distance.

Although the considered model allows us to simulate the behavior of the pro-posed system in several environments by combining various traffic flows and vehi-cle speeds with different distances values, we decided to focus on the following twoscenarios representing somehow the best and the worst case in which the proposedsystem can work: (1) a country road, where vehicles have typically a limited speed(around 50 Km/h) and the flow is very low (less than 100 vehicles/h); (2) a highway,where in the rush-hours the vehicle flow may be very high (around 4000 vehicles/h)as well as their speed (around 100 Km/h).

Taking into account, as we explain in detail in Section 3.2, that an event of interestwith a SNR = 10dB can be reliably detected by the proposed system, we designedthe positioning of the microphones.

In Figure 3.4a and Figure 3.4d, we depict the attenuation of the SNR with thedistance at fixed traffic flow q = 100 and q = 4000, respectively. In the first case,we observe that the SNR of the sounds of interest is about 10dB at a distance of120 meters, while in the second case a SNR of 10dB can be achieved at a distanceof about 25 meters. This implies that for a country road the microphones can beplaced at about 240 meters far from each other. The highway scenario, instead, isdefinitively more challenging due to the high number of vehicles crossing the roadand the optimal distance between microphones is approximately m = 50 meters.


30 60 120 240

−20

−10

0

10

20

30

d

SNR

q=100

(a) Best case

30 60 120 240

−20

−10

0

10

20

30

dSN

R

q=500

(b) Average case

30 60 120 240

−20

−10

0

10

20

30

d

SNR

q=1000

(c) Average case

30 60 120 240

−20

−10

0

10

20

30

d

SNR

q=4000 v=50v=70v=100v=130

(d) Worst case

Figure 3.4: Variation of the value of SNR (expressed in dB) with respect to the distance d (ex-pressed in meters) the average speed v (expressed in Km/h) and the traffic flow q (expressedin vehicles/h).

3.4 Experimental results

3.4.1 The data set

To the best of our knowledge, there are no publicly available data sets for roadsurveillance applications. Thus, we created a data set that contains two classes ofhazardous road events, namely crashes and tire skidding. The audio clips are sam-pled at 32 KHz, with a resolution of 16 bits per PCM sample; the whole data set wasmade publicly available at http://mivia.unisa.it for benchmarking purposes.

An audio-based system for road surveillance has to deal with different kindsof background sounds, ranging from very quiet background (i.e. in the countryroads) to highly noisy traffic jams (i.e. in the center of a big city) and highways.Thus, in the proposed data set the events of interest are superimposed to differentbackground sounds in order to simulate their occurrence in various environments.


We, originally, collected 59 samples of crashes and 45 of tire skidding, together withthe sound of 23 different road locations. We adopted a procedure to combine theoriginal sounds, which we explain in the following.

The audio clips x(n) have been, initially, normalized so that they have all thesame overall energy:

x(n) =x(n)

xrms(n). (3.16)

where xrms(n) is the root mean square (RMS) value of the clip. A backgroundclip b(n) of about one minute duration is randomly selected from the typical traf-fic sounds. Then a number Ne of foreground events is randomly chosen from theoriginal data set and superimposed to the background sound, in order to accountfor the occurrence of events of interest in a real environment. The selected eventsare mixed with the background sound, as follows:

outj(n) =

Ne∑i=1

{bj(n)⊕[si,ei] [A · xi(n)]

}, (3.17)

where ⊕[si,ei] is an operator that combines the signal xi(n) with the signal bj(n) inthe interval delimited by [si, ei], starting and ending points of the sounds of interest,respectively. The point ei is distanced from the starting point of the next sound si+1

by an interval of 4 to 7 seconds in which only background sound is present. Theattenuation (or amplification) factor A is determined so as to achieve a signal tonoise ratio of 15dB.

The final data set is composed of 57 audio clips of about one minute created withthe procedure defined above. Each of the clips contains a sequence of events of in-terest: in total, 200 events per class are present. The produced clips are organizedinto N = 4 folds, each of them containing 50 events from each class of interest thatoverlap various traffic background sounds. The samples contained in a fold (bothbackground and events of interest) are not present in the remaining folds, whichare thus completely independent from each other. Moreover, high variability in thedata is ensured by the heterogeneous background sounds on which the events ofinterest are superimposed. Within a given fold, the same event can be present asmixed with different backgrounds, in order to better represent various real situa-tions. In the following of the text, we will refer to the different classes with thefollowing abbreviations: BN for the background noise, CC for car crashes and TSfor tire skidding. A detail of the composition of the data set is reported in Table 3.3.


Data set details

#Events Duration (s)

BN - 2732

CC 200 326.3

TS 200 522.5

Table 3.3: Details on the composition of the data set. The total duration of the sounds isexpressed in seconds.

3.4.2 Experimental setup

For the computation of the low-level features, the audio stream is divided in framesof Tf = 32 milliseconds corresponding to 1024 PCM samples. We found that thechoice of Tf = 32ms is a reasonable compromise to take into account both low- andhigh-frequency properties of the signal and to perform a reliable short-time analysisof audio stream sampled ad 32KHz. Two consecutive frames are overlapped for the75% of their length in order to ensure continuity in the analysis of the audio stream.Different values of the number K of clusters (from 64 to 1024) have been consideredfor the experiments in order to evaluate the sensitivity of the system.

The high-level feature vector is computed for a time window of 3 seconds thatshifts forward by 1 second. Two consecutive time windows, thus, overlap by twoseconds. In this way, the continuity of analysis is ensured also at a time resolutionof the order of the seconds: events that occur at the end of one window fall roughlyin the middle of the next one.

For the experiments, the N -fold cross-validation is used. Cross-validation is atechnique used for the assessment of the performance of a pattern recognition sys-tem and of its generalization capabilities to different data. It consists in the separa-tion of a data set into a number of folds, which are independent from each other interms of samples. It means that the samples contained in one fold are not present inother folds. The cross-validation is often used to estimate how accurately a systemwill work in practice and how stable it will be under different conditions. In turn,N − 1 folds are used as a training set to learn the classification model, and the re-maining fold is used as a test set. The results of the N test obtained in this way arethen averaged.

3.4.3 Performance evaluation

We evaluate the performance of the proposed system by measuring the recognitionrate (true positive rate, TPR), i.e. the rate of correctly detected events of interest, and


the false positive rate (FPR), i.e. the rate of wrongly detected events of interest whenonly background sound is present. A correct classification is counted when at leastone of the overlapping time windows with the events is correctly classified. A falsepositive occurrence, which corresponds to a false alarm in a real system, is countedif an event of interest is detected when only background sound is present. In thecase that the same event of interest is detected in two consecutive background timewindows, only one false positive occurrence is counted.

Furthermore, we compute the receiver operating characteristic (ROC) curve, amethod that is widely used to evaluate the overall performance of a classificationsystem. It is a plot of the trade-off between the TPR and FPR of a classifier as itsdiscrimination threshold is varied. The closer a ROC curve to the top-left corner ofthe plane, the better the performance. We consider the area under the ROC curve(AUC), which is equal to 1 for a perfect system, as an overall measure of the perfor-mance.

In Figure 3.5 we report the performance of the proposed system (red solid line)in terms of recognition rate on the data set. We studied the variation of the recogni-tion rate with respect to the number of basic audio words (clusters) learned duringthe training phase. In the left column of Figure 3.5 the performance of the SVM clas-sifier is depicted for the three considered sets of low-level features. We achieve anaverage recognition rate of 82%, 80.25% and 75% with a standard deviation of 1.5,1.64 and 2.4 by employing as low-level features the set proposed by Carletti et al.(2013), the MFCC and the BARK, respectively. Moreover, we estimated the vari-ance of the generalization error for the 4-folds cross-validation using the evaluationmethod proposed by Nadeau and Bengio (2003). We observed that the estimatedvariance is from 25 to 50 times smaller that the average error, thus confirming sta-tistical significance of the experiments on N = 4 folds.

In addition to the SVM classifier, we employed a k-Nearest Neighbor (kNN) clas-sifier in order to evaluate the generalization capabilities of the proposed high-levelrepresentation. We show a plot of the performance results achieved by the kNNclassifier in the right column of Figure 3.5. The value of k has been experimentallyset to 5. Although the performance results of the SVM-based classifier are stablewith respect to the number of clusters, the performance achieved with the kNNclassifier suggests that an increasing number of audio words causes a worsening ofgeneralization capabilities. Thus, if too many words are used in the training phase,the system will be specialized in the recognition of the events from the training set.However, for the application at hand, the number of clusters is not a critical param-eter as it is kept below 128.

In Table 3.4 we report a summary of the results achieved by the system config-


64 128 256 512 102460

70

80

90

clusters

rec.

rate

SVM

Carletti et al. (2013)

Orig.−3dB−6dB

(a)

64 128 256 512 102445

55

65

75

85

clusters

rec.

ratek

NN

Carletti et al. (2013) Orig.−3dB−6dB

(b)

64 128 256 512 102460

70

80

90

clusters

rec.

rate

SVM

MFCC

(c)

64 128 256 512 102445

55

65

75

85

clusters

rec.

ratek

NN

MFCC

(d)

64 128 256 512 102460

70

80

90

clusters

rec.

rate

SVM

Bark

(e)

64 128 256 512 102445

55

65

75

85

clusters

rec.

ratek

NN

Bark

(f)

Figure 3.5: On the left column the recognition rate of the proposed system (solid red line)for the three considered set of features is reported. The solid lines represent the performanceobtained on a test set with the same SNR as the training set (15dB). The green and bluedashed lines show the results when the signal intensity (and the SNR) is reduced by 3dB and6dB, respectively. On the right column, the performance achieved with the kNN classifierdemonstrate a loss in generalization capabilities of the proposed method when a too highnumber of basic audio words is chosen.

ured with K = 64 clusters, which is the value that gives the highest generalization2.

2With K = 64 clusters, the SVM classifiers learn for the classes BN , CC and TS the followingaverage number of support vectors: (60, 55, 50) for Bark features set, (55, 70, 60) for MFCC features setand (50, 60, 55) for the feature set by Carletti et al. (2013).


Results on the data set

Rec. Rate Miss Rate Error Rate FPR

Bark 75% 21% 4% 10.96%

Mfcc 80.25% 19% 0.75% 5.48%

Carletti et al. (2013) 82% 17.75% 0.25% 2.85%

Table 3.4: Detailed results achieved by the proposed system configured with K = 64 basicaudio words.

In Table 3.5, instead, we report the classification matrices achieved by the proposedsystem. We can note that the features proposed by Carletti et al. (2013) and theMFCC features show higher robustness to traffic noise with respect to Bark features.This determines that the system achieves a larger false positive rate when the Barkfeatures set is used, due to the difficulties in differentiating the basic units of thesounds of interest from the background in very noisy traffic conditions. However,further studies on the temporal integration of basic audio units could improve therobustness to noise and the detection capabilities.


In a real environment, the sound source can be located at various distances from themicrophone, resulting in the acquisition of signals with different intensity and signalto noise ratio. We performed a sensitivity analysis of the proposed system withrespect to the signal intensity and the number of clusters. We decreased the intensityof the signal by −3dB and −6dB, in order to evaluate the detection capabilities at adistance of 25 and 120 meters depending on the scenario, according to the analysispresented in Section 3.3. In practice, we trained the system on the events in theoriginal data set and then tested it on events whose intensity is −3dB and −6dB ofthe original signal.

As observed in the previous paragraph, the number of basic audio words learnedduring the training process influences the generalization abilities of the system,while the trend of the recognition rate on the attenuated versions of the sounds(green and blue dashed lines for−3dB and−6dB, respectively) is coherent with theone of the original data set.

Conversely, it is worth noting that the performance of the system with respectto different distances of the sound source depends mostly on the low-level repre-sentation of the audio signal. When temporal features based on the intensity andenergy of the signal are used to describe the audio frames, in fact, the performance


Bark

Guessed

CC TS MissTr

ue

CC 86.0% 4.5% 9.5%

TS 2.0% 64.00% 34%

MFCC

Guessed

CC TS Miss

True

CC 89.5% 1.0% 9.5%

TS 0.5% 71.0% 28.5%

Carletti et al. (2013)

Guessed

CC TS Miss

True

CC 89.0% 0% 11.0%

TS 0.5% 75.0% 24.5%

Table 3.5: Classification matrices achieved by the proposed method on the data set with thethree considered sets of low-level features.

Sensitivity analysis

(Carletti et al., 2013) MFCC Bark

Rec. σ Rec. σ Rec. σ

SVM (Orig.) 84.50% 1.50 82.65% 1.64 78.20% 2.40

SVM (Att.) 78.95% 4.94 79.80% 2.84 78.20% 2.20

kNN (Orig.) 72.20% 8.84 69% 9.49 77.30% 5.72

kNN (Att.) 69.52% 7.86 67.45% 8.45 77.30% 5.50

Table 3.6: Results of the performed sensitivity analysis. For both the employed classifiers theaverage recognition rate and its standard deviation are reported in the case of classificationof the events in the proposed data set (orig.) and their attenuated version (Att.).

inevitably decreases with an increasing distance of the events from the microphone(blue and green lines in Figures 3.5a, 3.5c, 3.5b and 3.5d). In such cases, when the


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Posi

tive

Rat

e

Carletti et al. (2013)MFCCBARK

Figure 3.6: ROC curves of the proposed system configured with the three considered sets offeatures.

energy of an event of interest decreases, it becomes comparable with the one of thebackground noise and it is more difficult to discriminate such events. The MFCCfeatures, widely used for several audio recognition tasks like speech recognition orspeaker identification, are sensitive to additive noise. However they show higherrobustness to different signal to noise ratios, resulting in more stable results, as itcan be seen in Figure 3.6. From Figure 3.5e and Figure 3.5f, it is evident how thelow-level features based on the distribution of the spectral energy in sub-bands hasshown to be robust with respect to decreasing values of the power of the signal.

In Table 3.6 we report the average recognition rate and its standard deviationachieved by the proposed system varying the number of clusters for the test onthe original data set and the one considering also the attenuated versions of thesignals. The results registered by using the kNN classifier are highly influenced bythe loss in generalization capabilities when a high number of cluster is configured.In Figure 6.7, instead, we compare the ROC curves achieved by using the three setsof low-level features. The area under the curves (AUC) are equal to 0.80, 0.90 and0.86 for the features used by Carletti et al. (2013), MFCC and BARK, respectively.The ROC analysis confirms that the features based on the intensity and energy of thesignal are inadequate for recognition of sounds at various distances, while featuresbased on frequency-analysis have higher robustness to different SNR.

3.5. Conclusions 45

3.4.5 Real-time performance

The algorithm utilizes about 3% of the resource of a single Intel i5 CPU core to pro-cess audio streams sampled at 32 KHz. It has been also implemented and runs inreal-time on a STM32F4 board, making its deployment very inexpensive.

3.5 Conclusions

In this paper we proposed a system for detecting hazardous situations on roads byanalyzing the audio stream acquired by surveillance microphones. We carried outthe experiments on a data set that we created and made publicly available, with theaim of studying the sensitivity of the proposed system with respect to its configu-ration parameters. Furthermore, we conducted a careful design analysis in order tounderstand the potentiality of the proposed architecture, in terms of the maximumdistance at which an event of interest can be still recognized in different kinds ofenvironment, ranging from country roads to highways.

The achieved results confirm that the proposed system can be effectively usedin noisy road environments, with an average accuracy of 78.95% at a maximumdistance of 120 meters in country roads and of 25 meters on highways. Furthermore,its overall processing load is still compatible with low cost systems, so encouragingits porting on embedded systems with limited hardware resources. This propertyallows the realization of road surveillance systems with low deployment cost, alsoin combination with already existing surveillance architectures that provide audioacquisition sensors.

Submitted as:

Nicola Strisciuglio, Nicolai Petkov, Mario Vento “CoPE: Trainable filters for feature extraction in audiosignals,”, submitted to IEEE Transactions on Multimedia, 2016

Chapter 4

Trainable CoPE filters for audio eventsdetection

Abstract

Audio signal processing mainly focused on vocal sounds and music analysis. The em-ployed methodologies are not suitable for analysis of audio streams with uncontrolledbackground noise, in many cases with energy comparable to the one of the signal of in-terest.

In this chapter, we introduce a novel method for the detection of events of interest inaudio signals. We propose a filter that we call CoPE (Constellation of Peaks of Energy).It is versatile as its structure is determined in an automatic configuration process givena prototype sound of interest. We construct a bank of CoPE filters, configured on a setof training events. Then we take their responses to build feature vectors that we usein combination with a classifier to perform the detection task. In general, the proposedCoPE filter can be thought of as a feature extractor that is not constructed a priori butrather automatically configured on training data.

We carried out experiments on two publicly available data sets, namely the MIVIA audioevents and the MIVIA road events data sets, specifically made for testing events detec-tion applications in environments with various background sounds. The results that weachieve (recognition rate > 90% and false positive rate < 5%), confirm the effectivenessof the proposed method and outperform the ones obtained by state-of-the-art approaches.The CoPE filter has high robustness to variations of SNR and is also very efficient. Real-time response is achieved even when a large bank of filters is processed.

48 4. Trainable CoPE filters for audio events detection

4.1 Introduction

In the last years, research on audio analysis has mainly focused on speech recogni-tion (Besacier et al., 2014), speaker identification (Roy et al., 2012) and music clas-sification (Fu et al., 2011b). The state-of-the-art methods are far from proposing ageneral framework for audio processing and audio pattern recognition. In the caseof voice analysis, the features and the classification methodologies are established(Hidden Markov Models or Gaussian Mixture Models in combination with spectralor cepstral features). However, the human voice has very specific frequency charac-teristics that are not evident in other kinds of audio signals, like interesting eventsfor surveillance applications, such as gun shots, glass breakings, screams or car ac-cidents. The distribution of the energy of such sounds in the spectral domain is verydifferent from the one of typical speech signals, usually involving high-frequencycomponents. Moreover, in speech recognition and speaker identification systems,the sound source is assumed to be very close to the microphone. This implies a lowinfluence of noise on the functioning of the overall system. In other applications,such as events detection for audio surveillance, the source of the sound of interestcan be at any distance from the microphone. Thus, the analytic algorithms have todeal with very low or even negative values of signal-to-noise ratio (SNR). Due tothe high complexity and variability of audio recognition problems, actual methodsare not suitable for facing every kind of audio analysis task. Moreover, a particulartask usually involves a feature engineering process with the purpose of choosing aset of features that describe specific characteristics of the problem at hand.

In this chapter, we propose a method for audio events detection that automat-ically determines the features from training samples and deals with variable anduncontrolled background noise. We focus on the problem of audio events detectionin the field of intelligent surveillance application. Recently, indeed, an increasing de-mand for safety and security in public environments prompted the signal processingand pattern recognition communities to focus on developing systems for intelligentsurveillance applications. Traditionally, such systems have been involved in the pro-cessing of surveillance videos for behavior analysis (Sivaraman and Trivedi, 2013),human action recognition (Poppe, 2010; Foggia et al., 2014b) and traffic monitor-ing (Abadi et al., 2014). In the last years, attention towards the analysis of audiostreams as complementary or alternative tool to video analysis has grown. As amatter of fact, there are cases in which the video analysis cannot be used for privacyissue (e.g. public toilets) or suffers from illumination or sudden changes of lightissues. Moreover, particular events, like screams or gun shots, are very difficult tobe detected in video streams. In such cases, the combination of video and audioanalysis can be used for improving the reliability of the automatic surveillance sys-


tem (Cristani et al., 2007).A systematic review of the state of the art methods in the field of audio surveil-

lance, ranging from background subtraction to events detection and sound sourcelocalization, was recently proposed by Crocco et al. (2014). The approaches for audioevents detection can be organized in two main groups on the basis of the complexityof their classification architecture and data representation.

Methods in the first group compute typical audio features, such as spectral mo-ments, Mel-frequency cepstral coefficients (MFCC), wavelet coefficients, etc., anduse them in combination with a classifier. Gaussian Mixture Model (GMM) basedclassifiers were employed to detect abnormal events (Vacher et al., 2004; Clavel et al.,2005) or to model the background sound (Valenzise et al., 2007). In order to limit theinfluence of the background noise, by (Rabaoui et al., 2008) and (Lecomte et al., 2011)One-Class SVM classifiers were employed. Ntalampiras et al. (2011) proposed threeprobabilistic novelty detection methodologies to help human operators to react tohazardous situations. These methods are based on short-time analysis only and aregenerally influenced by high variations of the background noise.

In the second group, instead, there are methods that employ more sophisticatedclassification architectures or representation of the data. An architecture with a cas-cade of two GMM-based classifiers was proposed by Ntalampiras et al. (2009): thefirst-stage aims at separating events of interest from the background sound, whilethe second-stage assigns the detected abnormal events to one of the classes of inter-est. Conte et al. (2012) and Aurino et al. (2014) introduced a reject option modulefor Learning Vector Quantization and Support Vector Machine, respectively, in or-der to increase the reliability of prediction and the robustness to background noise.The events of interest were hypothesized to be composed by small, atomic units ofsound by Carletti et al. (2013) and Foggia et al. (2015b), where the bag of featuresand weighted bag of features classification paradigms were employed. However,the temporal arrangement of basic audio units is important in order to increase thereliability of the detectors. This was taken into account by Grzeszick et al. (2015) andby Chin and Burred (2012), where a features augmentation and a classifier based onGenetic Motif Discovery were proposed, respectively. Audio phrases composed ofsequences of basic audio units (also called audio words) are also employed by Phanet al. (2015). Foggia et al. (2014a), instead, formulated the audio event detectionproblem as an object detection task in time-frequency spectrogram-like images.

Generally, methods in the second group have better performance than simplermethods in the first group. The use of complicated classification architectures or theconstruction of more complex data representations increase the reliability of suchapproaches.

However, state of the art methods base their analysis on the computation of stan-


dard sets of audio features, which each describe specific characteristics of the audiosignal, being not suitable for general audio analysis tasks. As an example, some ofthe state of the art methods are based on the computation of MFCC features. Theyhave been demonstrated to be effective for human voice analysis, but are very sen-sitive to additive noise. Thus, although they are employed in audio surveillanceapplication they do not ensure high description capabilities in cases where the noisehas high energy. A particular audio analysis task (a pattern recognition system ingeneral) requires a process of features engineering, aiming at finding a suitable setof features for the problem at hand. The choice of the right combination of featuresthus becomes a critical step in the design of an audio analysis system.

4.2 Rationale

We present a method for audio signals analysis and classification, based on the useof a novel trainable filter. The filter that we propose is called CoPE (Combinationof Peaks of Energy) and takes inspiration from some characteristics of the humanauditory system. The proposed filter is trainable as its structure is not determinedin advance but it is rather learned by an automatic configuration procedure that isperformed on a prototype sound of interest. The concept of trainable filters has beenpreviously introduced for visual pattern recognition: the COSFIRE filters have beenapplied to several image processing tasks (Azzopardi and Petkov, 2012; Azzopardiet al., 2015; Shi et al., 2015; Guo et al., 2015), achieving promising results.

The proposed CoPE filter takes inspiration by how the outer human auditorysystem processes the sound. The sound pressure wave that reaches the ear directsto the cochlea membrane that vibrates along time according to the frequency of thesound wave. The back of the cochlea is pervaded by neurons, called inner hair cells(IHC), that fire when the energy of the vibrations is higher than a certain threshold,generating neural activity patterns along time on the fibers of the auditory nerve.Different types of sound generate distinctive neural activity patterns, which we con-sider as a description of the sounds of interest. We employ the Gammatone filter-bank as a model of the vibrations of the cochlea (Patterson and Moore, 1986), whoseoutput is a spectrogram-like image called auditory map.

Humans can recognize sounds even in highly noisy environments because theauditory system is able to detect known neural activity patterns generated by par-ticular vibrations of the cochlea even in presence of noise. We take the points ofhighest local energy in the auditory map as the location at which the IHCs fire. Theuse of local maxima points is motivated by their property of being robust to addi-tive noise (Avery Lichun, 2003) and variations of signal to noise ratio. We consider

4.2. Rationale 51

CoPE filterbank

CoPE filterbankprocessing

MulticlassSVM

C

...

...configuration process

(a)

(b)

(c) (d) (e)

r1r2

rL

Figure 4.1: Architectural schema of the proposed method. (a) The Gammatonegram represen-tation of the training audio samples is used, in the training phase (dashed arrow), to construct(b) a bank of CoPE filters. Such filters are used in the application phase to (c) process the inputsound and construct (d) feature vectors with their responses. (e) A multi-class SVM classifieris, finally, employed to detect events of interest.

the relative arrangement of such points as a description of the underlying neuralactivity pattern and, thus, as a discriminant feature of the sound of interest. A par-ticular CoPE filter relies on the detection of a certain constellation of local maximapoints, whose structure is determined during an automatic configuration processperformed on a training example. In the application phase the filter detects the con-stellation pattern of interest and it is tolerant to eventual distortions due to noise.

The basic concept of the filter is simple and its trainable character allows to avoida features engineering step, which is usually a delicate process in pattern recogni-tion. The important features are learned directly from the events of interest, makingthe system easily adaptable to different sound recognition tasks and not requiringdomain knowledge about the specific problem. In this chapter we configure a bankof CoPE filters, trained to detect various events of interest, and take their responsesto build feature vectors, used together with a classifier to perform the detection task.

We experimentally validated the proposed CoPE filter by testing its performanceon two publicly available data sets for audio events detection and classificationtasks, namely the MIVIA audio events (Foggia et al., 2015b) and the MIVIA roadsevents (Foggia et al., 2015a) data sets. We also extended the MIVIA audio eventsdata set by including events of interest at null and negative signal-to-noise ratio(SNR) values, in order to test the performance of the proposed method in highlynoisy environments. Moreover, we perform an analysis of the sensitivity of the per-formance of the proposed method with respect to the parameters of the CoPE filter.


4.3 Method

In Fig. 4.1, we show a sketch of the architecture of the proposed method. The pro-posed CoPE filter detects a specific constellation of energy peaks (local maxima) in atime-frequency representation of the input audio signal. During an automatic con-figuration process, it creates a model of such constellation of energy peaks from theGammatonegram image of a prototype sound. The CoPE filter introduces tolerancein the detection of the considered pattern of interest in order to account for slightdeformation due to delay or distortion of the sound. For the application at hand,we configure a bank of CoPE filters (Fig. 4.1b) trained on different prototype events(Fig. 4.1a) and use their responses to build a feature vector (Fig. 4.1c-d). Such vectoris, then, used in combination with a classifier in order to perform the detection task(Fig. 4.1e).

4.3.1 Gammatone filterbank

In signal processing, it is common to represent the frequency distribution of the en-ergy of the audio signal as it varies along time. We process the input signal by usinga Gammatone filterbank, which is a biologically-inspired model of the response ofthe cochlea membrane in the outer human auditory system (Patterson and Moore,1986). Different parts of the cochlea membrane, indeed, vibrate according to theenergy of the frequency components of the sound pressure waves that arrive to theear, drawing a so called auditory image (Patterson et al., 1992).

The traditional and most used time-frequency representation of sound is thespectrogram, in which the energy distribution over the frequencies is computed bydividing the frequency axis into sub-bands having the same bandwidth. However,there are important differences between the way in which the ear analyses the in-formation and how it is done in the spectrogram. In the human auditory system,the resolution in the perception of difference in frequency is not constant, but ratherinversely proportional to the base frequency of the sound. Thus, at low frequencythe band-pass filters have a narrower bandwidth than the ones at high frequency.This implies higher time resolution of filters at high frequency that are able to bet-ter catch high variations of the signal along time. The bandwidth of the band-passfilters in the Gammatone filterbank increases with increasing central frequency incontrast with a spectrogram where the filters have all the same bandwidth.

The impulse response of a Gammatone filter is the product of a gamma distribu-tion with a sinusoidal tone:

g(t) = atn−1e−2πbt cos(2πfct+ φ) (4.1)

4.3. Method 53

where fc is the central frequency of the filter, and φ is the phase, while the constanta controls the gain and n is the order of the filter. Finally b is the decay factor, thatdetermines the bandwidth of the considered band-pass filter and the duration ofthe impulse response. The center frequency of the Gammatone filters in the filter-bank are distributed along the frequency axis in proportion to their bandwidth asdetermined by the Equivalent Rectangular Bandwidth (ERB) scale:

ERB = 24.7 + 0.108fc (4.2)

We divide the input audio signal in frames of Tf milliseconds and process everyframe by a bank of Gammatone filters in order to capture the short-time propertiesof the distribution of the energy. Two consecutive frame are overlapped by 50%

of their length so that the continuity of analysis is ensured and border effects areavoided. We finally construct a Gammatonegram image Xgt(t, f) by concatenatingthe response of the Gammatone filterbank, represented as column vectors. Thus,each column of the Gammatonegram image is the response of the Gammatone fil-terbank at time instant t. In Figure 4.2, we show examples of the Gammatonegramrepresentation of a glass breaking (a), a gun shot (c) and a scream (e). It is evidenthow the time-frequency distribution of the energy of the events differ from eachother and can be employed as basic representation to effectively distinguish suchevents.

4.3.2 CoPE filter

In the following, we explain the configuration process of a CoPE filter and howit is applied to gammatonegram images for the purpose of detecting audio eventsof interest. The CoPE filter takes as input the position of the energy peaks in thegammatonegram image and searches for the constellation of peak points that bestcorresponds to the one determined in its model after the configuration step. TheCoPE filter introduces tolerance in the evaluation of the time-frequency position ofthe energy peaks, so accounting for deformations of the pattern of interest due tonoise or distortion.

Energy peaks selection

The energy peaks in a Gammatonegram image Xgt(t, f) have the property of highrobustness with respect to the presence of additive noise (Avery Lichun, 2003).Moreover, the constellation of a set of such points approximately describes the time-frequency distribution of the energy of the sound of interest. A time-frequency pointis considered to be a peak if it has higher energy content than its neighbor points.


time (seconds)

0.20.71.5

36

12Glass breaking

freq

/K

Hz

0 0.32 0.64 0.96 1.28 1.6 1.92

(a)time (seconds)

Glass breaking (peaks)

0 0.32 0.64 0.96 1.28 1.6 1.92

(b)

time (seconds)

0.20.71.5

36

12Gun shot

freq

/K

Hz

0 0.32 0.64 0.96 1.28 1.6 1.92

(c)time (seconds)

Gun shot (peaks)

0 0.32 0.64 0.96 1.28 1.6 1.92

(d)

time (seconds)

0.20.71.5

36

12Scream

freq

/K

Hz

0 0.32 0.64 0.96 1.28 1.6 1.92

(e)time (seconds)

Scream (peaks)

0 0.32 0.64 0.96 1.28 1.6 1.92

(f)

Figure 4.2: Gammatonegram representations of a glass breaking (a) a gun shot (c) and ascream (e), with the positions of the corresponding energy peaks (b, d, f). The highest en-ergy point is represented with a small circle.

Thus, we suppress non-maximum points in the Gammatonegram image and obtaina energy peak map response, as follows:

4.3. Method 55

Pgt(τ, f) =

Xgt(t, f), if Xgt(t, f) = max

t−∆t≤t′≤t+∆tf−∆f≤f ′≤f+∆f

Xgt(t′, f ′)

0, else

, (4.3)

where ∆t and ∆f determine the size, in terms of time and frequency, of the neigh-borhood around a time-frequency point in which the local maximum intensity isevaluated.

Configuration of a CoPE filter

Given the constellation of energy peaks of a sound of interest and a reference point(in our case the point that correspond to the highest peak of energy), we determinethe model of a CoPE filter in an automatic configuration process by considering thepositions of such peaks with respect to the reference point. In order to configure aCoPE filter, one has to choose only the support size of the filter, i.e. the length of atime interval around the reference point, in which the energy peaks are selected aspart of the model. In the proposed CoPE filter, every point pi that is selected withthe method mentioned above is described by a tuple of three parameters (∆ti, fi, ei):∆ti is the temporal offset of the considered point with respect to the reference point,fi represents the i-th frequency channel of the Gammatone filterbank and ei is theenergy contained in its bandwidth. We denote by S = {(∆ti, fi, ei) | i = 1, . . . , P}the set of 3−tuples of a CoPE filter, where P is the number of considered peakswithin the support of the filter.

As an example, in the images in the right column of Figure 4.2, we show a sketchof the configuration process for some events of interest. The energy peaks (smallspots) are extracted from the corresponding gammatonegram images in the left col-umn and their positions with respect to the highest energy point (small circle) areused to build the model S of the corresponding CoPE filters. In this work, we con-figure a filter for each event of interest in the training set, so obtaining a bank ofCoPE filters.

CoPE filter response

The configuration process results in a set of tuples that describe a constellation ofenergy peaks in the Gammatonegram image of a sound. Each tuple describes onesub-component of the filter. Formally, we define the response relative to the i-thtuple of the model as:

si(t) = maxt′,f ′{ψ(fi, t)Gσ′(t− t′, fi − f ′)} (4.4)


−∆t ≤ t′ ≤ ∆t,−∆f ≤ f ′ ≤ ∆f

where the function Gσ′(·, ·) is a Gaussian weighting function that allows for sometolerance in the expected position of the i-th time-frequency point in the model.This choice is supported by the evidence, in the auditory system, that vibrations ofthe cochlea membrane due to a sound wave of a certain frequency excite neuronsspecifically tuned for that frequency and also neighbor neurons (Palmer and Russell,1986). The size of the tolerance region is determined by the standard deviation σ′

of the function Gσ′ . Its general form is σ′ = (σ0 + αρi)/2, where σ0 and α areconstants, while ρi is the distance of the i-th energy peak from the reference pointof the filter. We consider the same tolerance value for the positions of all the energypeaks (which correspond to the excited neurons by cochlea vibrations) by settingα = 0. The function ψ(f, t) is a measure of the similarity between the energy peakpoint in the gammatonegram and the one in the model. We set ψ(f, t) = Pgt(f, t),so as to account only for the relative position of the peak points in the constellation.

We define the response of a CoPE filter as the geometric mean of the responsesof its sub-components:

r(t) =

∣∣∣∣∣∣∣ |S|∏i=1

si(t)

1/|S|∣∣∣∣∣∣∣t1

, (4.5)

where t1 is a threshold value. We consider t1 = 0, so not suppressing the responseof the CoPE filter.

4.3.3 A bank of CoPE filters

We configure a set of CoPE filters on L training audio samples from different classes,so as to construct a bank of filters. We construct a feature vector to describe a soundin the interval [T1, T2] as follows:

v[T1,T2] =[r1, r2, . . . , rL

]. (4.6)

whereri = max

t∈[T1,T2]ri(t) (4.7)

is the maximum response of the i-th CoPE filter in the filterbank within the interval[T1, T2]. Thus, we use the so constructed vectors to train a classifier, which is able todetect the occurrence of events of interest.

4.4. Data sets 57

4.3.4 Classifier

Considering M classes of interest, we build a multi-class SVM classifier by combin-ing the predictions of a pool of M +1 linear SVM classifiers (one for the backgroundsounds and M for the events of interest). The SVM is able to learn which are thesignificant features for the correct detection of a particular class of events and givesthem higher weight in the classification process. Moreover, it finds an optimal sep-arating hyperplane between sample points of different classes and has high gener-alization capabilities. We train the i-th SVM using as positive examples the samplesfrom the class Ci and as negative examples all the samples from the other classes(one-vs-all scheme). Each classifier assigns a score mi to the example under test. Wecombine the score values by choosing the class that corresponds to the SVM thatgives the highest classification score. We choose the reject class C0 (background) incase all the scores are negative:

C =

C0, if mi < 0 ∀i = 0, . . . ,M

arg maximi, otherwise.

(4.8)

4.4 Data sets

We carried out experiments on two publicly available data sets, specifically made fortesting audio events detection algorithms for surveillance applications: the MIVIAaudio events (Foggia et al., 2015b) and the MIVIA road events (Foggia et al., 2015a)data sets1.

4.4.1 MIVIA audio events

Typical events of interest for intelligent surveillance applications are glass break-ings, gun shots and screams. In the MIVIA audio events data set, such events areprovided as superimposed to various background sounds, in order to simulate theiroccurrence in different environments. In the original MIVIA audio events data set,each events is provided with 6 levels of signal to noise ratio ({5, . . . , 30}dB), so as tosimulate sound sources at various distances from the microphone.

In this work, we extended the original data set by including cases in which theenergy of the events of interest is equal or lower than the one of the backgroundsound, so obtaining null or negative SNR values. Thus, adopting the same proce-dure described by Foggia et al. (2015b), we created two more versions of the audio

1The data sets are available for download at http://www.mivia.unisa.it


clips containing audio events at 0dB and −5dB SNR value. We made the extendeddata set publicly available at the url http://mivia.unisa.it.

The final data set contains a total of 8000 events for each class, divided into 5600

events for training and 2400 events for testing equally distributed over the consid-ered values of SNR. The audio clips are PCM sampled at 32KHz with a resolutionof 16 bits per sample. Hereinafter we refer at glass breaking with GB, at gun shotswith GS and at screams with S. In Table 4.1, we report the details of the compositionof the data set.

4.4.2 MIVIA road events

The MIVIA roads data set contains car crash and tire skidding events mixed withtypical road background sounds such as traffic jam, passing vehicles and crowds. Inthe data set, a total of 400 events (200 car crashes and 200 tire skiddings) are super-imposed to various road background sounds ranging from very quiet environments(e.g. in the country roads) to highly noisy traffic jams (e.g. in the center of a big city)and highways. The events of interest are distributed over a total of 57 audio clipsof about one minute each. Such audio clips are divided into four independent folds(in each fold 50 events per class are present) for cross-validation experiments. Theaudio signals are sampled at 32KHz with a resolution of 16 bits per PCM sample. Inthe rest of the paper, we refer at car crash with CC and at tire skidding with TS. InTable 4.2, we report the details of the composition of the data set.

4.5 Experiments

We adopted the experimental protocol defined by Foggia et al. (2015b): the detec-tion of the events of interest is performed within a time window of 3 seconds thatforward shifts on the audio signal by 1 second. We apply the proposed method onevery time window in order to detect the presence or absence of an event of inter-est. We consider an event as correctly detected if it is detected in at least one of thewindows that overlap with.

For an event detection task, the recognition rate or the classification matrix donot constitute sufficiently valid and complete methods of evaluation. Indeed, thereare two types of error that we have to be aware of: the detection of events of in-terest when only background sound is present (false positive) and the case whenan event of interest occurs but it is not detected (missed detection). We measuredthe performance of the proposed approach by calculating the recognition rate (RR),the false positive rate (FPR), the error rate (ER) and the miss detection rate (MDR).Moreover, in order to assess the overall performance of the proposed method we

4.5. Experiments 59

MIVIA audio events data set

Training set Test set

#Events Duration (s) #Events Duration (s)

BN - 77828.8 - 33382.4

GB 5600 8033.1 2400 3415.6

GS 5600 2511.5 2400 991.3

S 5600 7318.4 2400 3260.5

Table 4.1: Details of the composition of the MIVIA audio events data set. The total durationof the sounds is expressed in seconds.

MIVIA road events data set

#Events Duration (s)

BN - 2732

CC 200 326.3

TS 200 522.5

Table 4.2: Details of the composition of the MIVIA road events data set. The total duration ofthe sounds is expressed in seconds.

compute the Detection Error Trade-off (DET) curve. It is a graphical plot that mea-sures the trade-off between the false positive rate and the miss detection rate andgives an insight of the performance of a classifier in terms of its errors. In contrastwith the ROC curve, in the DET curve the axis are non-linearly mapped in order tohighlight differences between classifiers in the critical operating region. Closer thecurve to the point (0, 0), better the performance of the system.

We carried out an analysis of the sensitivity of the performance with respectto the parameters of the CoPE filter. In particular, we show how the performanceof the proposed method varies in relation with different values of the parameter σ0,which controls the tolerance for the position of each energy peak in the constellation.In order to perform such analysis, we use a version of the MIVIA events data setspecifically built for cross-validation performance assessment (Foggia et al., 2015b).The data set is divided in k = 5 folds. Each of them contains 200 events of interest(times 8 versions of the SNR, as in the original MIVIA events data set).


Results - MIVIA audio events data set

Guessed class

GB GS S MDR

True

class

GB 95.33% 2.13% 1.25% 1.29%

GS 4.33% 89.25% 2.58% 3.83%

S 1.5% 4.92% 87.79% 5.79%

Table 4.3: Classification matrix achieved on the extended MIVIA audio events data set.

4.5.1 Performance and results

In Table 4.3, we report the classification matrix achieved by the proposed methodon the extended version of the MIVIA events data set. In the column MDR wereport the rate of events of interest of a certain class that are not detected but ratherconsidered as background noise. The average recognition rate on the three classesof interest is 90.7%, while the miss detection rate and the error rate are 3.7% and5.6%, respectively. We computed the false positive rate as the ratio of the numberof false detection and the total number of time intervals in which only backgroundsound is present. We obtained an overall FPR equal to 7.1%. In detail 1.25% are falsedetected glass breakings, 2.74% gun shots and 3.11% screams. If a false detection isobtained in two consecutive time windows, we count only one FP.

A detailed analysis of the performance at different values of SNR is, instead, re-ported in Table 4.4. We observe high stability of the recognition and miss rates whenthe events of interest have positive (also very low) SNR. The very high robustnessof the proposed method with respect to variations of the SNR, in the case it remainspositive, is attributable to the choice of the local energy peaks as basic input for theCoPE filter, which are robust to additive noise. The average performance achievedwhen only the events of interest with positive SNR values are taken into account isvery high. The proposed method achieved 95.38% recognition rate and 3.8% missrate, while the error rate is reduced to 0.8%. The FPR is also very low (2.2%), whichconfirms the robustness of the CoPE filter proposed as feature extractor. Conversely,when the SNR value decreases below zero the energy of the background noise intro-duces high deformation of the energy peak constellations of the patterns of interest.In such cases, however, the most of the wrong classifications are due to errors in theguessed class, which means that the proposed method is still able to detect an haz-ardous event and raise an alarm, even if the detected event is not well-categorized.

In Table 4.5 we report the results achieved by the proposed method on the MIVIAroad events data set. The average recognition rate over the four folds is 94% with astandard deviation of 4.32, while the average FPR is 3.94% with a standard deviation

4.5. Experiments 61

Detailed results - MIVIA audio events data set

SNR RR MR ER FPR

−5dB 72.7% 0.4% 26.9% 23.8%

0dB 81.3% 5.9% 12.8% 20.03%

5dB 95.44% 2.45% 2.11% 11.4%

10dB 95.44% 4% 0.56% 1%

15dB 95.11% 4.33% 0.56% 0.1%

20dB 95.11% 4.33% 0.56% 0.1%

25dB 94.78% 4.66% 0.56% 0.9%

30dB 95.44% 4% 0.56% 0.1%

Average 90.7% 3.7% 5.6% 7.2%

Average (SNR > 0) 95.2% 4% 0.8% 2.1%

Table 4.4: Detailed results for different values of the SNR, achieved by the proposed methodon the MIVIA events data set.

Results - MIVIA road events

Guessed class

GB GS Miss

True

class

GB 92% 2% 6%

GS 0.5% 96% 3.5%

Table 4.5: Average classification matrix achieved by the proposed method on the MIVIA roadevents data set.

of 1.82. The performance results are in line with the ones achieved on the MIVIAaudio events data set. The low standard deviation of the recognition rate, moreover,is indicative of good generalization capabilities of the featurs computed using theproposed CoPE filters.


For the configuration a CoPE filter, the user has to choose the size of its support, i.e.the length of the time interval around the reference point in which the energy peaksare considered for the construction of the model. We observed from experimentsthat different sizes of the support of the filter, namely st = {200, 300, 400} ms, donot significantly influence the performance of the proposed system. We set for allthe filters a support of 200 ms, which involves a confined number of energy peaks


Sensitivity analysis

MIVIA audio events (cross-validation) MIVIA road events

σ0 RR σRR FPR σFPR RR σRR FPR σFPR

1 64.02% 13.13 19.21% 7.29 71.5% 9.15 17.09% 7.66

2 73.4% 6.21 18.75% 3.16 81% 5.48 21.46% 11.14

3 86.16% 2.22 13.97% 2.85 95.25% 6.24 18.78% 12.5

4 86.17% 3.32 11.53% 2.4 95.25% 4.35 7.44% 4.39

5 85.3% 3.52 9.91% 2.2 94% 4.32 3.94% 1.82

6 85.63% 4.8 10.23% 2.1 93.75% 4.79 3.94% 11.31

Table 4.6: Sensitivity of the performance of the proposed method with respect to its parameterσ0 that regulates the weighting of the energy peaks. Higher the value of σ0, larger variationsfrom the model are tolerated.

(i.e. tuples) in the model. If compared to larger size of the support, it requiresless processing resources in the application phase. One could chose, for instance, asupport length st = 400 ms, achieving very close performance to the case in whichst = 200 ms. The drawback, then, is the necessity of computing and combining theresponses of a higher number of sub-components swhich increases the processingtime of each filter.

We studied the sensitivity of the performance of the proposed CoPE filter withrespect to its parameter σ0, which controls the degree of tolerance of the position ofthe energy peaks within the constellation. For these experiments, we considered aversion of the MIVIA audio events data set specifically made for cross-validation. InTable 4.6, we report the recognition rates and the false positive rates, together withtheir standard deviations in cross-validation experiments, achieved by the proposedmethod on both the considered data sets, as the parameter σ0 varies. The perfor-mance of the proposed system is sensitive to varying values of the tolerance degree,mostly when they are kept very low. For higher values (σ0 = 4, 5, 6), instead, theperformance metrics are stable. A higher tolerance degree for the detection of the en-ergy peak positions involves a stronger robustness to background noise that causesdeformations of the constellation patterns of interest.

4.5.3 Results comparison

We compared the results achieved by the proposed method with the ones achievedby the methods described by Foggia et al. (2015b) and Foggia et al. (2015a) on theMIVIA audio events and on the MIVIA road events data sets, respectively.

4.5. Experiments 63

Results comparison on MIVIA audio events data set

Test TP1 Test TP2

Method RR MR ER FPR RR MR ER FPR

CoPE 91.71% 2.61% 5.68% 9.2% 90.7% 3.7% 5.6% 7.2%

CoPE(snr>0) 96% 3.1% 0.9% 4.3% 95.2% 4% 0.8% 2.2%

BoWh 76.4% 11.64% 11.96% 5.9% 56.07% 36.43% 7.5% 5.3%

BoWh (snr>0) 84.8% 12.5% 2.7% 2.1% 64.63% 31% 4.4% 4.2%

BoWs 77.81% 10.65% 11.54% 6.6% 59.11% 32.97% 7.92% 5.3%

BoWs (snr>0) 86.7% 10.7% 2.6% 3.1% 68.74% 26.4% 4.9% 4.5%

Table 4.7: Comparison of the performance results achieved by the proposed method with theones achieved by state of the art approaches on the MIVIA audio events data set.

Foggia et al. (2015b) performed the audio events detection task by means of asystem based on the bag of features classification paradigm. Two high-level represen-tations are used: one based on hard vector assignment (BoWh) and the other basedon soft vector assignment (BoWs). The latter is more robust to background noisethan the former, despite of a required higher computational load. In order to com-pare the performance of the proposed CoPE filter with respect to the one achievedby these two approaches, we experimented by training the systems in two differentways.

The first training (TP1) was performed by considering only the audio events withpositive SNR. In the second case the systems under test were trained by consideringall the audio clips in the data set (TP2), including those events with null or negativeSNR. In Table 4.7, we report the results achieved by the proposed method, comparedwith the ones achieved by existing methods on the MIVIA audio events data set. Foreach test, we also isolate the results that concern the classification of the events ofinterest with only positive SNR values, according to the analysis that we discussedin Section 4.5.1. The results highlight a higher stability of the proposed methodthan other methods with respect to variations of the SNR of the events of interest.Moreover, the presence or absence of the negative SNR audio events in the trainingprocess of the system do not significantly influence the performance (recognitionrate of 96% vs 95.39%, respectively). The performance of the approaches describedny Foggia et al. (2015b), instead, are dependent on the SNR of the samples involvedin the training process and on the SNR values of the events of interest to be detected.When sounds with only positive SNR values are used for training, the recognitionrate achieved by state of the art methods is about 15% lower than the one obtainedby the proposed method. The performance of such approaches decreases even more


Results comparison on MIVIA road events data set

RR MR ER FPR

Proposed method 94% 4.75% 1.25% 3.95%

σ 4.32 4.92 1.26 1.82

BoW - bark Foggia et al. (2015a) 80.25% 21.75% 3.25% 10.96%

σ 7.75 8.96 2.5 8.43

BoW - mfcc Foggia et al. (2015a) 80.25% 19% 0.75% 7.69%

σ 11.64 11.63 0.96 5.92

BoW Carletti et al. (2013); Foggia et al. (2015a) 82% 17.75% 0.25% 2.85%

σ 7.79 8.06 1 2.52

Table 4.8: Comparison of the results achieved on the MIVIA roads events data set with respectto the methods proposed in Foggia et al. (2015a).

when events with negative SNR values are included in the model, i.e. in the testTP2 (more than 24% lower). The recognition rate achieved by the proposed systemkeeps stable and is significantly higher than state of the art methods. This meansthat the method proposed in this chapter is more likely to effectively work in realenvironments where the variability of the background sounds and of the SNR of theevents of interest is generally high.

In Table 4.8, we compare the results achieved on the MIVIA roads events data setwith the ones reported by Foggia et al. (2015a), where different sets of standard au-dio features have been employed as low-level descriptors in a system for detectionof hazardous situations in roads. We also include in Table 4.8 the value of standarddeviation of each measure of performance achieved in the cross-validation exper-iments. We obtain an average recognition rate (94%) that outperforms the perfor-mance of other methods on the MIVIA road events data set by more than 10% witha lower standard deviation. The improvement of the results confirm the robustnessof the proposed method also on other kinds of events of interest. In Fig. 4.3a andFig. 4.3b, we plot the DET curves achieved by the proposed method on the MIVIAaudio events and MIVIA road events data sets, respectively. We also plot the curvesrelated to the approaches described ny Foggia et al. (2015b) and Foggia et al. (2015a).The false and miss probabilities in the plots refer to the performance of the classifierthat takes decisions on the single time window that slides on the input audio sig-nal. The results reported in Tables 4.7 and 4.8 are obtained by aggregating the deci-sions on the single time windows according to the protocol specified by Foggia et al.(2015b). The curve relative to the system based on the proposed CoPE filters (solidline) is closer to the point (0, 0), so confirming the generally better performance than

4.5. Experiments 65

0.1 0.2 0.5 1 2 5 10 20 40

0.10.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

Mis

spr

obab

ility

(in%

)MIVIA audio events data set

copeBoWhBows

(a)

0.1 0.2 0.5 1 2 5 10 20 40

0.10.2

0.5

1

2

5

10

20

40

False Alarm probability (in %)

Mis

spr

obab

ility

(in%

)

MIVIA road events data set

copebarkmfccCarletti et al. (2013)

(b)

Figure 4.3: Detection Error Trade-off curves of the proposed method compared to curvesdetermined by the state of the art methods, on the MIVIA audio events (a) and MIVIA roadevents (b) data sets.


the state of the art methods based on the use of traditional audio features.

4.6 Discussion

The high recognition capabilities of the proposed method that emerged from the ex-periments are attributable to the trainable character and the versatility of the CoPEfilter, which easily adapts to different types of events of interest. Indeed, one canconfigure a filter to detect any constellation of energy peaks in Gammatonegramimages, similar to the one determined in the automatic configuration process on atraining sound. It is noteworthy that the proposed trainable CoPE filter does notstrictly relate with template matching techniques, which are influenced by varia-tions with respect to the prototype pattern. The tolerance introduced in the detec-tion of the position of every energy peak in the constellation allows for detection ofthe prototype pattern used for configuration and also modified version of it, mainlydue to noise or distortion. Moreover, the energy peaks used as basic inputs for theconfiguration and the processing of a CoPE filter have the property of stability withrespect to additive noise, as reported by Avery Lichun (2003). In Figure 4.4b, weshow the response r(t) along time of a CoPE filter configured on the glass breakingevent of Fig. 4.4a. In Fig. 4.4c, instead, we show a detail of the response of such filteron the same glass breaking event at different values of SNR. It is noticeable howthe response keeps stable for positive, also very low, values of SNR and it slightlydecreases for null or negative SNR values, according to the results reported in Ta-ble 4.4. This is due to the effect of background sound with higher energy than theevent of interest. It determines, indeed, strong changes of the position of the energypeaks with respect to the ones included in the model, which reduce the detectionperformance of the CoPE filter.

One important advantage of using CoPE filters is the possibility of avoiding afeatures engineering step, in which standard features (e.g. MFCC, spectral and tem-poral features, Wavelets, etc.) are usually chosen and combined together to over-come specific problems of the audio signals. On the contrary, the CoPE filter is usedto determine the important features through an automatic configuration process,without the need of manually creating a feature set to describe the sound. A config-ured CoPE filter can be thought of as a specific feature extractor, determined directlyform the data. Thus, it could be employed in other audio processing applications:music analysis for genre recognition (Sturm, 2014), ornamentation detection andrecognition (Neocleous et al., 2015) or audio fingerprinting (Cano et al., 2005), etc.

In this work, we configured a bank of CoPE filters on a set of training audioevents and, then, considered their responses to form a feature vector that we used in

4.6. Discussion 67

time (seconds)

0.2

0.7

1.5

3

6

12

Glass breaking event (30dB)fr

eq/

KH

z

0 0.32 0.64 0.96 1.28 1.6 1.92

(a)

0 0.32 0.64 0.96 1.28 1.6 1.92time (seconds)

0

0.2

0.4

0.6

0.8

1Response of the CoPE filter

(b)

0.4 0.445 0.49 0.535 0.58time (seconds)

0.6

0.7

0.8

0.9

1Response at different SNR values

0dB-5dB5dB10dB15dB20dB25dB30dB

(c)

Figure 4.4: Example of response along time of a CoPE filter (b) configured on a prototypeglass breaking event of interest (a). A detailed view of the filter response on the same eventof interest at different values of SNR (c) highlights the stability of the response for positivevalues of SNR: it decreases for null or negative SNR value.

combination with a classifier to perform the detection task. In the actual system, thefilters are considered as all equally significant for the description of the input audio


signal. The SVMs in the multiclass classifier learn the weights that the responseof each filter has for the classification of events from different classes. In order toimprove the quality of the performance and reduce the computational load of theCoPE filterbank, the selection procedure described by Strisciuglio et al. (2015a) canbe employed. It allows to discard those filters that do not relevantly contribute tothe classification task, thus reducing the number of filters in the filterbank withoutdecreasing the performance and the detection capabilities of the overall system.

The computation of the response of a single CoPE filter is very efficient. Givena Gammatonegram image (of size N ×M pixels), one 2-D convolution is requiredto detect the local energy peak points with a time complexity of O(N × M). Weconfigure all the CoPE filters in the filterbank with the same value of the parameterσ0, which implies the same standard deviation of the weighting function for all thefilters. Thus, the weighted peak response can be computed only once before theprocessing of the CoPE filterbank. It requires one 2-D convolution that hasO(N×M)

time complexity. The response of one CoPE filter consists of the multiplication ofthe |S| blurred and shifted peak responses in its model. The number |S| is, however,negligible with respect to the N ×M operations required for the weighting of thepeak responses. In practice, the straightforward MATLAB implementation2 that weused for the experiments takes an average time of 0.965 seconds (with a standarddeviation equal 0.007 over 200 trials) to compute the response of a bank of 200 CoPEfilters on an audio signal of 3 seconds. It is worth noting that the computation of theresponses of each CoPE filter in a filterbank is independent from the others. Thus,the proposed CoPE filterbank can be implemented in parallel mode, so as to furtherspeed-up the computation of the filterbank output.

4.7 Conclusion

We proposed a novel method for features extraction in audio signals based on train-able filters, which we call CoPE filters, and demonstrated its effectiveness in the taskof events detection for intelligent surveillance applications.

The CoPE filter is versatile as its structure is determined by means of an au-tomatic configuration process given a prototype pattern of interest. Moreover, itintroduces tolerance in the detection of the configured pattern of interest and in theprocess of extraction of characteristic information from the audio signal. This ac-counts for generalization capabilities in the final decision process. The results thatwe achieve on two publicly available data sets demonstrate the effectiveness of theproposed method and the improvement with respect to state of the art approaches.

2The Matlab implementation is available at http://matlabserver.cs.rug.nl/

Published as:

George Azzopardi, Nicola Strisciuglio, Mario Vento, Nicolai Petkov, “Trainable COSFIRE filters for vesseldelineation with application to retinal images”, Medical Image Analysis, Volume 19, Issue 1, January 2015,Pages 46-57, ISSN 1361-8415, http://dx.doi.org/10.1016/j.media.2014.08.002.

Chapter 5

Retinal vessel delineation using trainableB-COSFIRE filters

Abstract

Retinal imaging provides a non-invasive opportunity for the diagnosis of several medicalpathologies. The automatic segmentation of the vessel tree is an important pre-processingstep which facilitates subsequent automatic processes that contribute to such diagnosis.

We introduce a novel method for the automatic segmentation of vessel trees in retinalfundus images. We propose a filter that selectively responds to vessels and that we callB-COSFIRE with B standing for bar which is an abstraction for a vessel. It is basedon the existing COSFIRE (Combination Of Shifted Filter Responses) approach. A B-COSFIRE filter achieves orientation selectivity by computing the weighted geometricmean of the output of a pool of Difference-of-Gaussians filters, whose supports are alignedin a collinear manner. It achieves rotation invariance efficiently by simple shifting op-erations. The proposed filter is versatile as its selectivity is determined from any givenvessel-like prototype pattern in an automatic configuration process. We configure twoB-COSFIRE filters, namely symmetric and asymmetric, that are selective for bars andbar-endings, respectively. We achieve vessel segmentation by summing up the responsesof the two rotation-invariant B-COSFIRE filters followed by thresholding.

The results that we achieve on three publicly available data sets (DRIVE: Se = 0.7655,Sp = 0.9704; STARE: Se = 0.7716, Sp = 0.9701; CHASE DB1: Se = 0.7585, Sp =0.9587) are higher than many of the state-of-the-art methods. The proposed segmentationapproach is also very efficient with a time complexity that is significantly lower thanexisting methods.

70 5. Retinal vessel delineation using trainable B-COSFIRE filters

5.1 Introduction

Retinal fundus images (Fig. 5.1a) are routinely used for the diagnosis of variouspathologies, including age-related macular degeneration and diabetic retinopathy –the two leading causes of blindness among people of the Western World (Abramoff,Garvin and Sonka, 2010) – as well as glaucoma, hypertension, arteriosclerosis andmultiple sclerosis.

The computer analysis of retinal fundus images is an alternative to direct oph-thalmoscopy where a medical specialist visually inspects the fundus of the retina.Although ophthalmoscopy provides an effective means of analysing the retina,there is evidence that fundus photographs are more reliable than ophthalmoscopy,for instance, in the diagnosis of diabetic retinal lesions (Harding et al., 1995; vonWendt et al., 1999). Moreover, retinal fundus photography provides the possibilityof analysing the produced images in batch mode. The manual analysis of retinalimages is, on the other hand, time-consuming and expensive. The automation ofcertain processing steps is thus important and facilitates the subsequent decisionsby specialists to provide a basis for further automatic steps in the early diagnosis ofspecific diseases.

The automatic segmentation of blood vessels from background (Fig. 5.1b) is oneof the basic steps that is required for the analysis of retinal fundus images. This isa challenging process mainly due to the width variability of vessels and due to thelow quality of retinal images that typically contain noise and changes of brightness.Several methods have already been proposed for the segmentation of blood vesselsin such images, which can be divided into the following two categories: unsuper-vised and supervised methods. Supervised methods use pixel-wise feature vectorsto train a classifier in order to discriminate between vessel and non-vessel pixels,while unsupervised methods do not use classifiers but rely on thresholding filterresponses or other rule-based techniques.

Vessel tracking techniques (Liu and Sun, 1993; Zhou et al., 1994; Chutatape et al.,1998; Tolias and Panas, 1998) are unsupervised methods that use an initial set ofpoints, which are chosen either manually or automatically, to obtain the vasculartree by following the vessel center lines. Other unsupervised methods, which usea priori information about the profile structure of the vessels, employ mathemati-cal morphology to segment the vessel tree from the background (Zana and Klein,2001; Heneghan et al., 2002; Mendonca and Campilho, 2006). Moreover, morpho-logical operators have been used to enhance the vessels and then combine curvatureanalysis and linear filtering to discriminate the vessels from the background (Fanget al., 2003). Matched filtering techniques (Chauduri et al., 1989; Hoover et al., 2000;Gang et al., 2002; Al-Rawi et al., 2007) model the profile of the vessels by using a


(a) (b)

Figure 5.1: (a) Example of a coloured retinal fundus image (of size 565×584 pixels) and (b) thecorresponding manually segmented vessel tree, from the DRIVE data set (Staal et al., 2004).

two-dimensional (2D) kernel with a Gaussian cross-section. Growing a “Ribbon ofTwins” active contour model has been used both for segmentation and for widthmeasurement of the vessels (Al-Diri et al., 2009). Martinez-Perez et al. (2007) pro-posed a method based on multiscale analysis to obtain vessels width, size and ori-entation information that are used to segment the vessels by means of a growingprocedure. Lam et al. (2010) proposed a multiconcavity modeling approach withdifferentiable concavity measure to handle both healthy and unhealthy retinal im-ages simultaneously.

On the other hand, supervised methods have been used to automatically la-bel pixels either vessel or non-vessel. In such methods, classifiers are trained bypixel-wise feature vectors that are extracted from training retinal images, whoseground truth labels are given in the corresponding manually labeled images. Forinstance, a k-Nearest Neighbor (kNN) approach was used by Niemeijer et al. (2004)and Staal et al. (2004) to classify the feature vectors that were constructed by a mul-tiscale Gaussian filter and by a ridge detector, respectively. Soares et al. (2006) useda Bayesian classifier in combination with multiscale analysis of Gabor wavelets.A multilayer neural network was applied by Marin et al. (2011) to classify pixelsbased on moment-invariant features. Ricci and Perfetti (2007) proposed a rotation-invariant line operator both as an unsupervised method as well as in combinationto a support vector machine (SVM) with a linear kernel. Fraz et al. (2012) employeda classification scheme based on an ensemble of boosted and bagged decision trees.

Most of the existing unsupervised methods are based on filtering techniques thatrely on linear operations using predefined kernels. In particular, the output of thosefiltering methods is essentially a summation of weighted neighboring pixels (tem-plate matching). For instance, Al-Rawi et al. (2007) convolve (weighted sum) a pre-


processed retinal image with second-derivative Gaussian kernels, and Ricci and Per-fetti (2007) use a set of fixed hand-crafted templates. Template matching methodsare sensitive to slight deformations from the expected pattern.

In this chapter, we introduce a novel method for the automatic segmentationof blood vessels in retinal fundus images. It is based on the Combination of Re-ceptive Fields (CORF) computational model of a simple cell in visual cortex (Az-zopardi and Petkov, 2012) and its implementation called Combination of ShiftedFilter Responses (COSFIRE) (Azzopardi and Petkov, 2013b). We propose a bar-selective COSFIRE filter, or B-COSFIRE for brevity, that can be effectively used todetect bar-shaped structures such as blood vessels. The B-COSFIRE filter that wepropose is non-linear as it achieves orientation selectivity by multiplying the outputof a group of Difference-of-Gaussians (DoG) filters, whose supports are aligned ina collinear manner. It is tolerant to rotation variations and to slight deformations.Moreover, unlike the hand-crafted methods mentioned above, COSFIRE is a train-able filter approach. It means that the selectivity of the filter is not predefined in theimplementation but it is determined from a user-specified prototype pattern (e.g.a straight vessel, a bifurcation or a crossover point) in an automatic configurationprocess.

We evaluated the proposed method on the following three publicly avail-able data-sets: DRIVE (Staal et al., 2004), STARE (Hoover et al., 2000) andCHASE DB1 (Owen et al., 2009). Other data sets, such as the REVIEWDB (Al-Diriet al., 2008) and BioImLab (Grisan et al., 2008) data sets have not been considered inthe evaluation of our method as they are designed to evaluate vessels width mea-surement and tortuosity estimation algorithms, respectively.

5.2 Proposed method

5.2.1 Overview

In the following, we explain how the proposed B-COSFIRE filter is configured andused for the segmentation of the vessel tree in a given retinal fundus image. Beforewe apply the B-COSFIRE filter, we perform a pre-processing step to enhance the con-trast of the vessels and to smooth the border of the field-of-view (FOV) of the retina.We elaborate on this pre-processing aspect in Section 5.3.2. We also demonstratehow tolerance to rotation is achieved by simply manipulating some parameters ofthe model.

Fig. 5.2 illustrates the principle design of the proposed B-COSFIRE filter thatis configured to be selective for a vertical bar. It uses as input the responses ofcenter-on DoG filters at certain positions with respect to the center of its area of

5.2. Proposed method 73

Figure 5.2: Sketch of the proposed B-COSFIRE filter. The black spot in the middle of thewhite bar indicates the center of the filter support which is illustrated as a dashed ellipse.A B-COSFIRE filter combines the responses from a group of DoG filters (represented by thesolid circles) by multiplication.

support. Such DoG filters give high responses to intensity changes in the inputimage. Each gray circle in Fig. 5.2 represents the area of support of a center-on DoGfilter. The response of a B-COSFIRE filter is computed as the weighted geometricmean, essentially the product, of the responses of the concerned DoG filters in thecenters of the corresponding circles. The positions at which we take their responsesare determined by an automatic analysis procedure of the response of a DoG filterto a prototype bar structure; we explain this procedure below.

5.2.2 Detection of Changes in Intensity

We denote by DoGσ(x, y) a center-on DoG function with an excitatory (i.e. positive)central region and an inhibitory (i.e. negative) surround:

DoGσ(x, y)def=

1

2πσ2exp

(−x

2 + y2

2σ2

)− 1

2π(0.5σ)2exp

(− x2 + y2

2(0.5σ)2

)where σ is the standard deviation of the Gaussian function that determines the ex-tent of the surround. This type of function is an accepted computational modelof some cells in the lateral geniculate nucleus (LGN) of the brain (Rodieck, 1965).Motivated by the results of electrophysiological studies of LGN cells in owl mon-key (Irvine et al., 1993; Xu et al., 2002), we set the standard deviation of the innerGaussian function to 0.5σ. For a given location (x, y) and a given intensity distri-bution I(x′, y′) of an image I , the response cσ(x, y) of a DoG filter with a kernel


(a) (b)

Figure 5.3: (a) Synthetic input image (of size 100×100 pixels) of a vertical line (5 pixels wide)and (b) the corresponding response image of a center-on DoG filter (here σ = 2.6).

1

2

4

3

5

Figure 5.4: Example of the configuration of a B-COSFIRE filter to be selective for a verticalbar. The center of the area of support is indicated by the cross marker. The enumeratedspots represent the positions at which the strongest DoG responses are achieved along theconcentric circles of given radii.

function DoGσ(x− x′, y − y′) is computed by convolution:

cσ(x, y)def= |I ? DoGσ|+ (5.1)

where |·|+ denotes half-wave rectification1. Fig. 5.3b shows the response image of aDoG filter that is applied to the synthetic input image shown in Fig. 5.3a.

1Half-wave rectification is an operation that suppresses (sets to 0) the negative values.


5.2.3 Configuration of a B-COSFIRE Filter

Fig. 5.4 illustrates an example of the automatic configuration of a B-COSFIRE filter.We use a synthetic image that contains a vertical bar, such as the one shown inFig. 5.3a. We choose the center point, labeled by ‘1’ in Fig. 5.4, and in an automaticprocess we analyse its local neighbourhood as described below.

We apply a center-on DoG filter with a given σ to the prototype input pattern.We then consider the DoG filter responses cσ(x, y) along a number (in general k) ofconcentric circles around the center point (labelled by ‘1’ in Fig. 5.4). The positionsalong these circles at which these responses reach significant local maxima are thepositions of the points that characterize the dominant intensity variations aroundthe point of interest. For the considered example, there are two such positions foreach of the two circles. These points are labelled from ‘2’ to ‘5’ in Fig. 5.4. Thenumber of such points depends on the number k of concentric circles we considerand the specified prototype pattern.

In the proposed B-COSFIRE filter, every point i that is selected with the methodmentioned above is described by a tuple of three parameters (σi, ρi, φi): σi repre-sents the standard deviation of the DoG filter that responds most strongly and thatprovides the input, while ρi and φi are the polar coordinates with respect to thecenter of support of the B-COSFIRE filter.

We denote by S = {(σi, ρi, φi) | i = 1, . . . , n} the set of 3−tuples of a B-COSFIREfilter, where n stands for the number of considered DoG responses. In Eq. 5.2 wereport the parameter values of a set S that are determined by the automatic analysisof the input pattern shown in Fig. 5.4.

S =

(σ1 = 2.6, ρ1 = 0, φ1 = 0),(σ2 = 2.6, ρ2 = 2, φ2 = 1.57),(σ3 = 2.6, ρ3 = 2, φ3 = 4.71),(σ4 = 2.6, ρ4 = 4, φ4 = 1.57),(σ5 = 2.6, ρ5 = 4, φ5 = 4.71)

(5.2)

5.2.4 Blurring and Shifting DoG Responses

The above configuration process results in a B-COSFIRE filter that is selective for thecollinear spatial arrangement of five strong intensity variations. We use the DoG re-sponses at the determined positions to compute the output of the B-COSFIRE filter.

First, we blur the DoG responses in order to allow for some tolerance in the po-sition of the respective points. We define the blurring operation as the computationof the maximum value of the weighted thresholded responses of a DoG filter. Forthe weighting, we multiply the responses of the DoG filter by the coefficients of a


Gaussian functionGσ′(x′, y′), whose standard deviation σ′ is a linear function of thedistance ρi from the support center of the filter:

σ′ = σ′0 + αρi, (5.3)

where σ′0 and α are constants.

Second, we shift each blurred DoG response by a distance ρi in the directionopposite to φi, so that they meet at the support center of the B-COSFIRE filter. Theconcerned shift vector is (∆xi,∆yi) where ∆xi = −ρi cosφi and ∆yi = −ρi sinφi.

We denote by sσi,ρi,φi(x, y) the blurred and shifted response of a DoG filter foreach tuple (σi, ρi, φi) in set S. Formally, we define the i-th blurred and shifted DoGresponse as:

sσi,ρi,φi(x, y) =

maxx′,y′{cσi(x−∆xi − x′, y −∆yi − y′)Gσ′(x′, y′)}, (5.4)

where −3σ′ ≤ x′, y′ ≤ 3σ′.

5.2.5 Response of a B-COSFIRE Filter

We define the output of a B-COSFIRE filter as the weighted geometric mean of allthe blurred and shifted DoG responses that correspond to the tuples in the set S:

rS(x, y)def=

∣∣∣∣∣∣∣ |S|∏i=1

(sσi,ρi,φi(x, y))ωi

1/∑|S|i=1 ωi

∣∣∣∣∣∣∣t

(5.5)

ωi = exp−ρ2i2σ2 , σ =

1

3max

i∈{1···|S|}{ρi}

where |·|t stands for thresholding the response at a fraction t (0 ≤ t ≤ 1) of themaximum responses. The weighted geometric mean is an AND-type function, thatis a B-COSFIRE filter achieves a response only when all the afferent blurred andshifted responses sσi,ρi,φi(x, y) are greater than zero. The contribution of the blurredand shifted responses decreases with an increasing distance from the center of thesupport of the B-COSFIRE filter. A B-COSFIRE filter is selective to a bar of a givenpreferred orientation, the one of the prototype bar structure that was used for itsconfiguration.


5.2.6 Achieving Rotation Invariance

The orientation preference of a B-COSFIRE filter, as described above, depends onthe orientation of the bar structure used as input for the configuration of the filter.One can configure a filter with a different orientation preference by presenting arotated bar. Alternatively, one can manipulate the parameter in the set S, which cor-responds to the orientation preference 0◦, to obtain a new setRψ(S) with orientationpreference ψ:

Rψ(S) = {(σi, ρi, φi + ψ) | ∀ (σi, ρi, φi) ∈ S} (5.6)

In order to detect bars at multiple orientations, we merge the responses of B-COSFIRE filters with different orientation preferences by taking the maximum valueat every location (x, y):

rS(x, y)def= max

ψ∈Ψ

{rRψ(S)(x, y)

}(5.7)

where Ψ is a set of nr equidistant orientations given as Ψ ={πn i | 0 ≤ i < nr

}. Fig.

5.5c shows the response image of a rotation-invariant B-COSFIRE filter (12 valuesof ψ : ψ = {0, π12 ,

π6 , . . . ,

11π12 }) to the retinal image shown in Fig. 5.5a. The images

in Fig. 5.5d to Fig. 5.5o show the response images of B-COSFIRE filters achieved fordifferent ψ values.

The computation of the output of a rotation-invariant B-COSFIRE filter is veryefficient. It involves one convolution (O(n log n), where n is the number of pixels ina given image) of the input image with a center-on DoG filter (Eq. 5.1) followed bya maximum weighted operation (Eq. 5.4) using a separable Gaussian filter (O(kn),where k is the number of pixels in a given kernel) for each of the unique values of theρ parameter. The response for each considered orientation is achieved by two linear-time (O(n)) operations: appropriate shifting of the pre-computed weighted DoGresponses followed by weighted geometric mean. Finally, the rotation-invariant re-sponse is achieved by another linear-time operation that takes the pixel-wise maxi-mum of all orientation response maps.

5.2.7 Detection of Bar Endings

The AND-type output function of a B-COSFIRE filter achieves a response only whenall the afferent inputs are activated. In principle, a B-COSFIRE filter does not achievea response at bar (or vessel) endings. In practice, however, due to noisy backgroundsin retinal images, a B-COSFIRE filter also achieves a response at the end of the ves-sels, but much lower than the response that is achieved in the middle of a vessel.

We address this matter by configuring a new B-COSFIRE filter by the prototype


(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

Figure 5.5: (a) Input retinal image (of size 565× 584 pixels) and (b) its pre-processed version(the pre-processing procedure is thoroughly explained in Section 5.3.2). (c) The maximum su-perimposed response image rS(x, y)) of a rotation-invariant B-COSFIRE filter (ρ ∈ {2, 4, 6, 8},σ = 2.5, α = 0.7, σ′

0 = 4) that is applied with 12 equidistant values of the parameter ψ: (d)ψ = 0, (e) ψ = π

12, (f) ψ = π

6, (g) ψ = π

4, (h) ψ = π

3, (i) ψ = 5π

12, (j) ψ = π

2, (k) ψ = 7π

12, (l)

ψ = 2π3

, (m) ψ = 3π4

, (n) ψ = 10π12

, (o) ψ = 11π12

. The response images are inverted for clarityreasons.

bar ending shown in Fig. 5.6a. Here, we use the point that lies on the end of theline as the point of interest. In order to distinguish between the two types of filters,we refer to this filter as asymmetric B-COSFIRE filter and to the previously definedfilter as a symmetric B-COSFIRE filter.

Fig. 5.7d demonstrates that, in contrast to a symmetric B-COSFIRE filter, anasymmetric B-COSFIRE filter achieves much stronger response at the end of thevessel that is shown in Fig.5.7a.

5.3. Results 79

(a)

1

2

3

(b)

Figure 5.6: (b) Example of the configuration of an asymmetric B-COSFIRE filter by a prototypebar-ending (a). The point of interest of the B-COSFIRE filter is indicated by the cross markerand lies on the end of the prototype line.

(a) (b) (c) (d)

Figure 5.7: Responses to bar endings. (a) An enlarged area of the green channel of a reti-nal fundus image and the corresponding (b) ground truth image illustrating a single vesselending. (c-d) The response images obtained by a symmetric and antisymmetric B-COSFIREfilters, respectively. The black spots (a-b) and the white spots (c-d) indicate the position of thevessel ending.

5.3 Results

5.3.1 Data Sets and Ground Truth

We use three publicly available data sets of retinal fundus images, called DRIVE(Staal et al., 2004), STARE (Hoover et al., 2000) and CHASE DB1 (Owen et al., 2009).These data sets have gained particular popularity as they comprise the correspond-ing ground truth images that are manually segmented by different observers.

The DRIVE data set consists of 40 images (divided into a training set and a testset, each of which contains 20 images). For each image, the DRIVE data set con-tains a mask that delineates the FOV area together with the corresponding binarysegmentation of the vessel tree. The images in the training set have been manu-


ally segmented by one human observer, while the images in the test set have beensegmented by two other observers.

The STARE data set comprises 20 color retinal fundus images, 10 of which con-tain signs of pathologies. The data set contains two groups of manually segmentedimages prepared by two different observers.

The CHASE DB1 data set contains 28 colour images of retinal fundus from 14patients in the program Child Heart And Health Study in England. The data setcontains two groups of manually segmented images provided by two observers.

For all the three data sets, the performance of the proposed method is measuredby comparing the automatically generated binary images with the ones that aremanually segmented by the first observer as ground truth.

5.3.2 Pre-processing

For our experiments we only consider the green channel of retinal images. Thisdecision is supported by previous works (Niemeijer et al., 2004; Staal et al., 2004;Mendonca and Campilho, 2006; Soares et al., 2006; Ricci and Perfetti, 2007) whichfound that the contrast between the vessels and the background is better definedin the green channel. Conversely, the red channel has low contrast and the bluechannel shows a small dynamic range. Mendonca and Campilho (2006) assessedthe performance of different color representations such as the green component ofthe original RGB image, the luminance channel of the National Television SystemsCommittee (NTSC) color space and the a∗ component of the L∗a∗b∗ representation.They found that the highest contrast between vessels and background is, in general,shown in the green channel of the RGB image.

Due to the strong contrast around the FOV of the retinal images, the pixels closeto the circumference might cause the detection of false vessels. Thus, we use thepre-processing algorithm proposed by Soares et al. (2006) to smoothen the strongcontrast around the circular border of the FOV area. It uses a region of interest(ROI) determined by the FOV-mask of the retina. For the images in the DRIVE dataset we use the provided FOV-mask images to determine the initial ROI. Since theSTARE and the CHASE DB1 data sets do not provide the FOV-mask images, wecompute them by thresholding2 the luminosity plane of the CIELab3 version of theoriginal RGB image.

Next, we dilate the border in the following iterative procedure. In the first iter-ation, we consider every black pixel that lies just on the exterior boundary of theFOV-mask. We then replace every such pixel with the mean value of the pixels of

2The thresholds are 0.5 and 0.1 for the STARE and CHASE DB1 data sets, respectively.3CIELab is a color space specified by the International Commission on Illumination.

5.3. Results 81

its 8-neighbours that are inside the ROI. After the first iteration, the radius of theROI is increased by 1 pixel. We repeat this procedure 50 (in general m) times, as it issufficient to avoid false detection of lines around the border of the FOV of the retina.

Finally, we enhance the image by using the contrast-limited adaptive histogramequalization (CLAHE) algorithm (Pizer et al., 1987)4. The CLAHE algorithm, whichis commonly used as a pre-processing step in the analysis of retinal images (Fadzilet al., 2009; Setiawan et al., 2013), allows the improvement of the local contrastby avoiding the over-amplification of noise in relatively homogeneous regions. InFig. 5.8 we illustrate all the pre-processing steps.

5.3.3 Performance Measurements

For each test image we threshold the responses of a rotation-invariant B-COSFIREfilter by varying the values of parameter t (in Eq. 5.5) between 0 and 1 in stepsof 0.01. This threshold operation divides the pixels into two classes: vessels andnon-vessels. Then, we compare every resulting binary image with the correspond-ing ground truth by computing the following four performance measurements: thepixels that belong to a vessel in the ground truth image and that are classified as ves-sels are counted as true positives (TP), otherwise they are counted as false negatives(FN). The pixels that belong to the background and that are classified as non-vessels,are counted as true negatives (TN), otherwise they are counted as a false positives(FP).

In order to compare the performance of the proposed method with other state ofthe art algorithms, we compute the accuracy (Acc), sensitivity (Se), specificity (Sp)and Matthews correlation coefficient (MCC). These metrics are defined as follows:

Acc =TP + TN

N, Se =

TP

TP + FN, Sp =

TN

TN + FP,

MCC =TP/N − S × P√

P × S × (1− S)× (1− P ),

where N = TN + TP + FN + FP , S = (TP + FN)/N and P = (TP + FP )/N .The MCC is a measure of the quality of a binary classification. It is suitable even

when the number of samples in the two classes varies substantially. As a matter offact, this is the situation that we have at hand: the non-vessel pixels outnumber byseven times the vessel pixels. The MCC values vary between -1 and +1. The higher

4Other authors (Mendonca and Campilho, 2006; Marin et al., 2011) report that they pre-process the im-ages with background homogenization algorithms based on wide Gaussian kernels, while others (Soareset al., 2006; Fraz et al., 2012) explicitly state that they do not employ any pre-processing step.


mas

kex

trac

tionex

trac

tgre

ench

anne

ldi

late

bord

ers

CLA

HE

Zoo

min

gin

the

spec

ified

rect

angu

lar

regi

on

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figu

re5.

8:St

ep-b

y-st

epill

ustr

atio

nof

the

pre-

proc

essi

ngte

chni

ques

appl

ied

tore

tina

lfu

ndus

imag

es.

(a)

Ori

gina

lR

GB

reti

nal

imag

e(o

fsiz

e565×

584

pixe

ls),

ofw

hich

only

(b)t

hegr

een

chan

neli

sco

nsid

ered

.The

n,(c

)we

dila

teth

ere

gion

sar

ound

the

FOV

circ

umfe

renc

eof

the

gree

nch

anne

l,an

dfin

ally

(d)w

eap

ply

cont

rast

limit

edad

apti

vehi

stog

ram

equa

lizat

ion

(CLA

HE)

.Her

e,w

esh

owth

ein

vert

edgr

een

chan

nel

for

bett

ercl

arit

y.Th

eim

ages

in(f

),(g

)an

d(h

)ill

ustr

ate

inde

tail

the

smoo

thin

gef

fect

ofth

epr

e-pr

oces

sing

algo

rith

mar

ound

the

bord

erof

the

FOV

area

.(e)

Am

ask

ofth

eFO

Var

eais

auto

mat

ical

lyco

mpu

ted

for

each

imag

eof

the

STA

RE

and

CH

ASE

DB1

data

sets

byth

resh

oldi

ngth

elu

min

osit

ych

anne

loft

heC

IELa

bve

rsio

nof

the

orig

inal

imag

e.

5.3. Results 83

the value the better the prediction is. A value of +1 indicates a perfect prediction,0 indicates a prediction that is equivalent to random, and −1 indicates a completelywrong prediction.

Furthermore, for comparison purposes, we compute the receiving operator char-acteristics (ROC) curve, a method that is widely used to evaluate segmentation al-gorithms applied to retinal images. It is a tool that allows the analysis of the tradeoffbetween sensitivity and specificity. It is a 2D plot that illustrates the performance ofa binary classifier system as its discrimination threshold is varied. For every thresh-old value we compute a 2D point on the ROC curve. The point represents the falsepositive rate (FPR = 1 − Sp) on the x-axis and the true positive rate (TPR = Se)on the y-axis. The closer a ROC curve approaches the top-left corner the better theperformance of the algorithm is. For a perfect classification a ROC curve has a point(0, 1). We consider the area under the ROC curve (AUC), which is equal to 1 for aperfect system, as a single measure to quantify the performance.

Since the pixels of the dark background outside the FOV area of the retina areeasily detected, we consider only the pixels within the FOV area for the computationof the performance metrics.

5.3.4 Results

We carried out several experiments by considering different sets of parameters forthe configuration and application of the B-COSFIRE filters. For the evaluation of theproposed method, we split each data set into an evaluation and test sub-sets. Weused the evaluation set to determine the best parameters of the B-COSFIRE filters.For the DRIVE data set we used the training images for evaluation. The other twodata sets, the STARE and the CHASE DB1, contain only one set of images. For eachof these two sets we used the first half of images for evaluation and, then we testedthe proposed method on the entire data set.

First, we ran several experiments on the evaluation sets in order to determine thebest parameters (ρ, σ, σ0, α) of a symmetric B-COSFIRE filter by conducting a gridsearch and chose the set of parameters that gives the highest MCC average valueon the training images. We let the value of the parameter ρ to increase by intervalsof 2 pixels, that was empirically determined. In Table 5.1 we report the results thatwe achieved using a symmetric B-COSFIRE filter on the test sets along with thecorresponding parameters that were determined in the evaluation phase.

Second, we trained an asymmetric B-COSFIRE filter on the evaluation sets inorder to achieve stronger responses along the vessel endings. We observed thatan asymmetric filter achieves higher response values in the proximity of the barendings, but is less robust to noise than a symmetric B-COSFIRE filter. In Table 5.2


Symmetric B-COSFIRE filter

DRIVE STARE CHASE DB1

Res

ults

Accuracy 0.9427 0.9467 0.9411

AUC 0.9571 0.9487 0.9434

Specificity 0.9707 0.9689 0.9651

Sensitivity 0.7526 0.7543 0.7257

MCC 0.7395 0.7188 0.6791

Para

met

ers σ 2.4 2.7 4.8

ρ {0,2,4,. . . ,8} {0,2,4,. . . ,12} {0,2,4,. . . ,18}

σ0 3 1 3

α 0.7 0.6 0.2

Table 5.1: (Top) Experimental results obtained by symmetric B-COSFIRE filters, one for eachof the three data sets. (Bottom) The corresponding parameters that are automatically deter-mined in an evaluation stage.

Asymmetric B-COSFIRE filter

DRIVE STARE CHASE DB1

Res

ults

Accuracy 0.9422 0.9430 0.9270

AUC 0.9537 0.9536 0.9376

Specificity 0.9621 0.9742 0.9445

Sensitivity 0.7499 0.7765 0.7685

MCC 0.7369 0.7082 0.6433

Para

met

ers σ 1.9 2.4 4.7

ρ {0,2,4,. . . ,12} {0,2,4,. . . ,22} {0,2,4,. . . ,26}

σ0 2 1 2

α 0.1 0.1 0.1

Table 5.2: (Top) Experimental results obtained by using only asymmetric B-COSFIRE filters,one for each of the three data sets. (Bottom) The corresponding parameters that are automat-ically determined in an evaluation stage.

we report the results that we achieved using only the asymmetric B-COSFIRE filteralong with its configuration parameters for the three data sets.

Finally, we ran further experiments by summing the responses of symmetricand asymmetric B-COSFIRE filters. We determined a set of parameters for the

5.3. Results 85

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

False positive rate (1 - Specificity)

True

posi

tive

rate

(Sen

siti

vity

)

CHASE DB1DRIVESTARE

Figure 5.9: ROC curves for DRIVE (solid line), STARE (dashed line) and CHASE DB1 (dot-ted line) data sets. The results of the manual segmentation of the second observers are alsoreported on the graph. The squared marker (�) is the performance of the second observeron the DRIVE data set, while the diamond (♦) and the dot markers (•) are the results for theSTARE and CHASE DB1 data sets, respectively.

asymmetric B-COSFIRE filter that best complement the ones of the symmetric B-COSFIRE filter. The addition of asymmetric B-COSFIRE filter contributes to a sta-tistically significant improvement in the MCC performance. This is confirmedby a right-tailed paired t-test statistic (DRIVE: t(19) = 3.4791, p < 0.01; STARE:t(19) = 5.9276, p < 10−5). In Table 5.3 we report the results that we achieved bycombining the responses of the two B-COSFIRE filters along with the correspond-ing sets of parameters.

It is worth noting that the value of σ (the standard deviation of the outer Gaus-sian function of a DoG filter) is specific to each data set. Generally, as definedin (Petkov and Visser, 2005), the σ value is a function of the width L of the ves-sel of interest: σ = L

2γ

√(1− γ2)/(− ln γ), where γ is a fraction of the σ value (here

we set γ = 0.5). The variation in the image resolutions of the three data sets requiresthe use of different σ values. The fact that vessels are narrower at the end resultsin the configuration of an asymmetric filter with a σ value smaller than that of thesymmetric filter.


DR

IVE

STA

RE

CH

ASE

DB

1

ResultsA

ccur

acy

0.94

420.

9497

0.93

87

AU

C0.

9614

0.95

630.

9487

Spec

ifici

ty0.

9704

0.97

010.

9587

Sens

itiv

ity

0.76

550.

7716

0.75

85

MC

C0.

7475

0.73

350.

6802

B-C

OSF

IRE

filte

rsSy

mm

etri

cA

sym

met

ric

Sym

met

ric

Asy

mm

etri

cSy

mm

etri

cA

sym

met

ric

Parameters

σ2.

41.

82.

72.

14.

84.

3

ρ{0

,2,4

,6,8}{0

,2,4

,...

,22}

{0,2

,4,.

..,1

2}{0

,2,4

,...

,24}

{0,2

,4,.

..,1

8}{0

,2,4

,...

,34}

σ0

32

11

31

α0.

70.

10.

60.

10.

20.

1

Tabl

e5.

3:(T

op)

Expe

rim

enta

lre

sult

sfo

rth

ree

benc

hmar

kda

tase

tsob

tain

edby

sum

min

gup

the

resp

onse

sof

sym

met

ric

and

asym

met

ric

B-C

OSF

IRE

filte

rs.(

Bott

om)T

heco

rres

pond

ing

sets

ofpa

ram

eter

valu

es.

5.4. Discussion 87

The sensitivity, specificity and accuracy values for each data set are obtained fora specific value of threshold t, the one which contributed to the maximum averageMCC value of the corresponding data set. It is computed as follows. For a given testimage and threshold t we compute the MCC value. Subsequently, we compute theaverage of the MCC values of all the images to obtain a single performance measuredenoted by MCC. Several MCC values are obtained by varying the value of t from0 to 1 in steps of 0.01. Finally, we choose the threshold value t for a given data setthat provides the maximum value of MCC.

The ROC curves for the DRIVE, STARE and CHASE DB1 data sets are depictedin Fig. 5.9. For reference purposes, we also show the segmentation results of the sec-ond observer when compared to what is considered the gold segmentation standardthat is provided by the first observer.

The left column in Fig. 5.10 shows examples of retinal images, one for each ofthe three data sets, that are automatically obtained by the proposed approach. Theright column shows the corresponding ground truth images that were manuallysegmented by the first observer of each data set.

5.4 Discussion

The method that we propose is a trainable filter approach that we apply in an unsu-pervised way. The performance results that we achieve on the DRIVE, STARE, andCHASE DB1 benchmark data sets are better than many state-of-the-art unsuper-vised and supervised algorithms (Tables 5.4, 5.5 and 5.6). We evaluate the perfor-mance of the proposed method using the MCC value because it is a measure of thequality of a binary classification that is suitable when the samples in the two classesare unbalanced. For the sake of comparison, we move along the ROC curves in or-der to evaluate the performance of the B-COSFIRE approach with respect to the bestresults achieved by other unsupervised methods. For the DRIVE data set and forthe same specificity (Sp = 0.9764) reported by Mendonca and Campilho (2006) weachieve a sensitivity of 0.7376, which is marginally better. Similarly, for the STAREdata set and for the same specificity reported by Mendonca and Campilho (2006)and Al-Diri et al. (2009) (Sp = 0.9730 and Sp = 0.9681) we achieve a sensitivity of0.7554 and 0.7848 respectively, which is a significantly better result. We also achievethe best AUC value for the DRIVE data set with respect to all other unsupervisedapproaches. As to the CHASE DB1 data set, there are no other state-of-the-art un-supervised approaches to which we can compare.

While our method is unsupervised and it is more appropriate to compare it toother unsupervised methods, here we observe that its performance is comparable to


(a) (b) (c)

(d) (e) (f)

Figure 5.10: Examples of segmented vessel trees that are automatically obtained by the pro-posed B-COSFIRE filter approach from images taken from the (a) DRIVE (Se = 0.8746, Sp =0.9662 and MCC = 0.8005) , (b) STARE (Se = 0.8575, Sp = 0.9754 and MCC = 0.8097) and(c) CHASE DB1 (Se = 0.8360, Sp = 0.9548 and MCC = 0.7180) data sets. Images in (d), (e)and (f) show the corresponding manually segmented images by the first observer.

some supervised methods or slightly lower to others. Supervised methods are basedon machine learning techniques and typically require high-dimensional pixel-wisefeature vectors to train a classifier that discriminates vessel from non-vessel pixels.For instance, Fraz et al. (2012) characterize every pixel by a feature vector of nineelements, obtained by various filtering and morphological operations. Time com-plexity grows with increasing number of dimensions as more operations need to beperformed. In our approach we characterize every pixel with only two values, onefrom a vessel-selective filter and one from a vessel-ending-selective filter and com-bine them by summation. This results in a very efficient way of processing a retinalimage. It achieves performance results that are close to the supervised methods,which require high-dimensional feature vectors followed by an extensive trainingalgorithm to learn the best separation margin in the feature space. For the DRIVEdata set and for the same specificity reported by Marin et al. (2011) (Sp = 0.9801)

5.4. Discussion 89

we achieve a sensitivity of 0.7040, which is comparable with the one (0.7067) thatthey report. For the STARE data set and for the same specificity reported by Marinet al. (2011) and Fraz et al. (2012) we achieve comparable sensitivity of 0.6908 and0.7393, respectively. The ROC curves in Fig. 5.9 indicate that the performance of ourmethod on the DRIVE and STARE data sets is comparable to that of the second hu-man observer. For the CHASE DB1 data set, however, the second human observerachieves a performance that is better than that of our method.

DRIVE

Method Se Sp AUC Acc

Uns

uper

vise

d

B-COSFIRE 0.7655 0.9704 0.9614 0.9442

Chauduri et al. (1989) - - 0.7878 0.8773

Mendonca and Campilho (2006) 0.7344 0.9764 - 0.9463

Martinez-Perez et al. (2007) 0.7246 0.9655 - 0.9344

Al-Rawi et al. (2007) - - 0.9435 0.9535

Ricci and Perfetti (2007) - - 0.9558 0.9563

Al-Diri et al. (2009) 0.7282 0.9551 - -

Cinsdikici and Aydin (2009) - - 0.9407 0.9293

Lam et al. (2010) - - 0.9614 0.9472

Supe

rvis

ed

Niemeijer et al. (2004) - - 0.9294 0.9416

Staal et al. (2004) - - 0.9520 0.9441

Soares et al. (2006) 0.7332 0.9782 0.9614 0.9466


Marin et al. (2011) 0.7067 0.9801 0.9588 0.9452

Fraz et al. (2012) 0.7406 0.9807 0.9747 0.9480

Table 5.4: Performance results of the proposed unsupervised B-COSFIRE filter approach onDRIVE data set compared to other methods.

The B-COSFIRE filter that we propose is versatile as its selectivity is not pre-defined in the implementation but rather it is determined from a given prototypepattern in an automatic configuration process. As a matter of fact, here we demon-strated the configuration of two kinds of B-COSFIRE filters, namely symmetric andasymmetric, that give strong responses along vessels and at vessel endings, respec-tively. The configuration of these two types of filters was achieved by using two dif-ferent prototype patterns. In both cases we used prototype bar patterns of constantwidth and consequently the filters resulted in sets of tuples with the same value of


STARE


Uns

uper

vise

d

B-COSFIRE 0.7716 0.9701 0.9563 0.9497

Hoover et al. (2000) 0.6747 0.9565 0.7590 0.9275

Jiang and Mojon (2003) - - 0.9298 0.9009

Mendonca and Campilho (2006) 0.6996 0.9730 - 0.9479

Martinez-Perez et al. (2007) 0.7506 0.9569 - 0.9410

Al-Rawi et al. (2007) - - 0.9467 0.9090


Al-Diri et al. (2009) 0.7521 0.9681 - -

Lam et al. (2010) - - 0.9739 0.9567

Supe

rvis

ed

Staal et al. (2004) - - 0.9614 0.9516

Soares et al. (2006) 0.7207 0.9747 0.9671 0.9480


Marin et al. (2011) 0.6944 0.9819 0.9769 0.9526

Fraz et al. (2012) 0.7548 0.9763 0.9768 0.9534

Table 5.5: Performance results of the proposed unsupervised B-COSFIRE filter approach onSTARE data set compared to other methods.

CHASE DB1


Unsup. B-COSFIRE 0.7585 0.9587 0.9487 0.9387

Sup. Fraz et al. (2012) 0.7224 0.9711 0.9712 0.9469

Table 5.6: Performance results of the proposed unsupervised (unsup.) B-COSFIRE filter ap-proach on CHASE DB1 data set compared to the only one other existing supervised (sup.)method.

parameter σ. One may, however, use tapered bar structures as prototype patternswhich would result in tuples with different σ values. In principle, one can also con-figure B-COSFIRE filters that are selective for more complicated structures, such asbifurcations or crossovers. The B-COSFIRE filters used in this work are selectivefor elongated structures and achieve strong responses along the vessels and theirendings but show slightly lower responses for bifurcations and crossovers. Onemay configure other COSFIRE filters selective for various types of bifurcations and

5.4. Discussion 91

crossovers and combine their responses with those of the vessel and vessel-endingselective filters that we propose.

The B-COSFIRE filters can also be employed to detect vessels in other medicalimages, such as mammography and computed tomography. The proposed segmen-tation approach may also be used in non-medical applications that contain vessel-like structures, such as palmprints segmentation for biometric systems (Kong et al.,2009).

We aim to extend this work in various aspects. One direction for further investi-gation will focus on the configuration of a set of COSFIRE filters selective for differ-ent patterns, such as vessels, bifurcations and crossovers at different space-scales.The responses of such filters can be used to form a pixel-wise descriptor and then totrain a supervised classifier in order to discriminate vessel from non-vessel pixels.This will allow us to perform thorough analysis with respect to tiny vessels, vesselsaround the optic disc, and differentiation of vessels from abnormalities. Another di-rection for future work is to consider the depth dimension and configure COSFIREfilters for 3D vessel structures that can be applied, for instance, to detect the bloodvessels in angiography images of the brain.

The configuration and application of a B-COSFIRE filter is conceptually simpleand easy to implement: it involves convolution with DoG filters, blurring the DoGresponses, shifting the blurred responses towards the center of the concerned B-COSFIRE filter and computing a point-wise weighted geometric mean. The center-on DoG filters that we used in this work are not intrinsic to the method. Else-where we demonstrated that COSFIRE filters can be configured by Gabor filters forthe detection of patterns characterized by multiple lines/edges of different orienta-tions (Azzopardi and Petkov, 2013b). In particular, Azzopardi and Petkov (2013a)demonstrated that COSFIRE filters can be used to detect vascular bifurcations inretinal fundus images. Moreover, we show that by using a collection of center-offand center-on DoG filters we can effectively configure a contour operator that wecall CORF (Azzopardi and Petkov, 2012).

The performance of a B-COSFIRE filter is affected by the values of the parameters(σ0, σ, α) that are automatically selected in the configuration stage. In Table 5.7 wereport one-at-a-time sensitivity analysis, which shows that the parameter σ0 is theleast sensitive, followed by σ and then by α.

The application of a rotation-invariant B-COSFIRE filter is very efficient. A B-COSFIRE filter is defined as a set of 3-tuples and each tuple requires the blurringand shifting of a DoG filter response in a certain position. The computation of oneblurred and shifted response (for the same values of σ and ρ), for instance withsσ,ρ,φ=0(x, y) is sufficient. The result of sσ,ρ,φ(x, y) for any value of φ can be obtainedfrom the result of the output of sσ,ρ,φ=0(x, y) by appropriate shifting. Therefore, the


number of computations required depends on the number of unique combinationsof the values of parameters σ and ρ. In practice, the Matlab implementation5 thatwe used for our experiments takes less than 10 seconds to process each image fromthe DRIVE (565 × 584 pixels) and STARE (700 × 605 pixels) data sets and less than25 seconds to process an image from the CHASE DB1 data set (1280 × 960 pixels),on a personal computer equipped with a 2GHz processor. The duration times thatwe report include the processing of both symmetric and asymmetric B-COSFIREfilters to a single image. Note that the implementation that we use here is based onsequential processing. The B-COSFIRE approach can, however, be implemented inparallel mode such that the blurring and shifting operations for different pairs of(σ, ρ) can be processed on multiple processors, simultaneously. It is worth notingthat on images taken from the DRIVE and STARE data sets our implementation issignificantly less time-consuming than other approaches tested on similar hardware,as reported in Table 5.8.

The proposed B-COSFIRE approach differs mainly in the following three aspectsfrom other unsupervised approaches. First, a B-COSFIRE filter is trainable as its se-lectivity is not predefined in the implementation but rather it is determined froma given prototype pattern in an automatic configuration process. Other methodsuse a predefined set of kernels to filter the input image. Second, a B-COSFIRE filterachieves rotation-invariance in a very efficient way. It only requires the appropri-ate shifting of the blurred DoG responses followed by weighted geometric mean.On the contrary, other methods (Chauduri et al., 1989; Hoover et al., 2000; Al-Rawiet al., 2007; Ricci and Perfetti, 2007) achieve rotation invariance by convolving theinput image with rotated versions of the original kernel. For those methods thenumber of convolutions is dependent on the number of orientations for which the

5The Matlab implementation is available at http://matlabserver.cs.rug.nl/

offset -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

σ0 t(19) -0.34 -0.06 0.26 0.30 0.33 - -0.35 -0.96 -1.36 -1.68 -1.87

σ t(19) -2.24 -1.94 -1.42 -0.84 -0.43 - -0.24 -0.63 -1.26 -1.66 -2.35

α t(19) -4.71 -3.28 -2.41 -1.34 -0.88 - -0.07 -0.78 -1.55 -1.80 -2.72

Table 5.7: Sensitivity analysis of the hyper-parameters σ0, σ, and α using a symmetric B-COSFIRE filter on the 20 testing images of the DRIVE data set. The hyper-parameters areanalyzed one-at-a-time by changing their optimal values in offset intervals of 0.1 in bothdirections. For each combination, a two-tailed paired t-test statistic is performed to comparethe resulting 20 MCC values with those obtained by the optimal parameters (i.e. offset of 0).The t-values rendered in bold indicate statistical significance with p < 0.05.

5.5. Conclusions 93

Method Processing time

B-COSFIRE 10 seconds

Jiang and Mojon (2003) 20 seconds

Staal et al. (2004) 15 minutes

Mendonca and Campilho (2006) 2.5 minutes

Soares et al. (2006) 3 minutes

Al-Diri et al. (2009) 11 minutes

Lam et al. (2010) 13 minutes

Marin et al. (2011) 1.5 minutes

Fraz et al. (2012) 2 minutes

Table 5.8: Comparative analysis of the methods in terms of time required to process a singleimage from the DRIVE and STARE data sets.

operator is applied. Third, the weighted geometric mean that we use is a nonlin-ear function which produces a response only when all the filter defining points ofinterest are present. This type of function is more robust to noise than other meth-ods (Hoover et al., 2000; Ricci and Perfetti, 2007) that rely on weighted summations(convolutions).

In Fig. 5.11, we show how the response of a B-COSFIRE filter that is computed asthe weighted geometric mean of the afferent DoG responses is less affected by noisein comparison to the response of a B-COSFIRE filter with the same support struc-ture but that uses weighted arithmetic mean. We compute the signal-to-noise ratioas SNR = µsig/σbg , where µsig and σbg are the average signal value and the back-ground standard deviation, respectively. For a rather straight vessel (Fig. 5.11f) andusing weighted geometric mean the filter achieves a SNR of 8.35, which is higherthan that achieved by using weighted arithmetic mean (SNR = 7.02). The differ-ence is observed also for highly tortuous vessels (Fig. 5.11c), for which a SNR of 7.09

is achieved by the weighted geometric mean which is again higher than its counter-part (SNR = 6.47).

5.5 Conclusions

The results that we achieve on three benchmark data sets, DRIVE (Se = 0.7655, Sp= 0.9704), STARE (Se = 0.7716, Sp = 0.9701) and CHASE DB1 (Se = 0.7585, Sp =0.9587) are higher than many of the state-of-the-art methods. The high effectivenessachieved by the approach that we propose is coupled with high efficiency. In fact,


(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

tort

uous

wei

ghte

dge

omet

ric

mea

nst

raig

ht

wei

ghte

dge

omet

ric

mea

nw

eigh

ted

arit

hmet

icm

ean

wei

ghte

dar

ithm

etic

mea

n

orig

inal

imag

epr

e-pr

oces

sed

imag

eve

ssel

vess

elSN

R=7

.09

SNR

=8.3

5SN

R=7

.02

SNR

=6.4

7

Figu

re5.

11:

The

use

ofw

eigh

ted

geom

etri

cm

ean

exhi

bits

anim

prov

emen

tofs

igna

l-to

-noi

sera

tio

wit

hre

spec

tto

filte

rsba

sed

onw

eigh

ted

sum

mat

ion.

For

two

regi

ons

ofa

reti

nali

mag

e(a

)th

atco

ntai

nsa

tort

uous

(c)

and

ast

raig

htve

ssel

(f)

resp

ecti

vely

,we

show

the

impr

ovem

ento

fthe

SNR

byus

ing

the

wei

ghte

dge

omet

ric

mea

n(d

-g)w

ith

resp

ectt

oth

ew

eigh

ted

sum

mat

ion

(e-h

).

5.5. Conclusions 95

the proposed method is the most time-efficient algorithm for blood vessels segmen-tation in retinal fundus images published so far.

The B-COSFIRE filter is versatile as it can be configured, in an automatic pro-cess, to detect any given vessel-like patterns. As its response is achieved by com-puting the weighted geometric mean of input collinearly aligned DoG filters, the B-COSFIRE filter shows higher robustness to noise than methods based on weightedsummation or convolution (template matching). Besides vessel segmentation itcan also be used to detect features of interest, such as vascular bifurcations andcrossovers.

Submitted as:

Nicola Strisciuglio, George Azzopardi, Mario Vento, Nicolai Petkov, “Supervised vessel delineation inretinal fundus images with the automatic selection of B-COSFIRE filters”, accepted for publication inMachine Vision and Applications, 2016

Chapter 6

Automatic selection of an optimal set ofB-COSFIRE filters

Abstract

The inspection of retinal fundus images allows medical doctors to diagnose variouspathologies. Computer aided diagnosis systems can be used to assist in this process.As a first step, such systems delineate the vessel tree from the background. We proposea method for the delineation of blood vessels in retinal images that is effective for vesselsof different thickness. In the proposed method we employ a set of B-COSFIRE filters se-lective for vessels and vessel-endings. Such a set is determined in an automatic selectionprocess and can adapt to different applications. We compare the performance of differ-ent selection methods based upon machine learning and information theory. The resultsthat we achieve by performing experiments on two public benchmark data sets, namelyDRIVE and STARE demonstrate the effectiveness of the proposed approach.

6.1 Introduction

Retinal fundus imaging is a non-invasive tool that is widely employed by medi-cal experts to diagnose various pathologies such as glaucoma, age-related maculardegeneration, diabetic retinopathy and atherosclerosis. There is also evidence thatsuch images may contain signs of non-eye-related pathologies, including cardiovas-cular (Liew G, 2008) and systemic diseases (Sree and Rao, 2014). Fig. 6.1 shows ex-amples of two retinal fundus images and their corresponding manually segmentedvessel trees. In the last years, particular attention by medical communities has beengiven to early diagnosis and monitoring of diabetic retinopathy, since it is one of theprincipal causes of blindness in the world (Abramoff, Garvin and Sonka, 2010).

The manual inspection of retinal fundus images requires highly skilled people,which results in an expensive and time-consuming process. Thus, the mass screen-ing of a population is not feasible without the use of computer aided diagnosis sys-

98 6. Automatic selection of an optimal set of B-COSFIRE filters

(a) (b)

(c) (d)

Figure 6.1: Examples of fundus images of (a) healthy and (b) unhealthy retinas, together withthe corresponding manually segmented vessels taken from the STARE data set Hoover et al.(2000).

tems. Such systems could be used to refer to medical experts only the patients withsuspicious signs of diseases (Abramoff, Garvin and Sonka, 2010; Abramoff, Niemei-jer and Russell, 2010). In this way, the medical professionals could focus on the mostproblematic cases and on the treatment of the pathologies.

The automatic segmentation of the blood vessel trees in retinal images is a ba-sic step before further processing and formulation of diagnostic hypothesis. Thismeans that the quality of vessel segmentation influences the reliability of the sub-sequent diagnostic steps. It is, therefore, of utmost importance to obtain accuratemeasurements about the geometrical structure of the vessels. After segmenting thevessel tree, it is common for many methodologies to detect candidate lesions andthen to classify them as healthy or not. The better the segmentation the less falsecandidate lesions will be detected.

In the last years, this challenge has attracted wide interest of many image pro-cessing and pattern recognition researchers. Existing methods can be generally di-vided into two groups: unsupervised methods are based on filtering, vessel tracking


techniques or modeling, while supervised methods train binary classification modelsusing pixel-wise feature vectors.

In the unsupervised approaches, mathematical morphology techniques are usedin combination with a-priori knowledge about the vessels structure (Zana and Klein,2001; Mendonca and Campilho, 2006) or with curvature analysis (Fang et al., 2003).Vessel tracking-based methods start from an automatically or manually chosen setof points and segment the vessels by following their center-line (Liu and Sun, 1993;Zhou et al., 1994; Chutatape et al., 1998; Bekkers et al., 2014). Methods based onmatched filtering techniques, instead, assume that the profile of vessels can be mod-eled with a 2-dimensional Gaussian kernel (Chauduri et al., 1989; Hoover et al.,2000; Al-Rawi et al., 2007), also in combination with an orientation score (Zhanget al., 2015). Martinez-Perez et al. (2007) exploited information about the size, ori-entation and width of the vessels by a region growing procedure. A model of thevessels based on their concavity and built by using a differentiable concavity mea-sure, was proposed by Lam et al. (2010). In previous works (Azzopardi et al., 2015;Strisciuglio et al., 2015b), we introduced trainable filters selective for vessels andvessel-endings. We demonstrated that by combining their responses we could buildan effective unsupervised delineation technique. A method for the construction ofan orientation map of the vessels was proposed by Frucci et al. (2015). The informa-tion about the topology of the vessels was used in a graph-based approach (Chenet al., 2015).

On the other hand, supervised methods are based on computing pixel-wise fea-ture vectors and using them to train a classification model that can distinguish be-tween vessel and non-vessel pixels. Different types of features have been studiedin combination with various classification techniques. A k-NN classifier was usedin combination with the responses of multi-scale Gaussian filters or ridge detectorsb Niemeijer et al. (2004) and Staal et al. (2004), respectively. Multi-scale features,computed by means of Gabor wavelets, were also used to train a Bayesian classifierby Soares et al. (2006). A feature vector composed of the response of a line operator,together with the information about the green channel and the line width, was pro-posed by Ricci and Perfetti (2007) and used to train a support vector machine (SVM)classifier. Marin et al. (2011) applied a multilayer neural network to classify pixelsbased on moment-invariant features. An ensemble of bagged and boosted decisiontrees was employed by Fraz et al. (2012).

Generally, unsupervised approaches are very efficient, but at the expense oflower effectiveness when compared to their supervised counterparts. Supervisedmethods, although well-performing, require a thorough feature-engineering stepbased upon domain knowledge. The sets of features, indeed, are built with the pur-pose to overcome specific problems of retinal fundus images, such as the presence


of red or bright lesions, luminosity variations, among others. For instance, multi-scale Gabor filters can be used to eliminate red lesions (Soares et al., 2006), whilemorphological transformations can be used for reducing the effects of bright lesionsin the segmentation task (Fraz et al., 2012). Such methods, however, are suitable tocope with the processing of specific kinds of images and cannot be easily appliedto delineate elongated structures in other applications (e.g rivers segmentation inaerial images (Zhang et al., 2009) or wall cracks detection (Muduli and Pati, 2013)).

We propose to address the problem of segmenting elongated structures, such asblood vessels in retinal fundus images, by using a set of B-COSFIRE filters of thetype proposed by Azzopardi et al. (2015), selective for vessels of various thickness.The B-COSFIRE filter approach was originally proposed for delineation of retinalvessels. Such filters were also employed within a pipeline for the analysis of com-puted tomography angiography (CTA) images Zhu et al. (2015). This demonstratestheir suitability for various applications. Two B-COSFIRE filters, one specific forthe detection of vessels and the other for the detection of vessel-endings, were com-bined together by simply summing up their responses (Azzopardi et al., 2015). Theparameters of the vessel-ending filter were chosen in such a way to maximize theperformance of the two filters. This implies a dependence of the configuration ofthe vessel-ending detector upon the vessel detector. Moreover, the configurationparameters of each filter were chosen in order to perform best on the most commonthickness of all vessels.

In this chapter, we propose to determine a subset of B-COSFIRE filters, selec-tive for vessels of different thickness, by means of information theory and machinelearning. We compare the performance achieved by the system with different fea-ture selection methods, including Generalized Matrix Learning Vector Quantization(GMLVQ) (Schneider et al., 2009a), class entropy and a genetic algorithm.

The rest of the paper is organized as follows. In Section 6.2 we present the B-COSFIRE filters and the feature selection procedure. In Section 6.3 we introducethe data sets and the tools that we use for the experiments, while in Section 6.4we report the experimental results. After providing a comparison of the achievedresults with the ones of the existing methods and a discussion in Section 6.5, wedraw conclusions in Section 6.6.

6.2 Method

The main idea is to configure a bank of B-COSFIRE filters and to employ informa-tion theory and machine learning techniques to determine a subset of filters thatmaximize the performance in the segmentation task. We consider approaches that

6.2. Method 101

take into account the contribution of each feature individually and approaches thatevaluate also their combined contribution.

6.2.1 B-COSFIRE filters

B-COSFIRE filters are trainable and in (Azzopardi et al., 2015) they were configuredto be selective for bar-like structures. Such a filter takes as input the response ofa Difference-of-Gaussians (DoG) filter at certain positions with respect to the cen-ter of its area of support. The term trainable refers to the ability of determiningthese positions in an automatic configuration process by using a prototypical vesselor vessel-ending. Figure 6.2a shows a synthetic horizontal bar, which we use as aprototypical vessel to configure a B-COSFIRE filter.

For the configuration, we first convolve (the convolution is denoted by ?) aninput image I with a DoG function of a given standard deviation1 σ:

cσdef= |I ? DoGσ|+ (6.1)

where |·|+ denotes half-wave rectification2. In Figure 6.2b, we show the responseimage of a DoG filter with σ = 2.5 applied to the prototype in Figure 6.2a. We thenconsider the DoG responses along concentric circles around a given point of interest,and select from them the ones that have local maximum values, Figure 6.2c. Wedescribe each point i by three parameters: the standard deviation σi of the DoG filter,and the polar coordinates (ρi, φi) where we consider its response with respect to thecenter. We form a set S = {(σi, ρi, φi) | i = 1, . . . , n} that defines a B-COSFIRE filterthat has a selectivity preference for the given prototype. The value of n representsthe number of configured tuples.

For the application of the resulting filter, we first convolve an input image with aDoG function that has a standard deviation specified in the tuples of the set S. Then,we blur the DoG responses in order to allow for some tolerance in the preferred po-sitions of the concerned points. The blurring operation takes the maximum DoG re-sponse in a local neighourhood weighted by a Gaussian function Gσ′(x′, y′), whosestandard deviation σ′ is a linear function of the distance ρi from the support centerof the filter: σ′ = σ′0 + αρi (Figure 6.2d). The values of σ′0 and α are constants andwe tune them according to the application.

We then shift every blurred DoG response by a vector of length ρi in the di-rection towards the center of the area of support, which is the complimentary an-gle to φi. The concerned shift vector is (∆xi,∆yi), where ∆xi = −ρi cosφi and

1The standard deviation of the inner Gaussian function is 0.5σ2Half-wave rectification is an operation that suppresses (sets to 0) the negative values.


(a) (b)

(c) (d)

Figure 6.2: Example of the configuration of a B-COSFIRE filter using (a) a horizontal syntheticprototype vessel. We compute (b) the corresponding DoG filter response image and select (c)the local maxima DoG responses along concentric circles around a point of interest (identifiedby the cross marker in the center). (d) A sketch of the resulting filter: the sizes of the blobscorrespond to the standard deviations of the Gaussian blurring functions.

∆yi = −ρi sinφi. We define the blurred and shifted DoG response for the tuple(σi, ρi, φi) as:

sσi,ρi,φi(x, y) =

maxx′,y′{cσi(x−∆xi − x′, y −∆yi − y′)Gσ′(x′, y′)} (6.2)

We denote by rS(x, y) the response of a B-COSFIRE filter by combining the in-volved blurred and shifted DoG responses by geometric mean:

rS(x, y)def=

|S|∏i=1

(sσi,ρi,φi(x, y))

1/|S|

(6.3)

6.2. Method 103

The procedure described above configures a B-COSFIRE filter that is selectivefor horizontally-oriented vessels. In order to achieve multi-orientation selectivity,one can configure a number of B-COSFIRE filters by using prototype patterns indifferent orientations. Alternatively, we manipulate the parameter φ of each tupleand create a new set Rψ(S) = {(σi, ρi, φi + ψ) | i = 1, . . . , n} that represents aB-COSFIRE filter with an orientation preference of ψ radians offset from that of theoriginal filter S. We achieve a rotation-tolerant response in a location (x, y) by takingthe maximum response of a group of B-COSFIRE filters with different orientationpreferences:

rS(x, y)def= max

ψ∈Ψ

{rRψ(S)(x, y)

}(6.4)

where Ψ = {0, π12 ,π6 , . . . ,

11π12 }.

6.2.2 A bank of B-COSFIRE filters

The thickness of the vessels in retinal fundus images may vary from 1 pixel to anumber of pixels that depends on the resolution of the input images. For this reason,we configure a large bank of B-COSFIRE filters consisting of 21 vessel detectors{S1, . . . S21} and 21 vessel-ending detectors {S22, . . . S42}, which are selective forvessels of different thickness.

In Figure 6.3 we show the response images of the B-COSFIRE filters that are se-lective for (left column) vessels and (right column) vessel-endings. In particular, weconfigure filters selective for thin (second row), medium (third row) and thick (forthrow) vessels. It is noticeable how the large-scale filters are selective for thick vessels(Figure 6.3g and Figure 6.3h) and are robust to background noise but achieve low re-sponses along thin vessels. Conversely, the small-scale vessels (Figure 6.3c and Fig-ure 6.3d) show higher selectivity for thin vessels but are less robust to backgroundnoise. The combination of their responses promises to achieve better delineationperformance at various scales Strisciuglio et al. (2015a).

We construct a pixel-wise feature vector v(x, y) for every image location (x, y)

with the responses of the 42 B-COSFIRE filters in the filterbank, plus the intensityvalue g(x, y) of the green channel in the RGB retinal image:

v(x, y) =[g(x, y), r1(x, y), . . . , r42(x, y)

]T(6.5)

where ri(x, y) is the rotation-tolerant response of a B-COSFIRE filter Si. The in-clusion of the intensity value of the green channel is suggested by many existingapproaches Staal et al. (2004); Soares et al. (2006); Ricci and Perfetti (2007); Fraz et al.(2012); Strisciuglio et al. (2015a).


(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 6.3: Response images obtained by B-COSFIRE filters that are selective to (left column)vessels and (right column) vessel-endings of different thickness. We consider filters selectivefor thin (c-d), medium (e-f) and thick (g-h) vessels.

6.2. Method 105

6.2.3 Feature transformation and rescaling

Before classification, we apply the inverse hyperbolic sine transAforAmaAtion func-tion Johnson (1949) to each element of the feature vector. It reduces the skewness inthe data and is defined as:

f(vi, θ) =sinh−1(θvi)

θ(6.6)

For large valAues of vi and θ > 0, the function behaves like a log transAforAmaA-tion3. As θ → 0, f(vi, θ) → vi. We then compute the Z-score to standardize eachof the 43 features. As suggested in Ricci and Perfetti (2007), we apply the Z-scorenormalization procedure separately to each image in order to compensate for illu-mination variation between the images.

6.2.4 Automatic subset selection of B-COSFIRE filters

The filterbank that we designed in the previous section is overcomplete and mighthave many redundant filters. We investigate various feature selection approachesto determine the smallest subset of features that maximize the performance of thevessel tree delineation. We use as input the training data that consists of a matrix ofsize N × 43, where N corresponds to the number of randomly selected pixels (halfof them are vessel pixels, and the other half are non-vessel pixels) from the trainingimages, and the number of columns corresponds to the size of the filterbank plusthe green channel.

Entropy score ranking

Entropy characterizes uncertainty about a source of information. The rarer a re-sponse in a specific range is the more information it provides when it occurs. Weuse a filter approach that computes the entropy E of each of the 43 features:

E =

n∑i=1

c∑j=1

P (yi=j | x=i

20) logP (yi=j | x=

i

20) (6.7)

where y is the class label (vessel or non-vessel), c is the number of classes (in this casec = 2), x is a vector of quantized features rounded up to the nearest 0.05 incrementand n = 20. Before computing the entropy we first rescale and shift the Z-scoredvalues in the range [0, 1], such that the minimum value becomes 0 and the maximumvalue becomes 1.

3The value of θ has been experimentally determined on a training set (40000 feature vectors) and setto 1000 for both the DRIVE and STARE data sets.


We rank the 43 features using the reciprocal of their corresponding entropy val-ues, and select the highest k ranked features that contribute to the maximum accu-racy on the training set.

Genetic algorithm

The nature-inspired genetic algorithms are a family of search heuristics that can beused to solve optimization problems Goldberg (1989); MatouA¡ et al. (2000). We usea genetic algorithm to search for the best performing subset of features among theenormous possible combinations. We initialize a population of 400 chromosomeseach with 43 random bits. The positions of the one bits indicate the columns (i.e. thegreen channel and the 42 B-COSFIRE filters) to be considered in the given matrix.

The fitness function computes the average accuracy in a 10-fold cross validationon the training data with the selected columns. In each fold we configure an SVMclassifier with a linear kernel by using 90% of the training set and apply it to theremaining 10%. After every epoch we sort the chromosomes in descending order oftheir fitness scores and keep only the top 40 (i.e.10%) of the population. We use thiselite group of chromosomes to generate 360 offspring chromosomes by a crossoveroperation to randomly selected pairs of elite chromosomes. Every bit of the newlygenerated chromosomes has a probability of 10% to be mutated (i.e. changing thebit from 1 to 0 or from 0 to 1). We run these iterative steps until the elite group ofchromosomes stops changing.

Finally, we choose the filters that correspond to the positions of the one bits inthe chromosome with the highest fitness score and with the minimum number ofone bits.

GMLVQ

The Generalized Matrix Learning Vector Quantization (GMLVQ) (Schneider et al.,2009a,b) computes the pairwise relevances of all features with respect to the classi-fication problem. It generates a full matrix Λ of relevances that describe the impor-tance of the individual features and pairs of features in the classification task.

We consider the diagonal elements Λii as the ranking (relevant) scores of eachfeature. The higher the score the more relevant the corresponding feature is in com-parison to the others. In Figure 6.4 we show the feature relevances obtained from thetraining images of the DRIVE data set. In the following, we investigate the selectionof the subset of relevant features in two different ways.

Relevance peaks. We select only the features that achieve relevance peaks. Forinstance, from the feature relevances shown in Figure 6.4 we select the feature

6.2. Method 107

[r3, r8, r10, r17, r21, r24, r27, r31, r33, r36, r38, r42]. It is worth noting that this approachcan be used when the feature vector elements are in a systematic order and thus canbe compared with their neighboring elements. In our case the feature vector is con-structed by the responses of B-COSFIRE filters whose thickness preference increasessystematically, plus the green channel.

0

0.002

0.004

0.006

0.008

0.01

g r1 r42. . .

Rel

evan

ce

Features

Figure 6.4: A bar plot of the relevances of the features on the DRIVE data set.

Relevance ranking. We sort the 43 features in descending order of their relevancescores and select features with the top k relevances. We then determine the value ofk that maximize the accuracy on the training set.

6.2.5 Classification

We use the selected features to train a SVM classifier with a linear kernel. TheSVM classifier is particularly suited for binary classification problems, since itfinds an optimal separation hyperplane that maximizes the margin between theclasses (Joachims, 2000).

6.2.6 Application phase

In Figure 6.5 we depict the architectural scheme of the application phase of the pro-posed method. First, we preprocess a given retinal fundus image, Figure 6.5(a-b).We discuss the preprocessing procedure in Section 6.4.1. For each pixel, we constructa feature vector by considering the features selected during the training phase (i.e.possibly the green channel and the responses of a subset of k B-COSFIRE filters),Figure 6.5(c-d). Then, we transform and rescale the features and use a SVM clas-sifier to determine the vesselness of each pixel in the input image, Figure 6.5(e-f).


SVM

Cla

ssifi

ertr

ansf

orm

and

resc

alin

gg r 1 r k

B-C

OSF

IRE

filte

rban

k

...

...

(a)

prep

roce

ssin

g

(b)

thre

shol

ding

(d)

(c)

(e)

(f)

(g)

(h)

...

Figu

re6.

5:Sk

etch

ofth

eap

plic

atio

nph

ase

ofth

epr

opos

edm

etho

d.Th

e(a

)inp

utre

tina

lim

age

isfir

st(b

)pre

proc

esse

d.Th

en,(

c)th

ere

spon

ses

ofth

eba

nkof

sele

cted

B-C

OSF

IRE

filte

rsan

d,po

ssib

ly,t

hegr

een

chan

nela

reus

edto

form

a(d

)fea

ture

vect

or.A

fter

(e)t

rans

form

ing

and

resc

alin

gth

efe

atur

es,(

f)a

SVM

clas

sifie

ris

then

used

tocl

assi

fyev

ery

pixe

lin

the

inpu

tim

age

and

obta

in(g

)a

resp

onse

map

.(h)

The

bina

ryou

tput

isob

tain

edby

thre

shol

ding

the

SVM

prob

abili

tysc

ores

.

6.3. Materials 109

Finally, we compute the binary vessel map by thresholding the output score of theSVM, Figure 6.5(g-h).

6.3 Materials

6.3.1 Data sets

We performed experiments on two data sets of retinal fundus images that arepublicly available for benchmarking purpose: DRIVE (Staal et al., 2004) andSTARE (Hoover et al., 2000).

The DRIVE data set is composed of 40 images (of size 565× 584 pixels), dividedinto a training and a test set of 20 images each. The images in the training set weremanually labeled by one human observer, while the images in the test set were la-beled by two different observers. For each image in the data set, a binary mask ofthe field of view (FOV) of the retina is also provided.

The STARE data set consists of 20 retinal fundus images (of size 700×605 pixels),10 of which contain signs of pathology. Each image in the data set was manuallylabeled by two different human observers.

For both data sets, we consider the manual segmentation provided by the firstobserver as gold standard and use it as the reference ground truth for the perfor-mance evaluation of the algorithms. We use the second set of manually labeledimages to compute the performance of the second human observer with respect tothe gold standard.

6.3.2 B-COSFIRE implementation

We used the existing implementation of the B-COSFIRE filtering4 to compute theresponses of the involved vessel-selective and vessel-ending-selective filters. More-over, we provide a new set of Matlab scripts5 of the proposed supervised delineationtechnique, including the automatic feature selection.

6.4 Experiments

6.4.1 Pre-processing

In our experiments, we considered only the green channel of the RGB retinal images,since it shows the highest contrast between vessels and background (Staal et al.,

4http://www.mathworks.com/matlabcentral/fileexchange/491725The new package of Matlab scripts can be downloaded from http://matlabserver.cs.rug.nl


2004; Niemeijer et al., 2004; Mendonca and Campilho, 2006). The blue channel has asmall dynamic range, while the red channel has low contrast.

We pre-processed the retinal images in the DRIVE and STARE data sets in or-der to avoid false detection of vessels around the FOV and to further enhance thecontrast in the green channel. Due to the high contrast on the border of the FOVof the retina, the B-COSFIRE filters might detect false vessels. We applied the pre-processing step proposed by Soares et al. (2006), which aims at dilating the FOV byiteratively enlarging the radius of the region of interest by one pixel at a time. Ineach iteration, we selected the pixels in the outer border of the FOV and replacedthem with the average value of the intensities of the 8-neighbor pixels contained in-side the FOV. We iterated this procedure 50 times, as it was sufficient to avoid falsedetection of lines around the border of the FOV of the retina.

Finally, we applied the contrast-limited adaptive histogram equalization(CLAHE) algorithm (Pizer et al., 1987) in order to enhance the contrast betweenvessels and background. The CLAHE algorithm improves the local contrast andavoids the over-amplification of the noise in homogeneous regions.

6.4.2 Evaluation

For the DRIVE data set, we construct the training set by selecting 1000 vessel and1000 non-vessel pixels from each image of the training set, which correspond to atotal of 40000 feature vectors.

The STARE data set does not have separate training and test sets. Thus, weconstruct the training set by randomly choosing 40000 pixels from all the 20 imagesin the data set (1000 vessel pixels and 1000 non-vessel pixels from each image). Assuggested by Ricci and Perfetti (2007) and Fraz et al. (2012), since the size of theselected training set is very small (< 0.5% of the entire data set), we evaluate theperformance on the whole set of images.

The output of SVM classifier is continuous (in the range [0, 1] ) and indicates thedegree of vesselness of each pixel in a given image. The higher this value the morelikely a pixel is part of a vessel. We thresholded the output of the classifier in orderto obtain the binary segmented image. The threshold operation separates the pixelsinto two categories: vessels and non-vessels.

When comparing the segmented image with the ground truth image, each pixelcontributes to the calculation of one of the following measures: a vessel pixel inthe segmented image is a true positive (TP) if it is also a vessel pixel in the groundtruth, while it is a false positive (FP) if it is a background pixel in the ground truth; abackground pixel in the segmented image that is part of the background also in theground truth image is a true negative (TN), otherwise it is a false negative (FN). In

6.4. Experiments 111

order to evaluate the performance of the proposed method and compare it with theones of existing methods, we computed the sensitivity (Se), specificity (Sp), accuracy(Acc) and the Matthews correlation coefficient (MCC), which are defined as follows:

Acc =TP + TN

N, Se =

TP

TP + FN, Sp =

TN

TN + FP

and

MCC =TP/N − S × P√

P × S × (1− S)× (1− P ),

where N = TN + TP + FN + FP , S = (TP + FN)/N and P = (TP + FP )/N .For a binary classification problem, as in our case, the computation of the accu-

racy is influenced by the cardinality of the two classes. In the problem at hand, thenumber of non-vessel pixels is roughly seven times more than the number of vesselpixels. Therefore, the accuracy is biased by the number of true negative pixels. Forthis reason we computed the MCC, which quantifies the quality of a binary classi-fier even when the two classes are imbalanced. It achieves a value of 1 for a perfectclassification and a value of −1 for a completely wrong classification. The value 0

indicates a random guess classifier.Besides the above-mentioned measurements, we also generated a receiver op-

erating characteristics (ROC) curve and computed its underlying area (AUC). TheROC curve is a plot that shows the trade-off between the rate of false positives andthe rate of true positives as the classification threshold varies. The higher the AUCthe better the performance of the classification system.

6.4.3 Results

For a given test image and a threshold value t we computed the MCC. Then, wecomputed the average MCC across all test images and obtained a single perfor-mance measure MCC for every threshold t. We vary the threshold from 0 to 1 insteps of 0.01. Finally, we choose the threshold t∗ for a given data set that providesthe maximum value of MCC.

In Table 6.1 and Table 6.2 we report the results that we achieved with the pro-posed supervised approach on the DRIVE and STARE data sets, respectively. Inorder to evaluate the effects of the different feature selection methods, we used asbaseline the results (MCC = 0.7492 for the DRIVE data set and MCC = 0.7537

for the STARE data set) that we obtained by a linear SVM classifier trained with theresponses of the bank of 42 B-COSFIRE filters plus the intensity value in the greenchannel. This naıve supervised approach achieved better performance than the un-supervised B-COSFIRE filter approach (Azzopardi et al., 2015), whose results are


Res

ults

onD

RIV

Eda

tase

t

Met

hod

SeSp

AU

CA

ccM

CC

#Fe

atur

esPr

oces

sing

Tim

e

Uns

up.(

Azz

opar

diet

al.,

2015

)0.7655

0.9706

0.9614

0.9442

0.7475

210

sSupervised

No

feat

ure

sele

ctio

n0.7901

0.9675

0.9602

0.9437

0.7492

43

200

s

Gen

etic

algo

rith

m0.7754

0.9704

0.9594

0.9453

0.7517

17

110

s

GM

LVQ

-rel

evan

cepe

aks

0.7777

0.9702

0.957

0.9454

0.7525

12

75

s

GM

LVQ

-rel

evan

cera

nkin

g0.7857

0.9673

0.9602

0.9439

0.7487

11

70

s

Cla

ssen

trop

yra

nkin

g0.7731

0.9708

0.9593

0.9453

0.7513

16

90

s

Tabl

e6.

1:C

ompa

riso

nof

resu

lts

wit

hdi

ffer

entB

-CO

SFIR

Eap

proa

ches

onth

eD

RIV

Eda

tase

t.

STA

RE

data

set

Met

hod

SeSp

AU

CA

ccM

CC

#Fe

atur

esPr

oces

sing

Tim

e

Uns

up.(

Azz

opar

diet

al.,

2015

)0.

7716

0.97

010.

9563

0.94

970.

7335

210

s

Supervised

No

feat

ure

sele

ctio

n0.7449

0.9810

0.9639

0.9561

0.7537

43

210

s

Gen

etic

algo

rith

m0.7928

0.9734

0.9638

0.9542

0.7548

760

s

GM

LVQ

-rel

evan

cepe

aks

0.8046

0.9710

0.9638

0.9534

0.7536

10

80

s

GM

LVQ

-rel

evan

cera

nkin

g0.7737

0.9716

0.9590

0.9507

0.7384

11

85

s

Cla

ssen

trop

yra

nkin

g0.7668

0.9711

0.9577

0.9495

0.7280

19

150

s

Tabl

e6.

2:C

ompa

riso

nof

resu

lts

wit

hdi

ffer

entB

-CO

SFIR

Eap

proa

ches

onth

eST

AR

Eda

tase

t.


also reported in the two tables. The use of machine learning or information theorytechniques that compute a score of the importance of each feature gives the possibil-ity to select the best performing group of features and, at the same time, to reducethe overall processing time.

2 4 6 8 10 12 14 16 180.7

0.71

0.72

0.73

0.74

0.75

0.76DRIVE

Number of features

MC

C

Class entropyGMLVQ (relevance ranking)

(a)

2 4 6 8 10 12 14 16 180.7

0.71

0.72

0.73

0.74

0.75

0.76STARE

Number of features

MC

C

Class entropyGMLVQ (relevance ranking)

(b)

Figure 6.6: The plots in (a) and (b) show the MCC as a function of the top performing featuresfor the DRIVE and STARE data sets, respectively.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.7

0.75

0.8

0.85

0.9

0.95

1


True

posi

tive

rate

(Sen

siti

vity

)DRIVE

Unsup. B-COSFIREGMLVQ (relevance peaks)Genetic algoritm

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.7

0.75

0.8

0.85

0.9

0.95

1


True

posi

tive

rate

(Sen

siti

vity

)

STARE

Unsup. B-COSFIREGMLVQ (relevance peaks)Genetic algoritm

(b)

Figure 6.7: ROC curves achieved on (a) the DRIVE and (b) the STARE data sets by the se-lection methods based upon GMLVQ relevance peaks (solid line) and a genetic algorithm(dashed line), and by the unsupervised B-COSFIRE filters (dotted line).

For the methods based on feature ranking, namely GMLVQ and class entropy,


we report the results achieved when considering a set of the most k top-scored fea-tures. We chose the value of k which provided the highest accuracy on the trainingset. With this method we selected 11 features for both DRIVE and STARE data setsby using GMLVQ with relevance ranking. On the other hand, when we ranked thefeatures on the basis of their class entropy score we selected 16 features for DRIVEand 19 for STARE. In Figure 6.6a and Figure 6.6b we show how the MCC, on theDRIVE and STARE data sets, is sensitive to an increasing number of features in-volved in the classification process. We only show the most discriminant 19 fea-tures since the performance improvement achieved by further features is negligible.Moreover, the required processing time becomes too high and comparable to theone required to compute the full set of features. We performed experiments on amachine equipped with a 1.8 GHz Intel i7 processor with 4GB of RAM. In Fig-ure 6.7, we show the ROC curves obtained by the GMLVQ with relevance peaks(solid line) and by the genetic algorithm (dashed line) features selection methodsin comparison with the one of the unsupervised B-COSFIRE filter (dotted line). Asubstantial improvement of performance is evident for the STARE data set.

6.4.4 Statistical analysis

We used the right-tailed paired t-test statistic to quantify the performance improve-ment that we achieved with the proposed supervised method with respect to theusupervised B-COSFIRE approach. For each data set and for each method we usedthe MCC values computed from all test images as explained in Section 6.4.

A significant improvement of the results is confirmed for the feature selectionmethod based on GMLVQ with relevance peaks (DRIVE: t(19) = 1.33, p < 0.1;STARE: t(19) = 2.589, p < 0.01) and for the approach based on a genetic algorithm(DRIVE: t(19) = 1.13, p < 0.15; STARE: t(19) = 2.589, p < 0.01). On the contrary, thefeature selection methods based on ranking the features by their relevance or theirclass entropy score do not significantly improve the performance results.

For both data sets, the GMLVQ with relevance peaks and the genetic algorithmprovide the best performance results. In fact, there is no statistical difference be-tween the two methods.

6.4.5 Comparison with existing methods

With the proposed approach we achieve better results than many existing methods,which we report in Table 6.3. The direct evaluation of the results from Table 6.3 isnot trivial. Thus, for comparison purposes, we move along the ROC curves in Fig-ure 6.7 and for the same specificity values achieved by other methods, we compare


Perf

orm

ance

com

pari

son

DR

IVE

STA

RE

Met

hod

SeSp

AU

CA

ccSe

SpA

UC

Acc

UnsupervisedB-

CO

SFIR

E(A

zzop

ardi

etal

.,20

15)

0.76

550.

9704

0.96

140.

9442

0.77

160.

9701

0.95

630.

9497

Hoo

ver

etal

.(20

00)

--

--

0.67

470.

9565

0.75

900.

9275

Men

donc

aan

dC

ampi

lho

(200

6)0.

7344

0.97

64-

0.94

630.

6996

0.97

30-

0.94

79

Mar

tine

z-Pe

rez

etal

.(20

07)

0.72

460.

9655

-0.

9344

0.75

060.

6569

-0.

9410

Al-

Raw

ieta

l.(2

007)

--

0.94

350.

9535

--

0.94

670.

9090

Ric

cian

dPe

rfet

ti(2

007)

--

0.95

580.

9563

--

0.96

020.

9584

Lam

etal

.(20

10)

--

0.96

140.

9472

--

0.97

390.

9567

Supervised

Staa

leta

l.(2

004)

--

0.95

200.

9441

--

0.96

140.

9516

Soar

eset

al.(

2006

)0.

7332

0.97

820.

9614

0.94

660.

7207

0.97

470.

9671

0.94

80

Ric

cian

dPe

rfet

ti(2

007)

--

0.96

330.

9595

--

0.96

800.

9646

Mar

inet

al.(

2011

)0.

7067

0.98

010.

9588

0.94

520.

6944

0.98

190.

9769

0.95

26

Fraz

etal

.(20

12)

0.74

060.

9807

0.97

470.

9480

0.75

480.

9763

0.97

680.

9534

Prop

osed

met

hod

0.77

770.

9702

0.95

970.

9454

0.80

460.

9710

0.96

380.

9534

Tabl

e6.

3:C

ompa

riso

nof

the

perf

orm

ance

resu

lts

achi

eved

byth

epr

opos

edap

proa

chw

ith

the

ones

achi

eved

byot

her

exis

ting

met

hods

.

6.5. Discussion 117

sensitivity values that we achieve to theirs. We refer to the performance achievedby the GMLVQ with relevance peaks feature selection. For the DRIVE data set andfor the same specificity reported by Soares et al. (2006) (Sp = 0.9782) and by Marinet al. (2011) (Sp = 0.9801) we achieve better sensitivity: 0.7425 and 0.7183, respec-tively. For the same specificity reported by Fraz et al. (2012) (Sp = 9807) we achievea lower value of the sensitivity (Se = 0.7181). Similarly, for the STARE data set andfor the same specificity values reported by Soares et al. (2006), Marin et al. (2011)and Fraz et al. (2012) (Sp = 0.9747, Sp = 0.9819 and Sp = 0.9763) we achieve bettersensitivity: 0.7806, 0.7316 and 07697, respectively.

6.5 Discussion

The main contribution of this chapter is a supervised method for vessels delin-eation based on the automatic selection of a subset of B-COSFIRE filters selectivefor vessels of different thickness. We applied various feature selection techniques toa bank of B-COSFIRE filters and compared their performance. The versatility of theB-COSFIRE filters together with the use of a features selection procedure showedhigh flexibility and robustness in the task of delineating elongated structures in reti-nal images. The proposed method can be applied to other applications, such asthe quantification of length and width of cracks in walls (Muduli and Pati, 2013)for earthquake damage estimation or for monitoring the flow of rivers in order toprevent flooding disasters (Zhang et al., 2009).

The versatility of the B-COSFIRE filters lies in their trainable character and thusin being domain-independent. They can be automatically configured to be selectivefor various prototype patterns of interest. In this chapter we configured filters onsome vessel-like prototype patterns. This avoids the need of manually creating afeature set to describe the pixels in the retinal images, which is an operation thatrequires skills and knowledge of the specific problem. This is in contrast to othermethods that use hand-crafted features and thus domain knowledge. For instance,the features proposed by Fraz et al. (2012) are specifically designed to deal withparticular issues of the retinal fundus images, such as bright and dark lesions ornon-uniform illumination of the FOV. A specific B-COSFIRE filter is configured todetect patterns that are equivalent or similar to the prototype pattern used for itsconfiguration. In our case, it detects blood vessels of specific thickness. One mayalso, however, configure B-COSFIRE filters selective for other kinds of patterns suchas bifurcations and crossovers (Azzopardi and Petkov, 2013a,b) and add them to thefilterbank.

Although the difference of the performance achieved by the genetic algorithm


and by the GMLVQ with relevance peaks is not statistically significant, the lattermethod seems more stable as it selects a comparable number of features in the bothdata set. In fact, it selects a comparable number of features in both data sets. Further-more, the reduced bank of features allows to improve the classification performancetogether with a reduction of the required processing time. As a matter of fact, theGMLVQ approach selected a subset of 12 features for the DRIVE data set and 10 fea-tures for the STARE data set. The technique based on a genetic algorithm selected aset of 17 features for the DRIVE data set and 7 features for the STARE data set.

For the DRIVE data set, we selected five vessel and seven vessel-ending B-COSFIRE filters6. The value of the green channel was not relevant for this dataset. For the STARE data set, instead, we found that the value of the green channel isimportant. Thus, we constructed the feature vectors with the intensity value of thegreen channel plus the responses of four vessel- and three vessel-ending B-COSFIREfilters7.

The output of a genetic algorithm is crisp as the selected features have the sameweighting. In contrast, the GMLVQ approach shows higher flexibility since it pro-vides a measure of the relevance (in the range [0, 1]) that each filter has in the classi-fication task. The genetic algorithm, however, evaluates the combined contributionof many features, exploring a larger space of solutions, while the GMLVQ considersonly the contribution of two features at a time.

Although the two approaches based on GMLVQ and the one based on a geneticalgorithm construct different sets of B-COSFIRE filters, we achieve a statistically sig-nificant improvement of the performance results with respect to the unsupervisedmethod. This demonstrates that the proposed B-COSFIRE filterbank is robust to thefeature selection approach used. The flexibility and generalization capabilities ofthe B-COSFIRE filters, together with a feature selection procedure, allows the con-struction of a system that can adapt to any delineation problem.

The method based on the computation of the class entropy score of each featureand the selection of the k top-ranked features does not improve the performancesubstantially. In fact, in this approach the features are assumed to be statisticallyindependent and their contribution to the classification task is evaluated singularly.This reduces the effectiveness of the selection procedure since it does not take intoaccount eventual mutual contributions of pairs or groups of features to the classifi-

6The selected scales for the DRIVE data set are σ1 = 1.6, σ2 = 2.1, σ3 = 2.3, σ4 = 3 and σ5 = 3.4 forthe vessel-selective filters and σ6 = 1.1, σ7 = 1.4, σ8 = 1.8, σ9 = 2, σ10 = 2.3, σ11 = 2.5 and σ12 = 2.9for the vessel-ending-selective filters. We set σ0 = 3 and α = 0.7 for the vessel-selective filters andσ0 = 2 and α = 0.1 for the vessel-ending-selective filters.

7The selected scales for the STARE data set are σ1 = 1.8, σ2 = 2.2, σ3 = 3 and σ4 = 3.7 for thevessel-selective filters and σ5 = 1.7, σ6 = 1., σ7 = 2.2, σ8 = 2.5 and σ9 = 3.2 for the vessel-ending-selective filters. We set σ0 = 1 and α = 0.5 for the vessel-selective filters and σ0 = 1 and α = 0.1 for thevessel-ending-selective filters.

6.6. Conclusions 119

cation task.The application of a single B-COSFIRE filter is very efficient (Azzopardi et al.,

2015). It takes from 3 to 5 seconds (on a 1.8 GHz Intel i7 processor with 4GB of RAM)to process an image from the DRIVE and the STARE data sets. The responses of abank of B-COSFIRE filters are computed independently from each other. Therefore,the computation of such responses can be implemented in a parallel way so as tofurther optimize the required processing time.

6.6 Conclusions

The supervised method that we propose for the segmentation of blood vessels inretinal images is versatile and highly effective. The results that we achieve on twopublic benchmark data sets (DRIVE: Se = 0.7777, Sp = 0.9702 and MCC = 0.7525;STARE: Se = 0.8046, Sp = 0.9710 and MCC = 0.7536) are higher than many ex-isting methods. The proposed approach couples the generalization capabilities ofthe B-COSFIRE filter with an automatic procedure (GMLVQ with relevance peaks)that selects the best performing ones. The delineation method that we propose canbe employed in any application in which the delineation of elongated structures isrequired.

Chapter 7

Summary and Outlook

7.1 Summary

In this thesis we proposed novel pattern recognition methodologies for audio andimages analysis based on the use of biology-inspired filters. The filters that we pro-pose draw inspiration from some properties of the components of the middle earin human auditory system and by the characteristics of simple cells in area V1 ofthe primary visual cortex. We employed the proposed filters in two important ap-plications, namely the detection of audio events in surveillance systems and thedelineation of blood vessels in retinal fundus images.

Audio surveillance is a very recent research field and the state-of-the-art meth-ods did not provide yet a reasonable approach to solve the problem of audio eventsdetection in open environments. Moreover, to the best of our knowledge, there wereno publicly available data sets for benchmarking of detection algorithms. Thus, wecreated and made publicly available for algorithm evaluation two data sets, namelythe MIVIA audio events and the MIVIA road events data sets. The MIVIA audioevents data set is composed of glass breaking, gun shot and scream events, mixedwith various background sounds and with different values of signal-to-noise ratio.The MIVIA road events data set, instead, contains events of car crashes and skiddingtires, which are abnormal events that can occur in roads.

In Chapter 2, we proposed an approach for audio events detection in noisy envi-ronments, which is based on the bag-of-features classification paradigm. The basicidea of the proposed approach is that the events of interest are composed of smallperceptual units of hearing, which we call aural words. While a single aural worddescribes short-time characteristics of the audio signal, the occurrence and distri-bution of such aural words within a larger interval of time is likely representativeof the presence of a given event. The classification results that we obtained on theMIVIA audio events data set were higher than the ones achieved by existing meth-ods and motivated the implementation of a prototypical system for the live analysisof sound.

On the same trail, in Chapter 3 we proposed an application of the developed

122 7. Summary and Outlook

audio events recognition system to roads surveillance. To this concern we designedan architecture for practical deployment of an audio surveillance system. The pro-posed architecture is built on an accepted model of traffic noise, namely the CoRTNmodel (United Kingdom Department of Environment and welsh Office Joint Pub-lication, HMSO, 1975), and provides a tool for positioning of microphones so asto monitor long stretches of road. We evaluated the performance of the proposedsystem in different scenarios. In particular, we studied how the SNR and the de-tection rate vary depending on the sound source distance, the vehicles speed andthe number of passing vehicles per hour. Moreover, we collected and made pub-licly available a data set of abnormal events that can occur in roads. The proposedsystem for audio events detection achieved better performance than other existingmethods and showed a reasonable robustness to variations of the SNR level of theevents of interest.

Humans, however, are able to recognize sounds even when the SNR is negative,due to the characteristics of the auditory system. Starting from this considerationand inspired by some properties of the outer and middle ear in the human audi-tory system, in Chapter 4 we designed novel filters, that we called CoPE filters,for audio stream analysis. The performance achieved by the proposed CoPE filtersoutperformed the ones of existing methods and demonstrated their high stabilitywith respect to variations of the SNR of the events of interest. It is worth pointingout that the proposed filters performed well even in case of null or negative SNRvalues. One important property of such filters is trainability as they determine theimportant features by an automatic configuration process on a prototype sound ofinterest. Thus, they do not require domain knowledge for manually creating a fea-ture set to describe the sounds of interest. A configured CoPE filter can be thoughtof as a particular feature extractor, determined directly form the data.

In Chapter 5, instead, we focused on an important image analysis task and pro-posed B-COSFIRE filters for the delineation of blood vessels in retinal fundus im-ages. Unlike the original COSFIRE filters that combine the output of Gabor filters (?),B-COSFIRE filters take as input the responses of DoG filters, whose spatial arrange-ment is determined during an automatic configuration process performed on agiven prototype pattern of interest. We demonstrated that they are highly suitablefor delineation of bar-like structures in images. In the case of delineation of bloodvessels in retinal images, we proposed to combine the responses of B-COSFIREfilters that are selective to different kinds of vessels, namely straight vessels andvessel-endings. This combination contributed to outperform existing unsupervisedmethods and to achieve comparable results to approaches that are based on machinelearning techniques.

We further extend the adaptation capabilities of B-COSFIRE filters in Chapter 6.

7.2. Outlook 123

We proposed to employ machine learning and information theory techniques to de-termine an optimal set of B-COSFIRE (both vessel- and vessel-ending-selective) fil-ters for the improvement of the quality of the delineation output. We configured alarge set of B-COSFIRE filters selective for vessels of various thickness. Then, weemployed different feature selection methods, including Generalized Matrix Learn-ing Vector Quantization (GMLVQ) (Schneider et al., 2009a), class entropy and agenetic algorithm. The high results achieved by the proposed approach are at-tributable to the combination of the generalization capabilities of the B-COSFIREfilter with an automatic procedure for the selection of the important filters for theapplication at hand.

The proposed CoPE and B-COSFIRE filters share the same working principle.They are both trainable as their structure can be automatically configured to be se-lective to a specific pattern of interest. They take as input and then combine the re-sponses of simpler filter responses (i.e. time-frequency energy peaks for CoPE andDoG responses for B-COSFIRE). In the application phase, the input filter responsesare blurred so as to be tolerant to deformations of the pattern of interest. Thus, theyhave intrinsic generalization capabilities that can be exploited in complex classifica-tion systems.

The work proposed in this thesis draws a research path from traditional tobiology-inspired pattern recognition. It contributes to design robust and flexiblepattern recognition systems that exploit the fundamental working principles of thehuman auditory and visual systems. The proposed trainable filters aim at reducingthe feature engineering effort usually required for the design of a new system.

7.2 Outlook

In the following we propose some research lines that can be followed for extendingthe work contained in this thesis.

Audio events detection in environments with uncontrolled noise sources has re-cently raised a great interest of the increasing request by people of living in saferplaces. Background noise can be composed of various sounds, which are typical forsome environments but not typical for others. To this concern, the work proposed inthis thesis could be extended towards the improvement of robustness with respectto various background noise. Usually, it is not trivial to foresee the kinds of soundthat can occur in every environment, so making not possible the creation of a gen-eral model for background noise. For the construction of a complete and effectiveaudio surveillance system, the proposed audio event detectors (the one based onbag-of-features in Chapter 2 and the one based on CoPE filters in Chapter 4) could

124 7. Summary and Outlook

be extended with a background sound modeling and subtracting module Ntalam-piras et al. (2011); Crocco et al. (2014). This will strengthen the detection system andwill make it automatically adaptable to different environments.

The proposed CoPE filters are inspired by some physiological evidence of thefunctioning of the middle ear of the human auditory systems. In particular they areinspired by the way the cochlea membrane vibrates when is stimulated by incomingsound pressure waves and how such vibrations cause a firing activity of inner-haircells (IHC) towards the auditory nerve. There is neuro-physiological evidence thatthe firing of IHC is subjected to an inhibition mechanism that avoids the short-timefiring of the same neurons Lopez-Poveda and Eustaquio-Martın (2006). At a largerscale, it means that the effect of an acoustic event depends on what other acousticevents have occurred in the recent past. The investigation of such inhibition mech-anism in the selection of the sub-components of CoPE filters could contribute forstronger robustness to noise and higher generalization abilities. Only the significantpeaks that corresponds to IHC firing activity will be thus selected in the configura-tion and application phases of the filter.

One important property of the CoPE filters is trainabity. Their structure is deter-mined during an automatic configuration process given a pattern of interest. Thiscorresponds to high flexibility of the proposed filters, which can be employed inother audio analysis applications. For instance, a song is composed of repeatedpatterns, usually rhythmic but also melodic. The principle of CoPE filters can beused to model and then detect such patterns in unknown audio streams. Further-more, different musical genres typically involve different rhythmic structures andsometimes even different instruments, with specific frequency characteristics. Thus,the proposed CoPE filters can be employed in various applications such as mu-sic genre recognition (Sturm, 2014), ornamentation detection and recognition (Neo-cleous et al., 2015), audio fingerprinting (Cano et al., 2005), etc.

The B-COSFIRE filters proposed in Chapter 5 are effective for the delineationof blood vessels in retinal fundus images. We configured filters selective for bar-like patterns, such as vessels and vessel-endings. However, the selectivity of theB-COSFIRE filters is not fixed a-priori but rather determined during an automaticconfiguration process, performed on a prototype pattern of interest. The quality ofthe delineation output can thus be improved by configuring filters that are selectivefor other patterns, such as bifurcations or crossovers, and combine their responseswith the one of the bar-detectors.

B-COSFIRE filters reproduce some properties of simple cells in area V1 of pri-mary visual cortex, which are devoted to detection of edges and bars. Some ofthese cells receive inhibition from receptive fields with opposite polarity, which pro-duces sharper responses to bar-like stimuli. One direction for the extension of the

7.2. Outlook 125

B-COSFIRE filters is to include the inhibition mechanism, called push-pull inhibi-tion. It is expected to strengthen the response of the filter along by suppressing thenoise in the surrounding of the borders of thicker bars and along thinner bars.

Another direction for future work is to consider the depth dimension and con-figure B-COSFIRE filters for 3D bar structures that can be applied, for instance, todetect the blood vessels in angiography images of the brain. B-COSFIRE filters canbe also employed in other applications where delineation of elongated structures isrequired, such as the quantification of length and width of cracks in walls (Muduliand Pati, 2013) for earthquake damage estimation or the monitoring of the flow ofrivers from aerial images in order to prevent flooding disasters (Zhang et al., 2009).

The procedure proposed in Chapter 6 is a framework for the automatic selec-tion of a set of B-COSFIRE filters, whose combined responses improve the quality ofthe delineation output. The original set of B-COSFIRE filters used in Chapter 6 canbe extended to include, for instance, filters selective to bifurcations and crossovers.This will increase the flexibility of the proposed approach and its capability of deter-mining a set of filters optimized for the application at hand. Moreover, the methodcan be considered as a general approach for filters selection and employed, for in-stance, for the optimization of the CoPE filterbank proposed in Chapter 4.

Both CoPE and B-COSFIRE filters are implemented in MATLAB and in sequen-tial mode. Although they have already shown to be fast and to require small compu-tational resources, their real-time responses can be improved by parallel implemen-tation. Most of the computations are, indeed, independent and can be performedon different cores, such as on CUDA architecture. This will contribute to faster pro-cessing and to the possibility of building larger banks of filters.

Bibliography

Abadi, A., Rajabioun, T. and Ioannou, P.: 2014, Traffic flow prediction for road transportationnetworks with limited traffic data, IEEE Trans. Intell. Transp. Syst. PP(99), 1–10.

Abramoff, M. D., Niemeijer, M. and Russell, S. R.: 2010, Automated detection of diabeticretinopathy: barriers to translation into clinical practice, Expert Review of Medical Devices7, 287–296.

Abramoff, M., Garvin, M. and Sonka, M.: 2010, Retinal imaging and image analysis, Biomedi-cal Engineering, IEEE Reviews in 3, 169–208.

Al-Diri, B., Hunter, A. and Steel, D.: 2009, An active contour model for segmenting andmeasuring retinal vessels, Medical Imaging, IEEE Transactions on 28(9), 1488–1497.

Al-Diri, B., Hunter, A., Steel, D., Habib, M., Hudaib, T. and Berry, S.: 2008, Review - a ref-erence data set for retinal vessel profiles, Engineering in Medicine and Biology Society, 2008.EMBS 2008. 30th Annual International Conference of the IEEE, pp. 2262–2265.

Al-Rawi, M., Qutaishat, M. and Arrar, M.: 2007, An improved matched filter for blood vesseldetection of digital retinal images, Computer in biology and medicine 37(2), 262–267.

Anusuya, M. A. and Katti, S. K.: 2010, Speech recognition by machine, A review, CoRRabs/1001.2267.URL: http://arxiv.org/abs/1001.2267

Aurino, F., Folla, M., Gargiulo, F., Moscato, V., Picariello, A. and Sansone, C.: 2014, One-classsvm based approach for detecting anomalous audio events., INCoS 2014.

Avery Lichun, W.: 2003, An industrial-strength audio search algorithm, Proceedings of the 4thInternational Conference on Music Information Retrieval.

Azzopardi, G. and Petkov, N.: 2012, A CORF computational model of a simple cell that relieson LGN input outperforms the Gabor function model, Biological Cybernetics 106(3), 177–189.

128 BIBLIOGRAPHY

Azzopardi, G. and Petkov, N.: 2013a, Automatic detection of vascular bifurcations in seg-mented retinal images using trainable cosfire filters, Pattern Recognition Letters 34, 922–933.

Azzopardi, G. and Petkov, N.: 2013b, Trainable COSFIRE filters for keypoint detection andpattern recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 490–503.

Azzopardi, G., Strisciuglio, N., Vento, M. and Petkov, N.: 2015, Trainable COSFIRE filters forvessel delineation with application to retinal images, Medical Image Analysis 19(1), 46 – 57.

Barnwal, S., Barnwal, R., Hegde, R., Singh, R. and Raj, B.: 2013, Doppler based speed estima-tion of vehicles using passive sensor, IEEE ICMEW, pp. 1–4.

Bekkers, E., Duits, R., Berendschot, T. and ter Haar Romeny, B.: 2014, A multi-orientationanalysis approach to retinal vessel tracking, Journal of Mathematical Imaging and Vision49(3), 583–610.

Besacier, L., Barnard, E., Karpov, A. and Schultz, T.: 2014, Automatic speech recognition forunder-resourced languages: A survey, Speech Commun. 56(0), 85–100.

Blauert, J.: 2013, The Technology of Binaural Listening, Modern Acoustics and Signal Processing.

Borkar, P. and Malik, L.: 2013, Review on vehicular speed, density estimation and classifica-tion using acoustic signal, Int. Journal for traffic and transport engineering .

Brun, L., Saggese, A. and Vento, M.: 2014, Dynamic scene understanding for behavior analy-sis based on string kernels, IEEE Trans. Circuits Syst. Video Technol. 24(10), 1669–1681.

Cai, R., Lu, L. and Hanjalic, A.: 2008, Co-clustering for auditory scene categorization, IEEETrans. Multimedia 10(4), 596–606.

Canelli, G. B., Gluck, K. and A., S. S.: 1983, A mathematical model for evaluation and predic-tion of mean energy level of traffic noise in italian towns, Acustica .

Cano, P., Batlle, E., Kalker, T. and Haitsma, J.: 2005, A review of audio fingerprinting, Journalof VLSI signal processing systems for signal, image and video technology 41(3), 271–284.

Carletti, V., Foggia, P., Percannella, G., Saggese, A., Strisciuglio, N. and Vento, M.: 2013, Audiosurveillance using a bag of aural words classifier, IEEE AVSS, pp. 81–86.

Chauduri, S., Chatterjee, S., Katz, N., Nelson, M. and Goldbaum, M.: 1989, Detection ofblood-vessels in retinal images using two-dimensional matched-filters, IEEE Transactionson medical imaging 8(3), 263–269.

Chen, L., Huang, X. and Tian, J.: 2015, Retinal image registration using topological vasculartree segmentation and bifurcation structures, Biomedical Signal Processing and Control 16, 22– 31.

BIBLIOGRAPHY 129

Chetty, G. and Wagner, M.: 2005, Investigating feature-level fusion for checking liveness inface-voice authentication, Proc. of the 8th International Symposium on Signal Processing and itsApplications, ISSPA-2005, pp. 66–69.

Chin, M. and Burred, J.: 2012, Audio event detection based on layered symbolic sequencerepresentations, IEEE ICASSP, pp. 1953–1956.

Chutatape, O., Liu Zheng and Krishnan, S.: 1998, Retinal blood vessel detection and trackingby matched Gaussian and Kalman filters, in H. Chang and Y. Zhang (eds), Proc. 20th Annu.Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBS’98), Vol. 17, pp. 3144–9.

Cinsdikici, M. G. and Aydin, D.: 2009, Detection of blood vessels in ophthalmoscope imagesusing mf/ant (matched filter/ant colony) algorithm, Computer Methods and Programs inBiomedicine pp. 85–95.

Clavel, C., Ehrette, T. and Richard, G.: 2005, Events detection for an audio-based surveillancesystem, ICME, pp. 1306 –1309.

Conte, D., Foggia, P., Percannella, G., Saggese, A. and Vento, M.: 2012, An ensemble of reject-ing classifiers for anomaly detection of audio events, IEEE AVSS, pp. 76–81.

Cordella, L., Foggia, P., Sansone, C. and Vento, M.: 2003, A real-time text-independentspeaker identification system, ICIAP, pp. 632–637.

Cortes, C. and Vapnik, V.: 1995, Support-vector networks, Machine Learning 20(3), 273–297.

Cristani, M., Bicego, M. and Murino, V.: 2007, Audio-visual event recognition in surveillancevideo sequences, IEEE Trans. Multimedia 9(2), 257–267.

Crocco, M., Cristani, M., Trucco, A. and Murino, V.: 2014, Audio surveillance: a systematicreview, CoRR abs/1409.7787.

Daugman, J. G.: 1985, Uncertainty relation for resolution in space, spatial frequency, and ori-entation optimized by two-dimensional visual cortical filters, J. Opt. Soc. Am. A 2(7), 1160–1169.

Di Lascio, R., Foggia, P., Percannella, G., Saggese, A. and Vento, M.: 2013, A real time algo-rithm for people tracking using contextual reasoning, CVIU 117(8), 892–908.

Fadzil, M., Nugroho, H., Nugroho, H. and Iznita, I.: 2009, Contrast enhancement of retinalvasculature in digital fundus image, Digital Image Processing, 2009 International Conferenceon, pp. 137–141.

Fang, B., Hsu, W. and Lee, M.: 2003, Reconstruction of vascular structures in retinal images,Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), numbervol.3, IEEE Signal Process. Soc, pp. II–157–60 vol.3.

Foggia, P., Percannella, G., Saggese, A. and Vento, M.: 2013, Recognizing human actions by abag of visual words, Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conf. on,pp. 2910–2915.

130 BIBLIOGRAPHY

Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N. and Vento, M.: 2015a, Audio surveillanceof roads: A system for detecting anomalous sounds, Intelligent Transportation Systems, IEEETransactions on PP(99), 1–10.

Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N. and Vento, M.: 2015b, Reliable detection ofaudio events in highly noisy environments, Pattern Recognition Letters 65, 22 – 28.

Foggia, P., Saggese, A., Strisciuglio, N. and Vento, M.: 2014a, Cascade classifiers trained ongammatonegrams for reliably detecting audio events, IEEE AVSS, pp. 50–55.

Foggia, P., Saggese, A., Strisciuglio, N. and Vento, M.: 2014b, Exploiting the deep learningparadigm for recognizing human actions, IEEE AVSS 2014, pp. 93–98.

Fraz, M., Remagnino, P., Hoppe, A., Uyyanonvara, B., Rudnicka, A., Owen, C. and Barman,S.: 2012, An ensemble classification-based approach applied to retinal blood vessel seg-mentation, IEEE Transactions on Biomedical Engineering 59(9), 2538–2548.

Frucci, M., Riccio, D., di Baja, G. S. and Serino, L.: 2015, Severe: Segmenting vessels in retinaimages, Pattern Recognition Letters pp. –.

Fu, Z., Lu, G., Ting, K. M. and Zhang, D.: 2011a, Music classification via the bag-of-featuresapproach, Pattern Recognition Letters 32(14), 1768 – 1777.

Fu, Z., Lu, G., Ting, K. M. and Zhang, D.: 2011b, A survey of audio-based music classificationand annotation, Multimedia, IEEE Transactions on 13(2), 303–319.

fur Verkehr, B.: 1981, Richtlinien fur den larmschutz an strassen.

Gandhi, T. and Trivedi, M.: 2007, Pedestrian protection systems: Issues, survey, and chal-lenges, IEEE Trans. Intell. Transp. Syst. 8(3), 413–430.

Gang, L., Chutatape, O. and Krishnan, S.: 2002, Detection and measurement of retinal vessels,in fundus images using amplitude modified second-order Gaussian filter, IEEE Transactionson biomedical engineering 49(2), 168–172.

Geman, S. and Geman, D.: 1984, Stochastic relaxation, gibbs distributions, and the bayesianrestoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6(6), 721–741.

Gerosa, L., Valenzise, G., Tagliasacchi, M., Antonacci, F. and Sarti, A.: 2007, Scream and gun-shot detection in noisy environments, Proc. EURASIP European Signal Processing Conf., Poz-nan, Poland.

Goldberg, D. E.: 1989, Genetic Algorithms in Search, Optimization and Machine Learning, 1st edn,Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

Grisan, E., Foracchia, M. and Ruggeri, A.: 2008, A novel method for the automatic grading ofretinal vessel tortuosity, Medical Imaging, IEEE Transactions on 27(3), 310–319.

BIBLIOGRAPHY 131

Grzeszick, R., Plinge, A. and Fink, G.: 2015, Temporal acoustic words for online acousticevent detection, in J. Gall, P. Gehler and B. Leibe (eds), Pattern Recognition, Vol. 9358 ofLecture Notes in Computer Science, Springer International Publishing, pp. 142–153.

Guo, J., Shi, C., Azzopardi, G. and Petkov, N.: 2015, Recognition of Architectural and ElectricalSymbols by COSFIRE Filters with Inhibition, Springer International Publishing, chapter Com-puter Analysis of Images and Patterns: 16th International Conference, CAIP 2015, Valletta,Malta, September 2-4, 2015, Proceedings, Part II, pp. 348–358.

Harding, S., Broadbent, D., Neoh, C., White, M. and Vora, J.: 1995, Sensitivity and specificityof photography and direct ophthalmoscopy in screening for sight threatening eye disease- The Liverpool diabetic eye study, British Medical Journal 311(7013), 1131–1135.

Heneghan, C., Flynn, J., O’Keefe, M. and Cahill, M.: 2002, Characterization of changes inblood vessel width and tortuosity in retinopathy of prematurity using image analysis, Med-ical image analysis 6(4), 407–429.

Hoover, A., Kouznetsova, V. and Goldbaum, M.: 2000, Locating blood vessels in retinal im-ages by piecewise threshold probing of a matched filter response, IEEE Transactions on med-ical imaging 19(3), 203–210.

Hubel, D. and Wiesel, T.: 1962, Receptive fields, binocular interaction and functional archi-tecture in the cat’s visual cortex, Journal of Physiology-London 160(1), 106–&.

Irvine, G., Casagrande, V. and Norton, T.: 1993, Center surround relationships of magnocellu-lar, parvocellular, and koniocellular relay cells in primate lateral geniculate-nucleus, VisualNeuroscience 10(2), 363–373.

Jeffress, L. A.: 1948, A place theory of sound localization, Journal of Comparative and Physiolog-ical Psychology 41(1), 35–39.

Jiang, X. and Mojon, D.: 2003, Adaptive local thresholding by verification-based multithresh-old probing with application to vessel detection in retinal images, IEEE Trans. Pattern Anal.Mach. Intell. 25(1), 131–137.

Joachims, T.: 1998, Text categorization with suport vector machines: Learning with many rele-vant features, Proc. of the 10th European Conf. on Machine Learning, Springer-Verlag, pp. 137–142.

Joachims, T.: 2000, Estimating the generalization performance of an svm efficiently, Proceed-ings of the 17th Int. Conf. on Machine Learning, ICML ’00, pp. 431–438.

Johnson, N. L.: 1949, Systems of frequency curves generated by methods of translation,Biometrika 36(1-2), 149–176.

Kong, A., Zhang, D. and Kamel, M.: 2009, A survey of palmprint recognition, Pattern Recog-nition 42(7), 1408 – 1418.

132 BIBLIOGRAPHY

Lam, B., Gao, Y. and Liew, A.-C.: 2010, General retinal vessel segmentation usingregularization-based multiconcavity modeling, IEEE Transactions on Medical Imaging29(7), 1369–1381.

Lecomte, S., Lengelle, R., Richard, C., Capman, F. and Ravera, B.: 2011, Abnormal eventsdetection using unsupervised one-class svm - application to audio surveillance and evalu-ation, IEEE AVSS, pp. 124 –129.

Li, Q., Zhang, M. and Xu, G.: 2013, A novel element detection method in audio sensor net-works, International Journal of Distributed Sensor Networks .

Liew G, Wang JJ, M. P. W. T.: 2008, Retinal vascular imaging: a new tool in microvasculardisease research, Circ Cardiovasc Imaging 1, 156–61.

Liu, I. and Sun, Y.: 1993, Recursive tracking of vascular networks in angiograms based on thedetection deletion scheme, IEEE Transactions on medical imaging 12(2), 334–341.

Liu, L., Wang, L. and Liu, X.: 2011, In defense of soft-assignment coding, Computer Vision(ICCV), 2011 IEEE Int. Conf. on, pp. 2486–2493.

Liu, Z., Wang, Y. and Chen, T.: 1998, Audio Feature Extraction and Analysis for Scene Seg-mentation and Classification, The Journal of VLSI Signal Processing 20(1), 61–79.

Lopez-Poveda, E. A. and Eustaquio-Martın, A.: 2006, A biophysical model of the inner haircell: The contribution of potassium currents to peripheral auditory compression, Journal ofthe Association for Research in Otolaryngology 7(3), 218–235.URL: http://dx.doi.org/10.1007/s10162-006-0037-8

Lv, Y., Duan, Y., Kang, W., Li, Z. and Wang, F.-Y.: 2014, Traffic flow prediction with big data:A deep learning approach, IEEE Trans. Intell. Transp. Syst. PP(99), 1–9.

Malik, H.: 2013, Acoustic environment identification and its applications to audio forensics,IEEE Trans. Inf. Forensics Security 8(11), 1827–1837.

Marin, D., Aquino, A., Emilio Gegundez-Arias, M. and Manuel Bravo, J.: 2011, A New Super-vised Method for Blood Vessel Segmentation in Retinal Images by Using Gray-Level andMoment Invariants-Based Features, IEEE Transactions on medical imaging 30(1), 146–158.

Marmaroli, P., Carmona, M., Odobez, J.-M., Falourd, X. and Lissek, H.: 2013, Observation ofvehicle axles through pass-by noise: A strategy of microphone array design, IEEE Trans.Intell. Transp. Syst. 14(4), 1654–1664.

Martinez-Perez, M. E., Hughes, A. D., Thom, S. A., Bharath, A. A. and Parker, K. H.: 2007,Segmentation of blood vessels from red-free and fluorescein retinal images., Medical ImageAnalysis 11(1), 47–61.

MatouA¡, K., LepA¡, M., Zeman, J. and A ejnoha, M.: 2000, Applying genetic algorithms to se-lected topics commonly encountered in engineering practice, Computer Methods in AppliedMechanics and Engineering 190(13-14), 1629 – 1650.

BIBLIOGRAPHY 133

Meddis, R.: 2006, Auditory-nerve first-spike latency and auditory absolute threshold: A com-puter model, The Journal of the Acoustical Society of America 119(1).

Mendonca, A. M. and Campilho, A.: 2006, Segmentation of retinal blood vessels by com-bining the detection of centerlines and morphological reconstruction, IEEE Transactions onMedical Imaging 25(9), 1200–1213.

Muduli, P. and Pati, U.: 2013, A novel technique for wall crack detection using image fusion,Computer Communication and Informatics (ICCCI), 2013 International Conference on, pp. 1–6.

Nadeau, C. and Bengio, Y.: 2003, Inference for the generalization error, Machine Learning52(3), 239–281.

National Physical Laboratory: 2015, Technical guides - calculation of road traffic noise 1988.URL: http://resource.npl.co.uk/acoustics/techguides/crtn/

Neocleous, A., Azzopardi, G., Schizas, C. and Petkov, N.: 2015, Filter-based approach forornamentation detection and recognition in singing folk music, in G. Azzopardi andN. Petkov (eds), Computer Analysis of Images and Patterns, Vol. 9256 of Lecture Notes in Com-puter Science, Springer International Publishing, pp. 558–569.

Niemeijer, M., Staal, J., van Ginneken, B., Loog, M. and Abramoff, M.: 2004, Comparativestudy of retinal vessel segmentation methods on a new publicly available database, Proc.of the SPIE - The International Society for Optical Engineering, pp. 648–56. Medical Imaging2004. Image Processing, 16-19 Feb. 2004, San Diego, CA, USA.

Ntalampiras, S., Potamitis, I. and Fakotakis, N.: 2009, An adaptive framework for acousticmonitoring of potential hazards, EURASIP J. Audio Speech Music Process. 2009, 13:1–13:15.

Ntalampiras, S., Potamitis, I. and Fakotakis, N.: 2011, Probabilistic novelty detection foracoustic surveillance under real-world conditions, IEEE Trans. Multimedia 13(4), 713 –719.

Owen, C. G., Rudnicka, A. R., Mullen, R., Barman, S. A., Monekosso, D., Whincup, P. H.,Ng, J. and Paterson, C.: 2009, Measuring retinal vessel tortuosity in 10-year-old children:validation of the computer-assisted image analysis of the retina (caiar) program., InvestOphthalmol Vis Sci 50(5), 2004–10.

Palmer, A. and Russell, I.: 1986, Phase-locking in the cochlear nerve of the guinea-pig and itsrelation to the receptor potential of inner hair-cells, Hearing Research 24(1), 1 – 15.

Pancoast, S. and Akbacak, M.: 2012, Bag-of-audio-words approach for multimedia event clas-sification, Proc. of the Interspeech 2012 Conf.

Patterson, R. D. and Moore, B. C. J.: 1986, Auditory filters and excitation patterns as repre-sentations of frequency resolution, Frequency selectivity in hearing pp. 123–177.

Patterson, R. D., Robinson, K., Holdsworth, J., Mckeown, D., Zhang, C. and Allerhand, M.:1992, Complex Sounds and auditory images, in Y. Cazals, L. Demany and K. Honer (eds),Auditory Physiology and Perception, Pergamon, Pergamon, Oxford, pp. 429–443.

134 BIBLIOGRAPHY

Peeters, G.: 2004, A large set of audio features for sound description (similarity and classifi-cation) in the CUIDADO project, Tech. rep., IRCAM.

Petkov, N. and Visser, W. T.: 2005, Modifications of center-surround, spot detection and dot-pattern selective operators, Technical Report CS 2005-9-01, Institute of Mathematics andComputing Science, University of Groningen, The Netherlands.

Pham, Q.-C., Lapeyronnie, A., Baudry, C., Lucat, L., Sayd, P., Ambellouis, S., Sodoyer, D.,Flancquart, A., Barcelo, A.-C., Heer, F., Ganansia, F. and Delcourt, V.: 2010, Audio-videosurveillance system for public transportation, IEEE IPTA, pp. 47–53.

Phan, H., Hertel, L., Maass, M., Mazur, R. and Mertins, A.: 2015, Audio phrases for audioevent recognition, 23nd European Signal Processing Conference, EUSIPCO 2015, 2015.

Pizer, S., Amburn, E., Austin, J., Cromartie, R., Geselowitz, A., Greer, T., Ter Haar Romeny,B., Zimmerman, J. and Zuiderveld, K.: 1987, Adaptative Histogram Equalization and itsVarations, Computer Vision Graphics and Image Processing 39(3), 355–368.

Poppe, R.: 2010, A survey on vision-based human action recognition, Image and Vision Com-puting 28(6), 976 – 990.

Poveda, E. A. L. and Meddis, R.: 2001, A human nonlinear cochlear filterbank, J. Acoust. Soc.Am. 110(6), 3107–18.

Quartieri, J., N., M., Iannone, G., Guarnaccia, C., S., D., A, T. and T., L.: 2009, A review oftraffic noise predictive models, Recent Advances in Applied and Theoretical Mechanics, pp. 72–80.

Rabaoui, A., Davy, M., Rossignol, S. and Ellouze, N.: 2008, Using one-class svms and waveletsfor audio surveillance, IEEE Trans. Inf. Forensics Security 3(4), 763–775.

Rauscher, S., Messner, G. and Baur, P.: 2009, Enhanced automatic collision notification system- improved rescue care due to injury prediction- first field experience, ESV, pp. 1–10.

Ricci, E. and Perfetti, R.: 2007, Retinal blood vessel segmentation using line operators andsupport vector classification, IEEE Transactions on medical imaging 26(10), 1357–1365.

Rodieck, R. W.: 1965, Quantitative analysis of cat retinal ganglion cell response to visualstimuli, Vision research 5(23), 583–601.

Rouas, J.-L., Louradour, J. and Ambellouis, S.: 2006, Audio events detection in public trans-port vehicle, IEEE ITSC, pp. 733–738.

Roy, A., Magimai-Doss, M. and Marcel, S.: 2012, A fast parts-based approach to speakerverification using boosted slice classifiers, IEEE Trans. Inf. Forensics Security 7(1), 241–254.

Saquib, Z., Salam, N., Nair, R., Pandey, N. and Joshi, A.: 2010, A survey on automatic speakerrecognition systems, Signal Processing and Multimedia, Vol. 123 of Communications in Com-puter and Information Science, Springer Berlin Heidelberg, pp. 134–145.

BIBLIOGRAPHY 135

Schneider, P., Biehl, M. and Hammer, B.: 2009a, Adaptive relevance matrices in learning vec-tor quantization, Neural Comput. 21(12), 3532–3561.

Schneider, P., Biehl, M. and Hammer, B.: 2009b, Distance learning in discriminative vectorquantization, Neural Computation 21(10), 2942–2969.

Setiawan, A., Mengko, T., Santoso, O. and Suksmono, A.: 2013, Color retinal image enhance-ment using clahe, ICT for Smart Society (ICISS), 2013 International Conference on, pp. 1–3.

SETRA, CERTU, LCPC and CSTB: 1995, Bruit des infrastructures routieres : meethode decalcul incluant les effets meteorologiques, version experimentale, nmpb-routes-96.

Shi, C., Guo, J., Azzopardi, G., Meijer, J. M., Jonkman, M. F. and Petkov, N.: 2015, AutomaticDifferentiation of u- and n-serrated Patterns in Direct Immunofluorescence Images, Springer In-ternational Publishing, chapter Computer Analysis of Images and Patterns: 16th Interna-tional Conference, CAIP 2015, Valletta, Malta, September 2-4, 2015 Proceedings, Part I,pp. 513–521.

Sivaraman, S., Morris, B. and Trivedi, M.: 2013, Observing on-road vehicle behavior: Issues,approaches, and perspectives, IEEE ITSC, pp. 1772–1777.

Sivaraman, S. and Trivedi, M.: 2013, Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis, IEEE Trans. Intell. Transp. Syst.14(4), 1773–1795.

Sivic, J. and Zisserman, A.: 2009, Efficient visual search of videos cast as text retrieval, IEEETrans. Pattern Anal. Mach. Intell 31(4), 591–606.

Soares, J. V. B., Leandro, J. J. G., Cesar, Jr., R. M., Jelinek, H. F. and Cree, M. J.: 2006, Reti-nal vessel segmentation using the 2-D Gabor wavelet and supervised classification, IEEETransactions on medical imaging 25(9), 1214–1222.

Sree, V. and Rao, P.: 2014, Diagnosis of ophthalmologic disordersin retinal fundus images,ICADIWT, 2014 5th Int. Conf. on the, pp. 131–136.

Staal, J., Abramoff, M., Niemeijer, M., Viergever, M. and van Ginneken, B.: 2004, Ridge-basedvessel segmentation in color images of the retina, IEEE Transactions on medical imaging23(4), 501–509.

Steele, C.: 2001, A critical review of some traffic noise prediction models, Applied Acoustics62(3), 271 – 287.

Strisciuglio, N., Azzopardi, G., Vento, M. and Petkov, N.: 2015a, Multiscale blood vesseldelineation using B-COSFIRE filters, Computer Analysis of Images and Patterns, Vol. 9257 ofLecture Notes in Computer Science, Springer International Publishing, pp. 300–312.

Strisciuglio, N., Azzopardi, G., Vento, M. and Petkov, N.: 2015b, Unsupervised delineation ofthe vessel tree in retinal fundus images, Computational Vision and Medical Image ProcessingVIPIMAGE 2015, pp. 149–155.

136 BIBLIOGRAPHY

Sturm, B.: 2014, A survey of evaluation in music genre recognition, Adaptive Multimedia Re-trieval: Semantics, Context, and Adaptation, Vol. 8382 of Lecture Notes in Computer Science,Springer International Publishing, pp. 29–66.

Technical Committee ISO/TC 43, Acoustics, S.: 1996, Iso 9613-2 - acoustics - attenuation ofsound during propagation outdoors.

Tolias, Y. and Panas, S.: 1998, A fuzzy vessel tracking algorithm for retinal images based onfuzzy clustering, IEEE Transactions on medical imaging 17(2), 263–273.

United Kingdom Department of Environment and welsh Office Joint Publication, HMSO:1975, Calculation of road traffic noise.

Vacher, M., Istrate, D., Besacier, L., Serignat, J. F. and Castelli, E.: 2004, Sound Detection andClassification for Medical Telesurvey, in C. ACTA Press (ed.), Proc. 2nd ICBME, Innsbruck,Austria, pp. 395–398.

Valenzise, G., Gerosa, L., Tagliasacchi, M., Antonacci, F. and Sarti, A.: 2007, Scream and gun-shot detection and localization for audio-surveillance systems, IEEE AVSS, pp. 21–26.

von Wendt, G., Heikkila, K. and Summanen, P.: 1999, Assessment of diabetic retinopathyusing two-field 60 degrees fundus photography. A comparison between red-free, black-and-white prints and colour transparencies, Acta Ophthalmologica Scandinavica 77(6), 638–647.

Vu, V.-T., Bremond, F., Davini, G., Thonnat, M., Pham, Q.-C., Allezard, N., Sayd, P., Rouas,J.-L., Ambellouis, S. and Flancquart, A.: 2006, Audio-video event recognition system forpublic transport security, Crime and Security, 2006. The Institution of Engineering and Technol-ogy Conference on, pp. 414–419.

Wang, L., Yung, N. and Xu, L.: 2014, Multiple-human tracking by iterative data associationand detection update, IEEE Trans. Intell. Transp. Syst. 15(5), 1886–1899.

White, J., Thompson, C., Turner, H., Dougherty, B. and Schmidt, D.: 2011, Wreckwatch: Au-tomatic traffic accident detection and notification with smartphones, Mobile Networks andApplications 16(3), 285–303.

Xu, X., Bonds, A. and Casagrande, V.: 2002, Modeling receptive-field structure of koniocel-lular, magnocellular, and parvocellular LGN cells in the owl monkey (Aotus trivigatus),Visual Neuroscience 19(6), 703–711.

Zajdel, W., Krijnders, J., Andringa, T. and Gavrila, D.: 2007, Cassandra: audio-video sensorfusion for aggression detection, IEEE AVSS, pp. 200–205.

Zana, F. and Klein, J.: 2001, Segmentation of vessel-like patterns using mathematical mor-phology and curvature evaluation, IEEE Transactions on medical imaging 10(7), 1010–1019.

BIBLIOGRAPHY 137

Zhang, J., Bekkers, E., Abbasi, S., Dashtbozorg, B. and ter Haar Romeny, B.: 2015, Robust andfast vessel segmentation via gaussian derivatives in orientation scores, Image Analysis andProcessing - ICIAP 2015, Vol. 9279 of Lecture Notes in Computer Science, Springer InternationalPublishing, pp. 537–547.

Zhang, L., Zhang, Y., Wang, M. and Li, Y.: 2009, Adaptive river segmentation in sar images,Journal of Electronics (China) 26(4), 438–442.

Zheng, F., Zhang, G. and Song, Z.: 2001, Comparison of different implementations of mfcc,Journal of Computer Science and Technology 16(6), 582–589.

Zhou, L., Rzeszotarski, M., Singerman, L. and Chokreff, J.: 1994, The detection and quan-tification of retinopathy using digital angiograms, IEEE Transactions on medical imaging13(4), 619–626.

Zhu, W.-B., Li, B., Tian, L.-F., Li, X.-X. and Chen, Q.-L.: 2015, Topology adaptive vessel net-work skeleton extraction with novel medialness measuring function, Computers in Biologyand Medicine 64, 40 – 61.

Zwicker, E.: 1961, Subdivision of the audible frequency range into critical bands (frequenz-gruppen), The Journal of the Acoustical Society of America 33(2), 248–248.

Research Activities

Journal Papers

• George Azzopardi, Nicola Strisciuglio, Mario Vento, Nicolai Petkov, “TrainableCOSFIRE filters for vessel delineation with application to retinal images”, Medical Im-age Analysis, Volume 19, Issue 1, January 2015, Pages 46-57, ISSN 1361-8415,http://dx.doi.org/10.1016/j.media.2014.08.002

• Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, Mario Vento,“Reliable detection of audio events in highly noisy environments,” Pattern Recog-nition Letters, Volume 65, 1 November 2015, Pages 22-28, ISSN 0167-8655,http://dx.doi.org/10.1016/j.patrec.2015.06.026

• Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, Mario Vento,“Audio Surveillance of Roads: A System for Detecting Anomalous Sounds,” in IntelligentTransportation Systems, IEEE Transactions on , vol.17, no.1, pp.279-288, 2016, doi:10.1109/TITS.2015.2470216

• Nicola Strisciuglio, George Azzopardi, Mario Vento, Nicolai Petkov, “Supervised ves-sel delineation in retinal fundus images with the automatic selection of B-COSFIRE filters”,accepted for publication in Machine Vision and Applications, 2016

• Nicola Strisciuglio, Nicolai Petkov, Mario Vento “CoPE: Trainable filters for feature extrac-tion in audio signals,”, submitted 2016

Conference Proceedings

• Vincenzo Carletti, Pasquale Foggia, Gennaro Percannella, Alessia Saggese, NicolaStrisciuglio, Mario Vento, Audio surveillance using a bag of aural words classifier, Ad-vanced Video and Signal Based Surveillance (AVSS), 2013 10th IEEE International Con-ference on, 2013

• George Azzopardi, Nicola Strisciuglio, Mario Vento, Nicolai Petkov, Vessels delineationin retinal images using COSFIRE filters, Netherlands Conference on Computer Vision2014

140

• Pasquale Foggia, Alessia Saggese, Nicola Strisciuglio, Mario Vento, Exploiting the deeplearning paradigm for recognizing human actions, Advanced Video and Signal BasedSurveillance (AVSS), 2014 11th IEEE International Conference on, 2014

• Pasquale Foggia, Alessia Saggese, Nicola Strisciuglio, Mario Vento, Cascade classifierstrained on Gammatonegrams for reliably detecting Audio Events, Advanced Video and Sig-nal Based Surveillance (AVSS), 2014 11th IEEE International Conference on, 2014

• Pasquale Foggia, Alessia Saggese, Nicola Strisciuglio, Mario Vento, Nicolai Petkov, Carcrashes detection by audio analysis in crowded roads, Advanced Video and Signal BasedSurveillance (AVSS), 2015 12th IEEE International Conference on, 2015

• Nicola Strisciuglio, George Azzopardi, Mario Vento, Nicolai Petkov, Multiscale BloodVessel Delineation Using B-COSFIRE Filters, Computer Analysis of Images and Patterns,2015

• Nicola Strisciuglio, George Azzopardi, Mario Vento, Nicolai Petkov, Unsupervised delin-eation of the vessel tree in retinal fundus images, Computational Vision and Medical ImageProcessing V: Proceedings of the 5th Eccomas Thematic Conference on ComputationalVision and Medical Image Processing (VipIMAGE 2015, Tenerife, Spain, October 19-21,2015), 2015

Awards• Best Paper Award: Nicola Strisciuglio, George Azzopardi, Mario Vento, Nicolai Petkov,

Unsupervised delineation of the vessel tree in retinal fundus images, Computational Visionand Medical Image Processing V: Proceedings of the 5th Eccomas Thematic Conferenceon Computational Vision and Medical Image Processing (VipIMAGE 2015, Tenerife,Spain, October 19-21, 2015), 2015

Attended Conferences• 1st Netherlands Conference on Computer Vision (NCCV), Ermelo, The Netherlands,

2014

• Italian National Conference GIRPR (Gruppo Italiano Ricercatori in Pattern Recogni-tion), Ascea Marina (Salerno), 2014

• 11th IEEE International Conference on Advanced Video and Signal Based Surveillance(AVSS) - Seoul, South Korea, 2014

• 16th International Conference on Computer Analysis of Images and Patterns, CAIP2015, Valletta, Malta, 2015

Summer Schools• ICVSS, International Computer Vision Summer School, Ragusa, Sicily, July 2014.

141

Committees• Local committee - Italian National Conference GIRPR (Gruppo Italiano Ricercatori in

Pattern Recognition), Ascea Marina (Salerno), 2014

RefereeServing as referee for the following journals:

• Medical Image Analysis , Elsevier

• IEEE Transactions on Medical Imaging

• Pattern Recognition, Elsevier

• Image and Vision Computing, Elsevier

• PLOS ONE

• Pattern Recognition Letters, Elsevier

• IEEE Transactions on Intelligent Transportation Systems

• Electronic Letter on Computer Vision and Image Analysis

• Journal on Multimodal User Interfaces, Springer

• Expert Systems with Applications, Elsevier

• South African Computer Journal

university of groningen university of …nick/strisciuglio_phd.pdfuniversity of salerno through a...

Documents