paper title (use style: paper title) - er web viewalso their position in time.speech recognition...

Using MFCC and Spectro-Temporal Gabor Filter bank Features Acoustic Event Detection

Umair Zafar Khan1, Abdul Wahid2, Dr. Usman Akram3

1College of Electrical and Mechanical Engineering, NUST, Rawalpindi, Pakistan2Superior University, Lahore, Pakistan

3College of Electrical and Mechanical Engineering, NUST, Rawalpindi, Pakistan

ABSTRACT

Acoustic event Detection (AED) is concerned with recognition of sound which is produced by the human and the object which is handle by human or by nature. The Detection of acoustic events is an important task for our intelligent system which should recognize not only speech but also sounds of our indoor and outdoor environments which include information retrieval, audio-based surveillance and monitoring systems. Currently, System for detection and classification of events from our daily monophonic sound is mature enough to extract features and detect isolated events nearly accurate but accuracy is very low for large dataset and for noisy and overlapped audio events. Mostly the real life sounds are polyphonic and events have some part of overlap which is harder to detect. In our work we discuss the previous issues for detection and feature extraction of acoustic events. We use the DCASE dataset which was published in an international IEEE AASP challenge for Acoustic Event Detection which includes the “office live” recordings which was prepared in an office environment. MFCC is a technique which is commonly used for features extraction of speech and Acoustic event. We propose to use the Gabor filterBank in addition to MFCCs coefficients to analyze the feature. For Classification we use the Decision tree algorithm which gives the better classification and detection result. . Finally, we compare our proposed system with each system which was used for DCASE dataset and conclude that our technique gives best F-Score value in detection of events as compare to others.

Index Terms— Acoustic event detection (AED), feature extraction, MFCC, Gabor filterbank polyphonic, monophonic, classification

1. INTRODUCTION

After the better recognition and classification results of the automatic speech recognition framework, it’s necessary to improve the problem of acoustic event detection which has limited progress. Normally we use security cameras for monitoring and for security surveillance issue, but due to absence of light, it’s difficult to monitor the unusual event.

Most of the alarm system work through camera by capturing unusual gesture movement which works better in light but sometime work not well in the absence of light or in minimum light. For that particular problem, we can use acoustic sensors with cameras for monitoring and for security surveillance issue. Most of the monitoring system used video cameras for fire detection in forest or traffic accident but some time it cannot work well in special situations, especially without sufficient light or when the sightline is blocked. Under these circumstances, audio sensors can provide us with sufficient information. It is becoming more and more significant to use audio sensors to improve the effectiveness for monitoring systems, especially when video cameras cannot work effectively. Acoustic based monitoring system has been studied for many years. In [37] a novel method is used to detect human coughing in office and in [40] SVM based method also used for monitoring the office environment in which system detect some impulsive sound like door alarm and human crying. Some use HMM based acoustic system for detection of gunshot and car crashing [39]. In special monitoring system, we define our threat from particular location, for example we have to detect animal scream in forest monitoring system not in street or urban area. It is necessary to judge whether the event is expected to happen at a specific time and a specific location. So first we collect our required sound and use them for monitoring and security purpose [36]. Most of the work in acoustic event detection is in quite environment e.g. office buildings, meeting rooms, patient rooms and for monitoring of elderly aged people [35] but that system cannot work well in noisy condition or when multiple voices are overlapped.

Our research is aimed at the use of acoustic signals as additional information for automatic detection and analysis of acoustic situations. As compare to acoustic events classification, acoustic events detection (AED) is a more motivating and up till now it’s more complicated task, because we need to find out not only the characteristics of the events but

Page | 1

also their position in time.Speech recognition problem is largely solved but acoustic event detection problem is still need more work because environmental sound carry noises, large intra class variations, mostly same frequency spectrum and event overlapping which makes the task more difficult. In Section 2, we present the background in which some old methods are discussed and there is a review about different feature extraction method and information about dataset. In Section 4, our proposed technique for feature extraction by using both 2-D Gabor filter bank and MFCC is briefly describe and compare with different approaches which were used in DCASE dataset. Finally, in Section 5, conclusions are drawn.

2. BACKGROUND

For humans, speech is typically the most informative acoustic event but environmental voices are rich varieties of acoustic events that occur around us in our surroundings which carry vital information and cues and should not be ignored. We cannot use acoustic speech recognition (ASR) frame work for acoustic event detection because of unique spectrum of environmental voices which cannot be proper detect. We treat environmental and other sounds are treated as “acoustic event” which have properties such as onset, duration and offset times, with a given frequency content that defines the source of sound. Acoustical characteristics of speech and music are different from acoustic events. Number of classes, length of window and length of shift is defined in speech and music but undefined in acoustic events. The bandwidth is narrow in speech and music but broad in case of acoustic events, sound is almost stationary in speech and music but variations occur in case of environmental voices. That’s why we use new feature extraction and learning approach for acoustic event detection. It is common to extract acoustic features frame-by-frame, and then use Hidden Markov Models (HMMs) to find the most likely sequence of phonemes for the given features [3]. In the case of speaker recognition, acoustic features are extracted as in speech, but the frames are not decoded into a sequence of words. Instead, it is common to use clustering techniques to identify different speakers, especially as the number of speakers may not be known in advance.

For acoustic event recognition, the foundation is similar, and it is common to use the same acoustic features and pattern recognition systems as found in speech and speaker recognition [14]. However, the scope of the acoustic events is much broader, as it includes environmental sounds, which have a wider range of characteristics. In addition, the environments in which the acoustic events occur is considered to be unstructured, meaning there may be background noise, multiple sound sources, and reverberation, which makes the recognition much harder [7]. Therefore acoustic event recognition systems are based on these principles, and often incorporate different techniques. If we give a short clip of audio for detection then a recognition system must determine which acoustic event in its training database is the closest match with this new sound. For that purpose, we should have good features extraction technique that we will use further. Normally systems use the same technique, detection by classification which is used always in speech and event detection systems. Fig.1 is show the complete procedure of this technique [28] in which one portion is for training and second is

Fig.1. Diagram of an acoustic event detection system

for testing. A continuous signal is first sampled and windowing then next step is to find spectrum or spectrogram which is depending on feature extraction technique. After finding the features, classifier have to be trained by using these features and classification process is applied on training dataset to check the classification result on individual dataset. After these steps, performance is measured by using testing dataset which is passed through the trained classifier where pattern of events is measured and events are detected. These are the steps which are used for most of the event detection and speech recognition systems.

2.1. Activity Detection

Activity detection concerns finding the start and end points of acoustic events in a continuous audio stream, so that the classification system only deals with active segments. The process can also be called acoustic event detection or voice

Page | 2

activity detection (VAD), especially in the case of speech/non-speech segmentation. This is only required in the case of live audio, when the input is a continuous audio stream. it is common to have sound clips containing isolated acoustic events for the evaluation of an acoustic event recognition system during development. The outline of a typical system, taken from [28], is shown in Figure 2. where features are first extracted from the continuous signal, and then a decision is made, followed by post-processing to smooth the detector output. Algorithms for the decision module often fall into two groups: frame-threshold-detection, or detection-by-classification [23]. The former makes a decision based on a frame-level feature to decide whether it contains activity or noise. The decision module is simply a threshold, whereby if the feature output is greater than a defined value, then the decision is positive. It extracts the active segments from the continuous audio stream to pass into a classification system. The simplest feature for such a system could be the frame power level, where if the total power in a given frame exceeds a threshold, the frame is marked as active. However, this is very simplistic and is prone to errors in non-stationary noise.

Fig.2. Block Diagram of an Activity Detector

Other possible features include pitch estimation, zero-crossing rate or higher-order statistics, with further improvements in performance reported, if features use a longer time window, such as with spectral divergence features [16]. The advantages are low computational cost and real-time processing, but there are disadvantages such as the choice of threshold, which is crucial and may vary over time, and the size of the smoothing window, which can be relatively long to get a robust decision. Detection-by-classification methods do not suffer such problems, as a classifier is used to label the segment as noise or non-noise, rather than using a threshold. In this configuration, a sliding window is passed over the signal and a set of features extracted from each window. These are passed to a classifier which is trained to discriminate between noise and other non-noise events. The classifier must go through a training phase, where it must learn what features represent noise. A simple, unsupervised decision system could be based on clustering and Gaussian Mixture Modeling (GMM). Here, a short audio segment, containing both noise and non-noise, is clustered in the training phase, so that one cluster should contain the noise, and the other clusters the events. A GMM can then be fitted to the distributions of each cluster, so that future frames will be compared with each GMM, and the most likely one chosen as the label

2.2. Feature Extraction

The objective of features extraction is to compress the audio signal into a vector that is representative of the class of acoustic event which is trying to characterize. A good feature should be insensitive to external influences such as noise of environment, and able to emphasize the difference between different classes of sounds .This makes the task of classification easier, as it is simple to discriminate between different classes of sounds that are separable. There are two approaches in extracting features which vary according to the time extent covered by the features [12]. They are either global, where the descriptor is generated over the whole signal, or instantaneous, where a descriptor is generated from each short time frame of around 30-60 ms over the duration of the signal. This second method is the most popular feature extraction method, and is often called the bag-of-frames approach [19]. The sequence of feature vectors contains information about the short-term nature of the signal, hence needs to be aggregated over time, for example using HMMs. As speech recognition is the dominant field in audio pattern recognition, it is common to use features extraction method for speech to be directly used for generic acoustic events. The most popular feature is the Mel-Frequency Cepstral Coefficients (MFCC), although others such as Linear Prediction Cepstral Coefficients (LPCC) are also used. However, there are many ways in which the signal can be analyzed [4], hence there are a wide variety of other features that have been developed to capture the information contained in the signal. These usually fall into the following categories [12]:

2.2.1. Low-level Audio Features

Sound of events can be explained through the common set of the characteristics, such as a pitch, volume, time and timbre [6]. While common ASR features can capture these to an extent, they often do so implicitly, rather than designing a feature to capture a specific characteristic directly. For example, MFCCs capture loudness through the zeroth coefficient, and pitch and timbre are represented in the remaining coefficients. Hence, low level audio features

Page | 3

may not provide the best representation for the full range of sound characteristics, despite providing a good (SER) sound event recognition baseline performance [11]. In particular, timbre is important as it represents a range of distinctive characteristics about the sound. Therefore, several works focus on extracting novel features that characterize aspects of the sound event timbre. In particular, these look to capture elements such as the spectral brightness, roll-off, bandwidth and harmonicity . Brightness is defined as the centroid of power spectrum, while spectral roll-off measures the frequency below which a certain percentage of the power resides. Both are a measure of the high frequency content of the signal. Bandwidth captures the spread of the spectral information around the centroid, while harmonicity is a measure of the deviation of the sound event from the perfect harmonic spectrum. Many of these novel sound features are standardized in the MPEG-7 framework to provide a unified interface for modeling audio information [13]. MPEG-7 also includes an audio spectrum projection (ASP) feature, which can be seen as a generalization of the traditional MFCC approach, with a class-specific basis projection used in place of the DCT [8]. However, it was shown in [14] that MFCC features can still outperform MPEG-7 ASP features on a simple sound recognition task. Given such a large set of audio features to choose from, other works have focused on feature selection. These approaches aim to automate the process of selecting a suitable feature set for a given sound class that can discriminate well against other sounds. For example in [15], 138 low-level audio features are extracted, and decision tree classifier is used to pick appropriate features for modeling all sound object. Another example, in [17], uses the correlation-based feature selection (CFS) method on a base set of 79 features, using the implementation in the WEKA toolkit [26]. While such feature selection approaches try to determine the best subset of features from a predefined subset.

2.2.2. Temporal Features

The varied nature of the acoustic signals means that representing their frequency content alone may not be sufficient for classification. A simple approach is to combine frame-based spectral features, such as MFCCs, with their delta and delta-delta coefficients to capture their local temporal transitions. However, other features aim specifically to capture the important information in the temporal domain. Temporal information can be extracted across a range of different time and frequency scales, including capturing both spectral and temporal information in the feature. An early approach for temporal feature extraction for ASR is called “temporal patterns" (TRAPS), which extracts features over a long temporal window from each frequency subband . More recently, numerous related approaches have been proposed for AED. One example can be found in [118], where the aim is to characterize sound events through a “morphological" description of the temporal information in the signal. This includes properties such as the dynamic profile, e.g. whether the sound has an ascending or descending energy profile, the melodic profile, describing the change in pitch, and the complex-iterative nature of sound repetitions. The advantage of this approach is that it naturally describes the sound events in a form that is similar to human description, making the technique useful for indexing sounds for an audio search engine. A different approach is proposed in [15], where the aim is to characterize sound events through a parametric representation of their subband temporal envelope. This captures the distinctive spectro-temporal signature of the sound events, and allows a comparison of sounds using a distance measure based on the parameterized representation.A different approach for capturing the temporal information in the signal is to use temporal feature integration to transform a set of frame-based features into a segment-level features vector for classification [23] This is more common in AED compared to ASR, as acoustic events more commonly occur in isolation compared to the connected phonemes in speech. Typically, a statistical model of the temporal information is used, where parameters such as the mean, variance and higher-order statistics are captured [32]. However, these simple statistics ignore the temporal dynamics among successive feature vectors. One solution is to model the temporal information by fitting an autoregressive (AR) model to the sequence [21]. 2.2.3. Spectro-Temporal Features:

A natural extension to both spectral and temporal feature extraction is to consider features that jointly model the spectro-temporal information. For ASR, this approach has been used to improve the capture convinced features of speech, such as formant and formant transition. The most common approach is to use the correlation between a set of wavelet functions and the time-frequency base representation to extract a conventional feature for classification. The most popular wavelet representation is based on complex Gabor functions, which are two-dimensional sine-modulated Gaussian functions that can be tuned to model a range of spectro-temporal patterns. Recently, there has been interest in extracting information from local time-frequency regions in the spectrogram. One method is to perform the two dimensional (DCT) Discrete Cosine Transform of each local time-frequency patch on a regular grid, which is equivalent to performing correlation with a set of 2D-DCT bases [24]. The result can then be concatenated together to form a frame-based feature, which can be improved further by removing the higher-order components to provide both smoothing and dimensionality reduction [31]. While such approaches often used a xed set of basis functions, a recent proposal aims to learn the spectro-temporal modulation functions from the data [33]. Independent Component Analysis (ICA) is used for this purpose, and it is shown that this approach can learn functions that give an improved performance.

Page | 4

A related approach is based on decomposition of the time domain signal using Matching Pursuit (MP). This provides an efficient way of selecting a small basis set that represents the signal with only a small residual error. As before, a Gabor wavelet dictionary is commonly used [25], as it can better capture the non-stationary time-frequency characteristics in the signal compared to the one-dimensional Haar or Fourier bases [29]. It has also been noted that the Gabor bases are more effective at reconstructing a signal from only a small number of bases [27].

2.3. Pattern Classification

After characteristic drawing out, the pattern can be classified as belong to the one of lessons offered during preparation, and the brand functional to the audio part. As mention earlier, it is vital for feature of special course to be divergent and distinguishable, as not possible for the classifiers to differentiate between overlap sound lessons. The key categorization method use Hidden Markov Models (HMM), Support Vector Machines (SVM), Gaussian Mixture Models (GMM) are discuss in detail, there are positive method that they are summarize as follows [9]: k -Nearest Neighbors (k -NN): An easy algorithm, known as testing pattern, use the greater part vote of k adjacent trainings pattern to assigns a class a labels. Is often describes as sluggish algorithm, as all the calculation is delayed to trying, hence can have sluggish performances for a large number of preparation sample. For the case of the 1-NN, the methods have 100% the recalls performances, which is the matchless. Dynamic Time Warping (DTW): Algorithm can discover the resemblance between two sequences that vary in the moment or speeds. The works fine with the bag-of frames approaches, can decode a same word spoken on special speeds though, has mostly been out of date by HMM for Automatic Speech Recognition (ASR). Artificial Neural Networks (ANN): This technique, also referrers as the Multilayer Perceptron (MLP), a computational models encouraged by neurons in the brain. known the enough number of the hidden neurons, and is known to be the common approximates, but is frequently criticize for being the black-box, the purpose of every neurons in networks is hard to understand. It also suffers from complexity in preparation, as most universal way of back transmission is to get caught in local minima. Hidden Markov Models (HMM): A Markov model consists of the set of organized states, where transition between state is a unwavering by the set of probability, the after that state only depend on current state of the model. For a Hidden Markov Model, only the production or observation from each state can be seen to an observer. Hence, knowing the modify and production probability distribution, from a series of comments we need to calculate the most likely sequence of states that could report for the interpretation [3].

2.4. Proposed Dataset (D-CASE Acoustic Event Database)

We addressed the complexity of finding good detection of events for acoustic scene in office setting, to make use of the presented office environment within Queen Mary university of London. This same problem statement for Clear dataset which is use in [34], which are also the tasks tackled AED in an office environment. To encourage researcher and for big participation, and also to explores the challenge of overlapping audio scene, we have two subtasks in which one is detection of no overlapping sound, and events detection of overlapped sounds. For both case, system have to detect major event in the existence of noise in background. The development (validation) and test data set, called Office Live (OL), consisting of approximately one minute recordings of on a daily basis audio events in certain office environment (different size and absorbent quality room, different numbers of people in the rooms and varied sound). The audio events include:

Audio Event Name DurationAlert 40 sec

Clear throat 23 secCough 23 sec

Door Slam 44 secDrawer 33 sec

Keyboard 1 min 16 secKeys 41 sec

Knock 26 secLaugh 30 secMouse 29 sec

Page turns 1 min 03 secPen drop 16 sec

Phone 3 min 05 secPrinter 7 min 01 secSpeech 1 minSwitch 10 secTOTAL 18 min 49 sec

Table.1: Structure of the utilized dataset.

Page | 5

In (OS) dataset, recorded audio events have much overlapping with same events.

3. PROPOSED TECHNIQUE

This portion contains the information of the proposed method for the detection and classification of acoustic event. We use the MFCC and Gabor filter bank for features extraction. The working of these methods is given below.

3.1. MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCC)

Mel frequency cepstral coefficient is commonly use method for extraction of features in acoustic signals. The main disadvantage of using MFCC is the sensitivity to noise by the dependence of the spectral shape. The scale uses the approximation nonlinear frequency scale was Mel Frequency approximately linear for at least 1 kHz and a logarithmic frequency for frequency is above 1 kHz. It’s motivational by fact that human auditory systems become less frequencies selective as frequencies increases above 1 kHz. In MFCC feature correspond to energies cepstrum log filters. For calculation of the energy of registration is calculated from the outputs of the filters Xt (n) is the DFT

(1)of that particular speech frame tth is the input Hm (n) is the frequencies response of the filter m th in the filter banks, N is total size of a window of transformation and M is the filter size. Next, the discrete cosine transform (DCT) of the energy is calculated as log

(2)

Since the human auditory system is suitable for the evolution over time of the frequency domain content of the signal, often an effort to include extracting this information’s as a part of the analysis of the characteristics. To capture the change in the coefficients, the coefficients of the first and second difference is calculated as follows respectively.

(3)

These coefficients are concatenated with dynamic static coefficients according to what the end result of analysis of characteristics which represents the voice frame tth .

(4)

3.2. GABOR FILTERBANK

Gabor filters and two-dimensional filter, initially proposed in [1]. Later, it was used as a model for biological change optical system [2], as in the hearing system [10], and found many applications in computer vision [5]. To study the potential uses for the identification and classification of Acoustic event, we recommend applying a bank of Gabor feature extraction filter, which contains a set of two dimensional Gabor filter. Every filter is defined by their time and frequency domain carrier and envelops function. In Gabor filter, l index of the frame frequency rate , carrier frequency m0 and the spectral position of the span of time and temporal modulation frequency wm and wl and the number of semi- cycles in the VM and VL , respectively the envelope , is therefore defined as

(5)

Page | 6

With carrier functions

(6)And envelope functions

(7)Hann window is used for their finite support, coefficients of filter are zero. There is an example of a Gabor filter illustrates the all parameter shown in below Fig 4.2 The peak value of Gabor filter is at center of a mo and size .The spectral filter and temporal dimension are relative to the number of seeds - ¸ cycle / ui and π / WL VM and VL factors respectively.For estimated identical coverage of time domain modulation spectrometer, that particular Gabor filter bank was designed. Their frequency domain modulation frequencies, temporal, Central and WM1, WmNm and WL1, wlNl with Nm and Nl are denotes the respective number of center frequencies.

3.3 Features Extraction with MFCCs and Gabor Filter bank

To get a better ranking, the good choice of a feature extraction technique is essential to used. Fig.3 Block Diagram of Proposed framework It must be strong, stable, and physically interpretable. The use of added Gabor filters MFCCs bench is very effective for the classification system. The advantages of the combination of Gabor and MFCCs bank of filters have the ability to capture the intrinsic structure in each type of environmental sounds. Our goal is to use bank of Gabor filters with MFCCs added as an extraction tool of features for classification. However, the combination of MFCCs and Gabor filter bank provides a great improvement of the recognition results compared to the results attained when using only the Gabor filter bank only. Earlier we noted that among the appropriate audio features are the combination of Mel frequency cepstral coefficient (MFCC). In [27], the addition of MFCC and matching Pursuit obtained the best classification rate compared to other audio functions such as short period of time energy, zero crossing rate and the spectral flux. The use of the MFCC in addition to the temporal characteristics of the wavelet base and improve system performance. Bank of Gabor filters are parameterize in the frequency and orientation. They have the advantage of extracting frequency information positioned and orientated [22]. They provide an excellent spatial position and information on the simultaneous frequency.

Fig.3. Block Diagram of Proposed framework

Page | 7

They have different significant property, in exacting the ability to decompose a spectrogram in its spatial and leading spectral component

Fig.4. Block Diagram of MFCC Processor

The event signals are block into frame of N’s sample, with nearby frame are being alienated by M ( M < N ) . The start and stop index is specified for that section of the event sample to be taken. The first frame is formed by the initial N sample. The next frame start m sample after the initial frames, and superimposes it does from N - M sample. Likewise, the third frames start on 2M sample later than the first frames (or M sample after the second frames) and overlap with N - 2M samples. This procedure continues until the every one of acoustic event is registered in one and more than one frame. The next step in the process is the windows of each individually frame to reduce the signal discontinuity at the starting and at the end of each frames. The phase of subsequent processing is the fast Fourier transformation, which convert each frame of N sample from time domain to the domain of frequency. The next step is to break Mel frequency, which are applied in the frequencies domain, is consequently simple amounts to taking only those are triangular window in the spectrum. The concept here is the spectral distortion should be minimized at the starting and ending of each frames using tapering window of the signal to zero. The log spectrum honey to convert back in time then the result is honey (MFCC) which called the cepstrum frequency coefficient. Cepstral representations of the speech spectrum is a good representations of the spectral property of the local signals for given frames analysis. Because honey spectrums coefficient is real number they can be converted into the time domains by using the discrete cosine transformation (DCT). However, we must remove the log Gabor filters descriptors relevant for two reasons. First, the Gabor functions are not continuous component, which help to improve the contrast of edges and boundaries of the spectrograms. Second, the transfer functions of Gabor functions have a long tail to the top of high frequencies to keep us wide spectral information with localized spatial extension and port, thus allowing the real structures of edges of the spectrograms [18] to obtain.The important aspect of the Gabor function is the frequency response of the Gabor log is symmetric on a logarithmic axis. Gabor-Filter bank can be built with a certain bandwidth. This bandwidth can be optimized to produce a filter with a minimum spatial extension. It could be shown that the log functions - Gabor large high frequency code has end, should spectrogram can encode through a better representation of the high frequency components more effectively.The discrete signal y (n) windowed with a Hamming Window

(6)of length N . The windows are shifted by ns sample, and the resultant block are in the spectral domains by a discrete Fouriers transform (DFT) is converted

(7)The absolute value |Ylk| the resultant spectrograms is Mel deformed by triangular Mel - filter Fk , m and logged also with M. Tapes of Mel with band index m to a (log scaled Mel) resulting spectrogram

(8)The log scaled Mel spectrograms are filter by the complex Gabor Filter bank parameter, and the outputs of the real parts, imaginary or size is used for classification. The application of GFB to all central Mel spectrograms with 31 Mel Band and GFB with 41 filter results in 1271 dimensional functions.

Page | 8

Fig.5. Real part and imaginary part of symmetric two-dimensional Gabor filters bank filters with standards parameter.

4. EXPERIMENTAL RESULTS

Experimental results show that the addition of MFCC and Gabor filter bank for feature extractions are best suitable for those events discrimination. This filter output analysis best explains the suitability of MFCC and Gabor filters for these types of events. The differences between these features dimensions can prevent discrimination skills. The end of the time frame was not even considered but temporary information loses on average over time. For the better results for the detection, combination of classifier like Decision tree with MFCC+GFB features is used. In recent research paper [40] author use the Gabor filter Bank for Feature extraction which gives the very good result for classification and event detection. Feature extraction is very important step to find the specific class and to find specific event so best method for feature extraction is very important to use. In my thesis, I use the Gabor filter bank and MFCC both to find the features of training data. in first step I use the MFCC for features then the output of these features further processed in Gabor filter bank, so final best feature I use to train the classification model. We use the five-fold cross-validation for the classification task through data is carrying out to estimate the correctness for different parameter sets.

4.1 Comparison of Systems for Acoustic Event Detection

For DCASE dataset, different systems have been used to find best accuracy for detection and classification. MFCC is widely used for feature extraction but on other hand Gabor filter also achieve very good result.

Table.2: F-score of DCASE challenge for different system

Page | 9

We add MFCC with Gabor filter which gives best result for feature extraction. Decision tree is used for classifier.

MFCC is a familiar approaches use in AED. Comparison has been made between MFCC, Gabor features and the combination of both MFCC+GFB features and classifier for event detection. Table 2 shows the outcomes of different system use for event detection for DCASE dataset. The projected system in which we use GFB + MFCC feature obtained F score 0.85% is the highest is that compared to all other methods. This is due to the best feature extraction method that we use in our proposed technique.

5. CONCLUSION

This thesis has focused on the topic of acoustic event detection (AED), where the aim is to detect and classify the rich array of acoustic information that is present in many environments. This is a challenging task in unstructured environments where there are many event overlapping, the number and location of the sources, and any noise or channel effects that may be present. However, many state-of-the-art AED systems perform poorly in such situations, particularly those that rely on frame-based features, where each feature can contain a mixture of multiple sources or noise. This thesis has developed novel approaches to address some of the challenges which is faced by AED system, in particular by using inspiration from image processing as a foundation. We proposed a new feature extraction technique for AED where there are much noise and event overlapping. A compared with other classical system used for AED , we believe that the proposed technique based on a combination of Gabor filter bank and MFCCs positioned in the first ranks ( 0.85 % ) .it gives the good result with any classifier because we have to give features to a classifier. If our features are good then performance of classifier will be increase. The results obtainable in above table show that Gabor filter Bank features are not capable to distinguish between classes effectively when use alone. The mixture of MFCCs and Gabor filter Bank features divide some classes successfully. Combination including Spectro-Temporal domain is helpful, because they merge information of the two matching domain. MFCCs are spectral feature, they distinguish the frequency content. Nerveless, Gabor filter Bank features provide temporal and spectral information and also in high frequency they are mostly informative.

6. REFERENCES

[1] D. Gabor, “Theory of communication,” J. Inst. Elect. Eng., vol.93, pp. 429–457, 1946. [2] S. Marcelja, “Discriminative learning of receptive fields from responses to non-Gaussian stimulus ensembles,” J. Opt. Soc. Amer., vol. 70, no. 11, pp. 1297–1300, Nov. 1980.[3] L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.[4] J.W. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE, vol. 81, no. 9, pp. 1215–1247, 1993.[5] T. Aach, A. Kaup, and R. Mester, “On texture analysis: Local energy transforms versus quadrature filters,” Signal Process., vol. 45, no. 2, pp. 173–181, 1995.

[6] E. Wold, T. Blum, D. Keislar, and J. Wheaten, “Content-based classification, search, and retrieval of audio,” IEEE Multimedia, vol. 3, no. 3, pp. 27–36, 1996.

Page | 10

[7] K. El-Maleh, A. Samouelian, and P. Kabal, “Frame level noise classification in mobile environments,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1999.[8] M. Casey, “MPEG-7 sound-recognition tools,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 737–747, June 2001.[9] M. Cowling, R. Sitte, and T. Wysocki, “Analysis of speech recognition techniques for use in a non-speech sound recognition system,” Digital Signal Processing for Communication Systems, Sydney-Manly, 2002.[10] M. Cowling and R. Sitte, “Comparison of techniques for environmental sound recognition,” Pattern Recognition Letters, vol. 24, pp. 2895–2907, Nov. 2003.[11] A. Qiu, C. E. Schreiner, and M. A. Escabí, “Gabor analysis of auditory midbrain receptive fields: Spectro-temporal and binaural composition,” J. Neuro physiol., vol.90, no.1, pp.456–476, 2003[12] G. Peeters, “A large set of audio features for sound description (similarity and classification) in the CUIDADO project,” in CUIDADO I.S.T. Project Report, pp. 1–25, 2004[13] G. Peeters, “A large set of audio features for sound description (similarity and classification) in the CUIDADO project,” CUIDADO IST Project Report, pp. 1–25, 2004[14] H. G. Kim, J. J. Burred, and T. Sikora, “How e cient is MPEG-7 for general sound recognition?,” in AES 25thffi International Conference on Metadata for Audio, 2004.[15] D. Hoiem, Y. Ke, and R. Sukthankar, “SOLAR: Sound Object Localization and Retrieval in Complex Audio Environments,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 429–432, IEEE, 2005[16] D. Hoiem, Y. Ke, and R. Sukthankar, “SOLAR: sound object localization and retrieval in complex audio environments,” in Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on. IEEE, 2005, vol. 5.[17] S. J. Barry, A. D. Dane, A. H. Morice, and A. D. Walmsley, “The automatic recognition and counting of cough.,” in Cough (London, England), vol. 2, p. 8, Jan. 2006.[18] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X.A. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, “The HTK Book,” (forHTKVersion3.4). Cambridge, U.K.: Cambridge Univ. Press, 2006.[19] F. Pachet and P. Roy, “Exploring billions of audio features,” in Content-Based Multimedia Indexing, 2007. CBMI’07. International Workshop on. IEEE, 2007, pp. 227–235.[20] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface, A. Mertins, C. Ris, et al., “Automatic speech recognition and speech variability: A review,” Speech Communication, vol. 49, no. 10-11, pp. 763–786, 2007[21] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen, “Temporal Feature Integration for Music Genre Classification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1654–1664, July 2007.[22] S.M. Lajevardi, M. Lech. Facial Expression Recognition Using a Bank of Neural Networks and logarithmic Gabor Filters. DICTA08, Canberra, Australia, 2008.[23] A. Temko, “Acoustic Event Detection and Classification,” 2008.[24] J. Bouvrie, T. Ezzat, and T. Poggio, “Localized spectro-temporal cepstral analysis of speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4733–4736, IEEE, Mar. 2008.[25] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition using MP-based features,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1–4, IEEE, Mar. 2008.[26] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software,” ACM SIGKDD Explorations Newsletter, vol. 11, p. 10, Nov. 2009.[27] S. Chu, S. Narayanan, and C. J. Kuo, “Environmental Sound Recognition With Time-Frequency Audio Features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1142–1158, Aug. 2009[28] J. Ram´ırez, JM G´orriz, and JC Segura, “Voice activity detection. fundamentals and speech recognition system robustness,” M. Grimm, and K. Kroschel, Robust Speech Recognition and Understanding, pp. 1–22, 2010.[29] N. Yamakawa, T. Kitahara, T. Takahashi, K. Komatani, T. Ogata, and H. G. Okuno, “E ects of modelling within-ff and between-frame temporal variations in power spectra on non-verbal sound recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 2342–2345, Sept. 2010[30] G. Peeters and E. Deruty, “Sound indexing using morphological description,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, pp. 675– 687, Mar. 2010.[31] G. Kovacs and L. Toth, “Localized spectro-temporal features for noise-robust speech recognition,” in Proceedings of the 2010 International Joint Conference on Computational Cybernetics and Technical Informatics, pp.481–485,IEEE,2010. [32] T. Butko, Feature Selection for Multimodal Acoustic Event Detection. PhD thesis, Universitat Politecnica de Catalunya, 2011[33] M. Heckmann, X. Domont, F. Joublin, and C. Goerick, “A hierarchical framework for spectro-temporal feature extraction,” Speech Communication, vol. 53, no. 5, pp. 736–752, 2011.[34] “D-CASE: Detection and classification of acoustic scenes and events,” 2013 [Online]. Available: http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/

Page | 11

http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/

[35] P. Guyot, J. Pinquier, and R. André-Obrecht, ‘Water sounds recognition based on physical models,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 793–797, 2013.[36] T. Heittola, A. Mesaros, A. Eronen, and T. Virtanen, “Context-dependent sound event detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2013, no. 1, p. 1, 2013.[37] M. Cobos, J. J. Perez-Solano, S. Felici-Castell, J. Segura, and J.M. Navarro,“Cumulative-sum-Based localization of sound events in low-cost wireless acoustic sensor networks,” IEEE/ACM Transactions on Speech and Language Processing, 2014.[38] T. Sandhan, S. Sonowal, and J. Y. Choi, “Audio bank: a highlevel acoustic signal representation for audio event recognition,” in Proceedings of the14th International Conference on Control Automation and Systems (ICCAS ’14), pp. 82–87, IEEE, Seoul, Republic of Korea, October 2014.[39] S. E. Kucukbay and M. Sert, “Audio-based event detection in office live environments using optimized MFCC SVM approach,” in Proceeding of the IEEE International Conferenceon Semantic Computing (ICSC ’15), pp. 475 480, Anaheim, Calif USA, Feburary 2015.[40] Jens Schröder, Stefan Goetze, and Jörn Anemüller, “Spectro-Temporal Gabor Filterbank Features for Acoustic Event Detection,” in proceeding of the IEEE / ACM Transactions on Audio, Speech, and Language Processing, VOL. 23, NO.12, DECEMBER 2015

Page | 12

paper title (use style: paper title) - er web viewalso their position in time.speech recognition...

Documents