image classification based on convolutional denoising

Research ArticleImage Classification Based on Convolutional DenoisingSparse Autoencoder

Shuangshuang Chen12 Huiyi Liu1 Xiaoqin Zeng1 Subin Qian12

Jianjiang Yu2 andWei Guo2

1 Institute of Intelligence Science and Technology Hohai University No 8 West Focheng Road Nanjing 211100 China2School of Information Science and Technology Yancheng Teachers University Yancheng 224002 China

Correspondence should be addressed to Shuangshuang Chen chenssyctueducn

Received 15 March 2017 Accepted 16 September 2017 Published 26 November 2017

Academic Editor Lotfi Senhadji

Copyright copy 2017 Shuangshuang Chen et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

Image classification aims to group images into corresponding semantic categories Due to the difficulties of interclass similarity andintraclass variability it is a challenging issue in computer vision In this paper an unsupervised feature learning approach calledconvolutional denoising sparse autoencoder (CDSAE) is proposed based on the theory of visual attention mechanism and deeplearningmethods Firstly saliency detectionmethod is utilized to get training samples for unsupervised feature learningNext thesesamples are sent to the denoising sparse autoencoder (DSAE) followed by convolutional layer and local contrast normalizationlayer Generally prior in a specific task is helpful for the task solution Therefore a new pooling strategymdashspatial pyramid pooling(SPP) fused with center-bias priormdashis introduced into our approach Experimental results on the common two image datasets(STL-10 and CIFAR-10) demonstrate that our approach is effective in image classificationThey also demonstrate that none of thesethree components local contrast normalization SPP fused with center-prior and 1198972 vector normalization can be excluded from ourproposed approach They jointly improve image representation and classification performance

1 Introduction

In recent years image classification has been an active andimportant research topic in the field of computer vision andmachine learning applications The basic image classificationalgorithm is generally introduced in [1ndash3] and involves threemain stages in sequence (1) image sampling (2) featureextraction and (3) classifier designing In these stages featureextraction plays an important role [4] and the efficientfeatures extracted may increase the separation betweenspectrally similar classes resulting in improved classificationperformance

The feature extraction part is commonly accomplishedby a wide spectrum of different local or global descriptorsfor example scale invariant feature transform (SIFT) [5]histogram of oriented gradients (HOG) [6] and local binarypattern (LBP) [7] Although these hand-crafted features leadto reasonable results in various applications they are onlysuitable for a particular data type or research domain and

would result in dismal performance on other unknownusage [2] Recently there is a growing consensus that it isan alternative approach to utilize deep learning methodsto obtain machine-learned features for image classificationThese deep learning methods aim to extract general purposefeatures for any images rather than learning domain adaptivefeature descriptors particularly for certain tasks

Up to this point typical deep learning methods includeconvolutional neural network (CNN) [8 9] sparse coding[10] deep belief network (DBN) [11] and stacked autoencoder(AE) [12] Among these models CNN is one of the mainmodels in deep learningmethodswhich is a hierarchalmodelthat outperforms many algorithms on visual recognitiontasks One property is alternately using convolution [13] andpooling [14] structures The convolution operation sharesweights and keeps the relative location of features and thuscan preserve spatial information of the input data Despiteits apparent success there remains a major drawback CNNrequires large quantities of labeled data which are very

HindawiMathematical Problems in EngineeringVolume 2017 Article ID 5218247 16 pageshttpsdoiorg10115520175218247

2 Mathematical Problems in Engineering

expensive to obtain Stacked AE is another notable learningmethod which exploits a particular type of neural networkthe AE also called autoassociator [15]mdashas component ormonitoring device It can be effectively used for unsupervisedfeature learning on a dataset for which it is difficult to obtainlabeled samples [16] Beyond simply learning features by AEthere is a need for reinforcing the sparsity of weights andincreasing its robustness to noise Ng [17] introduced sparseautoencoder (SAE) which is a variant of AE Sparsity is auseful constraint when the number of hidden units is largeSAE has very few neurons that are active There is an anothervariant of AE called denoising autoencoder (DAE) [18]which minimizes the error in reconstructing the input froma stochastically corrupted transformation of the input Thestochastic corruption process consists in randomly settingsome of inputs (asmany as half of them) to zero Comparativeexperiments clearly show the surprising advantage ofDAE ona pattern classification benchmark suite

With the development of deep learning AE and itsvariants are widely used in the field of image recognition Xuet al [19] presented a stacked SAE for nuclei patch classifi-cation on breast cancer histopathology They extracted twoclasses of 34 times 34 patches from the histopathology imagesnuclei and nonnuclei patches These two kinds of patcheswere used to construct the training set and testing set Theauthors of [20] proposed a method called stacked DAE basedon paper [18] which is a straightforward variation on thestacked ordinary AE Besides stacked DAE was tested onMINIST dataset which contains 28 times 28 gray-scale imagesBeing similar to the method based on stacked SAE thetraining and testing dataset fed into models are relatively lowin resolution such as small image patches and low resolutionimages (eg hand-written digits) Both SAE and DAE arecommon fully connected networks which cannot scale wellto realistically sized high-dimensional inputs (eg 256 times 256images) in terms of computational complexity [21]They bothignore the 2D image structure

In order to overcome these limitations this paper intro-duces an approach called CDSAE (convolutional denoisingsparse autoencoder) that scales well to high-dimensionalinputsThis approach can effectively integrate the advantagesin SAE DAE and CNN This hybrid structure forces ourmodel to learn more abstract and noise-resistant featureswhich will help to improve the modelrsquos representation learn-ing performance CDSAE can map images to feature repre-sentation without any label information while CNN requireslarge quantities of labeled data Besides it differs from con-ventional SAE and DAE as its weights are shared among alllocations in the input images and thus preserves spatiallocality

Besides feature extraction mentioned above the sampleris another critical component which has a great influence onthe results Ideally it should focus attention on the imageregions that are the most informative for classification [22]Recently selective attention models have drawn a lot ofresearch attention [23 24] The idea in selective attentionis that not all parts of an image give us information If wecan attend only to the relevant parts we can recognize theimage more quickly and using less resources [23] People

place an object on the foveal with fixations when the gaze isconcentrated on the object and getmost information throughfixations [25] Compared to the traditional approaches usinga random sampling strategy we introduce a sampling strategyto sample fixations from the image which is inspired byhuman selective attentionMoreover those studies on humaneye fixations demonstrate that there is a tendency in humansto look towards the image center which is called the centerbias [26] It is worth mentioning that incorporating center-bias prior into saliency estimation has been previously inves-tigated by a number of researchers [27ndash29] Turning to ourwork center-bias prior are absorbed for SPP in our imageclassification model

To summarize the key contributions of this paper areelaborated as follows

(1) A sampling strategy about eye fixations based onhuman visual system is proposed which is inspiredby human eyes The fixation points and nonfixationpoints of images can be got by utilizing saliencydetection model

(2) A CDSAE model with local contrast normalizationoperation is proposed In this overall model single-layer DSAE is used for unsupervised feature learningwhich can effectively extract features without usingany label data Compared to conventional deep mod-els single-layer DSAE has a strength with a smallercomputational learning cost and fewer hyperparame-ters to tune

(3) An SPP incorporating center-bias prior is proposedThis not only maintains spatial information by pool-ing in local spatial bins but also fully utilizes priorknowledge of image dataset To the best of ourknowledge this is the first work that absorbs priorknowledge for pooling in image classification

The remainder of this paper is organized as follows InSection 2 we review related works in the literature Section 3introduces a sampling strategy based on human visionattention system Section 4 describes CDSAE and Section 5provides the overall classification framework The details ofour experiments and the results are presented in Section 6followed by a discussion and future work

2 Related Work

Other researchers have also made some headway on con-structing the convolutional autoencoder (CAE) an unsuper-vised feature extractor that can scalewell to high-dimensionalinput images Masci et al [21] propose a kind of CAEwhich directly takes the high-dimensional image data as theinput through training the AE convolutionally Though thisconvolution structure can preserve local relevance of theinputs training the AE convolutionally is not easy For thisproblem Coates et al [30] first extract patches from the inputimages and use patch-wise training to optimize the weightsof a basic SAE in place of convolutional training Besidesthey further propose that even with a single-layer networkin unsupervised feature learning it is possible to achieve

Mathematical Problems in Engineering 3

state-of-the-art performance In our method we absorb thisidea and construct a single-layer network for unsupervisedfeature learning Due to its simplicity and efficiency single-layer SAEhas awide range of applications Luo et al [3] utilizesingle-layer SAE for natural scene classification this idea isanalogous to Coates et alrsquos work [30] Similar method is usedfor remote sensing image classification reported in [31]Theselocally connected SAE through convolution [3 30 31] presentmany similarities with each layer of CNN such as the use ofconvolution and pooling

There are several differences between these works andours Firstly we adopt the theory of DAE which can learnmore noise-resistant features Hence our model is more sig-nificant unlike previous works which only use sparsity Sec-ondly local contrast normalization layer is embedded beforepooling layer in our model In [32] He et al introduce a spa-tial pyramid pooling (SPP) which shows great strength inobject detection In contrast to [3 30 31] which only usesingle-level pooling we instead propose an SPP fused withcenter-bias prior Bias is mainly proposed for image saliencydetection in computer vision It is often closely related to theapplication task and could be deliberately utilized as a priorin specific task to improve the performance of the task [33]

Another branch of related works are human selectiveattentionmodelsMany attentionmodels have been proposedin both natural language processing and computer vision In[34] Wang et al have proven that human read sentences bymaking a sequence of fixations and saccades They exploreattention models over single sentences with guidance ofhuman attention In computer vision area the core concept ofattention models is to focus on the important parts of theinput image instead of giving all pixels the same weight[34] Inspired by the theory of visual attention mechanismwe propose a sampling strategy about eye fixations basedon human visual system Our work is also closely related tothe work of Judd et al [35] who train a model of saliencydirectly from human fixations data Saliency map computedby saliency detection models is significantly correlated withhuman fixation patterns [36]

3 Sampling Strategy Based onHuman Vision Attention System

Methods of saliency detection proposed are selective atten-tion models which simulate visual attention system Theycan be used to measure the conspicuity of a location or thelikelihood of a location to attract the attention of humanobservers [35] The saliency map represents the saliency ofeach pixel And it can be thresholded such that a givenpercent of the image pixels are classified as fixated and therest are classified as not fixated [35] In this paper we adopt asaliency detection modelmdashcontext-aware saliencymdashto guideour sampling task [37] It is a new type of saliency detectionalgorithm which manages to detect the pixels on the salientobjects and only them This method has proposed that apixel 119894 is considered salient if the appearance of the patch119901119894 centered at pixel 119894 is distinctive with respect to all otherimage patches119889color(119901119894 119901119895) is the Euclidean distance betweenthe patches 119901119894 and 119901119895 in the CIE 119871 lowast 119886 lowast 119887 color space

normalized to the range [0 1] If 119889color(119901119894 119901119895) is high forall119895 thenpixel 119894 is considered salient And 119889position(119901119894 119901119895) denotes theEuclidean distance between the positions of patches 119901119894 and119901119895 normalized by the larger image dimension A dissimilaritymeasure is defined between a pair of patches as

119889 (119901119894 119901119895) = 119889color (119901119894 119901119895)1 + 119888 sdot 119889position (119901119894 119901119895) (1)

where 119888 = 3 in our paper This dissimilarity measure is pro-portional to the distance in color space and inversely propor-tional to the positional distance For every patch119901119894 we searchfor the 119871most similar patches 119902119895119871119895=1 in the image (if themostsimilar patches are highly different from 119901119894 then clearly allimage patches are highly different from 119901119894) As stated beforea pixel 119894 is salient when 119889(119901119894 119901119895) is high forall119895 isin [1 119871]Thereforethe single-scale saliency value of pixel 119894 at scale 119903 can bedefined as

119878119903119894 = 1 minus expminus1119871119871sum119897=1

119889 (119901119903119894 119902119903119897 ) (2)

Furthermore we also use four scales (100 80 50and 30) of the original image to measure the saliency in amultiscale image The saliency at pixel 119894 is taken as the meanof its saliency at different scales (more details can be found in[37])

60 of the ground truth human fixations are within thetop 5 salient areas of a saliencymap and 90 are within thetop 20 salient locations [35] Saliency map can be thresh-olded such that a given percent of the image pixels areclassified as fixation and the rest are classified as nonfixationsFigure 1 shows the saliency detection results for three imagesof STL-10 dataset To avoid missing the nonfixations corre-sponding to the images we also sample some nonfixationsFigure 1(c) shows the fixations and the nonfixations in eachimageThus we first randomly select one image of119872 imagesand then extract a given percent of the fixations and nonfixa-tions For each image the total number of the fixations andnonfixations is119873This can be represented as a vector inR119862 ofthe pixel intensity values with119862 = 119873times3 (the input image hasthree channelsmdashR G and B)Therefore a dataset119883 isin R119862times119872can thus be constructed where each columndenotes the pixelintensity values of the fixations and nonfixations sampledfrom each image

4 Convolutional DenoisingSparse Autoencoder

CDSAE can be divided into three stages feature learningfeature extraction and classification These stages are incorrespondence with (1) training the single-layer DSAE(2) convolution local contrast normalization and SPPfused with center-bias prior (3) support vector machine(SVM) classification The power of DSAE lies in the form ofreconstruction-oriented training where the hidden units canconserve the efficient feature to represent the input data Inorder to get better representation the convolution operation


(a) (b) (c)

Figure 1 (a) Sample images from STL-10 dataset (b) Saliency maps for original images (c) Several human fixations and nonfixations ofimages (The green points of circle denote fixations and the red points of diamond denote nonfixations)

is introduced to encode the input images with the featuresextracted by DSAE A local contrast normalization layer isembedded after convolution operation This can improvefeature invariance and increases sparsity (Large-Scale VisualRecognition with Deep Learning httpcvglstanfordeduteachingcs231a winter1314lectureslecture guest ranzatopdf)Following the local contrast normalization pooling isconducted to select significant features and decreases thespatial resolution Several types of poolingmethod have beenproposed to subsample the features for example averagepooling [46] max pooling [47] stochastic pooling [44] andspatial pyramid pooling [32] We propose a new form of SPPwhich seamlessly incorporate center-bias prior

Figure 2 shows how the CDSAE works

41 Feature Learning Recently increasing attention has beendrawn to the study of single-layer network for unsupervisedfeature learning [3 31] Paper [30] has proved that simple butfast algorithms can be highly competitive while more com-plex algorithms may have greater complexity and expense Inorder to extract appropriate and sufficient features with lowcomputational cost a single-layer DSAE model is proposedin this work The DSAE is a simple but effective extension ofthe classical SAE The main idea of this approach is to traina sparse AE which could reconstruct the input data from acorrupted version by manual addition with random noise

411 Sparse Autoencoder AE is a symmetrical neural net-work structurally defined by three layers input layer hidden


WInput image Convolved feature map

LCN

LCN feature map ClassificationPooled feature map

SPP fused with

center-bias prior

Convolution Full connection

M

KK

K

K

K

Figure 2 The flowchart of the CDSAE

layer and output layer It can be used to learn the features ofa dataset in an unsupervised manner The aim of the AE is tolearn a latent or compressed representation of the input databy minimizing the reconstruction error between the inputat the encoding layer and its reconstruction at the decodinglayer

During the encoding step an input vector 119909119894 isin R119862 isprocessed by applying a linear deterministic mapping and anonlinear activation function 119897 as follows

120572119894 = 119891 (119909119894) = 119897 (1198821119909119894 + 1198871) (3)

where 1198821 isin R119870times119862 is a weight matrix with 119870 features andb1 isin RK is the encoding bias In this study we consider a leakyrectified linear unit (LReLU) activation function for 119897(119909)Because LReLU has better performance than ReLU it iswidely used in the field of deep learning [48ndash50] It can berepresented as

119910 = 119909 if 119909 ge 0120603119909 if 119909 le 0 (4)

and the slope 120603 of the LReLU is set to 001 [48] Then wedecode a vector using a separate linear decoding matrix

119911119894 = 1198822120572119894 + 1198872 (5)

where1198822 isin R119862times119870 and 1198872 isin R119862 are a decoding weight matrixand a bias vector respectively Feature extractors are learnedbyminimizing the reconstruction error of the cost function in(6) The first term in the cost function is the error term Thesecond term is a regularization term (aka a weight decayterm)

119871 (119883 119885) = 12119872sum119894=1

10038171003817100381710038171003817119909119894 minus 119911119894100381710038171003817100381710038172 + 1205822 1198822 (6)

where119883 and119885 represent the training and reconstructed datarespectively

In order for the sparseness of hidden units the methodof [51] is introduced to constrain the expected activation ofhidden nodesWe add a regularization term that penalizes the

values of hidden units such that only a few of them are biggerthan the sparsity parameter 120588 andmost values of hidden unitsaremuch smaller than 120588 KL(120588 120588) is the sparse penalty termwhich can be denoted as the following formula

KL (120588 120588) = 120588 log 120588120588 + (1 minus 120588) log 1 minus 1205881 minus 120588 (7)

where KL(sdot) is the KullbackndashLeibler divergence [52] Werecall that 120572 denotes the activation of hidden units in autoen-coder let 120588 = (1119872)sum1198721 [120572(119894)] be the average activation of 120572averaged over the training set 119883119862times119872 Then our objectivefunction in the sparse autoencoder learning can be writtenas follows

119871 (119883 119885) + 120573 119870sum119895=1

KL (120588 120588) (8)

With the introduction of the KL divergence weightedby a sparsity penalty parameter 120573 in the objective functionwe penalize a large average activation of 120572 over the trainingsamples by setting 120588 small This penalization drives manyof the hidden unitsrsquo activation to be close or equal to zeroresulting in sparse connections between layers

412 Denoising Sparse Autoencoder In order to force thehidden layer to learn more robust features and prevent itfrom simply discovering the sparsity we train a DSAE toreconstruct the input from a corrupted version of it which isan extension of SAE Its objective function is the same as thatof SAE The only difference is that we have to feed the cor-rupted input into the input layer The structure of the DSAEis demonstrated in Figure 3Three basic types of noise can beutilized to corrupt the input of the DSAE The zero-maskingnoise [18] is employed in our model The key idea of DSAEis to learn a sparse but robust bank of local features whichalso can be called ldquoconvolution kernelsrdquo They can be usedto convolve the whole image in the next convolution layerThe training procedure of the DSAE is summarized inAlgorithm 1

In this paper we view DSAE as a ldquofeature extractorrdquo thattakes training data 119883 and outputs a function 119891 R119862 rarr R119870


(1) Input(2) Training set119883(3) Weight decay parameter 120582 weight of sparse penalty term 120573 sparse parameter 120588(4) Procedure(5) Initialize parameters (1198821 1198871) (1198822 1198872)(6) Get 119909119894 by stochastic corrupting the input vector 119909119894(7) FOR 119895 = 1 to 119879 do(8) Loss = 119871(119883119885) + 120573sum119870119895=1 KL(120588 120588)(9) Use L-BFGS algorithm [58] to update (1198821 1198871) (1198822 1198872)(10) ENDFOR(11)Output (1198821 1198871) which is utilized for convolution kernels

Algorithm 1 The training procedure of DSAE

Input data Corrupted data

Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1

Figure 3 Illustration of a single-layer DSAE Neurons with crossdenote the corrupted input neural units

that canmap an input vector 119909119894 to a new feature vector via the119870 features where119870 is the number of hidden units of DSAE

42 Feature Extraction The above DSAE algorithm yields afunction 119891 that transforms an input vector 119909119894 isin R119862 to a newfeature representation 120572119894 = 119891(119909119894) isin R119870 In this section wecan apply this feature extractor to our (labeled) training andtesting images for classification

421 Image Convolution In order to extract appropriateand sufficient features from training and testing imagesconvolution is utilized to construct a locally connected DSAEnetworks Each hidden unit connects only a small contigu-ous region of pixels in the input images Sounds naturalimages and more generally signals that display translationinvariance in any dimension can be better represented using

convolutional dictionaries [53] The convolution operatorenables the system to model local structures that appearanywhere in the signal [53] It is firstly used in natural imagesfield by LeCun et al [54] Figure 4 illustrates the significanceof the convolution operation Figure 4(a) is a source imageof STL-10 dataset (b)ndash(d) are the convolution kernels (akabases) trained by DSAE (e)ndash(g) are the features extractedfrom the source image through convolution operation

Given an image of u-by-u pixels (with119865 channels) we candefine a (119906minus119908+1)-by-(119906minus119908+1) image representation (with119870channels) by using our119908-by-119908 convolution kernel across theimage with some step-size (or ldquostriderdquo) s equal to or greaterthan 1 This is illustrated in Figure 5

422 Local Contrast Normalization The local contrast nor-malization layer is inspired by computational neurosciencemodels [55] It performs local subtractive and divisive nor-malizations enforcing a kind of local competition betweenadjacent features in a feature map and between features atthe same spatial location in different feature maps [56] Thesubtractive normalization operation removes the weightedaverage of neighboring neurons from the current neuron Fora given site (i j) of the kth feature map it can compute

V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)

where 119908119901119902 is a Gaussian weighting window (of size 9 times9 in this work) normalized so that sum119896119902119901 119908119901119902 = 1 Basedon the result of subtractive normalization the divisivenormalization computes 119910119896119894119895 = V119896119894119895max(120591 120590119894119895) where120590119894119895 = (sum119896119901119892 119908119901119902 sdot V2119896119894+119901119895+119902)12 In our experiments the con-stant 120591 is set to mean(120590119894119895)

As mentioned above we can obtain (119906 minus 119908 + 1) times (119906 minus119908 + 1) times 119870 feature maps through convolution operation fora given image Local subtractive and divisive normalizationsare performed over these 119870 feature maps by local contrastnormalization layer

423 SPP Fused with Center-Bias Prior Bias is often highlyrelated to the application task and sometimes can be delib-erately used as a prior in a specific task to improve the


(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)

Figure 4 Examples of convolutional feature extraction (a) is thesource image (b)ndash(d) are the convolution kernels learned by thesingle-layer DSAE (e)ndash(g) are the features extracted from the sourceimage

us

DSAE

w

Convolution kernel

Locally connected DSAE network through convolutionF channels K channels

Convolution

Input image Convolvedfeature map

(u minus w)s + 1

Figure 5 Illustration showing feature extraction using a 119908-by-119908convolution kernel and a stride of s

performance of the task [33] When humans take picturesthey naturally tend to frame an object of interest near thecenter of the image For this reason we incorporate thecenter-bias prior in our work which indicates the distanceto the center of each pixel In particular this specific prior isgenerated by a 2D Gaussian heatmap as showed in Figure 6

SPP (aka spatial pyramid matching) is an extensionof the Bag-of-Words (BoW) model which is one of themost key methods in computer vision SPP has long beenan important component in the competition-winning and

leading models for image classification [32] After obtainingfeatures using local contrast normalization as describedearlier SPP partitions the feature map into divisions fromfiner to coarser levels The coarsest pyramid level has a singlebin that covers the entire feature map Figure 7(a) illustratesan example configuration of 3-level pyramid pooling (3 times 32 times 2 and 1 times 1) about our method In each spatial bin ofevery pyramid level we pool the responses of each featuremap (throughout this paper we use mean pooling) The binsizes can be precomputed for spatial pyramid pooling Afterlocal contrast normalization the feature maps have a size of119886 times 119886 With a pyramid level of ℎ times ℎ bins we implementthis pooling level as a sliding window pooling where thewindow size win = lceil119886ℎrceil and stride str = lfloor119886ℎrfloor (lceilsdotrceil and lfloorsdotrfloordenote ceiling and floor operations) With a 3-level pyramidwe implement 3 such layers Output of each pyramid poolinglevel is KM-dimensional vector with the number of binsdenoted as M (K is the number of feature maps in the localcontrast normalization layer) In our work we use a 3-levelpyramid (3 times 3 2 times 2 and 1 times 1) So level3times3 level2times2 andlevel1times1 will generate 9119870 4119870 and 119870 dimensional vectorrespectively

According to the size of the three vector dimensionsmentioned above we generate the corresponding center-biasprior feature map respectively Then we reshape the threescales feature maps to generate column vectors which areused for element-wise product operation with the three-dimensional column vectors after SPP This calculation pro-cess can be showed in Figure 7 (⊙ denotes element-wiseproduct between vectors in Figure 7) 1198972 vector normalizationis usually used to further improve FV performance [57] Thefinal image representation is then obtained by concatenatingthe results of all local column vectors from level3times3 to level1times1(followed by 1198972 vector normalization) The process of SPPfused with center-bias prior is summarized in Algorithm 2

5 Overall Architecture of Image Classification

This section describes the overall architecture of the proposedmethod for image classification Our method consists of fourmain parts as showed in Figure 8 (1) obtaining samples forunsupervised feature learning (2) feature learning (3) featureextraction and (4) classification

(1) First we adopt context-aware saliency detectionmodel to compute saliency maps of image dataset which arethresholded to get fixated and unfixated points of images Wefirst randomly select one image of119872 images and then extracta given percent of the fixations and the nonfixations For eachimage the total number of the fixations and the nonfixationsextracted is119873This can be represented as a vector inRC of thepixel intensity values with 119862 = 119873 times 3 (the inputs are naturalcolor images) Therefore a dataset 119883 isin R119862times119872 can thus beconstructed

(2) Then the dataset 119883 is fed into a 119870-hidden-unitnetwork which is used for unsupervised feature learning of119870 feature extractors according to the DSAE model

(3) After the unsupervised feature learning convolutionis utilized to construct a locally connected DSAE networks


(a) (b)

Figure 6 (a) A sample of STL-10 dataset (b) Center-bias prior feature map

K

a

a

Local contrast normalizedfeature maps

(a) (b)

middot middot middot

K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀

Figure 7 Illustration of spatial pyramid pooling fused with center-bias prior (a) Spatial pyramid pooling layer (b) Multiscales feature mapsbased on center-bias prior

We can extract appropriate features from the training andtesting images using the learned feature extractors By usinglocal contrast normalization method we can increase featuresparsity and improve optimization of model SPP fused withcenter-bias prior is utilized to obtain final image representa-tion

(4) Finally our proposed method is combined with alinear support vector machine (SVM) to predict the labelIn the case of multiclass predictions we use the LIBLINEARimplementation for the SVM classification It is a family oflinear SVM classifiers for large-scale linear classification andan open source library which supports logistic regression andlinear SVM In our experiment we apply L2-loss linear SVMfor classification task In addition the regularization param-eters119862 of the linear SVM classifier are determined by fivefold

cross-validation with the arrangement of [2minus4 2minus3 26] Adetailed description can be found in [59]

6 Experimental Setup and Results

All experimentswere conducted using the computer platformof Intel Core i5-4430 CPU300GHz 320GHz mem-ory Win 7 MATLAB R2015 (b) In order to improve theexperimental operation speed we used a parallel computingtoolbox of MATLAB

In this section we first describe the datasets used forthe experiments and display the detailed parameter settingsof the proposed method STL-10 [30] and CIFAR-10 [60]are standard datasets for unsupervised feature learning and


(1) Input(2) An input image 119868(3) 119870 feature maps after local contrast normalization layer(4) Procedure(5) Generate center-bias prior feature maps based on 119868(6) FOR ℎ fl 1 to 3 DO(7) For current pyramid level of ℎ times ℎ bins compute win = lceil119886ℎrceil and str = lfloor119886ℎrfloor(8) Implement this pooling level and output 119870 times ℎ times ℎ-dimensional vectors 120585ℎtimesℎ(9) Reshape center-bias prior feature map to generate column vector119867ℎtimesℎ(10) 119891ℎtimesℎ larr997888 120585ℎtimesℎ⨀119867ℎtimesℎ(11) 119891ℎtimesℎ larr997888 119891ℎtimesℎ119891ℎtimesℎ2(12) ENDFOR(13) Concatenate 119891ℎtimesℎ (1 le ℎ le 3) to form the final spatial pyramid representation f

(14) 119891 larr997888 1198911198912(15)Output 119891

Algorithm 2 The pipeline of SPP fused with center-bias prior

deep learning networks Figure 9 shows ten examples fromeach image set In this part classification results of differentmodels on these two datasets are showed with rigorousanalysis Then in the next part the main techniques used inour model are evaluated with these two datasets

61 Experiment and Results Analysis of STL-10 Dataset TheSTL-10 dataset is a natural image set for developing deeplearning and unsupervised feature learning algorithms Eachclass has 500 training images and 800 testing imagesThe pri-mary challenge is due to the smaller number of labeled train-ing examples (100 per class for each training fold) Additional100000 unlabeled images are provided for unsupervisedlearningThis dataset contains ten classes (1) airplane (2) car(3) bird (4) cat (5) dog (6) deer (7) horse (8) monkey (9)ship and (10) truck with a resolution of 96 times 96 Figure 9(a)shows some examples of STL-10 dataset This dataset can beobtained at httpcsstanfordedusimacoatesstl10 We followthe standard setting in [30 61] (1) performing unsupervisedfeature learning on the unlabeled data (2) performing super-vised learning on the labeled data using predefined tenfoldof 100 examples from the training data and (3) reportingaverage accuracy on the full test set

First of all we used context-aware saliency detectionmethod to calculate saliency maps about 100000 unlabeledimages of STL-10 Saliency maps were thresholded such thata given percent of the image pixels were classified as fixationsand the rest were classified as nonfixations For sampling ofthe fixated points and nonfixated points we referred to themethod of Judd et al [35] We chose samples from the top 5and bottom 30 in order to have samples that were stronglypositive and strongly negative we avoided samples on theboundary between the two We did not choose any sampleswithin 5 pixels of the boundary of the unlabeled imagesHere we experimentally set the total number of sample pointsin each image equal to 64 And in each image the ratioof negative to positive samples was set to 1 4 Because the

images of STL-10 dataset are RGB images the pixel intensityvalue of all the collected samples in each image was expressedas the column vector R64lowast3 The pixel intensity value wasstored in row-major order one channel at a time That is thefirst 64 lowast 64 values were the red channel the next 64 lowast 64were green and the last were blue Therefore a dataset 119883 isinR192times100000 was constructed which was subsequently fed totrain DSAE

At present there is no perfect theoretical basis for selec-tion of the structure of a DSAE model we determinedthe optimal network structure through the experiments Tomeasure the classification performance with the STL-10dataset we first compared the classification accuracies withdifferent number of features and sparsity parameter valuesIn order to study the number of features and the sensitivity ofthe sparsity parameter we varied their values over a widerange Figure 10 shows the respective performance withdifferent number of features and sparsity parameter values Toevaluate the classification performance under different fea-ture numbers we considered feature representations of 400600 800 1000 and 1200 learned features Figure 10(a) clearlyshows the effect of increasing the number of learned featuresThe experimental analysis indicated that a feature size of 1000produced a nice accuracy with this dataset Based on thisanalysis we set the feature size as equal to 1000 to determinethe sparseness value Figure 10(b) shows that there was awide range of sparseness values and the best classificationperformance was obtained at a sparsity value equal to 002Detailed settings of other hyperparameters were set as fol-lows InputZeroMaskedFraction = 05 lambda= 0003 beta =5 and convolutional kernel size = 8times 8times 3 In our experimentwe used two different 3-level pyramids (3 times 3 2 times 2 and 1 times1) and (4 times 4 2 times 2 and 1 times 1) classification result showsthat the former can achieve better accuracies In the SVMtraining we intentionally did not use any data augmentation(multiviewflip) 1198972 vector normalization was applied to thefeatures for SVM training


Training Testing

Trainingimages

Testingimages

Class 1 Class 2 Class N

Locally connected DSAE network trough convolution

A single-layer DSAE model

Obtainingtraining samples

Local contrast normalization Local contrast normalization

SPP fused with center-bias prior SPP fused with center-bias prior

Feat

ure l

earn

ing

amp ab

strac

tion

Training feature vectors


Testingfeaturevectors

Clas

sifica

tion

SVM classifier SVM with optimal parameter

DSAE trainingObtaining images feature Classification

Decision

dddd



Figure 8 Overall architecture of the proposed method with all the bells and whistles

(a)

(b)

Figure 9 Samples of the two image datasets used in our experiments (a) STL-10 (b) CIFAR-10


Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)

Figure 10 The effect of the feature number and sparsity parameter value on the classification accuracy with the STL-10 dataset (a) Featurenumber varied over a wide range of different sizes to generate sparsity parameter (b) Sparsity parameter value varied over a wide range

Table 1 Comparison of average test accuracies () on all folds ofSTL-10

Method AccuracyICA (complete) [38] 480 plusmn 147Random weight baseline [38] 502 plusmn 108119870-means (triangle) [30] 515 plusmn 1733 layer features from CDBN + SVM [39] 5110Our method 518 plusmn 001

Then the performance of our method is comparedwith the previous studies on this dataset The classificationaccuracy is listed in Table 1 Here the state-of-the-art resultslisted for STL-10 can be improved by augmenting the trainingset with flip and other methods we have not done sohere and also report state of the art only for methods notdoing so Known from Table 1 we compared our single-layermodel with 119870-means clustering algorithmmdasha classic single-layer networkmdashand achieved high performance on imageclassification reported in [30] Moreover contrary to 3 layerfeatures from CDBN + SVM [39] our shallow model showsstrength in simplicity and effectiveness

62 Experiment and Results Analysis of CIFAR-10 DatasetWe applied the full pipeline for CIFAR-10 which is a down-sampled version of the STL-10 images (32 times 32 pixels) TheCIFAR-10 dataset consists of 50000 training images and10000 test images in ten classes (ie airplane bird automo-bile deer cat frog dog ship horse and truck)These classesare completely mutually exclusive Figure 9(b) demonstratessome examples of this dataset Compared to STL-10 imagesCIFAR-10 has a lower resolution Hence we achieved thetotal number of sample points in each image equal to 36 andconvolutional kernel size was set as 6 times 6 Besides we used allthe other parameters the same as for STL-10 includinginputZeroMaskedFraction lambda and beta We also firstcompared the classification performance for varied featurenumbers and sparsity parameter values in the same wayas before Figure 11 shows the classification accuracies at

Table 2 Comparison of accuracy () of the methods on CIFAR-10with no data augmentation

Methods AccuracyL2 sparse filtering [40] 63893-way factored RBM (3 layers) [41] 6530Mc RBM (3 layers) [42] 7100Tiled CNN [43] 7310Stochastic pooling ConvNet [44] 8487Deep networks with stochastic depth [45] 9509Our method 7418

different feature numbers and sparsity parameter valuesThe results indicated that a feature size of 1200 producedthe best accuracy with this dataset Based on this analysisfor all experiments we set the feature number equal to1200 to generate sparsity parameter values To evaluate theclassification performance with different sparsity parametervalues we measured the overall classification accuracy forvalues ranging from 001 to 02 The experimental analysisshowed that a sparsity parameter value of 002 producedan excellent accuracy with CIFAR-10 Analogous to STL-10small values of the sparsity parameter and large feature sizesresulted in a high accuracy This is mainly because CIFAR-10is a downsampled version of the STL-10 images These twodataset have similar complexity of the images

We now compare our final test results to some of the bestpublished results on CIFAR-10 The comparison is providedin Table 2 Our method has some accuracy degradation incomparison to state-of-the-art supervised publication [45]which has increased the considerable depth of residualnetworks even beyond 1200 layers The layers of 1200 arean astronomical figure Although the performance of ourfully unsupervised and extremely simple CDSAE shown herefaces challenge there is much room to exploit the dimensionof network depth Meanwhile we still believe that our modelhas merits of its own In particular it does not requiremodern computers with state-of-the-art GPU or very largeclusters [62] to be trained due to its simple architecture


Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)

Figure 11The effect of the feature number and sparsity parameter value on the classification accuracy with the CIFAR-10 dataset (a) Featurenumber varied over a wide range of different sizes to generate sparsity parameter (b) Sparsity parameter value varied over a wide range

Our simple network has the advantage that information canflow efficiently forward and backward and therefore can betrained effectively and within a reasonable amount of timeBesides it has a few hyperparameters to tune compared toincreasingly complex deep models while deeper and deeperCNN architectures have much harmful model complexityand are very difficult to train in practice Finally our methodis a fully unsupervised feature learning method whichthough currently underperforming still remains an appeal-ing paradigm It canmake use of raw unlabeled images whichare readily available in virtually infinite amounts Last but notleast our model fully incorporates the theory about saliencydetection and center-prior in computer vision which are notincluded in the papers listed in Table 2 The performance ismuch larger than that on the comparable STL-10 on accountof the small labeled datasets 518 (plusmn01) This indicatesthat the model proposed here is strong when we have largelabeled training sets as with CIFAR-10

63 Analysis of Computational Complexity To prove thatour method is of low computational cost than somestate-of-the-art methods we focus on two representativebaselinesmdashmdashStochastic pooling ConvNet [44] and Deepnetworks with stochastic depth [45] These two modelscompared are the current state-of-the-art CNNs which out-perform our method on classification accuracy We comparethe computational complexity between ours and them Thecomputational complexity mainly includes two parts (1)complexity of optimizer and (2) complexity of convolutions

Le et al [63] introduce three off-the-shelf optimizationalgorithmsmdashStochastic Gradient Descent (SGD) Limitedmemory BFGS (L-BFGS) andConjugateGradient (CG)Ourproposed methods are implemented with L-BFGS whereasfor [44 45] SGD is used for training In [64] the authorsdemonstrate that the computational cost of SGD is 119874(119899)per iteration (where 119899 denotes the number of variables inthe system and 119899 below has the same definition) They alsoconclude that L-BFGS reduce the cost of BFGS to 119874(119898119899) periteration (where 119898 is the number of updates allowed in L-BFGS) m is specified by the user [65] In practice we wouldrarely wish to use 119898 greater than 15 The empirical value of

119898 is always taken as 5 7 and 9 [58] Compared to a verylarge number of variables about 119899 119898 is much smaller Thecomputational cost of L-BFGS reduces to linear complexity119874(119899)

We now turn to an analysis of complexity of convolutionsMost recently He and Sun [66] propose the theoreticalcomplexity of all convolutional layers It can be representedas

119874( 119866sum119892=1

119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) (10)

where 119892 is the index of a convolutional layer and 119866 is thenumber of convolutional layers 119905119892 is the number of filters inthe 119892th layer and 119905119892minus1 is the number of input channels of the119892th layer ℎ119892 is the spatial length of the filter 120581119892 is the spatialsize of the output feature map The fully connected layersand pooling layers often take 5ndash10 computational time Asa consequence the cost of these layers is not involved in theabove formulation In our comparison we have referred tothis benchmark

In Table 3 we have listed briefly the overall complexity ofthe comparison algorithms (here we consider the complexityof one iteration)

In the following we will analyze the complexity of thesemodels in detail

(1) Stochastic pooling ConvNet [44] has 3 convolutionallayers with 5 times 5 filters and 64 filter banks per layer All of thepooling layers summarize a 3 times 3 neighborhood and use astride of 2The authors use a single fully connected layer withsoft-max outputs to produce the networkrsquos class predictionsWe have proved that the computational cost of SGD is119874(119899) per iteration above The number of variables 119899 isevaluated as 13M params (this number includes the paramsof convolutional layers and fully connected layer) Based ondescription of themodel in [44] we have derived the formula119874(sum119866119892=1 119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) = 21119119890 + 09 (119866 = 3)

(2) Deep networks with stochastic depth [45] use thearchitecture described byHe et al [67] and increase the depthof residual network to 1202 layers and still yield meaningfulimprovements on CIFAR-10 The residual network with 1202


Table 3 Optimizer utilized and the total complexity of the models

Model Optimizer Complexity Remarks

ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1

119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) 119866 = 3 n is 13M

Stochastic depth [45] SGD 119874(119899) + 119874( 119866sum119892=1

119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) 119866 = 1202 119899 is 194MOurs L-BFGS 119874(119899) + 119874(119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) 119866 = 1 119899 is 0013M

layers has 194M params [67] However because of lack ofrelative and specific parameter settings we have to evaluatethe complexity of all convolutional layers from the qualitativeperspective The authors [45] further demonstrated that verydeep networks have much greater model complexity and arevery difficult to train in practice and require a lot of timeIntuitively we can infer that the complexity of this 1202-layernetwork is much higher than our single-layer model

(3) In our proposed method the dimensions of the inputvector and feature are 108 and 1200 respectively In additionwe used 6 times 6 filters for convolution on CIFAR-10 L-BFGSis used to train our network wherein 119898 ≪ 119899 The numberof variables 119899 here is calculated as 0013M parameters Fromthis it could be suggested that with comparison of 119874(119899)our complexity of optimizer is lower than [44 45] For con-volutional complexity of our model we have calculated thecorresponding computational cost as follows 119874(sum119866119892=1 119905119892minus1 sdotℎ2119892 sdot 119905119892 sdot 1205812119892) = 119874(119905119892minus1 sdot ℎ2119892 sdot 119905119892 sdot 1205812119892) = 94478119890 + 07 (119866 = 1)

Based on the details analyzed above it is indicated thatour method has low computational cost than the comparedstate-of-the-art methods

64 Analysis of CDAErsquos Properties In this section we mainlyanalyze the influence of techniques and structures designedin the proposed algorithmThe key structures that contributeto the success of our network are local contrast normaliza-tion layer SPP fused with center-bias prior and 1198972 vectornormalization We evaluate the impact of each of these threeimprovements considered separately

Impact of Local Contrast Normalization Layer We start bystudying the influence of the local contrast normalizationlayer which is a single but important ingredient for goodaccuracy on object recognition benchmarks [56] We notethat local contrast normalization is key to obtaining goodresults without it the accuracy is 5078 (plusmn06) for STL-10 and 7222 for CIFAR-10 While adding it the accuracycan be improved around 1 and 2 respectively

Impact of Center-Bias Prior People use a lot of prior knowl-edge in interpretation of an image prior knowledge can beused for a specific task to improve its performance [68] SPPfusedwith center-bias prior is efficient in image classificationit raises the accuracy of STL-10 from 4918 to 518 andCIFAR-10 from 7401 to 7418 On the other hand theSPP fusedwith center-bias prior slightly raises the benchmarkfor CIFAR-10 an intuitive interpretation is that images of

CIFAR-10 have lower resolution compared to STL-10 It isadvantageous in datasets with high resolution

Impact of 1198972 Vector Normalization We now evaluate theinfluence of the 1198972 vector normalization of high-dimensionalvectors before using SVM training 1198972 vector normalizationimproves performance in these two datasets by 5 on STL-10 (4692 rarr 518) and CIFAR-10 (6831 rarr 7418)We see that the 1198972 vector normalization is powerful whichcan improve classification results over no normalizationdramatically Through experiments we show that these threecomplementary factors elevate the classification accuracy ofour CDSAEThey are all indispensable to our model as thereis usually a big drop in accuracy when removing these struc-tures

7 Conclusion and Future Work

In this paper an arguably simple but effective image clas-sification approach called CDSAE is proposed It is animprovement of the existing successful networks DAE SAEand CNN CDSAE efficiently integrates the following com-ponents combining DAE and SAE to construct DSAEembedding local contrast normalization layer following con-volution operation and most importantly building a spatialpyramid pooling fusedwith center-bias prior in a natural wayCDSAE has superiority in low computational cost and feweramounts of hyperparameters to tune while only sufferingfrom reduced performance relative to some state-of-the-artmethods

In experiment we find that the following are particularlyimperative

(1) Local Contrast Normalization It shows greater effective-ness in improving performance compared to not using it

(2) Center-Bias Prior It can effectively capture the center-prior information of datasets which is particularly appropri-ate for object-centered images with high resolution

(3) 1198972 Vector Normalization 1198972 vector normalization is moreeffective than nonnormalization

In our future research more investigations can be doneon the proposed framework Firstly we plan to extend thisapproach to learn hierarchical features of images from low-level to high-level feature representation Secondly center-bias prior is more specifically suited to the classified object


which is in the center of an image Other more general priorknowledge about images can be further introduced into theframework

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is partially supported by the National Natural Sci-ence Foundation of China (Grant nos 61603326 61602400)and Natural Science Foundation of Yancheng Teachers Uni-versity (Grant nos 15YCKLQ011 and 15YCKLQ012)

References

[1] L Shao L Liu and X Li ldquoFeature learning for image classi-fication via multiobjective genetic programmingrdquo IEEE Trans-actions on Neural Networks and Learning Systems vol 25 no 7pp 1359ndash1371 2014

[2] H Yin X Jiao Y Chai and B Fang ldquoScene classification basedon single-layer SAE and SVMrdquo Expert Systems with Applica-tions vol 42 no 7 pp 3368ndash3380 2015

[3] Y Luo Y Wen D Tao J Gui and C Xu ldquoLarge margin multi-modal multi-task feature extraction for image classificationrdquoIEEE Transactions on Image Processing vol 25 no 1 pp 414ndash427 2016

[4] J F Serrano-Talamantes C Aviles-Cruz J Villegas-Cortezand J H Sossa-Azuela ldquoSelf organizing natural scene imageretrievalrdquo Expert Systems with Applications vol 40 no 7 pp2398ndash2409 2013

[5] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[6] N Dalal and B Triggs ldquoHistograms of oriented gradients forhuman detectionrdquo in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPRrsquo05) vol 1 pp 886ndash893 June 2005

[7] T Ahonen A Hadid and M Pietikainen ldquoFace recognitionwith local binary patternsrdquo in Proceedings of the EuropeanConference on Computer Vision (ECCV) vol 3021 pp 469ndash481Prague Czech Republic 2004

[8] A Krizhevsky I Sutskever andG EHinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo in Proceedingsof the 26th Annual Conference on Neural Information ProcessingSystems (NIPS rsquo12) pp 1097ndash1105 Lake Tahoe Nev USADecember 2012

[9] J Schmidhuber ldquoDeep learning in neural networks anoverviewrdquo Neural Networks vol 61 pp 85ndash117 2015

[10] J Yang K Yu Y Gong and T Huang ldquoLinear spatial pyramidmatching using sparse coding for image classificationrdquo in Pro-ceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo09) pp 1794ndash1801 June2009

[11] G E Hinton ldquoDeep belief networksrdquo Scholarpedia vol 4 no 5article 5947 2009

[12] Y Bengio ldquoLearning deep architectures for AIrdquo Foundationsand Trends in Machine Learning vol 2 no 1 pp 1ndash27 2009

[13] K Fukushima ldquoNeocognitron a self-organizing neural net-work model for a mechanism of pattern recognition unaffectedby shift in positionrdquo Biological Cybernetics vol 36 no 4 pp193ndash202 1980

[14] J Weng N Ahuja and T Huang ldquoCresceptron a self-organizing neural network which grows adaptivelyrdquo in Pro-ceedings of the IJCNN International Joint Conference on NeuralNetworks pp 576ndash581 Baltimore MD USA 1992

[15] H Bourlard and Y Kamp ldquoAuto-association by multilayer per-ceptrons and singular value decompositionrdquo Biological Cyber-netics vol 59 no 4-5 pp 291ndash294 1988

[16] H-C Shin M R Orton D J Collins S J Doran and M OLeach ldquoStacked autoencoders for unsupervised feature learningand multiple organ detection in a pilot study using 4D patientdatardquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 8 pp 1930ndash1943 2013

[17] A Ng Sparse Autoencoder vol 72 of CS294A Lecture Notes2011

[18] P Vincent H Larochelle Y Bengio and P Manzagol ldquoExtract-ing and composing robust features with denoising autoen-codersrdquo in Proceedings of the 25th International Conference onMachine Learning pp 1096ndash1103 ACM Helsinki Finland July2008

[19] J Xu L Xiang R Hang and J Wu ldquoStacked Sparse Autoen-coder (SSAE) based framework for nuclei patch classificationon breast cancer histopathologyrdquo inProceedings of the 2014 IEEE11th International Symposium on Biomedical Imaging ISBI 2014pp 999ndash1002 Beijing China May 2014

[20] P Vincent H Larochelle I Lajoie and P Manzagol ldquoStackeddenoising autoencoders learning useful representations in adeep network with a local denoising criterionrdquo Journal ofMachine Learning Research vol 11 pp 3371ndash3408 2010

[21] J Masci U Meier D Ciresan and J Schmidhuber ldquoStackedconvolutional auto-encoders for hierarchical feature extrac-tionrdquo in Proceedings of the International Coriference on ArtificialNeural Networks (ICANN) pp 52ndash59 Springer Berlin Ger-many 2011

[22] E Nowak F Jurie and B Triggs ldquoSampling strategies for bag-of-features image classificationrdquo in Computer VisionmdashECCV2006 A Leonardis H Bischof and A Pinz Eds vol 3954of Lecture Notes in Computer Science pp 490ndash503 SpringerBerlin Germany 2006

[23] A A Salah E Alpaydin and L Akarun ldquoA selective attention-based method for visual pattern recognition with applicationto handwritten digit recognition and face recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol24 no 3 pp 420ndash425 2002

[24] A Borji and L Itti ldquoState-of-the-art in visual attention mod-elingrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 35 no 1 pp 185ndash207 2013

[25] E Mazurova Accuracy of Measurements of Eye-Tracking ofa human perception on the screen Degree thesis Depart-ment of International Business Arcada - Nylands svenskayrkeshogskola (2014)

[26] E Erdem and A Erdem ldquoVisual saliency estimation by nonlin-early integrating features using region covariancesrdquo Journal ofVision vol 13 no 4 article 11 2013

[27] P-H Tseng R Carmi I G M Cameron D P Munoz andL Itti ldquoQuantifying center bias of observers in free viewing ofdynamic natural scenesrdquo Journal of Vision vol 9 no 7 article 42009


[28] L Zhang M H Tong T K Marks H Shan and GW CottrellldquoSUN a Bayesian framework for saliency using natural statis-ticsrdquo Journal of Vision vol 8 no 7 article 32 2008

[29] B W Tatler ldquoThe central fixation bias in scene viewingSelecting an optimal viewing position independently of motorbiases and image feature distributionsrdquo Journal of Vision vol 7no 14 article 4 2007

[30] A Coates A Y Ng and H Lee ldquoAn analysis of single-layernetworks in unsupervised feature learning inrdquo in Proceedingsof the International Conference on Artificial Intelligence andStatistics (AISTATS) pp 215ndash223 2011

[31] F Zhang B Du and L Zhang ldquoSaliency-guided unsupervisedfeature learning for scene classificationrdquo IEEE Transactions onGeoscience and Remote Sensing vol 53 no 4 pp 2175ndash21842015

[32] K He X Zhang S Ren and J Sun ldquoSpatial pyramid poolingin deep convolutional networks for visual recognitionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol37 no 9 pp 1904ndash1916 2015

[33] A Borji M M Cheng H Jiang and J Li ldquoSalient object detec-tion A surveyrdquo httpsarxivorgabs14115878

[34] S Wang J Zhang and C Zong ldquoLearning Sentence Represen-tation with Guidance of Human Attentionrdquo httpsarxivorgabs160909189

[35] T Judd K Ehinger F Durand and A Torralba ldquoLearningto predict where humans lookrdquo in Proceedings of the 12thInternational Conference on Computer Vision (ICCV rsquo09) pp2106ndash2113 IEEE Kyoto Japan October 2009

[36] W Kienzle F A Wichmann B Scholkopf and M O FranzldquoA nonparametric approach to bottom-up visual saliencyrdquo inProceedings of the 20th Annual Conference on Neural Informa-tion Processing Systems NIPS 2006 pp 689ndash696 MIT PressVancouver Canada December 2006

[37] S Goferman Z M Lihi and A Tal ldquoContext-aware saliencydetectionrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 34 no 10 pp 1915ndash1926 2012

[38] JNgiamZChen andA S Bhaskar ldquoSparse filteringrdquoAdvancesin Neural Information Processing Systems pp 1125ndash1133 2011

[39] J Lee J H LimHChoi andD S Kim ldquoMultiple Kernel Learn-ing with Hierarchical Feature Representationsrdquo in Proceedingsof the International Conference onNeural Information Processing(ICNIP) pp 517ndash524 Springer Berlin Germany 2013

[40] Z Yang L Jin D Tao S Zhang and X Zhang ldquoSingle-layer Unsupervised Feature Learningwith L2 regularized sparsefilteringrdquo in Proceedings of the 2nd IEEE China Summit andInternational Conference on Signal and Information ProcessingIEEE ChinaSIP 2014 pp 475ndash479 July 2014

[41] A Krizhevsky and G E Hinton ldquoFactored 3-way restrictedboltzmann machines for modeling natural imagesrdquo in Proceed-ings of the International conference on Artificial Intelligence andStatistics (AISTATS) pp 621ndash628 2010

[42] M Ranzato and G E Hinton ldquoModeling pixel means andcovariances using factorized third-order Boltzmannmachinesrdquoin Proceedings of the 2010 IEEE Computer Society Conference onComputer Vision and Pattern Recognition CVPR 2010 pp 2551ndash2558 San Francisco Calif USA June 2010

[43] J Ngiam Z Chen D Chia P W Koh Q V Le and A YNg ldquoTiled convolutional neural networksrdquo in Proceedings of theAdvances in Neural Information Processing Systems (NIPS) pp1279ndash1287 Vancouver Canada 2010

[44] M D Zeiler and R Fergus ldquoStochastic pooling for regulariza-tion of deep convolutional neural networksrdquo in Proceedings ofthe 1nd International Conference on Learning Representations(ICLR) Scottsdale Ariz USA 2013

[45] G Huang Y Sun and Z Liu ldquoDeep networks with stochasticdepthrdquo in Proceedings of the European Conference on ComputerVision (ECCV) pp 646ndash661 Springer International PublishingAmsterdam Netherlands 2016

[46] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2323 1998

[47] Y Boureau J Ponce and Y Lecun ldquoA theoretical analysis offeature pooling in visual recognitionrdquo in Proceedings of the 27thInternational Conference on Machine Learning (ICML rsquo10) pp111ndash118 Haifa Israel June 2010

[48] A LMaas A YHannun andA YNg ldquoRectifier nonlinearitiesimprove neural network acoustic modelsrdquo in Proceedings of the30th International Conference on Machine Learning (ICML)Atlanta GA USA 2013

[49] H H Aghdam E J Heravi and D Puig ldquoRecognizing TrafficSigns Using a Practical Deep Neural Networkrdquo in Proceedingsof the Robot 2015 Second Iberian Robotics Conference (ROBOT)pp 399ndash410 Springer Lisbon Portugal 2016

[50] C Zhang and P C Woodland ldquoParameterised sigmoid andReLU hidden activation functions for DNN acoustic mod-ellingrdquo in Proceedings of the 16th Annual Conference of the Inter-national Speech Communication Association INTERSPEECH2015 pp 3224ndash3228 Dresden Germany September 2015

[51] H Lee C Ekanadham and A Y Ng ldquoSparse deep belief netmodel for visual area V2rdquo in Proceedings of the Advances inneural information processing systems (NIPS) pp 873ndash880 MITPress 2008

[52] S Kullback and R A Leibler ldquoOn information and sufficiencyrdquoAnnals of Mathematical Statistics vol 22 pp 79ndash86 1951

[53] K Kavukcuoglu P Sermanet Y-L Boureau K Gregor MMathieu and Y L LeCun ldquoLearning convolutional featurehierarchies for visual recognitionrdquo in Proceedings of the 24thAnnual Conference on Neural Information Processing Systems(NIPS rsquo10) pp 1090ndash1098 Curran Associates Inc VancouverCanada December 2010

[54] Y LeCun B Boser J S Denker et al ldquoBackpropagation appliedto handwritten zip code recognitionrdquo Neural Computation vol1 no 4 pp 541ndash551 1989

[55] N Pinto D D Cox and J J DiCarlo ldquoWhy is real-world visualobject recognition hardrdquo PLoS Computational Biology vol 4no 1 pp 0151ndash0156 2008

[56] K Jarrett K Kavukcuoglu M Ranzato and Y LeCun ldquoWhatis the best multi-stage architecture for object recognitionrdquo inProceedings of IEEE 12th International Conference on ComputerVision (ICCV rsquo09) pp 2146ndash2153 Kyoto Japan October 2009

[57] K Simonyan A Vedaldi and A Zisserman ldquoDeep fishernetworks for large-scale image classificationrdquo in Proceedings ofthe Advances in neural information processing systems (NIPS)pp 163ndash171 Curran Associates South Lake Tahoe Calif USA2013

[58] DC Liu and JNocedal ldquoOn the limitedmemoryBFGSmethodfor large scale optimizationrdquo Mathematical Programming vol45 no 3 pp 503ndash528 1989

[59] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-J LinldquoLIBLINEAR A library for large linear classificationrdquo Journal ofMachine Learning Research vol 9 pp 1871ndash1874 2008


[60] A Krizhevsky Learning Multiple Layers of Features from TinyImages [MS thesis] Department of Computer Science Uni-versity of Toronto 2009

[61] A Coates and A Y Ng ldquoSelecting receptive fields in deepnetworksrdquo in Proceedings of the Advances in Neural InformationProcessing Systems (NIPS) pp 2528ndash2536 Granada Spain 2011

[62] Q V Le ldquoBuilding high-level features using large scale unsu-pervised learningrdquo in Proceedings of the 38th IEEE InternationalConference on Acoustics Speech and Signal Processing (ICASSPrsquo13) pp 8595ndash8598 Vancouver Canada May 2013

[63] Q V Le J Ngiam A Coates A Lahiri B Prochnow andA Y Ng ldquoOn optimization methods for deep learningrdquo inProceedings of the 28th International Conference on MachineLearning (ICML rsquo11) pp 265ndash272 Bellevue Wash USA July2011

[64] N N Schraudolph J Yu and S Gunter ldquoA stochastic quasi-Newtonmethod for online convex optimizationrdquo in Proceedingsof the International Conference on Intelligence and Statistics(AISTATS) pp 436ndash443 San Juan Puerto Rico 2007

[65] T N Sainath L Horesh B Kingsbury A Y Aravkin andB Ramabhadran ldquoAccelerating Hessian-free optimization forDeep Neural Networks by implicit preconditioning and sam-plingrdquo in Proceedings of the 2013 IEEE Workshop on AutomaticSpeech Recognition andUnderstanding ASRU2013 pp 303ndash308Olomouc Czech Republic December 2013

[66] KHe and J Sun ldquoConvolutional neural networks at constrainedtime costrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 5353ndash5360Boston Mass USA June 2015

[67] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo in Proceedings of the 2016 IEEE Conferenceon Computer Vision and Pattern Recognition CVPR 2016 pp770ndash778 Las Vegas Nev USA July 2016

[68] H Z Ai and Y C Su Image Processing Analysis and MachineVision Tsinghua University Press Beijing China 2011

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Journal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


expensive to obtain Stacked AE is another notable learningmethod which exploits a particular type of neural networkthe AE also called autoassociator [15]mdashas component ormonitoring device It can be effectively used for unsupervisedfeature learning on a dataset for which it is difficult to obtainlabeled samples [16] Beyond simply learning features by AEthere is a need for reinforcing the sparsity of weights andincreasing its robustness to noise Ng [17] introduced sparseautoencoder (SAE) which is a variant of AE Sparsity is auseful constraint when the number of hidden units is largeSAE has very few neurons that are active There is an anothervariant of AE called denoising autoencoder (DAE) [18]which minimizes the error in reconstructing the input froma stochastically corrupted transformation of the input Thestochastic corruption process consists in randomly settingsome of inputs (asmany as half of them) to zero Comparativeexperiments clearly show the surprising advantage ofDAE ona pattern classification benchmark suite

With the development of deep learning AE and itsvariants are widely used in the field of image recognition Xuet al [19] presented a stacked SAE for nuclei patch classifi-cation on breast cancer histopathology They extracted twoclasses of 34 times 34 patches from the histopathology imagesnuclei and nonnuclei patches These two kinds of patcheswere used to construct the training set and testing set Theauthors of [20] proposed a method called stacked DAE basedon paper [18] which is a straightforward variation on thestacked ordinary AE Besides stacked DAE was tested onMINIST dataset which contains 28 times 28 gray-scale imagesBeing similar to the method based on stacked SAE thetraining and testing dataset fed into models are relatively lowin resolution such as small image patches and low resolutionimages (eg hand-written digits) Both SAE and DAE arecommon fully connected networks which cannot scale wellto realistically sized high-dimensional inputs (eg 256 times 256images) in terms of computational complexity [21]They bothignore the 2D image structure

In order to overcome these limitations this paper intro-duces an approach called CDSAE (convolutional denoisingsparse autoencoder) that scales well to high-dimensionalinputsThis approach can effectively integrate the advantagesin SAE DAE and CNN This hybrid structure forces ourmodel to learn more abstract and noise-resistant featureswhich will help to improve the modelrsquos representation learn-ing performance CDSAE can map images to feature repre-sentation without any label information while CNN requireslarge quantities of labeled data Besides it differs from con-ventional SAE and DAE as its weights are shared among alllocations in the input images and thus preserves spatiallocality

Besides feature extraction mentioned above the sampleris another critical component which has a great influence onthe results Ideally it should focus attention on the imageregions that are the most informative for classification [22]Recently selective attention models have drawn a lot ofresearch attention [23 24] The idea in selective attentionis that not all parts of an image give us information If wecan attend only to the relevant parts we can recognize theimage more quickly and using less resources [23] People

place an object on the foveal with fixations when the gaze isconcentrated on the object and getmost information throughfixations [25] Compared to the traditional approaches usinga random sampling strategy we introduce a sampling strategyto sample fixations from the image which is inspired byhuman selective attentionMoreover those studies on humaneye fixations demonstrate that there is a tendency in humansto look towards the image center which is called the centerbias [26] It is worth mentioning that incorporating center-bias prior into saliency estimation has been previously inves-tigated by a number of researchers [27ndash29] Turning to ourwork center-bias prior are absorbed for SPP in our imageclassification model

To summarize the key contributions of this paper areelaborated as follows

(1) A sampling strategy about eye fixations based onhuman visual system is proposed which is inspiredby human eyes The fixation points and nonfixationpoints of images can be got by utilizing saliencydetection model

(2) A CDSAE model with local contrast normalizationoperation is proposed In this overall model single-layer DSAE is used for unsupervised feature learningwhich can effectively extract features without usingany label data Compared to conventional deep mod-els single-layer DSAE has a strength with a smallercomputational learning cost and fewer hyperparame-ters to tune

(3) An SPP incorporating center-bias prior is proposedThis not only maintains spatial information by pool-ing in local spatial bins but also fully utilizes priorknowledge of image dataset To the best of ourknowledge this is the first work that absorbs priorknowledge for pooling in image classification

The remainder of this paper is organized as follows InSection 2 we review related works in the literature Section 3introduces a sampling strategy based on human visionattention system Section 4 describes CDSAE and Section 5provides the overall classification framework The details ofour experiments and the results are presented in Section 6followed by a discussion and future work

2 Related Work

Other researchers have also made some headway on con-structing the convolutional autoencoder (CAE) an unsuper-vised feature extractor that can scalewell to high-dimensionalinput images Masci et al [21] propose a kind of CAEwhich directly takes the high-dimensional image data as theinput through training the AE convolutionally Though thisconvolution structure can preserve local relevance of theinputs training the AE convolutionally is not easy For thisproblem Coates et al [30] first extract patches from the inputimages and use patch-wise training to optimize the weightsof a basic SAE in place of convolutional training Besidesthey further propose that even with a single-layer networkin unsupervised feature learning it is possible to achieve











119889 (119901119903119894 119902119903119897 ) (2)






(a) (b) (c)








LCN


SPP fused with

center-bias prior


M

KK

K

K

K




120572119894 = 119891 (119909119894) = 119897 (1198821119909119894 + 1198871) (3)


119910 = 119909 if 119909 ge 0120603119909 if 119909 le 0 (4)


119911119894 = 1198822120572119894 + 1198872 (5)


119871 (119883 119885) = 12119872sum119894=1

10038171003817100381710038171003817119909119894 minus 119911119894100381710038171003817100381710038172 + 1205822 1198822 (6)






119871 (119883 119885) + 120573 119870sum119895=1

KL (120588 120588) (8)








Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1








V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of














119889 (119901119903119894 119902119903119897 ) (2)






(a) (b) (c)








LCN


SPP fused with

center-bias prior


M

KK

K

K

K




120572119894 = 119891 (119909119894) = 119897 (1198821119909119894 + 1198871) (3)


119910 = 119909 if 119909 ge 0120603119909 if 119909 le 0 (4)


119911119894 = 1198822120572119894 + 1198872 (5)


119871 (119883 119885) = 12119872sum119894=1

10038171003817100381710038171003817119909119894 minus 119911119894100381710038171003817100381710038172 + 1205822 1198822 (6)






119871 (119883 119885) + 120573 119870sum119895=1

KL (120588 120588) (8)








Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1








V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





(a) (b) (c)








LCN


SPP fused with

center-bias prior


M

KK

K

K

K




120572119894 = 119891 (119909119894) = 119897 (1198821119909119894 + 1198871) (3)


119910 = 119909 if 119909 ge 0120603119909 if 119909 le 0 (4)


119911119894 = 1198822120572119894 + 1198872 (5)


119871 (119883 119885) = 12119872sum119894=1

10038171003817100381710038171003817119909119894 minus 119911119894100381710038171003817100381710038172 + 1205822 1198822 (6)






119871 (119883 119885) + 120573 119870sum119895=1

KL (120588 120588) (8)








Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1








V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of






LCN


SPP fused with

center-bias prior


M

KK

K

K

K




120572119894 = 119891 (119909119894) = 119897 (1198821119909119894 + 1198871) (3)


119910 = 119909 if 119909 ge 0120603119909 if 119909 le 0 (4)


119911119894 = 1198822120572119894 + 1198872 (5)


119871 (119883 119885) = 12119872sum119894=1

10038171003817100381710038171003817119909119894 minus 119911119894100381710038171003817100381710038172 + 1205822 1198822 (6)






119871 (119883 119885) + 120573 119870sum119895=1

KL (120588 120588) (8)








Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1








V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of








Feature

Output data

qD

x1

x2

x3

x4

xn

x1

x2

x3

x4

xn

W1 W2

y1

y2

y3

yk

z1

z2

z3

z4

zn

+1

+1








V119896119894119895 = 119909119896119894119895 minus sum119896119901119892

119908119901119902 sdot 119909119896119894+119901119895+119902 (9)





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





(a)

(e)

(f)

(g)

(b)

(c)

(d)

(a)

(e)

(f)

(b)

(c)

(d)


us

DSAE

w

Convolution kernel


Convolution


(u minus w)s + 1












(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





(a) (b)


K

a

a


(a) (b)


K minus d

4 times K minus d

9 times K minus d

⨀

⨀

⨀










(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of






(14) 119891 larr997888 1198911198912(15)Output 119891








Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





Training Testing

Trainingimages

Testingimages







Feat

ure l

earn

ing

amp ab

strac

tion




Clas

sifica

tion



Decision

dddd




(a)

(b)



Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





Accuracy

600 800 1000 1200400Feature number

505

507

509

511

513

515

517Ac

cura

cy (

)

(a)

Accuracy

0035 005 01 02002Sparsity

Accu

racy

()

4500

4650

4800

4950

5100

5250

(b)











Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of





Accuracy

722724726728

73732734

Accu

racy

()

800 900 1000 1200600Feature number

(a)

Accuracy

63656769717375

Accu

racy

()

002 0035 01 02001

(b)







119874( 119866sum119892=1










ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of







ConvNet [44] SGD 119874(119899) + 119874( 119866sum119892=1























Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of








Acknowledgments


References














































































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of






















































Volume 2014




Journal of











Journal of


Function Spaces






Algebra





Journal of




image classification based on convolutional denoising

Documents