robust and lightweight ensemble extreme learning machine...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

Robust and Lightweight Ensemble ExtremeLearning Machine Engine Based on Eigenspace

Domain for Compressed LearningHuai-Ting Li , Student Member, IEEE, Ching-Yao Chou , Student Member, IEEE, Yi-Ta Chen,

Sheng-Hui Wang, Student Member, IEEE, and An-Yeu Wu , Fellow, IEEE

Abstract— Compressive sensing (CS) is applied toelectrocardiography (ECG) telemonitoring system to address theenergy constraint of signal acquisition in sensors. In addition,on-sensor-analysis transmitting only abnormal data furtherreduces the energy consumption. To combine both advantages,“On-CS-sensor-analysis” can be achieved by compressedlearning (CL), which analyzes signals directly in compresseddomain. Extreme learning machine (ELM) provides an effectivesolution to achieve the goal of low-complexity CL. However,single ELM model has limited accuracy and is sensitive to thequality of data. Furthermore, hardware non-idealities in CSsensors result in learning performance degradation. In this work,we propose the ensemble of sub-eigenspace-ELM (SE-ELM),including two novel approaches: 1) We develop the eigenspacetransformation for compressed noisy data, and further utilizea subspace-based dictionary to remove the interferences, and2) Hardware-friendly design for ensemble of ELM provideshigh accuracy while maintaining low complexity. The simulationresults on ECG-based atrial fibrillation show the SE-ELM canachieve the highest accuracy with 61.9% savings of the requiredmultiplications compared with conventional methods. Finally,we implement this engine in TSMC 90 nm technology. Thepostlayout results show the proposed CL engine can providecompetitive area- and energy-efficiency compared to existingmachine learning engines.

Index Terms— Compressed learning, extreme learningmachine, machine learning, noise tolerance, very large scaleintegration (VLSI).

I. INTRODUCTION

W ITH the advances in machine learning techniques andwearable devices, real-time healthcare monitoring can

be realized by processing streaming physiological signalsobtained from different sensors. One of the most impor-tant tasks is the electrocardiography (ECG) telemonitoringsystem [1], since ECG measuring the electrical activity of

Manuscript received February 24, 2019; revised June 21, 2019 and August 5,2019; accepted August 29, 2019. This work was supported by the Ministryof Science and Technology (MOST), Taiwan, under Grant MOST 106-2221-E-002-205-MY3 and Grant MOST 106-2622-8-002-013-TA. This article wasrecommended by Associate Editor D. Comminiello. (Corresponding author:An-Yeu Wu.)

The authors are with the Graduate Institute of Electronics Engineer-ing, National Taiwan University, Taipei 106, Taiwan, and also with theDepartment of Electrical Engineering, National Taiwan University, Taipei106, Taiwan (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this article are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2019.2940642

Fig. 1. Different methods for low-power front-end design: (a) conven-tional CS with reconstruction for inference. (b) on-sensor-analysis onlytransmits abnormal data after simple analytics. (c) Proposed on-CS-sensor-analysis by robust and lightweight inference in compressed domain withoutreconstruction.

the heart has been utilized to diagnose various diseases.In practical situations, this system must be both low-latencyand high-accuracy to achieve rapid screening or accuratediagnosis [2]. Furthermore, for the wearable devices, smallform factor and low energy consumption are two seriousissues.

For the low-power front-end design of telemonitoring sys-tem, compressive sensing (CS) is a novel technique thataddresses the energy constraint [3]. As shown in Fig. 1(a),CS measures the compressed ECG signals, x, with fewer mea-surements than Nyquist rate. Therefore, the CS-based sensorcan greatly reduce the sampling rate of high-speed analog-to-digital converter devices, and reduce the transmitting power toachieve a 37.1% extension in node lifetime when comparedwith conventional compression technique [4]. However, if wewant to analyze the original signal, x , we need to firstly recon-struct the signal by extremely energy-intensive algorithms [5],which have shown to be impractical for sensor nodes. On theother hand, several recent works have proposed the on-sensor-analysis that only transmits abnormal data after simple analyt-ics [6]–[8], as shown in Fig. 1(b). In [6], the author utilizedextra computation for proper feature extraction to significantlyreduce the computation in support vector machine (SVM)engine. By executing simple preprocessing, feature extraction,and classification in sensor node, on-sensor-analysis can notonly reduce the transmitting power, but also provide timelydetection.

1549-8328 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0003-3807-0023

https://orcid.org/0000-0002-7270-9567

https://orcid.org/0000-0003-4731-8633


2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

To achieve extreme low-energy consumption and real-timeanalysis, in this paper, we aim at the on-CS-sensor-analysisby a combination of both aforementioned methods. As shownin Fig. 1(c), analyzing signals directly in compressed domaincan avoid huge computation cost for reconstruction. Com-pressed learning (CL) has been recently developed to performlearning and classification in the measurement domain withoutreconstructing the signals [9], [10]. To make CL feasible onsensor nodes, a lightweight inference engine is indispensable.

To reduce the energy consumption for each classifica-tion from microjoule to nanojoule, extreme learning machine(ELM), a single-layer feedforward neural network withextremely low complexity and fast training speed [11], hasbeen popular for fast preliminary screening and implementedin different ways [7], [12], [13]. ELM provides an efficientway to achieve rapid screening with low energy consumption;however, implementing CL directly on sensor node with ELMmodel will face the following challenges:

1) Learnability degradation caused by hardwarenon-idealities in CS sensor nodes: The sensors sufferfrom thermal noise, which can be modeled as whiteGaussian noise [15]. In addition, the compressed ECGsignals are contaminated by various types of interfer-ences [16]. Baseline wander, which makes the analysisof ECG difficult, is particularly considered as a criticalproblem. The performance of ELM is very sensitiveto the quality of data [17]. Therefore, the noise andinterference issues are elevating to dominating concerns,especially in the compressed domain.

2) Unfeasible hardware cost to implement ensemblelearning in sensor nodes: Learning directly in thecompressed domain with ELM, a very simple machinelearning model, suffers from degradation in accuracy.By combining multiple ELM models with AdaBoostalgorithm [14], the inference accuracy can be improvedand achieve the goal of accurate diagnosis. However,implementing ensemble of ELM with previously pre-sented architecture designs can cause huge area overheadwhich is unfeasible for sensor nodes.

In this work, we propose a robust and lightweight ensembleof ELM design for ECG-based atrial fibrillation detection,including algorithm and architecture, as shown in Fig. 2. First,we propose the eigenspace-ELM (E-ELM) to improve thelearnability of data by utilizing eigenspace transformation.It can reduce the required number of ELM classifiers toachieve lightweight model. Furthermore, we observed that themost significant eigenvectors of interference and cleaned ECGsignals are almost orthogonal to each other. By taking thisinsight into consideration, we propose the sub-eigenspace-ELM (SE-ELM) to make the lightweight learning modelrobust against interference. More importantly, this eigenspacetransformation prevents the inference framework from usinghigh-complexity sparse coding method for data reconstruc-tion. The proposed design can not only achieve CL innoisy situations but also be suitable for VLSI implementa-tion, which make on-CS-sensor-analysis feasible. With thehardware-friendly design for ensemble of ELM inference

engine, we achieve high inference accuracy while maintaininglow complexity. Our detailed contributions are as follows:

1) Sub-Eigenspace transformation for improving thelearnability and removing the interferences directlyin the compressed domain: In the off-line trainingstage, we extract detrended ECG signal from the originalsignal. Then, we apply the proposed two-stage principalcomponents analysis (PCA) on detrended ECG signal togenerate the dictionaries. By combining the CS sensingmatrix and dictionaries, in the inference stage, we canproject compressed signals onto the sub-eigenspace withthe most information preserved, as shown in Fig. 2(a).In the sub-eigenspace, the data has better signal-to-noiseratio (SNR), and needs fewer ELM models to achievetarget accuracy. In the white noise plus baseline wan-der case, the proposed SE-ELM can achieve the sameaccuracy with more than 61.9% savings of the requiredmultiplications compared with conventional ELM-basedmethods.

2) Hardware-friendly design for the inference stage ofeigenspace transformation and ensemble ELM: Asshown in Fig. 2(b), to reduce the required multiplicationsin ELM inference model, we utilize the linear feedbackshift register (LFSR) to generate the random inputweights. In addition, a hardware sharing architecture isproposed for the eigenspace transformation and ELMengine to mitigate hardware cost. We implement thisCL engine in TSMC 90 nm technology. The core sizeis 0.1475 mm2 at 5 MHz operation frequency.

The rest of this paper is organized as follows. In Section II,we present the background on CS and ELM algorithms, andintroduce related works of this specific topic area. Section IIIpresents the proposed E-ELM and SE-ELM frameworks.Section IV shows the numerical experiments and analysis ofcomplexity. The hardware architecture and VLSI implementa-tion results are shown in Section V. Finally, we conclude thispaper in Section VI.

II. BACKGROUND AND RELATED WORKS

A. Compressive Sensing (CS) [3]

CS measures signals with fewer measurements than Nyquistrate by utilizing the sparsity of signals. Its encoding procedurecan be modeled in the matrix form as [15]

x = �(x + n), (1)

where x ∈ RM×1 is the compressed measurements,

� ∈ RM×N is the CS sensing matrix, whose entries are

independent identically distributed (i.i.d) samples from theuniform distribution or the normal distribution [18], x ∈ R

N×1

is the original signal, and n ∈ RN×1 could be the additive noise

or the interference. This sensing framework is very low-costand shows great benefit to reducing energy consumption insensor nodes.

As for the reconstruction of x, the underdetermined problemcan be solved by sparse coding method since x is sparsein some domains. However, the computational complexityinvolved in sparse coding grows exponentially with the signal


LI et al.: ROBUST AND LIGHTWEIGHT ENSEMBLE ELM ENGINE BASED ON EIGENSPACE DOMAIN FOR CL 3

Fig. 2. Two parts are in the proposed robust and lightweight CL engine design, including (a) sub-eigenspace transformation from x to s and(b) hardware-friendly design for ensemble of ELM models.

length N . It is impractical to implement the reconstruction inthe sensor nodes even if using the most efficient approach.

B. Compressed Learning (CL) [9]

From the viewpoint of machine learning, the random pro-jection in (1) can be considered as efficient dimensionalityreduction from the data domain, R

N , to the compressed mea-surement domain, R

M . By utilizing the near isometry propertysatisfied by the sensing matrix in CS, Calderbank et al.stated that CL, learning directly in the measurement domainwithout paying the cost of reconstruction, is possible [9].They provided theoretical bounds guaranteeing the accuracyof linear kernel SVM in the measurement domain is close tothat in the data domain.

CL can not only avoid the reconstruction but also reduce thecost of the learning process markedly because the dimension-ality is smaller (M < N). However, the computational com-plexity of SVM-based CL is not feasible for resource-limitedsensors because most non-linear feature extraction algorithmscannot be performed in compressed domain. CL without fea-ture extraction will present significant challenges to realizingon-sensor-analysis because M is still much higher than thedimensionality of feature space (k � M < N , where k isthe number of extracted features). Furthermore, the numberof support vectors (SV) becomes much more because thelearnability in compressed domain is lower than that in fea-ture space. Therefore, instead of using SVM, we need otherextreme lightweight model whose computational complexityfor inference is not directly proportional to the data dimension.

C. Extreme Learning Machine (ELM) [11]

ELM first proposed by Huang et al. is an algorithm fortraining single hidden layer feedforward neural network [11].Different from the gradient-based algorithms that tuningeach parameter iteratively, the input weights and the hiddenbiases of ELM are chosen arbitrarily and the output weightsare solved as a simple linear system. Therefore, ELM hasextremely high learning speed and tends to provide greatgeneralization performance.

Given D labeled training data (di, ti), where di ∈ R1×n

is an n-dimension input feature vector and ti ∈ R1×m is the

label vector for m-class problem for i from 1 to D, targetmatrix T ∈ R

D×m consisting of all ti, activation function g(x),and the number of hidden neuron L, the details of trainingalgorithm are as following steps:

Step1: Randomly assign arbitrary ai = [a1i , a2i , . . ., ani ]T

which is the input weight vector connecting theinput neurons and i th hidden neuron and bias ofthe i th hidden neuron bi , for i from 1 to L.

Step2: Calculate the hidden layer output matrixH ∈ R

D×L as

H =⎡⎢⎣

g(d1 · a1 + b1) . . . g(d1 · aL + bL)...

...g(dD · a1 + b1) . . . g(dD · aL + bL)

⎤⎥⎦.

(2)

Step3: Calculate the output weight matrix β as

β = H†T, (3)

where H† is the Moore-Penrose generalized inverseof matrix H.

Next, we investigate the structure of ELM classificationengine that we concern in this work. In real applications, datacomes in as an input vector. We can also define three steps ofELM classification for each input data.

Step1: Compute the input weighting multiplication as

z = [d 1] · A, (4)

where A is the input weight matrix consisting of allai and bi that are used in training, and it is defined as

A =

⎛⎜⎜⎝

a11 a12 . . . a1L

. . . . . . . . . . . .an1 . . . . . . anL

b1 b2 . . . bL

⎞⎟⎟⎠. (5)

Step2: Compute the activation function as

h = g(z). (6)

Step3: Compute the output weighting multiplication as

y = h · β. (7)



In these steps, d is the input vector, z is the input vector ofthe hidden layer, h is the output vector of the hidden layer, andy is the output vector. For classification, if yi is the elementwith the largest value in y, the i th class is the final result ofthe classifier.

D. Ensemble ELM With AdaBoost Algorithm

AdaBoost algorithm, one of well-known ensemble methods,combines a series of iteratively and relatively trained classifiersto form a stronger model [14]. To increase the diversity amongclassifiers, each classifier is trained by same data but each ofwhich has different weights, w j for j from 1 to D, in costfunction. The weight of each data will be scaled up or downwhile every time a trained classifier misclassifies the data ornot.

In [19], ELM-based AdaBoost (A-ELM) algorithm isproposed. During each iteration of the boosting algorithm,the least squares problem of solving the output weight β iin (3) for i th ELM model could be replaced with the derivedweighted least squares problem as

β i = (HTi WiHi)

−1HTi WiT, (8)

where Wi consists of the instance weights of training data ini th iteration in the AdaBoost algorithm, and it is a diagonalmatrix of dimension D × D defined as

Wi =

⎛⎜⎜⎝

w1 0 ... 00 w2 ... 0... ... ... ...0 ... 0 wD

⎞⎟⎟⎠. (9)

Then, the ensemble weight in AdaBoost algorithm, αi for ifrom 1 to C , where C is number of classifiers, of eachclassifier is decided by the weighted training accuracy. Higherweighted training accuracy will lead to a larger αi that is moresignificant to the final output. After the training algorithm isdone, the final output vector of the ensemble model can becomputed as

yensemble =∑C

i=1αiyi. (10)

Equation (10) is the element-wise weighted sum of the outputvectors of total C classifiers.

E. Related Works on Noise-Tolerant Machine Learning

The compressed data is sensitive to noise or interfer-ences [20]. For mitigating noise and interference issues inECG signals, several signal processing techniques have beenproposed on original data domain [21], [22]. However, beforeapplying these techniques, we have to firstly reconstruct thecompressed data to original data, which should be avoidedin the scenario of on-CS-sensor-analysis. Instead, to deal withnoise and interference issues in compressed ECG signals with-out reconstruction, some general machine learning methodscan be applied, such as regularized learning models [17] anddata-driven hardware resilience (DDHR) [23].

1) Probabilistic Regularized ELM (PR-ELM) [17]: In [17],PR-ELM is proposed to improve model performance whensubject to noisy data and outliers. Regularization is a popularprocess of introducing additional information to solve an ill-posed problem or to prevent overfitting. By adding a penaltyon different model parameters, regularization can reduce themodel complexity. Hence, the model is less likely to fit thenoise in training data. To minimize not only the modeling errorvariance but also the modeling error mean, XinJiang et al.proposed a new objective function for training ELM model as

Obj = minβ,ei ,ζ

1

2‖β‖2 + γ

D

1

2

D∑i=1

(ei − ζ )2 + C1

2ζ 2

s.t. hi · β = ti − ei, i = 1, ..., D, (11)

where ei is modeling error, ζ is the global error factorwhich builds a bridge among errors, and both γ and Care the regularization parameters that can be decided bycross-validation. This method incorporated the distribution ofmodeling error into training process, thus allowing for morerobust performance than traditional ELM model.

2) Data-Driven Hardware Resilience (DDHR) [23]: Toobtain an inference model for sensed signals in the presenceof errors caused by hardware non-idealities, Zhuo et al. pro-posed DDHR approach, which utilizes data-driven training toconstruct an error-aware model [23]. With DDHR, the effectsof interferences can be regarded as altering the statistics ofsensed signals in the form of new class distributions. As aresult, the inference performance can be retained as long asthe mutual information is retained in the error-affected dataand the new distributions can be learned by the error-awaremodel. That is, to make DDHR effective, the data statistics ofthe training data should match those of testing data.

F. Design Issues in Prior Works

Applying ELM model to perform CL can achieve the goal ofextreme low-energy consumption for on-CS-sensor-analysis.However, classifying compressed measurement with ELM,a very simple machine learning model, suffers from degrada-tion in accuracy. Furthermore, the performance of ELM modelis very sensitive to the quality of data [17]. Therefore, the noiseand interference issues are elevating to dominating concerns,especially in the compressed domain. DDHR approach cannotdeal with white Gaussian noise because white noise is asequence of random numbers that cannot be learned by theerror-aware model [23]. PR-ELM only mitigates the effectsof non-idealities by modifying the regularization terms inoptimization for a given input space [17]. The aforementionedrelated works only restore the performance to a certain level,which is not enough for an accurate diagnosis, because thesemethods are lacking in finding or transforming to a more infor-mative and cleaner space for learning tasks. The comparisonamong related works is shown in Table I.

III. PROPOSED ROBUST ENSEMBLE OF ELM WITH

EIGENSPACE TRANSFORMATION

From the viewpoint of feature space transformation, we pro-pose an eigenspace transformation which can be directly



TABLE I

THE COMPARISON BETWEEN OUR PROPOSED METHOD ANDTHE RELATED WORKS THAT CAN ACHIEVE ROBUST CL

performed on compressed domain to extract information fromnoisy data. Furthermore, to remove the interference fromcompressed signals, we introduce a subspace-based dictionaryto project compressed signals onto the sub-eigenspace withthe most information preserved. Although the idea of trans-formation using subspace dictionaries is already proposed inour earlier work [10], in this work, we propose a more system-atic approach for obtaining both subspace-based dictionariesand transformation matrix. With the proposed sub-eigenspacetransformation, the number of ELM models required in theensemble method can be reduced. Therefore, compared to theprior works, the proposed SE-ELM can achieve the highestaccuracy with lowest computational cost.

A. Proposed Eigenspace-ELM (E-ELM) Framework

To enhance the learning performance by information extrac-tion and noise mitigation, the E-ELM framework is proposedas shown in Fig. 3. It consists of two stages: 1) off-line trainingand 2) on-line inference. The first stage utilizes a small amountof reconstructed data to learn a dictionary by PCA. Then,we can obtain an eigenspace transformation matrix, which isused in the inference stage, by using this dictionary and sens-ing matrix of CS. After transforming all training data with thelearned transformation matrix, we can train the ELM modelsin eigenspace. The second stage is directly performed in themeasurement domain. The compressed signals are transformedonto eigenspace by the transformation matrix obtained in thetraining stage. Then, the projected data, s, is further analyzedby the ELM models directly in the eignespace.

1) PCA-Based Dictionary: The computational complexityof most machine learning inference algorithms is highly cor-related to the data dimension. Therefore, dimension reduc-tion is an important stage in on-device inference. Manyworks [24], [25] have utilized PCA in different applications.PCA utilizes the property that the intrinsic dimension ofa dataset is much smaller than the data dimension. Thedataset can thus be represented by a small number of vectors,the principle components (PC). PCs are linear transformationsof the original set of variables which are orthogonal andordered by the variation of data on the corresponding basis.By choosing the first few PCs as basis, we can transform thedataset onto a lower dimensional subspace while retaining themost information.

Fig. 3. Proposed E-ELM framework, including (a) PCA-based dictionarylearning and construction of the eigenspace transformation matrix in theoff-line training stage and (b) utilizing the transformation matrix in the on-lineinference stage.

While PCA has been used primarily as a sensingmatrix [26], in this work, we exploit it as a transformationdictionary instead. Assume a dataset of D reconstructed train-ing data X = [x1x2 . . . xD], where each xi ∈ R

N×1. The firststep of PCA is to compute the covariance matrix,

∑, of the

dataset. Then, the eigendecomposition of∑

is performed,and we can obtain a matrix U whose i th column is the i theigenvector of

∑with i th largest eigenvalue. By choosing the

first ks important eigenvectors, the main data information canbe kept and the noise can be mitigated in the low dimensionalsubspace. The proposed PCA-based dictionary ψ consists ofthe first ks columns of U as follows:

ψ = U( :, 1 :ks). (12)

The PCA-based dictionary keeps the most significant infor-mation with the smallest dimension. The rest of eigenvectorsthat are not in ψ are considered as noise. We utilize thedata statistics information to eliminate the redundant basis andbuild an overdetermined dictionary.

2) Eigenspace Transformation Matrix: To improve the per-formance of CL without the huge computational overhead ofreconstruction, we propose to use an eigenspace transforma-tion matrix, combing the information in PCA-based dictionaryand the CS sensing matrix, to transform the compressed signal.For the dataset whose intrinsic dimension is much smaller thanthe data dimension, the variance of most PCs is close to zero.This means the dataset is sparse in the eigenspace. Therefore,we can find a low dimensional subspace with fixed basis in thesparse space and preserve the most information at the sametime. By utilizing the proposed PCA-based dictionary in (12),we can represent the compressed signal, x, as

x ∼= �ψs = �s. (13)

where � ∈ RM×ks is the matrix multiplication of sensing

matrix and the proposed PCA-based dictionary, and s is therepresentation of x in the eigenspace. To estimate s, we cansolve the optimization problem as

mins

∥∥x −�s∥∥2

. (14)



Fig. 4. Proposed SE-ELM framework, including (a) subspace-based dictio-naries division and construction of the sub-eigenspace transformation matrixin the off-line training stage and (b) utilizing the transformation matrix in theon-line inference stage.

Therefore, we can obtain s, the estimated result of s, with theleast-square sense as

s = �†x = (�T�)−1�T x. (15)

Consequently, we define the eigenspace transformationmatrix T as

T = �† = (�T�)−1�T . (16)

This eigenspace transformation requires only matrix multi-plication, because � in (13) is an overdetermined dictionarywhich can be solved by least squares method. This matrix iscalculated in the off-line training stage, and we can utilize it toperform eigenspace transformation on the received compressedsignals in the on-line inference stage.

B. Proposed Sub-Eigenspace-ELM (SE-ELM) Framework

The proposed E-ELM can improve the performance ofCL under white noise issue because of the benefits fromthe noise mitigation by PCA. However, when facing theinterference issue in compressed domain, the proposed E-ELMdegrades in accuracy because the interference spreads ontobasis vectors in the PCA-based dictionary. Therefore, as shownin Fig. 4, we propose the SE-ELM to divide the afore-mentioned PCA-based dictionary into two subspaces for sig-nal information and interference, respectively. In this paper,we choose baseline wander as a case study of interferenceissue. When dealing with other types of interference, thisframework still works well as long as we replace the baselineextraction block in Fig. 4 with proper preprocessing methods.

1) Two-Stage PCA-Based Dictionary: In the off-line train-ing stage, we utilize the reconstructed data contaminatedby interference to learn the dictionaries. First, we find thePCA-based dictionary, ψE, as discussed in Sec. III.A by usingthe contaminated data in the original domain. Therefore, somevectors in the basis of the dictionary contain the statisticalinformation of interference.

On the other hand, the contaminated data in original domaincan be handled by conventional interference removal methods.In this work, we exploit wavelet adaptive filtering (WAF) [22]to decompose the ECG signal and baseline wander. The ECGsignal of training set is decomposed up to 7 levels usingWavelet Transform (WT). The frequency of the 7th level

Fig. 5. The illustration of signal representation using the proposedsubspace-based dictionaries.

approximation coefficients is in the range of 0-1.4 Hz, in whichconsists of baseline signals. Thus, the 7th level approximationcoefficients are optimized by an adaptive filter. The filteredoutput and the detail coefficients are used for inverse WT.Then, we can obtain the detrended data.

Next, we apply the transpose of the PCA-based dictionary,ψE, to project all the D detrended data in the training setXd ∈ R

N×D to Yd ∈ Rks ×D as

Yd = ψTEXd. (17)

To divide this projected eigenspace into two subspaces, signalspace and baseline space, we apply the PCA again on Yd.As discussed in Sec. III.A.1), we can obtain a matrix UY ∈R

ks ×ks whose i th column is the i th eigenvector of the covari-ance matrix of Yd. Because Yd is projected from the detrendeddata, the more important eigenvectors are regarded as thebasis of projected signal space. On the other hand, the restof eigenvectors are regarded as the basis of projected baselinespace. That is, we can separate the matrix UY into two sub-matrices for representing signal and baseline as

Us = UY( :, 1 :kss) (18)

and

Ub = UY( :,kss + 1 : ks), (19)

where kss is the dimension of signal space, decided by thepercentage of cumulative variance. Then, as shown in Fig. 5,the signal in original domain can be represented as

x ∼= ψE[Us Ub][

sssb

]= [ψs ψb]

[sssb

], (20)

where ss is the signal component of the contaminated datain the eigenspace, sb is the baseline component of the con-taminated data in the eigenspace, and both ψ s = ψEUs andψb = ψEUb are the proposed subspace-based dictionaries forsignal and baseline wander, respectively.

2) Sub-Eigenspace Transformation Matrix: By utilizing theproposed subspace-based dictionaries, we can represent thecompressed signal, x, as

x ∼= �[ψs ψb][

sssb

]= [�s �b]

[sssb

], (21)

where �s = �ψ s ∈ RM×kss and �b = �ψb ∈ R

M×(ks −kss ).We can solve both ss and sb with the least-square sensedescribed in (13), (14), and (15). However, for on-line infer-ence stage, we implement the machine learning on only ss.Transforming x to ss can be viewed as baseline removalon compressed domain because the baseline component isleft in sb. According to [27], the Moore-Penrose inverse of



a columnwise partitioned matrix can be particularly formu-lated as

[�s �b]† = [(P⊥�b�s)

†(P⊥�s�b)†], (22)

where the orthogonal projectors are specified as

P⊥�s

= IM −�s�†s (23)

and

P⊥�b

= IM −�b�†b. (24)

Therefore, we can derive ss, the estimated results of ss, as

ss = (P⊥�b�s)

†x = TSEx, (25)

where TSE = (P⊥�b�s)

† ∈ Rkss ×M is the proposed sub-

eigenspace transformation matrix. With this matrix, we cantransform the compressed measurement x to the signal com-ponent ss without the computational cost for calculating thebaseline component, which reduces the required multiplica-tions in transformation by (ks − kss) × M .

The proposed SE-ELM framework benefits from the con-ventional interference removal methods in the original datadomain without the huge overhead of reconstruction in theon-line inference stage. With the proposed sub-eigenspacetransformation matrix, we can achieve the effect of interfer-ence removal directly on the compressed domain.

IV. NUMERICAL EXPERIMENTS AND ANALYSIS OF

COMPUTATIONAL COMPLEXITY

A. Experimental Settings for Different Noisy Scenarios

ECG-based atrial fibrillation (AF) detection is used tovalidate our proposed frameworks for sensor-data binary clas-sification. The raw ECG signals we used were recorded fromthe intensive care unit (ICU) of stroke in National Taiwan Uni-versity Hospital (NTUH) with sampling frequency of 512 Hz.We set 1 second as input dimension; therefore, each data is atime series of 512 samples containing baseline wander. Thedata were visually checked and labeled as AF or non-AFby doctors. For each label, there are 2500 training data and1000 inference data. To obtain the compressively-sensed signalwith measurement length M = 128, we designed the sensingmatrix � ∈ R

128×512 as a random Bernoulli matrix from auniform distribution U(+1, −1). Because of the randomnessin the sensing matrix, the simulation for each framework wasrun with 100 trials. The experimental setup is summarizedin Table II.

We compare the proposed frameworks with SVM andPR-ELM. To decide the model parameters for these machinelearning models, we implemented the 5-fold cross validationand found the best settings with the highest accuracy. Thedetailed search range for the parameters is summarizedin Table III. To evaluate the performance of each framework,the metric used in this paper is the inference accuracydefined as

Accuracy = Number of correct classification result

Total number of inference data. (26)

TABLE II

EXPERIMENTAL SETTINGS

TABLE III

PARAMETERS SETTINGS FOR MACHINE LEARNING MODELS

TABLE IV

DIFFERENT NOISY SCENARIOS

For the analysis of the computational complexity,we calculated the required number of multiplication inthe inference stage, as discussed in Section IV.B.

There are three different noisy scenarios in the experiments,as shown in Table IV, including the ideal case, the white noisecase, and the worst case, white noise plus baseline wandercase. In the ideal case, the recorded raw ECG signals areprocessed by the interference removal algorithms. Therefore,the input signal in the ideal case is the detrended ECG signals,xd. In the white noise case, to simulate the thermal noise inCS sensors, we add additional white Gaussian noise underdifferent SNR to the detrended ECG signals. In the worstcase, we add additional white Gaussian noise under differentSNR to the raw ECG signals, which is equal to adding bothinterference and white Gaussian noise to the detrended ECGsignals.

B. Analysis of Computational Complexity

We analyze the computational complexity of SVM,PR-ELM, and the proposed frameworks for inference on thecompressed domain. Because the input vector to each modelis the compressed signal, the input dimension is M . For SVMwith RBF kernel, the kernel size is decided by the number ofsupport vectors. Therefore, the computational complexity forRBF-SVM provided by [8] is #SV × M , where #SV is thenumber of support vectors.



TABLE V

REQUIRED MULTIPLICATIONS BY SVM AND PROPOSEDE-ELM FOR INFERENCE IN THE IDEA CASE

As for the ELM-based models, the required multiplicationsare n× L for ELM input weighting multiplications, and L ×mfor ELM output weighting multiplications, where n is theinput dimension of ELM model, L is the number of hiddenneurons, and m is the number of classes. With the proposedhardware-friendly ELM design detailed in Section V.A, we cancomplete the input weighting multiplications with only addersand multiplexers. Therefore, we can ignore the n × L termstanding for input weighting multiplications. Furthermore, forbinary classification problem, because one column in theoutput weight matrix β in (3) is the additive inverse of theother column, the computational complexity can be furtherreduced to only L. Consequently, for all the C classifiers in theensemble ELM models, the number of required multiplicationsis L × C .

In addition, the overheads of the proposed E-ELM andSE-ELM in inference stage are the matrix multiplicationsfor eigenspace transformation in (15) and sub-eigenspacetransformation in (25). Therefore, the number of requiredmultiplications of E-ELM and SE-ELM are M × ks + L × CE

and M × kss + L × CS E , respectively. ks and kss are thedimensions of eigenspace and sub-eigenspace. CE and CS E arethe required number of ELM models for E-ELM and SE-ELM.

C. Evaluation of E-ELM Framework

First, we evaluated the enhancement of learning perfor-mance provided by the E-ELM framework in the ideal case,as shown in Fig. 6. As a powerful and high-complexitymodel, one SVM model can achieve an accuracy of 94.56%.However, the required number of multiplications in SVMmodel is extremely large, as shown in Table V. The dotted lineswith different colors in Fig. 6 show the inference accuracyover the increasing number of ELM classifiers achieved byAdaBoost-ELM with different L. The inference accuracyincreases as we combine more models in the AdaBoost-ELM.The solid lines in Fig. 6 show the inference accuracy achievedby the proposed ensemble of E-ELM, which is superior to con-ventional AdaBoost-ELM. This improvement benefits from thesignal information extraction by PCA. In the proposed E-ELM,the dimension of PCA-based dictionary, ks , was chosen as39 with the criterion that the percentage of cumulative varianceis larger than 85%. When L is equal to 400, the proposedE-ELM with 50 classifiers can achieve the inference accuracyof 93.51%, which is a competitive performance compared toSVM. As shown in Table V, E-ELM obtains 94.07% saving ofthe required multiplications compared to SVM model in theideal case.

Fig. 7 presents the comparison of the inference accu-racy over the increasing number of ELM classifiers between

Fig. 6. Comparison of the inference accuracy over the increasing numberof ELM classifiers between SVM, E-ELM, and ELM models with differentnumbers of hidden neurons (L) in the ideal case.

TABLE VI

COMPARISON OF MEASURED SNR IN THE COMPRESSED DOMAIN

AND THE EIGENSPACE IN THE WHITE NOISE CASE

different frameworks in the white noise case under differentSNR. Although we only show the case where L is equalto 400, the simulations with different settings of L showa consistent performance. PR-ELM considers both modelingerror variance and modeling error mean in the training stage.Therefore, compared to AdaBoost-ELM, AdaBoost-PR-ELMhas higher inference accuracy when the compressed signalsare contaminated by Gaussian white noise. The proposedE-ELM outperforms the PR-ELM, because the eigenspacetransformation can achieve signal information extraction andnoise mitigation at the same time. As shown in Table VI,the SNR measured in the eigenspace is much higher than thatin the compressed measurement domain when we inject thesame level of white Gaussian noise, which illustrates that thelearnability of the compressed noisy data is improved bythe proposed eigenspace transformation.

The proposed E-ELM framework can show competitiveinference accuracy as discussed above, which is accurateenough for fast screening. More importantly, the requiredmultiplications by both the ensemble of ELM-based models,E-ELM and PR-ELM, are significantly less than that by SVM.Although E-ELM needs the overhead for eigenspace trans-formation, it requires much fewer classifiers in the ensemblemethod to achieve the same inference accuracy compared toPR-ELM. Consequently, the overall required multiplication(M×ks+L×CE ) is reduced. To achieve the inference accuracyprovided by 50 PR-ELM classifiers, as shown in Table VII,E-ELM only needs 23, 19, 16, and 9 classifiers when SNRis 20, 15, 10, and 5 dB, respectively. The results show thatE-ELM is more robust against the white noise when the SNRis worse. Therefore, compared to the PR-ELM, the proposedE-ELM obtains 41.5% saving of the required multiplicationsaveragely in different SNR levels.

D. Evaluation of SE-ELM Framework

Fig. 8 shows the comparison of the inference accuracyover the increasing number of ELM classifiers between



Fig. 7. Comparison of the inference accuracy over the increasing number of ELM classifiers between SVM, E-ELM, PR-ELM, and ELM models with400 hidden neurons in the white noise case when (a) SNR is 5, (b) SNR is 10, (c) SNR is 15, and (d) SNR is 20.

TABLE VII

THE REQUIRED NUMBER OF CLASSIFIERS IN E-ELM TO ACHIEVE

THE ACCURACY PROVIDED BY 50 PR-ELM CLASSIFIERS AND THE

COMPARISON OF THE REQUIRED MULTIPLICATIONSIN THE WHITE NOISE CASE

different frameworks in the worst case under different SNR.By comparing Fig. 7 and Fig. 8, we can observe that SVM,E-ELM, PR-ELM, and ELM suffer certain accuracy degra-dation because of the interference in the compressed signals.To improve the performance of SVM model, we applied theDDHR concept to it. Even though the interference effect onsignals can be learned to a certain degree by the error-awaremodel, DDHR-SVM has only about 2% higher accuracy thanSVM. This is because the space for learning task still containsthe learning task-unrelated information about interference.This reason also explains the accuracy degradation in PR-ELMand ELM. On the other hand, in E-ELM, even though the com-pressed signals are transformed to the eigenspace, the interfer-ence component spreads out on the eigenspace learned fromthe cleaned data. Therefore, the proposed eigenspace divisionin the SE-ELM is required when there are some unneces-sary components with specific distribution in the compressedsignals.

The SE-ELM framework removes the interference directlyin the compressed domain, and thus it can achieve the highestinference accuracy in the worst case. There is a tradeoffbetween the achieved accuracy and energy savings. As shownin Fig. 7 and Fig. 8, the ensemble methods can achieve higheraccuracy with more used models. In the proposed SE-ELM,the dimension of signal space, kss , was chosen as 29 withthe criterion that the percentage of cumulative variance islarger than 85%. As shown in Fig. 8, when L is equal to 400,SE-ELM needs only 7, 10, 11, and 11 classifiers to achievethe inference accuracy provided by DDHR-SVM when SNRis 20, 15, 10, and 5 dB, respectively. With the parametervalues, we can obtain and summarize the required multipli-cations in Table VIII. For example, the required number ofmultiplications by SE-ELM is 128 × 29 + 400 × 11 = 8112when SNR is 20. Therefore, SE-ELM obtains more than 66.8%

TABLE VIII

THE REQUIRED NUMBER OF CLASSIFIERS IN ELM-BASED MODELS

TO ACHIEVE THE ACCURACY PROVIDED BY DDHR-SVM AND THE

COMPARISON OF THE REQUIRED MULTIPLICATIONSIN THE WORST CASE

and 61.9% savings of the required multiplications averagelyin different SNR levels compared to E-ELM and PR-ELM,respectively, and can achieve the goal of lowest computationalcomplexity for on-CS-sensor-analysis.

V. ARCHITECTURE DESIGN AND VLSI IMPLEMENTATION

In this work, the main concern of the hardware archi-tecture design is an area- and energy-efficient implemen-tation for the inference stage of the proposed SE-ELMframework. To reduce the hardware cost for implementingA-ELM, we propose a hardware-friendly ensemble of ELM.The implementation results show that the proposed engine isnot only reconfigurable for the number of classifiers but alsocost-effective.

A. Hardware-Friendly Ensemble of ELM

We adopt the cost-sensitive method in [19] to achieveA-ELM. To further increase the diversity among differentELM classifiers, in each iteration of the AdaBoost algorithm,we assign different input weight matrix Ai to i -th ELM model,as shown in Fig. 9. With different input weight matrices,the hidden layer output matrices H’s in different iterationhave different values even though the training data is thesame. Therefore, the obtained output weight matrices β’s fordifferent ELM models are more diverse and the ensemblemodel is more accurate. However, it needs to compute thedifferent input weighting multiplications in (4), which resultsin the additional kSS × L × C multiplications. The input



Fig. 8. Comparison of the inference accuracy over the increasing number of ELM classifiers between DDHR-SVM, SVM, SE-ELM, E-ELM, PR-ELM, andELM models with 400 hidden neurons in the worst case when (a) SNR is 5, (b) SNR is 10, (c) SNR is 15, and (d) SNR is 20.

Fig. 9. Hardware-friendly design for ensemble of ELM by updating theinstance weights of each training data with AdaBoost algorithm and generatingdifferent input weights for each ELM model with dedicated design of LFSR.

weighting multiplications account for the highest proportionin the overall required multiplications.

To make the input weighting multiplication more hardware-friendly, we construct the input weight matrices as randomBernoulli matrices that each element is either 1 or −1.By doing so, we can use only adders and multiplexers tocomplete the input weighting multiplication without significantdrop in accuracy. Furthermore, if we store all the randomBernoulli matrices in the inference engine, it requires addi-tional (kSS + 1) × L × C bits memory (+1 for bias ofhidden neuron). Therefore, we utilize linear-feedback shiftregister (LFSR) to generate pseudo random patterns, and thedetails are as follows:

(a111, a121, . . . , a1n1, a1(n+1)1)

= seed of (n + 1) − bit LFSR. (27)

ai jk = ai( j−1)(k−1), for i from 1 to C,

j from 2 to n + 1, k from 2 to L . (28)

ai j1 = a(i−1)( j−1)L, for i from 2 to C,

j from 2 to n + 1. (29)

ai1k = new bit generated by (n + 1)-bit LFSR

in ((i − 1) × L + k)th cycle,

for i from 1 to C and k from 1 to L,

but not include (i, k) = (1, 1). (30)

In the above equations, ai jk is the ( j, k)th element in Ai, n isthe dimension of input layer, L is the dimension of hiddenlayer, and C is the number of ELM models in the ensemblemethod. Briefly, different column in Ai is the values in (n+1)-bit LFSR in different clock cycle. The first column in A1 isthe initial state of the LFSR. Then, the second column in A1

is the second state of the LFSR in the second cycle. The firstcolumn in Ai is generated by LFSR with the L-th column inAi−1, and so on.

B. Architecture Design for the Ensemble of SE-ELM

As discussed in the previous sections, the signal dimen-sion M , the dimension of signal space kss , and the numberof hidden neurons L we implemented are 128, 29, and 400,respectively. We set the maximum number of classifiers to16 in this design as this provides reasonable tradeoff betweeninference accuracy and computational cost.

The overall architecture for the inference stage of AdaBoost-based SE-ELM is shown in Fig. 10. It consists of several mem-ory buffers and four main computing blocks: hardware sharingfor eigenspace transformation and ELM output weighting mul-tiplication (Fig. 10(a)), ELM input weighting multiplication(EIW) (Fig. 10(b)), ensemble output unit (Fig. 10(c)), and datacontrol unit (DCU) (Fig. 10(d)). In addition, Fig. 11 shows theexecution flow of the proposed SE-ELM engine. It consists offive main execution stages: configuring stage, data collectingstage, eigenspace transformation, ELM computing stage, andweighted voting for final results.

1) Hardware Sharing Architecture for Eigenspace Trans-formation and ELM Output Weighting Multiplication: Theeigenspace transformation and ELM output weighting mul-tiplication are both matrix multiplication. In addition, thesetwo computations are performed at different execution stages.Therefore, to reduce the number of implemented multipliers,we design a hardware reuse method with folding technique.As shown in Fig. 10(a), the multiplexers, controlled by theDCU, at the inputs of the multiplier select proper inputs tocompute the element-by-element multiplication.

After 128 cycles for collecting the input data into 128 shiftregisters, in the eigenspace transformation stage, the multiplierreads the input data from the input shift registers and the TSEmemory buffers. During the eigenspace transformation stage,the compressed data cyclically shifts into the 128 registers.On the other hand, the multiplier reads the input data fromthe output of sigmoid unit and the β memory buffers in theELM computing stage.

2) ELM Input Weighting Multiplication (EIW): EIW com-putes the input weighting multiplication and the activationfunction in ELM algorithm, as described in (4) - (6). Since theshared multiplier can complete one of the element-by-elementmultiplications in the output weighting multiplication, EIWblock must generate the result of one hidden neuron in each



Fig. 10. Proposed hardware architecture design for the ensemble of SE-ELM, including (a) hardware sharing for eigenspace transformation and ELM outputweighting multiplication, (b) ELM input weighting multiplication (EIW), (c) ensemble output unit, and (d) data control unit (DCU).

Fig. 11. Execution flow of the proposed ensemble of SE-ELM engine.

clock cycle. Therefore, we utilize the multiplexers and addertree without hardware sharing. On the other hand, because theclock frequency is fast enough for real-time ECG-based AFdetection, we do not insert any pipeline registers in the addertree to save hardware costs. Although transformation requiresadditional kss registers used for storing the transformed data,it greatly reduces the size of adder tree in the ELM inputweighting multiplication. Therefore, from the whole frame-work perspective, the eigenspace transformation reduces thehardware complexity.

As shown in Fig. 10(b), for the input weighting multipli-cations, to reduce the computational complexity and memoryrequirements, we utilize the LFSR to generate the input weight

matrix. Since kss is chosen as 29 in this design, we imple-mented a 30-bit LFSR where the values in the 30 registersare used as one of the column vectors in the input weightmatrix in (5). In each clock cycle, these 30 binary valuescontrol the multiplexers in EIW to perform multiplicationwith 1 or −1. Therefore, we can complete the multiplicationsby multiplexers and adders. As for the activation function,we implemented the sigmoid unit using the piecewise linearapproximation [29].

3) Ensemble Output Unit: After collecting the results fromdifferent ELM models, the ensemble output unit will computethe ensemble average in (10) to obtain the final output. Forbinary classification, the classification result of each ELMmodel is the signed bit of the result of output weighting mul-tiplication. The DCU will provide the appropriate ensembleweight α for corresponding ELM model. Then, the ensem-ble average can be obtained from the accumulation of theensemble weights selected by the most significant bit (MSB)of each output weighting multiplication result, and finally wecan obtain the final classification result from the MSB of theaccumulated ensemble average.

4) Data Control Unit (DCU): The control mechanism inthis design is based on counters. With the configuration ofmodel parameters, DCU controls the multiplexers in differentparts of this architecture to decide the data flow, including128 compressed input data in Fig. 10(a), 29 transformeddata in Fig.10(b), and the resets of the accumulations inboth Fig.10(a) and Fig.10(c). DCU also controls the memoryaddresses to select the model parameters for specified ELMor eigenspace transformation. To achieve the reconfigurabledesign, DCU also decides the termination of ensemble methodaccording to the specified number of classifiers, and then letsthe input buffer collect new data.

C. Execution Flow of the Proposed Engine

The execution flow of the proposed engine is shownin Fig. 11. First, in the configuring stage, the parameters



Fig. 12. Inference accuracy of the proposed ensemble of SE-ELM in theworst case with different fixed-point precisions.

TABLE IX

SUMMARY OF THE PROPOSED ENGINE

of the ensemble model, including the number of classifiers,the seed of LFSR, the absolute value of bias in the inputweight, and the ensemble weights, are stored in the registersof the DCU. Then, the input shift registers can start tocollect 128 compressed samples of input data (1-second ECGsignal) and forward them to the shared multiplier to performeigenspace transformation. In each clock cycle, one of theelement-by-element multiplications in (25) is completed bythe multiplier with a proper model parameter fed from the TSEmemory buffers. After each 128 cycles, one of the transformeddata is completed and fed into EIW block.

After the eigenspace transformation is completed, the signalcomponents are kept in the kss +1 registers in the EIW block.Among them, the (kss + 1)th register holds the absolute valueof bias for input weighting multiplication. In each clock cycle,the EIW block can compute the output of one of the hiddenneurons with the random input weight generated by the LFSRand the sigmoid unit. At the same time, the shared multipliercan perform the output weighting multiplication in a pipelinemanner by using the result from the EIW block and the properoutput weight fed from the β memory buffers.

When the computations of an entire one ELM model arefinished, the DCU makes the decision whether to continue thecomputation of another ELM model or let the ensemble outputunit calculate the final classification result. The proposedengine outputs the final classification result if the computationsfor all the classifiers in the ensemble model are done. Then,it can start to collect another compressed ECG signal or updatethe model parameters if needed.

D. Fixed-Point Analysis

First, we performed the fixed-point analysis to investigatethe trade-off between the achieved accuracy and the word

Fig. 13. Layout of the proposed ensemble of SE-ELM engine.

lengths of each register in this engine. Since the multiplieris reused for eigenspace transformation and output weightingmultiplication, the bit widths for input compressed data x andthe output of the sigmoid h should be the same. Also, the bitwidths for the transformation matrix TSE and output weightmatrix β should be the same. To determine the bit width of x,h, the signal component in the eigenspace ssTSE, β, the inputof sigmoid z, and the bias b, the performance of SE-ELMframework is explored to evaluate the inference accuracy withdifferent fixed-point precisions. The bit widths of TSE andβ have the most significant effect on the inference accuracy,as shown in Fig. 12. After the exhausted computer simulation,the bit widths of x, h, ss, TSE, β, z, and b are set to 8, 8, 8,9, 9, 13, and 8, respectively.

E. Implementation Results and Comparisons

To validate the function of the proposed architecture design,we have implemented the proposed engine in TSMC 90 nmtechnology by using Cadence SoC encounter tool. As shownin Fig. 13, the core area of this engine is 384 μm × 384 μmat 5 MHz operation frequency. Some important characteristicsare summarized in Table IX.

Table X shows the comparisons of the proposedAdaBoost-based SE-ELM engine with other recently reportedmachine learning engines. To make a fair comparison, we nor-malized the energy consumption and area to 90 nm process.Since the design goal in the on-CS-sensor-analysis is area- andenergy-efficient, we defined the area-energy-efficiency (AEE,in MAC/pJ × mm2) as

AE E = 1

Energy Efficiency × Area, (31)

which describes the computation ability provided by normal-ized energy consumption and silicon cost.

Since [25] and [30] are both SVM engines that need muchmore computations for feature extraction or multiplying thesupport vectors, the energy consumption for each classificationis about hundreds of microjoules, which is significantly largewhen compared to ELM engines. The design in [30] is an SoCbased machine learning accelerator that includes a low-powerCPU core for feature extraction, thus the area and energyconsumption are much larger, as shown in Table X. As forthe recently published ELM engines, [7], [13], [31] utilizedthe mismatches in current mirrors to perform effective randommatrix multiplication in the first layer of ELM. Reference [7]targets on neural decoding that needs to achieve the goal



TABLE X

COMPARISON WITH PUBLISHED MACHINE LEARNING ENGINES

of extreme low-power design, while [13], [31] are moreenergy efficient due to the faster operation. In addition, [31]implemented the dimension expansion and physical unclonablefunction (PUF) for general Internet-of-Things applications.However, they need one capacitor for each row in the currentmirror to control the noise issue, which results in larger chiparea in these designs. As shown in Table X, [13], [31] showvery good energy efficiency, but their areas are much largerthan the digital designs. More importantly, the randomnessof the first layer of ELM in these works stems from thefabrication mismatch. Therefore, these designs are hard toextend to ensemble design. Note that, the reported area onlyconsiders the mixed-signal chip for the input layer of ELM.The output layer of ELM is implemented off chip on an FPGA.In addition, the temperature dependence of analog design alsoresults in inference accuracy degradation.

In our design, we need accurate computations for theeigenspace transformation. Therefore, we choose the full dig-ital design. A recently published work [32] also proposed theVLSI implementation of ensemble ELM and usage of randomnumber generator. In [32], each element in the input weightmatrix is represented by multiple bits. Therefore, given ann-dimension input vector, the multiplication of each hiddenneuron needs a multiplier executing for n cycles. Our proposeddesign only needs one cycle without any multiplier for theinput weighting multiplication by using random Bernoullimatrix as input weight matrix. With the proposed hardware-friendly algorithm and hardware sharing architecture design,the proposed engine can provide a competitive AEE comparedwith the previous designs that only implemented the first stageof ELM, as shown in Table X. Furthermore, our design canprovide the functionality of ensemble method that can improvethe inference accuracy of ELM models at the cost of energyefficiency due to eigenspace transformation.

VI. CONCLUSION

In this work, we propose a robust and lightweightensemble of ELM based on eigenspace transformation andhardware-friendly architecture design. The proposed E-ELMcan improve the learnability of compressed noisy signals

and thus reduce the required number of ELM classifiers forlightweight model. The proposed SE-ELM further removes theinterference directly in the compressed domain, and makesthe lightweight learning model robust against interference.This framework achieves the highest accuracy with lowestcomputation under white noise and interference environ-ment. In the ASIC design, the implementation results showthat the proposed engine can provide competitive area- andenergy-efficiency to achieve on-CS-sensor-analysis.

REFERENCES

[1] W. Cheng, M.-F. Yeha, K.-C. Changa, and R.-G. Leeb, “Real-time ECGtelemonitoring system design with mobile phone platform,” Measure-ment, vol. 41, pp. 463–470, May 2008.

[2] X. Li, R. D. S. Blanton, P. Grover, and D. E. Thomas, “Ultra-low-power biomedical circuit design and optimization: Catching the don’tcares,” in Proc. Int. Symp. Integr. Circuits (ISIC), Singapore, Dec. 2014,pp. 115–118.

[3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,no. 4, pp. 1289–1306, Apr. 2006.

[4] H. Mamaghanian, N. Khaled, D. Atienza, and P. Vandergheynst, “Com-pressed sensing for real-time energy-efficient ECG compression onwireless body sensor nodes,” IEEE Trans. Biomed. Eng., vol. 58, no. 9,pp. 2456–2466, Sep. 2011.

[5] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projectionfor sparse reconstruction: Application to compressed sensing and otherinverse problems,” IEEE J. Sel. Topics Signal Process., vol. 1, no. 4,pp. 586–597, Dec. 2007.

[6] M. A. B. Altaf and J. Yoo, “A 1.83 μJ/classification, 8-channel, patient-specific epileptic seizure classification SoC using a non-linear supportvector machine,” IEEE Trans. Biomed. Circuits Syst., vol. 10, no. 1,pp. 49–60, Feb. 2016.

[7] Y. Chen, E. Yao, and A. Basu, “A 128-channel extreme learn-ing machine-based neural decoder for brain machine interfaces,”IEEE Trans. Biomed. Circuits Syst., vol. 10, no. 3, pp. 679–692,Jun. 2016.

[8] A. Bytyn, J. Springer, R. Leupers, and G. Ascheid, “VLSI implementa-tion of LS-SVM training and classification using entropy based subset-selection,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Baltimore,MD, USA, May 2017, pp. 1–4.

[9] R. Calderbank, S. Jafarpour, and R. Schapire, “Compressed learning:Universal sparse dimensionality reduction and learning in the mea-surement domain,” Princeton Univ., Princeton, NJ, USA, Tech. Rep.,Dec. 2009.

[10] M.-Y. Tsai, C.-Y. Chou, and A.-Y. A. Wu, “Robust compressed analy-sis using subspace-based dictionary for ECG telemonitoring systems,”in Proc. IEEE Int. Workshop Signal Process. Syst., Lorient, France,Oct. 2017, pp. 1–5.



[11] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:A new learning scheme of feedforward neural networks,” in Proc. IEEEInt. Joint Conf. Neural Netw., vol. 2. Jul. 2004, pp. 985–990.

[12] S. Decherchi, P. Gastaldo, A. Leoncini, and R. Zunino, “Efficient digitalimplementation of extreme learning machines for classification,” IEEETrans. Circuits Syst., II, Exp. Briefs, vol. 59, no. 8, pp. 496–500,Aug. 2012.

[13] E. Yao and A. Basu, “VLSI extreme learning machine: A design spaceexploration,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,no. 1, pp. 60–74, Jan. 2017.

[14] R. E. Schapire and Y. Freund, Boosting: Foundations and Algorithms.Cambridge, MA, USA: MIT Press, 2012.

[15] R. G. Baraniuk, “Compressive sensing,” IEEE Signal Process. Mag.,vol. 24, no. 4, pp. 118–122, Jul. 2007.

[16] G. M. Friesen, T. C. Jannett, M. A. Jadallah, S. L. Yates, S. R. Quint,and H. T. Nagle, “A comparison of the noise sensitivity of nineQRS detection algorithms,” IEEE Trans. Biomed. Eng., vol. 37, no. 1,pp. 85–98, Jan. 1990.

[17] X. Lu, L. Ming, W. Liu, and H.-X. Li, “Probabilistic regular-ized extreme learning machine for robust modeling of noise data,”IEEE Trans. Cybern., vol. 48, no. 8, pp. 2368–2377, Aug. 2018.doi: 10.1109/TCYB.2017.2738060.

[18] E. J. Candes and T. Tao, “Near-optimal signal recovery from randomprojections: Universal encoding strategies?” IEEE Trans. Inf. Theory,vol. 52, no. 12, pp. 5406–5425, Dec. 2006.

[19] A. Riccardi, F. Fernández-Navarro, and S. Carloni, “Cost-sensitiveAdaBoost algorithm for ordinal regression based on extreme learningmachine,” IEEE Trans. Cybern., vol. 44, no. 10, pp. 1898–1909,Oct. 2014.

[20] E. J. Candés, J. K. Romberg, and T. Tao, “Stable signal recovery fromincomplete and inaccurate measurements,” Commun. Pure Appl. Math.,vol. 59, no. 8, pp. 1207–1223, Aug. 2006.

[21] F. A. Afsar, M. S. Riaz, and M. Arif, “A comparison of baseline removalalgorithms for electrocardiogram (ECG) based automated diagnosis ofcoronory heart disease,” in Proc. 3rd Int. Conf. Bioinform. Biomed. Eng.,Jun. 2009, pp. 1–4.

[22] K. L. Park, K. J. Lee, and H. R. Yoon, “Application of a waveletadaptive filter to minimise distortion of the ST-segment,” Med. Biol.Eng. Comput., vol. 36, pp. 581–586, Sep. 1998.

[23] Z. Wang, K. H. Lee, and N. Verma, “Hardware specialization in low-power sensing applications to address energy and resilience,” J. SignalProcess. Syst., vol. 78, no. 1, pp. 49–62, Jan. 2015.

[24] W. M. Baranski, A. Wytyczak-Partyka, and T. Walkowiak, “Compu-tational complexity reduction in PCA-based face recognition,” Inst.Comput. Eng., Control Robot., Wroclaw Univ. Technol., Wroclaw,Poland, Tech. Rep., 2014.

[25] Z. Chen et al., “An energy-efficient ECG processor with weak-stronghybrid classifier for arrhythmia detection,” IEEE Trans. Circuits Syst.,II, Exp. Briefs, vol. 65, no. 7, pp. 948–952, Jul. 2018.

[26] Y. Weiss, H. S. Chang, and W. T. Freeman, “Learning compressedsensing,” in Proc. Snowbird Learn. Workshop, Sep. 2007, pp. 1–7.

[27] J. K. Baksalary and O. M. Baksalary, “Particular formulae for theMoore–Penrose inverse of a columnwise partitioned matrix,” LinearAlgebra Appl., vol. 421, no. 1, pp. 16–23, 2007.

[28] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for sup-port vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2,no. 3, Apr. 2011, Art. no. 27. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm

[29] H. Amin, K. M. Curtis, and B. R. Hayes-Gill, “Piecewise linear approx-imation applied to nonlinear function of a neural network,” IEE Proc.-Circuits, Devices Syst., vol. 144, no. 6, pp. 313–317, Dec. 1997.

[30] K. H. Lee and N. Verma, “A low-power processor with configurableembedded machine-learning accelerators for high-order and adaptiveanalysis of medical-sensor signals,” IEEE J. Solid-State Circuits, vol. 48,no. 7, pp. 1625–1637, Jul. 2013.

[31] Y. Chen, Z. Wang, A. Patil, and A. Basu, “A 2.86-TOPS/W current mir-ror cross-bar-based machine-learning and physical unclonable functionengine for Internet-of-Things applications,” IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 66, no. 6, pp. 2240–2252, Jun. 2019.

[32] S. K. Bose, B. Kar, M. Roy, P. K. Gopalakrishnan, and A. Basu, “ADE-POS: Anomaly detection based power saving for predictive maintenanceusing edge computing,” in Proc. 24th Asia South Pacific Design Automat.Conf., Jan. 2019, pp. 597–602.

[33] F. Ren and D. Markovic, “18.5 A configurable 12-to-237 KS/s 12.8 mWsparse-approximation engine for mobile ExG data aggregation,” in IEEEISSCC Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2015, pp. 1–3.

Huai-Ting Li (S’15) received the B.S. degree inelectrical engineering and the Ph.D. degree in elec-tronics engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 2013 and 2018, respectively.His research interests include the architecture andalgorithm design of machine learning, 3-D networks-on-chip, and error-resilient system design.

Ching-Yao Chou (S’16) received the B.S. degree inelectrical engineering and the Ph.D. degree in elec-tronics engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 2014 and 2019, respectively.His research focuses on low-complexity signalprocessing for edge analysis, trying to link compres-sive sensing with machine learning.

Yi-Ta Chen received the B.S. degree in electri-cal engineering from National Taiwan University,Taipei, Taiwan, in 2017, where he is currently pur-suing the Ph.D. degree with the Graduate Instituteof Electronics Engineering. His research interestsinclude the machine learning engine for affectivecomputing and SW/HW co-design for SDN dataplane.

Sheng-Hui Wang (S’18) received the B.S. degree inelectrical engineering and the M.S. degree in elec-tronics engineering from National Taiwan Univer-sity, Taipei, Taiwan, in 2016 and 2018, respectively.His research interests include the architecture andalgorithm design of error-resilient system, biomed-ical signal processing, and affective computing.

An-Yeu (Andy) Wu (M’96–SM’12–F’15) receivedthe B.S. degree from National Taiwan University(NTU), Taipei, Taiwan, in 1987, and the M.S.and Ph.D. degrees in electrical engineering fromthe University of Maryland, College Park, MD,USA, in 1992 and 1995, respectively. In 2000,he joined the Department of Electrical Engineeringand the Graduate Institute of Electronics Engineer-ing (GIEE), NTU, as a Faculty Member, where heis currently a Distinguished Professor and has alsobeen the Director of GIEE since 2016. From 2007 to

2009, he was on leave from NTU and served as the Deputy General Directorof the SoC Technology Center, Industrial Technology Research Institute,Hsinchu, Taiwan. His research interests include VLSI architectures for signalprocessing and communications and adaptive/multirate signal processing.He has published more than 250 refereed journal and conference articlesin above research areas, together with five book chapters and 20 grantedU.S. patents. He was elevated to IEEE Fellow for his contributions to DSPalgorithms and VLSI designs for communication IC/SoC in 2015. He receivedthe Outstanding EE Professor Award from The Chinese Institute of ElectricalEngineering, Taiwan, in 2010. He is currently a Board of Governor Memberof the IEEE Circuits and Systems Society from 2016 to 2018.

robust and lightweight ensemble extreme learning machine...

Documents