dl-la: deep learning leakage assessment · felix wegener , thorben moos , and amir moradi ruhr...

DL-LA: Deep Learning Leakage AssessmentA modern roadmap for SCA evaluations

Felix Wegener∗, Thorben Moos∗, and Amir Moradi

Ruhr University Bochum, Horst Gortz Institute for IT Security, [email protected]

∗These authors contributed equally to this work

Abstract. In recent years, deep learning has become an attractive in-gredient to side-channel analysis (SCA) due to its potential to improvethe success probability or enhance the performance of certain frequentlyexecuted tasks. One task that is commonly assisted by machine learningtechniques is the profiling of a device’s leakage behavior in order to carryout a template attack. Very recently at CHES 2019, deep learning hasalso been applied to non-profiled scenarios, extending its reach withinSCA beyond template attacks for the first time. The proposed method,called DDLA, has some tempting advantages over traditional SCA dueto merits inherited from (convolutional) neural networks. Most notably,it greatly reduces the need for pre-processing steps when the SCA tracesare misaligned or when the leakage is of a multivariate nature. However,similar to traditional attack scenarios the success of this approach highlydepends on the correct choice of a leakage model and the intermediatevalue to target.In this work we explore whether deep learning can similarly be usedas an instrument to advance another crucial (non-profiled) disciplineof SCA which is inherently independent of leakage models and tar-geted intermediates, namely leakage assessment. In fact, given the simpleclassification-based nature of common leakage assessment techniques, inparticular distinguishing two groups fixed-vs-random or fixed-vs-fixed,it comes as a surprise that machine learning has not been brought intothis context, yet. Our contribution is the development of a full leakageassessment methodology based on deep learning which gives the evalu-ator the freedom to not worry about location, alignment and statisticalorder of the leakages and that easily covers multivariate and horizon-tal patterns as well. We test our approach against a number of casestudies based on FPGA measurements of the PRESENT block cipher,equipped with state-of-the-art hardware-based countermeasures. Our re-sults clearly show that the proposed methodology and network struc-ture (which remains unchanged between the experiments) outperformthe classical detection approaches (t-test and χ2-test) in all consideredscenarios.

1 Introduction

In an ideal world, side-channel security evaluations would be able to provide aqualitative and confident answer (pass or fail) to the question whether the deviceunder test (DUT) is vulnerable to physical attacks or not. However, history hasshown that this expectation is indeed a utopia. An exhaustive verification of thesecurity of a DUT against all possible attack vectors is simply infeasible. In-stead, the concept of leakage assessment has been introduced in order to answer

2 Felix Wegener, Thorben Moos, and Amir Moradi

a slightly, but explicitly, less informative question, namely whether in generalany kind of information can be extracted from side-channel measurements ofthe device under test. Clearly, in case this question is answered positively, noconclusions about the actual vulnerability of the device with respect to key recov-ery attacks can be drawn (although it is sometimes interpreted as an indicationthereof). Yet, in case it is answered negatively (and no false negative occurs) theDUT should be sufficiently secure. In other words, leakage assessment is concep-tually capable of providing the initially desired confidence in at least one of thetwo cases. This possibility inspired the quest for appropriate leakage assessmentmethods in academia and industry.

The most prominent approach is certainly distinguishing two groups of mea-surements, one for fixed and one for random inputs, by means of the Welch’st-test [5, 14]. Whenever these two groups are distinguishable with confidence onecan conclude that the device reveals input-dependent information. However, thismethod has some severe limitations, especially when more sophisticated typesof side-channel leakage need to be captured. First of all, since each point in timeis evaluated independently, the approach inherently expects that any potentialside-channel leakage is of a univariate nature and, more generally, that the de-tection of the leakage does not benefit from a combination of multiple points.Yet, many counterexamples to this assumption can be observed in reality. Al-though Schneider et al. [14] provide detailed information on how to performthe t-test at arbitrary order and variate, the performance of the test at highervariates either quickly runs into feasibility issues or its success depends highlyon the expertise of the evaluator and the prior knowledge about the underlyingimplementation. On another note, a misalignment of the leaking samples be-tween the individual traces leads to a significantly impaired detection as well.Thus, the Welch’s t-test, as it is currently applied as a test vector leakage as-sessment (TVLA) methodology, is naturally unsuited to cover multivariate andhorizontal leakages, as well as (heavily) misaligned traces. In addition to that, itwas recently pointed out that the separation of statistical orders, which is oftenseen as a beneficial feature of the t-test when seeking the smallest key depen-dent moment for example, causes false negatives when masked implementationswith (very) low noise levels are analyzed or when the leakage is distributed overmultiple statistical moments (as it is common for hardware masking schemeslike threshold implementations) [11, 16]. Moradi et al. [11] suggested Pearson’sχ2-test as a natural complement to the Welch’s t-test to aggregate leakages dis-tributed over multiple orders and to analyze the joint information. By combiningthe two approaches the risk of false negatives, especially in the previously de-scribed cases, can significantly be reduced. Yet, in the same manner as the t-test,the χ2-test analyzes the individual points in a leakage trace independently andtherefore suffers from the same shortcomings when it comes to multivariate orhorizontal patterns and misalignments.

Deep learning has been brought into the context of side-channel analysis mainlyin order to improve the effectiveness of template attacks [6]. In a template attackthe adversary is in possession of a fully-controlled profiling device, learns theleakage function of a certain cryptographic operation and subsequently uses theacquired knowledge to reveal sensitive information on a structurally identicalbut distinct target device where the secrets are unknown. Apart from the generalsuitability of deep learning to build classifiers for profiled side-channel attacks,it has also been demonstrated that certain features and structures of the applied

DL-LA: Deep Learning Leakage Assessment 3

neural networks offer valuable advantages over classical template attacks. Forexample, it is shown in [3] that convolutional neural networks (CNNs) can leadto efficient classifiers even when the available side-channel traces suffer froma misalignment. Thus, due to their so-called translation invariance property,CNNs can be utilized to conquer jitter-based countermeasures. Very recently,the first non-profiled deep learning based side-channel attacks were demonstratedin literature [17]. The proposed method, called DDLA, is based on guessing apart of the key, using it to compute the targeted key-dependent intermediatevalue, applying a leakage model and labeling the training data according to itsresult. Assuming that under the correct key hypothesis the differences betweenthe classes implied by the leakage model correlate with the measured leakagetraces (and for the incorrect guesses they do not), the impact of the correct keyguess is visible in the training loss and the training accuracy respectively andcan easily be identified. Although, this approach depends on the correct choiceof the targeted intermediate value and the applied leakage model just as muchas traditional attacks do, it offers some tempting advantages. First of all, in caseCNNs are used, the translation invariance property allows to analyze misalignedtraces without any pre-processing. Secondly, when the leakage is of a multivariatenature or generally distributed over multiple points no recombination and noprior knowledge about the underlying implementation is required. Hence, deeplearning is a powerful tool for non-profiled scenarios as well.

1.1 Our Contribution

For the first time in literature we evaluate whether deep learning is an eligi-ble strategy for black box leakage detection. To this end, we have developed anapproach that is based on the concept of supervised learning. We call it deeplearning leakage assessment (DL-LA) in the following. Simply put, we train aneural network with a randomly interleaved sequence of labeled side-channelmeasurements that have been acquired while supplying the DUT with one oftwo distinct fixed inputs (fixed-vs-fixed). Afterwards, in the validation phase,the trained network is supplied with unlabeled measurements from both groupsand supposed to correctly classify them. Of course, the training set and the vali-dation set are disjoint. In case the network succeeds with a higher percentage ofcorrect classifications than could be achieved by a randomly guessing binary clas-sifier with a non-negligible probability, it can be concluded that indeed enoughinformation was included in the training set to distinguish the two groups. Inother words, given the percentage of correctly classified traces and the size of thevalidation set one can easily calculate a confidence value, i.e., a probability, thatthe correct classifications were not just a random statistical occurrence. In thisway it is possible to directly compare the confidence values achieved by DL-LAwith the confidence provided by classical leakage assessment approaches, suchas the Welch’s t-test and Pearson’s χ2-test.

Clearly, in order to qualify as a reasonable strategy for leakage assessment, itshould be given that the network which is trained to become the classifier offers afairly robust and universal performance, independent of the type of side-channelleakage to be detected and independent of exterior parameters such as the tracelength. The evaluator should not be required to tune the hyper parameters of thenetwork, since this easily becomes computationally expensive, especially whenthe result of the evaluation is not known a priori. Thus, one of the main concerns


is whether it is possible to select a network structure that is performing robustlywhen faced with different types of side-channel leakage and characteristics of thetraces. Interestingly, we found that a very simple network structure consistingof only a few fully-connected layers with a fixed number of neurons handles thistask surprisingly well. We test the chosen network in a total of 7 FPGA-basedcase studies featuring the PRESENT ultra-lightweight block cipher with differentkinds of countermeasures applied. The classification capability of our networkdoes not only withstand misaligned and noisy traces, but is able to deal withunivariate and multivariate higher-order leakage as well. In all 7 case studieswe compare the success of our method to both the Welch’s t-test and Pearson’sχ2-test and show that DL-LA outperforms the leakage assessment capabilitiesof the classical techniques in all considered scenarios (either by requiring lesstraces to achieve the same confidence or by providing a higher confidence for thesame amount of traces). We also present one scenario where both the univariateand the multivariate versions of the t-test and the χ2-test fail to detect leakedinformation with confidence, while DL-LA still succeeds with only half of theavailable traces. As an unintended byproduct of our practical case studies, weprovide the most detailed practical comparison between the Welch’s t-test andPearson’s χ2-test that has been reported in the literature so far.

The most outstanding advantage of our approach is clearly that the networkis free to combine as many points for the classification of the two groups asnecessary. Thus, even in complex scenarios of purely multivariate or horizontalleakages, the traces can simply be fed as training data into the network withoutany pre-processing or manual selection of points. Accordingly, neither a highexpertise is demanded from the evaluator, nor is it required to obtain any priorinformation about the underlying implementation or the type of leakage that isexpected. Additionally, DL-LA entails a much lower risk of false positives thancurrent approaches, as it provides a single confidence value to assess the dinstin-guishability of the groups. The traditional point-wise methods would actuallyneed to normalize their confidence values to the number of points in the traces toprovide a correct confidence threshold. However, this inaccuracy is mostly dis-regarded in their respective methodologies. Even though, DL-LA provides onlya single confidence value, the approach can still identify the points of interestin side-channel traces that contain leakage, by performing a Sensitivity Analysis(SA) on the trained network. Obviously, the average computation time to per-form a DL-LA is significantly higher when compared to simple univariate tests.However, as soon as more complex types of side-channel leakage need to be an-alyzed, the additional run time quickly pays off, since the effort that otherwisehas to be spend in order to make traditional methods recognize those complexpatterns (if even possible) grows even bigger and contains several steps that arehard to automate.

1.2 Claims and Non-Claims

In order to avoid any potential confusion regarding our claims, or lack thereof,we explicitly list the most important statements below:

We do not claim that ...

– our chosen network is optimal for leakage detection in general or for any ofthe considered case studies in particular. We are certain that there is room


for improvement, as we intentionally optimized for robustness and simplicityinstead of single case performance.

– our chosen network necessarily leads to a classifier that outperforms thet-test or the χ2-test for any given side-channel traces.

– DL-LA should replace all other leakage detection techniques, but rather thatit can be extremely helpful especially in cases where detection is non-trivial.

– DL-LA generally causes none or fewer false negatives than the classical ap-proaches.

We do claim that ...

– the chosen network offers some basic universality and robustness. We havetested the network even beyond the case studies presented in this work (andalso for extreme cases like a trace length of 1 or greater than 10 000) and itdelivered reliable results in all of them.

– the chosen network is able to learn first-order, higher-order, univariate, mul-tivariate and horizontal leakages without requiring any pre-processing orprior knowledge of the underlying implementation.

– DL-LA entails a much lower risk of false positives (when the same confidencethreshold is chosen) since it provides one confidence per set of traces insteadof one confidence per time sample in the trace set.

2 Background

In this section we introduce the necessary background with respect to the rootsand the state-of-the-art of leakage assessment, as well as deep learning and itsapplications to side-channel analysis.

2.1 Leakage Assessment

Ever since the introduction of side-channel attacks in 1999 [8] the standardapproach for assessing the physical vulnerability of a device has been a moreor less exhaustive verification of its resistance against known attacks while at-tempting to cover a broad range of intermediate values and hypothetical leakagemodels. This approach, however, became less feasible over the years due to theincreasing amount of new attack methods and the higher complexity of poten-tial leakage models due to the introduction of countermeasures against physicalattacks. Another concern regarding this procedure is that it entails a significantrisk of reporting physical security in favor of the DUT while in reality merely acertain attack vector was missed in the process (by mistake or because it was un-known at time of evaluation) that could indeed enable key recovery [14]. Hence,the need for a robust and reliable standard leakage assessment method inde-pendent of concrete attack scenarios, targeted intermediates and hypotheticalleakage models grew consistently over the years. In an attempt to gather andevaluate promising candidates, the National Institute of Standards and Technol-ogy (NIST) hosted a ”Non-Invasive Attack Testing Workshop” in 2011. One ofthe most intriguing proposals at the workshop was the use of the non-specificWelch’s t-test [5] for leakage detection. Leakage detection avoids any dependencyon the choice of intermediates and leakage models by focusing on the detection ofleakage only, without paying any attention to the possibility to exploit said leak-age for key recovery. Simply put, the concept is based on supplying the device


under test with different inputs, recording its leakage behavior and evaluatingwhether a difference can be observed. Thus, such a method is suitable for blackbox scenarios and allows certification of a device’s physical security by thirdparty evaluation labs without the need to test a multitude of different meth-ods and parameter combinations. Seven years later, after some shortcomings ofthe moment-based nature of the t-test had been identified [16], another popularstatistical hypothesis test was proposed for leakage detection purposes, namelythe Pearson’s χ2-test [11]. Both hypothesis tests, the t-test and the χ2-test, areapplied in the field of statistics in order to answer the question whether twosets of data are significantly different from each other. To be more precise, theevaluation of the tests examines the validity of the null hypothesis, which con-stitutes that both sets of data were drawn from the same population (i.e., theyare indistinguishable) [14]. In side-channel analysis contexts, it is usually evalu-ated whether two groups of measurements can be distinguished with confidence.Traditionally, those two groups are acquired by supplying the DUT either withrandom (group Q0) or a fixed input (group Q1), selected by coin toss. Later, ithas been demonstrated that the careful choice of two distinct fixed inputs (in-stead of maintaining one group for random inputs) usually leads to a lower datacomplexity for the distinction [4]. We provide the details on how to conduct theWelch’s t-test and Pearson’s χ2-test below.

Welch’s t-test. We denote two sets of data by Q0 and Q1, their cardinality byn0 and n1, their respective means by µ0 and µ1 and their standard deviations bys0 and s1. The t-statistics and the degrees of freedom v can then be computedusing the following formulas.

t =µ0 − µ1√s20n0

+s21n1

v =

(s20n0

+s21n1

)2(

s20n0

)2

n0−1 +

(s21n1

)2

n1−1

Afterwards, the confidence p to accept the null hypothesis can be estimatedvia the Student’s t probability density function, where Γ (.) denotes the gammafunction [14, 11].

p = 2

∫ ∞|t|

f(t, v)dt f(t, v) =Γ(v+12

)√πvΓ

(v2

) (1 +t2

v

)− v+12

In practice, for the sake of simplicity, it is common to only evaluate the t-statistics and to set the confidence threshold for distinguishability to |t| > 4.5.The statistical background of this threshold is that for |t| > 4.5 and v > 1000 theconfidence p to accept the null hypothesis is smaller than 0.00001 which is equiv-alent to a 99.999 % confidence that the two sets were not drawn from the samepopulation. Of course, when the degrees of freedom v is not explicitly evaluated,it can occur that the assumption v > 1000 does not hold. However, practice hasshown that this procedure rarely produces false positive results in side-channelanalysis contexts. Yet, calculating the actual confidence p is certainly prefer-able, scientifically correct and can still be efficiently implemented [11]. Since theWelch’s t-test is designed to distinguish the means of two distributions, it canonly be applied to first-order univariate analyses in its simplest form. Schneideret al. [14] extended the methodology to arbitrary orders and variates and providethe required formulas for incremental one-pass computation of all moments.


Pearson’s χ2-test. In order to mitigate some of the limitations and shortcomingsof the moment-based nature of the Welch’s t-test, in particular for higher-orderanalyses of masked implementations, Moradi et al. [11] suggested the Pearson’sχ2-test. In contrast to the t-test this hypothesis test analyzes the full distribu-tions and can capture information that lies is multiple statistical moments. Thus,it prevents false negatives when moment-based analyses become suboptimal [11].In a first step a contingency table F has to be constructed from the two sets Q0

and Q1 (basically two histograms). We denote the number of rows by r (= 2,when two sets are compared) and the number of columns by c (number of binsof the histograms). The χ2-statistics x and the degrees of freedom v can then becomputed using the following formulas.

x =

r−1∑i=0

c−1∑j=0

(Fi,j − Ei,j)2

Ei,jv = (r − 1) · (c− 1)

Ei,j denotes the expected frequency for a given cell.

Ei,j =

(∑c−1k=0 Fi,k

)·(∑r−1

k=0 Fk,j

)N

Finally, the confidence p to accept the null hypothesis is estimated via the χ2

probability density function, where Γ (.) denotes the gamma function [11].

p =

∫ ∞x

f(x, v)dx f(x, v) =

x

v2−1e−

x2

2v2 Γ( v

2 )x > 0

0 otherwise

In contrast to the t-test this procedure can easily be extended to more than twosets of data (r > 2), which can be a valuable feature when used as distinguisherfor key recovery attacks. Generally, it can be said that in cases where the χ2-testprovides a higher confidence to reject the null hypothesis than the t-test (on thesame side-channel data), the analysis of the leakages requires some special atten-tion. This is usually the case when masked implementations with low noise levelsare analyzed or when hardware-masking schemes like threshold implementationscause leakages in multiple moments due to physical defaults such as glitches [11].

2.2 Deep Learning

We give a brief summary of the history and applications of deep learning andsubsequently introduce definitions and explain the underlying principle.

History and Applications. Historically, the field of machine learning dealt withextracting meaningful information from data by applying relatively simple math-ematical models, e.g., Bayes Classifiers, Support Vector Machines or DecisionTrees to a sanitized version of the input data. This required manual and time-consuming feature engineering to predetermine which elements might be usefulin a given set of raw data and how to best represent them, e.g., Canny edgedetection as a first hard-coded step for image classification.In contrast, deep learning methods are generally capable of learning from rawinput data, thereby making the elaborate modeling process unnecessary. Sincethe breakthrough improvement of classification accuracy on the ImageNet data


set in 2012 [9], deep learning has been successfully applied to many diverse taskssuch as speech recognition, drug discovery, natural language processing, visualart style transfer, image classification, autonomous driving and strategy games.

More recently, the side channel community discovered deep learning as atool to perform profiled attacks [3, 6, 10] with competitive results compared toclassical modeling techniques, e.g., based on a multivariate normal distribution.Apart from our present work, only Timon at CHES 2019 has investigated the useof deep learning in the non-profiled case [17]. The author exploits the correlationbetween a correct key guess and a steep learning rate to enable key recovery.Unfortunately, the method is computationally intense as a separate model hasto be trained for every key guess and its success is highly dependent on thecorrect labeling of the data which implies a suitable choice of leakage model andtargeted intermediate, making it unsuitable as a starting point for efficient andconfident leakage assessment.

Principle and Definitions. In the following we limit ourselves to sequential neu-ral networks (without recurrent elements) used for the purpose of classification.The aim of this description is to give brief definitions for the standard terms indeep learning, while the explanation of principles is intentionally very shallow.A neural network is structured into multiple layers, each containing a matrixof learnable weights w that is linearly applied to its inputs x and a non-linearactivation function applied to each coordinate of the result of this matrix mul-tiplication. The output of this combined operation is taken as an input for thesubsequent layer. Finally, the output layer of the neural network contains asmany output coordinates as classes1 (c) and uses softmax as an activation func-tion

softmax(xj) =exj∑ci=1 e

xi

,

such that the sum over all outputs is always equal to one, thereby forming aprobability distribution over the possible class labels.

Let us first assume the weights are initialized with some values before anevaluation of the network takes place by applying the function of the first layerto the input sample and subsequently propagating the computed values forwardlayer by layer until all layers have been evaluated. The prediction y′ consists inthe output coordinate of the final layer with the highest value.

In the beginning, the weights in a neural network are initialized with randomvalues. To determine useful weights that achieve accurate prediction values, atraining phase is necessary. First, the designer needs to define a metric to mea-sure the distance between a prediction y′ and the actual class label y. This metricis called a loss function which determines the loss score. To perform training, adata set with labeled inputs, i.e., a list of tuples (x, y), is separated into batchesof a fixed size b. The neural network is evaluated simultaneously on all sam-ples in a batch thereby producing loss scores. After each batch an optimizationstrategy based on Backpropagation is used to adjust all weights in the neuralnetwork dependent on the gradient of the loss function. Each iteration throughthe entire training set is called an epoch. To minimize the loss score, training overmultiple epochs is performed in each of which the training data is randomly re-grouped into new batches. For simplicity we assume that the training ends aftera predetermined number of user-defined epochs.

1 We limit ourselves to this variant called one-hot encoding.


To judge the quality of the classifier during and after training, the metricsaccuracy and validation accuracy should be considered. While accuracy is relatedonly to the training set, validation accuracy takes an entirely separate validationset into account, to ensure that the traits learned are actually generalizableopposed to rote learning of the specific training set (the latter phenomenon iscalled overfitting).

When choosing and training a deep learning model, the designer has to de-termine values for many so-called hyper parameters2, these include the depthof the network, the types and sizes of layers, their activation function, the lossfunction and the optimizer strategy. We provide the hyper parameters we choseand maintained throughout all of our case studies in Section 3.3.

3 DL-LA: Deep Learning Leakage Assessment

We introduce Deep Learning Leakage Assessment (DL-LA), a novel leakage as-sessment methodology based on deep learning. Our method is simple to applyand outperforms classical leakage detection approaches such as the Welch’s t-test and the more recently proposed Pearson’s χ2-test in many cases due to itsintrinsically multivariate nature.

3.1 Core Idea of DL-LA

The aim of leakage assessment is to determine whether an attacker is able toextract information from side channel measurements. The current state of the artfor non-profiled adversary models is based on univariate statistical distinctiontests (Welch’s t-test, Pearson’s χ2-test) which are applied to two groups of side-channel measurements collected for two distinct fixed inputs processed by thetarget implementation (alternatively one group for random inputs and the otherone for fixed).

DL-LA maintains the basic idea of distinguishing two groups of side channeltraces from each other (fixed-vs-fixed). Hence, from an evaluator’s perspectivethe entire measurement setup and tool-chain can remain unchanged when adopt-ing our methodology. We apply deep learning to the concept of leakage assess-ment by training a neural network to serve as a distinguisher between the twogroups. This is done in a supervised-learning-based approach by applying labeleddata from both groups to the network. This set of data is then called the train-ing set. Afterwards, the classification capabilities of the network are evaluatedon a distinct validation set of labeled measurements without revealing the truelabels to the network. The success rate of the classification on the validation setquantifies the amount of information that could be extracted from the trainingset by the neural network in order to provide a better-than-random guess whichof the two fixed inputs was processed by the target. In case this classificationsucceeds with a higher percentage than it could be achieved by randomly guess-ing with a non-negligible probability, it gives clear evidence for the fact thatinformative side-channel leakage is present. In this context we present a simplemetric to determine an exact p-value that quantifies the statistical confidence inthe evidence.

Please note that the number of required traces to extract the informationis only related to the training set. The size of the validation set can be chosen

2 In distinction from the parameters, i.e., the concrete weights learned during training.


completely independent and influences the result of the detection only if gener-alizable features (i.e., side-channel leakage) could be extracted from the trainingset. Otherwise the percentage of correct classifications will never be significantlydifferent from 50%, no matter how large the validation set is. We discuss thepartition strategy into training set and validation set to decouple the numberof traces available to the attacker from the statistical confidence the evaluatorwants to obtain in Section 5.

Given an identical number of traces, our proposed DL-LA leads in manycases to a higher statistical confidence than the t-test and χ2-test and is lessprone to produce false positives. In the case of an implementation that doesnot show univariate, but only multivariate leakage DL-LA outperforms currentleakage assessment methods by a large margin without requiring any knowledgeabout the underlying implementation.

3.2 Overall Methodology

We assume that the recorded traces have already been separated into a set ofN training traces and a set of M validation traces, the latter of which shouldhave an equal number of elements from both groups to maximize the statis-tical confidence value that can be obtained during the evaluation3. First, theevaluator has to pick a confidence level, i.e., an upper bound on the chancethat a false positive occurs. We assume the common threshold in SCA evalu-ations of pth = 10−5. Now, let v be the validation accuracy obtained by theneural network, then the total number of correct classifications is computed assM = v ·M . Considering the null hypothesis H0 where the neural network didnot learn anything and classifies randomly (coin flip model), this corresponds tomodeling the total number of correct guesses as a random variable following abinomial distribution

H0 : X ∼ Binom(M, 0.5).

The probability that at least sM correct classifications occur in a purely randomclassifier is given by: P (X ≥ sM ). This probability is easily computed as

P (X ≥ sM ) =

M∑k=sM

(M

k

)0.5k0.5M−k = 0.5M

M∑k=sM

(M

k

)

Now, we say that the implementation leaks information about intermediate val-ues if

P (X ≥ sM ) < pth.

In this case the exact location of leakage can be determined subsequently bySensitivity Analysis (cf. Section 3.4).

In the following, we specify a neural network that showed excellent and ro-bust results during our case studies. Further, we introduce Sensitivity Analysisas a method of leakage location and preempt several common pitfalls duringadoption.

3 We provide a discussion on the size of both sets in Section 5.


3.3 Proposed Network Structure

To perform DL-LA we suggest a generalizable network architecture that is freeof any assumptions about the traces to be analyzed or about the underlyingimplementation of the DUT. We performed all case studies in Section 4 with thesame network built within the Python library Keras using TensorFlow as thebackend.

Specification. Initially, we determine the (point-wise) mean µ and standard de-viation σ of the traces in the training set thereby standardizing both trainingand validation sets:

Xji := (Xj

i − µi)/σi,

with j denoting the trace and i the time sample within the trace. This verylightweight pre-processing step is necessary to reach a homogeneous range be-tween all input points and weights thus enabling efficient training.

The network consists of four fully-connected layers (Dense) of 120, 90, 50,respectively 2 output neurons. The input layer and each of the inner layers usea Rectified Linear Unit (ReLU ) as an activation function, while the final layeruses softmax. The four Dense layers are each separated by a BatchNormalizationlayer. In summary, the model can be defined in Python as:

model = Sequential([

Dense(120, activation = ’relu’, input_shape= (tracelength,) ),

BatchNormalization(),

Dense(90, activation = ’relu’),


Dense(50, activation = ’relu’),


Dense(2, activation = ’softmax’) ])

Further, we used the mean squared error as a loss function and adam as anoptimizer with the default parameters provided by Keras4. We chose the batchsize according to the memory restrictions of our Tesla K80 graphics card with 11GB VRAM as 2 000 samples for traces of length 5 000 points and 20 000 samplesfor traces of length 500 points.

Justification. We chose ReLU defined as

relu(x) =

x, x ≥ 0

0, x < 0

as an activation function over other common possibilities, e.g., tanh or sigmoid,because of better results regarding validation accuracy in initial tests as well asfor better computational performance when operating on large datasets (whichis highly relevant for the evaluation of protected implementations). We chose thesoftmax activation function of the final layer to create a probability distributionover both classes as explained in Section 2.2. The purpose of each BatchNormal-ization-layer is to decouple the learning process of all Dense layers from eachother and additionally provide a means of regularization to prevent overfitting[7].

4 lr = 0.001, β1 = 0.9, β2 = 0.999, ε = 10−8, decay = 0.0


We confirmed the suitability for univariate and multivariate leakage locatedin different statistical orders and for traces as short as one point and as long as10 000 points which may be encountered during a typical leakage evaluation ofsymmetric cryptographic primitives and provide extensive depth on the perfor-mance of our leakage assessment approach in different case studies in Section 4.

3.4 Extracting Temporal Information

If leakage is detected, the hardware designer or evaluator is usually interested inexactly pinpointing the leakage locations to report or alleviate the shortcoming,e.g., masking flaws. By applying Sensitivity Analysis (SA) [15, 17] we can exactlylocate all points of interest by quantifying how much they contributed to theleakage function learned by the neural network. In short, SA determines thepartial derivates of one output coordinate of the neural network with respect tothe network inputs, thereby characterizing the effect of a slight change in eachindividual input on the classification outcome.

We perform SA on the final network after training has completed by averag-ing the gradients of one output coordinate (with respect to the network inputs)weighted with the network inputs for all samples in the validation set and sub-sequently take the absolute value. More precisely, let xi denote the i-th inputof our network, y0 the first output coordinate of the network and Xj

i the valueof the i-th input for trace j in the validation set. Then the sensitivity can bedetermined as:

si =

∣∣∣∣∑j

∂y0∂xi·Xj

i

∣∣∣∣.While the actual value of this expression has to be determined via the chain-ruleover all network layers, this process is fully-automated by TensorFlow such thatthe remaining effort for the evaluator is a single function call.

3.5 Common Pitfalls

We discuss the most important differences between the classical detection ap-proaches and DL-LA and aim to preempt common pitfalls an evaluator mightencounter with our leakage assessment:

Group Imbalance. While the classical TVLA based on the Welch’s t-test aswell as the χ2-test can handle groups imbalanced in mean, variance and size,we want to stress that an equalization of group sizes in the validation set isextremely important for DL-LA. If the groups are imbalanced, the test statisticno longer follows the distribution X ∼ Binom(M, 0.5). Instead, always assigningthe label of the more common group leads to a classifier which outperformsrandom guessing, without actually being able to distinguish the groups basedon their traces. This discrepancy between actual and theoretical distribution ofthe test statistic – given a sufficiently large validation set – will lead to falsepositives. We strongly advise pruning both groups in the validation set to anexact ratio of 50/50.


Probability Adaption. An obvious idea to counteract the issue just addressed isthe adaption of the success probability in the Binomial distribution. Assume aslight (or even significant) imbalance ε in group sizes

|G0||G0|+ |G1|

= 0.5 + ε.

The evaluator could simply adapt the distribution of the test statistic to

X ∼ Binom(M, 0.5 + ε).

While this might even show satisfactory results in the case of low noise andunprotected or severely flawed implementations, which in turn lead to a highvalidation accuracy, we caution against any alteration of the distribution. In allpractically relevant cases (protected implementation, moderate noise) a changeof the success probability severely lowers the confidence of the statistical test.More specifically, consider a validation set of size 500 000 traces over which avalidation accuracy of v = 0.506 has been achieved. In case of a balanced val-idation set this event is highly statistically significant (10−17). However, if thevalidation set contains a small bias of ε = 0.004 no significance can be concludedas the remaining likelihood for this event is only 10−3. It is obvious that theadapted test looses its statistical power in all interesting cases; hence, false neg-atives might occur. Therefore, we want to reinforce the previous point to prunethe validation set to an exact 50/50 ratio.

Overfitting. We caution against using an overly complicated neural network asit might lead to overfitting, which is defined by a continuous rise of the trainingaccuracy over the number of epochs while the validation accuracy begins to fall.The underlying cause of this effect is the memorization of the training set asopposed to learning generalizable features of the entire set. Hence, it can beprevented by using a network with a simple structure which does not containexcessively many weights and optionally includes Normalization, Regularizationor Dropout layers (cf. Section 3.3).

4 Experimental Results

In the following we provide an experimental verification of the suitability of DL-LA as a black box leakage assessment strategy. In contrast to related works wedo not test our approach under circumstances of purely theoretical relevance,such as noise-free simulations or Sbox-only measurements with an extremelyhigh signal-to-noise ratio (SNR). Instead, we strive for a realistic benchmark ofour approach with a clear real-world impact. For this reason we chose hardwareimplementations of the full PRESENT-80 ultra lightweight block cipher [2] as thecommon target in our case studies. PRESENT has been developed for ubiquitousand resource constrained computing environments, which exactly constitutes thetype of application that commonly requires side-channel security as a design goaland may be certified by third-party evaluations labs. We target a very compactserialized hardware implementation of the cipher and two protected versions ofit which both feature provable first-order security. One of them even providessecurity at any order against univariate-only attacks. In all scenarios, we comparethe leakage assessment capability of DL-LA to the previously introduced state-of-the-art methods Welch’s t-test and Pearson’s χ2-test. We conclude that DL-LA


2 1 031415 state register 1

2 1 031819 key register

PLayer

S

Fig. 1. Unprotected serialized PRESENT architecture with a 4-bit data path.

is able to confidently detect leakage with fewer traces or a higher confidence onthe same amount of traces compared to the conventional methods which makes itan extremely valuable tool for leakage assessment. Especially in the case studieswhere only multivariate leakage is present DL-LA outperforms the state of theart by a large margin.

4.1 Measurement Setup

We have implemented the different instances of the PRESENT block cipher on aSAKURA-G board [1] which has specifically been designed for SCA evaluations.The board features two Spartan-6 FPGAs, one as a target and the other asa control interface. We measured the voltage drop over a 1 Ω shunt resistor inthe Vdd path of the target FPGA, which is amplified through the built-in ACamplifier, with a digital sampling oscilloscope at a sampling rate of 1 GS/s forthe first 5 case studies and 100 MS/s for the last 2. The targets were clockedat a frequency of 6 MHz, with the exception of the clock-randomized case studywhich is detailed later. For all case studies we measured side-channel traces ina fixed-vs-fixed manner for two arbitrarily selected fixed inputs. We have takencare to follow all rules that have to be considered to avoid false positives inleakage assessment [14], e.g., the measurements of the two groups are randomlyinterleaved and in the masked cases the communication between the control andthe target FPGA is performed in a shared manner (in our case the same holdsfor the communication with the measurement PC).

Case Study 1: Unprotected PRESENT, aligned Traces

In this first case study we target an unprotected serialized implementation ofthe PRESENT block cipher. The architecture can be seen in Figure 1 and issimilar to profile 1 introduced in [13]. As a first step we evaluate the confidenceto distinguish the two groups of measurements (fixed-vs-fixed) by conventionalmethods. The results of the first-order t-test and the χ2-test can be seen inFigure 2. In both cases we plot the confidence values p instead of relying on thecommon (and less precise) approach of defining a threshold for the intermediatestatistics (e.g., |t| > 4.5). The t-test succeeds in providing a confidence higherthan 99.999 % for the distinguishability of the two groups after only 20 tracessince it shows a probability below 10−5 to accept the null hypothesis. The χ2-testrequires approximately 90 traces to overcome the desired confidence threshold.In conclusion, none of the two methods faces any problems to distinguish theleakage distributions with a high confidence when 1 000 traces are considered.

When applying DL-LA to the same traces, the results in Figure 3 are achieved.We have to state here that a plot as depicted in Figure 3(b) is rather unnatural


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1200 1250 1300 1350 1400 1450 1500 1550 1600

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

20

40

60

80

-log

10(p

)

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

20

40

60

80-lo

g10

(p)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

10

20

30

40

50

-log

10(p

)

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

10

20

30

40

50

-log

10(p

)

Fig. 2. Univariate leakage assessment using 1 000 traces (step size 10) of an unprotectedserialized PRESENT-80 implementation. From top to bottom: 1) Sample trace, 2)Overlay of 10 sample traces, 3) t-test results, 4) χ2-test results.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time Samples

Sen

sitiv

ity

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

1000

2000

3000

-log

10(p

)

Fig. 3. DL-LA and Sensitivity Analysis using 1 000 traces (step size 10) of an un-protected serialized PRESENT-80 implementation. For each p value 30 epochs and avalidation set of 10 000 traces are considered.

to obtain using DL-LA. Normally, training and validating the network resultsin a confidence value after each epoch. Thus, it would be more natural to trainthe network on a training set of fixed size and to show the p values over thenumber of epochs to determine how many are required to overcome the thresh-old. However, in order to offer the best possible comparison between the leakageassessment approaches we repeated this process 100 times for a fixed number of


epochs (30) and a training set that increases by 10 traces per step and plottedthe maximum confidence over the number of traces. The result shows that anetwork which is trained on only 10 traces is already capable of providing anextremely high confidence that the two groups are distinguishable (since large−log10(p) values give confidence to reject the null hypothesis). By increasingthe size of the training set the confidence is boosted significantly until the pvalues stagnate in a corridor between 10−2300 to 10−3011. Please note that, asthe validation set has a size of 10 000 traces, the maximum achievable p value is0.510 000 = 10−3011. Thus, the stagnation in the corridor is simply caused by thefact that (almost) all of the traces in the validation set were classified correctly.By using a larger validation set the −log10(p) values would rise even beyond3011. We also perform a Sensitivity Analysis on the network to determine thepoints of interest and obtain a spatial resolution comparable to the univariatetests (cf. Figure 3(a)). The absolute values of the SA are not meaningful andcannot be compared to any threshold. Thus, they are omitted here. In summary,DL-LA outperforms the classical detection approaches in terms of required num-ber of traces and absolute confidence provided. Of course, for the evaluation ofDL-LA as performed in this case study, a validation set is required on top ofthe training set. However, please note that we only chose a validation set of10 000 traces here in order to show the extremely high magnitude of achievableconfidence values, even when considering very small training sets5. In fact, theindication of distinguishability relates only to the training set, and, in case thenetwork learned generalizable features from it, the confidence can be arbitrar-ily boosted by increasing the validation set. If no generalizable features werelearned (e.g., because no leakage is present) the percentage of correct classifica-tions will not be significantly different from 0.5. The advantages of decouplingthe confidence from the number of traces are discussed in Section 5. In Figure 19of Appendix A, we additionally provide DL-LA results for the first three casestudies where the size of the union of the training and the validation set doesnot exceed the number of traces considered by the t- and the χ2-test. Even thenDL-LA outperforms the classical approaches.

Case Study 2: Unprotected PRESENT, misaligned Traces

This case study is an exact replication of the previous one apart from the factthat we artificially created a misalignment of the traces, as apparent in Fig-ure 4(b). This misalignment was achieved by forcing the oscilloscope to triggerthe acquisition of the power traces close to the peak of the rising edge of thetrigger signal (in our case at 2.48 V while the peak is at 2.5 V) as opposedto the more stable part in the middle of the edge. Thus, due to the electronicnoise, the acquisition is in some cases triggered earlier than in others and thetraces are not perfectly aligned anymore. Figure 4 shows that the t- and χ2-testresults do not seem to significantly suffer from this misalignment when consid-ering the absolute magnitude of the −log10(p) values. However, the number oftraces to overcome the threshold is increased in comparison to the previous casestudy in both tests. DL-LA also performs similar as before, as apparent fromFigure 5 and outscores the classical detection approaches in required traces andprovided confidence. It seems that the slight misalignment of the traces does not

5 The minimum size of the validation set in order to be able to overcome the detectionthreshold is 17, since −log10(0.517) > 5. However, this assumes a 100% correctclassification by the network, otherwise a larger set needs to be considered.


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1200 1250 1300 1350 1400 1450 1500 1550 1600

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

20

40

60

80

-log

10(p

)

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

20

40

60

80-lo

g10

(p)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

20

40

60

80

-log

10(p

)

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

20

40

60

80

-log

10(p

)

Fig. 4. Univariate leakage assessment using 1 000 misaligned traces (step size 10) of anunprotected serialized PRESENT-80 implementation. From top to bottom: 1) Sampletrace, 2) Overlay of 10 sample traces, 3) t-test results, 4) χ2-test results.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time Samples

Sen

sitiv

ity

0 100 200 300 400 500 600 700 800 900 1000

No. of Traces

0

1000

2000

3000

-log

10(p

)

Fig. 5. DL-LA and Sensitivity Analysis using 1 000 misaligned traces (step size 10) ofan unprotected serialized PRESENT implementation. For each p value 30 epochs anda validation set of 10 000 traces are considered.

significantly affect the detection capabilities of any of the leakage assessmenttechniques when unprotected implementations are considered and the numberof available traces is not chosen to be extremely small.


Case Study 3: (Unprotected) PRESENT, randomized Clock

Since the artificial delay in the previous case study only slightly increased thedata complexity of a leakage detection we now try to test a countermeasurethat leads to much more heavily misaligned and noisy traces. In particular, werandomize the clock that drives the targeted PRESENT implementation. Thisis done by clocking the cipher with the output of a 64-bit LFSR. Hence, ineach encryption (and therefore also in the power traces) the same intermediatecomputations are executed at different times, since the rising edges of the LFSRoutput occur in a random sequence. The input frequency of the LFSR was set to24 MHz so that the number of rising edges in a certain frame of time is on averagesimilar to being clocked by a stable 6 MHz clock. In this case the t- and χ2-teststruggle significantly more to detect leakage than in the previous experiments, asapparent in Figure 6. While the t-test requires about 2 000 traces for a detectable

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1200 1250 1300 1350 1400 1450 1500 1550 1600

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

2

4

6

8

10

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

No. of Traces

0

2

4

6

8

10

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

2

4

6

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

No. of Traces

0

2

4

6

-log

10(p

)

Fig. 6. Univariate leakage assessment using 5 000 traces (step size 50) of a serializedPRESENT-80 implementation with clock randomization. From top to bottom: 1) Sam-ple trace, 2) Overlay of 10 sample traces, 3) t-test results, 4) χ2-test results.

breach of side-channel security, the χ2-test barely overcomes the threshold at


all. DL-LA on the other hand is able to confidently state distinguishability afterabout 150 traces (cf. Figure 7). Although all three approaches suffer significantly

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time Samples

Sen

sitiv

ity

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

No. of Traces

0

10

20

30

-log

10(p

)

Fig. 7. DL-LA and Sensitivity Analysis using 5 000 misaligned traces (step size 50) ofan unprotected serialized PRESENT implementation with clock randomization. Foreach p value 30 epochs and a validation set of 10 000 traces are considered.

from the misalignment and the added noise, DL-LA is still able to performdetection on a much smaller amount of traces. Please note that, if desired by theevaluator, the confidence can be made arbitrarily larger by increasing the sizeof the validation set.

Case Study 4: PRESENT Threshold Implementation, aligned Traces

In this case study we target a serialized threshold implementation (TI) [12] ofthe PRESENT block cipher. The architecture can be seen in Figure 8 and issimilar to profile 2 introduced in [13]. The PRESENT Sbox is decomposed into




2 1 031819 key register

PLayer

PLayer

PLayer

G3

G2

G1

F1

F2

F3

Fig. 8. Serialized PRESENT threshold implementation architecture with 3 shares anda decomposed Sbox.

two quadratic functions F and G. Both of those decompositions are split intothree component functions each according to the concepts of correctness, non-completeness and uniformity [12]. As apparent from Figure 8 the three sharesin the computation of the decomposed Sbox are evaluated in parallel. Thus,no first-order, but univariate higher-order (especially second- and third-order)leakage is expected. We evaluate this assumption in Figure 9. As expected thefirst-order t-test does not indicate detectable leakage, but the second- and third-order test do. Interestingly, we can confirm the statements made by the authors


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1200 1250 1300 1350 1400 1450 1500 1550 1600

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

5

10

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

5

10

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

5

10

15

20

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

5

10

15

20

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

10

20

30

40

50

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

10

20

30

40

50

-log

10(p

)

Fig. 9. Univariate leakage assessment using 10 000 000 traces (step size 100 000) of aserialized PRESENT threshold implementation. From top to bottom: 1) Sample trace,2) Overlay of 10 sample traces, 3) first-order t-test results, 4) second-order t-test results,5) third-order t-test results, 6) χ2-test results.

of the χ2-test proposal [11] regarding the shortcomings of the moment-basednature of the t-test. Unlike the situation in the previous case studies, the χ2-test


outperforms the t-test here. While the second-order and the third-order t-testrequire 3 000 000 and 1 500 000 traces for the detection respectively, the χ2-testsucceeds after only 500 000 traces and results in a much higher confidence overall.

Please note that for this case study and the upcoming ones, where we analyzeprotected implementations, we change the visualization of the DL-LA results.Due to the large data sets involved it is not feasible to train many differentclassifiers with a steadily increasing size of the training set over many steps.Instead we visualize the −log10(p) values over the number of epochs (insteadof over training traces). We do this twice, once for the minimum number oftraces required by the classical assessment method (here χ2-test with 500 000)and once for a larger training set in order to show that much larger confidencevalues can be achieved in the protected cases as well. Those results are presentedin Figure 10. In case of a training set of size 500 000, the DL-LA succeeds only

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time Samples

Sen

sitiv

ity

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

1

2

3

4

5

-log

10(p

)

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

500

1000

1500

-log

10(p

)

Fig. 10. DL-LA and Sensitivity Analysis using 500 000 (Fig. 10(b)) and 3 000 000(Fig. 10(a), 10(c)) traces of a serialized PRESENT threshold implementation respec-tively. For each p value a validation set of 1 500 000 traces is considered.

just in overcoming the confidence threshold. However, barring the possibility of afalse positive results (which is highly unlikely), the confidence could be increasedby an evaluator either by considering more epochs or by increasing the validationset. In the case of a training set including 3 000 000 measurements, the confidencethat side-channel leakage is present becomes extremely large.

Case Study 5: PRESENT Threshold Implementation, misalignedTraces

This case study is equivalent to the previous one apart from the fact that weartificially created a misalignment of the traces, as it was already done for casestudy 2. As a result of this misalignment the leakage detection approaches requireslightly more traces to overcome the confidence threshold than in the alignedcase. In particular, as shown in Figure 11, the second-order and the third-ordert-test require 3 600 000 and 1 500 000 traces for the detection respectively, whilethe χ2-test succeeds after only 800 000 traces and again results in a much higherconfidence. The DL-LA is again the most powerful leakage detection mechanism


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1200 1250 1300 1350 1400 1450 1500 1550 1600

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

5

10

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

5

10

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

10

20

30

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

10

20

30

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

20

40

60

-log

10(p

)

0 1 2 3 4 5 6 7 8 9 10

No. of Traces 106

0

20

40

60

-log

10(p

)

Fig. 11. Univariate leakage assessment using 10 000 000 misaligned traces (step size100 000) of a serialized PRESENT threshold implementation. From top to bottom: 1)Sample trace, 2) Overlay of 10 sample traces, 3) first-order t-test results, 4) second-order t-test results, 5) third-order t-test results, 6) χ2-test results.

and succeeds for both sizes of the training set (800 000 and 3 000 000) with muchhigher confidence values than any of the classical approaches (cf. Figure 12).


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time Samples

Sen

sitiv

ity

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

5

10

15

20

-log

10(p

)

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

500

1000

1500

-log

10(p

)

Fig. 12. DL-LA and Sensitivity Analysis using 800 000 (Fig. 10(b)) and 3 000 000(Fig. 10(a), 10(c)) misaligned traces of a serialized PRESENT threshold implementa-tion respectively. For each p value a validation set of 1 500 000 traces is considered.

Case Study 6: PRESENT Multivariate Threshold Implementation,aligned Traces

In the final two case studies we concentrate on scenarios where the classical uni-variate detection approaches are naturally unsuited to detect leakage, namelypurely multivariate higher-order leakages. These cases are the primary motiva-tion to apply DL-LA in reality, as all currently known methods fail to capturethe whole amount of present side-channel leakage in these scenarios. We provideevidence for this statement in the following.

We constructed a special version of the PRESENT threshold implementationarchitecture depicted in Figure 8, that does not offer univariate side-channelleakage. To this end we had to ensure that all six component function (G1, G2,G3, F1, F2 and F3) are evaluated sequentially and not in parallel. We did thisby gating their respective inputs with AND gates which are controlled by a finitestate machine (FSM). In addition to that we had to make sure that none ofthe state registers are clocked at the same time. Thus a single Sbox computationtakes 7 clock cycles in the resulting hardware design. As expected, our univariateleakage assessment using the classical detection approaches does not indicatethe presence of any side-channel leakage (cf. Figure 13). However, a multivariateinvestigation could still find higher order leakage if performed at the correctoffsets. This requires either white-box knowledge about the implementation ormust be determined by exhausting all possibilities. The results for the best offsetleading to detectable multivariate leakage is illustrated in Figure 14 which showsleakage in the third order after more than 45 million traces. Please note that wehave performed each multivariate second-order t-test, third-order t-test and χ2

with the correct offsets (as we know all the implementation details) and none ofthem was able to detect leakage with fewer traces.

In stark contrast, as apparent in Figure 15, DL-LA provides a very highconfidence level of −log10(p) > 150 for the presence of side-channel leakage aftertraining on only 20 million traces. We still retained the same network architectureas before and made absolutely no assumptions about the leakage and required nowhite-box knowledge about the offset of the individual evaluation of TI shares.Due to memory and time restrictions we pruned the trace to a length of 500


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1485 1490 1495 1500 1505 1510 1515 1520 1525 1530

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

Fig. 13. Univariate leakage assessment using 50 000 000 traces (step size 500 000) of aserialized multivariate PRESENT threshold implementation. From top to bottom: 1)Sample trace, 2) Overlay of 10 sample traces, 3) first-order t-test results, 4) second-order t-test results, 5) third-order t-test results, 6) χ2-test results.

sample points (1281-1780). Validation took place on 5 million traces. Leakage


2 4 6 8 10 12 14 16 18 20

Time samples

0

2

4

6

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

2

4

6

-log

10(p

)

Fig. 14. Multivariate third-order t-test using 50 000 000 traces (step size 500 000) of aserialized multivariate PRESENT threshold implementation.

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

50

100

150

-log

10(p

)

Fig. 15. DL-LA using 20 000 000 traces of a serialized multivariate PRESENT thresh-old implementation. For each p value a validation set of 5 000 000 traces is considered.

becomes apparent after 25 epochs and continuously increases until our chosenthreshold of 50 epochs has been reached.

Case Study 7: PRESENT Multivariate Threshold Implementation,misaligned Traces

Our final case study is a replication of the previous one, but again we misalignedthe traces through bad triggering. As shown in Figure 16 no univariate detec-tion of leakage succeeds. In this case however, even the multivariate third-orderanalysis with the best possible offset for leakage detection in the previous casestudy does not succeed (cf. Figure 17). In other words, the acquired set of tracesdoes not allow detection of any leakage using conventional methods, at least incase the traces are not re-aligned before the analysis.

DL-LA however detects leakage with high confidence (−log10(p) > 60) aftertraining on only half of the available traces (25 000 000). This result is depictedin Figure 18.

5 Discussion

False Positives. False positives commonly appear as a problem in classical leak-age evaluations. This is due to their point-wise independent nature. A thresholdof pth = 10−5 for each individual point will lead to an aggregation of the errorprobability over the length of the entire trace, thereby lowering the confidence.More formally, the likelihood that a false positive occurs at least once in a traceof length K can be described as:

P (false positive) = 1− (1− pth)K .


0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

Pow

er c

onsu

mpt

ion

1485 1490 1495 1500 1505 1510 1515 1520 1525 1530

Time samples

Pow

er c

onsu

mpt

ion

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

Fig. 16. Univariate leakage assessment using 50 000 000 misaligned traces (step size500 000) of a serialized multivariate PRESENT threshold implementation. From top tobottom: 1) Sample trace, 2) Overlay of 10 sample traces, 3) first-order t-test results,4) second-order t-test results, 5) third-order t-test results, 6) χ2-test results.

For the typical value of K = 5 000 in our case studies and the common thresholdof pth = 10−5 this formula equates to 0.0488, thereby it only provides an effective


2 4 6 8 10 12 14 16 18 20

Time samples

0

1

2

3

4

5

-log

10(p

)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

No. of Traces 107

0

1

2

3

4

5

-log

10(p

)

Fig. 17. Multivariate third-order t-test using 50 000 000 misaligned traces (step size500 000) of a serialized multivariate PRESENT threshold implementation.

0 5 10 15 20 25 30 35 40 45 50

Epoch number

0

20

40

60

-log

10(p

)

Fig. 18. DL-LA using 25 000 000 misaligned traces of a serialized multivariatePRESENT threshold implementation. For each p value a validation set of 5 000 000traces is considered.

confidence of roughly 5%. Hence, a manual investigation of the individual leakagepoints is often necessary when performing classical leakage detection to excludefalse positives.

In contrast, our deep learning based methodology does not produce severalindividual univariate statistical tests, but produces a single decision metric basedon the entirety of points6. To practically verify the resilience against false pos-itives, we trained our network several times on randomly generated data withrandom group assignments. In these evaluations we never observed any confi-dence exceeding p = 10−2. Hence, we have a high confidence, that false positivesare far less likely to occur with our methodology if a reasonable threshold ischosen, e.g., pth = 10−5.

False Negatives. Any leakage evaluation methodology should primarily aim toprevent false negatives. While we cannot possibly compare our methodology toall existing attacks and most certainly do not claim the impossibility of falsenegatives, we want to stress the importance to adopt DL-LA as one new elementbesides the t- and the χ2-test in the evaluator’s toolbox to severely reduce falsenegatives when considering multivariate information leakage.

Confidence Boosting. In a common leakage evaluation a statistical test (t-test,χ2-test) is performed on the entirety of collected traces. Thereby, two differentmetrics, (i) the number of traces required to extract meaningful information,and (ii) the level of confidence the evaluator wants to achieve are tightly in-tertwined. More specifically, under realistic noise conditions t-test and χ2-test

6 Note that pinpointing leakage in the time dimension is still possible due to sensitivityanalysis as demonstrated in our case studies.


are fundamentally unable to answer the question: Given a very high confidencethreshold of p = 10−50 can the attacker extract information given only very fewtraces? In stark contrast, our deep learning based methodology operates on twosets: One training set of size N and one validation set of size M . Here, N andM can be chosen independently from each other. While N represents the actualamount of traces available to an attacker, M should be chosen sufficiently largeto reach the desired level of statistical confidence, e.g. note that the maximumlevel of statistical confidence that can be achieved with a given validation setequals 0.5M and might be much lower under realistic noise conditions.

Sensitivity Analysis. As seen in Section 4 computing the gradient of an outputcomponent of the neural network with respect to the input values can provide aninsight into the dependence of the classification result on each individual timesample. While this seems similar to the result of classical univariate hypothesistests, which illustrate independent statistical tests on each point in time, thereare some crucial differences: DL-LA learns a function depending on the inputs insome way, that minimizes the given loss function. This leads to two effects: (1)Points that do not contribute to leakage may still receive a non-zero componentin the gradient, (2) Points that contribute to leakage, but correlate heavily withother points contributing to leakage might not be learned, as there is no intrinsicincentive for the neural net to learn redundant information. However, all of ourpractical case studies show, that the highest values in the Sensitivity Analysisalways correspond to leakages which are also found by point-wise analyses. Yet,we want to caution against the idea that all leakage locations can be found witha single SA. Instead, the process is more iterative: After a design flaw has beenidentified and fixed, the DL-LA of the next design iteration might reveal newleakage locations of flaws that already persisted in the initial evaluation, butwere simply not learned by the classifier.

Validation Accuracy in Isolation. Commonly, neural networks are applied toclassification tasks in which the user is actually interested in obtaining a goodclassifier, e.g., obtain a network to distinguish cat pictures from dog pictures.In those cases a very high validation accuracy (0.99 + ε) is expected from asuitable neural network as each individual sample is noise free and can easily beassigned to one specific group. In contrast, when evaluating side-channel traces,especially of masked implementations, the randomized intermediate values leadto an impossibility to precisely assign each individual sample to a group withhigh accuracy.7 In contrast, the aim of the attacker can only be to distinguishdifferent intermediate values statistically, i.e., on average. This leads to a verydifferent expectation (compared to the image classification problem): The aim isto find a network that works better than chance (validation accuracy > 0.5) anddoes so consistently over a large validation set. Hence, we caution the evaluator todisregard seemingly small values for the validation accuracy, e.g. 0.505. Instead,the size of the (perfectly balanced) validation set should always be taken intoaccount by computing the correct p-value according to the Binomial distribution.

Test and Validation Set. In deep learning, there is a common distinction betweenthe validation set, used as a feedback mechanism to adjust the hyper parame-

7 In fact, if we find a neural network with validation accuracy equal to 1.0 an attackerwould most likely be able to not only succeed with DPA, but mount a successfulSimple Power Analysis (SPA).


ters and the test set, another completely independent set that is used to accessthe accuracy of the final network. This approach is used to prevent implicit in-formation leakage from the validation set into the trained model (through theadjustment of hyper parameters). For our case studies this distinction is notneeded, because we performed all evaluations on networks with identical hyperparameters and the chosen network architecture is not special by any means, e.g.,we verified that (small) variations in the size of layers will lead to comparableresults.

Misalignment. As seen in our case studies, DL-LA is resilient against slightmisalignment through bad triggering. However, this robustness is shared withthe t- and χ2-test. When operating on severely misaligned traces due to clockrandomization both DL-LA and classical tests lose orders of magnitude of con-fidence compared to an aligned evaluation. Fortunately, this can be partiallyoffset by increasing the validation set to perform Confidence Boosting. While weperformed initial experiments with CNNs which indeed provide more resilienceagainst misalignment, we chose not to include them into our network sugges-tion as the usage of convolutions severely hindered the detection of higher-order(univarate or multivariate) leakages. We leave the design of a neural networkcombining convolutions and fully-connected layers to resist misalignment whiledetecting higher-order leakages as future work.

Availability. A sample implementation of our evaluation methodology basedon Keras and TensorFlow, including an arbitrary precision computation of theclassifier will be made freely available upon publication.

6 Conclusion

We introduced Deep Learning Leakage Assessment (DL-LA), a novel method-ology to evaluate leakages in an intrinsically multivariate and horizontal wayby training a classifier based on deep neural networks. In the case of univariateleakages our method outperforms the non-specific t-test as well as the χ2-test inseveral case studies: We detect leakages with fewer traces and with a confidenceorders of magnitudes higher compared to classical leakage assessment. However,as we cannot provide an indication suggesting the general superiority of our ap-proach over the classical hypothesis tests in these scenarios, we suggest DL-LAas a complement to the t-test and χ2-test and not a replacement thereof.

In the case of multivariate leakage, DL-LA effortlessly learns an accurateclassifier, while multivariate extensions of the t- and χ2-test require (i) exhaus-tive search over all time offsets or (ii) expert-level domain knowledge to choosethe correct offset. Most importantly, we demonstrate a case study in which theclassical hypothesis tests cannot detect any leakage despite having white-boxknowledge about the underlying implementation while DL-LA indicates the in-security with overwhelming confidence in a black box setting, requiring only apart of the available trace.

Our method unifies horizontal and vertical side-channel attacks, is simple touse, broadly applicable and produces results with high statistical confidence. Itshould be adopted as a valuable addition to the evaluator’s toolbox to severelyreduce false negatives in multivariate and horizontal settings.


Acknowledgments

The work described in this paper has been supported in part by the DeutscheForschungsgemeinschaft (DFG, German Research Foundation) under Germany’sExcellence Strategy - EXC 2092 CASA - 390781972 and through the project271752544 “NaSCA: Nano-Scale Side-Channel Analysis”.

References

1. Side-channel AttacK User Reference Architecture.http://satoh.cs.uec.ac.jp/SAKURA/index.html

2. Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw,M.J.B., Seurin, Y., Vikkelsoe, C.: PRESENT: an ultra-lightweight block cipher.In: Paillier, P., Verbauwhede, I. (eds.) Cryptographic Hardware and EmbeddedSystems - CHES 2007, 9th International Workshop, Vienna, Austria, September10-13, 2007, Proceedings. Lecture Notes in Computer Science, vol. 4727, pp. 450–466. Springer (2007), https://doi.org/10.1007/978-3-540-74735-2 31

3. Cagli, E., Dumas, C., Prouff, E.: Convolutional neural networks with data aug-mentation against jitter-based countermeasures - profiling attacks without pre-processing. In: Fischer, W., Homma, N. (eds.) Cryptographic Hardware and Em-bedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan,September 25-28, 2017, Proceedings. Lecture Notes in Computer Science, vol.10529, pp. 45–68. Springer (2017), https://doi.org/10.1007/978-3-319-66787-4 3

4. Durvaux, F., Standaert, F.: From improved leakage detection to the detection ofpoints of interests in leakage traces. In: Fischlin, M., Coron, J. (eds.) Advancesin Cryptology - EUROCRYPT 2016 - 35th Annual International Conference onthe Theory and Applications of Cryptographic Techniques, Vienna, Austria, May8-12, 2016, Proceedings, Part I. Lecture Notes in Computer Science, vol. 9665, pp.240–262. Springer (2016), https://doi.org/10.1007/978-3-662-49890-3 10

5. Goodwill, G., Jun, B., Jaffe, J., Rohatgi, P.: A testing methodology for side channelresistance validation. In: NIST non-invasive attack testing workshop (2011)

6. Hospodar, G., Gierlichs, B., Mulder, E.D., Verbauwhede, I., Vandewalle, J.: Ma-chine learning in side-channel analysis: a first study. J. Cryptographic Engineering1(4), 293–302 (2011), https://doi.org/10.1007/s13389-011-0023-x

7. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

8. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M.J.(ed.) Advances in Cryptology - CRYPTO ’99, 19th Annual International Cryp-tology Conference, Santa Barbara, California, USA, August 15-19, 1999, Proceed-ings. Lecture Notes in Computer Science, vol. 1666, pp. 388–397. Springer (1999),https://doi.org/10.1007/3-540-48405-1

9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-volutional neural networks. In: Advances in neural information processing systems.pp. 1097–1105 (2012)

10. Masure, L., Dumas, C., Prouff, E.: A comprehensive study of deep learning forside-channel analysis. Cryptology ePrint Archive, Report 2019/439 (2019)

11. Moradi, A., Richter, B., Schneider, T., Standaert, F.: Leakage detection with thex2-test. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(1), 209–237 (2018),https://doi.org/10.13154/tches.v2018.i1.209-237

12. Nikova, S., Rechberger, C., Rijmen, V.: Threshold implementations against side-channel attacks and glitches. In: Ning, P., Qing, S., Li, N. (eds.) Information andCommunications Security, 8th Int. Conf., ICICS 2006, Raleigh, NC, USA, Dec,2006, Proceedings. Lecture Notes in Computer Science, vol. 4307, pp. 529–545.Springer (2006)


13. Poschmann, A., Moradi, A., Khoo, K., Lim, C., Wang, H., Ling, S.: Side-channelresistant crypto for less than 2, 300 GE. J. Cryptology 24(2), 322–345 (2011)

14. Schneider, T., Moradi, A.: Leakage Assessment Methodology - A Clear Roadmapfor Side-Channel Evaluations. In: CHES 2015. Lecture Notes in Computer Science,vol. 9293, pp. 495–513. Springer (2015)

15. Simonyan, Zisserman, V.: Deep inside convolutional networks: Visualising imageclassification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

16. Standaert, F.: How (not) to use welch’s t-test in side-channel security evaluations.In: Smart Card Research and Advanced Applications - 16th International Confer-ence, CARDIS 2018, Montpellier, France, November 2018 (2018)

17. Timon, B.: Non-profiled deep learning-based side-channel attacks with sensitivityanalysis. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2019(2), 107–131 (2019)

A Appendix

0 50 100 150 200 250 300 350 400 450 500

No. of Traces

0

50

100

150

-log

10(p

)

0 50 100 150 200 250 300 350 400 450 500

No. of Traces

0

50

100

150-lo

g10

(p)

0 250 500 750 1000 1250 1500 1750 2000 2250 2500

No. of Traces

0

2

4

6

-log

10(p

)

Fig. 19. DL-LA results targeting an unprotected serialized PRESENT-80 implemen-tation, using (up to) half the traces as training set and half the traces as validationset. From top to bottom: 1) aligned traces, 2) misaligned traces, 3) randomized clock.For 1) and 2) the training set ranges from 10 to 500 traces in steps of 10, while thevalidation set is 500 traces large. For 3) the training set ranges from 50 to 2 500 tracesin steps of 50, while the validation set is 2 500 traces large.

dl-la: deep learning leakage assessment · felix wegener , thorben moos , and amir moradi ruhr...

Documents