arxiv:2007.06786v1 [cs.cv] 14 jul 2020

Meta-rPPG: Remote Heart Rate EstimationUsing a Transductive Meta-Learner

Eugene Lee Evan Chen Chen-Yi Lee

Institute of Electronics, National Chiao Tung University{eugenelet.ee06g,evanchen.ee06}@nctu.edu.tw, [email protected]

Abstract. Remote heart rate estimation is the measurement of heartrate without any physical contact with the subject and is accomplishedusing remote photoplethysmography (rPPG) in this work. rPPG sig-nals are usually collected using a video camera with a limitation of be-ing sensitive to multiple contributing factors, e.g. variation in skin tone,lighting condition and facial structure. End-to-end supervised learningapproach performs well when training data is abundant, covering a dis-tribution that doesn’t deviate too much from the distribution of testingdata or during deployment. To cope with the unforeseeable distributionalchanges during deployment, we propose a transductive meta-learner thattakes unlabeled samples during testing (deployment) for a self-supervisedweight adjustment (also known as transductive inference), providing fastadaptation to the distributional changes. Using this approach, we achievestate-of-the-art performance on MAHNOB-HCI and UBFC-rPPG.

Keywords: Remote heart rate estimation, rPPG, meta-learning, trans-ductive inference

1 Introduction

Remote photoplethysmography (rPPG) is useful in situations where conven-tional approaches for heart rate estimation like electrocardiogram (ECG) andphotoplethysmogram (PPG) that requires physical contact with the subject isinfeasible. The acquisition of rPPG signal is useful for the estimation of phys-iological signal like heart rate (HR) and heart rate variation (HRV) which areimportant parameters for remote health-care. rPPG is usually obtained using avideo camera and a growing number of studies have used rPPG for HR estima-tion [40,39,1,10,24,52,55].

The adoption of deep learning techniques for the measurement of rPPG fromface images is not novel and is supported by numerous studies [8,61,35,60]. Allof the prior work involving deep learning uses an end-to-end supervised learningapproach where a global model is deployed during inference (to the best of ourknowledge), also known as inductive inference. It has been pointed out by therecent advances in deep learning that such approach doesn’t perform well whenthere are changes in the modeled distribution when moving from training datasetto the real world [2,15]. The unpredictability of the changes in environment and

arX

iv:2

007.

0678

6v1

[cs

.CV

] 1

4 Ju

l 202

0

2 Lee et al.

test subjects (skin-tone, facial structure, etc.) hinders the performance of remoteheart rate estimation. To cope with the dynamical changes of the environmentand subjects, we propose a transductive meta-learner that is able to performfast adaption during deployment. Our algorithm introduces a warm-start timeframe (2 seconds) for adaptation (weight update) to cope with the distributionalchanges, resulting in better performance during remote heart rate estimation.

Current advances in meta-learning [14,33,44] have shed light on the tech-niques of designing and training a deep neural network (DNN) that is able toadapt to new tasks through minimal weight changes. As prior work in meta-learning is built on well-defined tasks consisting of the classification of a fewlabeled examples (shots) provided as support set during test time for fast adap-tation, it can’t be directly applied to our context as labeled data are unobtainableduring deployment. In our context, adaptation has to be done in a self-supervisedfashion, hence we incorporate transductive inference techniques [18,26] into thedesign of our learning algorithm. The application of meta-learning and trans-ductive inference to rPPG estimation is not trivial since we have to considerthe modeling of both spatial and temporal information in the formulation of ourmeta-learner. In our work, we split our DNN into two parts: a feature extrac-tor and a rPPG estimator modeled by a convolutional neural network (CNN)and long short-term memory (LSTM) [17] network respectively. We introduce asynthetic gradient generator modeled by a shallow Hourglass network [32] alongwith a novel prototypical distance minimizer to perform transductive inferencewhen labeled data are not available during deployment. Intuitively, the varia-tion in distribution is caused by the visual aspect of the incoming signal, henceadaptation (weight changes) is only performed on the feature extractor while therPPG estimator’s parameters will be frozen during deployment. In summary, ourmain contributions are as follows:

1. In Section 3, we propose a meta-learning framework that exploits spatiotem-poral information for remote heart rate estimation, specifically designed forfast adaptation to new video sequences during deployment.

2. In Section 3.2, we introduce a transductive inference method that makes useof unlabeled data for fast adaptation during deployment using a syntheticgradient generator and a novel prototypical distance minimizer.

3. In Section 3.3, we propose the formulation of rPPG estimation as an ordinalregression problem to cope with the mismatch in temporal domain of bothvisual input and collected PPG data, as studied in [11].

4. In Section 4, we validate our proposed methods empirically on two publicface heart rate dataset (MAHNOB-HCI [48] and UBFC-rPPG [3]), trainedusing our collected dataset. Our experimental results demonstrate that ourapproach outperforms existing methods on remote heart rate estimation.

2 Related Work

Remote Heart Rate Estimation. The study of the measurement of heartrate using a video camera is not novel and has been supported by existing works.

Remote Heart Rate Estimation Using a Transductive Meta-Learner 3

The remote measurement of heart rate is accomplished through the calculationof the cardiac pulse from the extracted rPPG signal which embeds the bloodvolume pulse (BVP). rPPG signal is extracted from the visible spectrum of lightacquired by a video camera [7]. Contact PPG sensors found in wearable devicesand medical equipment estimates the heart rate by acquiring the light reflectedfrom the skin using a light-emitting diode (LED) as its source. The constituentsof the reflected light is composed of the amount of light absorbed by the skin(constant) and a time varying pulse signal contributed by the absorption of lightby blood capillaries beneath the skin. Since existing contact PPG methods arenon-intrusive, light sources with specific wavelengths are used, e.g. green light(525 nm) and infrared light (880 nm), having the properties of good absorptionby blood and has minimal overlap with the visible light spectrum [27].

The feasibility of remote heart rate estimation was first proven by Verkruysseet al. [53], showing that PPG signals can be extracted from videos collectedunder an ambient light setting using webcams. To deal with noise, Poh et al.[40] performed blind source separation on webcam filmed videos to extract pulsesignal. A study is done in [51] revolving the comparison of the absorptivity ofindependent channels from the RGB channels by the human skin. It is concludedthat the green channel is easily absorbed by the human skin, giving high signal-to-noise ratio for PPG signal acquisition. Based on the results in [51], Li etal. [24] used only the green channel for rPPG measurement and introduced anadaptive filtering technique (normalized least mean square) to eliminate motionartifact during deployment.

As it is hypothesized that all three channels (RGB) contain considerable in-formation for the estimation of the rPPG signals, different channels are weightedand summed to receive better results when compared to using only a single chan-nel (green) in [30]. CHROM [10] estimates the rPPG by using a method thatlinearly combines the RGB channels using the knowledge of a skin reflectionmodel to separate pulse signal from motion-induced distortion. The POS [55]method and the SB [56] model used the same skin reflection model as CHROM,but made a different assumption on the distortion model where another projec-tion direction is used to separate the pulse and the distortions. All the mentionedmethods are based on the calculation of the spatial mean of the entire face, as-suming the contribution of each pixel to the estimation of rPPG is equal. Suchassumption is not resilient to noise, rendering it infeasible in extreme conditions.

Recently, deep learning approaches have been introduced for rPPG signalestimation. DeepPhys [8] is an end-to-end supervised method implemented usinga feed-forward CNN. Instead of using a Long Short-Term Memory (LSTM) cellto model the temporal information, an attention mechanism is used to learn thedifference between frames. rPPGNet [61] considers the possibility of differenttypes of video compression affecting the performance of rPPG estimation andproposed a network that handles both video quality reconstruction and rPPGsignal estimation trained in an end-to-end fashion. RhythmNet [35] trained aCNN-RNN model based on a training set that contains diverse illuminationand pose to enhance performance on public dataset. PhysNet [60] constructed

4 Lee et al.

both a 3DCNN-based network and a RNN-based network and compared theirperformance.

Meta-Learning. The motivation for introducing meta-learning to our rPPGestimation framework is to perform fast adaptation of weights when our networkis deployed in a setting that is not covered by our training distribution. Mostof the studies revolving meta-learning are in few-shot classification, which havewell-defined training and testing tasks. The structure of such well-defined tasksis exploited in [29,45,54] where the structure of the network is designed to takeboth training and test samples into consideration during inference. Gradient-based few-shot learning has also been proposed in [14,33,41] with a limitationthat the network capacity has to be small to prevent overfitting due to the smallnumber of training samples. Few-shot learning based on the metric space has alsobeen studied in [21,54,47]. Meta-learning has also been applied to tasks beyondfew-shot classification [12,59,58], showing convincing results.

In our work, the parameters of our network is divided into two parts whereone is responsible for fast adaptation and the other only responsible for estima-tion. Similar update methodology has been introduced in [31,16,18,44,62]. Allthe proposed methods have different weight update orders and network construc-tion but they have the consensus on the effectiveness of maintaining two sets ofweights in a few-shot learning setting.

3 Methodology

Given an input image from the camera, a face is first detected followed by thedetection of facial landmarks. Pixel values within the landmarks are retainedwhile the rest are filled with 0. The face image is then cropped and reshapedinto a K ×K image. We define the i-th video stream and PPG data containinga single subject as x(i) and t(i) respectively (we will drop the superscript iwhenever the context involves a single video stream for brevity) where x(i) andt(i) are sampled from a distribution of tasks (x(i), t(i)) ∼ p(T ). Continuous Tframes of face images are aggregated into a sequence, giving us the input dataof our network, xt:t+T ∈ RK×K×T . The face sequences are paired with the PPGsignal tt:t+T ∈ RT collected using a PPG sensor placed beneath the index finger,where each sample tt is temporally aligned to each frame of the face sequencext during the collection process. The output of our network is a rPPG signaly ∈ RT which is an estimation of the PPG signal having a small temporal offsetcaused by the carotid-radial pulse wave velocity (PWV) [11].

The training of our network is different from typical training practice foundin an end-to-end supervised learning setting. We cast the learning of our networkinto a few-shot learning setting, involving both support set S and query set Q.In a few-shot learning setting, S and Q are sampled from a large pool of tasksdistribution {Sn,Qn} ∼ T where the sampled Sn and Qn share the same setof classes but have disjoint samples for a classification problem. Our learningsetting differs from the existing few-shot classification setting in two ways: 1. we


FeatureExtractor

(�)��

�

�

rPPGEstimator

(�)ℎ�

BidirectionalLSTM

�proto

�

�train

�

ConvolutionalEncoder

SyntheticGradientGenerator

(�)��

=� 1

� ∑�=1

�

��

( , )∇�PROTO � �proto

(�)��

(�, �)∇�ORD�

�

OrdinalRegression

Loss

TransductiveGradient

InformationFlow

DistanceMeasure

HourglassNetwork

(�, �)ORD

�

Fig. 1: Overview of our system for transductive inference. Three modules: featureextractor, rPPG estimator and synthetic gradient generator are found in oursystem. During inference, only the feature extractor and rPPG estimator will beused. Only the parameters of the feature extractor, θ, will be updated duringtransductive learning. Gradients used for the update of θ are shown in dashedlines where the blue dashed line is only present during the training phase ofmeta-learning. zproto is generated using the training set, xtrain.

are solving a regression problem, 2. our input is a sequence of images instead ofstill images that are distributed independent and identically.

In comparison to the typical few-shot learning setting, instead of sampling aset of samples originating from disjoint classes, we sample a sequence of image,xt:t+T , from disjoint video streams, e.g. videos containing different subject orbackground. During each sampling process, we sample N independent videostreams from our collected pool of video streams also known as distribution oftasks, p(T ). For each video stream, we split it into shorter sequences, where eachsequence is further split into V and W frames. In correspondence to the few-shotlearning setting, V is the support set S used for adaptation and W is the queryset Q used to evaluate the performance of the model.

3.1 Network Architecture

Our meta-learning framework for remote heart rate estimation consists of threemodules: convolutional encoder, rPPG estimator and synthetic gradient gener-ator. To infer the rPPG signal, only the convolutional encoder and rPPG esti-mator will be used whereas the synthetic gradient generator will only be usedduring transductive learning. Our network is designed to exploit spatiotemporalinformation by first modeling visual information using a convolutional encoderfollowed by the modeling of the estimation of PPG signal using a LSTM rPPGestimator. We name our approach Meta-rPPG where detailed configuration of

6 Lee et al.

our architecture is shown in Table 1. An overview of our proposed framework isshown in Fig. 1.

Convolutional Encoder. To extract latent features from a stream of images,we use a feature extractor modeled by a CNN, fθ(·). We use a ResNet-alikestructure as the backbone of our convolutional encoder. Given an input streamof T frames, the convolutional encoder is shared across frames:

pθ(zi|xi) = fθ(xi), (1)

giving us T independent distribution of latent features, {pθ(zi|xi)}t+Ti=t , that willbe fed to a rPPG estimator to model the temporal information for the estimationof rPPG signal.

rPPG Estimator. The latent features extracted from the input image streamare then passed to rPPG estimator modeled by a LSTM-MLP module, hφ(·).The intuition behind the split is to separate the parameters responsible for visualmodeling from the parameters responsible for the estimation of rPPG signal,accomplished via the temporal modeling of the latent features:

pφ(y|zt, zt+1, ...,zt+T ) = hφ({pθ(zi|xi)}t+Ti=t ). (2)

The LSTM module is designed to model the temporal information of a fixedsequence of T latent features. The output of each step of the LSTM is followedby a MLP module which is responsible for the estimation of rPPG signal. Aswe model the estimation of rPPG signal as an ordinal regression task, we havea multitask (S tasks) output. Details are deferred to Section 3.3.

Synthetic Gradient Generator. For the fast adaptation of the parametersof our model during deployment, we introduce a synthetic gradient generatormodeled by a shallow Hourglass network [32], gψ(·). The idea of using a syn-thetic gradient generator for transductive inference is not novel as it was firstintroduced in [19] to parallelize the backpropagation of gradient and to augmentbackpropagation-through-time (BPTT) of long sequences found in LSTM. It isthen applied to a few-shot learning framework in [18] to generate gradient forunlabeled samples, giving a significant boost in performance. Our synthetic gra-dient generator attempts to model the gradient at z backpropagated from theordinal regression loss from the output, LORD(y, t) (defined in (13)):

gψ(z) = ∇zLORD(y, t). (3)

Our synthetic gradient generator will be used during transductive inference forthe fast adaptation to new video sequences and at the adaptation phase duringtraining, using the support set S.


Table 1: Conv2DBlocks are composed of Conv2D, Batchnorm, average pooling,and ReLU. Conv1DBlocks are composed of Conv1D, Batchnorm and ReLU.Shortcut connections are added between Conv2DBlocks for the ConvolutionalEncoder. The synthetic gradient generator is designed as a Hourglass network.X indicates the information the layer acts upon. Output size is defined asT×Channels×K ×K for Convolutional Encoder and T×Channels for the rest.

Module Layer Output Size Kernel Size Spatial Temporal

ConvolutionalEncoder

Conv2DBlock 60×32×32×32 3×3 XConv2DBlock 60×48×16×16 3×3 XConv2DBlock 60×64×8×8 3×3 XConv2DBlock 60×80×4×4 3×3 XConv2DBlock 60×120×2×2 3×3 X

AvgPool 60×120 2×2 X

rPPGEstimator

Bidirectional LSTM 60×120 - X XLinear 60×80 - XOrdinal 60×40 - X

SyntheticGradientGenerator

Conv1DBlock 40×120 3×3 X XConv1DBlock 20×120 3×3 X XConv1DBlock 40×120 3×3 X XConv1DBlock 60×120 3×3 X X

3.2 Transductive Meta-Learning

During the deployment of Meta-rPPG, we have to consider the possibility ofobserving input samples that are not modeled by our model, also known asout-of-distribution samples. A possible solution is through the introduction ofa pre-processing step that attempts to project the input data into a commondistribution that is covered by our model. The consideration of an infinitely largedistribution containing all possible scenarios during deployment is near impos-sible for any pre-processing technique. To cope with the shift in distribution, wepropose two methods for transductive inference which provides gradient to ourconvolutional encoder fθ(·) during deployment. The first method is through thegeneration of synthetic gradients using a generator that is modeled during train-ing. The second method attempts to minimize prototypical distance betweendifferent tasks which is based on the hypothesis that the rPPG estimator hφ(·)is only responsible for the estimation of rPPG signal. The modeling of the esti-mation of rPPG signal should not be affected by the visual input and is limitedto a constrained distribution.

Generating Synthetic Gradient. To generate gradients for weight updateof our model during inference, we introduce a synthetic gradient generator thatmodels the backpropagated gradient from the final output that uses an ordinalloss LORD(y, t) from (13), up to the output of the feature extractor z, which

8 Lee et al.

�(1)

�(2)

�(3)

�proto

(a) Prototypical distance minimization.

rPPGEstimator

(�)ℎ�In-Distribution

Out-of-Distribution

BidirectionalLSTM

(b) Distribution of latent space.

Fig. 2: (a) shows the high-level concept on the application of prototypical dis-tance minimization on the latent space. (b) shows that the rPPG estimator onlyperforms well on the in-distribution samples covered by the training datasetwhile performance on out-of-distribution samples is sub-optimal. In-distributionsamples are outlined in green while out-of-distribution samples are outlined inred. The prototypical distance minimizer generates gradient that forces the out-of-distribution samples towards the center of the in-distribution samples.

can be simply put as ∇zLORD(y, t). As our synthetic gradient generator gψ(z)attempts to model ∇zLORD(y, t), we can observe its role in the chain-rule ofgradient update of the parameters of the feature extractor, θ:

θ = θ − α∂LORD(y, t)

∂z

∂z

∂θ(4)

= θ − αgψ(z)∂z

∂θ. (5)

During the learning phase of training, we can update the weights of our syntheticgradient generator by minimizing the following objective function:

LSYN(gψ(z),∇zLORD(y, t)) = ||gψ(z)−∇zLORD(y, t)||22, (6)

where the weight update of ψ is given as:

ψ = ψ − η∇ψLSYN(gψ(z),∇zLORD(y, t)). (7)

Minimizing Prototypical Distance. As rPPG estimation is based on a visualinput, there’s no guarantee that the samples collected for training is consistentwith the samples used during testing due to the broad distribution in the visualspace. In a statistical viewpoint, data provided by the training set modeled byour network can be viewed as in-distribution samples whereas data collectedunder a different setting, e.g. lighting condition, subject and camera settings,are considered as out-of-distribution samples. It is not surprising that deep neu-ral networks doesn’t perform well on out-of-distribution samples as studied in[25,42]. To overcome this limitation, we propose a prototypical distance mini-mization technique that can be applied on video sequences having a property


where the statistical information that needs to be modeled by a neural networkdoesn’t vary too much over time. We show a conceptual illustration in Fig. 2.

First, we consider each video sequences as separate task that needs to bemodeled. We define the prototype of the latent variable of a specific task as:

z(i) =1

T

T∑t=1

pθ(z(i)t |x

(i)t ). (8)

Here, T consecutive samples from task or video i is sampled and the averageis taken across the latent variable generated by the convolutional decoder from(1). We then obtain our first global latent variable prototype as:

zproto = Ex(i)∼p(T )

1

T

T∑t=1

pθ(z(i)t |x

(i)t ) (9)

= Ex(i)∼p(T )

1

T

T∑t=1

fθ(x(i)t ). (10)

As mentioned earlier, we perform Monte Carlo sampling of N tasks or videosfor every training iteration, hence we are unable to obtain the statistical meanof the entire dataset in one shot. A more computationally feasible approach isto update our global latent variable prototype in an iterative fashion as:

zproto = γzproto + (1− γ)Ex(i)∼p(T )

1

T

T∑t=1

fθ(x(i)t ), (11)

which can be understood as the weighted average between the old term andthe newly sampled global latent variable prototype via the introduction of thehyperparameter γ. The global prototype in (11) is obtained during the learningphase of training and transductive gradient is generated by minimizing the loss:

minθLPROTO(z, zproto) = min

θEx(i)∼p(T )

1

T

T∑t=1

||pθ(z(i)t |x

(i)t )− zproto||22. (12)

Training of Meta-Learner. A meta-learner will perform well during testing ifthe setting during testing is similar to the setting during training. As mentionedearlier, we split the N tasks samples from all our collected video into sequencesthat are further split into V and W frames. We use V for the update of θ phrasedas the adaptation phase and W for the update of φ, ψ, θ and zproto phrased asthe learning phase. In general, the split of V and W is put as W > V andW ∩ V = ∅ with the intuition that we attempt to minimize the frames requiredfor adaptation and perform learning on the adapted space or distribution.

The role of the adaptation phase is to train our convolutional encoder fθ(·) tomap input image sequences to a representation or latent space that will performwell when fed to our rPPG estimator hφ(·). During training, gradients from three

10 Lee et al.

sources will be used in the adaptation phase: synthetic gradient generator, pro-totypical distance minimizer and from the ordinal regression loss. During testingor deployment, we don’t have any labeled data available, hence gradient fromthe ordinal regression loss is unattainable. Instead, we use gradients from oursynthetic gradient generator and prototypical distance minimizer for adaptation.Note that the adaptation phase is run for L steps on the same V frames.

The role of the learning phase is to force our model to learn a suitable latentrepresentation for rPPG signal estimation based on its image domain correspon-dence. Supervised learning is required in the learning phase, hence this phasewill only be present during training. Here, θ and φ which corresponds to the con-volutional encoder and rPPG estimator will be trained in an end-to-end fashionwhereas the parameters of the synthetic gradient generator ψ will be updatedusing the gradient backpropagated from the ordinal regression loss. ψ is updatedby minimizing the synthetic loss given in (6).

Before the inclusion of the adaptation phase during training, we first pre-train our network under the learning phase for R epochs. The reason is that thegradient backpropagated to the synthetic gradient generator and the prototypeused for the prototypical minimizer rely on the task at hand (rPPG estimation)rather than using gradient and prototype based on a set of random weights,which could lead to unstability. We summarize the training of our meta-learnerin Algorithm 1.

3.3 Posing rPPG Estimation as an Ordinal Regression Task

Ordinal regression is commonly used in task that requires the prediction of la-bels that contains ordering information, e.g. age estimation [36,6], progressionof various diseases [13,57,50,46], text message advertising [43] and various rec-ommender systems [37]. The motivation of using ordinal regression in our workis because there’s an ordering of rPPG value that can be exploited and therewill always be a temporal and magnitude discrepancy between rPPG and PPGsignal as they originate from different parts of the human body [11].

To cast the estimation of rPPG signal as an ordinal regression problem, wefirst normalize a segment of PPG signal to be within 0 and 1. We then quantizeit uniformly into 40 segments where each segment represents a rank. Using thePPG segment tt:t+T as an example, the quantized t-th sample will be categorizedinto one of the S segments {τ1 ≺ ... ≺ τS}. With a slight abuse of notations,if the categorized value for the t-th sample falls in the s-th segment, we denoteit as tt,s = 1{tt > τs} to keep our formulation concise. 1{·} is an indicatorfunction that returns 1 if the inner condition is true and 0 otherwise. As in [35],our rPPG estimator will have to solve S binary classification problem given as:

LORD(y, t) = − 1

T

T∑t=1

S∑s=1

tt,s log(pφ(yt,s|zt:t+T ))+(1−tt,s) log(1−pφ(yt,s|zt:t+T )).

(13)Note that in (13), the same notational abuse is applied to the output of our rPPGestimator, y. During inference, we can obtain T rPPG samples corresponding to


Algorithm 1 Training of Meta-Learner

1: Input: p(T ): distribution of tasks2: for i← 1, R do . Pre-train network in an end-to-end fashion for R epochs3: Sample batch of tasks Ti ∼ p(T )4: for (x, t) ∼ Ti do5: Update θ and φ by minimizing LORD(y, t) from (13)6: end for7: end for8: while not done do . Begin transductive meta-learning9: Sample batch of tasks Ti ∼ p(T )

10: for (x, t) ∼ Ti do11: {x, t}, {x, t} ← x, t . Split into V and W consecutive frames12: for i← 1, L do . Adaptation phase (run L steps)13: θ ← θ − α(∇θLproto(z, zproto) +∇θLORD(y, t) + fψ(z))14: end for15: ψ = ψ − η∇ψLSYN(fψ(z),∇zLORD(y, t)) . Learning phase16: θ = θ − η∇θLORD(y, t)17: φ = φ− η∇φLORD(y, t)

18: zproto = γzproto + (1− γ)Ex(i)∼x1T

∑Tt=1 fθ(x

(i)t )

19: end for20: end while

T consecutive frames, xt:t+T , fed to our model, giving us:

yt:t+T = hφ(zt:t+T ) ={ S∑s=1

1{pφ(yi,s = 1|zt:t+T ) > 0.5}}Ti=t. (14)

4 Experiments

We show the efficacy of our proposed method by performing empirical simula-tion MAHNOB-HCI [48] and UBFC-rPPG [3] using a model trained using ourcollected dataset. We also show how transductive inference helps in adapting tounseen datasets using a model trained using our own collected training dataset.Source code is available at https://github.com/eugenelet/Meta-rPPG.

4.1 Dataset and Experimental Settings

MAHNOB-HCI dataset [48] is recorded at 61 fps using a resolution of780×580 and includes 527 videos from 27 subjects. Since our training dataset iscollected at 30 fps, we downsample the videos to 30 fps by getting rid half of thevideo frames. To generate ground truth heart rate (HR) for evaluation, we usethe EXG1 signals containing ECG signal. Peaks of ECG signal are found usingscipy.signal.find peaks and distance in samples between peaks is usedfor the calculation of HR. To compare fairly with previous works [8,34,49,60][3, 12, 18], we follow the same routine in their works by using 30 seconds clip(frames 306 to 2135) of each video.

https://github.com/eugenelet/Meta-rPPG

12 Lee et al.

0 5 10 15 20 25 30

Number of Adaptation Steps, L

4

6

8

10

12

14

MA

E

MAHNOB-HCI

CHROM

baseline

inductive

proto

synth

proto + synth

0 5 10 15 20 25 30


6

7

8

9

10

11

12

13

MA

E

UBFC-rPPG

CHROM

baseline

inductive

proto

synth

proto + synth

Fig. 3: MAE obtained using different rPPG estimation methods. Demonstrateshow the number of adaptation steps, L, affects performance.

UBFC-rPPG dataset [3] is relatively new and contains 38 uncompressedvideos along with finger oximeter signals. Ground truth heart rate is provided,hence it can directly be used for evaluation without any additional processing.This dataset has a wider range of HR collected induced by making participantsplay a time-sensitive mathematical game that supposedly raises their HR. Diffusereflections was created by introducing ambient light in the experiment.

Collected Dataset. For the training of our model, we collected our owndataset. A RGB video camera is used for the collection of video at 30 fps usinga resolution of 480×640. PPG signals are collected using our self-designed de-vice [22] where raw data can be sent to another host device via a UART port.Collection of video and PPG signal are synced by connecting both devices to aNvidia Jetson TX2. Since our PPG sensor device collects PPG signal at 100 Hz,we downsample it to 30 Hz to match the visual stream. Video and PPG samplesof length over 2 minutes are collected and are cropped to 2 minutes to matchour proposed meta-learning framework. A total of 19 videos are collected where18 are used for training and 1 for validation.

Experimental Settings. Facial videos and corresponding PPG signals aresynchronized before training. For each video clip, we use the face detector foundin dlib that uses [9] cascaded with a facial landmark detector implementedusing [20]. To keep the results of our face detector consistent, we use a medianflow tracker provided by OpenCV [5]. Pixels within the landmarks are retainedand 5 pixels are used for zero-padding beyond the outermost pixels. The resultingface images are then resized to 64 × 64. All experiments and network trainingare done on a single Nvidia GTX1080Ti using PyTorch [38]. The SGD optimizeris used. We set the learning rate, η = 10−3, the adaptive learning rate, α = 10−5

and the prototype update weight, γ = 0.8. We train all the models for 20 epochs.


Performance Metrics. For the evaluation of public datasets, we report theerror’s standard deviation (SD), mean absolute error (MAE), root mean squareerror (RMSE) and Pearson’s correlation coefficient (R).

4.2 Evaluation on MAHNOB-HCI and UBFC-rPPG

We evaluate 5 different configurations as part of the ablation study of the meth-ods we introduced. The first configuration is the training of our proposed archi-tecture in an end-to-end fashion and perform inductive inference on the test set,End-to-end (baseline). The second configuration is to train our network usingall of our proposed methods but doesn’t perform adaptation during inference,Meta-rPPG (inductive). The third and fourth configuration perform transduc-tive inference be using either prototypical distance minimizer, Meta-rPPG (protoonly), or synthetic gradient generator, Meta-rPPG (synth only), respectively.The final configuration uses both proposed method for transductive inference,Meta-rPPG (proto+synth).

Results of average HR measurement for MAHNOB-HCI and UBFC-rPPGare shown in Table 2 and 3 respectively. We also show MAE results by thevarying number of adaptation steps, L in Fig. 3. Since we use L = 10 duringtraining, setting L = 10 during evaluation also gives us the best results, whichagrees with the common practice used in meta-learning [14,33]. Note that ourproposed architecture used for end-to-end (baseline) training is similar to [60]with a difference that we are using only 18 videos of length 2 minutes for trainingwhereas [60] uses OBF dataset [23] that contains 212 videos of length 5 minutes.Considering the difference in magnitude of dataset size, it’s understandable thattraining end-to-end using our network doesn’t perform as well as in [60]. Thisalso indicates that our performance can be further improved just by collectingmore data. More experimental results are shown in the Supplementary Materials.

Table 2: Results of average HR measurement on MAHNOB-HCI.

MethodHR (bpm)

SD MAE RMSE R

Poh2011 [39] 13.5 - 13.6 0.36CHROM [10] - 13.49 22.36 0.21Li2014 [24] 6.88 - 7.62 0.81SAMC [52] 5.81 - 6.23 0.83SynRhythm [34] 10.88 - 11.08 -HR-CNN [49] - 7.25 9.24 0.51DeepPhys [8] - 4.57 - -PhysNet [60] 7.84 5.96 7.88 0.76STVEN+rPPGNet [61] 5.57 4.03 5.93 0.88

End-to-end (baseline) 7.39 7.47 8.63 0.70Meta-rPPG (inductive) 7.91 7.42 8.65 0.74Meta-rPPG (proto only) 6.89 6.05 6.71 0.77Meta-rPPG (synth only) 5.09 3.88 4.02 0.81Meta-rPPG (proto+synth) 4.90 3.01 3.68 0.85

14 Lee et al.

Table 3: Results of average HR measurement on UBFC-rPPG.

MethodHR (bpm)

SD MAE RMSE R

GREEN [53] 20.2 10.2 20.6 -ICA [39] 18.6 8.43 18.8 -CHROM [10] 19.1 10.6 20.3 -POS [55] 10.4 4.12 10.5 -3D CNN [4] 8.55 5.45 8.64 -

End-to-end (baseline) 13.70 12.78 13.30 0.27Meta-rPPG (inductive) 14.17 13.23 14.63 0.35Meta-rPPG (proto only) 9.17 7.82 9.37 0.48Meta-rPPG (synth only) 11.92 9.11 11.55 0.42Meta-rPPG (proto+synth) 7.12 5.97 7.42 0.53

End-to-end (baseline) Meta-rPPG (inductive) Meta-rPPG (proto+synth)

Fig. 4: Feature activation map visualization of 4 subjects using different trainingmethods. Usage of transductive inference results in activations of higher contrastand covers larger region of facial features that contributes to rPPG estimation.

Visualization of feature activation map shows the importance of trans-ductive inference during evaluation. In Fig. 10, we show feature activation mapsvisualization [28] generated using our proposed methods and supervised end-to-end trained model on images extracted from MAHNOB-HCI. Regions that aremore likely to be useful for remote HR estimation are more pronounced whenour approach is introduced. Results show that transductive inference is usefulwhen applied to data excluded from the training distribution.

5 Conclusion

In this work, we introduce transductive inference into the framework of rPPGestimation. For transductive inference, we propose the use of a synthetic gra-dient generator and protypical distance minimizer to provide gradient to ourfeature extractor when labeled data are unobtainable. By posing the learning ofour network as a meta-learning framework, we see substantial improvements onMAHNOB-HCI and UBFC-rPPG dataset demonstrating state-of-the-art results.

Acknowledgements. This work is supported by Ministry of Science and Tech-nology (MOST) of Taiwan: 107-2221-E-009 -125 -MY3.


References

1. Balakrishnan, G., Durand, F., Guttag, J.: Detecting pulse from head motions invideo. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 3430–3437 (2013)

2. Bengio, Y., Deleu, T., Rahaman, N., Ke, R., Lachapelle, S., Bilaniuk, O., Goyal, A.,Pal, C.: A meta-transfer objective for learning to disentangle causal mechanisms.arXiv preprint arXiv:1901.10912 (2019)

3. Bobbia, S., Macwan, R., Benezeth, Y., Mansouri, A., Dubois, J.: Unsupervisedskin tissue segmentation for remote photoplethysmography. Pattern RecognitionLetters 124, 82–90 (2019)

4. Bousefsaf, F., Pruski, A., Maaoui, C.: 3d convolutional neural networks for remotepulse rate measurement and mapping from facial video. Applied Sciences 9(20),4364 (2019)

5. Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)6. Cao, W., Mirjalili, V., Raschka, S.: Rank-consistent ordinal regression for neural

networks. arXiv preprint arXiv:1901.07884 (2019)7. Cennini, G., Arguel, J., Aksit, K., van Leest, A.: Heart rate monitoring via re-

mote photoplethysmography with motion artifacts reduction. Optics express 18(5),4867–4875 (2010)

8. Chen, W., McDuff, D.: Deepphys: Video-based physiological measurement usingconvolutional attention networks. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 349–365 (2018)

9. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:2005 IEEE computer society conference on computer vision and pattern recognition(CVPR’05). vol. 1, pp. 886–893. IEEE (2005)

10. De Haan, G., Jeanne, V.: Robust pulse rate from chrominance-based rppg. IEEETransactions on Biomedical Engineering 60(10), 2878–2886 (2013)

11. Digiglio, P., Li, R., Wang, W., Pan, T.: Microflotronic arterial tonometry for con-tinuous wearable non-invasive hemodynamic monitoring. Annals of biomedical en-gineering 42(11), 2278–2288 (2014)

12. Dou, Q., de Castro, D.C., Kamnitsas, K., Glocker, B.: Domain generalization viamodel-agnostic learning of semantic features. In: Advances in Neural InformationProcessing Systems. pp. 6450–6461 (2019)

13. Doyle, O.M., Westman, E., Marquand, A.F., Mecocci, P., Vellas, B., Tsolaki, M.,K loszewska, I., Soininen, H., Lovestone, S., Williams, S.C., et al.: Predicting pro-gression of alzheimers disease using ordinal regression. PloS one 9(8) (2014)

14. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptationof deep networks. In: Proceedings of the 34th International Conference on MachineLearning-Volume 70. pp. 1126–1135. JMLR. org (2017)

15. Finn, C., Rajeswaran, A., Kakade, S., Levine, S.: Online meta-learning. arXivpreprint arXiv:1902.08438 (2019)

16. Gidaris, S., Komodakis, N.: Dynamic few-shot visual learning without forgetting.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition. pp. 4367–4375 (2018)

17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation9(8), 1735–1780 (1997)

18. Hu, S.X., Moreno, P., Xiao, Y., Shen, X., Obozinski, G., Lawrence, N., Dami-anou, A.: Empirical bayes transductive meta-learning with synthetic gradients.In: International Conference on Learning Representations (ICLR) (2020), https://openreview.net/forum?id=Hkg-xgrYvH

https://openreview.net/forum?id=Hkg-xgrYvH

https://openreview.net/forum?id=Hkg-xgrYvH

16 Lee et al.

19. Jaderberg, M., Czarnecki, W.M., Osindero, S., Vinyals, O., Graves, A., Silver,D., Kavukcuoglu, K.: Decoupled neural interfaces using synthetic gradients. In:Proceedings of the 34th International Conference on Machine Learning-Volume70. pp. 1627–1635. JMLR. org (2017)

20. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of re-gression trees. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 1867–1874 (2014)

21. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML deep learning workshop. vol. 2. Lille (2015)

22. Lee, E., Hsu, T.J., Lee, C.Y.: Centralized state sensing using sensor array on wear-able device. In: 2019 IEEE International Symposium on Circuits and Systems(ISCAS). pp. 1–5. IEEE (2019)

23. Li, X., Alikhani, I., Shi, J., Seppanen, T., Junttila, J., Majamaa-Voltti, K., Tulppo,M., Zhao, G.: The obf database: A large face video database for remote physio-logical signal measurement and atrial fibrillation detection. In: 2018 13th IEEEInternational Conference on Automatic Face & Gesture Recognition (FG 2018).pp. 242–249. IEEE (2018)

24. Li, X., Chen, J., Zhao, G., Pietikainen, M.: Remote heart rate measurement fromface videos under realistic situations. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition. pp. 4264–4271 (2014)

25. Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution imagedetection in neural networks. arXiv preprint arXiv:1706.02690 (2017)

26. Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S.J., Yang, Y.: Learning topropagate labels: Transductive propagation network for few-shot learning. arXivpreprint arXiv:1805.10002 (2018)

27. Maeda, Y., Sekine, M., Tamura, T.: The advantages of wearable green reflectedphotoplethysmography. Journal of medical systems 35(5), 829–834 (2011)

28. Menikdiwela, M., Nguyen, C., Li, H., Shaw, M.: Cnn-based small object detectionand visualization with feature activation mapping. In: 2017 International Con-ference on Image and Vision Computing New Zealand (IVCNZ). pp. 1–5. IEEE(2017)

29. Mishra, N., Rohaninejad, M., Chen, X., Abbeel, P.: A simple neural attentivemeta-learner. arXiv preprint arXiv:1707.03141 (2017)

30. Moco, A.V., Stuijk, S., de Haan, G.: Skin inhomogeneity as a source of error inremote ppg-imaging. Biomedical optics express 7(11), 4718–4733 (2016)

31. Munkhdalai, T., Yu, H.: Meta networks. In: Proceedings of the 34th InternationalConference on Machine Learning-Volume 70. pp. 2554–2563. JMLR. org (2017)

32. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. In: European conference on computer vision. pp. 483–499. Springer (2016)

33. Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms.arXiv preprint arXiv:1803.02999 (2018)

34. Niu, X., Han, H., Shan, S., Chen, X.: Synrhythm: Learning a deep heart rate esti-mator from general to specific. In: 2018 24th International Conference on PatternRecognition (ICPR). pp. 3580–3585. IEEE (2018)

35. Niu, X., Shan, S., Han, H., Chen, X.: Rhythmnet: End-to-end heart rate estima-tion from face via spatial-temporal representation. IEEE Transactions on ImageProcessing (2019)

36. Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Ordinal regression with multipleoutput cnn for age estimation. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 4920–4928 (2016)


37. Parra, D., Karatzoglou, A., Amatriain, X., Yavuz, I.: Implicit feedback recommen-dation via implicit-to-explicit ordinal logistic regression mapping. Proceedings ofthe CARS-2011 5 (2011)

38. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala,S.: Pytorch: An imperative style, high-performance deep learning library. In:Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett,R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

39. Poh, M.Z., McDuff, D.J., Picard, R.W.: Advancements in noncontact, multiparam-eter physiological measurements using a webcam. IEEE transactions on biomedicalengineering 58(1), 7–11 (2010)

40. Poh, M.Z., McDuff, D.J., Picard, R.W.: Non-contact, automated cardiac pulsemeasurements using video imaging and blind source separation. Optics express18(10), 10762–10774 (2010)

41. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning (2016)42. Ren, J., Liu, P.J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., Laksh-

minarayanan, B.: Likelihood ratios for out-of-distribution detection. In: Advancesin Neural Information Processing Systems. pp. 14680–14691 (2019)

43. Rettie, R., Grandcolas, U., Deakins, B.: Text message advertising: Response ratesand branding effects. Journal of targeting, measurement and analysis for marketing13(4), 304–312 (2005)

44. Rusu, A.A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S.,Hadsell, R.: Meta-learning with latent embedding optimization. arXiv preprintarXiv:1807.05960 (2018)

45. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learningwith memory-augmented neural networks. In: International conference on machinelearning. pp. 1842–1850 (2016)

46. Sigrist, M.K., Taal, M.W., Bungay, P., McIntyre, C.W.: Progressive vascular cal-cification over 2 years is associated with arterial stiffening and increased mortalityin patients with stages 4 and 5 chronic kidney disease. Clinical Journal of theAmerican Society of Nephrology 2(6), 1241–1248 (2007)

47. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In:Advances in neural information processing systems. pp. 4077–4087 (2017)

48. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database foraffect recognition and implicit tagging. IEEE transactions on affective computing3(1), 42–55 (2011)

49. Spetlık, R., Franc, V., Matas, J.: Visual heart rate estimation with convolutionalneural network. In: Proceedings of the British Machine Vision Conference, New-castle, UK. pp. 3–6 (2018)

50. Streifler, J.Y., Eliasziw, M., Benavente, O.R., Hachinski, V.C., Fox, A.J., Barnett,H.: Lack of relationship between leukoaraiosis and carotid artery disease. Archivesof neurology 52(1), 21–24 (1995)

51. Takano, C., Ohta, Y.: Heart rate measurement based on a time-lapse image. Med-ical engineering & physics 29(8), 853–857 (2007)

52. Tulyakov, S., Alameda-Pineda, X., Ricci, E., Yin, L., Cohn, J.F., Sebe, N.: Self-adaptive matrix completion for heart rate estimation from face videos under real-

http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf



18 Lee et al.

istic conditions. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 2396–2404 (2016)

53. Verkruysse, W., Svaasand, L.O., Nelson, J.S.: Remote plethysmographic imagingusing ambient light. Optics express 16(26), 21434–21445 (2008)

54. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networksfor one shot learning. In: Advances in neural information processing systems. pp.3630–3638 (2016)

55. Wang, W., den Brinker, A.C., Stuijk, S., de Haan, G.: Algorithmic principles of re-mote ppg. IEEE Transactions on Biomedical Engineering 64(7), 1479–1491 (2016)

56. Wang, W., den Brinker, A.C., Stuijk, S., de Haan, G.: Robust heart rate fromfitness videos. Physiological measurement 38(6), 1023 (2017)

57. Weersma, R.K., Stokkers, P.C., van Bodegraven, A.A., van Hogezand, R.A.,Verspaget, H.W., de Jong, D.J., Van Der Woude, C., Oldenburg, B., Linskens,R., Festen, E., et al.: Molecular prediction of disease risk and severity in a largedutch crohns disease cohort. Gut 58(3), 388–395 (2009)

58. Wu, Y., Rosca, M., Lillicrap, T.: Deep compressed sensing. arXiv preprintarXiv:1905.06723 (2019)

59. Yu, H., Sun, S., Yu, H., Chen, X., Shi, H., Huang, T.S., Chen, T.: Foal: Fast onlineadaptive learning for cardiac motion estimation. In: Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition. pp. 4313–4323 (2020)

60. Yu, Z., Li, X., Zhao, G.: Remote photoplethysmograph signal measurement fromfacial videos using spatio-temporal networks. In: Proc. BMVC. pp. 1–12 (2019)

61. Yu, Z., Peng, W., Li, X., Hong, X., Zhao, G.: Remote heart rate measurement fromhighly compressed facial videos: an end-to-end deep learning solution with videoenhancement. In: Proceedings of the IEEE International Conference on ComputerVision. pp. 151–160 (2019)

62. Zintgraf, L.M., Shiarlis, K., Kurin, V., Hofmann, K., Whiteson, S.: Fast contextadaptation via meta-learning. arXiv preprint arXiv:1810.03642 (2018)


Algorithm 2 Transductive Inference During Deployment

1: Input: x: A single video stream2: x, x ← x . x: first 2 seconds of video, x: rest of video3: for i← 1, L do . Adaptation phase (run L steps)4: θ ← θ − α(∇θLproto(z, zproto) + fψ(z))5: end for6: y ← hφ(fθ(x)) . Estimation of rPPG signal using adapted feature extractor

A Performing Transductive Inference During Deployment

Here, we show the algorithm for transductive inference during deployment inAlgorithm 2. The inference process is similar to the typical inference of a feed-forward deep neural network (the final line). The difference is in the inclusion ofadaptation steps using the first 2 seconds of the video for transductive learningprior to actual inference.

B Performance Comparison Using Different AdaptationSteps

In this section, we study how the number of adaptation steps, L, used for trans-ductive inference affects performance. We report results under different metrics,namely, mean absolute error (MAE), standard deviation of error (SD), rootmean squared error (RMSE) and Pearson correlation coefficient (R). Tabulatedperformances for MAHNOB-HCI are shown in Table 4, 5, 6 and 7 for MAE,SD, RMSE and R respectively. In the same order, tabulated performances forUBFC-rPPG are shown in Table 8, 9, 10 and 11. Again, using the same order,we show comparison plots in Figure 5, 6, 7 and 8 for both MAHNOB-HCI andUBFC-rPPG.

From the results, we can deduce that the number of adaptation steps usedduring transductive inference should match the number of steps used duringtraining. This rule only applies to the generation of synthetic gradients for trans-ductive inference but not on the protypical distance minimizer. We hypothesizethat the prototypical distance minimizer will eventually converge to some valueand doesn’t hurt performance if run for infinite number of adaptation steps.This is intuitive as the idea of prototypical distance minimizer is to pull out-of-distribution samples towards the center of the distribution that is modeled bythe rPPG estimator using the training data.

C Does Joint Adaptation of both Feature Extractor andrPPG Estimator Give Better Results?

We hypothesize that the estimation of rPPG signal is more efficient if we areable to update the weights of the feature extractor during testing for the adapta-tion to the new observed distribution. By doing so, we expect that the features

20 Lee et al.

generated by the feature extractor fall within the distribution covered by therPPG estimator, i.e. the weights of the rPPG estimator is obtained throughthe optimization on the training dataset. One might challenge that the jointadaptation of both feature extractor and rPPG estimator weights might resultin better performance, contradicting our hypothesis. To demonstrate that ourhypothesis holds, we perform an empirical study on whether the joint updateapproach or the sole update of the weights of the feature extractor performsbetter. The implementation is straight forward, the synthetic gradient generatoris moved towards the output for the joint adaptation case. We show experimentson MAHNOB-HCI using an adaptation steps of 10 in Table 12. The empiricalresults support our hypothesis.

Table 4: Results of mean absolute error of HR measurement on MAHNOB-HCIusing different adaptation steps, L.

MethodMAE of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30

End-to-end (baseline) 7.47 7.47 7.47 7.47 7.47Meta-rPPG (proto only) 7.42 6.65 6.05 6.02 6.02Meta-rPPG (synth only) 7.42 5.00 3.88 3.70 6.25Meta-rPPG (proto+synth) 7.42 3.89 3.01 3.02 6.20

Table 5: Results of average standard deviation HR measurement error onMAHNOB-HCI using different adaptation steps, L.

MethodSD of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


D Visualization of Feature Activation Map UsingDifferent Methods

In this section, we show the visualization for feature activation map using themethods we introduced and is compared with the baseline that uses an end-to-end supervised learning method. Ablation study is also performed by showingfeature activation maps obtained using individual methods that we proposed.


Table 6: Results of root mean squared error of HR measurement on MAHNOB-HCI using different adaptation steps, L.

MethodRMSE of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


Table 7: Results of Pearson correlation coefficient of HR measurement onMAHNOB-HCI using different adaptation steps, L.

MethodR of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


Table 8: Results of mean absolute error of HR measurement on UBFC-rPPGusing different adaptation steps, L.

MethodMAE of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


Table 9: Results of average standard deviation HR measurement error on UBFC-rPPG using different adaptation steps, L.

MethodSD of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


22 Lee et al.

Table 10: Results of root mean squared error of HR measurement on UBFC-rPPG using different adaptation steps, L.

MethodRMSE of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


Table 11: Results of Pearson correlation coefficient of HR measurement onUBFC-rPPG using different adaptation steps, L.

MethodR of HR (bpm)

L = 0 L = 5 L = 10 L = 20 L = 30


0 5 10 15 20 25 30


4

6

8

10

12

14

MA

E

MAHNOB-HCI

CHROM

baseline

inductive

proto

synth

proto + synth

0 5 10 15 20 25 30


6

7

8

9

10

11

12

13

MA

E

UBFC-rPPG

CHROM

baseline

inductive

proto

synth

proto + synth

Fig. 5: Mean absolute error, MAE, obtained using different rPPG estimationmethods. Demonstrates how the number of adaptation steps, L, affects perfor-mance.

Table 12: Results of average HR measurement on MAHNOB-HCI comparing thedifference between joint adaptation of both feature extractor and rPPG estimatorand updating the feature extractor only.

MethodHR (bpm)

MAE RMSE R

Joint Adaptation 4.25 6.09 0.81Extractor Only 3.01 3.68 0.85


0 5 10 15 20 25 30


5.0

5.5

6.0

6.5

7.0

7.5

8.0

SD

MAHNOB-HCI

baseline

inductive

proto

synth

proto + synth

0 5 10 15 20 25 30


7

8

9

10

11

12

13

14

SD

UBFC-rPPG

baseline

inductive

proto

synth

proto + synth

Fig. 6: Standard deviation of error, SD, obtained using different rPPG estima-tion methods. Demonstrates how the number of adaptation steps, L, affectsperformance.

0 5 10 15 20 25 30


4

5

6

7

8

RM

SE

MAHNOB-HCI

baseline

inductive

proto

synth

proto + synth

0 5 10 15 20 25 30


8

9

10

11

12

13

14

15

RM

SE

UBFC-rPPG

baseline

inductive

proto

synth

proto + synth

Fig. 7: Root mean squared error, RMSE, obtained using different rPPG esti-mation methods. Demonstrates how the number of adaptation steps, L, affectsperformance.

24 Lee et al.

0 5 10 15 20 25 30


0.70

0.72

0.74

0.76

0.78

0.80

0.82

0.84R

MAHNOB-HCI

baseline

inductive

proto

synth

proto + synth

0 5 10 15 20 25 30


0.30

0.35

0.40

0.45

0.50

R

UBFC-rPPG

baseline

inductive

proto

synth

proto + synth

Fig. 8: Pearson correlation coefficient, R, obtained using different rPPG esti-mation methods. Demonstrates how the number of adaptation steps, L, affectsperformance.

More subjects are also shown here to give a better understanding of the im-portance of transductive inference for rPPG estimation using a deep learningmodel.

E Demonstration Using Video

We show the implementation of our algorithm on videos extracted from MAHNOB-HCI to show its performance during deployment. We demonstrate our algorithmon videos of 3 subjects and a snapshot of a single frame from one of the video isshown in Fig. 9. From the video attached in the supplementary materials, it canbe observed that the feature activation maps corresponding to our transductiveinference method is relatively consistent when compared to a model trained inan end-to-end fashion. Please refer to our video for a better understanding onthe improvements brought by the introduction of transductive inference to rPPGestimation.


Fig. 9: A frame extracted from a video from MAHNOB-HCI. From left to right: 1.Pre-processed face image (zero-ing of pixels outside facial landmarks), 2. featureactivation maps corresponding to end-to-end trained model, 3. feature activationmap corresponding to Meta-rPPG (proto+synth) and 4. plots containing rPPGsignal (top), power spectral density of rPPG signal (middle), predicted (bottom-left) and ground truth heart rate (bottom-right) in beats-per-minute (BPM).

26 Lee et al.

End-to-end(baseline)

Meta-rPPG(inductive)

Meta-rPPG(proto only)

Meta-rPPG(synth only)

Meta-rPPG(proto+synth)

Fig. 10: Feature activation map visualization of 8 subjects (each row correspondsto one subject) using different training methods. Ablation study is performed byinspecting the feature map activation the results upon the application of everyproposed transductive inference method. Usage of transductive inference resultsin activations of higher contrast and covers larger region of facial features thatcontributes to rPPG estimation.

arxiv:2007.06786v1 [cs.cv] 14 jul 2020

Documents