extensions to statistical process control statistical

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=lqen20

Download by: [HKUST Library] Date: 15 December 2017, At: 01:30

Quality Engineering

ISSN: 0898-2112 (Print) 1532-4222 (Online) Journal homepage: http://www.tandfonline.com/loi/lqen20

Statistical transfer learning: A review and someextensions to statistical process control

Fugee Tsung, Ke Zhang, Longwei Cheng & Zhenli Song

To cite this article: Fugee Tsung, Ke Zhang, Longwei Cheng & Zhenli Song (2018) Statisticaltransfer learning: A review and some extensions to statistical process control, Quality Engineering,30:1, 115-128, DOI: 10.1080/08982112.2017.1373810

To link to this article: https://doi.org/10.1080/08982112.2017.1373810

Accepted author version posted online: 12Sep 2017.Published online: 12 Sep 2017.

Submit your article to this journal

Article views: 96

View related articles

View Crossmark data

http://www.tandfonline.com/action/journalInformation?journalCode=lqen20

http://www.tandfonline.com/loi/lqen20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/08982112.2017.1373810

https://doi.org/10.1080/08982112.2017.1373810

http://www.tandfonline.com/action/authorSubmission?journalCode=lqen20&show=instructions

http://www.tandfonline.com/action/authorSubmission?journalCode=lqen20&show=instructions

http://www.tandfonline.com/doi/mlt/10.1080/08982112.2017.1373810

http://www.tandfonline.com/doi/mlt/10.1080/08982112.2017.1373810

http://crossmark.crossref.org/dialog/?doi=10.1080/08982112.2017.1373810&domain=pdf&date_stamp=2017-09-12

http://crossmark.crossref.org/dialog/?doi=10.1080/08982112.2017.1373810&domain=pdf&date_stamp=2017-09-12

QUALITY ENGINEERING, VOL. , NO. , –https://doi.org/./..

Statistical transfer learning: A review and some extensions to statistical processcontrol

Fugee Tsung, Ke Zhang, Longwei Cheng, and Zhenli Song

Department of Industrial Engineering and Logistics Management, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon,Hong Kong

KEYWORDSD printing; bayesianmodeling; landslides; qualitycontrol; regularization;statistical process control;statistical transfer learning;transfer learning; urban railtransit

ABSTRACTThe rapid development of information technology, togetherwith advances in sensory and data acqui-sition techniques, has led to the increasing necessity of handling datasets from multiple domains. Inrecent years, transfer learning has emerged as an effective framework for tackling related tasks in tar-get domains by transferring previously-acquired knowledge from source domains. Statistical modelsandmethodologies arewidely involved in transfer learning andplay a critical role,which, however, hasnot been emphasized in most surveys of transfer learning. In this article, we conduct a comprehen-sive literature review on statistical transfer learning, i.e., transfer learning techniques with a focus onstatistical models and statistical methodologies, demonstrating how statistics can be used in transferlearning. In addition, we highlight opportunities for the use of statistical transfer learning to improvestatistical process control andquality control. Several potential future issues in statistical transfer learn-ing are discussed.

Introduction

With the remarkable development of informationtechnology in recent years, data mining and machinelearning techniques have been widely and successfullyapplied in various domains and data sources. However,traditional machine learning approaches usually per-formwell only for single tasks andwithin the same datadistribution (Pan and Yang 2010, Weiss, Khoshgoftaarand Wang 2016). To this end, transfer learning pro-vides an efficient framework for combining multiplesources and allows the transfer of previously acquiredknowledge to tackle related tasks in new domains.With the assistance of transfer learning, informationtransferred from source domains could improve alearner’s performance in the target domain (Weiss,Khoshgoftaar and Wang 2016). For instance, learningto play the electronic organ may help facilitate learningthe piano (Pan and Yang 2010). Similarly, babies firstlearn to recognize human faces and then build onthis knowledge to recognize other objects (Zhang andYeung 2014). Transfer learning techniques have beendemonstrated to be truly beneficial in many real-world

CONTACT Fugee Tsung [email protected] Department of Industrial Engineering and Logistics Management, Hong Kong University of Science andTechnology, Clear Water Bay, Kowloon, Hong Kong.Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/lqen.

applications such as warranty prediction (Tseng, Hsu,and Lin 2016), surface shape prediction (Shao et al.2017), WiFi localization (Pan et al. 2008), sentimentclassification (Blitzer et al. 2007), and collaborativefilter (Pan et al. 2010).

Recent decades have witnessed rapid developmentin statistical models and methodologies with applica-tions in a variety of fields. Many of these applicationsincreasingly require describing data in different struc-tures via statistical models and methodologies. Manystatistical models have been actively studied in atransfer learning framework to integrate multiple datasources and transfer knowledge in specific data types.For example, Jin et al. (2011) investigate a hierarchicalBayesian model to cluster short text messages viatransfer learning from auxiliary long text data. Shaoet al. (2017) propose a multi-task learning approachfor Gaussian processes to predict surface shapes byintegrating similar manufacturing processes. In addi-tion to statistical models, statistical methodologies arewidely involved and play a critical role in connectingeach individual statistical model in transfer learning

© Taylor & Francis

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

https://doi.org/10.1080/08982112.2017.1373810

https://crossmark.crossref.org/dialog/?doi=10.1080/08982112.2017.1373810&domain=pdf&date_stamp=2017-11-24

mailto:[email protected]

http://www.tandfonline.com/lqen

116 F. TSUNG ET AL.

studies. Although transfer learning techniques havebeen extensively summarized in the field of data min-ing and machine learning (see Pan and Yang, 2010;Lu et al., 2015; Weiss, Khoshgoftaar and Wang 2016),there currently appears to be a lack of review articleson transfer learning from a statistical perspective.

This study provides a comprehensive review of sta-tistical transfer learning, which refers to the transferlearning literature with a focus on statistical modelsand methodologies adopted. Apart from reviewing lit-erature, we investigate how statistical transfer learningcan be utilized in the field of statistical process control(SPC) and quality control via several real-world appli-cations: landslide monitoring using slope sensor sys-tems, passenger inflow forecasting and monitoring inurban rail transit systems, and shape deformationmod-eling for 3D printed products.

The rest of the article is organized as follows. In thenext section, we provide a brief overview and catego-rization of transfer learning methods. Then statisticalmodels andmethodologies contained in transfer learn-ing papers are discussed and summarized. After thatweintroduce three applications in SPC and quality controlbased on statistical transfer learning. The conclusion ispresented at the last section.

A brief overview of transfer learning

There has been a great deal of research in recent yearson transfer learning techniques and applications andseveral survey papers of transfer learning have beenpublished on data mining and machine learning. Forexample, Pan and Yang (2010) introduce a brief historyand the categorization of transfer learning techniquesand present a comprehensive overview of transferlearning for classification, regression and clusteringproblems; Taylor and Stone (2009) survey transferlearning for reinforcement learning; Lu et al. (2015)examine transfer learning approaches in the compu-tational intelligence field and cluster them into severalcategories.

Before we proceed with the detailed review and cat-egorization of transfer learning, it is necessary to clarifythe relationship between the closely related concepts oftransfer learning, multi-task learning, and self-taughtlearning. Transfer learning is adopted to extract knowl-edge from source domains to improve performance intarget domains, while in multi-task learning, the rolesof the source and target tasks are symmetric: learning

the task in each domain can be improved by usingshared knowledge gained through other tasks. Multi-task learning approaches are covered in this paper asthey are sometimes viewed as a subarea of transferlearning (Xu and Yang 2011) and widely termed astransfer learning techniques (Huang et al. 2012, Zouet al. 2015). Lastly, self-taught learning is transferlearning with an emphasis on utilizing unlabeled datain source domains for predictions in target domains.

Generally speaking, approaches to transfer learn-ing can be divided into three categories based onthe form of transferring information from source totarget: instance-based, feature-based, and parameter-based transfer learning. A brief review of each transferlearning approach will be given in the following.

Instance-based transfer learning is used when thesource and the target instances are generated from twodifferent but closely related distributions so that partsof the source data can be reused in the target task.Dai et al. (2007) present a boosting algorithm, TrAd-aBoost, to select the most useful source instances asadditional training data for the target task by iterativelyreweighting. Their procedure enables the constructionof a high-quality model for the target task by integrat-ing only a tiny amount of new data and a large amountof old data. Jiang and Zhai (2007) propose a generalinstance weighting framework to remove misleadingtraining instances from source data and assign addi-tional weight to instances in target data than those insource data. Liao et al. (2005) adopt an active learn-ing strategy to improve task performance by intro-ducing auxiliary variables for each instance in sourcedata. Wu and Dietterich (2004) implement knowledgetransfer by minimizing a weighted sum of two separateloss functions corresponding to source and task data,respectively.

The feature-based transfer approach attempts tolearn a common feature structure between sourceand target data, which can be treated as a bridgefor knowledge transfer. Argyriou et al. (2007) pro-pose a regularization-based method to learn a low-dimensional function representation shared by sourceand target tasks. Lee et al. (2007) adopt a probabilis-tic approach to learn an informed meta-prior overfeature relevance. Their model transfers meta-priorsbetween source and target tasks and can thus dealwith cases where tasks have non-overlapping featuresor the relevance of the features varies between tasks.Blitzer et al. (2006) suggest structural correspondence

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

QUALITY ENGINEERING 117

learning (SCL) to identify correspondences betweenfeatures from source and target domains by modelingtheir correlations with pivot features. Although it isshown experimentally that SCL can reduce the dif-ference between domains, selecting the pivot featuresremains challenging. Pan et al. (2008) utilize maxi-mum mean discrepancy embedding (MMDE) to finda low-dimensional latent feature space in which thedistributions of data in different domains are close toeach other. Dai et al. (2008) exploit large amounts ofauxiliary data to uncover an improved feature repre-sentation to enhance the clustering performance of asmall amount of target data.

Parameter-based transfer learning assumes thatsource and target tasks should share parametersor hyper-parameters of prior distributions. Mostresearch has been focused on two aspects: hierarchicalBayesian (HB) frameworks and regularization-basedapproaches. For the former, the parameters of modelsfor individual tasks are often assumed to be generatedfrom a common prior distribution. Thus, knowledgecan be transferred across domains by learning thecommon information through the abundant auxil-iary data from source domains. Gaussian processesare widely used and appropriate for this situation(Lawrence and Platt 2004, Schwaighofer et al. 2005,Bonilla et al. 2007). Researchers have also imple-mented parameter transfers with regularization-basedapproaches. Evgeniou and Pontil (2004) separatethe parameter in support vector machines (SVMs)for each task into a task-common term and a task-specific term. They present an approach for knowledgetransfer based on the minimization of regularizationfunctionals.

Review of statistical transfer learning

As mentioned in the last section, existing surveys oftransfer learning have been mainly conducted in thedata mining and machine learning fields and focusedmostly onmethods of transferring information. Unlikeexisting surveys, transfer learning literature is reviewedfrom a statistical perspective in this article. Particu-larly, transfer learning papers in the fields of statisticsand industrial engineering will gain additional atten-tion. Recent progress is reviewed and organized fromtwo perspectives: statistical models and statisticalmethodologies. First, transfer learning approachesfor many real-world applications are reviewed based

on their underlying statistical models for each singletask/domain, including linear models, Gaussian pro-cess, network models, and statistical language models.Various statistical techniques are then exploited totransfer information across multiple statistical modelsand data sources such as boosting-based methods,bagging, Bayesian modeling, and regularization-basedapproaches are summarized.

Statistical models

A variety of statistical models have been developedin past decades to handle different data types in adiverse range of real-world applications. In an applica-tion aimed at transferring knowledge across multipletasks, the first issue is to select an appropriate statisticalmodel for each single task. As fundamental elementsfor statistical transfer learning approaches, the underly-ing statistical models need to be summarized with cor-responding applications.

Many transfer learning studies have been conductedbased on linear and generalized linear models. Forexample, to combine Earth System Model outputs forland surface temperature prediction in both South andNorth America, Gonçalves et al. (2016) adopt a sim-ple linear model for each geographic location and sug-gested a multi-task learning approach to allow themto share dependencies. The transfer learning approachis more effective than conducting an ordinary leastsquares regression for each linear model since it cancapture dependences across locations. For general-ized linear models, Zou et al. (2015) propose a trans-fer learning method for a logistic regression with anapplication in degenerate biological systems. Similarly,Zhang et al. (2014) investigate a regularization-basedtransfer learning approach to capture task relationshipsfor generalized linear models with the incorporationof new tasks. Moreover, Samarov et al. (2015, 2016)consider a linear mixture model to combine multi-ple outputs, which are applied to handle hyperspectralbiomedical images.

Transfer learning has also been employed forGaussian processes. The key idea of such approachesis to impose a shared prior to connect similar but notidentical Gaussian processes. The prediction for eachindividual Gaussian process benefits from utilizingobservations from different but related processes.Many Gaussian process methods based on transferlearning are investigated under various assumptions

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

118 F. TSUNG ET AL.

ranging from block to non-block design (Schwaighoferet al. 2004, Bonilla et al. 2007, Yu et al. 2005) with prac-tical applications in compiler performance predictionsand exam score predictions. Furthermore, Shao et al.(2017) integrate cutting force variation modeling witha multi-task learning approach to improve surfaceprediction accuracy by incorporating engineeringinsight, in which an iterative multitask Gaussian pro-cess learning algorithm is proposed to learn the modelparameters.

For network models and graphical models, Huanget al. (2012) propose a transfer learning approach for aGaussian graphical model. The goal of this method isto learn the brain connectivity network for Alzheimer’sdisease patients based on functional magnetic reso-nance image (fMRI) data.

Finally, statistical language models (see Zhai et al.2008 for more details) can also be extended by trans-fer learning. Although task-specific statistical languagemodels such as Latent Dirichlet allocation (Blei et al.2003) for long text or Twitter-LDA (Zhao et al. 2011)for short text have different structures, the words andcorresponding language models naturally share lin-guistic similarities. In this sense, to improve the per-formance of Twitter clustering, Jin et al. (2011) suggestan extended Twitter clustering scheme by transferringthe knowledge learned from long texts to short Twittertexts.

Statistical methodologies

In transfer learning literature, assorted statisticalmethodologies are recommended to connect andtransfer knowledge among multiple statistical modelsand data sources. Here most of the mainstream statis-tical methodologies are reviewed to build connectionsbetween statistical models in each domain.

Boosting-based weighing schemes are often usedto conduct instance-based transfer learning. Basedon the well-known adaptive boosting method (Fre-und and Schapire 1995), Dai et al. (2007) present theboosting algorithm TrAdaBoost, reducing the distri-bution differences between domains by adjusting theweights of instances. The traditional adaptive boostingis an ensemble method that creates a strong classifierfrom a number of basic classifiers like decision trees,where the basic classifiers are combined sequentially,by carefully adjusting the weights of training instancesin each iteration. To extend the adaptive boosting

method to conduct transfer learning on both sourcedomains and target domains, TrAdaBoost adoptsa different weighting mechanism which decreasesweights of the instances in source domains that aredissimilar to target domains. By doing so, TrAdaBoostconsequently allows to enhance the predictive perfor-mance in target domains by using the data in sourcedomains. Following this line, boosting-based methodsare considered under various scenarios, includingmultimodal TrAdaBoost (Wei et al. 2016) and multi-source TrAdaBoost (Yao and Doretto 2010). Bagging(Breiman 1996) and bootstrap (Efron and Tibshirani1994) are also extended as TrBagg (Kamishima et al.2009) and double-bootstrapping source data selection(Lin et al. 2013) to construct an ensemble of learnersin the context of instance transfer.

Bayesian modeling is one of the most commonstatistical techniques for transferring informationacross different tasks and models. Information onparameters can be easily transferred between mod-els through shared prior distributions and commonhyper-parameters. The use of Bayesian priors enablesthe transfer of information on parameters from onemodel to another. For instance, to predict products’field return rate during the warranty period, Tsenget al. (2016) propose a hierarchical Bayesian approachto model laboratory and field data collected from mul-tiple products with a similar design. The informationsharing between products is efficiently performedusing a Dirichlet prior distribution, which leads toimproved predictive performance especially for theproducts with few or even no failures in laboratory.Hierarchical Bayesian based transfer learning methodsare widely investigated under different settings and invarious applications. For linear models, Zhang et al.(2010) formulate lq regularization-based multi-taskfeature selection in a Bayesian framework, whichdevises expectation-maximization (EM) algorithmsto learn model parameters for every task. To linkthe model coefficients of old and new domains indegenerate biological systems, Zou et al. (2016) adopta hierarchical structure to characterize the degeneracyand correlation structure of the domains. Moreover,Bouveyron et al. (2014) suggest a Bayesian approachfor a mixture of linear regression models in whichDirichlet distribution and inverse-gamma distribu-tion are imposed as priors. For Gaussian processes,the hierarchical Bayesian framework is consideredto connect multiple related processes (Yu et al. 2005,

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


Bonilla et al. 2007) and can thus benefit from jointestimation and knowledge transfer. In addition, it isappropriate to adopt Bayesian prior to transfer infor-mation between different statistical language modelsand tasks (Jin et al. 2011), since many language mod-els, such as Latent Dirichlet Allocation (Blei et al.2003), are themselves probabilistic generative mod-els constructed in a Bayesian manner. Apart fromparameter transfer, Bayesian probabilistic modelsare also exploited for transfer learning in terms ofinstance weighting in natural language process (NLP)applications (Jiang and Zhai 2007).

Additionally, regularization-based methods providean effective framework of transfer learning by assum-ing similar patterns for the model parameters betweensources, which are commonly exploited to build con-nections between linear models through numerouspenalties. Specifically, suppose that there are K tasksin total contained in target and source domains.A regularization-based method typically considers thefollowing penalized estimation:

minBk

∑k

gk (Xk, Y k, Bk) + penalty (B) ,

where gk is an application-specified loss function withcoefficients Bk, the k-th row of B, and (Xk, Y k) aretraining data of k-th task. A penalty term is chosenso as to make information on coefficients transferredacross all tasks and domains. Many penalties are inves-tigated for different applications under this framework.For instance, Liu, Ji and Ye (2009) adopt L21-norm reg-ularization for linear models to conduct feature selec-tion across multiple domains by encouraging multiplepredictors to share similar sparsity patterns, where thepenalty is taken as the sum of l2-norm over each vari-able, i.e., penalty(B) = ∑

i

√∑k B2

ki. Similarly, Liu,Palatucci and Zhang (2009) propose the multi-taskLasso to select significant variables across related linearregression models by replacing l1-norm regularizationwith the sum of l∞ regularization, i.e., penalty(B) =∑

imaxk |Bki|. Following this line, many variants con-sidering other penalties are investigated. For example,an adjusted l1-norm regularization weighted by spa-tial information is given by Samarov et al. (2015) anda mixture of l∞/l1/ridge-based penalty is discussedin Samarov et al. (2016), where l1 based penalty is∑

i∑

k |Bki| and ridge-based penalty is∑

i∑

k B2ki .

Gonçalves et al. (2014) design a regularization-based

approach induced by a Gaussian prior that charac-terizes a sparse dependency structure of tasks andextended linear and logistic regressions under thisframework and then provide a flexible Gaussian cop-ula model that relaxes the Gaussian marginal assump-tion (Gonçalves et al., 2016). Kernel extensions are con-sidered in nonlinear models in the context of transferlearning. Zhang et al. (2014) investigate a regulariza-tion approach by imposing a matrix-variate Gaussianprior distribution and extend it using kernel methods.Nonlinear multi-task kernels for SVMs have also beenstudied (Evgeniou and Pontil 2004).

For unsupervised approaches, Song et al. (2015)investigate a PCA-based transfer learning approachand apply it on speech emotion recognition. As a pop-ular PCA-like algorithm in data mining field, sparsecoding-based transfer learning has also been exten-sively studied (Raina et al. 2007,Wei et al. 2016,Maureret al. 2013). The key idea of sparse coding is to representdata vectors as sparse linear combinations of basic ele-ments to allow homogenous representation structuresto be shared between tasks.

For better illustration of the connection among sta-tistical models, methodologies and transfer learning,we’ve summarized the relationship between transferlearning categories and statistical methodologies inTable 1. In Table 2, we list transfer learning literaturethat adopts various statistical models and methodolo-gies.

Statistical models and methodologies are both crit-ical in transfer learning and have been widely investi-gated. However, there remains a limited variety of sta-tistical models extended using transfer learning despitetheir broad applications. As such, statistical transferlearning extensions for SPC and quality control aredemonstrated in the following three sections. Firstly,autoregressive models are extended to a transfer learn-ing version to describe non-contemporaneous relation-ships between sensors and for the rapid detection oflandslides and slope failures (Zhang et al. 2017). Then,the prediction and monitoring of passenger inflows

Table . Relationship between transfer learning categories andstatistical methodologies

Bagging/ BayesianBoosting bootstrap Regularization framework

Instance-transfer � � �Feature-transfer �Parameter-transfer � �

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

120 F. TSUNG ET AL.

Table . Relationship between statistical models and statistical methodologies in transfer learning.

Linear models and Graphical/ Hierarchicalgeneralized linear models Gaussian processes network models Bayesian models

Boosting, bagging and bootstrap Dai et al. ();Wei et al. ();Yao and Doretto, ();…

Regularization-based methods Zhang and Tsung, ();Liu et al. ();Samarov et al. ();Gonçalves et al. ();…

Bayesian frameworks Zhang et al. (); Shao et al. (); Huang et al. () Song et al. (working paper);Zou et al (); Yu et al. (); Bonilla et al. ();…

Jin et al. ();…Bouveyron et al. ();…

in a rail transit system will be considered in a statisti-cal transfer learning framework (Song et al. workingpaper). After that, we introduce a parameter-basedtransfer learning approach for shape deviation predic-tion and to control 3D-printed products with distinctshapes based on geometric error decomposition andmodeling by incorporating engineering knowledgeand experimental design (Cheng et al. 2017).

Statistical transfer learning for landslidemonitoring

Landslides are common geographical activities thatresult in large quantities of rock, earth and debris flow-ing down hillslopes, leading to thousands of casualtiesand billions of dollars in infrastructure damage everyyear around the world (Yang et al. 2010). To detectand predict such abnormal geographical behavior,accelerometer-based sensor systems are widely usedin landslide-prone sites. Autocorrelated time series areoften used to describe sensor readings over time andautoregressive (AR) models are used for prediction(Pu et al. 2015). SPC procedures for monitoring suchautocorrelated processes have also been widely studied(Psarakis and Papaleonida 2007, Castagliola and Tsung2005).

Multiple time series are collected from severallandslide-prone sites with multiple sensors assigned.The relationship between such time series can bemainly divided into two categories. The first is con-temporaneous relationships, which contain spatiallycorrelated residuals and time-lagged effects. Existingmodels such as vector autoregressive (VAR) models(Lütkepohl 2005) and spatial-temporalmodels (Cressieand Wikle 2015) are capable of capturing the contem-poraneous relationship between multiple time series.The second is non-contemporaneous relationships,

which means the autoregressive structure of sensors indifferent sites may share similarities. The similaritiesmay result from the same type of sensors adopted andsimilar geographical activities on site.

In this landslide application, sensor readings areusually recorded during different periods, varyingfrom site to site. Thus existing statistical modelssuch as AR and VAR can only be used separatelyfor sensors in each landslide-prone site and mayfail to provide accurate modeling for sensors at asite in the early stages with fewer observations. Toimprove modeling accuracy, it is helpful to jointlymodel the time series from different sites by con-sidering the non-contemporaneous relationships,which cannot be conducted using the above existingmethods. To this end, a transfer learning approach isproposed to extend autoregressive models for slopefailure monitoring, where both contemporaneousand non-contemporaneous relationships of multipleautocorrelated time series are considered. Intuitively,it is expected that information will be transferred from“experienced” sites to early-stage sites by discoveringnon-contemporaneous dependency structure in ARcoefficients. The detailed statistical model is as follows.

Suppose that data are collected from K differentlandslide-prone sites. For the k-th site, pk sensors areassigned to collect measurements simultaneously. Sen-sor readings over time are denoted as time series{y[k]i,t }Tkt=1 for the i-th sensor. Consider an AR(L) modelfor i-th sensor at the k-th site:

y[k]i,t =∑l

β[k]i,l y

[k]i,t−l + ε

[k]i,t . [1]

Here,β[k]i,l refers to the autoregressive coefficient rep-

resenting the relationship between the i-th sensor’s cur-rent measurements and its lag l measurements at sitek. We assume that ε

[k]t = (ε

[k]1,t , ..., ε

[k]pk,t )

′ is a pk × 1

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


vector of error terms, following N (0,�[k]), where thecovariance matrix �[k] is adopted to characterize thecontemporaneous spatial correlation between sensorswithin site k. On the other hand, similarly to Gonçalveset al. (2016), a Gaussian prior is imposed over the time-lagged coefficients across the AR(L) models of sensorsto transfer information across sites and sensors. Specif-ically, assume that

βl =(β[1]1,l , ..., β

[1]p1l, ..., β

[K]1,l , ...β

[K]pKl

)∼ N

(0, �Me−θ l

)[2]

independently for each l. Here, �M denotes a hiddendependency structure among the autoregressive mod-els on all sites. Moreover, the decreasing time-laggedeffect is accounted for using the exponential functione−θ l , which depends on the lag term l and parameter θ .

The joint likelihood over all sites can be derivedto infer the above model parameters from historicaldata. For site k, let AR coefficients B[k] ={β[k]

i,l fori = 1, . . . p; l = 1, . . . L}. The log-likelihood condi-tional on B[k] and�[k] is written as gk(�[k], B[k];Y [k]).Considering the imposed prior distribution [2] thatcharacterizes the non-contemporaneous dependencystructure, in a Bayesian perspective, if we assumenon-informative priors for (�, �M, θ ), the poste-rior log-likelihood for the entire transfer learningframework is∑

k

gk(�[k], B[k];Y [k])

+∑l

log(π(βl |�M, θ ))+ const, [3]

where π(·) denotes the prior distribution of βl inEq. [2]. To get estimation, we can iteratively updatethe parameters by maximizing the posterior above.Specifically, with �M and θ fixed, {B[k],�[k]}Kk=1 canbe obtained by maximizing [3] through a coordinateascent algorithm; with {B[k],�[k]}Kk=1fixed, we can esti-mate �M and θ . The latter term in [3] can be viewedas a regularization term that links parameters in allmodels together.

A Monte Carlo simulation is performed to showthe performance of the transfer-learning extendedmethod in Phase I estimation of autoregressive pro-cesses. Assume that there are only two sites in totalwith three sensors at each site. The lengths of observedtime series are different: sensors in the first site haveonly 50 observations while those in the second have500 observations. Here the first sensor can be regardedas an early-stage site with fewer observations, while the

second is an “experienced” site with enough historicaldata. Let the maximum lag be 4. Suppose that the truecoefficients of AR(4)models are generated froma 6 × 6matrix �M , where

�M =

⎡⎢⎢⎣

1 ρ ... ρ ρ

ρ 1 ... ρ ρ

... ... ... ...

ρ ρ ... ρ 1

⎤⎥⎥⎦

6×6

.

A larger ρ in �M means the time series of sensorsin the simulation tend to evolve in a more similarmanner. When ρ= 1, the true AR coefficients in allof the sensors are the same. Here, the non-transferbaseline method is taken as separately estimating anAR model within each site. In this simulation, 10,000replicates are conducted for each ρ ranging from 0–0.99. The transfer learning method and non-transfermethod are compared via their averaged root meansquare errors (RMSEs) at the first site. The result inZhang et al. (working paper) show that AR modelsextended by transfer learning significantly outper-form the non-transfer baseline approach and theirestimation performance is improved with ρ increas-ing, which illustrates the necessity and effectivenessof investigating non-contemporaneous relationshipsand extending the AR model in a statistical transferlearning framework. Further evaluation studies willbe conducted to compare predictive performance withmore baseline methods considered.

For the monitoring part, statistical transfer learn-ing can provide all-round improvements to extendthe existing SPC schemes: more accurate Phase I esti-mation and more rapid detection in Phase II onlinemonitoring.

For Phase I analysis, the simulation result has shownthe effectiveness of estimating in-control AR coeffi-cients for sensors and landslides-prone sites, especiallyfor those with limited historical data. Outlier detectionand change-point diagnosis can also be investigatedin this transfer learning framework with reasonableassumptions.

Adopting statistical transfer learning in Phase IIlandslide monitoring is more challenging than forPhase I analysis but worthwhile. An improved under-standing of the target site/sensor can be obtained withthe help of source sites/sensors, thus leading to rapiddetection. For Phase II landslide monitoring, shiftsin the autoregressive structure B[k] and the spatialcovariance �[k] represent abnormal states and need tobe monitored. To monitor the autoregressive structure

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

122 F. TSUNG ET AL.

of the k-th site B[k], a generalized likelihood ratio(GLR)-based SPC scheme may be considered wherethe transferred parameters {�M, θ} obtained in PhaseI can provide a more precise GLR statistic. For thecovariance matrix �[k], a residual-based control chartmay be considered in which the improved estimationof B[k] using transfer learning can reduce the noisewhen constructing statistics.

Statistical transfer learning for monitoring inurban rail transit systems

With the proliferation of smart cities, public trans-portation services such as urban railway transit (URT)systems are playing an increasingly important rolein commuter mobility. For instance, Hong Kong’sMTR carries more than five million passengers everyday. Aperiodic incidents and events, such as trafficaccidents, traffic controls, celebrations, protests anddisasters, can lead to abnormal passenger streamson public transportation systems, which can resultin serious accidents such as stampedes due to over-crowding in extreme cases. It is important to predictpassenger flows and conduct monitoring schemesto prevent accidents due to excessive passenger flowwithin URT systems. A large body of transportationengineering research has analyzed passenger flows toestimate travel times (Jaiswal et al. 2010), to simulatethe distribution andmovement of passengers in an area(Setti and Hutchinson 1994), and to predict passengerselection behavior (Ren et al. 2012). However, there hasbeen a dearth of research on the influence of passengercrowding on entire URT systems. It is therefore crucialto develop a statistical methodology to understand,predict, and monitor the number of passengers andthe degree of crowding in a URT system in real time.

However, there remain several challenges to beaddressed. Most notably, the early warning problemaims to make proactive decisions based on predictedfuture passenger flows, which are significantly differentfrom conventional statistical process control problemsmonitoring current processes based on past data.Hence the modeling and predictive performance of themethodology are crucial. Furthermore, the large num-ber of stations in realURT systems particularly requiresscalable early warning schemes. In this section, a scal-able and predictive-based SPC scheme is sought. Tobegin with, a single model will be built for each station,allowing for high flexibility and satisfying predictive

ability. Moreover, due to the special properties ofcount data, conventional methods such as functionaldata analysis cannot be applied to inflow passengerprofiles. To tackle this, a hierarchical model is adoptedto describe inflow counting data in each station, wherexk(t ) represents the number of passengers enteringstation k at time t . Particularly, xk(t ) is assumed to bea Poisson random variable with intensity parameterλk(t ). After this, the focus is on log λk(t ), which isinspired by the log-linear model for categorical data(Li et al. 2009). Specifically, the following state spacemodel is proposed for each station k, k = 1, 2, . . . ,N

logλk (t ) = α(k)0 +

∑l

α(k)l logλk (t − l)+ εk (t ) ,

[4]where εk(t )’s are independent across stations andfollow N(0, σ 2

k ).When the above hierarchical model of URT systems

is extended in a transfer learning framework, severaladvantages immediately appear and show promisingpotential. A URT system typically covers an entirecity and connects various areas in a network struc-ture. Hence, the autoregressive structure of Poissonintensities of different stations can be highly related asthe stations are spatially close or of the same category,such as downtown, business districts, and residen-tial areas. Consequently, the parameters of log-linearmodels in Eq. [4] for similar stations are expected tobe closely related to each other. Isolating each stationis not enough to fully utilize the useful informationfrom other related stations. Instead, it is beneficial tolearn multiple tasks (stations) simultaneously under atransfer learning framework.

In more detail, coefficients across different stationsare assumed to share a common prior distribution, i.e.,

αl =(α(1)l , α

(2)l , . . . , α

(N)l

)T∼ N (0, �) .

Here, � depicts the inherent relatedness structureamong stations. Accordingly, stations can be clusteredinto different categories by treating � as a similaritymeasure. In this sense, transferring knowledge acrossstations is expected to improve estimation and thuspredictive performance as well as reveal the hiddeninherent structure among stations. A similar idea hasbeen raised in the diseasemapping problemby borrow-ing information from neighboring regions (Blangiardoand Cameletti 2015). The major difference in the URT

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


problem is that the knowledge can be shared amongstations from similar category of functional zones inaddition to stations which are spatially close.

To monitor the passenger inflow in URT, a SPCscheme is required to estimate parameters in Phase Iand implement monitoring in Phase II. There are tech-nical challenges in each phase to applying SPC pro-cedures for passenger inflow monitoring applications:during Phase I, a state space model is adopted for eachstation to describe the latent Poisson intensity wherethe parameters are estimated together across variousstations via a common prior distribution. The esti-mation of the hierarchical model in transfer learningcan improve performance, especially for newly acti-vated stations with less data, where inherent related-ness structure � is exploited over the entire URT sys-tem. EM algorithms or particle filtering are potentialimplementations of Phase I parameter inference. Dur-ing Phase II, a predictive-based monitoring approachis recommended because proactive decisions must bemade to avoid accidents. Specifically, for each station,based on a prediction with a given lead time, e.g.,20 min, the personnel in a URT system can decide on acourse of action. If the predicted values exceed the pre-specified limit, this signifies that upcoming passengerinflow poses a considerable challenge to the operationof the station. Hence, an early warning scheme shouldbe activated, and an alarmmay be signaled if necessary.Here, the control limits are determined by each station’sspecific capacity and infrastructure characteristics incollaboration with Phase I analysis. On the other hand,if the predicted values differ considerably from the laterobserved actual values, model calibration and onlineupdating are needed. Moreover, to solve the scalabil-ity issue, the approach of Mei (2010) can be adopted tocombine transfer learning predictions in multiple sta-tions to obtain SPC statistics for detection throughoutthe URT network, where a local statistic is generatedfor monitoring based on the predicted passenger flowat each station, and all of these local statistics are thenintegrated to make a final decision. Further work fol-lowing this line needs to be conducted.

Statistical transfer learning for 3D printingquality control

3D printing is one of the most promising manufac-turing techniques since it enables the direct fabrica-tion of products of complex shapes with few design

constraints. However, dimensional inaccuracy remainsone of the most concerned quality issues limiting thetechnology’s application. Many shape deviation mod-eling and compensation methods have been proposedto improve the geometric accuracy of fabricated prod-ucts, including those devised by Huang et al. (2014,2015), Luan and Huang (2015), andWang et al. (2017),among others. However, it has been shown that thesemethods only perform well for products with specificshapes and usually require re-estimatingmodel param-eters for new shapes. There are three major challengesto predicting geometric errors and deriving effectivecompensation plans for new products before fabrica-tion. First, the geometric error-generating mechanismis very complex and there are multiple error sources,whichmake it difficult to build an effective model fromthe first principle. Second, as there are a wide variety ofcomplex shapes, it is only feasible to fabricate limitedproducts for limited shapes due to resource constraints;it is therefore unfeasible to build a single comprehen-sive model based on data-driven methods that requirelarge amounts of data. Third, it is hard to establish con-nections between the shape deviation of products fab-ricated with distinct shapes.

To tackle the above challenges in quality control,Cheng et al. (2017) propose an in-plane shape deviationmodeling scheme from a statistical transfer learningperspective. In this scheme, a parameter-based trans-fer learning approach is adopted based on geometricerror decomposition and modeling by incorporatingengineering knowledge and experimental design.

Although the error-generating mechanism is com-plex, the geometric error of a fabricated product canbe generally decomposed into two components: (i) ashape-independent error component, whichmeans themodel parameters for this component are the same fordifferent shapes, and (ii) a shape-specific error compo-nent, corresponding to a specific term for the devia-tion model of each different shape. The motivation ofthe above decomposition is based on the observationthat the deviations of the same point located on theboundaries of two different in-plane shapes are usu-ally different. One cause of the difference in the fuseddeposition modeling (FDM) process is that the errorinduced by depositing material is highly related to themoving path of the extruder. The error at a boundarypoint is thus expected to have two components: one isgenerally shared by all shapes and the other is highlyrelated to the shape features. Suppose the input shape

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

124 F. TSUNG ET AL.

of a designed product is ψ0 and the final shape of thefabricated product is ψ , then

ψ = ψ0 + e0 (ψ0)+ e1 (ψ0)+ ε,

where e0(ψ0) is the shape-independent deviation,e1(ψ0) is the shape-specific deviation, and ε representsthe random error.

Since the measured deviation always contains botherror components, it is difficult to isolate the shape-independent error from the shape-specific error withsimple data-drivenmethods. To tackle this, Cheng et al.(2017) propose approximating the shape-independenterror with the deviation of a point inside a productsince the shape-specific error is majorly incurred byshape boundary features. Based on this assumption,an experiment was designed to investigate and modelthe shape-independent error e0(ψ0). First, a circularplate with a radius of 60 mm and a square plate witha side length of 100 mm were fabricated via an FDM3D printer, as shown in Figure 1. On each plate, 81marks are designed and fabricated in an 9 × 9 grid pat-tern. The interval of the grid is 10 mm. Each mark isdesigned as a circular hole with a radius of 1 mm andits center represents the mark’s position. Such a markis small enough to rarely affect the material shrink-age, and its circular shape can facilitate the measure-ment process for obtaining its position. Since thesemarks are inside the product, the measured deviationsat these locations are rarely related to the shape fea-tures and can hence be used to approximate the shape-independent errors and model e0(ψ0) in the Cartesiancoordinate system. Suppose the designed marks arefabricated at (xi, yi) and their measured locations aredenoted as (x′

i, y′i), i = 1, . . .M. The measurement of

the shape-independent error at (xi, yi) can then be rep-resented as (e0x(xi, yi), e0y(xi, yi)) = (x′

i − xi, y′i − yi).

Based on the significant linear pattern observed in

Figure . Fabricated circular plate and square plate for modelingshape-independent error. Adapted from Cheng et al. ().

the measurements of shape-independent error, the fol-lowing linear regression models are applied to modelthe shape-independent error in the x-direction andy-direction separately:{

e0x(x, y

) = β1xx + β2xy + exe0y

(x, y

) = β1yx + β2yy + ey.

The linear coefficients in the above model can beestimated using the data from our measurements. Theresult shows that this model can accurately predictthe shape-independent error. After this step, the aboveshape-independent error model can be transferred topredict the shape-independent error component forany new shape. Suppose the input shape is ψ0. Theshape incorporating the predicted shape-independenterror can then be represented as

ψ ′ = {(x + e0x

(x, y

), y + e0y

(x, y

)) | (x, y) ∈ ψ0}.

To investigate the shape-specific error, the inputshape ψ0, the shape incorporating the shape-independent error ψ ′ and the final product shapeψ are represented as r0(θ ), r′(θ ), and r(θ ) inthe polar coordinate system, respectively. y(θ ) =r(θ )− r0(θ ) denotes the measured deviation profile;f0(θ ) = r′(θ )− r0(θ ) denotes the predicted shape-independent deviation profile. The shape-specific erroris then isolated from the shape-independent error by

y (θ )− f0 (θ ) = r (θ )− r′ (θ ) = f1 (θ )+ ε (θ ) ,

where f1(θ ) denotes the shape-specific error andε(θ ) is the random error. To demonstrate this, twocircular products with radii of 10 mm and 30 mmand two square products with side lengths of 20 mmand 60 mm are fabricated and the correspondingdeviation profiles are measured. For each product, theshape-independent deviation profile is predicted andthe shape-specific deviation profile is calculated. It isobserved that the shape-independent deviation modelcan capture a major part of the total deviation, andthe remaining shape-specific deviation profile has ashape-specific pattern around the 0 line and appearsto be rarely affected by the size of a product. Thispreliminary result shows that the proposed parameter-based transfer learning approach greatly improves theextendibility of the shape deviation model to infernew shapes. Future studies will focus on modelingthe relationship between the shape-specific deviationprofiles and the shape features, which may furtherimprove the model’s predictive performance and thus

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


increase the shape fidelity of fabricated products viadeviation compensation.

In addition to predicting shape deviation and con-trolling for 3D printed products with distinct shapes,there is also a great need for statistical transfer learningfor 3D printing quality control from source machinesto new target machines. When a 3D printing machinechanges, the shape deviation model must be revised,which indicates the need to regenerate training data. Itis thus of great importance to transfer the knowledgeacquired from a source machine to a target machine.

Conclusion

The rapid development of information technology,together with advances in sensory and data acquisi-tion techniques, have made it possible to conduct sta-tistical inferences based on multiple domains. To sharedomain knowledge and combinemultiple data sources,transfer learning techniques have been investigated andutilized in many real-world applications. In this article,besides a general review of transfer learning, a sum-mary of transfer learning literature is provided basedon statistical models adopted in individual domainsand statistical techniques that are exploited to conductknowledge transfer. Furthermore, transfer learningtechniques are applied to SPC and quality control appli-cations with various data types: autocorrelated sensorreadings for landslide detection, Poisson counting pro-cesses for urban rail transitmonitoring, and shape devi-ation prediction and control for 3D-printed products.

Several research issues in the context of statisticaltransfer learning remain to be addressed. First, trans-fer learning techniques have been mainly applied in alimited variety of applications. As stated in the abovesections a great number of domain-specific statisticalmodels have the potential to be extended for transferlearning in further applications. For example, as twomajor dimensions in the information quality frame-work (Kenett and Shmueli, 2016), integrating dataand generalizing findings are commonly required forapplications that integrate complex surveys (Kenett,2016) and statistics data (Dalla and Kenett, 2015).Developing application-specified transfer learningmethods should be of great value for practical use.Second, incorporating engineering knowledge into astatistical transfer learning work is also of great interestto the fields of statistics and industrial engineering. Forexample, in the 3D printing application the shape error

decomposition and modeling by incorporating engi-neering knowledge makes it possible to transfer themodel to infer new shapes. A tailor-made statisti-cal model for such applications can pave the way tointegrating engineering insight and transfer learningapproaches. Third, since the above extensions willinevitably lead to more complex models, efforts atboth theoretical analysis and numerical studies areincreasingly desired for model inference and param-eter estimations when conducting statistical transferlearning approaches. For example, in the landslidemonitoring application, the transfer learning modelfor time-lagged regressions can be inferred throughempirical Bayesian and maximum a posteriori (MAP).However, it is also possible to undertake model infer-ence in a full Bayesian manner through Markov chainMonte Carlo methods (Gilks et al. 1995, Rubinsteinet al. 2016) or variational inference (Wainwright et al.2008, Blei et al. 2016). Further statistical analysis isrequired to make a choice in such transfer learningapplications. Specifically, asymptotic properties of theestimators and convergence properties of algorithmsare needed to support the choices of statistical method-ologies for transfer learning applications. Finally, forSPC extensions, as mentioned earlier, there is substan-tial room for improvement using statistical transferlearning in both Phase I analysis and Phase II monitor-ing. In Phase I analysis, it is challenging but worthwhileto conduct statistical transfer learning for in-controlparameter estimation, outlier detection and change-point diagnosis, for an improved understanding ofin-control situations. During Phase II monitoring, oneof the major problems is constructing statistics withthe assistance of statistical transfer learning for rapidanomaly detection. Issues such as the online updatingof transferred parameters are worth considering.

About the authors

Fugee Tsung is Professor of the Department of Industrial Engi-neering and Logistics Management (IELM), Director of theQuality and Data Analytics Lab, at the Hong Kong Universityof Science and Technology (HKUST). He is a Fellow of theInstitute of Industrial Engineers (IIE), Fellow of the AmericanSociety for Quality (ASQ), Fellow of the American StatisticalAssociation (ASA), Academician of the International Academyfor Quality (IAQ) and Fellow of the Hong Kong Institution ofEngineers (HKIE). He is Editor-in-Chief of Journal of QualityTechnology (JQT), Department Editor of the IIE Transactions,and Associate Editor of Technometrics. He has authored over

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

126 F. TSUNG ET AL.

100 refereed journal publications, and is the winner of theBest Paper Award for the IIE Transactions in 2003 and 2009.He received both his M.Sc. and Ph.D. from the University ofMichigan, Ann Arbor and his B.Sc. from National Taiwan Uni-versity. His research interests include quality engineering andmanagement to manufacturing and service industries, statis-tical process control and monitoring, industrial statistics, anddata analytics.

Ke Zhang is a Ph.D. candidate in Department of IndustrialEngineering and Logistics Management at Hong Kong Univer-sity of Science and Technology. He received a Bachelor’s degreein Statistics fromUniversity of Science andTechnology of Chinain 2014. His research interests include statistical modeling, pro-cess control and data mining.

Longwei Cheng is a Ph.D. candidate inDepartment of Indus-trial Engineering and LogisticsManagement atHongKongUni-versity of Science and Technology. He received a Bachelor’sdegree in Automation from University of Science and Technol-ogy of China in 2014. His research interests include statisticalmodeling and quality control.

Zhenli Song is a Ph.D. candidate in Department of IndustrialEngineering and Logistics Management at Hong Kong Univer-sity of Science and Technology. He received a Bachelor’s degreein Statistics fromUniversity of Science andTechnology of Chinain 2015. His research interests include statistical modeling anddata mining.

Acknowledgments

The authors are grateful to professor Steven Rigdon, professorXiaomingHuo and the two anonymous reviewers for their com-ments and suggestions that greatly improved our article.

Funding

Professor Tsung’s research was supported by the RGC GRF16203917.

References

Argyriou, A., T. Evgeniou, andM. Pontil. 2007a. Multi-task fea-ture learning. In Advances in neural information processingsystems, B., Schölkopf, J., Platt, T., Hoffman (Eds.), Vol. 19,pp. 41–48. Cambridge: MIT Press.

Blangiardo, M., and M. Cameletti. 2015. Spatial and spatio-temporal Bayesian models with R-INLA. JohnWiley & Sons.

Blei, D.M., A. Kucukelbir, and J. D.McAuliffe. 2016. Variationalinference: A review for statisticians. arXiv preprint arXiv1601:00670.

Blei, D.M., A. Y.Ng, andM. I. Jordan. 2003. Latent dirichlet allo-cation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Blitzer, J., M. Dredze, and F. Pereira. 2007, June. Biographies,bollywood, boom-boxes and blenders: Domain adaptation

for sentiment classification. In ACL (Vol. 7, pp. 440–447)Stroudsburg: Association for Computational Linguistics.

Blitzer, J., R. McDonald, and F. Pereira. 2006, July. Domainadaptationwith structural correspondence learning. In Pro-ceedings of the 2006 conference on empirical methods innatural language processing (pp. 120–128). Stroudsburg:Association for Computational Linguistics.

Bonilla, E. V., K. M. A. Chai, and C. K. Williams. 2007, Decem-ber. Multi-task Gaussian process prediction. In NIPs (Vol.20, pp. 153–160). Vancouver: Neural Information Process-ing Systems.

Bouveyron, C., and J. Jacques. 2014. Adaptive mixtures ofregressions: Improving predictive inference when popula-tion has changed. Communications in Statistics-Simulationand Computation 43 (10):2570–2592.

Breiman, L. 1996. Bagging predictors. Machine learning 24(2):123–140.

Castagliola, P., and F. Tsung. 2005. Autocorrelated SPC for non-normal situations.Quality and Reliability Engineering Inter-national 21 (2):131–161.

Cressie, N., and C. K. Wikle. 2015. Statistics for spatio-temporaldata. Hoboken: John Wiley & Sons.

Cheng, L., F. Tsung, and A.Wang. 2017. A transfer learning per-spective formodeling shape deviations in additivemanufac-turing. IEEE Robotics and Automation Letters 2 (4):1988–1993.

Dai, W., Q. Yang, G. R. Xue, and Y. Yu. 2008, July. Self-taughtclustering. In Proceedings of the 25th International Confer-ence onMachine Learning (pp. 200–207). New York: ACM.

Dai, W., Q. Yang, G. R. Xue, and Y. Yu. 2007, June. Boosting fortransfer learning. In Proceedings of the 24th InternationalConference onMachine Learning (pp. 193–200). NewYork:ACM.

Dalla Valle, L., and R. S. Kenett. 2015. Official Statistics DataIntegration for Enhanced Information Quality, Quality andReliability Engineering InternationalVol. 31, No. 7:pp. 1281–1300.

Efron, B., and R. J. Tibshirani. 1994.An introduction to the boot-strap. CRC Press.

Evgeniou, A., and M. Pontil. 2007. Multi-task feature learning.Advances in Neural Information Processing Systems 19:41.

Evgeniou, T., and M. Pontil. 2004, August. Regularized multi–task learning. In Proceedings of the Tenth ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining (pp. 109–117). New York: ACM.

Freund, Y., and R. E. Schapire. 1995, March. A desicion-theoretic generalization of on-line learning and an applica-tion to boosting. In European conference on computationallearning theory (pp. 23–37). Heidelberg: Springer BerlinHeidelberg.

Gilks, W. R., S. Richardson, and D. Spiegelhalter. Eds.).1995.Markov chain Monte Carlo in practice. Boca Raton: CRCPress.

Gonçalves, A. R., P. Das, S. Chatterjee, V. Sivakumar, F. J.Von Zuben, and A. Banerjee. 2014, November. Multi-tasksparse structure learning. In Proceedings of the 23rd ACMInternational Conference on Conference on Information

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


and Knowledge Management (pp. 451–460). New York:ACM.

Gonçalves, A. R., F. J. Von Zuben, and A. Banerjee. 2016. Multi-task sparse structure learning with Gaussian copulamodels.Journal of Machine Learning Research 17 (33):1–30.

Huang, S., J. Li, K. Chen, T.Wu, J. Ye, X.Wu, and L. Yao. 2012. Atransfer learning approach for networkmodeling. IIE Trans-actions 44 (11):915–931.

Huang, Q., H. Nouri, K. Xu, Y. Chen, S. Sosina, and T. Das-gupta. 2014. Statistical predictive modeling and compensa-tion of geometric deviations of three-dimensional printedproducts. Journal of Manufacturing Science and Engineering136 (6):061008.

Huang, Q., J. Zhang, A. Sabbaghi, and T. Dasgupta. 2015.Optimal offline compensation of shape shrinkage forthree-dimensional printing processes. IIE Transactions 47(5):431–441.

Jaiswal, S., J. Bunker, and L. Ferreira. 2010. Influence of plat-form walking on BRT station bus dwell time estimation:Australian analysis. Journal of Transportation Engineering,136 (12), 1173–1179.

Jiang, J., and C. Zhai. 2007, June. Instance weighting for domainadaptation in NLP. In ACL (Vol. 7, pp. 264–271). Strouds-burg: Association for Computational Linguistics.

Jin, O., N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. 2011, October.Transferring topical knowledge from auxiliary long texts forshort text clustering. In Proceedings of the 20th ACM inter-national conference on Information and knowledge man-agement (pp. 775–784). New York: ACM.

Kamishima, T., M. Hamasaki, and S. Akaho. 2009, December.TrBagg: A simple transfer learning method and its appli-cation to personalization in collaborative tagging. In DataMining, 2009. ICDM’09. Ninth IEEE International Confer-ence on (pp. 219–228). Piscataway: IEEE.

Kenett, R. S. 2016. On generating high InfoQwith Bayesian net-works,Quality Technology andQuantitativeManagement 13(3).

Kenett, R. S., and G.. Shmueli. 2016. Information quality: thepotential of data and analytics to generate knowledge, Hobo-ken: John Wiley and Sons.

Lawrence, N. D., and J. C. Platt. 2004, July. Learning to learnwith the informative vector machine. In Proceedings of theTwenty-First International Conference on Machine Learn-ing (p. 65). New York: ACM.

Lee, S. I., V. Chatalbashev, D. Vickrey, and D. Koller. 2007, June.Learning a meta-level prior for feature relevance frommul-tiple related tasks. In Proceedings of the 24th InternationalConference onMachine Learning (pp. 489–496). NewYork:ACM.

Li, J., C. Zou, and F. Tsung. 2009, August. Monitoringmultivari-ate binomial data via log-linearmodels. Proceedings of the 1st INFORMS International Conference on Service Science.Catonsville: INFORMS.

Liao, X., Y. Xue, and L. Carin. 2005, August. Logistic regressionwith an auxiliary data source. In Proceedings of the 22 ndInternational Conference on Machine Learning (pp. 505–512). New York: ACM.

Lin, D., X. An, and J. Zhang. 2013. Double-bootstrapping sourcedata selection for instance-based transfer learning. PatternRecognition Letters 34 (11):1279–1285.

Liu, J., S. Ji, and J. Ye. 2009, June. Multi-task feature learning viaefficient l 2, 1-norm minimization. In Proceedings of thetwenty-fifth conference on uncertainty in artificial intelli-gence (pp. 339–348). Corvallis: AUAI Press.

Liu, H., M. Palatucci, and J. Zhang. 2009, June. Blockwise coor-dinate descent procedures for the multi-task lasso, withapplications to neural semantic basis discovery. In Pro-ceedings of the 26th Annual International Conference onMachine Learning (pp. 649–656). New York: ACM.

Lu, J., V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang. 2015.Transfer learning using computational intelligence: a sur-vey. Knowledge-Based Systems 80:14–23.

Luan, H., and Q. Huang. 2015, August. Predictive modeling ofin-plane geometric deviation for 3D printed freeform prod-ucts. In Automation Science and Engineering (CASE), 2015IEEE International Conference on (pp. 912–917). Piscat-away: IEEE.

Lütkepohl, H. 2005. New introduction to multiple time seriesanalysis. Springer Science & Business Media.

Maurer, A., M. Pontil, and B. Romera-Paredes. 2013, January.Sparse coding for multitask and transfer learning. In ICML(2) (pp. 343–351). Princeton: The International MachineLearning Society.

Mei, Y. 2010. Efficient scalable schemes for monitoring a largenumber of data streams. Biometrika 97 (2):419–433.

Pan, S. J., and Q. Yang. 2010. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering 22(10):1345–1359.

Pan, S. J., V. W. Zheng, Q. Yang, and D. H. Hu. 2008, July.Transfer learning for wifi-based indoor localization. InAssociation for the advancement of artificial intelligence(AAAI) workshop (p. 6). Palo Alto: The Association for theAdvancement of Artificial Intelligence.

Pan, S. J., J. T. Kwok, and Q. Yang. 2008, July. Transfer Learningvia Dimensionality Reduction. In AAAI (Vol. 8, pp. 677–682). Palo Alto: The Association for the Advancement ofArtificial Intelligence.

Pan,W., E.W. Xiang, N. N. Liu, andQ. Yang. 2010, July. TransferLearning in Collaborative Filtering for Sparsity Reduction.In AAAI (Vol. 10, pp. 230–235). Palo Alto: The Associationfor the Advancement of Artificial Intelligence.

Pu, F., J. Ma, D. Zeng, X. Xu, and N. Chen. 2015. Early warningof abrupt displacement change at the Yemaomian landslideof the Three Gorge Region, China. Natural Hazards Review16 (4):04015004.

Psarakis, S., and G. E. A. Papaleonida. 2007. SPC proceduresfor monitoring autocorrelated processes. Quality Technol-ogy and Quantitative Management 4 (4):501–540.

Raina, R., A. Battle, H. Lee, B. Packer, and A. Y. Ng. 2007,June. Self-taught learning: transfer learning from unlabeleddata. In Proceedings of the 24th International Conferenceon Machine Learning (pp. 759–766). New York: ACM.

Ren, H., J. Long, Z. Gao, and P. Orenstein. 2012. Passengerassignment model based on common route in congested

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017

128 F. TSUNG ET AL.

transit networks. Journal of Transportation Engineering 138(12):1484–1494.

Rubinstein, R. Y., and D. P. Kroese. 2016. Simulation and theMonte Carlo method. Hoboken: John Wiley and Sons.

Samarov, D. V., D. Allen, J. Hwang, Y. Lee, andM. Litorja. 2016.A Coordinate Descent based approach to solving the SparseGroup Elastic Net. Technometrics (just-accepted).

Samarov, D. V., J. Hwang, and M. Litorja. 2015. The spatiallasso with applications to unmixing hyperspectral biomed-ical images. Technometrics 57 (4):503–513.

Schwaighofer, A., V. Tresp, and K. Yu. 2004. Learning Gaussianprocess kernels via hierarchical Bayes. InAdvances inNeuralInformation Processing Systems (pp. 1209–1216). Vancou-ver: Neural Information Processing Systems.

Setti, J. R., and B. G. Hutchinson. 1994. Passenger-terminal sim-ulation model. Journal of Transportation Engineering 120(4):517–535.

Shao, C., J. Ren, H. Wang, J. J. Jin, and S. J. Hu. 2017. Improv-ingmachined surface shape prediction by integratingmulti-task learning with cutting force variation modeling. Jour-nal of Manufacturing Science and Engineering 139 (1):011014.

Song, P., W. Zheng, J. Liu, J. Li, and X. Zhang. 2015, Novem-ber. A novel speech emotion recognition method via trans-fer PCA and sparse coding. In Chinese Conference on Bio-metric Recognition (pp. 393–400). Berlin: Springer Interna-tional Publishing.

Song, Z., K. Zhang, and F. Tsung. 2017. A Multi-task learningapproach for improved forecast of passenger inflow ratesinto a URT System.Working paper.

Taylor,M. E., and P. Stone. 2009. Transfer learning for reinforce-ment learning domains: A survey. Journal ofMachine Learn-ing Research 10 (Jul):1633–1685.

Tseng, S. T., N. J. Hsu, and Y. C. Lin. 2016. Joint modelingof laboratory and field data with application to warrantyprediction for highly reliable products. IIE Transactions 48(8):710–719.

Wainwright, M. J., and M. I. Jordan. 2008. Graphical models,exponential families, and variational inference. Foundationsand Trends® in Machine Learning 1 (1–2):1–305.

Wang, A., S. Song, Q. Huang, and F. Tsung. 2017. In-plane shape-deviation modeling and compensationfor fused deposition modeling processes. IEEE Trans-actions on Automation Science and Engineering 14 (2):968–976.

Wei, Y., Y. Zheng, and Q. Yang. 2016, August. Transfer knowl-edge between cities. In Proceedings of the 22 nd ACM

SIGKDD International Conference on Knowledge Discov-ery and Data Mining (pp. 1905–1914). New York: ACM.

Weiss, K., T. M. Khoshgoftaar, and D. Wang. 2016. TransferLearning Techniques. In Big Data Technologies and Appli-cations (pp. 53–99). Springer International Publishing.

Wu, P., and T. G. Dietterich. 2004, July. Improving SVM accu-racy by training on auxiliary data sources. In Proceedingsof the Twenty-first International Conference on MachineLearning (p. 110). New York: ACM.

Xu, Q., and Q. Yang. 2011. A survey of transfer and multitasklearning in bioinformatics. Journal of Computing Scienceand Engineering 5 (3):257–268.

Yang, X., and L. Chen. 2010. Using multi-temporal remotesensor imagery to detect earthquake-triggered landslides.International Journal of Applied Earth Observation andGeoinformation 12 (6):487–495.

Yao, Y., and G. Doretto. 2010, June. Boosting for transfer learn-ing with multiple sources. In Computer vision and patternrecognition (CVPR), 2010 IEEE conference on (pp. 1855–1862). Piscataway: IEEE.

Yu, K., V. Tresp, and A. Schwaighofer. 2005, August. LearningGaussian processes from multiple tasks. In Proceedings ofthe 22 nd international conference onMachine learning (pp.1012–1019). New York: ACM.

Zhai, C. 2008. Statistical language models for informationretrieval. Synthesis Lectures on Human Language Technolo-gies 1 (1):1–141.

Zhang, K., and F. Tsung. 2017. AMulti-task learning frameworkintegrating non-contemporaneous autoregressive models.Working paper.

Zhang, Y., and D. Y. Yeung. 2014. A regularization approachto learning task relationships in multitask learning. ACMTransactions on Knowledge Discovery from Data (TKDD) 8(3):12.

Zhang, Y., D. Y. Yeung, andQ. Xu. 2010. Probabilistic multi-taskfeature selection. In Advances in neural information pro-cessing systems (pp. 2559–2567). Vancouver: Neural Infor-mation Processing Systems.

Zhao, W. X., J. Jiang, J. Weng, J. He, E. P. Lim, H. Yan, and X.Li. 2011, April. Comparing twitter and traditional mediausing topic models. In European Conference on Informa-tion Retrieval (pp. 338–349). Heidelberg: Springer BerlinHeidelberg.

Zou, N., Y. Zhu, J. Zhu, M. Baydogan, W. Wang, and J. Li. 2015.A Transfer Learning Approach for Predictive Modeling ofDegenerate Biological Systems. Technometrics 57 (3):362–373.

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

01:

30 1

5 D

ecem

ber

2017


Discussion on “Statistical transfer learning: A review and some extensions tostatistical process control”

Xuemin Zia and Changliang Zoub

aSchool of Science, Tianjin University of Technology and Education, China; bInstitute of Statistic and LPMC, Nankai University, China

We would like to congratulate the authors on an inter-esting and important review of industrial statisticsapplications of transfer learning techniques. This is atimely discussion because there is an urgent need fordealing with datasets from multiple domains in vari-ous industries (not only in manufacturing, but also inservice) and handling related tasks in target domainsby transferring previously acquired knowledge fromsource domains. In this discussion, we will focus onsome recent developments in the field of statistics thatmay essentially be categorized as transfer learning andthus similar methodologies can be applied in relatedapplications.

Regularization-basedmodel estimation acrossmultiple classes

As authors surveyed in the third section indicate, aneffective and popular framework of transfer learningis to apply regularization-based methods, by assumingmodels from different sources share certain character-istics. Some commonly used penalties were summa-rized. Recently, the combination of fused lasso (Tibshi-rani et al. 2005) and standard lasso penalties is shown tobe quite efficient for model estimation across multipleclasses (Danaher, Wang, and Witten 2014). For exam-ple, consider the problemof estimatingmultiple relatedGaussian graphical models from a high-dimensionaldataset with observations belonging to distinct classes.Graphical models are especially of interest in the anal-ysis of modern social network data and can providea useful tool for visualizing the relationships betweenindividuals and for generating social hypotheses.The standard formulation for estimating a Gaus-sian graphical model assumes that each observation

CONTACT Changliang Zou [email protected] Institute of Statistic and LPMC, Nankai University, Nankai Qu , China.

is drawn from the same distribution (Friedman,Hastie,and Tibshirani 2007). However, in many datasets theobservations may correspond to several distinctclasses (sources or domains), so the assumption thatall observations are drawn from the same distributionis inappropriate. Estimating separate graphical modelsfor the distinct classes does not exploit the similar-ity between the true graphical models. In addition,estimating a single graphical model with the pooledsamples ignores the fact that we do not expect thetrue graphical models to be identical, and that thedifferences between the graphical models may be ofinterest. Guo et al. (2011) proposed to take a penalizedlog-likelihood approach and the penalty term is essen-tially in the form of P(B) = λ

∑i(∑

k B2ki)

1/2, wherewe slightly abuse notation here by using the same sym-bols as the authors’ in the third section, but this shouldnot cause any confusion. Clearly, this is basically anidea of statistical transfer learning and the penaltyfunction is similar to the penalties mentioned by theauthors.

Danaher et al. (2014) pointed that the approachof using P(B) has some disadvantages. One is that ituses just one tuning parameter and cannot controlseparately the sparsity level and the extent of networksimilarity. The other is that in cases where we expectedge values as well as the network structure to be sim-ilar between classes, the proposal of Guo et al. (2011)may not be well suited because it encourages sharedpatterns of sparsity but does not encourage similarityin the signs and values of the non-zero edges. Danaheret al. (2014) employ generalized fused lasso

P̃(B) = λ1∑

i

∑

k

|Bki| + λ2∑

i

∑

k<k′|Bki − Bk′i|,

© Taylor & Francis

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

https://doi.org/10.1080/08982112.2018.1388053



130 X. ZI AND C. ZOU

where λ1 and λ2 are non-negative tuning parameters.Like the lasso, the use of P̃(·) would result in sparseestimates B1, . . . ,Bk when the tuning parameter λ1 islarge. In addition, many elements of B1, . . . ,Bk will beidentical across classes when the tuning parameter λ2

is large (Tibshirani et al. 2005). Thus, P̃(B) borrowsinformation aggressively across classes, encouragingnot only a similar network structure but also similaredge values.

Similar treatments can also be applicable when theunderlying assumption that graphical models are notchanging over time is violated. There is growing evi-dence that network structures are often non-stationary.As a result there is a clear need to quantify dynamicchanges in network structure over time. Specially, thereis a need to estimate a network at each observation inorder to accurately quantify temporal diversity. Clearly,estimating time-varying networks or graphical mod-els could be again viewed as a statistical transfer learn-ing problem because we need to borrow informationof data structure across time. To date, the most com-monly used approach to achieve this goal involvesthe use of sliding windows or kernel-based methods(Esposito et al. 2003). However, as pointed by someauthors, such as Monti et al. (2014), while sliding win-dows are a valuable tool for investigating high-leveldynamics of networks there are twomain issues associ-ated with its use. First, the choice of the window lengthcan be a difficult parameter to tune. Second, the use ofsliding windows needs to be accompanied by an addi-tional mechanism to determine if variations in edgestructure are significant. In light of this, Monti et al.(2014) used a variant of P̃(B)

P̌(B) = λ1∑

i

∑

k

|Bki| + λ2∑

i

∑

k≥2

|Bki − Bk−1,i|,

which encourages the estimation procedure to produceestimates with the two properties; sparsity and tempo-ral homogeneity. With the help of P̌(B), we are able toobtain individual estimates of graphical models at eachtime point as opposed to a model for the entire timeseries, allowing one to fully characterize the dynamicevolution of networks over time.

Inference with transfer-learning techniques

Aforementioned studies focus mainly on model esti-mation by connecting and transferring knowledge

among multiple statistical models and data sources.We would like to briefly discuss some potential useof transfer learning in statistical hypothesis testing.Here, we take large-scale simultaneous hypothesis test-ing problems as example, in which thousands or eventens of thousands of cases are considered together.This problem has become a familiar feature in scien-tific fields such as biology, medicine, genetics, neu-roscience, economics, and finance and has been wellstudied, and many solutions have been proposed(Storey and Tibshirani 2003). The most commonlyused approaches start with a list of N p-values pi,one for each hypothesis Hi, and reject all hypothe-ses with a p-value below a (possibly random, i.e.,data-dependent) threshold q∗. The goal is to controla measure of Type I error at level α. Traditionally,this measure has been the Family-Wise Error Rate(FWER), but for many applications this is too strin-gent, and over the last 20 years the False DiscoveryRate (FDR) has become a popular choice, as it is morepermissive and adaptive (Benjamini and Hochberg1995).

In many real-world applications, beyond p-values,side information, represented by covariates Xi, is oftenavailable for each hypothesis. This side-informationmay be related to the different power of the tests, orto different prior probabilities of the null hypothesisbeing true. For example, previous studies may sug-gest that some null hypotheses are more or less likelyto be false; similarly, in spatially structured problems,non-null hypotheses are more likely to be clusteredthan true null hypotheses. It is thus anticipated thatexploiting structural prior information will improvethe performance of conventional multiple testing pro-cedures. Such covariates are often apparent to domainscientists. While side information is likely to be irrel-evant in the context of single hypothesis testing, with-out taking into account this information in large-scalesimultaneous hypothesis testing problems would tendto lack the detection ability of statistical significance.The covariate-adjusted or side-information-exploitedFDR estimation and control is thus essentially a statis-tical transfer learning problem. It has recently becomean active research topic and several attempts have beenmade in the literature to incorporate side informa-tion. For instance, methods that up-weight or down-weight hypotheses appeared in Genovese, Roeder, andWasserman (2006) and Hu, Zhao, and Zhou (2010),

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017


among many others, say using

Qi = pi/ωi, i = 1, . . . ,N,

instead of pi’s themselves with the popular BH pro-cedure (Benjamini and Hochberg 1995), where ωi’sare some pre-chosen weights (related to Xi) satisfy-ing

∑ωi = 1. If the weights are chosen a priori, that

is, without looking at the p-values, then the weightedBH procedure also controls the FDR. In particular,most of the literature to date only considers the caseof deterministic weights. These results are very valu-able, since weighted multiple testing procedures havebeen shown to be robust to weight misspecification(Genovese, Roeder, and Wasserman 2006): choosinggood weights can lead to huge increases in power, yetbad weights will only slightly decrease power com-pared to the unweighted procedure. In contrast, someworks allow the weights to depend also on the p-values in a data-driven way, while still controlling theFDR (Scott et al. 2013). Another different approach,based on a two-stage approach mainly arising fromthe microarray literature, extracts the prior informa-tion to remove a subset of features which seems togenerate uninformative signals in the filtering stage,followed by applying some multiple testing procedureto the remaining features which have passed the fil-ter in the selection stage; see, for example, Bourgon,Gentleman, and Huber (2010), Sarkar, Chen, and Guo(2013), and the references therein. The success of thesesolutions is mainly due to the assistance of transferlearning; information transferred from source domainscould improve the performance of statistical inferencein the target domain. In the literature of statistical sci-ence, recent interests concentrate upon optimal selec-tion of tuning parameters, such like the filtering thresh-olds in the two-stage procedure, for example, see Duand Zhang (2014).

In summary, given the fact that advances indata acquisition techniques have led to the increas-ing necessity of handling datasets from multipledomains/classes, statistical modeling and inferencethat are able tomake efficient use of previously acquiredknowledge from source domains or across classes, havebecome critical in a variety of industrial applications.The review of statistical models and methodologies inthis direction is very timely, and more research effortsare needed for establishing certain statistical propertiesin a systematic framework to ensure the right use oftransferred knowledge.

About the authors

Xuemin Zi is Professor of the School of Science at Tianjin Uni-versity of Technology and Education. Her research interestsinclude statistical process control and design of experiments.

Changliang Zou is Professor of the Institute of Statistics atthe Nankai University. He has authored over 90 refereed jour-nal publication in Statistics and Industrial Engineering, suchas Journal of the American Statistical Association, Annals ofStatistics, Biometrika, Technometrics, Journal of Quality Tech-nology, Naval Reserch Logisitics, and IIE Transactions. He cur-rently serves as an associate editor of Technometrics and Statis-tica Sinica, and is on the editorial board of the Journal of QualityTechnology.

Acknowledgments

Finally, we would like to thank the authors, and the editors, forthe opportunity to comment.

Funding

We would like express our appreciation of the support of NNSFof China, Grant 11771332.

References

Benjamini, Y., and Y. Hochberg. 1995. Controlling the false dis-covery rate: A practical and powerful approach to multi-ple testing. Journal of the Royal Statistical Society. Series B57:289–300.

Bourgon, R., R. Gentleman, and W. Huber. 2010. Independentfiltering increases detection power for high-throughputexperiments. Proceedings of the National Academy of Sci-ences 107:9546–51.

Danaher, P., P. Wang, and D. M. Witten. 2014. The joint graph-ical lasso for inverse covariance estimation across multi-ple classes. Journal of the Royal Statistical Society: Series B76:373–97.

Du, L., andC.M. Zhang. 2014. Single-indexmodulatedmultipletesting. Annals of Statistics 42:1262–1311.

Esposito, F., E. Seifritz, E. Formisano, R. Morrone, T.Scarabino, G. Tedeschi, and F. Di Salle. 2003. Real-time independent component analysis of fMRI time-series.Neuroimage 20:2209–24.

Friedman, J., T. Hastie, and R. Tibshirani. 2007. Sparse inversecovariance estimation with the graphical lasso. Biostatistics9:432–41.

Genovese, C. R., K. Roeder, and L. Wasserman. 2006. False dis-covery control with p-value weighting. Biometrika 93:509–24.

Guo, J., E. Levina, G.Michailidis, and J. Zhu. 2011. Joint estima-tion of multiple graphical models. Biometrika 98:1–15.

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

132 X. ZI AND C. ZOU

Hu, J. X., H. Zhao, and H. Zhou. 2010. False discovery rate con-trol with groups. Journal of the American Statistical Associ-ation 105:1215–27.

Monti, R. P., P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos,and G. Montana. 2014. Estimating time-varying brain con-nectivity networks from functional MRI time series. Neu-roImage 103:427–43.

Sarkar, S. K., J. Chen, and W. Guo, 2013. Multiple testing ina two-stage adaptive design with combination tests con-trolling FDR. Journal of the American Statistical Association108:1385–1401.

Scott, J. G., R. C.Kelly,M.A. Smith, P. Zhou, andP. E.Kass. 2015.False discovery-rate regression: An application to neuralsynchrony detection in primary visual cortex. Journal of theAmerican Statistical Association 110:459–71.

Storey, J. D., and R. Tibshirani. 2003. Statistical significance forgenomewide studies. Proceedings of the National Academyof Sciences 100:9440–5.

Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K.Knight. 2005. Sparsity and smoothness via the fusedlasso. Journal of the Royal Statistical Society. Series B 67:91–108.

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017


Discussion of “Statistical transfer learning: A review and some extensions tostatistical process control”

Panagiotis Tsiamyrtzis

Department of Statistics, Athens University of Economics and Business, Athens, Greece

I would like to start by congratulating the authors forthe nice presentation on the very interesting topic ofStatistical Transfer Learning (STL) and especially forbringing to the fore its relation to the area of Sta-tistical Process Control (SPC). In what follows somefurther aspects of statistical transfer learning will bepresented while others will be discussed further, aim-ing to provide a more spherical point of view of thisarea.

Aspects of statistical transfer learning

Transfer learning attempts to carry over informationamong tasks and improve the learning procedure. Thissharing of information across tasks (depending on theproblem under study) can be done either in parallel orsequentially (see Figure 1).

Parallel STL: When various tasks need to be learnedsimultaneously. These tasks might be different spa-tial/temporal cases of the same task or tasks fromdifferentdomains. So, knowledge transfer is multi-directional, likethe landslide monitoring example.

Sequential STL: When we have one or more tasks thatwe have learned and we are interested in transferring thisavailable knowledge in the learning of a new task. So, wehave unidirectional transfer, like the 3D printing qualitycontrol example.

The former (also known as multi-task learning inmachine learning) has the advantage thatwe have a big-ger set to work with but the learning is done simultane-ously. On the other hand, the latter allows knowledgeattained from past (source) data to be carried over tonew (target) data analysis.

CONTACT Panagiotis Tsiamyrtzis [email protected] Department of Statistics, Athens University of Economics and Business, Patission Street, Athens ,Greece.

Negative transfer learning

In learning a new (target) task one can try to either doit:

(i) from scratch, that is, use just the available dataand ignore any relative knowledge;

(ii) use transfer learning to carry over informa-tion from source task(s) and attempt to haveimproved performance in learning the targettask.

Is transfer learning always preferable? The answeris no. The scenario where transfer learning can poten-tially make things worst is known as negative transferlearning (Torrey and Shavlik 2009), where attemptingto transfer knowledge from source to target task notonly does not improve performance but it may actuallydecrease it. To avoid negative transfer the user needsto be careful, that the tasks are similar enough andthat the transfer method is well leveraged. This is quitedemanding for an autonomous system and methodsthat will be able to protect against negative transfer(allowing only “safe” transfer) will most likely reducethe benefit of transfer learning, compared to a methodthat does “aggressive” transfer learning, which willhave excellent performance in similar tasks but willallow negative transfer in dissimilar scenarios.

Statistical methods in transfer learning

Transfer learning is not really a new concept in the areaof statistics. The Bayesian approach, for example, canbe seen as a transfer learning mechanism. The idea of aprior distribution along with the hierarchical modelingnaturally fits this purpose. As a representative example

© Taylor & Francis

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

https://doi.org/10.1080/08982112.2018.1386956



134 P. TSIAMYRTZIS

Figure . Parallel (a) versus sequential (b) statistical transfer learn-ing.

one can refer to the power priors (Ibrahim and Chen2000) which play the role of (sequential) statisticaltransfer learning mechanism: if D0 are the source taskdata we form the prior:

π(θ |D0, α0) ∝ [L(θ |D0)]α0 π0(θ )

and then upon observing the target data D1 we obtainthe posterior:

π(θ |D0,D1,α0) ∝ [L(θ |D1) ][ L(θ |D0)]α0 π0(θ )

where the value of α 0 � [0, 1] will determine the effect(reflecting the similarity between the source and thetarget task) of the source data (D0) in determining theposterior distribution of the parameter θ , once we usethe target data (D1).

Can the Bayesian approach allow negative transfer?If the source and target tasks have been generated fromvery different values of parameters, then the posteriorin the target task can be negatively affected from theprior (that was set from the source task), mainly whenwe have low volumes of data, as with big data the effectof the prior diminishes. In any case, prior sensitivityanalysis can be helpful to examine whether the priorused, affects (negatively) the posterior or not.

Transfer learning and SPC

The article demonstrates ways where SPC methodscan provide tools in statistical transfer learning. Aninteresting question though comes if we inverse theabove and ask whether SPC methods can benefit fromthe use of statistical transfer learning. For example, infrequentist-based control charting (Shewhart charts,CUSUM, EWMA, etc.) a standard practice is to employa Phase I/II split, where in Phase I we perform learning

(calibration) while in Phase II we perform testing. So,learning stops at the end of Phase I. Transfer learningphilosophy would suggest to carry over learning inPhase II, incorporating the information from new dataas they become available. Such a proposal is feasiblevia a Bayesian SPC scheme (see, e.g., Tsiamyrtzis andHawkins 2005, 2010) which can be set as a sequentiallyupdated mechanism allowing the parameter transferlearning, as data become available progressively, pro-viding solutions even when we have small amountsof data and braking free form the usual Phase I/IIconstraint.

About the author

Panagiotis Tsiamyrtzis received the BSc degree in mathemat-ics from the Aristotle University of thessaloniki, Greece. Heobtained the MSc and PhD degrees in statistics from the schoolof statistics at University of Minnesota, USA, working withDouglas M. Hawkins in the field of SPC. In 2004 he joined thedepartment of statistics at Athens University of Economics andBusiness, where he is currently Associate Professor. He is alsoa Research Associate Professor at the Department of ComputerScience, University of Houston, TX, USA. His research inter-ests include Bayesian statistical process control and statisticalaspects of Computational Physiology, where he published morethan 50 papers.

References

Ibrahim, J. G., and M. H. Chen. 2000. Power prior distri-butions for regression models. Statistical Science 15 (1):46–60.

Torrey, L., and J. Shavlik. 2009. Transfer learning. In Handbookof research on machine learning applications, eds. E. Soria,J. Martin, R. Magdalena, M. Martinez, and A. Serrano,242–64. IGI Global. http://pages.cs.wisc.edu/∼shavlik/abstracts/torrey.handbook09.abstract.html

Tsiamyrtzis, P., and D.M. Hawkins. 2005. A Bayesian scheme todetect changes in the mean of a short-run process. Techno-metrics 47 (4):446–56. doi:10.1198/004017005000000346.

Tsiamyrtzis, P., and D. M. Hawkins. 2010. Bayesian startupphase mean monitoring of an autocorrelated process that issubject to random sized jumps. Technometrics 52 (4):438–52. doi:10.1198/TECH.2010.08053.

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

http://pages.cs.wisc.edu/~shavlik/abstracts/torrey.handbook09.abstract.html

https://doi.org/10.1198/004017005000000346

https://doi.org/10.1198/TECH.2010.08053


Rejoinder

Fugee Tsung, Ke Zhang, Longwei Cheng, and Zhenli Song

Department of Industrial Engineering and Logistics Management, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong

Wewould like to thank the discussants for their insight-ful comments on a number of important areas in Sta-tistical Transfer Learning (STL). Here we would like totake this opportunity to summarize some of their keyideas for further discussion and analysis.

Initially, Tsiamyrtzis pointed out that the STL tech-niques can be categorized as parallel STL and sequen-tial STL by the way they share information across tasks.We agree with the summary that the parallel STL is alsoknown as multi-task learning and normally presents asmulti-directional while in sequential STL the learningmethods have specific unidirectional transfer. We havealso discussed the difference of multi-task learning andtypical transfer learning in our original article.

Both discussion articles mention the statisticalmethods adopted in transfer learning. For Bayesian-based methods in STL, Tsiamyrtzis indicates that theidea of a prior distribution along with the hierarchicalBayesian modeling naturally fit the purpose of transferlearning to share information among parameters. Wepresent the same idea in our article and we expectan integration of Bayesian methods and STL appli-cations for SPC. For regularization-based methods,we acknowledge Zi and Zou’s (2018) introduction onsome recent developments of regularization-basedmethods in high-dimensional statistics that may befurther adopted in STL applications, like fused Lasso(Tibshirani et al. 2005) for network structureddata.

Tsiamyrtzis raises an important issue of trans-fer learning: negative transfer (Torrey and Shavlik2009). In fact, inappropriate transfer across tasks mayhave negative effects on the learning performance,which have been widely discussed in machine learning

CONTACT Fugee Tsung [email protected] Department of Industrial Engineering and Logistics Management, Hong Kong University of Science andTechnology, Clear Water Bay, Hong Kong.

studies (Pan and Yang 2010). The negative effects nor-mally originate the unconformity between the datasetand the assumed model/pattern. For example, as Tsi-amyrtzis specifies, in Bayesian-based transfer learn-ing, the transfer is allowed by a prior distribution thatassumes similarity between source and target domains.If the source and target tasks have been generated fromvery different parameters, the posterior in the targettask may be negatively affected by the prior. Similarly,in regularization-based transfer learning, if the tasks donot present a common sparsity pattern as assumed, thetransfer learning performance might be compromisedor even worsen.

Discussants also mention the further applications ofSTL in statistical inference and SPC. Zi and Zou (2018)suggest the potential use of STL in hypothesis testing.We agree with their comments and are glad to see thepotential improvement that STL could make for mul-tiple testing problems. On the other hand, Tsiamyrtzisnotes that SPC methods may benefit from the use ofSTL by incorporating information in Phase I and PhaseII, like in a Bayesianmanner.We appreciate this impor-tant suggestion on future research. This idea seems ina similar sense to online learning (Hoffman, Bach, andBlei 2010).

In summary, STL provides an efficient frameworkto extend and improve SPC methods by combiningpreviously acquired knowledge across classes/sources,whichmeets the increasing necessity of fusing differentdatasets in modern industrial and service applications.We thank the discussants for their valuable comments,andhope the discussion articlewill identify somedirec-tions that warrant future research for SPC and qualityimprovement.

© Taylor & Francis

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

https://doi.org/10.1080/08982112.2018.1390957



136 F. TSUNG ET AL.

References

Hoffman, M., F. R. Bach, and D. M. Blei. 2010. Online learn-ing for latent Dirichlet allocation. In Advances in neuralinformation processing systems, 856–64. Vancouver: NeuralInformation Processing Systems.

Pan, S. J., and Q. Yang. 2010. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering 22(10):1345–59. doi:10.1109/TKDE.2009.191.

Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K.Knight. 2005. Sparsity and smoothness via the fused lasso.

Journal of the Royal Statistical Society: Series B 67:91–C108.doi:10.1111/j.1467-9868.2005.00490.x.

Torrey, L., and J. Shavlik. 2009. Transfer learning. In Handbookof research on machine learning applications, eds. E. Soria, J.Martin, R. Magdalena, M. Martinez, and A. Serrano, 242–64. IGI Global.

Zi, X., and C. Zou. 2018. Discussion on “Statistical trans-fer learning: A review and some extensions to sta-tistical process control.” Quality Engineering 30 (1):129–132.

Dow

nloa

ded

by [

HK

UST

Lib

rary

] at

07:

00 1

8 D

ecem

ber

2017

https://doi.org/10.1109/TKDE.2009.191

https://doi.org/10.1111/j.1467-9868.2005.00490.x.

extensions to statistical process control statistical

Documents