theoreticalfoundations · 2015. 4. 14. · is to treat it as one would data in a multivariate...

Theoretical Foundationsof Functional Data Analysis,

with an Introduction toLinear Operators

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by WALTER A. SHEWHART and SAMUEL S. WILKS

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens,Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay,Sanford WeisbergEditors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels

A complete list of the titles in this series appears at the end of this volume.

Theoretical Foundationsof Functional Data Analysis,

with an Introduction toLinear Operators

Tailen HsingProfessor, Department of Statistics

University of Michigan, USA

Randall EubankProfessor Emeritus, School of Mathematical andStatistical Sciences, Arizona State University, USA

This edition first published 2015© 2015 John Wiley & Sons, Ltd

Registered officeJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom

For details of our global editorial offices, for customer services and for information about how to apply for per-mission to reuse the copyright material in this book please see our website at www.wiley.com.

The right of the author to be identified as the author of this work has been asserted in accordance with the Copy-right, Designs and Patents Act 1988.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, inany form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted bythe UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not beavailable in electronic books.

Designations used by companies to distinguish their products are often claimed as trademarks. All brand namesand product names used in this book are trade names, service marks, trademarks or registered trademarks of theirrespective owners. The publisher is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer ofWarranty:While the publisher and author have used their best efforts in preparingthis book, they make no representations or warranties with respect to the accuracy or completeness of the contentsof this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.It is sold on the understanding that the publisher is not engaged in rendering professional services and neitherthe publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expertassistance is required, the services of a competent professional should be sought.

Library of Congress Cataloging-in-Publication Data applied for.

A catalogue record for this book is available from the British Library.

ISBN: 9780470016916

Set in 10/12pt TimesLTStd by Laserwords Private Limited, Chennai, India

1 2015

http://www.wiley.com

To Our Families

Contents

Preface xi

1 Introduction 11.1 Multivariate analysis in a nutshell 21.2 The path that lies ahead 13

2 Vector and function spaces 152.1 Metric spaces 162.2 Vector and normed spaces 202.3 Banach and 𝕃p spaces 262.4 Inner Product and Hilbert spaces 312.5 The projection theorem and orthogonal decomposition 382.6 Vector integrals 402.7 Reproducing kernel Hilbert spaces 462.8 Sobolev spaces 55

3 Linear operator and functionals 613.1 Operators 623.2 Linear functionals 663.3 Adjoint operator 713.4 Nonnegative, square-root, and projection operators 743.5 Operator inverses 773.6 Fréchet and Gâteaux derivatives 833.7 Generalized Gram–Schmidt decompositions 87

4 Compact operators and singular value decomposition 914.1 Compact operators 924.2 Eigenvalues of compact operators 964.3 The singular value decomposition 1034.4 Hilbert–Schmidt operators 1074.5 Trace class operators 113

viii CONTENTS

4.6 Integral operators and Mercer’s Theorem 1164.7 Operators on an RKHS 1234.8 Simultaneous diagonalization of two nonnegative definite

operators 126

5 Perturbation theory 1295.1 Perturbation of self-adjoint compact operators 1295.2 Perturbation of general compact operators 140

6 Smoothing and regularization 1476.1 Functional linear model 1476.2 Penalized least squares estimators 1506.3 Bias and variance 1576.4 A computational formula 1586.5 Regularization parameter selection 1616.6 Splines 165

7 Random elements in a Hilbert space 1757.1 Probability measures on a Hilbert space 1767.2 Mean and covariance of a random element of a Hilbert space 1787.3 Mean-square continuous processes and the Karhunen–Lòeve

Theorem 1847.4 Mean-square continuous processes in 𝕃2(E,ℬ(E), 𝜇) 1907.5 RKHS valued processes 1957.6 The closed span of a process 1987.7 Large sample theory 203

8 Mean and covariance estimation 2118.1 Sample mean and covariance operator 2128.2 Local linear estimation 2148.3 Penalized least-squares estimation 231

9 Principal components analysis 2519.1 Estimation via the sample covariance operator 2539.2 Estimation via local linear smoothing 2559.3 Estimation via penalized least squares 261

10 Canonical correlation analysis 26510.1 CCA for random elements of a Hilbert space 26710.2 Estimation 27410.3 Prediction and regression 281

CONTENTS ix

10.4 Factor analysis 28410.5 MANOVA and discriminant analysis 28810.6 Orthogonal subspaces and partial cca 294

11 Regression 30511.1 A functional regression model 30511.2 Asymptotic theory 30811.3 Minimax optimality 31811.4 Discretely sampled data 321

References 327

Index 331

Notation Index 334

Preface

This book aims to provide a compendium of the key mathematical conceptsand results that are relevant for the theoretical development of functional dataanalysis (fda). As such, it is not intended to provide a general introduction tofda per se and, accordingly, we have not attempted to catalog the volumes offda research work that have flowed at a brisk pace into the statistics literatureover the past 15 years or so. Readers might therefore find it helpful to readthe present text alongside other books on fda, such as Ramsay and Silverman(2005), which providemore thorough and practical developments of the topic.This project grew out of our own struggle in acquiring the theoretical foun-

dations for fda research in diverse fields of mathematics and statistics. Withthat in mind, the book strives to be self-contained. Rigorous proofs are pro-vided for most of the results that we present. Nonetheless, a solidmathematicsbackground at a graduate level is needed to be able to appreciate the con-tent of the text. In particular, the reader is assumed to be familiar with linearalgebra and real analysis and to have taken a course in measure theoretic prob-ability. With this proviso, the material in the book would be suitable for aone-semester, special-topics class for advanced graduate students.Functional data analysis is, from our perspective, the statistical analysis of

sample path data observed from continuous time stochastic processes. Thus,we are dealing with random functions whose realizations fall into some suit-able (large) collection of functions. This makes an overview of function spacetheory a natural starting point for our treatment of fda. Accordingly, we beginwith that topic in Chapter 2. There we develop essential concepts such asSobolev and reproducing kernel Hilbert spaces that play pivotal roles in sub-sequent chapters. We also lay the foundation that is needed to understand theessential mathematical properties of bounded operators on Banach and, inparticular, Hilbert spaces.Our treatment of operator theory is broken into three chapters. The first

of these, Chapter 3, deals with basic concepts such as adjoint, inverse, andprojection operators. Then, Chapter 4 investigates the spectral theory thatunderlies compact operators in some detail. Here, we present both the typicaleigenvalue/eigenvector expansion for self-adjoint operators and the somewhatless common singular value expansion that applies in the non-self-adjoint case

xii PREFACE

or, more generally, for operators between two different Hilbert spaces. Theseexpansions make it possible to develop the concepts of Hilbert–Schmidt andtrace class operators at a level of generality that makes them useful in subse-quent aspects of the text.The treatment of principal components analysis in Chapter 9 requires some

understanding of perturbation theory for compact operators. This material istherefore developed in Chapter 5. As was the case for Chapter 4, we do thisfor both the self-adjoint and the-non self-adjoint scenarios. The latter instancetherefore provides an introduction to the less well-documented perturbationtheory for singular values and vectors that, for example, can be employed toinvestigate the properties of canonical correlation estimators.The fact that sample paths must be digitized for storage entails that data

smoothing of some kind often becomes necessary. Smoothing and regulariza-tion problems also arise naturally from the approximate solution of operatorequations, functional regression, and various other problems that are endemicto the fda setting. Chapter 6 examines a general abstract smoothing or regu-larization problem that corresponds to what we call a functional linear model.An explicit form is derived for the associated estimator of the underlying func-tional parameter. The problems of computation and regularization parameterselection are considered for the case of real valued, scalar response data. Aspecial case of our abstract smoothing scenario leads us back to ordinarysmoothing splines and we spend some time studying their associated prop-erties as nonparametric regression estimators.Chapter 7 aims to establish the probabilistic underpinnings of fda. Themean

element, covariance operator, and cross-covariance operators are rigorouslydefined here for random elements of a Hilbert space. The fda case where arandom element has a representation as a continuous time stochastic processis given special treatment that, among other factors, clarifies the relationshipbetween its covariance operator and covariance kernel. A brief foray intorepresentation theory produces congruence relationships that prove useful inChapter 10. The chapter then concludes with selected aspects of the largesample theory for Hilbert space valued random elements that includes both astrong law and central limit theorem.The large sample behavior of the samplemean element and covariance oper-

ator are studied in Chapter 8. This is relevant for cases where functional datais completely observed. When meaningful discretization occurs, smoothingbecomes necessary and we look into the large sample performance of twoestimation schema that can be used for that purpose: namely, local linearand penalized least-squares smoothing. Chapter 9 is the principal componentscounterpart of Chapter 8 in that it investigates the properties of eigenvaluesand eigenfunctions associated with the covariance operator estimators thatwere derived in that chapter.

PREFACE xiii

Chapters 10 and 11 both address bivariate situations. In Chapter 10, thefocus is canonical correlation and this concept is used to study a variety offda problems including functional factor analysis and discriminant analysis.Then, Chapter 11 deals with the problem of functional regression with ascalar response and functional predictor. An asymptotically, optimal penal-ized least-squares estimator is investigated in this setting.We have been fortunate to have talented coworkers and students that have

generously shared their ideas and expertise with us on many occasions.A nonexhaustive list of such important contributors includes ToshiyaHoshikawa, Ana Kupresanin, Yehua Li, Heng Lian, Yolanda Munoz-Maldanado, Rosie Renaut, Hyejin Shin, and Jack Spielberg. We sincerelyappreciate the invaluable help they have provided in bringing this bookto fruition. The inspiration for much of the development in Chapter 10can be traced to the serendipitous path through academics that brought usinto contact with Anant Kshirsagar and Emanuel Parzen. We gratefullyacknowledge the profound influence these two great scholars have had onthis as well as many other aspects of our writing. TH also wishes to thankRoss Leadbetter for introducing him to the world of research and Ray Carrollfor his support which has opened doors to many possibilities, includingthis book.

1

Introduction

Briefly stated, a stochastic process is an indexed collection of random vari-ables all of which are defined on a common probability space (Ω,ℱ,ℙ). Ifwe denote the index set by E, then this can be described mathematically as

{X(t, 𝜔) ∶ t ∈ E, 𝜔 ∈ Ω} ,

where X(t, ⋅) is aℱ-measurable function on the sample spaceΩ. The 𝜔 argu-ment will generally be suppressed and X(t, 𝜔) will typically be shortened tojust X(t).Once the X(t) have been observed for every t ∈ E, the process has been

realized and the resulting collection of real numbers is called a sample pathfor the process. Functional data analysis (fda), in the sense of this text, isconcerned with the development of methodology for statistical analysis ofdata that represent sample paths of processes for which the index set is some(closed) interval of the real line; without loss, the interval can be taken as[0, 1]. This translates into observations that are functions on [0, 1] and datasets that consist of a collection of such random curves.From a practical perspective, one cannot actually observe a functional data

set in its entirety; at some point, digitization must occur. Thus, analysis mightbe predicated on data of the form

xi(j∕r), j = 1,… , r, i = 1,… , n,

involving n sample paths x1(⋅),… , xn(⋅) for some stochastic process witheach sample path only being evaluated at r points in [0, 1]. When viewed fromthis perspective, the data is inherently finite dimensional and the temptationis to treat it as one would data in a multivariate analysis (mva) context.

Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators,First Edition. Tailen Hsing and Randall Eubank.© 2015 John Wiley & Sons, Ltd. Published 2015 by John Wiley & Sons, Ltd.

2 THEORETICAL FOUNDATIONS OF FUNCTIONAL DATA ANALYSIS

However, for truly functional data, there will be many more “variables” thanobservations; that is, r ≫ n. This leads to drastic ill conditioning of the linearsystems that are commonplace in mva which has consequences that can bequite profound. For example, Bickel and Levina (2004) showed that a naiveapplication of multivariate discriminant analysis to functional data can resultin a rule that always classifies by essentially flipping a fair coin regardless ofthe underlying population structure.Rote application of mva methodology is simply not the avenue one should

follow for fda. On the other hand, the basic mva techniques are still mean-ingful in a certain sense. Data analysis tools such as canonical correlationanalysis, discriminant analysis, factor analysis, multivariate analysis ofvariance (MANOVA), and principal components analysis exist because theyprovide useful ways to summarize complex data sets as well as carry outinference about the underlying parent population. In that sense, they remainconceptually valid in the fda setting even if the specific details for extractingthe relevant information from data require a bit of adjustment. With that inmind, it is useful to begin by cataloging some of the multivariate methodsand their associated mathematical foundations, thereby providing a roadmapof interesting avenues for study. This is the subject of the following section.

1.1 Multivariate analysis in a nutshell

mva is a mature area of statistics with a rich history. As a result, we can-not (and will not attempt to) give an in-depth overview of mva in this text.Instead, this section contains a terse, mathematical sketch of a few of themethods that are commonly employed in mva. This will, hopefully, providethe reader with some intuition concerning the form and structure of analogsof mva techniques that are used in fda as well as an appreciation for both thesimilarities and the differences between the two fields of study. Introductionsto the theory and practice of mva can be found in a myriad of texts includ-ing Anderson (2003), Gittins (1985), Izenman (2008), Jolliffe (2004), andJohnson and Wichern (2007).Let us begin with the basic set up where we have a p-dimensional random

vector X = (X1,… ,Xp)T having (variance-)covariance matrix

𝒦 = 𝔼[(X − m) (X − m)T

](1.1)

withm = 𝔼X (1.2)

the mean vector for X. Here, 𝔼 corresponds to mathematical expectationand 𝑣

T indicates the transpose of a vector 𝑣. The matrix 𝒦 admits an

INTRODUCTION 3

eigenvalue–eigenvector decomposition of the form

𝒦 =p∑j=1

𝜆jejeTj (1.3)

for eigenvalues 𝜆1 ≥ · · · ≥ 𝜆p ≥ 0 and associated orthonormal eigenvectors

ej =(e1j,… , epj

)T, j = 1,… , p that satisfy

eTi 𝒦ej = 𝜆j𝛿ij,

where 𝛿ij is 1 or 0 depending on whether or not i and j coincide. This providesa basis for principal components analysis (pca).We can use the eigenvectors in (1.3) to define new variables Zj = eTj (X − m),

which are referred to as principal components. These are linear combinationsof the original variables with the weight or loadings eij that is applied to Xi inthe jth component indicating its importance to Zj; more precisely,

Cov(Zj,Xi) = 𝜆jeij.

In fact,

X = m +p∑j=1

Zjej (1.4)

as, if 𝒦 is full rank, e1,… , ep provide an orthonormal basis for ℝp; this iseven true when 𝒦 has less than full rank as eTj X is zero with probabilityone when 𝜆j = 0. The implication of (1.4) is that X can be represented as aweighted sum of the eigenvectors of 𝒦 with the weights/coefficients beinguncorrelated random variables having variances that are the eigenvalues of𝒦.In practice, one typically retains only some number q < p of the compo-

nents and views them as providing a summary of the (covariance) relationshipbetween the variables in X. As with any type of summarization, this resultsin a loss of information. The extent of this loss can be gauged by the propor-tion of the total X variance V ∶= trace(𝒦) that is recovered by the principalcomponents that are retained. In this regard, we know that

V =p∑j=1

𝜆j

while the variance of the jth component is

Var(Zj) = eTj 𝒦ej

= 𝜆j.


Thus, the jth component accounts for 100𝜆j∕V percentage of the total vari-

ance and 100(1 −

∑jk=1 𝜆k∕V

)is the percentage of variability that is not

accounted for by Z1,… ,Zj.Principal components possess various optimality features such as the one

catalogued in Theorem 1.1.1.

Theorem 1.1.1 Var(Zj) = max{eTe=1,eT𝒦ei=0,i=1,…, j−1}Var(eTX).

The proof of this result is, e.g., a consequence of developments in Section 4.2.It can be interpreted as saying that the jth principle component is the linearcombination of X that accounts for the maximum amount of the remainingtotal variance after removing the portion that was explained by Z1,… ,Zj−1.The discussion to this point has been concerned with only the population

aspects of pca. Given a random sample x1,… , xn of observations on X, weestimate𝒦 by the sample covariance matrix

𝒦n = (n − 1)−1n∑i=1

(xi − xn

) (xi − xn

)T(1.5)

with

xn = n−1n∑i=1

xi (1.6)

the sample mean vector. As𝒦n is positive semidefinite, it has the eigenvalue–eigenvector representation

𝒦n =p∑j=1

𝜆jnejneTjn, (1.7)

where the ein are orthonormal and satisfy

eTin𝒦nejn = 𝜆jn𝛿ij.

This produces the sample principle components zjn = eTjn(x − xn) forj = 1,… , p with x = (x1,… , xp)T and the associated scores eTjn(xi − xn), i =1,… , n that provide sample information concerning the Zj.Theorems 9.1.1 and 9.1.2 of Chapter 9 can be used to deduce the large

sample behavior of the sample eigenvalue–eigenvector pairs, (𝜆jn, ejn), j =1,… , r. The limiting distributions of

√n(𝜆jn − 𝜆j) and

√n(ejn − ej) are found

to be normal which provides a foundation for hypothesis testing and intervalestimation.The next step it to assume that X consists of two subsets of variables

that we indicate by writing X = (XT1 ,X

T2 )

T , where X1 = (X11,… ,X1p)T and

INTRODUCTION 5

X2 = (X21,… ,X2q)T . Questions of interest now concern the relationships thatmay exist between X1 and X2. Our focus will be on those that are manifestedin their covariance structure. For this purpose, we partition the covariancematrix𝒦 for X from (1.1) as

𝒦 =[𝒦1 𝒦12𝒦21 𝒦2

]. (1.8)

Here, 𝒦1,𝒦2 are the covariance matrices for X1,X2, respectively, and𝒦12 = 𝒦T

21 is sometimes called the cross-covariance matrix.The goal is now to summarize the (cross-)covariance properties of X1 and

X2. Analogous to the pca approach, this will be accomplished using linearcombinations of the two random vectors. Specifically, we seek vectors a1 ∈ℝp and a2 ∈ ℝq that maximize

𝜌2(a1, a2) =

Cov2(aT1X1, aT2X2)

Var(aT1X1

)Var

(aT2X2

) . (1.9)

This optimization problem can be readily solved with the help of the singularvalue decomposition: e.g., Corollary 4.3.2. Assuming that X1,X2 contain noredundant variables, both𝒦1 and𝒦2 will be positive-definite with nonsingu-lar square roots𝒦1∕2

i , i = 1, 2. This allows us to write

𝜌2(a1, a2) =

(aT1ℛ12a2

)2aT1 a1a

T2 a2

, (1.10)

whereℛ12 = 𝒦−1∕2

1 𝒦12𝒦−1∕22 , (1.11)

a1 = 𝒦1∕21 a1 and a2 = 𝒦1∕2

2 a2. The matrixℛ12 can be viewed as a multivari-ate analog of the linear correlation coefficient between two variables. Usingthe singular value decomposition in Corollary 4.3.2, we can see that (1.10)is maximized by choosing a1, a2 to be the pair of singular vectors a11, a21that correspond to its largest singular value 𝜌1. The optimal linear combi-nations of X1 and X2 are therefore provided by the vectors a11 = 𝒦−1∕2

1 a11and a21 = 𝒦−1∕2

2 a21. The corresponding random variables U11 = aT11X1 andU21 = aT21X2 are called the first canonical variables of the X1 and X2 spaces,respectively. They each have unit variance and correlation 𝜌1that is referredto as the first canonical correlation.The summarization process need not stop after the first canonical variables.

If𝒦12 has rank r, then there are actually r − 1 additional canonical variablesthat can be found: namely, for j = 2,… , r, we have

U1j = aT1j X1 (1.12)


andU2j = aT2j X2, (1.13)

where a1j = 𝒦−1∕21 a1j , a2j = 𝒦−1∕2

2 a2j with a1j, a2j, the other singular vectorpairs from ℛ12 that correspond to its remaining nonzero singular values𝜌2 ≥ · · · ≥ 𝜌r > 0. For each choice of the index j, the random variable pair(U1j,U2j

)is uncorrelated with all the other canonical variable pairs and

has corresponding canonical correlation 𝜌j. When all this is put Together, itgives us

ℛ12 = 𝒦1∕21 𝒜1𝒟𝒜 T

2 𝒦1∕22 (1.14)

with𝒜i =

[ai1,… , air

]the matrix of canonical weight vectors for Xi, i = 1, 2, and

𝒟 = diag (𝜌1,… , 𝜌r)

a diagonal matrix containing the corresponding canonical correlations.There are various other ways of characterizing the canonical correlations

and vectors. As they stem from the singular values and vectors of ℛ12,they are the eigenvalue and eigenvectors obtained from 𝒦−1∕2

1 𝒦12𝒦−12

𝒦21𝒦−1∕21 and 𝒦−1∕2

2 𝒦21𝒦−11 𝒦12𝒦

−1∕22 . For example, the squared canon-

ical correlations and canonical vectors of the X1 space can be derived fromthe linear system

𝒦−11 𝒦12𝒦

−12 𝒦21a1 = 𝜌

2a1. (1.15)

The population canonical correlations and associated canonical vectorscan be estimated from the sample covariance matrix (1.5). For this purpose,one partitions 𝒦n analogous to 𝒦 and carries out the same form of singularvalue decomposition except using sample entities in place of 𝒦1,𝒦2, and𝒦12. The resulting sample canonical correlations have a limiting multivariatenormal distribution under various conditions as detailed in Muirhead andWaternaux (1980).The vectors X1 and X2 are linearly independent when𝒦12 is a matrix of all

zeros which we now recognize as being equivalent to 𝜌1 = · · · = 𝜌min(p,q) = 0.Test statistics for this and other related hypotheses can be developed from thesample canonical correlations.Canonical correlation occupies a pervasive role in classical multivariate

analysis. One place it arises naturally is in the (linear) prediction of X1from X2. A best linear unbiased predictor(BLUP) is provided by the vector𝛽0 + 𝛽1X2, where 𝛽0, 𝛽1 are, respectively, the p × 1 vector and p × q matrixthat minimize

𝔼(X1 − b0 − b1X2)T(X1 − b0 − b1X2)

INTRODUCTION 7

as a function of b0, b1. The minimizers are easily seen to be

𝛽0 =m1 − 𝛽1m2

𝛽1 =𝒦12𝒦−12 ,

wherein mj = 𝔼Xj, j = 1, 2. From this, we recognize 𝛽1 as the least-squaresregression coefficients for the regression of X1 on X2. Now, premultiply (1.14)by𝒦1∕2

1 and postmultiply by 𝒦−1∕22 to obtain

𝛽1 =r∑

j=1𝜌j𝒦1a1ja

T2j,

where, again, r is the rank of𝒦12. This establishes a fundamental relationshipbetween the best linear predictor and the canonical variables: namely,

𝛽0 + 𝛽1X2 = 𝛽0 +r∑

j=1𝜌j𝒦1a1jU2j. (1.16)

Thus, canonical correlation lies at the heart of the linear prediction of X1 fromX2. The converse is seen to be true as well by simply interchanging the rolesof X1 and X2 in the above-mentioned discussion.In finding our linear predictor for X1, we chose a priori to restrict attention

to only those that were linear functions of X2. A related, but distinct, variationon this theme is to presume the existence of a linear model that relates X1 toX2 by an expression such as

X1 = 𝛽0 + 𝛽1X2 + 𝜀 (1.17)

with 𝛽0 and 𝛽1 of dimension p × 1 and p × q as before and 𝜀 a p × 1 randomvector that is uncorrelated with X2 while having zero mean and covariancematrix

𝒦𝜀= 𝔼𝜀𝜀T .

In this event,𝒦12 = 𝛽1𝒦2, (1.18)

𝒦1 = 𝛽1𝒦2𝛽T1 +𝒦

𝜀

=𝒦12𝒦−12 𝒦21 +𝒦

𝜀(1.19)

and the canonical correlations now satisfy the relationship

𝒜1𝒟𝒜 T2 = (𝛽1𝒦2𝛽

T1 +𝒦

𝜀)−1𝛽1. (1.20)

Weyl’s inequality (e.g., Thompson and Freede, 1971) tells us that the eigen-values of

(𝛽1𝒦2𝛽

T1 +𝒦

𝜀

)are at least as large as those for 𝒦

𝜀. Thus, 𝛽1 is

null if and only if𝒟 in (1.20) is the zero matrix.


Factor analysis is another multivariate analysis method that aims to examinethe relationship between the two sets of variables. However, in this setting,only the values of the response variable X1 are observed while X2 is viewedas a collection of latent variables whose values represent the object of theanalysis. The basic premise is that X1 and X2 are linearly related in that

X1 = X2 + 𝜀 (1.21)

with 𝜀 a vector of zero mean random errors with variance–covariance matrix𝒦𝜀as before. The goal is now to use X1 to predict the unobserved values

of X2.Canonical correlation again provides the tool that allows us to make a

modicum of progress toward solving the prediction problem posed by model(1.21). In this regard, we can look for a linear combination aT1X1 that ismaximally correlated with a linear combination aT2X2 of X2. There are somesimplifications in this instance in that𝒦12 = 𝒦2 and𝒦1 = 𝒦2 +𝒦

𝜀lead us

to consideration of the objective function

Corr2(aT1X1, aT2X2) =

(aT1𝒦2a2

)2aT1𝒦1a1a

T2𝒦2a2

=

(aT1𝒦

−1∕21 𝒦1∕2

2 a2)2

aT1 a1aT2 a2

with a1 = 𝒦1∕21 a1, a2 = 𝒦1∕2

2 a2. Thus, for example, the optimal correlations𝜌 and choices for a1 can be obtained as solutions of the eigenvalue problem

𝒦−1∕21 𝒦2𝒦

−1∕21 a1 = 𝜌

2a1.

A little algebra then reveals this to be equivalent to finding solutions of

𝒦1a1 =1

1 − 𝜌2𝒦𝜀a1. (1.22)

To proceed further, we need to impose some additional structure on X2. Thestandard approach is to assume that

X2 = ΦZ (1.23)

for some unknown p × r matrix

Φ = {𝜙ij}i=1∶p, j=1∶r =[𝜙1,… , 𝜙r

](1.24)

and Z = (Z1,… ,Zr)T a vector of zero mean random variables with

𝔼[ZZT

]= I.

INTRODUCTION 9

The elements of Z are referred to as factorswhile the elements ofΦ are calledfactor loadings. A typical identifiability constraint arising from maximumlikelihood estimation is to have

𝜙Ti 𝒦

−1𝜀𝜙j = 0, i ≠ j. (1.25)

This has the consequence of making ΦT𝒦−1𝜀Φ a diagonal matrix, which we

will subsequently presume to be the case.When (1.23) holds

𝒦1 =ΦΦT +𝒦𝜀

=𝒦1∕2𝜀

(𝒦−1∕2

𝜀 ΦΦT𝒦−1∕2𝜀 + I

)𝒦1∕2

𝜀 .

It is readily verified that under (1.25) the matrix𝒦−1∕2𝜀

ΦΦT𝒦−1∕2𝜀

has eigen-values 𝛾j = 𝜙

Tj 𝒦

−1𝜀𝜙j with associated eigenvectors 𝒦−1∕2

𝜀 𝜙j. However, this

means that 𝒦−1∕2𝜀 𝒦1𝒦

−1∕2𝜀 has eigenvalues 1 + 𝛾j associated with this same

set of eigenvectors; i.e.,[𝒦−1∕2

𝜀𝒦1𝒦

−1∕2𝜀

−(1 + 𝛾j

)I]𝒦−1∕2

𝜀𝜙j = 0 (1.26)

or [𝒦1 − (1 + 𝛾j)𝒦𝜀

]𝒦−1

𝜀𝜙j = 0. (1.27)

Comparing this with (1.22) leads to the conclusion that 𝒦−1𝜀𝜙j is a canoni-

cal weight vectors for the X1 space with 𝛾j = 𝜌2j ∕(1 − 𝜌

2j ) obtained from its

corresponding canonical correlation.To see where these developments might take us consider the unrealistic

scenario where we knew𝒦1 and𝒦𝜀but notΦ. If that were the case, the coef-

ficient matrix could be recovered from the canonical weight functions for theX1 space asΦ = 𝒦

𝜀

[a11,… , a1r

]. We could then predict Z via the best linear

unbiased predictor: namely, the linear transformation of X1 that minimizesthe prediction error 𝔼(Z −ℒX1)T (Z −ℒX1) over all possible choices for ther × p matrixℒ. The minimum is attained withℒ = Cov(Z,X1)𝒦−1

1 giving

Z = ΦT(ΦΦT +𝒦

𝜀

)−1X1 (1.28)

as the optimal predictor.Of course, in practice, we will not know either of𝒦1 or𝒦𝜀

. However, givena random sample of values from X1, the first of these two quantities is easy toestimate using the sample variance–covariance matrix. While estimation of𝒦𝜀is more problematic, there are various ways this can be accomplished that

produce at least acceptable initial estimators. Such estimators can be substi-tuted into (1.27) to obtain an estimator ofΦ. This, in turn, provides an update


of𝒦𝜀= 𝒦1 − ΦΦT . The result is one possible iterative estimation algorithm

that is employed in the factor analysis genre.Our particular development of factor analysis is due to Rao (1955). A

detailed treatments of this and many other factor analysis-related topics canbe found in Basilevsky (1994).A case of particular interest that can be treated with a linear model is

MANOVA and its predictive analog known as discriminant analysis. Todevelop these ideas, we begin with the model

X1 = m +[m1 − m,… ,mq+1 − m

]X2 + 𝜀,

where X2 =(X21,… , X2(q+1)

)Thas a multinomial distribution with∑q+1

j=1 X2j = 1 and success probabilities 𝜋1,… , 𝜋q+1, m1,… ,mq+1 are the X1

mean vectors for the q + 1 different populations, m =∑q+1

j=1 𝜋jmj is the grandmean and 𝜀 is a p-variate random vector with mean zero and covariancematrix 𝒦

𝜀. As, 𝜋q+1 = 1 −

∑qj=1 𝜋j and X2(q+1) = 1 −

∑qj=1 X2j, the previous

model can be equivalently expressed as

X1 = mq+1 +[m1 − mq+1,… ,mq − mq+1

]X2 + 𝜀 (1.29)

withX2 ∶=

(X21,… , X2q

)T. (1.30)

This corresponds to (1.17) with 𝛽1 =[(m1 − mq+1),… , (mq − mq+1)

]and

𝛽0 = mq+1.To apply formula (1.20), we must calculate𝒦1,𝒦12 and𝒦2. In this regard,

first note that

𝔼X2 = (𝜋1,… , 𝜋q)T

=∶ 𝜋

and𝒦2 =Var(X2)

= diag(𝜋1,… , 𝜋q

)− 𝜋𝜋T

.

Then, one may check that

𝒦−12 = diag

(𝜋−11 ,… , 𝜋

−1q

)+ 𝜋−1

q+111T

for a q-vector 1 of all unit elements. From these identities, we obtain

𝒦12 =[𝜋1(m1 − m),… , 𝜋q(mq − m)

]and

𝒦1 = 𝒦B +𝒦𝜀

INTRODUCTION 11

with

𝒦B =q+1∑j=1

𝜋j(mj − m)(mj − m)T = 𝒦12𝒦−12 𝒦21.

Relations (1.18) and (1.15) can now be used to see that in this instance thecanonical correlations and canonical vectors of the X1 space are character-ized by

𝒦−1𝜀𝒦Ba = 𝜌

2

1 − 𝜌2a. (1.31)

Thus, all the canonical correlations are zero if and only if 𝒦B is the zeromatrix, which, in turn, is equivalent to the standard MANOVA null hypoth-esis that all of the q + 1 populations have the same mean vector. Statisticaltests for this null model can then be constructed from the sample canonicalcorrelations.If the mean vectors are the same for all of our q + 1 populations, it will not

generally be possible to distinguish between them on the basis of location.However, if the MANOVA null model is rejected, one can expect to haveat least some success in categorizing incoming observations according topopulation membership. The process of doing so is often referred to asdiscriminant analysis. While there are many discrimination methods thatappear in the literature, our focus here will be limited to Fisher’s classicalproposal. The idea is to find a linear combination, or discriminant function,hTX1 that provides the maximum separation between the populations in thesense of maximizing the ratio

hT𝒦Bh

hT𝒦𝜀h. (1.32)

However, this is just a variation of a problem we already encountered withpca. The solution is the largest eigenvalue 𝜆1 of 𝒦−1∕2

𝜀𝒦B𝒦

−1∕2𝜀

with theoptimal discriminant function weight vector h1 = 𝒦−1∕2

𝜀u1 obtained from the

eigenvector u1 that corresponds to 𝜆1. These quantities are characterized by

𝒦−1∕2𝜀 𝒦B𝒦

−1∕2𝜀 u1 = 𝜆1u1

or, equivalently, by𝒦−1

𝜀𝒦Bh1 = 𝜆1h1.

So, 𝜆1 = 𝜌21∕(1 − 𝜌

21) with 𝜌1 the first canonical correlation.

Additional discriminant functions are provided by maximizing (1.32) con-ditional on the resulting linear combinations of variables being uncorrelatedwith the discriminant functions that have already been determined fromthis iterative process. The resulting eigenvalues will, of course, enjoy the


same relation to their corresponding canonical correlations. If we now opt toretain r ≤ min(p, q) discriminant functions having weight vectors h1,… hr,a new observation x is classified as being from the population whose indexminimizes

r∑j=1

(hTj x − hTj mi)2 (1.33)

over i = 1,… , q + 1.As one might expect, the relationship between Fisher’s discriminant

functions and canonical variables goes much deeper than just the eigenvalueconnection. The equivalence of Fisher’s discriminant analysis and canonicalcorrelation is revealed by observing that

𝛿ij =aT1i𝒦Ba1j

𝜌2i

=hTi 𝒦Bhj𝜆i

.

This shows that the hi and a1i are eigenvector of𝒦B that have been adjusted tohave norms 𝜆i and 𝜌

2i . So, a1i = 𝜌ihi∕

√𝜆i. In particular, this means that (1.33)

will produce the same classification as would be obtained using canonicalvariables with the criterion

r∑j=1

(aTj x − aTj mi)2

1 − 𝜌2j.

A typical situationwith datawould have us observing p dimensional randomvectors

Xij, i = 1,… , q + 1, j = 1,… , ni

with n =∑q+1

i=1 ni. Then, estimators of𝒦𝜀,𝒦B are

n−1q+1∑i=1

ni∑j=1

(Xij − Xi

)(Xij − Xi

)T

and

n−1q+1∑i=1

ni(Xi − X

)(Xi − X

)T,

respectively, with Xi = n−1i∑ni

j=1 Xij,X = n−1∑q+1

i=1 niXi. We then carry outclassification by replacing 𝒦

𝜀,𝒦B by their sample analogs in the previous

formulation.

INTRODUCTION 13

1.2 The path that lies ahead

The basic concepts described in Section 1.1 remain conceptually valid in thecontext of fda. The first challenge faced by researchers in the area lies indeveloping them fully and rigorously from a mathematical perspective. Thisis the essential precursor to the growth and maturation of inferential method-ology in general and for fda in particular. One cannot estimate a “parameter”if it is undefined and it is easy to take a misstep in fda formulations that leadto exactly such a conundrum.The theory of multivariate analysis is inextricably interwoven with matrix

theory. If the observations in fda are viewed as vectors of infinite length,then we would anticipate that the infinite dimensional analog of matriceswould represent the tools of the trade for advancing our understanding ofthis emerging field. Such entities are called linear operators and, in particular,compact operators give us the infinite dimensional extension of matrices thatarise naturally in fda. Just as one cannot venture far into multivariate anal-ysis without understanding matrix theory, one cannot expect to appreciatethe mathematical aspects of fda without a thorough background in compactoperators and their Hilbert–Schmidt and trace class variants.After developing the necessary background on linear spaces in Chapter

2, we accumulate some of the essential ingredients of functional analysisand operator theory in Chapters 3–5. Chapter 3 provides a general overviewthat introduces linear operators and linear functionals as well as fundamen-tal concepts such as the inverse and adjoint of an operator, nonnegative andprojection operators. Compact operators are then treated in Chapter 4 wherewe develop both eigenvalue and singular value expansions and treat the spe-cial cases of Hilbert–Schmidt and trace class operators. This work plays animportant role throughout the remainder of this text. Chapter 5 deals withperturbation theory for compact operators and provides key tools for the treat-ment of functional pca in Chapter 9.Data smoothing methods tend to be a prominent aspect of most fda inferen-

tial methodology making a foray into the mathematical aspects of smoothingsomewhat de rigueur for this particular treatise. Our treatment of the topicfocuses on an abstract penalized smoothing problem that can be specializedto recover the spline smoothers that are most common in the fda literature.Penalized smoothing methods recur in Chapters 8 and 11 where we studytheir performance for estimation of certain functional parameters.Classical statistics is concerned with inference about the distribution of a

basic random variable that we are able to sample repeatedly. The same can besaid for fda except that the meaning of the “random variable” phrase requiresa bit of reinterpretation before it becomes relevant to that setting. What isneeded is the concept of a random element of a Hilbert space. That ideaalong with its associated probabilistic machinery is developed in Chapter 7.


Here, we extend the concepts of a mean vector and covariance matrix to theinfinite dimensional scenario and develop some asymptotic theory that is rel-evant for random samples of Hilbert space valued random elements.Chapters 8–11 will provide extended, detailed illustrations of how the

mathematical machinery in Chapters 2–6 can be used to address problemsthat arise in the fda environment. In Chapter 8, this takes the form of analysisof the large sample properties of three types of estimators of the meanelement and covariance function: ones that are based on the sample meanelement and covariance operator for completely observed functional data andlocal linear and penalized least-squares estimators for the discretely observedcase. This is followed in Chapter 9 with an investigation of the asymptoticbehavior of the principle components estimators that are produced by thecovariance estimators introduced in Chapter 8.Chapter 10 deals with the bivariate case where, for example, one has two

stochastic processes and wishes to analyze their dependence structure. Some-what more generally, it provides a development of abstract canonical corre-lation for two Hilbert space valued random elements. By specializing thistheory to the fda stochastic processes context, we are then able to obtainparallels of results in Section 1.1 for functional analogs of linear prediction,regression, factor analysis, MANOVA, and discriminant analysis.Finally, Chapter 11 deals with the important case of bivariate data having

both a scalar and functional (i.e., stochastic process) response. The large sam-ple properties of a particular penalized least-squares estimator of the regres-sion coefficient function are investigated in this setting.

theoreticalfoundations · 2015. 4. 14. · is to treat it as one would data in a multivariate...

Documents