probabilistic latent factor induction and statistical factor analysis
DESCRIPTION
It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the field of market research, for instance, long-established methods, such as factor analysis remain in daily use today. Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and thus help researchers to correctly compare and interpret the respective results. More specifically, we want to establish the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.TRANSCRIPT
Probabilistic Latent Factor Induction andStatistical Factor Analysis
A Comparison of Methods
Stefan Conrady, [email protected]
Dr. Lionel Jouffe, [email protected]
April 7, 2011
Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Table of Contents
IntroductionAbout the Authors 4
Stefan Conrady 4
Lionel Jouffe 4
Key Concepts from Information Theory 1
Entropy 1
Chain Rule Theorem 2
Conditional Entropy 2
Mutual Information 3
Relative Entropy (Kullback-Leibler Divergence) 3
Example 1 3
Example 2 4
Comparison of MethodsApproach 5
Notation 5
Key Terminology 5
Data Set 6
Probabilistic Latent Factor Induction with BayesiaLab 7
Data Import 7
Variable Clustering 16
Latent Factor Induction 21
Statistical Factor Analysis 30
Factor Analysis with STATISTICA 32
Conclusion 39
References 40
Contact Information 41
Conrady Applied Science, LLC 41
Bayesia SAS 41
Copyright 41
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com ii
Introduction
Bayesian networks have been gaining prominence among scientists over the recent decade and the new insights gener-
ated by this powerful research approach can now be found in studies that circulate well beyond the academic communi-ties. As a result, many practitioners and managerial decision-makers see more and more references to Bayesian networks
in all kinds of scienti!c and business research, ranging from biostatistics to marketing analytics.
It is not surprising that the new Bayesian network paradigm prompts comparisons to more conventional methods. In the !eld of market research, for instance, long-established methods, such as factor analysis remain in daily use today.
Given that there exists a direct counterpart to factor analysis in the Bayesian network framework, we want to highlight
similarities as well as fundamental differences. The objective of this paper is to present both methods side-by-side and thus help researchers to correctly compare and interpret the respective results. More speci!cally, we want to establish
the semantic equivalents between the traditional statistical factor analysis approach and BayesiaLab’s method based on
Bayesian networks, which we refer to as Probabilistic Latent Factor Induction.
Factor Analysis is a statistical method used to describe variability among observed variables in terms of a potentially lower number of unobserved variables called factors. It is possible, for example, that variations in three or four ob-
served variables mainly re"ect the variations in a single unobserved variable, or in a reduced number of unobserved
variables. The observed variables can be seen as manifestations of abstract underlying (and unobserved) dimensions or (latent) factors.
Factor analysis originated in psychometrics, and is used in behavioral sciences, social sciences, marketing, product man-
agement, operations research, and other applied sciences that deal with a large number of variables in their data.
Probabilistic Latent Factor Induction is a work"ow within the BayesiaLab software package, which has the same objec-
tive as the traditional factor analysis, i.e. variable reduction, but works entirely with the framework of Bayesian net-
works and is based on principles derived from information theory.
It is important to point out that this comparison is not meant to favor one approach over the other (and to declare a winner and loser), although it is clearly in the authors’ interest to promote Bayesian networks in general and BayesiaLab
in particular. Rather, this paper should serve as reference for research practitioners and those who use research results
in their decision-making processes, so they can correctly interpret insights generated with either approach.
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com iii
About the Authors
Stefan Conrady
Stefan Conrady is the cofounder and managing partner of Conrady Applied Science, LLC, a privately held consulting
!rm specializing in knowledge discovery and probabilistic reasoning with Bayesian networks. In 2010, Conrady Applied Science was appointed the authorized sales and consulting partner of Bayesia SAS for North America.
Stefan Conrady studied Electrical Engineering and has extensive management experience in the !elds of product plan-
ning, marketing and analytics, working at Daimler and BMW Group in Europe, North America and Asia. Prior to es-tablishing his own !rm, he was heading the Analytics & Forecasting group at Nissan North America.
Lionel Jouffe
Dr. Lionel Jouffe is cofounder and CEO of France-based Bayesia SAS. Lionel Jouffe holds a Ph.D. in Computer Science and has been working in the !eld of Arti!cial Intelligence since the early 1990s. He and his team have been developing
BayesiaLab since 1999 and it has emerged as the leading software package for knowledge discovery, data mining and
knowledge modeling using Bayesian networks. BayesiaLab enjoys broad acceptance in academic communities as well as
in business and industry. The relevance of Bayesian networks, especially in the context of consumer research, is high-lighted by Bayesia’s strategic partnership with Procter & Gamble, who has deployed BayesiaLab globally since 2007.
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com iv
Key Concepts from Information TheoryBefore we proceed to the direct comparison of methods, it is important to establish several key concepts relating to the knowledge representation in Bayesian networks.
Entropy The concept of entropy provides the underpinning for all structural learning and analysis algorithms in BayesiaLab.
Entropy measures the uncertainty inherent in the distribution of a random variable.
The entropy H(X) of a random variable X is de!ned as:
H (X) = − p(x)log2 p(x)x∈X∑ ,
where x stands for the states, which variable X can take. Note that the log is to the base of 2 and the value of entropy is expressed in bits (0/1).
An example can perhaps illustrate this: If variable X represents the outcome of a coin toss, X can have one of two
states, Heads and Tails, i.e. the set of potential outcomes is X={Heads, Tails}. Given the coin toss is fair, the probability of Head and Tails will be 0.5, i.e. p(Heads)=0.5 and p(Tails)=0.5.
We can now compute the entropy H(Xfair), based on these values:
H (X fair ) = − p(Heads)log2 p(Heads) − p(Tails)log2 p(Tails)= −0.5 log2 0.5 − 0.5 log2 0.5 = 0.5 + 0.5 = 1 bit
This means our uncertainty prior to a fair coin toss is equivalent to an entropy value of 1 bit, which is the maximum
entropy due to the uniform distribution of the variable with two states.
If we had a biased coin instead with p(Heads)=0.7 and p(Tails)=0.3, it is intuitive to think that the uncertainty would be lower as one state of the coin toss will be more probable and, indeed, computing the entropy H(Xbiased) yields a lower
value.
H (Xbiased ) = −0.7 log2 0.7 − 0.3log2 0.3 = 0.881
To complete this idea, we can also plot H(X) as a function of the bias, p(Heads)=1-p(Tails), with p(Heads)∈{0,..,1}, i.e.
ranging from impossible, p(Heads)=0, to certain, p(Heads)=1.
Information Theory Background
www.conradyscience.com | www.bayesia.com 1
0.2 0.4 0.6 0.8 1.0p�Heads�
0.2
0.4
0.6
0.8
1.0H�X�
Clearly, anything other than a perfectly fair coin reduces the entropy and thus our uncertainty regarding the outcome of the coin toss.
Chain Rule TheoremThe chain rule for joint entropy states that the total uncertainty about the value of X and Y is equal to the uncertainty
about X plus the (average) uncertainty about Y once you know X.
H (X,Y ) = H (X) + H (Y∣X)
The proof of this theorem follows:
H (X,Y ) = − p(x, y)log2 p(x, y)x∈X∑
y∈Y∑
= − p(x, y)log2 p(y∣x)p(x)x∈X∑
y∈Y∑
= − p(x, y)log2 p(y∣x)x∈X∑
y∈Y∑ − p(x, y)log2 p(x)
x∈X∑
y∈Y∑
= − p(x, y)log2 p(y∣x)x∈X∑
y∈Y∑ − p(x)log2 p(x)
x∈X∑
= H (Y∣X) + H (X)
Conditional EntropyPerhaps the single most important concept for computations in BayesiaLab is conditional entropy. Conditional entropy
refers to the entropy of a random variable when we have information on another variable.
The conditional entropy H(Y|X), is de!ned as
Information Theory Background
www.conradyscience.com | www.bayesia.com 2
H (Y∣X) = p(x)H (Y∣ X = x)x∈X∑
= − p(x) p(y∣x)log2 p(y∣x)y∈Y∑
x∈X∑
= − p(x, y)log2 p(y∣x)y∈Y∑
x∈X∑
The conditional entropy of Y conditional on X refers to the expected entropy of Y conditional on the value of X.
Mutual InformationThe mutual information I(X,Y) measures how much (on average) the observation of random variable Y tells us about
the uncertainty of X, i.e. by how much the entropy of X is reduced if we we have information on Y.
I(X,Y ) = H (X) − H (X∣Y ) = H (Y ) − H (Y∣X)
Note that the mutual information is a symmetric metric, which re"ects the uncertainty reduction of X by knowing Y as
well as of Y by knowing X.
Relative Entropy (Kullback-Leibler Divergence)A closely related concept is the relative entropy, also referred to as the Kullback-Leibler Divergence (DKL) or sometimes
cross entropy. The Kullback-Leibler Divergence is a measure of the difference between two probability distributions p and q.
For probability distributions p and q of a discrete random variable X, their K–L divergence is de!ned to be
DKL = p(X) || q(X)( ) = p(x)log2p(x)q(x)x∈X
∑
In words, it is the expected value of the logarithmic difference between the joint probability distributions p(X) and q(X). In contrast to the mutual information, the relative entropy is non-symmetric.
Example 1
We once again use tossing coins as an example. By default, we would expect that any given coin is fair and assume a
model q(Heads)=q(Tails)=0.5. As it turns out, in repeated coin tosses, we observe that a probability of p(Heads)=0.75 and of p(Tails)=0.25. We can now use the Kullback-Leibler Divergence to establish the “distance” or “distortion” be-
tween the originally assumed distribution q(x) and the observed distribution of p(x).
DKL = p(X) || q(X)( ) = p(x)log2p(x)q(x)x∈X
∑
= p(Heads)log2p(Heads)q(Heads)
+ p(Tails)log2p(Tails)q(Tails)
= 0.75 log20.750.5
+ 0.25 log20.250.5
= 0.188722 bits
Information Theory Background
www.conradyscience.com | www.bayesia.com 3
Example 2
For another illustration we use an example from the !eld of meteorology. More speci!cally, we look at the rainfall in two cities in state of Victoria, Australia. We used daily rainfall data measured at Geelong Airport and at Melbourne
Tullamarine Airport, which are approximately 80 kilometers apart, over the entire calendar year of 2010. Given the
proximity of those locations, one would generally expect similar weather. Perhaps the Geelong weather isn’t reported in the Melbourne newspapers and so a traveler wants to use the Melbourne weather as a proxy. However, the actual
weather station observations tell us that there is rain in Melbourne on 40.3% of the days, whereas Geelong sees rainfall
on 47.4% of the days in the year.
We can now compute the Kullback-Leibler Divergence for these two distributions, and pGeelong(x) stands for Geelong
and pMelbourne(x) for the Melbourne rain probability distributions.
DKL = pGeelong (X) || pMelbourne(X)( ) = pGeelong (x)log2
pGeelong (x)pMelbourne(x)x∈X
∑
= pGeelong (x = No Rain)log2
pGeloong (x = No Rain)pMelbourne(x = No Rain)
+ pGeelong (x = Rain)log2
pGeloong (x = Rain)pMelbourne(x = Rain)
= 0.526 log20.5260.597
+ 0.474 log20.4740.403
= 0.0148958 bits
DKL = pMelbourne(X) || pGeelong (X)( ) = pMelbourne(x)log2pMelbourne(x)pGeelong (x)x∈X
∑
= pMelbourne(x = Rain)log2pMelbourne(x = Rain)pGeelong (x = Rain)
+ pMelbourne(x = No Rain)log2pMelbourne(x = No Rain)pGeelong (x = No Rain)
= 0.403log20.4030.474
+ 0.597 log20.5970.526
= 0.0147077 bits
BayesiaLab’s primary metric, the Arc Force, is directly proportional to the relative entropy and describes the strength of
the directional link between two variables. More speci!cally, it describes the difference between the joint probability distributions with and without the particular arc.
Information Theory Background
www.conradyscience.com | www.bayesia.com 4
Comparison of Methods
ApproachWe believe that we can best facilitate a comparison of the statistical factor analysis and latent factor induction by work-
ing through an example. We draw upon the familiar dataset from the previously presented case study from the perfume
industry, hereafter referred to as the “Perfume Study.”1
We begin our tutorial with the Data Import process for BayesiaLab, although it is not meant to be at the core of the
comparison. It is important though to spell out the data pre-processing steps in BayesiaLab, as they highlight some of
the fundamental differences between probabilistic and statistical approaches.
Once the data preparation is complete, we !rst present the probabilistic latent factor induction work"ow with
BayesiaLab and then provide an example of a statistical factor analysis. For the statistical factor analysis, we will use
STATISTICA 10 as the software platform, although most steps are fairly generic and could be reproduced with a num-ber of other statistical software packages as well.
NotationTo clearly distinguish between natural language, software-speci!c functions and study-speci!c variable names, the fol-
lowing notation is used:
• BayesiaLab-speci!c functions, keywords, commands, etc., are capitalized and shown in bold type.
• Names of attributes, variable, node and factors are italicized.
• At appropriate points in the text, grey boxes highlight parallels between the two presented methods:
Key Terminology• “Observed” and “manifest” are used interchangeably and describe those random variables, which have been meas-
ured by the researcher. Each variable measure
• The terms “latent” or “unobserved” are used interchangeably in the context of hidden concepts or factors, which
cannot be measured, but can potentially be extracted or induced. In our context, the term factor stands exclusively for
latent variables. Consequently, the terms “factor”, “factor variable”, “latent variable” and “unobserved variable” are equivalent.
Probabilistic Latent Factor Induction ↔ Statistical Factor Analysis
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 5
1 Conrady and Jouffe (2010)
Data SetThe Perfume Study is based on a monadic consumer survey about a range of fragrances, which was conducted in
France. In this example we use survey responses from 1,321 women, who have evaluated a total of 11 fragrances on a
wide range of attributes:
• 27 ratings on fragrance-related attributes, such as, “sweet”, “!owery”, “feminine”, etc., measured on a 1-to-10 scale.
• 12 ratings on projected imagery related to someone, who would be wearing the respective fragrance, e.g. “is sexy”,
“is modern”, measured on a 1-to-10 scale.
• 1 variable for Intensity, a measure re"ecting the level of intensity, measured on a 1-to-5 scale.
• 1 variable for Purchase Intent, measured on a 1-to-6 scale.
• 1 nominal variable, Product, for product identi!cation purposes.
Probabilistic Latent Factor Induction vs. Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 6
Probabilistic Latent Factor Induction with BayesiaLab
Data ImportTo start the process with BayesiaLab, we !rst import the data set, which is formatted as a CSV !le.2 With Data>Open Data Source>Text File, we start the Data Import wizard, which immediately provides a preview of the data !le.
The table displayed in the Data Import wizard shows the individual variables as columns and the survey responses as rows. There are a number of options available, e.g. for sampling. However, this is not necessary in our example given
the relatively small size of the database.
Clicking the Next button, prompts a data type analysis, which provides BayesiaLab’s best guess regarding the data type
of each variable.
Furthermore, the Information box provides a brief summary regarding the number of records, the number of missing
values, !ltered states, etc.3
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 7
2 CSV stands for “comma-separated values”, a common format for text-based data !les.
3 There are no missing values in our database and !ltered states are not applicable in this survey.
For this example, we will need to override the default data type for the Product variable, as each value is a nominal
product identi!er rather than a numerical scale value. We can change the data type by highlighting the Product variable
and clicking the Discrete check box, which changes the color of the Product column to red.
We will also de!ne Purchase Intent and Intensity as discrete variables, as the default number of states of these variables is already adequate for our purposes.4
The next screen provides options as to how to treat any missing values. In our case, there are no missing values so the
corresponding panel is grayed-out.
Clicking the small upside-down triangle next to the variable names brings up a window with key statistics of the
selected variable, in this case Fresh.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 8
4 The desired number of variable states is largely a function of the analyst’s judgment.
The next step is the Discretization and Aggregation dialogue, which allows the analyst to determine the type of
discretization that must be performed on all continuous variables.5 For this survey, and given the number of
observations, it is appropriate to reduce the number of states from the original 10 states (1 through 10) to smaller number. One could, for instance, bin the 1-10 rating into low, mid and high, or apply any other arbitrary method
deemed appropriate by the analyst.
The screenshot shows the dialogue for the Manual selection of discretization steps, which permits to select binning
thresholds by point-and-click.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 9
5 BayesiaLab requires discrete distributions for all variables.
For this particular example, we select Equal Distances with 5 intervals for all continuous variables. This was the
analyst’s choice in order to be consistent with prior research.
Clicking Select All Continuous followed by Finish completes the import process and the 49 variables (columns) from our database are now shown as blue nodes in the Graph Panel, which is the main window for network editing. By
default, all variables are represented as nodes. This initial view represents a fully unconnected Bayesian network.
Note
For choosing discretization algorithms beyond this example, the following rule of thumb may be helpful:
• For supervised learning, choose Decision Tree.
• For unsupervised learning, choose, in the order of priority, K-Means, Equal Distances or Equal Frequencies.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 10
In the above graph, two variables play a fundamentally different role. The values of Product represent categories and
Purchase Intent is the overall target variable, i.e. the dependent variable of the Perfume Study. Thus both will be ex-cluded from the factor generation process.
While correlation and covariance the central measures for statistical factor analysis, learning Bayesian networks with
BayesiaLab (and thus probabilistic factor induction) is based on measures from information theory, such as the
Kullback-Leibler Divergence, which was introduced in the !rst chapter.
The Kullback-Leibler Divergence can be obtained after learning an initial Bayesian network with one of BayesiaLab’s
unsupervised learning algorithms. “Unsupervised” implies that the learning algorithm searches for an overall representa-
tion of the joint distribution of the underlying data rather than the characterization of an individual target variable.
In our example, we use BayesiaLab’s EQ algorithm to obtain a Bayesian network.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 11
As this view of the network is not easily readable, BayesiaLab has numerous built-in layout algorithms, of which the
Force Directed Layout is perhaps the most commonly used. It can be invoked by View>Automatic Layout>Force Directed Layout or alternatively through the keyboard shortcut “p”.
The resulting network will look similar to the following screenshot.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 12
Completed Bayesian Network upon EQ Learning
With the network established, we can now further examine the probabilistic relationships between the nodes, which are represented as arcs.6 By selecting, Analysis>Graphic>Arc Force, we can show the probabilistic strength of the arcs,
which is visualized by the thickness of the arcs.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 13
6 “Arcs” are directed links or edges between nodes, which appear as arrows in the graph.
Network with Arc Force
The numeric values of the Arc Force can be shown by selecting View>Display Arc Comments. In the network shown
below, the Arc Force values are presented in yellow boxes attached to each arc.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 14
Network with Arc Force
Arc Force ↔ CovarianceIn BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
matrix play the equivalent role.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 15
Variable ClusteringWith Arc Force established as a the key measure across the entire network, BayesiaLab can determine clusters of vari-
ables, which are “close” in a probabilistic sense. This can be initiated from the menu via Analysis>Graphic>Variable Clustering.
The clustering algorithm is iterative and starts with those two variables, whose connecting arc has the strongest Arc Force. The following sequence of screenshots illustrates this algorithm conceptually in “slow motion,” as the analyst
would not see these individual steps in the actual work"ow.
As a starting point, every manifest variable is treated as a distinct cluster and so we have 47 clusters. Using the
Kullback-Leibler Divergence as a measure, the “closest” variables are then merged into one concept. As a result, we !rst
obtain 46 clusters, then 45, etc., as shown in the array of dendrograms below. BayesiaLab proposes to conclude this algorithm upon !nding 15 clusters. However, the analyst has the ability to override this automatic selection. As the
choice of clusters appears to be generally compatible with our interpretation of the variable names, we accept this rec-
ommendation.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 16
Sequence of Dendrograms
47 46 45 44 16 15...
Because of the importance of this process, we will also show it from another angle, i.e. by looking at sequential views of
the graph.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 17
Step 0 - 47 Clusters
Step 1 - 46 Clusters: Pleasure merged with Corresponds
The strongest Arc Force exists between Pleasure and Corresponds and BayesiaLab will form an interim concept from them. The next-highest Arc Force then determines whether another variable is merged with the !rst concept or whether
a new concept is created. In our case, Radiant and In Love are combined as a new concept.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 18
Step 2 - 45 Clusters: Radiant merged with In Love
In the third step, we see Sensual and Romantic merged into a new latent concept, and so on.
Step 3 - 44 Clusters: Sensual merged with Romantic
Upon completion of this process, BayesiaLab forms variable/node clusters from these common concepts and color-codes
them accordingly.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 19
Network with Color-Coded Variable Clusters
By clicking the Validate Clustering button , we can now formally !xate the new latent factor variables. The new latent factors are shown in the following table with their associated observed variables. By default, they are given
the name “Factor” plus a numeric suf!x
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 20
Latent Factor InductionUpon de!nition of the new latent factor variables, we now want to make them
available for modeling purposes. Although these latent factors exist as new concepts
and are conceptually linked to the manifest variables, the factors do not yet have any values or states.
This will now happen in the Multiple Clustering process, which creates discrete
states for each latent factor variable by performing data clustering over the linked manifest variables.
More speci!cally, the states of each latent factor will be created in such a way that
they best summarize the joint probability distribution de!ned by the manifest vari-
ables. Factor 0 and its linked manifest variables are shown below.
Subnetwork for Factor 0
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 21
The following Monitors display the marginal probability distributions of the variables associated with Factor 1, plus,
highlighted in red, Factor 1 itself and its states are shown. We can see that 5 states were created for Factor 1, labelled C1 through C5, and they each have an expected value, which is shown in parentheses. For instance, state C2 has an
expected value of 9.21. That means, given that C2 is observed, the mean value of the manifest variables, weighted by
their relation with C2, is equal to 9.21. In other words, C2 corresponds to high ratings with regard to those 5 dimen-sions.
By selecting speci!c states of Factor 0 in the Monitor Panel, we can see the conditional distributions of the manifest variables. The states C2 and C3 are displayed for reference below. They can be easily interpreted by looking at the asso-
ciated values, e.g. state C2 appears to re"ect high ratings of the manifest variables, whereas state C3 captures very low
ratings.
A more general analysis of the relationships between manifest variables and latent factors can be obtained through
Analysis>Reports>Relationship Analysis:
This chart summarizes the values of key clustering measures, such as the Kullback-Leibler Divergence, for every mani-
fest variable associated with Factor 0. For reference only, it also includes Pearson’s Correlation Coef!cient R.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 22
It is also possible to visualize the mean values of the manifest variables (x-axis) along with the Mutual Information (y-
axis, left panel) and the Standardized Total Effect (y-axis, right panel) for the latent factor variable.
Although we have now de!ned new factor variables, we have not yet seen the original matrix survey responses in terms
of the new factor variables. For instance, every respondent record has a value for Active, Ful"lled, Trust, etc., as these variables were observed and recorded in the survey, but how do we !nd the values (or states) of the new latent factors
for each respondent record?
Actually, at the conclusion of the Multiple Clustering process, BayesiaLab has introduced the new factors into the origi-
nal network. By using BayesiaLab’s imputation process, which is based on maximum likelihood, they were added as new nodes to the graph and also saved as new columns (or !elds) to the database,
Relationship Analysis ↔ Factor LoadingsThis summary of clustering measures in the Relationship Analysis allows an interpretation, which is very similar to what is provided with factor loadings.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 23
Latent Factors Introduced into Network
We can easily verify that each new factor has a value for each respondent record. We start Inference>Interactive Infer-ence, which allows to scroll through the survey records and view the values of any variable, including the values of the new latent factors.
Factor Induction ↔ Saving Factor ScoresIntroducing the new latent factors into the network is equivalent to adding the factor scores to the original observa-tion matrix.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 24
For instance, survey record #0 is expressed as state C4 in terms of Factor 0. The states of the manifest variables are
shown for reference.
Record #8, for example is assigned to state C3:
Now we have the entire set of respondent records re-expressed in terms of 15 latent factors, which allows us to use
them for all kinds of modeling purposes.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 25
Given the importance of latent factors for interpretation, we will assign descriptive labels to each of them. BayesiaLab
can visually aid in this process by showing the latent factors and their relationships to the original manifest variables. This means, we will simply learn a new network, which includes both factor variables and manifest variables.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 26
Network including Latent Factors and Manifest Variables
The emerging network structure clearly lends itself to de!ning descriptive labels, which are applied to the factors in the following graph.7
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 27
7 See Conrady and Jouffe (2010) for a more detailed explanation of the interpretation process.
Network including Latent Factors and Manifest Variables plus Factor Labels
It is important to reiterate that the latent factors generated here are not orthogonal, which means that probabilistic rela-
tionships exist between the factors. For illustration purposes, we can highlight the latent factors and exclude the mani-
fest variables from being displayed. In addition, the following graph also displays the Arc Force between each latent
factor providing further con!rmation that the latent factors are not independent.
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 28
Network with Latent Factors and Arc Forces
Probabilistic Latent Factor Induction with BayesiaLab
www.conradyscience.com | www.bayesia.com 29
Statistical Factor AnalysisPerhaps the most common approach for extracting factors from a set of observed variables is Principal Components Analysis (PCA) and it is frequently considered a synonym for factor analysis.8 For our purpose, we look at PCA as a
prototypical tool for factor extraction, which lends itself to be compared to the latent factor induction with BayesiaLab
presented earlier.
Principal Component Analysis (PCA) is a mathematical procedure that uses an orthogonal transformation to convert a set of observations, represented by matrix X, of possibly correlated variables into a set of values of uncorrelated vari-
ables called principal components, to be represented by a new matrix Y. The goal of this transformation is to minimize
redundancy (measured by covariance) and to maximize the signal (measured by variance).
This transformation is de!ned in such a way that the !rst principal component has the highest possible variance, i.e.
accounting for as much of the variability in the data as possible. In turn, each succeeding component has the next-
highest variance while being orthogonal to (uncorrelated with) the preceding components.
Conceptual Illustration of Principal Component Vectors
More formally, PCA creates a re-expression of the original data set on the basis of a new set of orthonormal vectors,
replacing the original set of “naive” basis vectors, which resulted from the choice of measurements.9
In matrix notation, this can be expressed as follows:
PX = Y
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 30
8 There are differences between PCA and the more general concept of factor analysis, but explaining those goes beyond
the scope of this paper.
9 Any observed variable automatically establishes a basis vector. Measuring 47 variables would thus result in a 47-
dimensional coordinate system.
with X being the matrix of original observations and P being a yet-to-be-determined orthonormal matrix that trans-
forms X into Y. Interpreting this geometrically, P is a rotation and stretch to generate Y. The rows of P, {p1,…,pm}, are the new set of basis vectors for expressing the columns of X. Writing out the explicit dot products may better illustrate
this.
PX =p1pm
⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟x1 xn( )
Y =p1 ⋅x1 … p1 ⋅xn
pm ⋅x1 pm ⋅xn
⎛
⎝
⎜⎜⎜
⎞
⎠
⎟⎟⎟
This provides us with the general framework, but we have yet to determine what matrix P should be.
This is the point where we need to introduce the concept of the covariance matrix (Cx). It is de!ned as
CX =1
n −1XXT
• CX is a square and symmetric m × m matrix.
• The elements on the diagonal of CX represent the variance of the observed variables.
• The off-diagonal elements of CX represent the covariance between observed variables.
As a result CX captures the correlations between all possible pairs of observed variables.
This obviously relates to our objective of minimizing redundancy (measured by covariance) and maximizing the signal
(measured by variance) of the target matrix Y. The optimum achievement of these goals would imply a diagonal covari-
ance matrix of Y, i.e. with all off-diagonal elements being zero, and our objective thus translates into stipulating that CY must be diagonal. Fortunately, linear algebra provides several tools for diagonalizing a matrix.
More formally, the objective becomes !nding some orthonormal matrix P where Y=PX such that CY is diagonalized.
The rows of P are then the principal components.
Without providing further detail, the solution is:
• The principal components of X are the eigenvectors of XXT or the rows of P.
• The ith diagonal value of CY is the variance of X along pi.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 31
Factor Analysis with STATISTICAUpon loading the survey data into STATISTICA, the respondent records will be presented as a data table, with the vari-able names shown as column headers and case numbers shown as row headers.10 This represents our observation matrix
X.
Observation Matrix X
As a starting point of the PCA process, we can display CX, the covariance matrix of X:
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 32
10 We will skip a detailed description of the data import steps, as they are fairly generic and we assume that readers
would use a wide array of statistical programs.
Covariance Matrix
As expected, there is a high amount of covariance, i.e. redundancy, between many of the observed variables. To get a
better sense of the magnitude of these pairwise relationships, it helps to display the correlation matrix for reference:
Arc Force ↔ CovarianceIn BayesiaLab, Arc Force, a probabilistic measure based on the Kullback-Leibler Divergence, is the central measure for latent factor induction. In statistical factor analysis, covariance, correlation and, in particular, the covariance
matrix play the equivalent role.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 33
Correlation Matrix
STATISTICA, like many other statistical software packages, has built-in routines, which can perform the computation of the matrix P of principal components automatically. There are several methods available for solving the PCA, includ-
ing the approach using the eigenvectors of the covariance matrix, which was shown earlier.
Regardless of the computational method used, the solution of the PCA provides as many eigenvalues as there are ob-
served variables. The sum of all eigenvalues equals the number of observed variables, in our case 47. This allows to de-termine the share of variance attributable to each factor. For instance, the !rst factor has an eigenvalue of 29.6, which
means that it accounts for 29.6/47=62.98% of the variance. Proceeding down the list, the eigenvalues decline in value and
correspondingly their contribution to the total variance.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 34
List of Eigenvalues
Now that we have a measure of how much variance each successive factor extracts, we can return to the question of
how many factors to retain, as the overall objective of this exercise is variable reduction. The precise number of factors
to be retained is ultimately an arbitrary decision of the analyst, but factors with eigenvalues greater than 1 are typically considered candidates. A scree plot11 is typically used to illustrate the eigenvalues of the extracted factors. Sometimes
this provides a visual indication of a natural cutoff point between higher and lower eigenvalues. Here such a distinction
cannot be made easily, so we defer to the rule-of-thumb and retain eigenvalues greater than 1.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 35
11 The name “scree plot” is a metaphorical expression, as “scree” is the term for the accumulation of broken rock at the
base of mountain cliffs. In the scree plot we want to distinguish the substantial eigenvalues from the “rubble” at the bottom.
Scree Plot
In the next step we turn to the interpretation of the extracted factors. The table below shows the factor loadings, which are the correlations of each observed variable with the extracted factors.
Factor Loadings
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 36
Given the high eigenvalue of factor 1, it is not surprising that many variables are highly correlated with it. In our par-
ticular case, however, this correlation is mostly negative, which may be counterintuitive for interpretation purposes.
It is common practice to rotate factors in order to aid in the interpretation process. Intuitively speaking, the rotation in
typically chosen in such a way that the principal factor, i.e. factor 1, aligns with what is commonly understood as the
“positive x-axis.”
Such a factor rotation, for which several methods exist, was also performed with STATISTICA and the results appear in
the table below. In addition, factor loadings higher than 0.7 are highlighted.
Loadings on Rotated Factors
The analyst can now use these factor loadings to assign meaningful names to each factor. Some are quite obvious in
their characterization, such as factor 3, which could be called “pleasant” or factor 4, which is quite obviously “classi-cal.” It is also interesting to see that only one variable, i.e. Intensity, has a high loading on factor 2. This implies that
Relationship Analysis ↔ Factor LoadingsThe summary of clustering measures in BayesiaLab’s Relationship Analysis allows an interpretation, which is very simi-lar to what is provided with factor loadings.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 37
perhaps Intensity is a standalone concept, which has little redundancy. On the other extreme, many variables have high
loadings on factor 1, which makes identifying a distinct concept more elusive.
Without completing this interpretation process, we turn to the “reduction” part by introducing the extracted factors as
variables into the original data set, i.e. replacing 47 variables with 6 variables. This is often referred to as “saving factor
scores,” with the factor scores being the values related to the original observations in this new coordinate system created by the extracted factors. Our observations now have new coordinates in a 6-dimensional coordinate system rather than
in one with 47 dimensions.
Factor Scores
We now have the ability to create a wide range of models, for instance, modeling Purchase Intent as a function of the 6
new factors. This will undoubtedly be easier to interpret than a model, which includes all of the 47 original observed variables.
Latent Factor Induction ↔ Saving Factor ScoresIntroducing the latent factors into the network is equivalent to adding the factor scores to the original observation matrix.
Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 38
ConclusionAlthough fundamentally different in their framework, statistical factor analysis and probabilistic latent factor induction have many parallels, which lend themselves to direct comparative interpretation. Given these parallels, analysts familiar
with either domain should !nd it easy to translate their research work"ow from one framework into the other. Equally,
end users of research results, who may be less familiar with the underlying computations, should be in a position to
interpret the !ndings from both methods in a very similar manner.
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 39
References
Conrady, Stefan, and Lionel Jouffe. “Driver Analysis & Product Optimization, A Case Study from the Perfume Indus-try”, December 1, 2010. http://www.conradyscience.com/index.php/driver-analysis.
Cover, T. M, and J. A Thomas. “Entropy, relative entropy and mutual information.” Elements of Information Theory (1991): 12–49.
Kachigan, Sam Kash. Multivariate Statistical Analysis: A Conceptual Introduction. 2nd ed. Radius Press, 1991.
MacKay, David J. C. Information Theory, Inference and Learning Algorithms. 1st ed. Cambridge University Press, 2003.
Shlens, J. “A tutorial on principal component analysis.” Systems Neurobiology Laboratory, University of California at San Diego (2005).
StatSoft, Inc. “Electronic Statistics Textbook.” Electronic Statistics Textbook, 2011. http://www.statsoft.com/textbook/.
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 40
Contact Information
Conrady Applied Science, LLC312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383 [email protected]
www.conradyscience.com
Bayesia SAS6, rue Léonard de Vinci
BP 119
53001 Laval CedexFrance
+33(0)2 43 49 75 69
www.bayesia.com
Copyright© 2011 Conrady Applied Science, LLC and Bayesia SAS. All rights reserved.
Any redistribution or reproduction of part or all of the contents in any form is prohibited other than the following:
• You may print or download this document for your personal and noncommercial use only.
• You may copy the content to individual third parties for their personal use, but only if you acknowledge Conrady
Applied Science, LLC and Bayesia SAS as the source of the material.
• You may not, except with our express written permission, distribute or commercially exploit the content. Nor may you transmit it or store it in any other website or other form of electronic retrieval system.
Probabilistic Factor Induction and Statistical Factor Analysis
www.conradyscience.com | www.bayesia.com 41