from “big data” to “smart data”: algorithm for cross ... · 3. the measurement of full data...
TRANSCRIPT
Electronic copy available at: https://ssrn.com/abstract=3058881
1
From “Big Data” to “Smart Data”: Algorithm for Cross-Evaluation (ACE) as a Novel Method for Large-Scale Survey Analysis
Darko Kantoci
KanDar Enterprises, Inc. 250 Commercial St., Suite 3005F
Manchester, NH 03101 USA
E-mail: [email protected]
Emir Džanić* Cambridge Innovative System Solutions Ltd
CPC1 – Capital Park, Fulbourn Cambridge CB21, 5XE
UK E-mail: [email protected]
Marcel Bogers
University of Copenhagen Department of Food and Resource Economics
Unit for Innovation, Entrepreneurship and Management Rolighedsvej 25
1958 Frederiksberg C Denmark
E-mail: [email protected]
Accepted for publication in: International Journal of Transitions and Innovation Systems
21 October 2017 Abstract: Current research is increasingly relying on large data analysis to provide insights into trends and patterns across a variety of organizational and business contexts. Existing methods for large-scale data analysis do not fully capture some of the key challenges with data in large data sets, such as non-response rates or missing data. One method that does address these challenges is the SunCore Algorithm for Cross-Evaluation (ACE). ACE provides a view of the whole data set in a multidimensional mathematical space by performing consistency and cluster analysis to fill in the gaps, thereby illumining trends and patterns previously invisible within such data sets. This approach to data analysis meaningfully complements classical statistical approaches. We argue that the value of the ACE algorithm lies in turning “big data” into “smart data” by predicting gaps in large data sets. We illustrate the use of ACE in connection to a survey on employees’ perception of the innovative ability within their company by looking at consistency and cluster analysis. Keywords: statistical modelling; statistical algorithm; survey analysis; consistency analysis; cluster analysis; data trends; data patterns; data correlation; non-ignorable missing data; non-response missing data; cross evaluation; big data; smart data; innovation survey; food processing company. *Corresponding author
Electronic copy available at: https://ssrn.com/abstract=3058881
2
1 Introduction
Surveys on the perception of employees or management in different industrial sectors have
been increasingly used to measure different parameters (employee satisfaction, managerial
performance, innovation capabilities, etc.) using classical statistical approaches (Kuroki,
2012; Yen-Ku and Kung-Don, 2010; Klein et al., 1971; Su et al., 2009; Kabaskal et al.,
2006). In the past 15 years, some specific fields of research, such as innovation, have
emerged as important for companies’ strategic positioning (Bigliardi and Gelatti, 2013;
Chesbrough and Bogers, 2014). On that basis, such research has provided the push for further
investigation prompting researchers to conduct an increasing amount of studies in that
domain. Specifically, for innovation, surveys became more standardized and advanced
(Dadura and Lee, 2011; OECD, 2005; Cabral, 1998). It also became evident that they can be
used for not only measuring managerial attitude in multiple case studies, such as the one
Dadura and Lee (2011) conducted, but also on employees’ perception as described by
Linaker et al. (2015).
The concept of Innovation, as an example, became increasingly interconnected with
other concepts. New approaches to measuring it constantly appear and involve methods that
are new to innovation studies, such as system dynamics (Savitskaya and Kortelainen, 2012)
and the graph theoretic method (Temel, 2016). Besides the introduction of new
methodological approaches, a more complete understanding of the innovation concept
requires involvement of more abstract topics such as organizational culture or organizational
climate (Linaker et al., 2015), as well as practical topics such as the introduction of
performance enhancing HRM practices that can lead to change in organizational behaviour or
culture (Lobanova and Ozolina-Ozola, 2014). This need immediately brings up the survey as
an important tool and source of information for innovation researchers. At the same time, this
area is not free of problems because innovation surveys may only measure innovation on a
3
generic level while leaving out specific attributes. This issue may be a challenge, for
example, in traditional industries where innovation is mainly aesthetic (Alcaide-Marzal and
Tortajada-Esparza, 2007). For that reason, the usefulness of a survey in a traditional industry
such as food processing can be improved with the use of a proper tool that can address
challenges of survey methodology. Furthermore, the results from such a survey, as described
by Cabral (1998), show that innovation is a very complex problem in the food industry and,
at a firm level, large firms tend to be more innovative than small ones. Finally, an innovation
survey on Italian manufacturing firms (Cesaratto, Mangano and Sirilli, 1991) presents
surprisingly different results regarding the number of R&D manufacturing firms, showing
that there are twice as many such firms as compared to the results of the annual survey on
research and development activities carried out by the Italian Central Statistical Office.
Therefore, the methodology for measuring innovation in the manufacturing industry needs to
be constantly adapted and developed to provide accurate data. In this paper, we will present
the SunCore Algorithm for Cross-Evaluation (ACE) as a novel statistical approach to survey
analysis in the context of employees’ perception of the innovative ability of their company.
To address the above-mentioned issues, we present a survey to analyse intrinsic
factors of a large food processing company’s innovativeness. Our approach will illustrate
how ACE can contribute to a better understanding of the nature of innovation through the
discovery of latent variables. We will illustrate how internal perspective and cultural
elements that are tightly bound to employees’ perception can be captured through such a
survey. We will also show that ACE can be used as an analytical tool that can allow for
repeatable and validated results to emerge for a company’s internal analytics. This approach
can be easily adopted for measurement on different levels, such as in industry.
4
2 Some current limitations of survey data analysis in innovation studies
Innovation is an increasingly important topic for company leadership and management as
they strive to develop and maintain high levels of competitiveness. Such a complex and
strategic topic involves understanding a complex nexus of products, processes, organizational
structures, culture, marketing, etc. In discovering new understandings of how innovation
arises within this nexus, a more ecological approach to innovation requires organizations to
research the interactions that occur between this nexus and the perceptions of the employees
involved (Dadura and Lee, 2011). However, employees’ perception of the innovativeness of
their company, especially in large companies, is rarely measured. This is in part because
analysis methods have been lacking due to the aforementioned limitations in survey based
research.
We argue that a valuable tool to enable this type of research are survey methods in
combination with the ACE algorithm. The ACE algorithm was successfully used in the
natural sciences to uncover correlations between cancer cells and anti-cancer drugs (Kantoci,
1999, Seles et al. 2016). Since the ACE algorithm is generic in nature, it can be applied to
any data set; for example, organizational, managerial, or in industries such as banking to
uncover hidden correlations. When applied to large data sets, ACE can provide value
extraction in conditions where large sets of data are expected to be missing due to a
participant’s unwillingness to reveal their beliefs, or a failure to answer particular
questions/statements because of a lack of information or knowledge. A core assumption is
that complete contribution to value extraction can only be achieved if all collected data can
be included in the analysis, even if that data set contains missing data. In that case, missing
data can contribute to data analysis to provide new understanding and insight for leaders and
management. The methodology we propose can be applied to initial analysis, followed by
5
actions targeting weak points in the data, and can be further expanded to longitudinal survey
analysis.
In this paper, we specifically address research challenges such as the consistency and
cluster analysis of survey data containing large data sets of non-ignorable, non-response data
(missing data to critical survey questions/statements due to a large proportion of incomplete
answers or no answers at all) by using ACE to process the entire data set at once. Data
analysis, using powerful computational techniques, serves as a useful tool more than ever
before to unveil trends and patterns that are hidden within data. This kind of data value
extraction provides new insights that meaningfully complement classical statistical
approaches, surveys, and other data sources (George et al., 2014). These approaches have the
ability to transform so-called “big data” into “smart data”. It is a shift from data volume to
data intelligence, focusing not on quantity as a key value but on the insights, that can be
drawn from data analysis (George et al., 2014). For that reason, “big data” becomes an
important tool for the redefinition of individuals, organisations and entire ecosystems (Perko
and Ototsky, 2016).
Surveys as a data source, though widely used and highly valuable, face several
significant limitations including data composition (e.g. sample representation, sample
composition, incomplete/missing data) and our explanatory abilities (correlation vs.
causation). Limitations such as the inability to eliminate rival explanations, where one can
only find associations rather than causal relationships between variables (Singleton and
Straits, 2005), have long caused consternation. Another important limitation is connected to
measurement error where respondents may answer questions/statements in the direction of
social desirability rather than their real feelings (Singleton and Straits, 2005) or leave non-
ignorable missing data which is problematic yet potentially informative.
6
Two illustrative examples of how researchers have approached the non-ignorable
missing data problem, arising from non-responses in survey research include i) Foster and
Smith’s (1998) analysis of the 1992 British general election and ii) Bertoli-Barsotti and
Punzo’s (2014) work on healthcare worker assessments. Foster and Smith, looking at a
sample of the British electorate, suggested the adjustment of the sample size for surveys
where it was expected the data may contain a substantial non-ignorable non-response.
Bertoli-Barsotti and Punzo, using a large data set on the assessment of healthcare workers in
which there was a significant answer refusal rate from informants, applied a Rasch-Rasch
model (RRM) to overcome non-responses. Comparing these two approaches highlights a
significant change in the last 20 years. Researchers are moving from an approach of sample
size adjustment towards implementation of different computational algorithms that can
overcome non-ignorable, missing data. Put simply, researchers are moving from increasing
data size towards more sophisticated algorithmic models that can help fill in the gaps of
missing data.
In addition to these limitations, survey data analysis has other challenges, especially
in the number of dimensions that are evaluated through classical statistical analysis methods.
Familiar analyses such as T-test, ANOVA, frequency analysis, mean, median, etc. work on
pairs of data, rather than holistic, multi-dimensional analysis. As a key concern, many
developments have occurred to expand the capacity of such analysis practices. However,
even when advanced statistical methods were used such as multidimensional ANOVA,
clustering, and other algorithms that work on multidimensional data, these algorithms failed
to address:
7
1. Evaluation of the entire data matrix at once.
2. The discarding of large data sets due to missing data since missing values skew
other answers (non-responses, spurious answers, etc.), especially for non-
ignorable missing data.
3. The measurement of full data consistency (correlation and clustering) where
standard statistical approaches provide valuable analysis for data consistency
within the group and between groups, but miss correlations with all other data.
To address the aforementioned limitations, this study explores the value of ACE as a
new mathematical model tool to predict answers based on the similarities of answers within a
group across a large data matrix. In so doing, we argue that this approach preserves statistical
power and enables us to use data that would otherwise be discarded (non-ignorable missing
data, spurious answers etc.). This model analyses data per all-with-all correlations, meaning
that each value relates to all other values in the entire data set based on its own value as well
as its distance to all other data in the matrix.
A key application of this approach, and one that we focus on in this paper, is the study
of organizational innovation.
To build the argument for the value of the ACE algorithm, and to substantiate the
methodological approach advocated, this paper mobilizes survey data from a large innovation
survey (126 statements) completed within a European food processing and pharmaceutical
company with a response rate of n=495.
This particular data set included enough non-responses to make those responses non-
ignorable. The analysis implies specific limitations and challenges to researchers engaged in
such research. However, with this computational algorithm, we were able to mobilize all of
the collected data, including missing data, to provide new insights through value extraction.
8
The algorithm allows statistical analysis by correlating all dimensions of the data within the
whole data matrix (including missing and existing data) at once.
The objective of this paper is to illustrate the capabilities of the ACE algorithm when
used to analyse a data matrix originating from an organizational/managerial survey
performed in a large organization. We will show that the ACE algorithm is valid not only for
the analysis of complex natural science cases (Kantoci at al., 1999; Seles et al. 2016) but also
for social science data that contains equally complex relationships between explicated and
extracted values.
3 At once statistical analysis of survey data
3.1 Evaluation of the entire data matrix at once: ACE algorithm
Current methodologies in data analysis use classical statistical methods that can work on
narrow data sets, such as comparing two arrays. For example, classical cluster analysis
correlates data by local association factors. One can easily imagine that extracting similarity
patterns from data would provide another analytical dimension that is crucial for large data
samples (the “big data”). Such analysis would allow for the extraction of new parameters that
were not accessible through traditional statistical approaches.
An application of this type of analysis would be the extraction of values to provide
insight into example similarities across all the data among groups of survey participants. To
achieve this level of analysis, a data matrix has to be analysed in its totality. This would
require an analysis tool operating in a multidimensional mathematical space that can handle
any number of dimensions simultaneously. The ACE algorithm supplies such a tool.
Moreover, it identifies similarities in data and groups similar data together across all
dimensions. By so doing, similarity patterns can be extracted from the data by analysing the
entire data matrix all at once. In other words, the modelling mechanism can measure the
9
consistency of answers coming from multi-cluster samples that contain missing data,
analysing similarity patterns, and predicting the value of any missing data.
The application of the ACE algorithm, and its capabilities in a full matrix multi-
dimensional analysis, has been tested earlier. Previously a similar model was used in a
correlation analysis of the action of anti-cancer compounds against cancer cells. In this study
(Kantoci, 1999), a similar, earlier version of the algorithm was used to find the best possible
drug candidates for particular cancers, as well as to group (put in clusters) anti-cancer drugs
based on their anti-cancer activity and chemical structure similarities and link them to
particular cancer cell lines. As described in Seles et al. (2016), ACE was used to determine
the shelf life of food products (Model 2). In considering these variables, ACE creates a non-
linear, multi-dimensional association between the data, and then displays the results in a
matrix and in a graphical representation. ACE evaluates data in such a way that each number
in the matrix “feels” all other numbers in the matrix, after which point the algorithm arranges
the data by calculated association coefficients. The algorithm is based on a complete linkage
approach that accounts for all similarities and variability in a clustering matrix.
In this previous study, as well as within this study, the focus is on how the ACE
algorithm follows the “at-once” principal of analysing all data with all other data in the
matrix in squared Euclidean space. It is a so-called “in place” algorithm since original data
values are not moved from their original position in the matrix during calculation. The
algorithm accomplishes these tasks by finding multi-dimensional correlations based on vector
orientations and scalar values. This approach uncovers all levels of correlation, from weakest
to strongest. The final data can be analysed in situ or rearranged to cluster similar values
together in all 3 dimensions as islands of similarities. After this rearrangement, it can be
easily determined which data are correlated and which are not. This can be done through data
frequency analysis that finds islands of identities in 3D space.
10
As concluded by Kantoci (1999), ACE may be applied to other situations where
details of grouping and relationships need to be extracted from large data sets.
While classical statistical methods can work on narrow data sets and can compare a
limited number of arrays, the ACE algorithm exploits an internal capability to evaluate the
entire data matrix at once to provide new opportunities for value extraction from different
data sources and scales. In what follows, we apply this capability particularly to the challenge
of missing data (non-ignorable non-response) and its impact on statistical analysis.
3.2 Missing data (non-ignorable non-response) impact on statistical analysis
For many researchers, missing data are a widespread problem. Data gathered from different
sources (surveys, experiments, and secondary sources) are often missing some data. Missing
data can impact the results of statistical analysis depending on the mechanism which caused
the data to be missing and the way in which the data analyst deals with it (Grace-Martin,
2001).
Subjects in survey studies often drop out before the study is completed for many
different reasons. Moreover, surveys often suffer missing data when participants refuse to, or
do not know how to answer a question.
Since most statistical procedures require a value for each variable, missing data are
problematic. Researchers face a challenge whenever a data set is incomplete.
When facing missing data, it is common to use complete case analysis (also called list
wise deletion) – to analyse only the cases with complete data. Survey participants with
missing data on any variables are accordingly dropped from the analysis. Besides the
advantages of such an approach (easy to use, very simple, and is the default in most statistical
packages), limitations involve a substantial lowering in the sample size leading to a severe
lack of statistical power especially when many variables are missing data for a few cases. In
11
such instances, researchers can expect biased results, depending on why the data are missing
(Grace-Martin, 2001).
Processing of such data usually involves mean value imputation (Durrant, 2009). For
the missing data (data without value) imputing the mean value is equivalent to applying the
same non-response weight adjustment to all respondents in the same imputation class. It
assumes that non-response is uniform and that non-respondents have similar characteristics to
respondents. This method weakens distribution and multivariate relationships by creating
artificial spikes at the class mean (Government of Canada, 2010).
There are two important classes of missing data, ignorable and non-ignorable.
Ignorable missing data involves data that is missing completely at random (MCAR), and data
that is missing at random (MAR). Cases of ignorable missing data imply that the probability
of observing data items is independent of the value of that data item (Marlin, 2005).
However, non-ignorable missing data means that the missing data mechanism is related to the
missing values. It commonly occurs when people do not want to reveal something very
personal or unpopular about themselves. For example, if individuals with higher incomes are
less likely to reveal them on a survey than are individuals with lower incomes, the missing
data mechanism for income is non-ignorable.
Whether income is missing or observed is related to its value. Complete case analysis
can give highly biased results for non-ignorable missing data. If proportionally lower and
moderate-income individuals are left in the sample because high income people are missing,
an estimate of the mean income will be lower than the actual population mean. Non-ignorable
missing data are therefore more challenging and require a different approach (Grace-Martin,
2001).
In the presented survey, informants were allowed the possibility to answer statements
by stating refusal – “I don’t know/I don’t want to answer,” (coded with 0) or to assign a value
12
in a Likert scale (coded 1-7). This option has directly connected cause for non-response to the
latent variable of interest to define non-response as non-ignorable missing data (Bertoli-
Barsotti and Punzo, 2014). The latent variables of interest in the presented case are assumed
to be linked to an informant’s unwillingness to state an opinion, ignorance due to the
company’s inadequate education or information system, or due to their own beliefs. Survey
statements such as “our company has an R&D department” challenge informants’ knowledge
of the company’s organizational structure to assume that one’s refusing to answer means that
they are not well informed on the fact that the company has an R&D department. By
answering within an offered Likert scale (1-7), an informant confirms the existence of such
an organizational unit, while their refusal to answer (by choosing 0 - do not know) allows an
informant to clearly state a lack of awareness of an R&D department or her/his knowledge on
what the term “R&D department” means, thereby showing a lack of information on this
specific topic. Regardless of the latent variable causing a refusal to answer, researchers can
further investigate reasons for the appearance of missing data and act accordingly by
providing education and information to specific groups of employees. In some survey
statements, informants are stating their beliefs instead of their educated and informed answer,
e.g. in statements such as “I believe that my company will achieve increased sales figures in
the next 3 years” or “our reputation is better than our competitors’ reputation.” These types of
statements allow informants to state their beliefs and attitudes by choosing to support or not
support statements in a 1-7 Likert scale. When an informant chooses not to answer the
question, she/he is refusing to reveal their beliefs. This is missing data. In other words, data in
the presented case study has a relation to its value whether it is missing or observed.
The application of the ACE algorithm to this problem implies an ability to uncover
the most likely answers to unanswered questions/statements through Gaussian, skewness, and
frequency approximations. Based on all other answers, the algorithm can predict a most
13
likely response. If questions/statements are designed to lead to the same answer and are
scattered across other questions/statements, the likelihood of finding the closest match is
rather high. In the presented case, predictability goes as high as a 99.6% probability of
predicting the answer. For example, survey respondents omitted a significant number of
answers for various reasons. The ACE algorithm was able to estimate correct answers for a
particular statement on the order of p < 0.05 (>99.5%) based on answers given by other
respondents who answered statements in a similar pattern to the respondent.
In what follows, it will be demonstrated that the described methodology preserves all
responses and enters predicted values for missing data through ACE. If by prediction the
algorithm introduces high data bias, the clustering algorithm will group biased data together
and move them out of other good correlations.
Having discussed above how missing data presents significant limitations to survey
analysis, especially when non-ignorable non-response missing data is the case, we now turn
to the practice of using new algorithms to manage this challenge, as seen in the study by
Bertoli-Barsotti and Punzo (2014). In this case, we look at the value of the ACE algorithm
and its ability to provide response predictability through consistency and clustering
correlations.
3.3 Consistency- correlation and clustering
In this section, we look at the issue of consistency and clustering. Consistency is defined as
the grouping of similarities between data. Therefore, consistency of data depends not only on
answers that contain a value but also on the missing data that are a part of a data set. In the
present study, consistency in answers is assumed to be dependent on latent variables and
involves a whole data set regardless of data type (missing or existing). In the initial
exploration phase, when no historical data is available such as in the presented case of
14
employee innovation perception, the consistency data analysis will provide quantification and
metrics that can be used in the characterization of latent variables and for further longitudinal
studies. In other words, regardless of which aforementioned variables is the case for specific
survey statements, the consistency in answering among informants is what matters for initial
surveys, such as in the example used here. This is especially true for surveys that intend to
reveal initial states in organizations and that face risks of high levels of non-ignorable
missing data.
In the presented case, calculated factors through the ACE algorithm, as a separate
third dimension, are of similar value for data in both other dimensions (statements vs.
sectors). Therefore, through consistency analysis, correlation and clustering is achieved for
two dependent dimensions (statements vs. sectors). This allows researchers to identify
company sectors that were responding to survey statements most or least consistently, as well
as to identify critical survey statements for each company sector with respect to consistency
in answering.
Consistency analysis is an important result of the ACE algorithm since it is not
evaluating answers per se; it is evaluating relationships between data. The grouping function
of the algorithm is a critical part of data analysis since it will put similar data together
regardless of whether data have high or low scores. This is achieved through frequency and
rank algorithms. Whereas a frequency algorithm calculates the same responses across
medians and groups them together for both data axis, rank graphs were generated to represent
the normalized association patterns in 2-D space.
In addition, grouped similarities are also sorted by a frequency algorithm, thus
creating islands of similarities based on the ACE algorithm factors scalar component. This
means that similar similarities are grouped together based on their scalar components (ACE
15
factor). Again, the ACE algorithm approach takes into account all values, numerical as well
as missing, without destruction caused by, for example, the practice of imputing mean values.
4 Methods
4.1 Sample and data collection
In order to test the value of the ACE algorithm, we applied it to a large data set gathered from
a large European food processing and pharmaceutical company. The company, established
more than 70 years ago, is a dominant player in various markets within Europe and also
operates in China and Africa. The data set, as introduced above, arose from a survey on
employee perceptions of the innovation ability of the company.
The survey used was a modified version of the OECD survey on the innovation
abilities of a company developed by Dadura and Lee (2011); (see also OECD, 2002; OECD,
2005). The original questionnaire was applied on innovation in the Taiwanese food industry.
In the case presented here, it was used to measure the perception of employees towards their
own company’s innovation capability. The questionnaire involves statements on 5 general
areas: products, processes, organization, marketing, and ecologic innovation (sustainability).
The survey was carried out during 2014 using an online version of the questionnaire
distributed through the company’s intranet. The survey comprised of 126 statements. All
informants were assured of anonymity and responses were sought from across specialist and
managerial functions of the organization including: specialists and lower management,
middle management, upper middle management, and top management). The total possible
respondent pool was 902 across 7 large sectors of the organization. The allotted time for the
survey was 20 days, during which the response rate was 54% or n=487.
Respondents were expected to answer statements using a Likert scale (false=1, true=7
and 0=don’t know/do not want to answer). The survey was controlled for possible multiple
16
selection or accidental skipping of answers by putting programmed controls for these
possibilities and not allowing respondents to submit the questionnaire if their responses
contained any such error. This allowed us to control for non-ignorable missing data by only
allowing respondents to generate missing data by selecting a 0-value assigned to the possible
response “I don’t know/I don’t want to answer.”
The original survey data were mutated (obfuscated) for this analysis through a
specially developed algorithm, for reasons of not disclosing real survey information. An
obfuscation (mutation) algorithm is based on weighted randomization formulas. The mutated
data kept its original statistical power.
Each data point was multiplied by a random number scaled by the data point value.
The main requirement for the obfuscation algorithm is that it preserves statistical power and
preserves data relationships as realistically as possible.
4.2 ACE algorithm
The ACE algorithm (KanDar Enterprises, Inc., Manchester, NH, USA) was used for
complete data analysis, clustering, data obfuscation and prediction. It was developed on a
Presario C500 computer (Hewlett-Packard Company, Palo Alto, CA) running Windows
Server 2003 R2 SP2 (Microsoft Corporation, Redmond, WA) with a Microsoft Visual Studio
2010 (Microsoft Corporation, Redmond, WA) platform using C# programming language.
The current version of ACE incorporated all calculations and assortments for the cluster
matrix, frequency plots, and rank graphs under one programming language. The integrated
version of the software is known under the trade name SunCore (KanDar Enterprises, Inc.,
Manchester, NH, USA) and is available through the author for commercial and non-
commercial use. ACE calculates values for cluster analysis using the following steps:
17
1. Set statements in columns, respondents in rows
2. Run prediction algorithm to replace zeros with predicted data
3. For this paper, we use the obfuscation algorithm to mask real data
4. Calculate association value for each data point in squared Euclidean space. The
algorithm evaluates all-with-all points in the matrix, and then normalizes
calculated parameters by the initial value.
5. Calculate medians in rows and columns. Generated values are further referred to
as Grouping index (GI). Grouping index (GI) is a median of relevant values for
different dimensions in the data matrix.
6. Calculate frequency and rank. The frequency algorithm evaluates sector
groupings after the clustering algorithm serving analysis of consistency.
Generated values are further addressed as clustering factors.
7. Sort medians for statements and sectors
8. Plot consistency matrix
Since the computer source code is not available for comparative studies, the ACE
code was not compared to other algorithms. For that reason, it is important to note that the
ACE application was previously described in pharmacy (Kantoci, 1999) and food science
(Seles et al. 2016) where it was successfully compared to results gained from other available
algorithms.
4.3 Classical statistical analysis
To be able to compare our results with classical statistics, we calculated survey data statistics
using standard ANOVA and clustering algorithms. Since there are no comparable algorithms
in literature, we cannot explicitly compare ACE with these algorithms. To prove ACE
versatility and robustness, we calculated clustering for standard shapes depicting correlation
18
factors in 3D plots (see below). These shapes are considered “natural” shapes [Everitt, 1974,
pp 60 – 64; Everitt et al., 2011a]. As Everitt pointed out, “many [means algorithms] would
have difficulty in recovering natural clusters.” Therefore, we selected several natural clusters
and performed ACE clustering algorithm to test if ACE can uncover these clusters. Since we
were unable to perform external validation due to a lack of similar algorithms, we analysed
natural patterns such as “moon/banana” and other shapes. As Everitt pointed out, these
shapes are the most difficult shapes to analyse because they have different levels of symmetry
and complexity (in mathematical terms). For comparative analysis, as previously published
(Kantoci, 1999), ACE clustered anti-cancer drugs based on their antineoplastic activity. It
was apparent in subsequent structure activity relationship (SAR) analysis that all clustered
drugs have the same basic chemical structure with side groups’ variations. Furthermore, ACE
(Model 2) was used to predict shelf life of food products and showed better correlation to
conventional, laboratory methods than when an interpolation algorithm (Model 1) was used
(Seles et al. 2016). This further proves the robustness of the algorithm and its versatility.
Classical statistics (ANOVA, frequency/histogram, univariate summary statistics)
were calculated with PAST version 3.02 from http://folk.uio.no/ohammer/past to correlate
with ACE results.
From the available options, we correlated ACE results with classical clustering
algorithms such as UPGMA (unweighted pair-group method using average approach, Everitt
et al., 2011b) (Figure 1).
ACE internal groupings are similar/same (Table 1), although cluster positioning is
different.
Furthermore, we utilized ACE to analyse and evaluate cases of simple and complex
natural clusters that can be detected graphically (Everitt et al., 2011c, Everitt, 1974)
19
For this presentation, we left data in place to show positional matching for a few
examples (Figure 2, Figure 3, Figure 4 and Figure 5).
5 Results and discussion
5.1 Missing data characterization
The original technology applied in the survey did not allow for the generation of MCAR or
MAR missing data for unanswered statements or for statements that were given more than
one answer.
Missing data was classified as non-ignorable when respondents selected 0 value (I
don’t know/I don’t want to answer) in the online survey.
The survey results nicely represent the realities of survey data with respect to non-
responses. The survey results show that if we apply the same rule to discard an entire
response if even one statement is not answered, the survey would end with an 11.5%
response rate of completed, valid answers where only 56 out of 487 participants answered all
statements by selecting values between 1 and 7 for all statements. So, in effect, this approach
would have resulted in the removal of almost 90% of respondents using a traditional discard
approach. If one takes into consideration the time and effort to run a questionnaire and data
workup, the wasted effort is close to 90%. In order to select a valid data set for statistical
treatment, the available choices are:
1. To disregard all participants that did not answer all statements by selecting a
number 1-7 on the Likert scale, i.e. 88.5% of respondents. This is highly
undesirable.
2. Reduce the number of statements to include as many participants as possible,
thus creating a subset with all answers.
20
3. Use imputation of mean values and classical statistical approach with all the
drawbacks of this approach.
4. Use another method to predict possible answers based on closest neighbour
statistics.
For such a huge reduction of sample size, option 1 was not acceptable. For the second
option, there were no reasonably sized subsets with all answered statements per sector. The
third option would cause the destruction of the data set.
The final, fourth option to include all participants regardless of whether they
answered all statements or not on the Likert scale of 1-7 was the only relevant method of
approach. The final sample, therefore, included missing data.
5.2 Characterization of statistical sample and predictive modelling
Initial data analysis shows a skewed Gaussian data distribution pattern where all types of
skew were observed. It was necessary to calculate response factors for each answer per
statement across all segments. The answers per statement across all sectors were then
grouped by the answered value and normalized by non-zero answered statements.
The ACE algorithm allowed use of a simplified imputation algorithm based on data
distribution per asked statement. Instead of imputing mean values, the algorithm used
specifically developed Gaussian, skewness, and frequency analysis to generate imputed
values. Histogram (answer frequency, skewness) characteristics were used to calculate the
closest matches while concurrently keeping the Gaussian distribution of data. Using this
approach, weakening of distribution and creating artificial spikes at the class mean, which is
characteristic for mean value imputation approach (Government of Canada, 2010), was
avoided and statistical power was maintained.
21
As shown in figure 6, the first survey statement was used to illustrate a prediction
model where 23 statements were answered with 0 values among participants. The percentage
of non-ignorable missing data varied among statements by sector where sectors had a
different number of respondents, between 2.31 and 45.08%, as shown in Table 2.
Furthermore, Table 3 shows the same sample tested on p(ANOVA) = 0.9956 and ω2=
-0.07692, confirming that the predicted value correctly filled in missing data, thus preserving
statistical power (Table 3). Results for all other statements fall into a similarly acceptable
range, for p (ANOVA) <0.05 and for ω2 < 0.06.
5.3 Data analysis
After obfuscation, data were analysed through ACE consistency (clustering and correlation)
algorithms. Data were grouped from highly correlated (Figure 7, bottom left) to completely
uncorrelated (Figure 7, top right). This data representation can uncover “truth in answers” as
represented in Figure 7, bottom left. Answers that are spurious are depicted in Figure 7, top
right.
Statements are ranked from high to low correlation according to a grouping index
(GI). The grouping index (GI) was calculated by the ACE algorithm for columns and rows
separately. Grouping index (GI) is a median of relevant values for statements (columns) and
sectors (rows) (Figure 7). The lower the difference between medians, the better the
correlation that is expected. The smaller the grouping index, the better the correlation
between answers. That means that all respondents answered similarly, producing a consistent
response to statements (Figure 7, bottom left). This is very important to understand the
overall correlation confidence. The score, as an extracted value, is the function of a complex
data matrix including non-ignorable missing data and is therefore linked to latent variables.
For this reason, this score reflects the depth of respondent understanding of the current
22
situation within the company, regardless of the answer score (1-7); the insight is in the
consistency comparison to other answers and sectors.
The algorithm successfully correlated answers based on statements vs. sectors where
participants are employees. The algorithm not only correlated similarities, but also found
significant discrepancies within answers. For demonstration purposes, two different ways to
utilize cluster and correlation analysis were used:
i) differences in clustering and correlation are assumed to be linked to survey participants:
1. Unwillingness to answer truthfully or for other personal reasons. Participants
gave answers just to finish the questionnaire. This case is found in the matter of
belief, i.e. statements like “I believe that my company will achieve increased
sales figures in the next 3 years” or “Our reputation is better than our
competitor’s reputation.” It is possible that some respondents were grouped as
inconsistent for not wanting to reveal their beliefs.
2. Lack of information related to survey statements. For example, in statements
such as “Our company is using green certified equipment and technologies”
inconsistency was noticed in cases where respondents were educated to
understand technology from the survey statement but are not informed on its
application within the company.
3. Lack of education on the topic(s) relevant for answering the statement(s). For
example, in statements such as “Our company is using green certified
equipment and technologies,” inconsistency was noted when respondents are
not aware of such technologies and are for that reason lacking information on
that topic.
23
ii) differences in clustering and correlation are assumed to be linked to survey participants
motivation as explained by Touré‐Tillery and Fishbach (2014) where two dimensions of
motivation were analysed, namely: outcome focused and process focused motivation.
Both examples were illustrated in Figure 7 where statements separated in 3 equal
groups of high, medium, or low correlation were showing different statement structures that
can further help to reveal new phenomena. Furthermore, ACE clustering for sectors shown on
Y axis are informative for future decisions as to what sectors to involve in activities such as
motivation, education, or information as leaders. When compared to UPGMA clustering
(Figure 1), ACE clustering revealed a wider group of sectors (23 in Cluster B) that contained
all 7 sectors as in the UPGMA cluster. Among those 7 sectors, 4 would be most appropriate
for involvement in the described activities. Furthermore, the ACE algorithm, if compared to
classical statistical clustering (UPGMA), would show similarities between sectors as shown
in Table 4. 19 out of 26 sectors showed relative distance less than 10 (more than 62%
similar).
In this study, the ACE algorithm approach allowed the analyst to use all of the data
collected within the survey for statistical analysis regardless of data type (missing or
existing). In the specific case used here, a relatively high level of non-ignorable non-response
was expected due to differences in respondents expected knowledge, access to information,
and proposed survey statements (e.g. respondents from a financial department are not
expected to be experts in the field of ecology).
By using the ACE algorithm, non-ignorable non-response was included in analysis
that would not have been possible using standard statistical analysis methods.
As such, a survey on employees’ perception of a company’s innovation including
holistic and widely set statements turned out to be ideal for elucidating the ACE algorithm’s
24
capabilities in the extraction of new values from a data matrix, including consistency
measurement.
In practical terms, by measuring consistency in sectorial answers, one could easily
identify a sectorial connection to latent variables (sectors lacking motivation, information, or
education). The resulting findings have great potential in informing management to take steps
toward changing group consistency by providing additional motivation, information, or
education to specific groups of employees in order to provoke new innovative behaviour, or
to increase acceptance of external innovation and collaboration.
6 Conclusion
In the present study, we provide evidence of the application of the ACE algorithm in survey
analysis, which we illustrated through a survey on employee perception of an organization’s
innovation ability. The use of ACE allowed for a better understanding of the nature of
innovation through the discovery of latent variables. Furthermore, we have demonstrated that
ACE can be applied to capture internal perspectives and cultural elements that are tightly
bound to employees’ perception. We showed that ACE is an analytical tool that allows for
repeatable and validated results to emerge for a company’s internal analytics.
More specifically, the ACE algorithm was validated for survey analysis and its ability
to perform in the process of latent variable extraction was demonstrated. We can conclude
that by using positional matching and classical clustering algorithms (such as UPGMA), ACE
was found to correlate and comply to the study’s rigor; it can therefore be used as a valid tool
for the analysis of complex data retrieved from surveys. Furthermore, we showed that when
used on real world data, ACE predicted actual data with p (ANOVA) <0.05 and for and for
ω2 < 0.06 indicating that prediction matched actual data.
25
The algorithm successfully correlated answers based on statements vs. sectors where
participants are employees, and found significant discrepancies within answers giving
researchers an opportunity to extract latent variables.
It can be concluded that in cases where non-ignorable non-response data can be easily
identified, such as in this study, and where survey design does not allow for the generation of
any other kind of missing-data, the ACE algorithm can be successfully used to analyse all
collected data as valuable information. That would not be possible using standard statistical
analysis methods where non-ignorable non-response data would be lost and the sample size
radically reduced. In this paper, we demonstrated the value of non-ignorable non-response
through the extraction of values using latent variables where those values can present
valuable information for management.
Even though this case was a complex methodology exercise that involved a survey,
algorithm and innovation study, and the contribution of this work is focused to the area of
methodology, it also opens significant opportunity to further study the field of innovation and
HR. It is of particular interest for organizations to find the most effective way of allowing for
HRM decisions that may promote the spread of innovation development (Lobanova, and
Ozolina-Ozola, 2014). Through a practical illustration, we have demonstrated how rich the
source of information from survey data can be, especially when addressing complex
problems. A specific contribution to literature is the application of non-ignorable non-
response data that would otherwise be lost. It allowed for clustering and correlation analysis
to uncover new relationships between data, provide new meaning to those relationships, and
allow for the discovery of latent variables. This study describes a tool that can easily be
adapted to different cases ranging from theory to practice. Therefore, an important
contribution is providing a method to analyse survey data as illustrated through the case of a
survey of employee perception on the innovation ability of the organization. Descriptive
26
analysis of similar perception surveys, such as in the Linaker et al. (2015) study, can be
supplemented with latent algorithmically-extracted data and used for scholarly and practical
purposes.
Limitations of this work are in the area of algorithmic methodology that can often
only be validated internally due to a lack of standard reference tools and methods with the
same capabilities. In this case, there were no reference methods available to validate the
presented algorithm and therefore external validation of the ACE algorithm was not possible.
The same algorithm was effectively externally validated in another case where Kantoci
(1999) focused on biologically active compounds where reference methods were available for
comparison of results. This work successfully proved ACE validity for that specific use.
However, the presented results should be tested and rechecked by other methods or
algorithms when they become available. Nevertheless, the discoveries raised using this
algorithm tool should not be ignored because ACE can provide real insight into trends.
Through iteration of surveying processes and extraction of latent variables, possible artificial
results can be avoided. In general, surveys, as a data source, face significant limitations in the
field of data composition, comprised of sampling process limitations or missing data, and in
our ability to explain results, i.e. correlation vs. causation. This work, through addressing
missing data, contributes to the field of data composition but still leaves open the explanatory
ability issue.
The analysis of a temporal pattern of innovative activities was addressed by Jang and
Chen (2010) as an important research field. It can be studied through a survey approach
where future research can include a longitudinal innovation survey study in which the ACE
algorithm can provide trend analysis and prediction. Therefore, a further field of study and
added value for the presented tool would be to assess the effectiveness of a company’s
policies, new methodologies, popularity, market opinions, etc., based on properly structured
27
questionnaires. In certain emerging domains, such as open innovation, this would allow for a
stronger integration of concepts and methods across units of analysis (Bigliardi & Galati,
2013; Bogers et al., 2017; West & Bogers, 2014).
28
7 References and notes Alcaide-Marzal, J., and Tortajada-Esparza, E. (2007) ‘Innovation assessment in traditional industries. A proposal of aesthetic innovation indicators’, Scientometrics, Vol. 72 No.1, pp.33-57.
Everitt, B. S., (1974). Cluster Analysis, 2nd ed., (pp. 61-63), New York: Halsted Press. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011a). An introduction to classification and clustering. Cluster Analysis, 5th ed., (pp. 1-14), London: Wiley. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011b). Hierarchical clustering. Cluster Analysis, 5th ed., (pp. 71-110), London: Wiley. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011c). Detecting clusters graphically. Cluster Analysis, 5th ed., (pp. 15-41), London: Wiley. Bertoli-Barsotti, L., and Punzo, A. (2014) ‘Refusal to Answer Specific Questions in a Survey: A Case Study’, Communications In Statistics: Theory & Methods, Vol. 43 No. 4, pp. 826-838.
Bigliardi, B., & Galati, F. (2013) ‘Models of adoption of open innovation within the food industry’, Trends in Food Science & Technology, Vol. 30 No. 1, pp. 16-26.
Bogers, M., Zobel, A.-K., Afuah, A., Almirall, E., Brunswicker, S., Dahlander, L., Frederiksen, L., Gawer, A., Gruber, M., Haefliger, S., Hagedoorn, J., Hilgers, D., Laursen, K., Magnusson, M. G., Majchrzak, A., McCarthy, I. P., Moeslein, K. M., Nambisan, S., Piller, F. T., Radziwon, A., Rossi-Lamastra, C., Sims, J., & Ter Wal, A. L. J. 2017. ‘The open innovation research landscape: Established perspectives and emerging themes across different levels of analysis’, Industry and Innovation, Vol. 24 No. 1, pp. 8-40.
Cabral, J. D. O. (1998) ‘Survey on technological innovative behavior in the Brazilian food industry’, Scientometrics, Vol 42 No. 2, pp. 129-169.
Cesaratto, S., Mangano, S., and Sirilli, G. (1991) ‘The innovative behaviour of Italian firms: a survey on technological innovation and R&D’, Scientometrics, Vol. 21 No. 1, pp. 115-141.
Chan, H., & Perrig, A. (2004) ‘ACE: An emergent algorithm for highly uniform cluster formation’, In Wireless Sensor Networks, Heidelberg: Springer Berlin, pp. 154-171.
Chesbrough, H. and Bogers, M. (2014) ‘Explicating open innovation: Clarifying an emerging paradigm for understanding innovation’ in H. Chesbrough, W. Vanhaverbeke and J. West (Eds.), New Frontiers in open Innovation, Oxford: Oxford University Press, pp. 3-28. Dadura, A. M., & Jiun-Shen Lee, T. (2011) ‘Measuring the innovation ability of Taiwan's food industry using DEA’, Innovation: The European Journal of Social Sciences, Vol. 24 No. 1/2, pp. 151-172.
Durrant, G. B. (2009) ‘Imputation methods for handling item-nonresponse in practice: methodological issues and recent debates’, International Journal of Social Research Methodology, Vol. 12 No. 4, pp. 293-304. Forster, J. J., and Smith, P. F. (1998) ‘Model-based inference for categorical survey data subject to non-ignorable non-response’, Journal of The Royal Statistical Society: Series B (Statistical Methodology), Vol. 60 No. 1, pp. 57.
George, G., Haas, M. R., & Pentland, A. (2014) ‘Big Data and Management. Academy of Management Journal’, Vol. 57 No. 2, pp. 321-326.
29
Government of Canada (2010). Survey Methods and Practices, Catalogue no. 12-587-X [online], 2003001, 211 http://www5.statcan.gc.ca/olc-cel/olc.action?objId=12-587-X&objType=2&lang=en&limit=0. (Accessed 2 November 2016) Grace-Martin, K. (2001). Missing Data Mechanisms. Cornell Statistical Consulting Unit [online], https://www.cscu.cornell.edu/news/statnews/stnews46.pdf. (Accessed 2 November 2016) Jang, S. L. and Chen, J. H. (2010) ‘What determines how long an innovative spell will last?’ Scientometrics, Vol. 86 No. 1, pp. 65-76. Kabasakal, H., Asugman, G. and Develioğlu, K. (2006) ‘The role of employee preferences and organizational culture in explaining e-commerce orientations’ International Journal Of Human Resource Management, Vol. 17 No. 3, pp. 464-483.
Kantoci, D. (1999) ‘The Algorithm for Cross Evaluation (ACE) of Biologically Active Compounds’ Life Sciences, Vol. 65 No. 12, pp. 1305-1315.
Klein, S. M., Kraut, A. I., & Wolfson, A. (1971) ‘Employee Reactions to Attitude Survey Feedback: A Study of the Impact of Structure and Process’, Administrative Science Quarterly, Vol. 16 No. 4, pp. 497-514. Kuroki, M. (2012) ‘The Deregulation of Temporary Employment and Workers' Perceptions of Job Insecurity’, Industrial & Labor Relations Review, Vol. 65 No. 3, pp. 560-577. Linåker, J., Munir, H., Runeson, P., Regnell, B. and Schrewelius, C. (2015, June). ‘A Survey on the Perception of Innovation in a Large Product-Focused Software Organization’, in International Conference of Software Business 2015, June , Springer International Publishing., pp. 66-80. Lobanova, L. and Ozolina-Ozola, I. (2014) ‘Innovative trends in human resource management: a case study of Lithuanian and Latvian organisations’, International Journal of Transitions and Innovation Systems 2, Vol. 3 No. 2, pp. 131-152.
Marlin, B. M., Roweis, S. T. and Zemel, R. S. (2005). Unsupervised Learning with Non-Ignorable Missing Data. In AISTATS, January 2005.
Perko, I. and Ototsky, P. (2016) ‘Business ecosystems requirements for big data’, Int. J. Transitions and Innovation Systems, Vol. 5, Nos. 3/4, pp.329–352.
Savitskaya, I. and Kortelainen, S. (2012) ‘Innovating within the system: the simulation model of external influences on open innovation process’, Int. J. Transitions and Innovation Systems, Vol. 2, No. 2, pp.135–150. Seles, J., Ranilovic, J., Kantoci, D., Bauman, I., Dzanic, E., Mihaljevic Herman, V, & Cvetkovic, T. (2016, January). Comparison of experimental method with a new mathematical model to determine the shelf life of liquid mixtures for marinating. In 8th Central European Congress on Food 2016-Food Science for Well-being (CEFood 2016). Singleton, R.A. and Straits, B.C. (2005) Approaches to Social Research, New York: Oxford University Press. Su, S., Baird, K. and Blair, B. (2009) ‘Employee organizational commitment: the influence of cultural and organizational factors in the Australian manufacturing industry’, International Journal of Human Resource Management, Vol. 20 No. 12, pp. 2494-2516.
Temel, T. (2016) ‘A methodology for characterising innovation systems – revisiting the agricultural innovation system of Azerbaijan’, Int. J. Transitions and Innovation Systems, Vol. 5, Nos. 3/4, pp.254–298.
30
Touré‐Tillery, M. and Fishbach, A. (2014) ‘How to measure motivation: A guide for the experimental social psychologist’, Social and Personality Psychology Compass, Vol. 8 No. 7, pp. 328-341. West, J., & Bogers, M. 2014. ‘Leveraging external sources of innovation: A review of research on open innovation’, Journal of Product Innovation Management, Vol. 31 No. 4, pp. 814-831. Yen-Ku, K., & Kung-Don, Y. (2010) ‘How employees' perception of information technology application and their knowledge management capacity influence organizational performance’, Behaviour & Information Technology, Vol. 29 No. 3, pp. 287-303.
31
8 Tables, figure captions and figures 8.1 Tables
IDENTITY Grouping Index Major cluster group Sector_05 4.90 1 Sector_22 4.93 2 Sector_03 4.94 2 Sector_11 5.18 2 Sector_01 5.29 2 Sector_04 5.30 2 Sector_08 5.32 2 Sector_10 5.39 2 Sector_23 5.39 2 Sector_02 5.43 2 Sector_07 5.50 3 Sector_06 5.52 3 Sector_15 5.52 3 Sector_18 5.57 3 Sector_20 5.58 3 Sector_24 5.68 3 Sector_17 5.81 3 Sector_16 5.86 4 Sector_19 5.97 4 Sector_26 6.31 4 Sector_09 6.33 5 Sector_25 7.29 5 Sector_13 7.73 5 Sector_12 9.24 5 Sector_14 9.24 5 Sector_21 9.24 6 Table 1 ACE Clustering results for sectors showing cluster grouping indexes
32
Sector Number of answers per sector*
Number of non-Ignorable data (0)
% non-ignorable missing values (0)
Sector_01 8970 1721 19.19
Sector_02 4810 594 12.35
Sector_03 2990 478 15.99
Sector_04 2340 521 22.26
Sector_05 3380 630 18.64
Sector_06 780 48 6.15
Sector_07 3770 896 23.77
Sector_08 2600 493 18.96
Sector_09 910 188 20.66
Sector_10 650 70 10.77
Sector_11 6890 1322 19.19
Sector_12 130 21 16.15
Sector_13 650 293 45.08
Sector_14 260 95 36.54
Sector_15 780 80 10.26
Sector_16 1430 328 22.94
Sector_17 3900 885 22.69
Sector_18 5850 1157 19.78
Sector_19 260 6 2.31
Sector_20 3380 1025 30.33
Sector_21 520 169 32.50
Sector_22 2730 370 13.55
Sector_23 1300 140 10.77
Sector_24 1690 474 28.05
Sector_25 780 271 34.74
Sector_26 1560 288 18.46 Table 2 Percent unanswered statements by sector where sectors had a different number of respondents.
33
Test for equal means Sum of
sqrs df Mean square F p (same)
Between groups: 3.58E-05 1
3.58E-05 3.14E-05 0.9956
Within groups: 13.7074 12 1.14228 Total: 13.7074 13 ω2: -0.07692 Levene´s test for homogeneity of variance, from means
p (same): 0.9687
Levene´s test, from medians p (same): 0.9943
Welch F test in the case of unequal variances: F=3.138E-05, df=11.99, p=0.9956 Table 3 ANOVA existing vs. predicted data
34
Sector Relative distance between classical and ACE grouping
Similarity % between classical and ACE grouping
Sector_01 2 92% Sector_20 2 92% Sector_04 3 88% Sector_06 4 85% Sector_23 4 85% Sector_25 4 85% Sector_16 5 81% Sector_22 5 81% Sector_12 6 77% Sector_24 6 77% Sector_10 8 69% Sector_13 8 69% Sector_26 8 69% Sector_02 9 65% Sector_09 9 65% Sector_17 9 65% Sector_18 9 65% Sector_07 10 62% Sector_11 10 62% Sector_08 14 46% Sector_03 16 38% Sector_05 16 38% Sector_21 16 38% Sector_14 18 31% Sector_15 19 27% Sector_19 20 23%
Table 4 Distance comparison between classical and ACE cluster grouping
35
8.2 Figure captions and figures
Figure 1 Sector cluster Paired Group UPGMA plot (plotted using MS Excel 2003, Microsoft
Corporation, Seattle, WA)
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
11
10
00
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
00
00
00
00
00
00
00
00
11
11
11
11
10
00
00
00
00
00
00
00
00
00
00
11
11
11
11
10
00
00
00
00
00
00
00
00
00
00
11
11
11
11
10
00
00
00
00
00
00
00
00
00
00
11
11
11
11
10
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
00
00
00
00
00
00
00
00
00
11
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
a
1 4 7
10 13 16 19 22 25 28
S1
S7
S13
S19S25
0
100
200
300
400
500
600
700
800
900
b
Figure 2 Example of positional matching for dot shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)
36
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
11
11
10
00
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
00
00
00
00
00
00
00
00
11
11
01
11
10
00
00
00
00
00
00
00
00
00
00
11
10
00
11
10
00
00
00
00
00
00
00
00
00
00
11
00
00
01
10
00
00
00
00
00
00
00
00
00
00
11
10
00
11
10
00
00
00
00
00
00
00
00
00
00
01
11
01
11
00
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
00
00
00
00
00
00
00
00
00
11
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
a 1
5
9
13
17
21
25
29
S1
S6
S11
S16
S21
S26
0100200300400500600
700
800
900
b
Figure 3 Example of positional matching for doughnut shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
00
00
00
00
00
00
00
00
00
00
01
10
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
00
11
00
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
01
11
00
00
00
00
00
00
00
00
00
00
00
00
00
01
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
10
00
00
00
00
00
00
00
00
00
00
00
00
01
11
11
00
00
00
00
00
00
00
00
00
00
00
00
01
11
11
00
00
00
00
00
00
00
00
00
00
00
00
01
11
11
10
00
00
00
00
00
00
00
00
00
00
00
01
11
11
11
00
00
00
01
00
00
00
00
00
00
00
01
11
11
11
10
00
00
10
00
00
00
00
00
00
00
00
11
11
11
11
11
11
10
00
00
00
00
00
00
00
00
11
11
11
11
11
11
00
00
00
00
00
00
00
00
00
11
11
11
11
11
10
00
00
00
00
00
00
00
00
00
01
11
11
11
11
00
00
00
00
00
00
00
00
00
00
00
11
11
11
10
00
00
00
00
00
00
00
00
00
00
00
01
11
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
a 1
5
9
13
17
21
25
29
S1
S6
S11
S16
S21
S26
0100200
300
400
500
600
700
800
b
Figure 4 Example of positional matching for banana (moon) shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)
Figure 5 Example of positional matching for random shape base object (a) used for validation
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
01
00
00
00
01
00
00
00
00
00
00
00
00
00
00
10
10
00
00
00
00
00
00
10
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
01
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
10
00
10
00
00
00
10
00
00
00
00
01
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
01
00
01
00
00
00
10
00
00
10
00
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
01
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
01
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
01
00
00
00
00
00
01
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
10
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
a 1
5
9
13
17
21
25
29
S1
S6
S11
S16
S21
S26
0100200300400500600
700
800
900
b
37
of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)
Question 1 Prediction Model
0
0.5
1
1.5
2
2.5
3
3.5
1 2 3 4 5 6 7
Answer
Wei
ghte
d D
istr
ibut
ion
Original DataPredicted Data
Statement1Prediction Model
Figure 6 Comparison between actual and predicted data based on the ACE prediction algorithm. (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)
Figure 7 SunCore clustering and correlation plot showing i) type of statement and ii) motivation dimension structure for low, medium, and high correlation group of statements and sectors clustered using same ACE algorithm. (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)