from “big data” to “smart data”: algorithm for cross ... · 3. the measurement of full data...

37
Electronic copy available at: https://ssrn.com/abstract=3058881 1 From “Big Data” to “Smart Data”: Algorithm for Cross-Evaluation (ACE) as a Novel Method for Large-Scale Survey Analysis Darko Kantoci KanDar Enterprises, Inc. 250 Commercial St., Suite 3005F Manchester, NH 03101 USA E-mail: [email protected] Emir Džanić* Cambridge Innovative System Solutions Ltd CPC1 – Capital Park, Fulbourn Cambridge CB21, 5XE UK E-mail: [email protected] Marcel Bogers University of Copenhagen Department of Food and Resource Economics Unit for Innovation, Entrepreneurship and Management Rolighedsvej 25 1958 Frederiksberg C Denmark E-mail: [email protected] Accepted for publication in: International Journal of Transitions and Innovation Systems 21 October 2017 Abstract: Current research is increasingly relying on large data analysis to provide insights into trends and patterns across a variety of organizational and business contexts. Existing methods for large-scale data analysis do not fully capture some of the key challenges with data in large data sets, such as non-response rates or missing data. One method that does address these challenges is the SunCore Algorithm for Cross-Evaluation (ACE). ACE provides a view of the whole data set in a multidimensional mathematical space by performing consistency and cluster analysis to fill in the gaps, thereby illumining trends and patterns previously invisible within such data sets. This approach to data analysis meaningfully complements classical statistical approaches. We argue that the value of the ACE algorithm lies in turning “big data” into “smart data” by predicting gaps in large data sets. We illustrate the use of ACE in connection to a survey on employees’ perception of the innovative ability within their company by looking at consistency and cluster analysis. Keywords: statistical modelling; statistical algorithm; survey analysis; consistency analysis; cluster analysis; data trends; data patterns; data correlation; non-ignorable missing data; non- response missing data; cross evaluation; big data; smart data; innovation survey; food processing company. *Corresponding author

Upload: others

Post on 03-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

Electronic copy available at: https://ssrn.com/abstract=3058881

1

From “Big Data” to “Smart Data”: Algorithm for Cross-Evaluation (ACE) as a Novel Method for Large-Scale Survey Analysis

Darko Kantoci

KanDar Enterprises, Inc. 250 Commercial St., Suite 3005F

Manchester, NH 03101 USA

E-mail: [email protected]

Emir Džanić* Cambridge Innovative System Solutions Ltd

CPC1 – Capital Park, Fulbourn Cambridge CB21, 5XE

UK E-mail: [email protected]

Marcel Bogers

University of Copenhagen Department of Food and Resource Economics

Unit for Innovation, Entrepreneurship and Management Rolighedsvej 25

1958 Frederiksberg C Denmark

E-mail: [email protected]

Accepted for publication in: International Journal of Transitions and Innovation Systems

21 October 2017 Abstract: Current research is increasingly relying on large data analysis to provide insights into trends and patterns across a variety of organizational and business contexts. Existing methods for large-scale data analysis do not fully capture some of the key challenges with data in large data sets, such as non-response rates or missing data. One method that does address these challenges is the SunCore Algorithm for Cross-Evaluation (ACE). ACE provides a view of the whole data set in a multidimensional mathematical space by performing consistency and cluster analysis to fill in the gaps, thereby illumining trends and patterns previously invisible within such data sets. This approach to data analysis meaningfully complements classical statistical approaches. We argue that the value of the ACE algorithm lies in turning “big data” into “smart data” by predicting gaps in large data sets. We illustrate the use of ACE in connection to a survey on employees’ perception of the innovative ability within their company by looking at consistency and cluster analysis. Keywords: statistical modelling; statistical algorithm; survey analysis; consistency analysis; cluster analysis; data trends; data patterns; data correlation; non-ignorable missing data; non-response missing data; cross evaluation; big data; smart data; innovation survey; food processing company. *Corresponding author

Page 2: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

Electronic copy available at: https://ssrn.com/abstract=3058881

2

1 Introduction

Surveys on the perception of employees or management in different industrial sectors have

been increasingly used to measure different parameters (employee satisfaction, managerial

performance, innovation capabilities, etc.) using classical statistical approaches (Kuroki,

2012; Yen-Ku and Kung-Don, 2010; Klein et al., 1971; Su et al., 2009; Kabaskal et al.,

2006). In the past 15 years, some specific fields of research, such as innovation, have

emerged as important for companies’ strategic positioning (Bigliardi and Gelatti, 2013;

Chesbrough and Bogers, 2014). On that basis, such research has provided the push for further

investigation prompting researchers to conduct an increasing amount of studies in that

domain. Specifically, for innovation, surveys became more standardized and advanced

(Dadura and Lee, 2011; OECD, 2005; Cabral, 1998). It also became evident that they can be

used for not only measuring managerial attitude in multiple case studies, such as the one

Dadura and Lee (2011) conducted, but also on employees’ perception as described by

Linaker et al. (2015).

The concept of Innovation, as an example, became increasingly interconnected with

other concepts. New approaches to measuring it constantly appear and involve methods that

are new to innovation studies, such as system dynamics (Savitskaya and Kortelainen, 2012)

and the graph theoretic method (Temel, 2016). Besides the introduction of new

methodological approaches, a more complete understanding of the innovation concept

requires involvement of more abstract topics such as organizational culture or organizational

climate (Linaker et al., 2015), as well as practical topics such as the introduction of

performance enhancing HRM practices that can lead to change in organizational behaviour or

culture (Lobanova and Ozolina-Ozola, 2014). This need immediately brings up the survey as

an important tool and source of information for innovation researchers. At the same time, this

area is not free of problems because innovation surveys may only measure innovation on a

Page 3: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

3

generic level while leaving out specific attributes. This issue may be a challenge, for

example, in traditional industries where innovation is mainly aesthetic (Alcaide-Marzal and

Tortajada-Esparza, 2007). For that reason, the usefulness of a survey in a traditional industry

such as food processing can be improved with the use of a proper tool that can address

challenges of survey methodology. Furthermore, the results from such a survey, as described

by Cabral (1998), show that innovation is a very complex problem in the food industry and,

at a firm level, large firms tend to be more innovative than small ones. Finally, an innovation

survey on Italian manufacturing firms (Cesaratto, Mangano and Sirilli, 1991) presents

surprisingly different results regarding the number of R&D manufacturing firms, showing

that there are twice as many such firms as compared to the results of the annual survey on

research and development activities carried out by the Italian Central Statistical Office.

Therefore, the methodology for measuring innovation in the manufacturing industry needs to

be constantly adapted and developed to provide accurate data. In this paper, we will present

the SunCore Algorithm for Cross-Evaluation (ACE) as a novel statistical approach to survey

analysis in the context of employees’ perception of the innovative ability of their company.

To address the above-mentioned issues, we present a survey to analyse intrinsic

factors of a large food processing company’s innovativeness. Our approach will illustrate

how ACE can contribute to a better understanding of the nature of innovation through the

discovery of latent variables. We will illustrate how internal perspective and cultural

elements that are tightly bound to employees’ perception can be captured through such a

survey. We will also show that ACE can be used as an analytical tool that can allow for

repeatable and validated results to emerge for a company’s internal analytics. This approach

can be easily adopted for measurement on different levels, such as in industry.

Page 4: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

4

2 Some current limitations of survey data analysis in innovation studies

Innovation is an increasingly important topic for company leadership and management as

they strive to develop and maintain high levels of competitiveness. Such a complex and

strategic topic involves understanding a complex nexus of products, processes, organizational

structures, culture, marketing, etc. In discovering new understandings of how innovation

arises within this nexus, a more ecological approach to innovation requires organizations to

research the interactions that occur between this nexus and the perceptions of the employees

involved (Dadura and Lee, 2011). However, employees’ perception of the innovativeness of

their company, especially in large companies, is rarely measured. This is in part because

analysis methods have been lacking due to the aforementioned limitations in survey based

research.

We argue that a valuable tool to enable this type of research are survey methods in

combination with the ACE algorithm. The ACE algorithm was successfully used in the

natural sciences to uncover correlations between cancer cells and anti-cancer drugs (Kantoci,

1999, Seles et al. 2016). Since the ACE algorithm is generic in nature, it can be applied to

any data set; for example, organizational, managerial, or in industries such as banking to

uncover hidden correlations. When applied to large data sets, ACE can provide value

extraction in conditions where large sets of data are expected to be missing due to a

participant’s unwillingness to reveal their beliefs, or a failure to answer particular

questions/statements because of a lack of information or knowledge. A core assumption is

that complete contribution to value extraction can only be achieved if all collected data can

be included in the analysis, even if that data set contains missing data. In that case, missing

data can contribute to data analysis to provide new understanding and insight for leaders and

management. The methodology we propose can be applied to initial analysis, followed by

Page 5: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

5

actions targeting weak points in the data, and can be further expanded to longitudinal survey

analysis.

In this paper, we specifically address research challenges such as the consistency and

cluster analysis of survey data containing large data sets of non-ignorable, non-response data

(missing data to critical survey questions/statements due to a large proportion of incomplete

answers or no answers at all) by using ACE to process the entire data set at once. Data

analysis, using powerful computational techniques, serves as a useful tool more than ever

before to unveil trends and patterns that are hidden within data. This kind of data value

extraction provides new insights that meaningfully complement classical statistical

approaches, surveys, and other data sources (George et al., 2014). These approaches have the

ability to transform so-called “big data” into “smart data”. It is a shift from data volume to

data intelligence, focusing not on quantity as a key value but on the insights, that can be

drawn from data analysis (George et al., 2014). For that reason, “big data” becomes an

important tool for the redefinition of individuals, organisations and entire ecosystems (Perko

and Ototsky, 2016).

Surveys as a data source, though widely used and highly valuable, face several

significant limitations including data composition (e.g. sample representation, sample

composition, incomplete/missing data) and our explanatory abilities (correlation vs.

causation). Limitations such as the inability to eliminate rival explanations, where one can

only find associations rather than causal relationships between variables (Singleton and

Straits, 2005), have long caused consternation. Another important limitation is connected to

measurement error where respondents may answer questions/statements in the direction of

social desirability rather than their real feelings (Singleton and Straits, 2005) or leave non-

ignorable missing data which is problematic yet potentially informative.

Page 6: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

6

Two illustrative examples of how researchers have approached the non-ignorable

missing data problem, arising from non-responses in survey research include i) Foster and

Smith’s (1998) analysis of the 1992 British general election and ii) Bertoli-Barsotti and

Punzo’s (2014) work on healthcare worker assessments. Foster and Smith, looking at a

sample of the British electorate, suggested the adjustment of the sample size for surveys

where it was expected the data may contain a substantial non-ignorable non-response.

Bertoli-Barsotti and Punzo, using a large data set on the assessment of healthcare workers in

which there was a significant answer refusal rate from informants, applied a Rasch-Rasch

model (RRM) to overcome non-responses. Comparing these two approaches highlights a

significant change in the last 20 years. Researchers are moving from an approach of sample

size adjustment towards implementation of different computational algorithms that can

overcome non-ignorable, missing data. Put simply, researchers are moving from increasing

data size towards more sophisticated algorithmic models that can help fill in the gaps of

missing data.

In addition to these limitations, survey data analysis has other challenges, especially

in the number of dimensions that are evaluated through classical statistical analysis methods.

Familiar analyses such as T-test, ANOVA, frequency analysis, mean, median, etc. work on

pairs of data, rather than holistic, multi-dimensional analysis. As a key concern, many

developments have occurred to expand the capacity of such analysis practices. However,

even when advanced statistical methods were used such as multidimensional ANOVA,

clustering, and other algorithms that work on multidimensional data, these algorithms failed

to address:

Page 7: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

7

1. Evaluation of the entire data matrix at once.

2. The discarding of large data sets due to missing data since missing values skew

other answers (non-responses, spurious answers, etc.), especially for non-

ignorable missing data.

3. The measurement of full data consistency (correlation and clustering) where

standard statistical approaches provide valuable analysis for data consistency

within the group and between groups, but miss correlations with all other data.

To address the aforementioned limitations, this study explores the value of ACE as a

new mathematical model tool to predict answers based on the similarities of answers within a

group across a large data matrix. In so doing, we argue that this approach preserves statistical

power and enables us to use data that would otherwise be discarded (non-ignorable missing

data, spurious answers etc.). This model analyses data per all-with-all correlations, meaning

that each value relates to all other values in the entire data set based on its own value as well

as its distance to all other data in the matrix.

A key application of this approach, and one that we focus on in this paper, is the study

of organizational innovation.

To build the argument for the value of the ACE algorithm, and to substantiate the

methodological approach advocated, this paper mobilizes survey data from a large innovation

survey (126 statements) completed within a European food processing and pharmaceutical

company with a response rate of n=495.

This particular data set included enough non-responses to make those responses non-

ignorable. The analysis implies specific limitations and challenges to researchers engaged in

such research. However, with this computational algorithm, we were able to mobilize all of

the collected data, including missing data, to provide new insights through value extraction.

Page 8: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

8

The algorithm allows statistical analysis by correlating all dimensions of the data within the

whole data matrix (including missing and existing data) at once.

The objective of this paper is to illustrate the capabilities of the ACE algorithm when

used to analyse a data matrix originating from an organizational/managerial survey

performed in a large organization. We will show that the ACE algorithm is valid not only for

the analysis of complex natural science cases (Kantoci at al., 1999; Seles et al. 2016) but also

for social science data that contains equally complex relationships between explicated and

extracted values.

3 At once statistical analysis of survey data

3.1 Evaluation of the entire data matrix at once: ACE algorithm

Current methodologies in data analysis use classical statistical methods that can work on

narrow data sets, such as comparing two arrays. For example, classical cluster analysis

correlates data by local association factors. One can easily imagine that extracting similarity

patterns from data would provide another analytical dimension that is crucial for large data

samples (the “big data”). Such analysis would allow for the extraction of new parameters that

were not accessible through traditional statistical approaches.

An application of this type of analysis would be the extraction of values to provide

insight into example similarities across all the data among groups of survey participants. To

achieve this level of analysis, a data matrix has to be analysed in its totality. This would

require an analysis tool operating in a multidimensional mathematical space that can handle

any number of dimensions simultaneously. The ACE algorithm supplies such a tool.

Moreover, it identifies similarities in data and groups similar data together across all

dimensions. By so doing, similarity patterns can be extracted from the data by analysing the

entire data matrix all at once. In other words, the modelling mechanism can measure the

Page 9: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

9

consistency of answers coming from multi-cluster samples that contain missing data,

analysing similarity patterns, and predicting the value of any missing data.

The application of the ACE algorithm, and its capabilities in a full matrix multi-

dimensional analysis, has been tested earlier. Previously a similar model was used in a

correlation analysis of the action of anti-cancer compounds against cancer cells. In this study

(Kantoci, 1999), a similar, earlier version of the algorithm was used to find the best possible

drug candidates for particular cancers, as well as to group (put in clusters) anti-cancer drugs

based on their anti-cancer activity and chemical structure similarities and link them to

particular cancer cell lines. As described in Seles et al. (2016), ACE was used to determine

the shelf life of food products (Model 2). In considering these variables, ACE creates a non-

linear, multi-dimensional association between the data, and then displays the results in a

matrix and in a graphical representation. ACE evaluates data in such a way that each number

in the matrix “feels” all other numbers in the matrix, after which point the algorithm arranges

the data by calculated association coefficients. The algorithm is based on a complete linkage

approach that accounts for all similarities and variability in a clustering matrix.

In this previous study, as well as within this study, the focus is on how the ACE

algorithm follows the “at-once” principal of analysing all data with all other data in the

matrix in squared Euclidean space. It is a so-called “in place” algorithm since original data

values are not moved from their original position in the matrix during calculation. The

algorithm accomplishes these tasks by finding multi-dimensional correlations based on vector

orientations and scalar values. This approach uncovers all levels of correlation, from weakest

to strongest. The final data can be analysed in situ or rearranged to cluster similar values

together in all 3 dimensions as islands of similarities. After this rearrangement, it can be

easily determined which data are correlated and which are not. This can be done through data

frequency analysis that finds islands of identities in 3D space.

Page 10: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

10

As concluded by Kantoci (1999), ACE may be applied to other situations where

details of grouping and relationships need to be extracted from large data sets.

While classical statistical methods can work on narrow data sets and can compare a

limited number of arrays, the ACE algorithm exploits an internal capability to evaluate the

entire data matrix at once to provide new opportunities for value extraction from different

data sources and scales. In what follows, we apply this capability particularly to the challenge

of missing data (non-ignorable non-response) and its impact on statistical analysis.

3.2 Missing data (non-ignorable non-response) impact on statistical analysis

For many researchers, missing data are a widespread problem. Data gathered from different

sources (surveys, experiments, and secondary sources) are often missing some data. Missing

data can impact the results of statistical analysis depending on the mechanism which caused

the data to be missing and the way in which the data analyst deals with it (Grace-Martin,

2001).

Subjects in survey studies often drop out before the study is completed for many

different reasons. Moreover, surveys often suffer missing data when participants refuse to, or

do not know how to answer a question.

Since most statistical procedures require a value for each variable, missing data are

problematic. Researchers face a challenge whenever a data set is incomplete.

When facing missing data, it is common to use complete case analysis (also called list

wise deletion) – to analyse only the cases with complete data. Survey participants with

missing data on any variables are accordingly dropped from the analysis. Besides the

advantages of such an approach (easy to use, very simple, and is the default in most statistical

packages), limitations involve a substantial lowering in the sample size leading to a severe

lack of statistical power especially when many variables are missing data for a few cases. In

Page 11: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

11

such instances, researchers can expect biased results, depending on why the data are missing

(Grace-Martin, 2001).

Processing of such data usually involves mean value imputation (Durrant, 2009). For

the missing data (data without value) imputing the mean value is equivalent to applying the

same non-response weight adjustment to all respondents in the same imputation class. It

assumes that non-response is uniform and that non-respondents have similar characteristics to

respondents. This method weakens distribution and multivariate relationships by creating

artificial spikes at the class mean (Government of Canada, 2010).

There are two important classes of missing data, ignorable and non-ignorable.

Ignorable missing data involves data that is missing completely at random (MCAR), and data

that is missing at random (MAR). Cases of ignorable missing data imply that the probability

of observing data items is independent of the value of that data item (Marlin, 2005).

However, non-ignorable missing data means that the missing data mechanism is related to the

missing values. It commonly occurs when people do not want to reveal something very

personal or unpopular about themselves. For example, if individuals with higher incomes are

less likely to reveal them on a survey than are individuals with lower incomes, the missing

data mechanism for income is non-ignorable.

Whether income is missing or observed is related to its value. Complete case analysis

can give highly biased results for non-ignorable missing data. If proportionally lower and

moderate-income individuals are left in the sample because high income people are missing,

an estimate of the mean income will be lower than the actual population mean. Non-ignorable

missing data are therefore more challenging and require a different approach (Grace-Martin,

2001).

In the presented survey, informants were allowed the possibility to answer statements

by stating refusal – “I don’t know/I don’t want to answer,” (coded with 0) or to assign a value

Page 12: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

12

in a Likert scale (coded 1-7). This option has directly connected cause for non-response to the

latent variable of interest to define non-response as non-ignorable missing data (Bertoli-

Barsotti and Punzo, 2014). The latent variables of interest in the presented case are assumed

to be linked to an informant’s unwillingness to state an opinion, ignorance due to the

company’s inadequate education or information system, or due to their own beliefs. Survey

statements such as “our company has an R&D department” challenge informants’ knowledge

of the company’s organizational structure to assume that one’s refusing to answer means that

they are not well informed on the fact that the company has an R&D department. By

answering within an offered Likert scale (1-7), an informant confirms the existence of such

an organizational unit, while their refusal to answer (by choosing 0 - do not know) allows an

informant to clearly state a lack of awareness of an R&D department or her/his knowledge on

what the term “R&D department” means, thereby showing a lack of information on this

specific topic. Regardless of the latent variable causing a refusal to answer, researchers can

further investigate reasons for the appearance of missing data and act accordingly by

providing education and information to specific groups of employees. In some survey

statements, informants are stating their beliefs instead of their educated and informed answer,

e.g. in statements such as “I believe that my company will achieve increased sales figures in

the next 3 years” or “our reputation is better than our competitors’ reputation.” These types of

statements allow informants to state their beliefs and attitudes by choosing to support or not

support statements in a 1-7 Likert scale. When an informant chooses not to answer the

question, she/he is refusing to reveal their beliefs. This is missing data. In other words, data in

the presented case study has a relation to its value whether it is missing or observed.

The application of the ACE algorithm to this problem implies an ability to uncover

the most likely answers to unanswered questions/statements through Gaussian, skewness, and

frequency approximations. Based on all other answers, the algorithm can predict a most

Page 13: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

13

likely response. If questions/statements are designed to lead to the same answer and are

scattered across other questions/statements, the likelihood of finding the closest match is

rather high. In the presented case, predictability goes as high as a 99.6% probability of

predicting the answer. For example, survey respondents omitted a significant number of

answers for various reasons. The ACE algorithm was able to estimate correct answers for a

particular statement on the order of p < 0.05 (>99.5%) based on answers given by other

respondents who answered statements in a similar pattern to the respondent.

In what follows, it will be demonstrated that the described methodology preserves all

responses and enters predicted values for missing data through ACE. If by prediction the

algorithm introduces high data bias, the clustering algorithm will group biased data together

and move them out of other good correlations.

Having discussed above how missing data presents significant limitations to survey

analysis, especially when non-ignorable non-response missing data is the case, we now turn

to the practice of using new algorithms to manage this challenge, as seen in the study by

Bertoli-Barsotti and Punzo (2014). In this case, we look at the value of the ACE algorithm

and its ability to provide response predictability through consistency and clustering

correlations.

3.3 Consistency- correlation and clustering

In this section, we look at the issue of consistency and clustering. Consistency is defined as

the grouping of similarities between data. Therefore, consistency of data depends not only on

answers that contain a value but also on the missing data that are a part of a data set. In the

present study, consistency in answers is assumed to be dependent on latent variables and

involves a whole data set regardless of data type (missing or existing). In the initial

exploration phase, when no historical data is available such as in the presented case of

Page 14: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

14

employee innovation perception, the consistency data analysis will provide quantification and

metrics that can be used in the characterization of latent variables and for further longitudinal

studies. In other words, regardless of which aforementioned variables is the case for specific

survey statements, the consistency in answering among informants is what matters for initial

surveys, such as in the example used here. This is especially true for surveys that intend to

reveal initial states in organizations and that face risks of high levels of non-ignorable

missing data.

In the presented case, calculated factors through the ACE algorithm, as a separate

third dimension, are of similar value for data in both other dimensions (statements vs.

sectors). Therefore, through consistency analysis, correlation and clustering is achieved for

two dependent dimensions (statements vs. sectors). This allows researchers to identify

company sectors that were responding to survey statements most or least consistently, as well

as to identify critical survey statements for each company sector with respect to consistency

in answering.

Consistency analysis is an important result of the ACE algorithm since it is not

evaluating answers per se; it is evaluating relationships between data. The grouping function

of the algorithm is a critical part of data analysis since it will put similar data together

regardless of whether data have high or low scores. This is achieved through frequency and

rank algorithms. Whereas a frequency algorithm calculates the same responses across

medians and groups them together for both data axis, rank graphs were generated to represent

the normalized association patterns in 2-D space.

In addition, grouped similarities are also sorted by a frequency algorithm, thus

creating islands of similarities based on the ACE algorithm factors scalar component. This

means that similar similarities are grouped together based on their scalar components (ACE

Page 15: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

15

factor). Again, the ACE algorithm approach takes into account all values, numerical as well

as missing, without destruction caused by, for example, the practice of imputing mean values.

4 Methods

4.1 Sample and data collection

In order to test the value of the ACE algorithm, we applied it to a large data set gathered from

a large European food processing and pharmaceutical company. The company, established

more than 70 years ago, is a dominant player in various markets within Europe and also

operates in China and Africa. The data set, as introduced above, arose from a survey on

employee perceptions of the innovation ability of the company.

The survey used was a modified version of the OECD survey on the innovation

abilities of a company developed by Dadura and Lee (2011); (see also OECD, 2002; OECD,

2005). The original questionnaire was applied on innovation in the Taiwanese food industry.

In the case presented here, it was used to measure the perception of employees towards their

own company’s innovation capability. The questionnaire involves statements on 5 general

areas: products, processes, organization, marketing, and ecologic innovation (sustainability).

The survey was carried out during 2014 using an online version of the questionnaire

distributed through the company’s intranet. The survey comprised of 126 statements. All

informants were assured of anonymity and responses were sought from across specialist and

managerial functions of the organization including: specialists and lower management,

middle management, upper middle management, and top management). The total possible

respondent pool was 902 across 7 large sectors of the organization. The allotted time for the

survey was 20 days, during which the response rate was 54% or n=487.

Respondents were expected to answer statements using a Likert scale (false=1, true=7

and 0=don’t know/do not want to answer). The survey was controlled for possible multiple

Page 16: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

16

selection or accidental skipping of answers by putting programmed controls for these

possibilities and not allowing respondents to submit the questionnaire if their responses

contained any such error. This allowed us to control for non-ignorable missing data by only

allowing respondents to generate missing data by selecting a 0-value assigned to the possible

response “I don’t know/I don’t want to answer.”

The original survey data were mutated (obfuscated) for this analysis through a

specially developed algorithm, for reasons of not disclosing real survey information. An

obfuscation (mutation) algorithm is based on weighted randomization formulas. The mutated

data kept its original statistical power.

Each data point was multiplied by a random number scaled by the data point value.

The main requirement for the obfuscation algorithm is that it preserves statistical power and

preserves data relationships as realistically as possible.

4.2 ACE algorithm

The ACE algorithm (KanDar Enterprises, Inc., Manchester, NH, USA) was used for

complete data analysis, clustering, data obfuscation and prediction. It was developed on a

Presario C500 computer (Hewlett-Packard Company, Palo Alto, CA) running Windows

Server 2003 R2 SP2 (Microsoft Corporation, Redmond, WA) with a Microsoft Visual Studio

2010 (Microsoft Corporation, Redmond, WA) platform using C# programming language.

The current version of ACE incorporated all calculations and assortments for the cluster

matrix, frequency plots, and rank graphs under one programming language. The integrated

version of the software is known under the trade name SunCore (KanDar Enterprises, Inc.,

Manchester, NH, USA) and is available through the author for commercial and non-

commercial use. ACE calculates values for cluster analysis using the following steps:

Page 17: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

17

1. Set statements in columns, respondents in rows

2. Run prediction algorithm to replace zeros with predicted data

3. For this paper, we use the obfuscation algorithm to mask real data

4. Calculate association value for each data point in squared Euclidean space. The

algorithm evaluates all-with-all points in the matrix, and then normalizes

calculated parameters by the initial value.

5. Calculate medians in rows and columns. Generated values are further referred to

as Grouping index (GI). Grouping index (GI) is a median of relevant values for

different dimensions in the data matrix.

6. Calculate frequency and rank. The frequency algorithm evaluates sector

groupings after the clustering algorithm serving analysis of consistency.

Generated values are further addressed as clustering factors.

7. Sort medians for statements and sectors

8. Plot consistency matrix

Since the computer source code is not available for comparative studies, the ACE

code was not compared to other algorithms. For that reason, it is important to note that the

ACE application was previously described in pharmacy (Kantoci, 1999) and food science

(Seles et al. 2016) where it was successfully compared to results gained from other available

algorithms.

4.3 Classical statistical analysis

To be able to compare our results with classical statistics, we calculated survey data statistics

using standard ANOVA and clustering algorithms. Since there are no comparable algorithms

in literature, we cannot explicitly compare ACE with these algorithms. To prove ACE

versatility and robustness, we calculated clustering for standard shapes depicting correlation

Page 18: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

18

factors in 3D plots (see below). These shapes are considered “natural” shapes [Everitt, 1974,

pp 60 – 64; Everitt et al., 2011a]. As Everitt pointed out, “many [means algorithms] would

have difficulty in recovering natural clusters.” Therefore, we selected several natural clusters

and performed ACE clustering algorithm to test if ACE can uncover these clusters. Since we

were unable to perform external validation due to a lack of similar algorithms, we analysed

natural patterns such as “moon/banana” and other shapes. As Everitt pointed out, these

shapes are the most difficult shapes to analyse because they have different levels of symmetry

and complexity (in mathematical terms). For comparative analysis, as previously published

(Kantoci, 1999), ACE clustered anti-cancer drugs based on their antineoplastic activity. It

was apparent in subsequent structure activity relationship (SAR) analysis that all clustered

drugs have the same basic chemical structure with side groups’ variations. Furthermore, ACE

(Model 2) was used to predict shelf life of food products and showed better correlation to

conventional, laboratory methods than when an interpolation algorithm (Model 1) was used

(Seles et al. 2016). This further proves the robustness of the algorithm and its versatility.

Classical statistics (ANOVA, frequency/histogram, univariate summary statistics)

were calculated with PAST version 3.02 from http://folk.uio.no/ohammer/past to correlate

with ACE results.

From the available options, we correlated ACE results with classical clustering

algorithms such as UPGMA (unweighted pair-group method using average approach, Everitt

et al., 2011b) (Figure 1).

ACE internal groupings are similar/same (Table 1), although cluster positioning is

different.

Furthermore, we utilized ACE to analyse and evaluate cases of simple and complex

natural clusters that can be detected graphically (Everitt et al., 2011c, Everitt, 1974)

Page 19: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

19

For this presentation, we left data in place to show positional matching for a few

examples (Figure 2, Figure 3, Figure 4 and Figure 5).

5 Results and discussion

5.1 Missing data characterization

The original technology applied in the survey did not allow for the generation of MCAR or

MAR missing data for unanswered statements or for statements that were given more than

one answer.

Missing data was classified as non-ignorable when respondents selected 0 value (I

don’t know/I don’t want to answer) in the online survey.

The survey results nicely represent the realities of survey data with respect to non-

responses. The survey results show that if we apply the same rule to discard an entire

response if even one statement is not answered, the survey would end with an 11.5%

response rate of completed, valid answers where only 56 out of 487 participants answered all

statements by selecting values between 1 and 7 for all statements. So, in effect, this approach

would have resulted in the removal of almost 90% of respondents using a traditional discard

approach. If one takes into consideration the time and effort to run a questionnaire and data

workup, the wasted effort is close to 90%. In order to select a valid data set for statistical

treatment, the available choices are:

1. To disregard all participants that did not answer all statements by selecting a

number 1-7 on the Likert scale, i.e. 88.5% of respondents. This is highly

undesirable.

2. Reduce the number of statements to include as many participants as possible,

thus creating a subset with all answers.

Page 20: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

20

3. Use imputation of mean values and classical statistical approach with all the

drawbacks of this approach.

4. Use another method to predict possible answers based on closest neighbour

statistics.

For such a huge reduction of sample size, option 1 was not acceptable. For the second

option, there were no reasonably sized subsets with all answered statements per sector. The

third option would cause the destruction of the data set.

The final, fourth option to include all participants regardless of whether they

answered all statements or not on the Likert scale of 1-7 was the only relevant method of

approach. The final sample, therefore, included missing data.

5.2 Characterization of statistical sample and predictive modelling

Initial data analysis shows a skewed Gaussian data distribution pattern where all types of

skew were observed. It was necessary to calculate response factors for each answer per

statement across all segments. The answers per statement across all sectors were then

grouped by the answered value and normalized by non-zero answered statements.

The ACE algorithm allowed use of a simplified imputation algorithm based on data

distribution per asked statement. Instead of imputing mean values, the algorithm used

specifically developed Gaussian, skewness, and frequency analysis to generate imputed

values. Histogram (answer frequency, skewness) characteristics were used to calculate the

closest matches while concurrently keeping the Gaussian distribution of data. Using this

approach, weakening of distribution and creating artificial spikes at the class mean, which is

characteristic for mean value imputation approach (Government of Canada, 2010), was

avoided and statistical power was maintained.

Page 21: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

21

As shown in figure 6, the first survey statement was used to illustrate a prediction

model where 23 statements were answered with 0 values among participants. The percentage

of non-ignorable missing data varied among statements by sector where sectors had a

different number of respondents, between 2.31 and 45.08%, as shown in Table 2.

Furthermore, Table 3 shows the same sample tested on p(ANOVA) = 0.9956 and ω2=

-0.07692, confirming that the predicted value correctly filled in missing data, thus preserving

statistical power (Table 3). Results for all other statements fall into a similarly acceptable

range, for p (ANOVA) <0.05 and for ω2 < 0.06.

5.3 Data analysis

After obfuscation, data were analysed through ACE consistency (clustering and correlation)

algorithms. Data were grouped from highly correlated (Figure 7, bottom left) to completely

uncorrelated (Figure 7, top right). This data representation can uncover “truth in answers” as

represented in Figure 7, bottom left. Answers that are spurious are depicted in Figure 7, top

right.

Statements are ranked from high to low correlation according to a grouping index

(GI). The grouping index (GI) was calculated by the ACE algorithm for columns and rows

separately. Grouping index (GI) is a median of relevant values for statements (columns) and

sectors (rows) (Figure 7). The lower the difference between medians, the better the

correlation that is expected. The smaller the grouping index, the better the correlation

between answers. That means that all respondents answered similarly, producing a consistent

response to statements (Figure 7, bottom left). This is very important to understand the

overall correlation confidence. The score, as an extracted value, is the function of a complex

data matrix including non-ignorable missing data and is therefore linked to latent variables.

For this reason, this score reflects the depth of respondent understanding of the current

Page 22: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

22

situation within the company, regardless of the answer score (1-7); the insight is in the

consistency comparison to other answers and sectors.

The algorithm successfully correlated answers based on statements vs. sectors where

participants are employees. The algorithm not only correlated similarities, but also found

significant discrepancies within answers. For demonstration purposes, two different ways to

utilize cluster and correlation analysis were used:

i) differences in clustering and correlation are assumed to be linked to survey participants:

1. Unwillingness to answer truthfully or for other personal reasons. Participants

gave answers just to finish the questionnaire. This case is found in the matter of

belief, i.e. statements like “I believe that my company will achieve increased

sales figures in the next 3 years” or “Our reputation is better than our

competitor’s reputation.” It is possible that some respondents were grouped as

inconsistent for not wanting to reveal their beliefs.

2. Lack of information related to survey statements. For example, in statements

such as “Our company is using green certified equipment and technologies”

inconsistency was noticed in cases where respondents were educated to

understand technology from the survey statement but are not informed on its

application within the company.

3. Lack of education on the topic(s) relevant for answering the statement(s). For

example, in statements such as “Our company is using green certified

equipment and technologies,” inconsistency was noted when respondents are

not aware of such technologies and are for that reason lacking information on

that topic.

Page 23: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

23

ii) differences in clustering and correlation are assumed to be linked to survey participants

motivation as explained by Touré‐Tillery and Fishbach (2014) where two dimensions of

motivation were analysed, namely: outcome focused and process focused motivation.

Both examples were illustrated in Figure 7 where statements separated in 3 equal

groups of high, medium, or low correlation were showing different statement structures that

can further help to reveal new phenomena. Furthermore, ACE clustering for sectors shown on

Y axis are informative for future decisions as to what sectors to involve in activities such as

motivation, education, or information as leaders. When compared to UPGMA clustering

(Figure 1), ACE clustering revealed a wider group of sectors (23 in Cluster B) that contained

all 7 sectors as in the UPGMA cluster. Among those 7 sectors, 4 would be most appropriate

for involvement in the described activities. Furthermore, the ACE algorithm, if compared to

classical statistical clustering (UPGMA), would show similarities between sectors as shown

in Table 4. 19 out of 26 sectors showed relative distance less than 10 (more than 62%

similar).

In this study, the ACE algorithm approach allowed the analyst to use all of the data

collected within the survey for statistical analysis regardless of data type (missing or

existing). In the specific case used here, a relatively high level of non-ignorable non-response

was expected due to differences in respondents expected knowledge, access to information,

and proposed survey statements (e.g. respondents from a financial department are not

expected to be experts in the field of ecology).

By using the ACE algorithm, non-ignorable non-response was included in analysis

that would not have been possible using standard statistical analysis methods.

As such, a survey on employees’ perception of a company’s innovation including

holistic and widely set statements turned out to be ideal for elucidating the ACE algorithm’s

Page 24: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

24

capabilities in the extraction of new values from a data matrix, including consistency

measurement.

In practical terms, by measuring consistency in sectorial answers, one could easily

identify a sectorial connection to latent variables (sectors lacking motivation, information, or

education). The resulting findings have great potential in informing management to take steps

toward changing group consistency by providing additional motivation, information, or

education to specific groups of employees in order to provoke new innovative behaviour, or

to increase acceptance of external innovation and collaboration.

6 Conclusion

In the present study, we provide evidence of the application of the ACE algorithm in survey

analysis, which we illustrated through a survey on employee perception of an organization’s

innovation ability. The use of ACE allowed for a better understanding of the nature of

innovation through the discovery of latent variables. Furthermore, we have demonstrated that

ACE can be applied to capture internal perspectives and cultural elements that are tightly

bound to employees’ perception. We showed that ACE is an analytical tool that allows for

repeatable and validated results to emerge for a company’s internal analytics.

More specifically, the ACE algorithm was validated for survey analysis and its ability

to perform in the process of latent variable extraction was demonstrated. We can conclude

that by using positional matching and classical clustering algorithms (such as UPGMA), ACE

was found to correlate and comply to the study’s rigor; it can therefore be used as a valid tool

for the analysis of complex data retrieved from surveys. Furthermore, we showed that when

used on real world data, ACE predicted actual data with p (ANOVA) <0.05 and for and for

ω2 < 0.06 indicating that prediction matched actual data.

Page 25: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

25

The algorithm successfully correlated answers based on statements vs. sectors where

participants are employees, and found significant discrepancies within answers giving

researchers an opportunity to extract latent variables.

It can be concluded that in cases where non-ignorable non-response data can be easily

identified, such as in this study, and where survey design does not allow for the generation of

any other kind of missing-data, the ACE algorithm can be successfully used to analyse all

collected data as valuable information. That would not be possible using standard statistical

analysis methods where non-ignorable non-response data would be lost and the sample size

radically reduced. In this paper, we demonstrated the value of non-ignorable non-response

through the extraction of values using latent variables where those values can present

valuable information for management.

Even though this case was a complex methodology exercise that involved a survey,

algorithm and innovation study, and the contribution of this work is focused to the area of

methodology, it also opens significant opportunity to further study the field of innovation and

HR. It is of particular interest for organizations to find the most effective way of allowing for

HRM decisions that may promote the spread of innovation development (Lobanova, and

Ozolina-Ozola, 2014). Through a practical illustration, we have demonstrated how rich the

source of information from survey data can be, especially when addressing complex

problems. A specific contribution to literature is the application of non-ignorable non-

response data that would otherwise be lost. It allowed for clustering and correlation analysis

to uncover new relationships between data, provide new meaning to those relationships, and

allow for the discovery of latent variables. This study describes a tool that can easily be

adapted to different cases ranging from theory to practice. Therefore, an important

contribution is providing a method to analyse survey data as illustrated through the case of a

survey of employee perception on the innovation ability of the organization. Descriptive

Page 26: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

26

analysis of similar perception surveys, such as in the Linaker et al. (2015) study, can be

supplemented with latent algorithmically-extracted data and used for scholarly and practical

purposes.

Limitations of this work are in the area of algorithmic methodology that can often

only be validated internally due to a lack of standard reference tools and methods with the

same capabilities. In this case, there were no reference methods available to validate the

presented algorithm and therefore external validation of the ACE algorithm was not possible.

The same algorithm was effectively externally validated in another case where Kantoci

(1999) focused on biologically active compounds where reference methods were available for

comparison of results. This work successfully proved ACE validity for that specific use.

However, the presented results should be tested and rechecked by other methods or

algorithms when they become available. Nevertheless, the discoveries raised using this

algorithm tool should not be ignored because ACE can provide real insight into trends.

Through iteration of surveying processes and extraction of latent variables, possible artificial

results can be avoided. In general, surveys, as a data source, face significant limitations in the

field of data composition, comprised of sampling process limitations or missing data, and in

our ability to explain results, i.e. correlation vs. causation. This work, through addressing

missing data, contributes to the field of data composition but still leaves open the explanatory

ability issue.

The analysis of a temporal pattern of innovative activities was addressed by Jang and

Chen (2010) as an important research field. It can be studied through a survey approach

where future research can include a longitudinal innovation survey study in which the ACE

algorithm can provide trend analysis and prediction. Therefore, a further field of study and

added value for the presented tool would be to assess the effectiveness of a company’s

policies, new methodologies, popularity, market opinions, etc., based on properly structured

Page 27: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

27

questionnaires. In certain emerging domains, such as open innovation, this would allow for a

stronger integration of concepts and methods across units of analysis (Bigliardi & Galati,

2013; Bogers et al., 2017; West & Bogers, 2014).

Page 28: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

28

7 References and notes Alcaide-Marzal, J., and Tortajada-Esparza, E. (2007) ‘Innovation assessment in traditional industries. A proposal of aesthetic innovation indicators’, Scientometrics, Vol. 72 No.1, pp.33-57.

Everitt, B. S., (1974). Cluster Analysis, 2nd ed., (pp. 61-63), New York: Halsted Press. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011a). An introduction to classification and clustering. Cluster Analysis, 5th ed., (pp. 1-14), London: Wiley. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011b). Hierarchical clustering. Cluster Analysis, 5th ed., (pp. 71-110), London: Wiley. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011c). Detecting clusters graphically. Cluster Analysis, 5th ed., (pp. 15-41), London: Wiley. Bertoli-Barsotti, L., and Punzo, A. (2014) ‘Refusal to Answer Specific Questions in a Survey: A Case Study’, Communications In Statistics: Theory & Methods, Vol. 43 No. 4, pp. 826-838.

Bigliardi, B., & Galati, F. (2013) ‘Models of adoption of open innovation within the food industry’, Trends in Food Science & Technology, Vol. 30 No. 1, pp. 16-26.

Bogers, M., Zobel, A.-K., Afuah, A., Almirall, E., Brunswicker, S., Dahlander, L., Frederiksen, L., Gawer, A., Gruber, M., Haefliger, S., Hagedoorn, J., Hilgers, D., Laursen, K., Magnusson, M. G., Majchrzak, A., McCarthy, I. P., Moeslein, K. M., Nambisan, S., Piller, F. T., Radziwon, A., Rossi-Lamastra, C., Sims, J., & Ter Wal, A. L. J. 2017. ‘The open innovation research landscape: Established perspectives and emerging themes across different levels of analysis’, Industry and Innovation, Vol. 24 No. 1, pp. 8-40.

Cabral, J. D. O. (1998) ‘Survey on technological innovative behavior in the Brazilian food industry’, Scientometrics, Vol 42 No. 2, pp. 129-169.

Cesaratto, S., Mangano, S., and Sirilli, G. (1991) ‘The innovative behaviour of Italian firms: a survey on technological innovation and R&D’, Scientometrics, Vol. 21 No. 1, pp. 115-141.

Chan, H., & Perrig, A. (2004) ‘ACE: An emergent algorithm for highly uniform cluster formation’, In Wireless Sensor Networks, Heidelberg: Springer Berlin, pp. 154-171.

Chesbrough, H. and Bogers, M. (2014) ‘Explicating open innovation: Clarifying an emerging paradigm for understanding innovation’ in H. Chesbrough, W. Vanhaverbeke and J. West (Eds.), New Frontiers in open Innovation, Oxford: Oxford University Press, pp. 3-28. Dadura, A. M., & Jiun-Shen Lee, T. (2011) ‘Measuring the innovation ability of Taiwan's food industry using DEA’, Innovation: The European Journal of Social Sciences, Vol. 24 No. 1/2, pp. 151-172.

Durrant, G. B. (2009) ‘Imputation methods for handling item-nonresponse in practice: methodological issues and recent debates’, International Journal of Social Research Methodology, Vol. 12 No. 4, pp. 293-304. Forster, J. J., and Smith, P. F. (1998) ‘Model-based inference for categorical survey data subject to non-ignorable non-response’, Journal of The Royal Statistical Society: Series B (Statistical Methodology), Vol. 60 No. 1, pp. 57.

George, G., Haas, M. R., & Pentland, A. (2014) ‘Big Data and Management. Academy of Management Journal’, Vol. 57 No. 2, pp. 321-326.

Page 29: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

29

Government of Canada (2010). Survey Methods and Practices, Catalogue no. 12-587-X [online], 2003001, 211 http://www5.statcan.gc.ca/olc-cel/olc.action?objId=12-587-X&objType=2&lang=en&limit=0. (Accessed 2 November 2016) Grace-Martin, K. (2001). Missing Data Mechanisms. Cornell Statistical Consulting Unit [online], https://www.cscu.cornell.edu/news/statnews/stnews46.pdf. (Accessed 2 November 2016) Jang, S. L. and Chen, J. H. (2010) ‘What determines how long an innovative spell will last?’ Scientometrics, Vol. 86 No. 1, pp. 65-76. Kabasakal, H., Asugman, G. and Develioğlu, K. (2006) ‘The role of employee preferences and organizational culture in explaining e-commerce orientations’ International Journal Of Human Resource Management, Vol. 17 No. 3, pp. 464-483.

Kantoci, D. (1999) ‘The Algorithm for Cross Evaluation (ACE) of Biologically Active Compounds’ Life Sciences, Vol. 65 No. 12, pp. 1305-1315.

Klein, S. M., Kraut, A. I., & Wolfson, A. (1971) ‘Employee Reactions to Attitude Survey Feedback: A Study of the Impact of Structure and Process’, Administrative Science Quarterly, Vol. 16 No. 4, pp. 497-514. Kuroki, M. (2012) ‘The Deregulation of Temporary Employment and Workers' Perceptions of Job Insecurity’, Industrial & Labor Relations Review, Vol. 65 No. 3, pp. 560-577. Linåker, J., Munir, H., Runeson, P., Regnell, B. and Schrewelius, C. (2015, June). ‘A Survey on the Perception of Innovation in a Large Product-Focused Software Organization’, in International Conference of Software Business 2015, June , Springer International Publishing., pp. 66-80. Lobanova, L. and Ozolina-Ozola, I. (2014) ‘Innovative trends in human resource management: a case study of Lithuanian and Latvian organisations’, International Journal of Transitions and Innovation Systems 2, Vol. 3 No. 2, pp. 131-152.

Marlin, B. M., Roweis, S. T. and Zemel, R. S. (2005). Unsupervised Learning with Non-Ignorable Missing Data. In AISTATS, January 2005.

Perko, I. and Ototsky, P. (2016) ‘Business ecosystems requirements for big data’, Int. J. Transitions and Innovation Systems, Vol. 5, Nos. 3/4, pp.329–352.

Savitskaya, I. and Kortelainen, S. (2012) ‘Innovating within the system: the simulation model of external influences on open innovation process’, Int. J. Transitions and Innovation Systems, Vol. 2, No. 2, pp.135–150. Seles, J., Ranilovic, J., Kantoci, D., Bauman, I., Dzanic, E., Mihaljevic Herman, V, & Cvetkovic, T. (2016, January). Comparison of experimental method with a new mathematical model to determine the shelf life of liquid mixtures for marinating. In 8th Central European Congress on Food 2016-Food Science for Well-being (CEFood 2016). Singleton, R.A. and Straits, B.C. (2005) Approaches to Social Research, New York: Oxford University Press. Su, S., Baird, K. and Blair, B. (2009) ‘Employee organizational commitment: the influence of cultural and organizational factors in the Australian manufacturing industry’, International Journal of Human Resource Management, Vol. 20 No. 12, pp. 2494-2516.

Temel, T. (2016) ‘A methodology for characterising innovation systems – revisiting the agricultural innovation system of Azerbaijan’, Int. J. Transitions and Innovation Systems, Vol. 5, Nos. 3/4, pp.254–298.

Page 30: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

30

Touré‐Tillery, M. and Fishbach, A. (2014) ‘How to measure motivation: A guide for the experimental social psychologist’, Social and Personality Psychology Compass, Vol. 8 No. 7, pp. 328-341. West, J., & Bogers, M. 2014. ‘Leveraging external sources of innovation: A review of research on open innovation’, Journal of Product Innovation Management, Vol. 31 No. 4, pp. 814-831. Yen-Ku, K., & Kung-Don, Y. (2010) ‘How employees' perception of information technology application and their knowledge management capacity influence organizational performance’, Behaviour & Information Technology, Vol. 29 No. 3, pp. 287-303.

Page 31: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

31

8 Tables, figure captions and figures 8.1 Tables

IDENTITY Grouping Index Major cluster group Sector_05 4.90 1 Sector_22 4.93 2 Sector_03 4.94 2 Sector_11 5.18 2 Sector_01 5.29 2 Sector_04 5.30 2 Sector_08 5.32 2 Sector_10 5.39 2 Sector_23 5.39 2 Sector_02 5.43 2 Sector_07 5.50 3 Sector_06 5.52 3 Sector_15 5.52 3 Sector_18 5.57 3 Sector_20 5.58 3 Sector_24 5.68 3 Sector_17 5.81 3 Sector_16 5.86 4 Sector_19 5.97 4 Sector_26 6.31 4 Sector_09 6.33 5 Sector_25 7.29 5 Sector_13 7.73 5 Sector_12 9.24 5 Sector_14 9.24 5 Sector_21 9.24 6 Table 1 ACE Clustering results for sectors showing cluster grouping indexes

Page 32: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

32

Sector Number of answers per sector*

Number of non-Ignorable data (0)

% non-ignorable missing values (0)

Sector_01 8970 1721 19.19

Sector_02 4810 594 12.35

Sector_03 2990 478 15.99

Sector_04 2340 521 22.26

Sector_05 3380 630 18.64

Sector_06 780 48 6.15

Sector_07 3770 896 23.77

Sector_08 2600 493 18.96

Sector_09 910 188 20.66

Sector_10 650 70 10.77

Sector_11 6890 1322 19.19

Sector_12 130 21 16.15

Sector_13 650 293 45.08

Sector_14 260 95 36.54

Sector_15 780 80 10.26

Sector_16 1430 328 22.94

Sector_17 3900 885 22.69

Sector_18 5850 1157 19.78

Sector_19 260 6 2.31

Sector_20 3380 1025 30.33

Sector_21 520 169 32.50

Sector_22 2730 370 13.55

Sector_23 1300 140 10.77

Sector_24 1690 474 28.05

Sector_25 780 271 34.74

Sector_26 1560 288 18.46 Table 2 Percent unanswered statements by sector where sectors had a different number of respondents.

Page 33: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

33

Test for equal means Sum of

sqrs df Mean square F p (same)

Between groups: 3.58E-05 1

3.58E-05 3.14E-05 0.9956

Within groups: 13.7074 12 1.14228 Total: 13.7074 13 ω2: -0.07692 Levene´s test for homogeneity of variance, from means

p (same): 0.9687

Levene´s test, from medians p (same): 0.9943

Welch F test in the case of unequal variances: F=3.138E-05, df=11.99, p=0.9956 Table 3 ANOVA existing vs. predicted data

Page 34: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

34

Sector Relative distance between classical and ACE grouping

Similarity % between classical and ACE grouping

Sector_01 2 92% Sector_20 2 92% Sector_04 3 88% Sector_06 4 85% Sector_23 4 85% Sector_25 4 85% Sector_16 5 81% Sector_22 5 81% Sector_12 6 77% Sector_24 6 77% Sector_10 8 69% Sector_13 8 69% Sector_26 8 69% Sector_02 9 65% Sector_09 9 65% Sector_17 9 65% Sector_18 9 65% Sector_07 10 62% Sector_11 10 62% Sector_08 14 46% Sector_03 16 38% Sector_05 16 38% Sector_21 16 38% Sector_14 18 31% Sector_15 19 27% Sector_19 20 23%

Table 4 Distance comparison between classical and ACE cluster grouping

Page 35: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

35

8.2 Figure captions and figures

Figure 1 Sector cluster Paired Group UPGMA plot (plotted using MS Excel 2003, Microsoft

Corporation, Seattle, WA)

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

11

10

00

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

00

00

00

00

00

00

00

00

11

11

11

11

10

00

00

00

00

00

00

00

00

00

00

11

11

11

11

10

00

00

00

00

00

00

00

00

00

00

11

11

11

11

10

00

00

00

00

00

00

00

00

00

00

11

11

11

11

10

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

00

00

00

00

00

00

00

00

00

11

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

a

1 4 7

10 13 16 19 22 25 28

S1

S7

S13

S19S25

0

100

200

300

400

500

600

700

800

900

b

Figure 2 Example of positional matching for dot shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)

Page 36: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

36

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

11

10

00

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

00

00

00

00

00

00

00

00

11

11

01

11

10

00

00

00

00

00

00

00

00

00

00

11

10

00

11

10

00

00

00

00

00

00

00

00

00

00

11

00

00

01

10

00

00

00

00

00

00

00

00

00

00

11

10

00

11

10

00

00

00

00

00

00

00

00

00

00

01

11

01

11

00

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

00

00

00

00

00

00

00

00

00

11

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

a 1

5

9

13

17

21

25

29

S1

S6

S11

S16

S21

S26

0100200300400500600

700

800

900

b

Figure 3 Example of positional matching for doughnut shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

01

10

00

00

00

00

00

00

00

00

00

00

00

00

00

01

10

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

01

11

00

00

00

00

00

00

00

00

00

00

00

00

00

01

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

10

00

00

00

00

00

00

00

00

00

00

00

00

01

11

11

00

00

00

00

00

00

00

00

00

00

00

00

01

11

11

00

00

00

00

00

00

00

00

00

00

00

00

01

11

11

10

00

00

00

00

00

00

00

00

00

00

00

01

11

11

11

00

00

00

01

00

00

00

00

00

00

00

01

11

11

11

10

00

00

10

00

00

00

00

00

00

00

00

11

11

11

11

11

11

10

00

00

00

00

00

00

00

00

11

11

11

11

11

11

00

00

00

00

00

00

00

00

00

11

11

11

11

11

10

00

00

00

00

00

00

00

00

00

01

11

11

11

11

00

00

00

00

00

00

00

00

00

00

00

11

11

11

10

00

00

00

00

00

00

00

00

00

00

00

01

11

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

a 1

5

9

13

17

21

25

29

S1

S6

S11

S16

S21

S26

0100200

300

400

500

600

700

800

b

Figure 4 Example of positional matching for banana (moon) shape base object (a) used for validation of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)

Figure 5 Example of positional matching for random shape base object (a) used for validation

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

01

00

00

00

01

00

00

00

00

00

00

00

00

00

00

10

10

00

00

00

00

00

00

10

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

01

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

10

00

10

00

00

00

10

00

00

00

00

01

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

01

00

01

00

00

00

10

00

00

10

00

01

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

01

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

01

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

01

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

01

00

00

00

00

00

01

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

10

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

a 1

5

9

13

17

21

25

29

S1

S6

S11

S16

S21

S26

0100200300400500600

700

800

900

b

Page 37: From “Big Data” to “Smart Data”: Algorithm for Cross ... · 3. The measurement of full data consistency (correlation and clustering) where standard statistical approaches

37

of ACE algorithm and illustrated through positional correlation (b). (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)

Question 1 Prediction Model

0

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7

Answer

Wei

ghte

d D

istr

ibut

ion

Original DataPredicted Data

Statement1Prediction Model

Figure 6 Comparison between actual and predicted data based on the ACE prediction algorithm. (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)

Figure 7 SunCore clustering and correlation plot showing i) type of statement and ii) motivation dimension structure for low, medium, and high correlation group of statements and sectors clustered using same ACE algorithm. (plotted using MS Excel 2003, Microsoft Corporation, Seattle, WA)