quality assurance on big data & analytics · 2015. 9. 22. · big data platform (e.g. hadoop)...

79
Vrije Universiteit Amsterdam Q UALITY A SSURANCEON B IG D ATA & A NALYTICS Proposed measures to increase the quality of the Big Data process Drs. D.R. Voges NOVEMBER 2014

Upload: others

Post on 02-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

Vrije Universiteit Amsterdam

QUALITY ASSURANCE ON BIG DATA & ANALYTICS

Proposed mea sures to increase the qua l i ty o f t he Big Data process

Drs. D.R. Voges

NOVEMBER 2014

Page 2: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

ii

QUALITY ASSURANCE ON BIG DATA & ANALYTICS

This research paper examines the unique properties of Big Data, reviews the critical steps in its analysis, explores the usefulness of quality control measures to gain increased certainty over the output of Big Data & Analytics, and investigates the results of the actual implementation of these proposed control measures in an actual business case.

Document information Author: Drs. D.R. (Dennis) Voges

University: VU University of Amsterdam

Course: Executive Master of IT Auditing

Version: FINAL

Date: 2014-11-11

Thesis guidance VU University: Dr. R.P.H.M. (René) Matthijsse RE

KPMG: Drs. J.J. (Jaap) van Beek RE RA

Page 3: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

iii

TABLE OF CONTENTS

Chapter I. INTRODUCTION ............................................................................................................................... 6

Preliminary literature review ..................................................................................................................................... 6

Relevance .................................................................................................................................................................. 6

Motivation ................................................................................................................................................................. 7

Research problem ...................................................................................................................................................... 8

Questions ................................................................................................................................................................... 8

Method ...................................................................................................................................................................... 8

Scope and delimitations ............................................................................................................................................ 8

Reading guide.......................................................................................................................................................... 10

Chapter II. BIG DATA & ANALYTICS ........................................................................................................ 11

Background ............................................................................................................................................................. 11

Big Data’s unique properties ................................................................................................................................... 14

ETL process: Preparing raw data for inquiry .......................................................................................................... 18

Technology: ETL process enablers ......................................................................................................................... 23

Analytics: Extracting information from data ........................................................................................................... 26

Technology: Analytics process enablers ................................................................................................................. 28

Decision making: Evaluating analytic outcomes for insight ................................................................................... 29

Chapter III. QUALITY ASSURANCE ............................................................................................................ 35

Why Quality Assurance? ......................................................................................................................................... 35

Scoping of quality dimensions ................................................................................................................................ 36

Maintaining quality in the ETL phase ..................................................................................................................... 37

Maintaining quality in the Analytics phase ............................................................................................................. 39

Security, Privacy and Ethics .................................................................................................................................... 43

Proposed Quality Assurance framework for Big Data & Analytics ........................................................................ 46

Chapter IV. CASE STUDY .............................................................................................................................. 46

Method of inquiry ................................................................................................................................................... 47

Obtaining empirical data ......................................................................................................................................... 47

Description of selected business case ...................................................................................................................... 48

Interview findings ................................................................................................................................................... 49

Chapter V. CONCLUSIONS AND RECOMMENDATIONS ............................................................................ 58

Conclusion .............................................................................................................................................................. 58

Recommendations ................................................................................................................................................... 59

Page 4: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

iv

Suggestions for further research .............................................................................................................................. 59

Chapter VI. ANSWER TO THE RESEARCH QUESTION ............................................................................ 60

Chapter VII. BIBLIOGRAPHY ......................................................................................................................... 60

Chapter VIII. APPENDIX .............................................................................................................................. 67

Big Data & Analytics QA controls and expert interview answers ......................................................................... 67

Page 5: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

v

Page 6: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

6

CHAPTER I. INTRODUCTION The term “Big Data” is by now ubiquitous and has been applied to numerous implementations, perhaps not always in the context its original creators have proposed. Much has been written on the commercial and technical aspects of Big Data, and the field of data analytics has been gathering steam in recent years. Up to now, little has been published on measures – such as quality assurance – by which quality improvements may be achieved in relation to Big Data & Analytics its origin, process and outcome. 1 This study attempts to remediate this situation.

PRELIMINARY LITERATURE REVIEW An informal survey of early publications performed by the researcher shows that early Big Data publications focused on its potential value. These publications describe inspiring examples how to connect previously separate data. They foretell of the potential of Big Data to improve companies’ performance by enabling better decision making, based on the greater amount and detail of information at our disposal. (McAfee & Brynjolfsson, 2013). Of course, the age-old adage still applies, in that measuring is knowing; you cannot manage what you don’t measure. In this day and age, the trouble appears not to be one of measurement. For one thing, we measure too much to digest. By extension, while you cannot manage what you don’t measure, you also cannot manage what you do not analyse. The crucial point is that data does not jump state automatically from data to information to that most coveted form of competitive advantage: insight. It requires human intervention and ingenuity.

It follows that after the evangelisation phase, later Big Data publications appear to shift the subject to application of technologies and techniques, shedding light on the inner workings of its architecture and its design by which commercial value from data can be obtained. This is because applying Big Data is not always straightforward; it is processed in different ways then firms are traditionally used to, as this study will show. What is clear that data analysis techniques are still continuously evolving to deal with Big Data (Bryant, Katz, & Lazowska, 2008) and that every day new technologies and techniques come into being to deal with large, fast flowing and diverse data sets. However, it is still not clear if these technological solutions and new techniques comply with established management and control principles that risk professionals and auditors expect of traditional corporate IT systems. How might they affect the conclusions achieved by data analysis and corporate decision making?

This question is now slowly being answered. A recent publication by (IBM, 2013) shows the discussion is shifting to the organisational aspects of Big Data, by illustrating what the enablers and showstoppers are. Many of these challenges are just now being discussed, but have not yet been solved. This study attempts to nudge the current discussion along even further by focussing on the governance, risk and controls aspects of Big Data by which quality assurance may be achieved. It will result in a proposal for controls, when implemented, are believed to improve upon the quality of the Big Data process and its consequences.

RELEVANCE We live in an age where we are surrounded and governed by automation. This has unquestionably had a large impact on our lives. Consider the daily personal use of smart-phones and social media, but also of navigation software and online payments. Everyone wears a portable tracking device that doubles as a phone and a media player. Wearable health technology is also entering the market, measuring people’s vitals and movements to provide them with information on their health or latest sporting achievement. Multitudes of sensors, systems and applications continuously log our actions and record related metadata.

1 ISACA, one of the largest IT audit associations of professionals, on 24 April 2013 requested contributors to a whitepaper discussing the topic of Big Data and audit. An article later appeared in the ISACA Journal but only discussed how IT auditors can also leverage Big Data. (Setty & Bakhshi, 2013)

Page 7: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

7

This trove of data is now being investigated for its commercial value, and through the use of Big Data analytics, is leveraged to gain behavioural insights into people. Historical as well as real-time data are used as input for predictions, but also to provide input for business KPI’s, automated algorithms or system triggers. Google’s self-driving cars will be based on Big Data. The price of your insurance policy will be based on your risk-taking behaviour, the measurement and analysis of which is enabled by Big Data.

As more and more of these Big Data applications are becoming the de jure standard and in order to enable economy of scale, are being fully automated, less and less human involvement is required. It becomes increasingly important to ensure that automated Big Data processes are operating correctly and ensure that society and its individuals whose lives are impacted by its algorithms are treated fairly. This paper will attempt to demonstrate how the IT auditor can assist in achieving this goal.

MOTIVATION The researcher is motivated to study the topic of Big Data due to his interest in cutting edge technological innovation, and his desire to show the relevance of the profession of the IT auditor and other risk professionals. By contributing insight into this topic, the author hopes that this adds to the IT auditor’s ability to respond with added value to this new technological development and its application in real-life.

Page 8: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

8

RESEARCH PROBLEM The goal of this study is to provide a response to the issue on how to safeguard the quality of the Big Data process and its outcome, and challenge of removing of unnecessary uncertainty in relation thereto. This response, in form of specific control measures, is expected to contribute positively to the Big Data & Analytics outcome. To determine how this may be achieved, the author poses the following central research question:

• In what manner and under which conditions is it feasible to apply effective control measures to Big Data implementations and gain increased certainty over the validity of data analytics outcomes?

QUESTIONS In order to answer the central research problem, the following sub-questions will be investigated:

• Which unique properties distinguish Big Data, what is the analytical process involved, and what are its critical organisational aspects (Description)

• Which control measures are appropriate for Big Data implementations, and based on a real-life business case study, what control measures have been applied and found effective? (Analysis)

• Can controls strengthen the validity of decisions based on Big Data & Analytics by increasing the quality of the outcome? (Reflection)

METHOD The answers to this study’s central research question and related sub-questions will be obtained through:

a) Desk research in the first part, primarily in the form of literature review. It is expected that the insight and conclusions from research publications originating from various professional disciplines (such as Business Intelligence, IT audit, Forensics, Data Analytics, Risk Management, Statistics and Computer Science and the Humanities) can be combined to produce many of the answers to this study’s questions.

b) In addition to an exploration of available theory, a real-life Big Data & Analytics business case applied at a Dutch international bank will be investigated to gain empirical evidence on practical applications of controls. By interviewing the experts with the business case, this study will try to determine if proposed controls have been applied, and obtain feedback on the suitability of their design and effectiveness.2

SCOPE AND DELIMITATIONS This study attempts to study risk and control aspects of Big Data & Analytics. To do so, it will consider the following aspects of and for Big Data:

• Data: Big Data’s relevant data properties and its sources;

• ETL processing: raw source data’s capture, storage and handling and processed data’s representation

• Analytics: statistical modelling and algorithm development involved in extracting insight from data;

2 It is necessary to clarify that the purpose of this study is not to provide assurance. The researcher will apply accepted qualitative research methods as a means of inquiry. Therefore, no audit activities will be performed and no audit testing will be part of investigating the selected business case.

Page 9: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

9

• Decision making: the visualisation and interpretation of processed data.

• Quality Assurance: What quality perspectives should be covered?

• Governance of IT: what are the organisational enablers and which IT security aspects are involved?

As part of this study’s focus on assurance aspects of Big Data, a limited number of organisational domains will also be in scope. Technology; People and will be considered in order to come to a balanced set of control measures.

This study will not investigate qualitative properties or the nature of the demand by business and operations for Big Data & Analytics. Nor is the intent to study the relevance or utility of Big Data & Analytics outcomes and its use in KPI’s, monitoring variables etc., or for management decision making. These aspects shall, for the purpose of this study, be considered as a given, and will therefore be left out of scope. We refer the reader to the following figure to illustrate the aforementioned scope constraints:

Business D em and Big D ata & Analy t ics D ecisions Conclusions

Business KPI ’s (e.g. t hr esholds, t r igger s)

Big D ata p lat f or m (e.g. H adoop)

Ex tern al data sources

(un strucured)I n tern al data

sources (structured))

{Algor i t hm and quer y t echniques}

{N atur e of r equest }

{Relevance or ut i l i t y }

{I n t er pr etat ion of answer s}

{Big D ata pr oper t ies}

Figure 1: Conceptual scope of research on Quality Assurance on Big Data & Analytics

Page 10: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

10

READING GUIDE The name ‘Big Data’ used in the general public conflates the raw and processed data with the analysis techniques intended to unleash the information contained within. For the sake of clarity the researcher will attempt to abide by the term Big Data & Analytics. As the term Business Intelligence is used interchangeably in literature with Analytics, sometimes this term will be referred to as well. Further to this, when referring to:

• Big Data’s “data” (as in the raw or processed data involved in making business decisions and its properties), this study will use the term Big Data set.

• The end-to-end chain of actions of “Capture / Storage / Preparation” ( as in preparing raw and processed data for inquiry) and “Analytics / Visualisation / Interpretation” (as in the process of analysing large amounts of data with the help of tooling) this study will refer to those as the Big Data process.

The reason for choosing the term analytics over its parent word analysis is because analytics has gained a more inclusive meaning in business and also refers to the use of the technology and the associated tools for data analysis – which is exactly what this study attempts to cover.

Page 11: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

11

CHAPTER II. BIG DATA & ANALYTICS BACKGROUND

Before delving deeper into the subject matter, it is useful to provide some history on the practice of Big Data & Analytics and explain the conceptual context. As the components of the term hint at, data refers to large volumes of records and analytics is related to knowledge discovery. Knowledge discovery in business has been around for a while and goes by many names; Process Mining, Predictive Analytics, Forecasting, Rating Models or just plain Applied Statistics. When its purpose is for strategic decision making by management, it has long been referred to as Business Intelligence. In fact, this term was already coined by a 19th century author to denote the ability of businessmen to successfully act upon received information. (Devins, 1865)

Evolution

Attempts have been made to provide a way to categorise paradigm shifts in the evolution, applications and emerging research areas of Big Data analytics.

(Chen, Chiang, & Storey, 2012) have provided a framework for the evolution of analytics, based upon key characteristics and capabilities. The authors describe the different stages of evolution in Business Intelligence and Analytics (BI&A):

• The first phase, BI&A 1.0, is characterized by data management and warehousing, Data warehouses (DWH) and ETL tooling are used to assist OLAP techniques, which along with dashboards and metrics assist in the analysis and visualisation of KPI’s.

• The second phase, BI&A 2.0, is defined by the addition of new information sources based on web platforms; e.g. web analytics, user generated content, Web 2.0 social information and crowd-sourcing.

• Its last phase, BI&A 3.0, is said to include Mobile aspects and result in the provision of location-aware analysis and person-centred analysis amongst others.

It is difficult to say with certainty what the future will bring (predictions are difficult, especially about the future). This is why this study will adhere to using generic process steps and conceptual descriptions as much as possible in order to maintain future relevance.

As stated earlier, Big Data & Analytics is closely related to Business Intelligence (BI). This is apparent when looking at the BI Hype Cycle (Gartner, 2011) illustration in figure 2. In this predictive model, 13 capabilities are identified which the future held for the field of BI.

Page 12: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

12

Figure 2

As of the time of writing (2014) it is unclear if all predictions made in then have held; Data quality tools and Excel as a BI front-end should have reached plateau of productivity by now. It is clear that basic query and search capabilities are currently being used by most organisations. But Business Intelligence SaaS and Visualisation-Based Data discovery tools for real-time decisioning has not hit the mainstream yet. More far away still seem to be the enterprise wide metadata repositories, search based data discovery tools and natural language questions answering. Many of these venues still require exploration by Big Data pioneers.

Conceptual model

We already mentioned Big Data, and the process of extracting value for decision making. For the purpose of clarity, this study proposed the following simple definition for Big Data & Analytics;

Big Data & Analytics is the process that leads to informed, factual decisions based on scientific analysis of the relationships between multiple data points, creating value for stakeholders through strategic leverage of these insights.

The potential for value generation has been recognised by many. But what is the process which leads to this much-mentioned value? Many conceptual ideas and process descriptions abound. For example (Accenture, 2013) has modelled this process in their Data Value Continuum, where data changes state from raw data to processed data, to insight, presentation and transactions. IBM and the Big 4 consulting firms also propose their own methods - which is the most accurate? What is apparent is that these models more or less cover the same steps and intend to achieve the same goals. For research purposes it is helpful to adhere to generic terms. That’s why this study will adapt the conceptual model for Knowledge Discovery (Chen & Zhang, 2014), illustrated in figure 2, because of the usefulness and flexibility of its generic nature.

Page 13: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

13

Figure 3 Knowledge discovery process (Chen & Zhang, 2014)

While this study broadly follows the lines of this model, it distinguishes more granular process steps which are necessary for the purpose of answering the research question. The following listing shows the correspondence between the activity described in the original model and the (sub-)activities identified by this study.

1) Data Recording is split between Capture and Storage.

2) Data Cleaning / Integration / Representation is synonymous with Conversion & Transformation and Cleansing & Enrichment.

Process steps 1) and 2) of the Chen & Zhan model will be covered in this study’s chapter on “ETL Processing”.

3) Data Analysis is split between Data Profiling, defining the Statistical Model and Algorithm / Query Development.

Process step 3) of the Cheng & Zhan model will be covered in this study’s chapter on “Analytics”.

4) Data Visualisation and Interpretation are separated as activities. Decision Making is omitted due to earlier stated delimitations. This study adds a further step: Automated Big Data.

Process step 4) will be covered in this study’s chapter on “Decision Making”. This mapping and the resulting adapted model which will be used by this study are illustrated in figure 4.

Figure 4: Conceptual Big Data & Analytics process (Voges, 2014)

The three identified phases align also align with process steps distinguished by other researchers, such as (Ceri, Della Valle, & Pedreschi, 2012) and respectively correspond to what these authors call a) Data Preparation, b) Data Analysis and c) Data Evaluation.

The knowledge discovery models focuses on process and does not integrate the technological component. Due to its implicit importance it will be covered in a separate paragraph on both chapters

Page 14: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

14

BIG DATA’S UNIQUE PROPERTIES The use of the term ‘big’ in Big Data evokes the idea that one is dealing with something large and formidable. But what exactly distinguishes big from small? That is, how do we know that we are dealing with actual Big Data and not its smaller cousin? In other words, what is meant with the word “Big” in the term “Big Data”?

Although the term alludes to the large size of these datasets, no actual size thresholds can be defined which distinguish Big Data sets from regularly sized data sets. Instead the concept the concept has been defined as referring to those “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”(McKinsey Global Institute, 2011). The concept is linked with the challenges associated with the capture, curation, storage, search, sharing, transfer, analysis, and visualisation of very large data sets whereby computing techniques at our disposable, and the current state of technology, are the limiting factors to the ability of extracting value.

While gaining popular use, at the same time Big Data is also surrounded by a lot of mythology. It is even considered by some researchers to be somewhat of a misnomer. This is because mainly as a result of technological progress, ‘big’ can no longer be considered the key definer of the concept. According to (Boyd & Crawford, 2012) “Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross-reference large data sets.” What is needed are more discerning criteria along which Big Data can be judged, which go further than size alone.

Gartner is thought to have coined the so-called 3V’s in relation to Big Data: Volume, Velocity and Variety (Beyer, 2013). The term was already used by earlier in a whitepaper on 3D Data Management. (Laney, 2001) but not yet linked to Big Data. It is these three properties that are most popularly used to define Big Data and which have a direct impact on how these data sets must be handled in processing. Later, Veracity was added to this set, as the challenge of managing the trustworthiness of data became apparent to early pioneers. The following section describes each of the “V’s” involved in Big Data and explains their impact on the Big Data & Analytics process.

Volume

That Big Data is associated with large data volumes speaks for itself. It is also more specifically a reference to limits of hardware and systems in being able to capture, store and manipulate data. It is at this boundary that the term Big Data applies. It is also relative, as what might have been be considered Big Data in the early 90’s can now be processed by a child’s toy, as result of subsequent technological progress. To give an example, in the recent past it required a dedicated team and hardware solution to have a computer beat Gary Kasparov, the then reigning world champion of chess. IBM finally did so in 1997 with Deep Blue. In 2009, Pocket Fritz 4.0, a mobile app, achieved a higher ELO3 rating than both Mr. Kasparov or Deep Blue (2898) and won a Grandmaster tournament in Buenos Aires. (Mercosur Cup, 2009)

To point out another well-known (and one of the earliest) applications of a Big Data solutions: Google is famous for bringing the vast volume of websites on the internet within reach of our fingertips via their search bar. It is no wonder that they have developed the technology which provided the basis for MapReduce, one of the foundation blocks of Big Data architecture.

This example shows it is helpful to speak of Big Data in terms of the solutions which have been brought to bear on it, such as the use of parallel computing, distributed computing and cloud computing, to fully understand the impact of Big Data volume.

3 The ELO rating system is a method for calculating the relative skill levels of players in competitor-versus-competitor games such as chess.

Page 15: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

15

Where does this volume come from? The internet as a distribution channel is one source. Netflix for example sends an enormous amount of video data to customers, which according to a commercial trend analysis takes up about of 34,2% of internet data in North America (Sandvine Incorporated, 2014). One aspect of Big Data is that it also includes data generated by people. In 2014, there are nearly 3 billion people connected to the internet. By end of 2013, Facebook had 1.23bn users. That is a lot of data being generated. According to (Canny & Zhao, 2014) this people-generated data has application in government, business, education and healthcare and are very often used in commercial data mining. This gets us right back to the value proposition of Big Data, which often involves analysis of behaviour.

Of course, businesses also create vast amounts of data. Consider for example the financial industry, where banks and payment service providers process millions of transactions each day. Equens, a pan-european payments processer, handled 10.4 billion payment transactions in 2012 (Equens S.E., 2013). As we shall see later, payments data is recorded with additional information stored in meta-data. Knowing where payments are made, at what time, to whom and to what amount– when correctly interpreted – can provide valuable input for management steering information and decision making.

Velocity

In physics, velocity is a function of time, conveying both the magnitude and the direction of movement. When the term was applied to Big Data, it seems the original authors (Beyer, 2013) wanted to coin an easily remembered marketing term. Velocity in this case refers to the rapid generation and dissemination of new data. That is, the rate at which its volume grows and its throughput increases across communication channels. The term is also closely related to acceleration, which is to say, the positive degree in change of velocity. The latter definition is perhaps a more adequate appropriation for Big Data. There is a constant acceleration in growth of recorded data as the number of electronic sensors in use increase and the rate of adoption thereof by the public and business grows and grows, with no indication yet when this growth will plateau.

Another researcher (Muthukrishnan S. , 2005) does not use the term velocity, but refers to a similar concept: the data stream speed. In his view a high speed stream of data stresses computing infrastructure when performing transmission, capture or storage. This definition aligns with the view of (McKinsey Global Institute, 2011) and is a very elegant short description for Big Data whereby data is in transit.

As more and different kind of metrics than only the financial are becoming considered valuable in doing business, leading to their recording and capture. Again, a fine example can be found at media company Netflix, which in 2012 analysed 30 million movies served a day and 4 million movie ratings received a day (Harris, 2012), in order to come up with better viewing recommendations to its customers. Since then Netflix’s number of customers, global presence and size of its movie portfolio has only grown.

Velocity impacts how data is processed, which is especially a challenge in applications which require real-time analysis, when the purpose is “to identify meaningful patterns in one or more data streams and trigger action to respond to them as quickly as possible.” (Intelligent Business Strategies, 2012). This study will explain why, when it next investigates the required actions and operations on data in the Big Data & Analytics process.

Page 16: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

16

Variety

To explain why Variety is also an important discerning feature it is helpful to maintain a distinction between structured and unstructured data.

Structured information use in business environments, based on relational databases and their systems (RDBMS) has flourished, especially in the field of Finance. RDBMS’s are ideally suited to the purposes of transaction recording and further processing into a General Ledger for profit accounting. To maintain referential integrity, structured data requires the design of a data schema. Structured data is stored in predictable forms, based on alphanumerical values, usually constrained by the allowances of the database field.

Unstructured data is not so predictable – it comes in the form of text, audio or video or a combination thereof, and is often difficult to order and classify due to the lack of predefined constraints by e.g. a data scheme. The web is one prime example. One way of solving the problem of understanding its coherence, is to look at the referential information contained within weblinks and index these. This has given rise to Google’s PageRank algorithm, which this study considers as one of the earliest applications of Big Data which became available to the general public. PageRank has empowered Google users in finding what they are searching for on the internet. These links are certainly not as strong as those in relational databases and do not add to the integrity of the web, as links can be broken or changed at will.

To add to the diversity, Variety also applied to the different file container formats which exits for the same kind of media, be their origin proprietary or open-source. Apple prefers the use of Quicktime “*.mov” files, The open source community has developed the Matroshka “*.mkv” container format for the same media. But are we able to easily read out the output of older proprietary SCADA systems? All of these kinds of data must be processed and parsed in order to be interpreted correctly when applied in Big Data situations.

Fortunately, Big Data platforms have gained the capability to deal with most file types. One recent survey, sponsored by IBM, included the following quote by a survey respondent: “Hadoop’s scalability for Big Data volumes is impressive, but the real reason they’re working with Hadoop is its ability to manage a very broad range of data types in its file system, plus process analytic queries via MapReduce across numerous eccentric data types” (TDWI, 2011)

Another aspect of Variety is related to the data’s origin. There are an uncountable number of sources. The aforementioned SCADA-like systems are one of the traditional sources of digital data records and might go back the longest in time. Having evolved from monolithic systems with no connectivity, they are now networked and partake in the “Internet of Things”. While there application in the early days was for mainly industrial control of large-scale processes, infrastructure and facilities, the information they record is slowly being consumerised. For example, traditional traffic information from road sensors has long been available to the general public for predicting traffic congestion. Information from the internet has become available since the 1990’s. More recently, smart meters provide us with information on water and energy use in homes. Of course, older, non-digital records are becoming digitised as well, allowing for their analysis.

Newer sources include sensors which are continuously carried with them by individuals. Since the invention of GPS and the advent of consumer smartphones, all kinds of new environmental data is being recorded. This is because the current generation of smartphones often include a multitude of sensors: camera, accelerometer, thermometer, altimeter, barometer, compass, chronograph, touch screen, GPS navigation. The data which these sensors record provide information on the geographical location and movement of its users and the information on their surroundings (e.g. temperature, time of day). In fact this data is often used to augment understanding of traditional sources of information.

An upcoming trend in electronics is wearable health gear, such as smartwatches, which expands the scope of recorded data by capturing records on an individual’s bodily functions, such as perspiration, temperature, pulse,

Page 17: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

17

and blood pressure. Can you imagine Netflix analysing customer’s heartbeat and blood pressure at certain scenes in a movie?

Veracity

Some, including (IBM, 2013), argue that the property Veracity should be added to the above nomenclatures. IBM describes the issue of veracity as “data in doubt”, which refers to the “uncertainty due to data inconsistency & incompleteness, ambiguities, latency, deception, model approximations” By which they mean that the authenticity and genuineness of the sources for data records are difficult to confirm. Again, it is useful to discern the difference in sources by differentiating between internally generated data and externally acquired data.

The validation of external data sources is one of the biggest challenges in realising data quality in Big Data. If externally data is generated by a reputable vendor, third party assurance reports have traditionally been used to gain comfort over the design & implementation and the operating effectiveness of controls within the internal control environment of the vendor. While this still does not say anything directly about the quality of the data, it provides a certain level of comfort and accountability.

Sometimes alternative sources are available to further validate data obtained from the outside the organisation. For example, investment firms often employ multiple interfaces to financial data vendors to cross-check the validity of data inputs used for valuating assets under management. In case of using social data however it is often impossible to perform cross-checks. Due to the walled garden nature of many of these platforms there exists but a single golden source. For this externally generated data no such assurances can be provided and often extra treatment and vetting is required before it can be relied on.

For data which has been generated internally within a controlled IT environment, certain assurances can be given on controls which guard the integrity of the data, by determining if these have been adequately designed and most importantly are implemented and operating effectively. The profession of the IT auditor has grown around supporting the safeguarding of the integrity of information systems supporting financial processes. IT control better practices such as application controls, interface monitoring and logical access management are applied to increase the control over structured data, which is made possible due to the predictable nature of database records.

How many V’s did you say?

Starting with the original term, the catchy “V” nomenclature for describing Big Data continues and has been added to throughout the years e.g. Value, Variability and Veracity have also been proposed. As discussed earlier, Veracity points out the problem of verifying the trustworthiness of the multitudes of sources from which Big Data can be derived, which can be both public or private, controlled or non-controlled. Value is sometimes explicitly mentioned to draw focus on the whole purpose of the Big Data exercise. Proponents point out that companies are e.g. turning to Waste Data, previously by-catch of other processes and unutilised, and are discovering its value in other applications. In fact (Accenture, 2013) dedicated a report on how firms can monetise on this unused Big Data. Variability has been mentioned by some (SAS, 2012) to do with the variation in when and how much data is generated, which can be for example seasonal, or be event-triggered. This property has impact on the capacity of systems and platforms for dealing with peak loads.

The fact is that Big Data spans multiple of the discussed dimensions at once, because it combines different data sources, sizes, structure and unstructured data, real-time and historical data. This compounds the organisational challenge when dealing with Big Data in the analytics phase. This study will analyse why in the next section.

Page 18: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

18

ETL PROCESS: PREPARING RAW DATA FOR INQUIRY In the previous section we have described what unique characteristics Big Data possesses (the ‘V’s). In the following section this study details how these dimension can impact its handling when preparing data for inquiry, before applying analytics. For the purpose of this study the author will focus on those activities which most prominently feature in this phase.

Before proceeding it is helpful to clarify what is understood with the term processing. This study borrows from the definition used in art. 2b of EU Data Protection act (European Parlement and the Council of the European Union, 1995). When the scope is broadened to also include data other than personal data, this definition reads ‘any operation or set of operations which is performed upon data, whether or not by automatic means, such as collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction’. Thus processing can be considered to cover a broad range of read and write actions, either manual or automated. In this section write actions are particularly important because these carry the risk of overwriting, amending or deleting of source data when access is not properly managed. This study assumes that processing performed in the analytics phase is performed within the computing environment of a data scientist’s sandbox, ad does not involve write operations on source data. Of course, access authorisation must be adequately configured to ensure this.

Capture and storage

Without its recording or capture there would of be no Big Data. Much of it is generated by systems, without us being aware. Most people take it for granted that a computer, application or website performs logging or recording of system events and user actions. Oftentimes it is forgotten that it was a conscious decision on part of the developer of an application as to what should be logged. This is done primarily in order to fulfil the requirements of the particular software in question. For example payments software will need to maintain an audit trail of recipients, amounts and authorisers. Websites need to implement security logging for intrusion detection purposes.

It is at this first step that the Big Data process begins and it is often overlooked. This is because Big Data implementations often build on currently unused sets of historical data, which were probably captured for other, often more limited, purposes (e.g. as in the applied business case, financial transactions held for regulatory or tax purposes). This can be seen as one of the contributing causes of the data quality challenge which are often encountered with internally generated Big Data sets. Historical data sets are often incomplete due to patchy maintenance, faulty archival, or simply because they are not tailored to the purpose of later Big Data questions. While experience shows that these issues cannot completely be avoided, it is helpful to make a conscious data capture design decision when developing new software. Acknowledging this fact does not make it easier however. According to a (KPMG, 2014) survey “the biggest challenge to executing a D&A strategy is identifying what data to collect”.

In conclusion, to mitigate the risk that data is unavailable due to lack of measurement, a conscious and documented decision should be made in the design phase in order to establish which data should be measure and/or collection is required.

In addition, more often than not, Big Data leverages freely or cheaply available external data sources. This includes public data e.g. data generated by internet users or statistics made available by governments. Both sources might contain errors, inconsistencies or worse, biases.

The IT profession has always relied on identification and authentication to assure the integrity of data. By being able to trace inputs or actions back to individual users, it becomes possible to hold them accountable. This is an important principle in preventing fraud. When data is derived from social media, this approach is not feasible.

Page 19: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

19

While social media platforms such as Facebook require users to use their real life identity to identify themselves, there is way of enforcing this on a massive scale. Others, such as Twitter, allow anonymous usage via a pseudonym. Even with authentication and authorisation in place, it is difficult to check the ‘truthfulness’ of the message posted. When data from an external source has been captured by manual input, as is the case with social media text messages, it is difficult to ascertain its veracity. This data is often free format as well, which comes with its own challenges. More often than not, no input controls are in place, and certainly not those input controls required to increase the quality of later data analysis.

In conclusion, when using external data such as that generated by social media, the integrity of the source must be assessed through other means, because traditional measures, e.g. user identification and authentication controls, which ensure the integrity of data, cannot always be relied upon.

While on the topic of interpreting social media messages, it has been demonstrated that both people and machines do not perform well at detecting jokes or sarcasm. (R., S., & N., 2011). Facebook is also used for showing off and individual’s social status and messages might as result contain an optimistic bias. Further to this the data scientist has no binding contracts with the individual contributors to his data set, which stipulate that they must provide honest and true data. Worse even, users of social media platforms are often not aware that they are contributing to trend analysis research or similar ventures. We will for now ignore that data protection law explicitly imposes the requirement of purpose limitation and consent to processing of personal data. (European Parlement and the Council of the European Union, 1995). Social media companies often circumvent this requirement by requesting blanket consent from their users. It goes beyond the scope of this paper to evaluate if this tactic will withstand the scrutiny of legislators and treatment by the future legal system as to if this practice constitutes obtaining of informed consent.

Another factor contributing to the problem of bias in a social data set, is that sovereign intelligence agencies (Osborne, 2014) and political lobbies have become digitally literate, and there have been signs that grand scale manipulation of social media is occurring. (Benham, Edwards, & al., 2012) (Metaxas & Mustafaraj, 2012). Social media companies themselves do not possess a completely clean sheet of ethics, such as a recent Facebook experiment on mood manipulation has shown, whereby users were primed for posting either negative or positive messages by having been exposed to only one of the two kinds the week before (The Atlantic, 2014). The psychological experiment actually worked as intended.

Another factor contributing to the challenge of establishing Veracity is the impossibility of verifying the veracity of each record individually, due to the huge amounts of messages involved. This is a challenge for Big Data implementations which rely on data generated by social media. For example, Twitter is largely used for sentiment analysis. While the insights gained can be applied for marketing purposes they are also used by the investment industry to gauge investor’s moods, in order to predict stock price movement. But do these firms check if this data is free from misleading messages, produced by ‘bots’ which might be operated by malicious investors in an attempt to perform market manipulation? When dealing with publicly available data which is continuously updated, amended or deleted this is one issue in reading the data stream. With the help of machine learning the vetting process of large number of record might possibly be automated, but at the moment this unfortunately goes together with a large margin for error as result of inaccuracies in machines being able to sense malicious data from web (Muthukrishnan S. , 2010). Using information generated by people on the internet in Big Data applications therefore means having to deal with uncertain data.

In conclusion, when using external data such as that generated trough e.g. social media or statistics data by government, what is required is some form of profiling of the source data amongst others, in order to vet its usefulness as a source of truthful information. This includes vetting the trustworthiness of the data and its source. Due to the varying causes of untruthfulness or bias in data it is not easy to prescribe any specific measures to prevent this risk. An impact assessment might provide a means to further account for uncertainties in the veracity of the data in the analytics phase.

Coming back to the design decision on what to capture, when considering application development in the Big Data age, the general consensus is to “record everything” everything “The current assumption is that all data

Page 20: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

20

can be captured, moved and processed, and pervades all of the computing, communication, data acquisition and analysis we have seen the past half century” (Muthukrishnan S. , 2010) This is because by limiting the amount of information recorded one also limits the possible variables used in future analysis. On the other hand, by recording everything it also might become more difficult to separate noise from signal. This has led (Muthukrishnan S. , 2010) to suggest that we need to figure out first what the minimum amount of data is required. When capture is not event driven, and especially when it is automated, one of the crucial dimensions of capturing data is the frequency or sample rate. A weather station will be programmed to measure atmospheric conditions every other time period, e.g. every 5 minutes, or at noon each day, etc. One may call this the granularity of data capture. A closely related dimension is accuracy. Again, taking the example of a weather station, the temperature might be recorded in degrees Celsius with an accuracy of one decimal. This might be useful for reporting the weather on the news, but might not fit the requirements of a sophisticated climate model used to predict the path of a hurricane. When trying to figure out the peak time of day that customers draw cash from pin machines, it doesn’t help at all when only the date of the transaction has been recorded and not its time of day.

In conclusion, in order to ensure that data is available for analysis, one must first consider the appropriate measurement, that is the minimum data set required to come to meaningful answers. A related risk, the one of too much information, must also be accounted for. At an early stage, data scientist must consider in what way they will filter noise from signal, while also ensuring that the right granularity is captured.

There is also the technical challenge in capture of social media data and the like. When connecting to an external data source, such as Twitter, one must consider that there are Application Programming Interface (API) limitations when using public data. Interface monitoring is therefore also important in Big Data situations making use of live data. Next to this, the usual technical aspects of the interface must be controlled for. This assertion is also made by (Gudipati, Rao, Mohan, & Gajja, 2013), who show that when considering a typical Big Data architecture setup there are multiple interfaces from source data (e.g. web logs, social network sites, transactional data, etc) into the Hadoop Distributed File system (HDFS) as data is being loaded for further processing. The authors are proponents of functional data testing, consisting of validation before Hadoop processing, validation of the MapReduce data output and validation of the extracted data before loading into the data warehouse. Another important factor is that paying customers often receive better access. Twitter has a commercial incentive to provide better access on a for-profit basis to third parties. Recently it has been shown that HFT’s enjoy similar such privileges with exchanges. In this case the privilege is the better timeliness of information in comparison to other traders, which some have termed “computerised front running”(Brown, 2010 ). This is similar to the wilful obfuscation of GPS signal for civilian use, which used to be in place to protect US military interests. This lead to less accurate location measurements being available for commercial use than for those signals available to the military. The same principle applies for social media data or in fact, any other external data source. Organisation must consider any deliberately imposed limitations on the quality, granularity or connection to the source data. In conclusion, to ensure the completeness in transfer of source data into the Big Data platform, one must ensure that the technical aspects of the interface with source data are accounted for. What is required therefore are measures such as interface mapping, monitoring and input and output validation in the ETL processing phase of Big Data & Analytics. One must also be aware of any deliberate limitations in the data provided, e.g. based on the use of premium subscriptions or not. All of the abovementioned factors impact the downstream step of the Big Data & Analytics process, whereby the data is cleansed or enriched to alleviate the any Data Quality issues. The Big Data philosophy is to never throw away data, and usually only information is added to gain more insight, exemplified by the process step of data enrichment. Superfluous information is not thrown away, but simply not selected for processing in the analytics phase. How does one then deal with the growing volume of data when the capacity of hardware is

Page 21: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

21

finite at the time of its installation? The IT industry has traditionally relied on scaling of individual tech components such as hard disks and processors. But at a certain point simply adding new storage capacity is also insufficient as one reaches the limitations of the implemented technology. This leads to the need for replacement with expensive new hardware. It is no surprise that this has led to the use of cheap available off-the shelf components such as those used for distributed computing. With the advent of Cloud computing further specialisation has occurred with suppliers focussing specifically on information storage as a service whereby the geographical location of the storage is rendered irrelevant, at least from an end-user point of view.

But how is this geographically dispersed data coordinated? As mentioned earlier, many of the companies which were first faced with these challenges where those that were involved in indexing the web and hosting the content of its users. Due to their requirements for vast amounts of storage capacity, these firms have stood at the basis of new technologies which have come into being to deal with large unstructured data sets. Now other companies are becoming aware and want to harvest the potential of analysing vast data sources. Solutions first developed for taming the web are now hitting the mainstream. The most prominent example is Hadoop, which is based on the MapReduce function developed by Google4 (Pavlo, Paulson, & al., 2009) is a well-known example of how distributed computing is implemented to process Big Data and allows firms to offload store and query of massive data and perform analytics in parallel (Davenport, Barth, & Bean, 2012). Alternatives, such as Parallel DBMSs (while also allowing for parallel processing) have not proven as popular, perhaps due to the burden of an imposed data schema which comes as a requirement of those solutions. As a result, Hadoop is the current golden platform of choice and has many implementations, of which the most popular is the open source Apache Hadoop. Firms like IBM, Yahoo and Cloudera putting have put their weight behind it and try to prevent fragmentation of the community, as there are many advantageous to a common language or platform being shared between companies. One downside to Hadoop’s popularity is that organisations seize Hadoop for every analytics problem they encounter, which might lead to unnecessary complicated solutions. It is therefore important to consider if simpler solutions are available to process large records which office automation tools cannot deal with. As one Big Data programmer (Stucchio, 2013) put it: “Too big for Excel is not Big Data”, when he used Pandas5 for a problem too big for Excel, but too small for Hadoop.

In conclusion, what is required to ensure that the goals of Big Data & Analytics, is the choice for the right Big Data architecture. Currently the golden standard is the Hadoop platform, but the chosen solutions must fit the research problem at hand. To prevent unnecessary complexity and/or costs, the solution must fit the requirements for its intended use.

While on the topic of Hadoop, it is worth mentioning that HDFS does not comply with established ACID properties that auditors expect of database transactions. This especially applies for capture of data in transit. Because of this Hadoop is often used in place of RDMBs because the latter is not well suited for this application: “Only static data are easily caught with ER model. (Lumineau, Laforest, Grippay, & Petit, 2012) because “data from streams have a particular lifecycle: they are transient data that should be caught by the running queries when they arrive. Similarly, data from services appear only after service calls, that happen at query time. Such characteristics cannot be presented in existing data modelling conceptual models like the Entity-Relationship model6or its extensions” This fact could have big implications on auditability and integrity

4 One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs. The input data set is stored in a collection of partitions in a distributed file system deployed on each node in the cluster. The program is then injected into a distributed processing framework and executed in a manner to be described. The Map function reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of intermediate records in the form of new key/value pairs. s the Map function produces these output records, a “split” function partitions the records into R disjoint buckets by applying a function to the key of each output record. (Pavlo, Paulson, & al., 2009) 5 PANDAS is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming 6 In software engineering, an entity–relationship model (ER model) is a data model for describing the data or information aspects of a business domain or its process requirements, in an abstract way that lends itself to ultimately being implemented in a database such as a relational database. The main components of ER models are entities (things) and the relationships that can exist among them, and databases.

Page 22: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

22

of Big Data sets in HDFS clusters. For example, if such a data set is continuously updated with a live data stream, and backup technology is not available to show a complete audit trail, this makes the object effectively a moving target, thus very difficult to audit. How do a firms IT auditors establish the completeness or veracity of the source data? How can independent experts verify if assumptions made on the data sets were correct? In order to reproduce the same results, one needs access to the exact same data set which was used earlier. This can be achieved through a Source Data Preservation strategy proposed by (Rothenberg & Bikson, 1999). The principle of reasonable assurance, whereby one does not have to verify the data but only the operations on it, does not apply in case of Big Data because of the wild variations which small unknown errors in the data set produce in the analytics phase. It is important that these errors can be identified and dealt with.

In conclusion, to mitigate the risk that source data is not available, either for auditing, forensic purposes, or revisiting by data scientists when they find themselves in the problem solution phase, what is required is a strategy for source data preservation to ensure that backup copies are made and available at a later time.

Conversion and transformation

We have elaborated on the capture of data in previous sections. Wherever unstructured data is generated or combined, it is highly likely that some form of conversion takes place downstream. It requires classification, structuring, labelling, often through machine learning (Muthukrishnan S. , 2010). With traditional databases and Data warehouse this process is known as extract, transform and load (ETL), a term which this study has also applied due to the public’s familiarity with the nature of the process. In short, ETL Extraction is performed from source(s) and source systems. Next, operations (which may include sorting, aggregation, joining or calculations) are performed on the data so that it is Transformed to suit the target systems and its requirements. Loading usually consists of the uploading of processed data to a DWH. These same operations also apply to Big Data.

Previously omitted was the format in which capture and storage occurs. This might not seem exceptionally problematic due to the many, oftentimes open source, conversion software applications freely available on the internet. But historical data might sometimes be stored in archaic, currently illegible formats which puts at the risk the preservation of the information within. Migration of this data to newer formats is also error-prone because conversions software is not always 100% accurate (Lawrence, Kehoe, Rieger, Walters, & Kenney, 2000). This can become a problem for Big Data research which relies on data which has been generated in the past. When conversion is necessary and not supervised or checked, inadvertent introduction of errors in the source data may occur and reduce the data quality and impact the subsequent analysis and outcome. Incorrect conversion is also a factor after execution of the MapReduce function. (Gudipati, Rao, Mohan, & Gajja, 2013) illustrate the necessity for validating that transformation rules are correct, performing reconciliations of target table data with HDFS data to ensure they match, and validating the integrity of data in the target DWH system, amongst other measures.

In conclusion, regarding the technical aspects conversion and transformation, there is a risk that this is done incorrectly. In order to mitigate this risk, conversions controls must be applied and transformation rules must be validated.

In the analysis of Big Data sets a different type of conversion often plays a big role. Not only changes in file format can have an impact. Generation of metadata based on source data is also problematic. For example, to enable text-based search by the user, binary data is often converted to text (XML) or if that proves impossible, accompanied by human-generated metadata to facilitate contextual analysis. Think for example of the pictures smartphones take, which are saved together with GPS data, time and date taken. Also, social network service providers often request users to further identify individuals in photos uploaded to their platform by having the user “tag” friends, further adding to this metadata. Responses from friends to the uploaded picture are also time-stamped and recorded and displayed for others, by way even further context to the uploaded picture through might be added in the form of semantic data (e.g. Mark asks “That’s a nice picture John, is that your

Page 23: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

23

girlfriend next to you?”) This simple example shows that semantic data can be parsed for information on relationships.

Due to the huge volumes this type of processing is nearly always performed by computers, based on a logical rule set, sometimes referred to as Artificial Intelligence (A.I.) This if often a challenging and multi-iterative process. (Ferguson, 2012) gives an example of how social media sentiment might be processed in multiple passes; first messages are structured, then they are sorted for positive or negative sentiment, and thirdly advanced analytic algorithms are used to correlate sentiment with contacts, followers and relationship. Thinking of the previously explained challenge of “Veracity” which Big Data poses, this beggars the question, how can the trustworthiness of this parsed data be assured? Semantic Big Data might be roughly correct, but is it precise enough for all applications, especially when these decisions are automated or impact the lives of individuals? Going back to the example, if John does not reply “No Mark, that’s my sister” to his friend’s question, A.I. /machine learning might have assumed that it was John’s girlfriend in the picture.

In conclusion: the risk exists that the conversion of (social) media information into information which can be processed by machines, does not lead to valid answers. What is required therefore is the profiling of the source data set to understand its origin and characteristics, i.e. how did it come about, and what limitations may that pose on its scientific analysis.

Cleansing and enrichment

Closely related to data quality management is the process of cleansing and enrichment of data before analysis. Every once in a while researchers are accused and have also been found guilty of massaging their data to fit pre-determined results. This shows the dangers associated with uncontrolled access to the storage areas where source data is housed but also to lack of an audit trail when adding data. What distinguishes this phase from later analytics is that write operations take place to source data, while querying and analysis are usually relegated to read-only operations. Being able to write to the golden data source of subsequent analytics is a risk which must be managed.

Further to this, the Big Data analysis techniques is to never throw data away. In most applications, iterative enrichment occurs to data sets until these produce the desired results. n other words: first gather data, then search for useful links. Rinse, wash and repeat until the intended outcome is reached. This means that more and more data is added to Hadoop clusters, with the goal of increasing the outcome of the quality of the search results or appearance of new correlations. All this data must go through next steps, and the same proposed quality measures as the first source data set.

In conclusion: the cleansing and enrichment phase may introduce the risk of affecting the integrity of the original source data. What is required to mitigate this risk is, change management and logging of actions performed on source data. To ensure this, what is also required is logical access control on the Hadoop platform. What is also required is the preservation of the original source data, such as proposed in the conversion and transformation phase.

TECHNOLOGY: ETL PROCESS ENABLERS Infrastructure

Big Data exacts high performance requirements from enterprise IT infrastructure (Rabl & Sadoghi, 2012). This is because Big Data requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times. Exceptional technologies often go along with exceptional investments, the latter which are typically constrained by annual budgets and allotting of available resources in an organisation. Companies combine the requirements and constraints in relation to obtaining analytical Management Information (MI), by offloading this strain on main systems to data warehouses (DWH), which are powered by arrays of servers consisting of multiple processors and multiple storage media. Sometimes virtualisation plays a role, as in case of the use of Infrastructure as a Service (IaaS), including Cloud storage (e.g. Amazon S3) or Cloud computing, or more when harnessing processing power in the Cloud. (for example with Google’s Compute Engine).

Page 24: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

24

A survey on Big Data better practices by (TDWI, 2011) which was sponsored by IBM, cited as one of the main challenges that business face in implementing Big Data is lack of scalability, performance or inclusion of data analytic capabilities within the existing Data warehouse solution of the company. The mismatch with existing DWH capabilities will oblige organisations to look for cost-effective new solutions.

(Accenture, 2012) argues that “All in all, Big Data will require an infrastructure that can store, move and combine data with greater speed and agility—and traditional IT infrastructures are simply not engineered to meet this need. It is technically possible to translate unstructured data into structured form, and then use relational database management systems to manage it, of course. However, that translation process takes a great deal of time, driving up costs and delaying time to results. In general, the problem is not so much technological as financial; it is simply not economically feasible to use the traditional infrastructure to manage Big Data.”

That this challenge is still actual in 2014 is demonstrated by a survey performed at the request of (KPMG, 2014) amongst a 144 CFO’s and CIO’s which showed that “85% of companies are struggling with implementing the solutions to accurately analyse and interpret their existing data”

In conclusion: the risk exists that an organisations current IT infrastructure is not capable of dealing with Big Data & Analytics. To mitigate this risk, what is required is some form capacity management to ensure adequate IT infrastructure assets and processing resources are made available.

Architecture

Traditionally DWH’s have relied on DBMS based on the use of relational databases, with SQL one of the most used applications. The reason for this success is that “the design and evolution of a comprehensive EDW schema serves as the rallying point for disciplined data integration within a large enterprise, rationalizing the outputs and representations of all business processes” (Cohen, B., Dunlap, Hellerstein, & Welton, 2009) However, DBMS have limitations with processing unstructured or large volumes of data. What has enabled the Big Data revolution is recent development of MapReduce functionality. The key to success of MapReduce is the ability to a) map large swaths of data and b) reduce this complex problem to simpler ones, which are then distributed to multiple nodes for solving Combined with the fact that this delegation can be cascaded, whereby nodes perform a further split and delegation of problem to be solved, this results in the successful harnessing of the power of parallel processing. MapReduce and its derived implementations can be found at the heart of most Big Data solutions, such as open source Hadoop, maintained by the Apache Foundation, and its commercial counterparts such as Marklogic, Xquery etc. (IBM, 2013) structures the technical solutions as follows:

• Stream processing software • Analytical RDBMSs • Hadoop solutions (could be on-premise or in the cloud) • NoSQL DBMSs e.g. graph DBMSs

To illustrate what one of these look like, a schematic depiction of a typical Big Data architecture with Hadoop is given in the following figure.

Page 25: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

25

Figure 5: Sourced from InfoSys

Next to the ability to scale, the other main difference with traditional DBMS systems is the commitment of MapReduce to the “schema later” or even “schema never” paradigm (Pavlo, Paulson, & al., 2009). SQL relies on a database schema which imposes integrity constraints through e.g. tables. MapReduce has no such constraints, in fact: ‘Most NoSQL databases have an important property. Namely, they are commonly schema-free. Indeed, the biggest advantage of schema-free databases is that it enables applications to quickly modify the structure of data and does not need to rewrite tables.’ (Chen & Zhang, 2014) The authors names other advantages, such as the flexibility of MapReduce in handling hardware failures: in comparison, RMDBS assumes that hardware failure is rare while MapReduce continually assess if response was received from nodes and if required redistributes the unsolved problem.

While MapReduce is enjoying increased popularity, many researchers emphasise the lasting importance of SQL, as it is still is often married with MapReduce type solutions, and can exist peacefully alongside it. (Cohen, B., Dunlap, Hellerstein, & Welton, 2009) also argue that companies should move ”toward more fluid integration or consolidation of traditionally diverse tools including traditional relational databases, column stores, ETL tools, and distributed file systems” so that data sources are combined in one data flow. An example is given by (Chen S. , 2010) whereby a SQL-like data warehouse (Cheetah) is built on top of a MapReduce solution.

Performance is an issue when attempting to solve complex problems. In this, the power of distributed computing has come to the rescue of Big Data. Still, within the wide range of possible strategies, informed choices must be made. Researchers (Rabl & Sadoghi, 2012) compared the performance of Apache Cassandra, Apache HBase, Project Voldemort, Redis, VoltDB, and a MySQL Cluster. They showed that these systems behave differently in throughput or when scaled. This demonstrates that choosing the right technological solution, ‘fit for purpose’ of the goal it is applied to, critically influences the execution of a company’s Big Data project.

In conclusion: there is the risk that the implemented architecture is not capable of dealing with the Big Data & Analytics problem at hand. What is therefore required prior to any project, is the choice for a suitable Big Data IT Architecture which achieves the required performance levels.

Page 26: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

26

ANALYTICS: EXTRACTING INFORMATION FROM DATA As mentioned in the introduction, this study uses the term analytics used to refer to both the tools and techniques used to gain information and decisionable insight from Big Data. Analytics follows after the ETL processing phase, and leads to processed data. Analytics is closely associated with the mathematical profession, more specifically the field of statistics. Processed data needs to be analysed. How is this done?

Statistical analysis

(McKinsey Global Institute, 2011) lists a number of suitable technologies which can be applied in Big Data process: “A/B testing, association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensemble learning, genetic algorithms, machine learning, natural language processing, neural networks, pattern recognition, predictive modelling, regression, sentiment analysis, signal processing, supervised and unsupervised learning, simulation, time series analysis and visualisation...” When gazing upon this list it shows statistics prominently feature at the heart of Big Data analysis.

It goes beyond the scope of this study to discuss all of the above in detail. It is useful to select one as an example, such as the use of regression testing.

Regression testing is a familiar technique for anyone working in with statistics. It is used to determine relationships between variable and as such has overlap with machine learning. Many techniques for carrying out regression analysis are available, which can broadly be distinguished in two categories: Methods such as linear regression and ordinary least squares regression are parametric, with the regression function defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression techniques a form of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. Nonparametric regression requires larger sample sizes than regression based on parametric models because the data must supply the model structure as well as the model estimates.

The choice for one of these methods depends on the nature of the problem and the data. The subsequent performance of chosen method depends on how the data used in the model has been generated. Because its exact generation is not exactly known, assumptions about this have to be made, which cannot always be tested. While regression models that are used for prediction are often still useful even when the assumptions are partially incorrect, they may not perform optimally. More importantly in many applications, especially with small effects or questions of causality based on observational data, there is the risk that regression methods can give misleading results.

It is important to note that Big Data attempts to apply the scientific method. The use of highly educated researchers, statistics and the quantitative method all imply that the endeavour for the truth is built on a solid case. But one of the main differences between commercial application of Big Data and scientific community, is that the latter usually publishes its results and sources for the purpose of peer review. In the same way that open source software code is viewable by millions of pairs of eyes, scientific studies, in principle, can be challenged and mistakes corrected. Theories are open to falsification. No such strong mechanism currently exists for vetting Big Data application by private organisations. The sizeable investments Big Data actually requires creates a strong incentive to limit the amount of people “in the know” in order to prevent intellectual property from disseminating without being able to recover R&D costs. To at least attempt to replicate the quality rigours of the scientific method and not only its process, commercial organisations should ensure that independent knowledgeable eyes have a look at their Big Data product before the start relying too much on it. What often occurs within companies is that external consultants or auditors are hired to bring independent assessments (after signing an NDA of course).

In conclusion, it is necessary to challenge scientific research assumptions made during Big Data analytics, such as those used in case of regression testing, in order to prevent misleading results. What is required is

Page 27: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

27

the validation of the robustness of the applied scientific method and its assumptions. This study refers to the chapter on Quality Assurance were various methods used in professional fields which face similar problems are discussed and evaluated for applicability, e.g. Model Validation.

Programming algorithms

This study views algorithms as distinct from queries, though the terms are sometimes used interchangeably by the general public. And of course algorithms can be written to perform a query, but they are more effectively used to identify correlations, while queries are limited to retrieving lists of information. This study views algorithms as the operationalisation of statistical modelling with the help of computer code.

It is useful to distinguish between two type of algorithms. Algorithms operating on data at rest are different from algorithms operating on live data. They are similar to queries in that they are usually operated as a batch job. In contrast, algorithms operating on data in movement can be configured to be triggered by - and act upon - real-time events. High Frequency Trading is one commercial domain where algorithms are very popular, because of their potential to realise millions of small profits in minuscule discrete time periods. The discerning feature of these automated algorithms is that they are self-operating. Once brought into production they require no human intervention to fulfil their job, and will continue to do as instructed, unless its developer has incorporated built-in limits (e.g. time, financial) or thresholds which trigger a circuit breaker, or if they are manually stopped. Researchers in the field of A.I. have pointed the dangers of self-operating programs. A famous thought experiment is that of the paperclip maximiser, provided by (Bostrom, 2003), which is an A.I. that in the end uses up all available resources in the universe to create paperclips, thereby destroying it. This shows how A.I. even when designed competently and without malice, could ultimately pose a grave danger.

As with any software development, there is always the risk of incorrect code entering production, if proper testing procedures are not performed. Testing code for bugs has historically been part of the software development process. This has been embraced by ITIL and within the General IT Controls whereby the testing process is sometimes stretched out to include technical acceptance testing, functional acceptance testing, user acceptance testing and requires business approval before changes may enter production.

To control for software development and algorithm development in specific, an Integrated Development Environment (IDE), such as the Eclipse Platform, can help developers with source code editing, automatic code completion and debugging. Debugging is especially important as it can prevents algorithm execution errors due to programming bugs or invalid data. Advanced debuggers can even execute programs step by step for the purpose of troubleshooting (Aggarwal & Kumar, 2003)

In conclusion, it has been established that Big Data algorithms are developed through computer programming. Because of this, there is the risk of the inadvertent introduction of bugs into the code. This is especially the case for automated algorithms which operate in a live environment. What is therefore required to mitigate this risk, is managed software development of queries and algorithms, and testing before promoting to production, in case algorithms will operate automatically in a live environment.

Writing queries

Internet search engines have become an integral part of our daily lives. Searching with keywords however heavily depends on the success of the sorting algorithm which is implemented to sort search “hits”. A similar search technique is used in Big Data. Take for example the access to social media data. The methods available often emulate the familiar Structured Query Language (SQL) instruction format. Facebook makes available “FBql” for accessing its Social Graph. Google has “BigQuery”, a cloud service for analysing Big Data. These solutions are sometimes categorised under the nomer NoSQL, which stands for “Not Only SQL”, implying that while SQL-like queries can be used, the underlying technology is not limited to RDBM principles.

When queries are formulated, just as algorithms, they depend on the fact that the underlying data exhibits the same properties and relationships. These properties are referred to as the ontology. (Kondylakis & Plexousakis, 2012) have shown that the number of invalid responses grows practically linearly with the number of changes,

Page 28: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

28

and propose a way for automatic detection of invalid queries by monitoring ontology evolution, in this case of the processed data.

It is therefore necessary to validate if queries are retrieving information as intended in the first place. When documented, SQL queries are replicable by independent third parties, and it can be verified if the ontology produce the exact same results in subsequent performance. It is important that the original source and query formulation is also available. In case streaming data is used, a backup of relevant streaming data might be considered. As mentioned earlier, the non-compliance of Big Data with ACID principle, might also cause a problem. If the queried data is in transit, query results might vary as NoSQL solutions favour availability over consistency.

In conclusion: there is a risk that queries do not retrieve the intended information. To mitigate this risk, a number of measures can be taken. Starting with the preservation of the original query so that the error can be determined at a later time, what is also required is monitoring the evolution of the ontology of processed data in relation to query performance. A measure which might be applicable in live environments, is the (automatic) detection of invalid queries.

TECHNOLOGY: ANALYTICS PROCESS ENABLERS Computer languages and software

No data can go without software dedicated to its analysis. Big data algorithms and queries can be self-developed with various statistical and non statistical programming language. R is an example of the first, which has even led to the use of the term “pbdR” 7 Python is an example of the second category and has set out to become the most popular language according to a poll by a web site analytics for the analytics community (KD Nuggets, 2013). The same source also lists SAS, Java and Matlab as programming languages of choice for Big Data developers. But the venerable SQL database language is also still very popular. For example, on top of the Hadoop Architecture, Hive can be loaded which provides users with SQL-like functionality to analyse data within

When companies follow the route of developing their own solutions, it means that they must employ people who are well-versed in these programming languages and who can – ideally – also bridge the gap with the business to facilitate the latter’s requirements. This challenge is also identified by (Chen S. , 2010) who stated that the challenge with self-developed MapReduce implementations is that the user has to write their own code to get to the data. Fortunately, Big Data in the meanwhile has reached such a maturity level that vendors are making available off-the shelf solutions – very often web based. Parstream, Mapr, Cloudera are popular product offerings to name few (Vance, 2013).

In conclusion: to ensure managed software development of algorithms and queries, in order to mitigate the risk of incorrect code being used or worse, entering production, what is required is the choice for the most fit for purpose programming language. After coding, of course what is required is debugging of the algorithm code. In case that the applications are not self-developed, this means that what is required is a sourcing strategy which considers the requirements of Big Data applications for Independent Software Vendors who provide off-the-shelf solutions.

Moving to the Cloud

A recent survey by (KPMG, 2013) showed that “More than one third of ... respondents currently do not believe that they have the right capabilities in-house to analyse Big Data & Analytics.” Because of the sizable investment required in technology and skills, it is no wonder that the business community is looking towards leveraging the advantages of Cloud computing in making Big Data analytics to end-users.

7 PbdR stands for Programming Big Data with R.

Page 29: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

29

What queries and algorithms have in common is that they both require a heavy amount of CPU power and memory bandwidth. The iterative approach of data enrichment which is often applied to Big Data analysis means also that hardware requirements are unpredictable. Scalability and flexibility become important. Both of these characteristics are exhibited by Cloud solutions (IaaS), a term (Chung, 2011) refers to as elasticity.

Manager currently at the peak of their careers and employed in decision-making positions, are often not endowed with skills required of data scientists. A management revolution is necessary, but as often goes with bringing science to the masses, tooling and platforms will have to be developed, allowing lay persons to leverage advantages of Big Data (e.g. crowdsourcing of ideas). Researchers (Zorrilla & García-Saiz, 2013) proposed a cloud-based analytical service (SaaS) and attempt to describe a software architecture which meets the necessity of non-expert data miners to extract useful and novel knowledge using data mining techniques in order to obtain patterns which can be used in their business decision making process. The main characteristic is the use of templates to answer certain pre-defined questions. Most recently cloud computing has been expanding to provide Business Intelligence as a Service, or as ISACA refers to it, “Information as a Service” (shortened to “InfoaaS”, not to be confused with “IaaS” which are Cloud services for Infrastructure) Basing itself on COBIT5, ISACA has recently defined control measures as means to gain assurance in the Cloud. (ISACA, 2014), which might be useful when applied in Big Data & Analytics.

In conclusion: with the use of Big Data &Analytics becoming more commonplace, the risk exists that tooling does not match the needs of business end-users. What is required is selection of the right tool which matches the ability of the intended Big Data & Analytics user. When InfoaaS is sourced from Cloud vendor the applicability of Cloud controls must also be considered.

DECISION MAKING: EVALUATING ANALYTIC OUTCOMES FOR INSIGHT The 43rd president of the United States, George H. W. Bush, is notorious for his statement: “I am the decider and I decide what is best”. (CBS News, 2006) It would only later become apparent that one of his major decisions, the declaration of war on Iraq, was based on faulty and biased information. The rest is history.

One of the responsibilities of management is the making of economic choices, that is putting most effectively to work the resources at an organisation’s disposal in order to achieve its goals. In for-profit companies the principle-agent relationship more often than not results in the explicit goal of maximising shareholder value, with all other goals coming second. It is no wonder that much of the focus of the field of business intelligence is on the financial, e.g. be it for financial forecasting, financial modelling or historical financial analysis. This has led to the development of traditional Management Information Systems which assisted operational and strategic activities. (O'Brien, 1999) by combining hardware, software, data, people function and dedicating these to decisions making (as opposed to other types of information systems) (Laudon & Laudon, 2010). To combat the garbage-in garbage-out issues, MIS are governed by internal controls intended to prevent and detect errors in management information. Research has (Chan, Peters, Richardson, & Watson, 2012) concluded that weak internal controls related to data processing integrity have the strongest impact on forecasting accuracy. As this study views the Big Data process as a modern way of assisting in the knowledge discovery process, it is reasonable to assume that the same principle and requirements can be assumed to apply.

Visualisation

Visualisation are used to draw the viewer to the essential meaning of a message. In auditing often only simple visualisations are utilised e.g. risk ratings in the management summary of an audit report are grouped according to the stoplight system (red, amber and green), usually indicating high, medium and low risks impact of findings. In the world of Big Data, many more novel ways of representing information, previously relegated to scientific niches, have hit the mainstream. From the whimsical tag clouds, which depict keyword occurrence, to stream graphs, tree charts, three dimensional scatter plots and spectrograms. All are gaining usage in the field of business administration. Below are some examples. From top to bottom and from left to right, starting with

Page 30: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

30

figure 6: the edits of a Wikipedia bot whereby each different colour corresponds to a different page 8; an immersive (3-D made from 2-D) simulation of oil flow through water based on computational fluid dynamics simulations9; An analysis of photo geotags in New York City; Facebook’s global social connections10; Personal friend network clusters11; internet browser market share from 2002 to 200912

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

The process leading to certain visualisations (e.g. charts depicting peak electricity usage during day, average house price) can be described in terms of its outcome: Plotting, Charting, etc. But to gain understanding of the intermediate steps involved, it is better to describe visualisation by its core basal components.(Amar, Eagan, & Stasko, 2005). Within these core components these researchers describe straightforward activities such as filtering, sorting, determining range, but also slightly more complex activities such as characterising distribution, finding anomalies, computing values and determining correlation. As Big Data analytics is most often used for pattern analysis in order to come to predictions, it is the latter that again shows how statistics lies at the core of Big Data analytics.

Visualisation has long been used to deal with the human problem of bounded rationality. That is, the problem of making an optimal choice given the information available. (Gigerenzer & Selten, 2002). Visualisation is extensively used in Big Data, e.g. as a way to enable collaboration for employees with different professional backgrounds (Accenture, 2014) argue that “While data visualisation is not necessarily a perfect solution (as all are subjective in some way), on the explanative side its best advantage is that it provides a common language for executives, functional leads and data scientists to have a discussion about data, together”. In the age of Big Data the issue of too little information appears to have become smaller. At the other extreme, oversaturation of information introduces new risks, as our brains are flooded with a multitude of confusing and conflicting

8 http://archive.wired.com/science/discoveries/magazine/16-07/pb_visualizing 9 http://www.nsf.gov/news/news_images.jsp?cntn_id=125855&org=NSF 10 http://www.bbc.co.uk/news/science-environment-11989723 11 http://blog.stephenwolfram.com/2013/04/data-science-of-the-facebook-world/ 12 http://www.axiis.org/examples/BrowserMarketShare.html

Page 31: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

31

signals. Research proves that information overload leads to poorer decision making. Especially if the information is ambiguous, novel, complex or intense. (Eppler & Mengis, 2004) The number of dimensions are also contributory factor to information overload. As Big Data’s mainstay is pattern recognition from complex data sets, this means the risk of information overload is present. As more and more data is used for pattern recognition, it becomes challenging to visualise multiple data points on two dimensional surfaces such as paper or computer screens.

In conclusion, by using visualisation to help communicate Big Data & Analytics answers, the risk is introduced that the visualisation is confusing to the user, e.g. due to information overload, and can lead to poorer decision making. To mitigate this risk, what is ideally required is the correct choice in visual portrayal for the problem at and, e.g. which only includes the least amount of complexity, reducing signals to the minimum required to make a valid decision.

Visualisation is extensively used by business users. The use of automated tooling in organisations is dominated by office software, which caters to the need of end-users of performing tasks related to performing calculations (e.g. through spreadsheet software), documenting text (e.g. word processor) and visually presenting results (e.g. slide show presentation software). Microsoft PowerPoint is relied on by many companies to present MIS to decision makers, due to its versatility of integrating graphics from other office software. Slides are often accompanied by circles, squares, callouts, along with graphical outputs allowed by the basic functionality of spreadsheet software. The graphics are recognisable in the form of pie charts, bar graphs, spider charts either expressed in relative (percentages) or absolute (i.e. currency) figures and sprinkled across slideshow presentations or within reports. What most of these visualisations have in common is that they are simple and moreover simple to understand by basically educated laymen who have enjoyed higher schooling. There are usually only two criteria compared to each other by plotting these on a two-dimensional axis. This limited number of dimensions is something they also have in common with traditional BI tools which are often used for descriptive statistics (Cohen, B., Dunlap, Hellerstein, & Welton, 2009).

Recent technological development with Big Data has made possible new ways of communicating information to the general public. This led to the practice of visual data exploration and visual data mining (Keim, Kohlhammer, Ellis, & Mansmann, 2010). For example, news media vie for readership by providing interactive editorials online underpinned by numerical analysis. Think of the increasing popularity of InfoGraphics. An look how traditional static charts are being replaced by interactive ones, whereby the user can decide to zoom in on a relevant data subset. As an example thereof, when comparing economic trends (e.g. development of country GDP versus housing prices), the online edition of the Economist allows readers to select relevant countries for comparison based on a time period of choice. These editorials require careful preparation, preceded by a lot of work and freezing of the data set over a certain period. In traditional business intelligence, transitory data (i.e. credit card transactions) are also off-loaded to a data warehouse (DWH) where it is frozen and archived until it is applied for further use. Research is currently underway on how to how to visualise data in real time through innovative techniques which temporarily reduce processing complexity of visualisation without reducing the validity of the information presented.(Keim, Huamin, & Kwan-Liu, Big-Data Vizualisation, 2013). No practical applications thereof seem to have hit the mainstream yet.

In conclusion, the risk exists that Big Data visualisation business end-users are not trained or equipped with the right skill set to understand newer, more complex Big Data visualisations whereby more data points than are usual in business are portrayed. Newer visual might awe early or uninitiated Big Data users and lead to incorrect decisions. To mitigate this risk, business end-user training is required in interpreting data analytic outcomes as portrayed by visuals.

Data scientists are also heavy users of visualisation, as it is not only used to express analytic results, but also used at an earlier phase in order to understand the make-up of the to-be analysed data. Exploratory Data Analysis (EDA) is recommended (Tukey, 1977) in order to understand the data’s characteristics in order to improve the subsequent use to test hypothesises. He also pleaded for using data to search for viable hypotheses, turning around the leading paradigm of the time that statistical data should only support hypotheses (Konold, 1999). EDA has led a generation of students to use statistical visualisation techniques such as box plot,

Page 32: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

32

histogram, multi-vari chart, run chart, pareto chart, scatter plot, stem-and-leaf plot. Understanding the nature of the data also helps prevent applying incorrect statistical methods on the it. It is with Big Data that traditional representations of quantitative data can now be augmented or replaced with more scientific visualisations, which as (Friendly, 2009) coins it, is "primarily concerned with the visualisation of three-dimensional phenomena (architectural, meteorological, medical, biological, etc.” Big Data allows business to discover patterns involving more than the traditional two parameters set against time. This is termed inferential or inductive statistics (Cohen, B., Dunlap, Hellerstein, & Welton, 2009). What is important, that the right portrayal of the data is chosen. An example should be considered: by visualising a pie chart at an angle such as the top chart in the figure 5 below13. It becomes apparent that the bottom-front pie segment is visually distorted and appears enlarged:

Figure 12

It is well known that statistical or numerical information and its subsequent presentation can be used to manipulate how information is received and interpreted. Now consider how this phenomenon might be an issue in more complex visualisations. In the discussed example, the distortion might be unintentionally introduced in an attempt to give a 3D-look to visuals, but distortions can also be introduced on purpose. Think for example such practices used in advertising and marketing, whereby prices are always rounded down to the nearest 99 cents because the mind mistakes 2,99 as a significantly lower amount than 3,01, priming the observer into thinking that he is getting a better deal while the advantage is not material.

In conclusion, while new visualisation techniques entering the mainstream are making it possible for a larger audience to understand patterns in complex data sets, the risk is introduced that an incorrect portrayal of the data is chosen and that the data is misrepresented. Or worse, that a distortion or bias is introduced on purpose to influence the end-user in the interpretation step. When this occurs by data scientists in the data profiling step, this risk is even greater.

In summary, what is required is the correct choice in visual representation of processed data; ensuring that visualised data is not misrepresented; ensuring that visualised data is not misleading; ensuring that no distortions or bias are introduced.

Interpretation

The famous expression associated with statistics, usually attributed to the author Mark Twain, is the superlative stepping contained in the statement that “there are lies, damned lies and statistics”. By which is meant that statistics are often used in an attempt to strengthen weak arguments or to mislead an audience. This idea is not new and popular statistic books of the 20th century (Huff, 1954) already addressed the issue and explained it in laymen’s terms. Prominent examples of potential misuse of statistics the author categorised under chapters

13 Obtained from http://www.dashboardinsight.com/articles/digital-dashboards/building-dashboards/the-case-against-3d-charts-in-dashboards.aspx

Page 33: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

33

named: “The Sample with the Built-in Bias, The Well-Chosen Average, The Little Figures That Are Not There and Much Ado about Practically Nothing” (Steele, 2005). Follow-up studies have concluded that distorting graphics or manipulating underlying data are some ways to lie with statistics, but failing to correctly follow the statistical method or failing to check if the applied method is the correct one also leads to distortions. (De Veaux & Hand, 2005). The latter authors go even further and state that even with the correct method and honest intentions, bad data can lead to results which have no validity at all. While statistics are often an obligatory part of the curriculum of university study courses within the social sciences, including that of business studies, these subjects often only briefly touch upon the most basic statistic principles. Many managerial jobs or those in the field of accounting do not require daily application of statistical knowledge. In fact, when knowledge of statistics is a prerequisite of the job vacancy, recruitment is often sought in other fields such as those of engineering, actuarial and econometrics. This suggests that companies are aware that the average statistical knowledge among non-technical employees, including top management, is low. Use of statistics by (IT) auditors is equally limited. This may have been caused by the trend set in the 1980’s whereby statistical audit analysis of transactions was eliminated by the preference for substantive testing. (Hitzig, 1992) When IT auditors encounter statistics this is usually limited to ensuring that selecting enough samples based on pre-defined tables. How then are the current generation of managers and auditors placed to interpret the result of Big Data analysis? Or in the case of an auditor, how can he assure himself that the right statistical method was applied when auditing a Big Data object? Both will have to rely on specialists or subject matter experts. If organisations are to reduce the risk that an inadequate scientific method is applied, Big Data teams will have to be organised in such a way that team member are required to challenge each other on adequacy of the applied methodology. (Cohen, B., Dunlap, Hellerstein, & Welton, 2009) state that there is a growing need for data scientists but that that “these are often highly trained statisticians, who may have strong software skills but would typically rather focus on deep data analysis than database management.”

In conclusion: to mitigate the risk of incorrect statistical method being applied, what is required is the involvement of multiple experts challenging the applied methods and outcomes in Big Data & Analytics deployments, a process similar to the one of the scientific peer review.

(Manovich, 2011) makes the wry observation that the people involved in Big Data can be grouped as follows: ‘those who create data (both consciously and by leaving digital footprints), those who have the means to collect it, and those who have expertise to analyze it’. For those charged with analysis, he emphasises the importance of complementing venues of research on the same data set and draws an analogy with researching video documentaries, whereby statistical and computational analysis should complemented by the human observation (i.e. watching a sample of selected movies) to come to useful results. This gap is also identified by (IBM, 2013). The gap between the demand for analytics talent globally and the supply of analytics talent locally is one of the key obstacles to analytics implementations across all organizations ... The largest skills gap is the ability to combine analytic skills with business knowledge. The analyst who both understands the business and performs higher mathematic tasks is the most sought after in the market.

(Boritz, 2005) also places importance on context and argues that some kind of information impairment always occurs during interpretation or recognition. It is important that information producers and information maintainers consult with information users and that the allowable tolerances and materiality guidelines are agreed. The skill set of information users is also an important determinant in this. Data is becoming increasingly accessible to individuals who are not specifically trained in the analysis thereof. (Boyd & Crawford, 2012). When users are not properly trained to weigh the outcome of analysis, this increases the chance of misinterpretation. (Berry, 2011) goes even further and states that using Big Data to understand reality poses risks to society because: ‘destabilising amounts of knowledge and information that lack the regulating force of philosophy’. With this, the author might very well be proponent of small data so that humanity’s bounded rationality can better deal with it.

Page 34: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

34

How can organisations ensure that the risk of misinterpretation is reduced? (Eppler & Mengis, 2004) list a number of countermeasures which can be taken: “With regard to information itself, information overload can be reduced if efforts are made to assure that it is of high value, that it is delivered in the most convenient way and format(Simpson & Prusak, 1995), that it is visualized, compressed, and aggregated (Ackoff, 1967; Meyer, 1998), and that signals and testimonials are used to minimize the risks associated with information (Herbig & Kramer, 1994). On the individual level, it is important to provide training programs to augment the information literacy of information consumers (Bawden, 2001; Koniger & Janowitz, 1995; Schick et al., 1990) and to give employees the right tools so that they can improve their time (Bawden, 2001) and information management (Edmunds & Morris, 2000) skills.

This shows that organisations can reduce risks by ensuring that employee using the fruits of Big Data & Analytics, follow proper training to be able to correctly interpret results. Also, the right tooling must be available for interpreting outcomes, in which users must also be trained. This brings us back to the observations of (Boyd & Crawford, 2012) who assert that interpretation is the result if biases and limitations are not understood and outlined.

In conclusion: to mitigate the risk of Big Data end-users are able to interpret, or that they misinterpret Big Data & Analytics output, what is required providing employees with the skills, training and tooling so that they may understand Big Data & Analytics, thus eliminating as much uncertainty in their interpretation of the analytics mode and outcomes.

Page 35: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

35

CHAPTER III. QUALITY ASSURANCE In the previous sections this study has dissected Big Data into its information components and its process components. We have further highlighted the importance of people, technology and Governance of Enterprise IT aspects in relation to both of these components. This study’s goal is to search for ways to improve the quality of the Big Data process. What is meant by “quality” and how can this be ensured and where possible, increased? In the following section we will explain the choice for Quality Assurance and look into relevant framework and their usefulness in application for Big Data & Analytics.

WHY QUALITY ASSURANCE? Quality assurance can be defined as a way of preventing mistakes or defects in manufactured products, and avoiding problems when delivering solutions or services to customers. The American Society for Quality defines QA as: “The planned and systematic activities implemented in a quality system so that quality requirements for a product or service will be fulfilled.”(ASQ, 2014) Of importance hereby is the emphasis on continuous measurement and monitoring during delivery of a service or fabrication of a product. Quality Assurance differs from Quality Control in that the latter focuses on checking process outputs – after the fact. Quality Control is more aligned with the traditional view of auditing, which often evaluates audit objects based on control operation during historical periods; test results are either good or bad. While Quality Control might uncover mistakes within a service process, and when coupled with recommendations can prevent mistakes in future, it cannot correct them as they happen and is of little value to Big Data stakeholders during the delivery process. It is for this reason the author recommends Quality Assurance for Big Data & Analytics as a way to improve quality before finalisation. QA is also a very good fit with Big Data, which due to its iterative nature fits the popular Deming cycle of Plan-Do-Check-Act which relies on inspection to improve quality (Langley, Moen, Nolan, & Norman, 2009).

Achieving quality standards

Numerous business best practices exist with the aim of ensuring quality or continuously improving quality. These include quality management systems such as Lean Six Sigma and Total Quality Management (TQM). International standards have been developed and companies employ various best practices for quality management systems. For example the ISO 9001 standard which (International Standards Organisation, 2014) claims that its application leads to continuous improvement ISO9001: “2008 Quality management systems — Requirements” has proven to be very popular, which might show the increased importance companies put on quality management, next to any positive side effects to the company image which certification might bring. (Poksinska, Dahlgaard, & Antoni, 2002). One of the principles of ISO 9001 is that effective decisions should be based on the analysis of data and information, which leads to the prerequisite of “ensuring that data and information are sufficiently accurate and reliable. The statement applied to any information but especially management information and as Big Data & Analytics takes a prominent place in this domain, the statement will also apply to it. The related ISO/IEC 17025 standard sets out general requirements for the competence of testing and calibration laboratories. Both standards allow for certification. Next to standards, other quality measurement initiatives also exist. For example, at the end of the 20th century, Carnegie Mellon university introduced (and patented) the Capability Maturity Model Integration (CMMI Product Team, 2010) and advocates its by the public for use in guiding process improvement across a project, division, or an entire organization.

The pursuit of quality is ultimately a question of risk mitigation. In a recent article in the Wall Street Journal, many of these risks were mentioned (Jordan, 2013), e.g. the lack of skills of the current work force; tooling which is not ready for prime time yet, difficulties understanding of the data, and the need for security of data on powerful information. More and more companies are becoming aware of the challenges, issues and now also the risk which go along with Big Data. It makes sense therefore a suitable control frameworks becomes available to deal with these risks. It goes beyond the scope of this study to determine if these standards can be

Page 36: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

36

used in conjunction with the proposed control measures for Big Data & Analytics, but it is useful to establish which other, existing frameworks are in use which address the quality management issue.

SCOPING OF QUALITY DIMENSIONS How are frameworks scoped?

To explain how control frameworks are scoped this study starts with general management theory. Since the introduction of the Business Balanced Scorecard (BSC) by (Kaplan & Norton, 1993) , the scorecard has enjoyed popularity in its application in business, enabling organizations to measure business results and track their progress against business goals, with the objective of improving financial performance. As the name suggests, the scorecard is conceived to be balanced and considers an evenly weighted set of stakeholders or subject areas in order to achieve strategic goals. It is a practical application of control theory, whereby (process or activity) outputs are measured at certain intervals and compared to expected values. Where necessary, corrective measures are taken. Objections were raised against the original scoping of BSC scorecard perspectives, which according to the opponents placed too much focus on the financial aspect. This study leave that discussion to the critics, but it shows that scoping should not assumed to be set in stone, especially if the objective is ultimately something else than financial and can be adapted depending on the objective. For example, numerous applications of the scorecard for achieving Corporate Social Responsibility exist.

The Committee of Sponsoring Organizations of the Treadway Commission, arguably the originator of the concept of Internal Control, also produced a balanced scorecard of sorts. In 1992 it presented to the world the COSO cubed framework, which portrays multiple interrelated enterprise risk management components, derived from the way management runs a business (Committee of Sponsoring Organizations of the Treadway Commission, 2004). It combines these component with four categories of business objectives (strategic, operational, reporting and compliance). In turn these are multiplied again across departments, business units and other smaller governed entities. COSO has formed the inspiration for many derivative frameworks, including COBIT. The scoping of COSO is based on the core concepts of risk mitigation objectives, the process and people involved with business operations. The framework is delimited by the principle of obtaining reasonable assurance.

The complexity of the original framework has led organisation intent on internal controls to apply their own, simplified versions of the framework (Shaw, 2006). Combined with the fact that the original risk management components of the COSO framework have increased throughout the years from four to eight, this leads to the conclusion that there is much room for interpretation on part of the control framework user as to how to scope the framework. This certainly applies to the number of control measures which are actually implemented.

Frameworks addressing IT

Due to the large role of IT in Big Data & Analytics, the frameworks which address these cannot be overlooked. The professional field of Information Security has always focused on the venerable “C.I.A.” triad; that is, the Completeness, Integrity and Availability of data or information. So also its control frameworks. (Boritz, 2005) argues however that current control frameworks which address data integrity are limited to financial information, because most of these frameworks have roots in the accounting and auditing spaces. The author proposes that information integrity controls should be made suitable for other information types as well on which decisions are based. This study agrees with this statement, which is why it has investigated issues which cannot be solved by them, and where possible proposed new measures.

The previous section discussed the how dimensional scoping is achieved for well-known control frameworks. The same general approach can be said to be control frameworks applicable for maintaining information Confidentiality, Integrity and Availability. For example, General IT Controls(GITC’s) can also be viewed as a balanced scorecard for the IT control environment, used to determine if controls are implemented and operating effectively. Exceptions enable management to execute remediating measures to achieve expected values in future. Again, as with other frameworks, no single exhaustive scoping of controls exists. Sarbanes-Oxley, ISAE3402 and other regulations and audit standards all refer to the concept of GITC without providing a

Page 37: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

37

delimited scoping. A recent ISACA publication (Singleton, 2013) confirmed this, providing some examples of GITC frameworks with varying scopes. But it concluded that while details vary, there is consensus on scope. It is left up to the professional judgement of the IT auditor which details and perspectives to apply, as long as the audit goal is achieved. The coverage of GITCs overlap with ISO 2700114 and with COBIT, and addresses the topics of User authorisation, Change Management, Software Development, Incident management and Continuity. COBIT goes further, also addressing enterprise level controls such as Defining IT Strategy and Managing Outsourcing.

This study has established that many aspects of the General IT Controls are important, as Big Data & Analytics builds on top of existing IT infrastructure and control environments. As many aspects have already been covered by in previous sections under different names, this study will only look to specific entity level controls which might contribute to the quality of the Big Data & Analytics process. This topic is cover in the section on Governance of Enterprise IT, contained within this chapter.

MAINTAINING QUALITY IN THE ETL PHASE This study concludes that the scoping of the scorecards and control frameworks in general depends on the objective. Prior to this, this study identified the challenges and issues with Big Data & Analytics alongside the inherent risks of the process. If recommendations were already well known and part of the field of IT auditing these were mentioned immediately, albeit in passing. In the following section, we detail some specific challenges which require special consideration because they are not usually covered by the IT profession, and we look to together disciplines for better practices. Also described are those topic deserving special mention due to their impact or their importance. An elaboration is provided in the following section.

Data Quality management

The saying “Garbage in, garbage out” is applicable to many input-output processes. Supposing a pristine dataset, traditional methods of maintaining data quality are linked to access control intended to prevent unauthorised (and by deduction possibly incorrect, be it malicious or unintentional) changes to the data. In financial processes it is common practice to introduce dual man controls when data is manually inputted, creating conflicting interest in follow-up process workflow activities to increase data quality. Still, 100% correct data is unachievable. This is why Data Quality management has traditionally been brought to bear on this a challenge. (Jonker & Pols, 2014) It is achieved by performing data classification and follow steps to check its accuracy, consistency, timeliness, conformity, reasonableness, preciseness and integrity, along with other sanity checks. Data Quality is achieved when data exhibits ‘Representational faithfulness” which is achieved “when information is: complete (within limits established by agreement, policy or regulation); current/timely (within limits established by agreement, policy or regulation); accurate/correct (within limits established by agreement, policy or regulation); and valid/authorized (in accordance with policies, standards and business rules established by top management and the Board and applicable laws and regulations established by regulatory agencies or legislative bodies). In addition, the study finds support for a second layer of attributes represented by the following enablers for the core attributes of information integrity: Secure; Available/ Accessible; Understandable/Appropriate level of granularity/aggregation; Consistent/Comparable/ Standards; Dependable/Predictable; Verifiable/Auditable; Credible/Assured.” (Boritz, 2005). The same researcher further defines data integrity as the process of maintaining and assuring the accuracy and consistency of data over its entire life-cycle. Research further shows that for traditional data warehousing challenges, in order to maintain integrity of source data, access control and security is paramount. (Fernandez-Medina & Trujillo, 2006). This means that data cannot be modified in an unauthorized or undetected manner. Therefore, as a prerequisite, Data Quality Management must be combined with other IT Security controls. One huge caveat is that these discussed approaches are only intended to work on internal data.

14 ISO/IEC 27001 is a well-known standard in the 2700x family providing requirements for an information security management system (ISMS).

Page 38: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

38

As is clear however, much of the data of interest to Big Data is externally generated by people in an end-user computing environment. Traditionally IT auditors have stayed away from providing assertions of assurance over these environments due the many manual activities and local lack of a control environment. In these cases it often not possible to perform traditional quality management measures on source data, as from the start the data set is often in doubt or its generation unknown.

How do we maintain integrity when security mechanism do not exist for interfaces with public and social data? Especially when most of Big Data encompasses social media data as well. (Boyd & Crawford, 2012) argue that with public data such as Twitter information there is no way of knowing if you’re complete. Twitter or individuals might have inadvertently deleted offensive tweets. The data made available can also not be complete, except to a few advantaged few parties. Still, data scientists insist that there is value to be gleaned from analysing these data sets and it is clear that organisations are applying them. The challenge of source data in doubt is circumvented in the Analytics phase. This is done by making assumptions which take into account identified data quality errors in source data as part of the data profiling. Without amending the original source data, table and the algorithms and queries operating thereon, are configured so they remove outliers, or perform missing data interpolation. Of course, this process is not perfect, which adds to the existing statistical uncertainty. It is important that this uncertainty is be expressed as part of the answer of Big Data & Analytics.

This shows the importance of understanding the source of your data and means that vetting of data source is required, by other means than the traditional reliance on controls. Through data analysis in the Analytics phase, the properties of the data must be studied to determine their impact on the applied Statistical Model’s performance and the validity of the outcomes.

Referential integrity

We mentioned earlier that one of the important characteristics of traditional databases is the management of referential integrity in databases, a principle known as ACID15. Big Data’s MapReduce and extended family of similar architectures famously do not comply with ACID model. This is not a problem in cases whereby data is not in transit. For example, when applying Big Data to large historical data sets. This is why for a Proof of Concept dealing which only uses, offline date, it is easy to deal with these challenges. They are familiar, and have familiar solutions. Integrity is violated when a message is actively modified in transit. Information security systems typically provide message integrity in addition to data confidentiality.

In conclusion: We do not need to consider referential integrity when dealing with data at rest. However, when data is in transit, and this data is fed into the Big Data platform (e.g. Hadoop) the impact of (the lack of) referential integrity should be considered as part of the Capture & Storage step in the ETL processing phase.

15 ACID is the acronym for Atomicity, Consistency, Isolation, Durability and are the set of properties that guarantee that database transactions are processed reliably.

Page 39: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

39

MAINTAINING QUALITY IN THE ANALYTICS PHASE This study has established that the Analytics phase of Big Data, whereby data is modelled, is closely associated with applied statistics. It is therefore useful to name a number of risks which apply to statistical model development.

Statistical risks

A general risk is caused by human’s built in ability for pattern recognition, which is both our rise (i.e. weather prediction) and our pitfall (i.e. gambling) As the last example shows, pattern recognition unfortunately is not good at distinguishing correlation from causation. This brings us to a famous risk associated with statistics: the mistake of assuming that correlation implies causation, To illustrate how this can go wrong, (Leinweber, 2007) analysed a huge data set in which he proved a spurious connection between the S&P500 and Bangladesh butter production. Of course, this is not true. This demonstrates that Big Data analysis can be prone to find links where there are none. Conclusions on causation should therefore always be challenged.

Another major risk present in statistical modelling is one of overfitting. In statistics and machine learning, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. Another cause is that the training data is too noisy. This risk of overfitting is therefore present with Big Data, due to its philosophy to use all available data, also as training data. (Davenport, Barth, & Bean, 2012) note that smaller, static sets of data, therefore have value in development and refinement of analytical models in Big Data.

Model outputs (i.e. statistic answers) are usually associated with an certain bandwidth of uncertainty, known as a confidence interval. Due to its importance, the confidence interval (i.e. uncertainty) of the outcome can (and should) have a big influence on decisions by analytics end-users. If organisations rely on data analytics for strategy formulation, financial decisions or go-no go choices in implementation, it should be ensured the risk of misinterpretation is mitigated. Management should therefore ensure that the person(s) responsible for interpreting is/are knowledgeable in interpreting uncertainty. But this alone is not enough to mitigate misinterpretation, because (Boyd & Crawford, 2012) point out that interpretation is at risk of cognitive limitation and bias. This is especially so if the responsibility for interpretation resides with a single person. A second opinion by way of an independent ore expert review of the analytics outcome can assist in providing certainty that the interpretation is correct.

Model validation

To ascertain which measures contribute positively to a better data analytics outcome, it is helpful to have a look at other fields for established better practices. One such field is Financial Risk Management, where Risk Modelling has become the new paradigm for managing and mitigating risks (McKinsey, 2011). As a result of financial regulation, many companies in the financial industry have applied modelling to assess and mitigate risks in order to reduce their risk-weighted capital buffers. Within the insurance industry, the Own Risk and Solvency Assessment (ORSA) may include application of the standard model prescribed by the regulator, or development and implementation of an own suitable internal model. As financial sector firms internal models have grown in complexity, so does the risk associated with their use. Considering the sizable amount of financial loss which is at stake, in both cases, a model evaluation assessment is a vital part of the process, in order to mitigate what has been dubbed Model Risk.

A 2011 report by Financial Risk Management experts the underlying risks of Model Risk are described and include: Inadequate documentation around the model; Bad data input; Erroneous assumptions; Calculation errors and errors in model logic (KPMG, 2011). The statistical robustness is the criteria whereby these quality

Page 40: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

40

of these models are measured. This study summarises the approach proposed to mitigate these risks, an activity known as Model Validation.

At the minimum a Model Validation should cover core aspects of the models performance, including its objectives, output, methodology, data integrity and parameterization. The approach calls for a first time assessment, right after the model has been developed, and a continuous assessment, after the model has operated in production for a while. A standard sample is used to estimate current parameters of the model. As part of the Data Profiling exercise, this sample should be documented, e.g. which period is covered, sample size, data history, representativeness and relevance of data, relation between samples. As part of the first time assessment, an evaluation on the out‐of‐sample and out‐of‐time datasets is performed. Out-of-sample and out-of-time data are fed into the model and its performance is compared to when it has been fed the standard sample. Continuous assessment is scheduled periodically to validate if the model is still functioning as intended when conditions might have changed. This is similar to the practice proffered by the interviewed data scientists, who prescribe stress testing of algorithms by introducing extreme values and outliers to induce failure so the limits of the model are known and it is only operated within its performance envelope.

Analysis of the performance of the model output is one of the central aspects of the Model Validation. For predictive models, backtesting is said to be useful for analyse of the performance of the model outputs against the actual performance of the portfolio. Backtesting is a special type of cross-validation applied to time series data. It is performed to assess a) discriminatory power by calculating items such as the Lorenz curve or Gini coefficient; and calibration quality (b) which is assessed by testing the confidence interval with Binomial testing or Chi-square testing.

In conclusion: Financial Risk Modeling makes use of the same general scientific, mathematical and statistic principles and methods which are used Big Data & Analytics, in fact at the basis of both lies applied statistics and scientific research. Both practices share the same risks with the latter. The approach to ensure model quality with Big Data & Analytics should therefore be a similar one. This study proposes that Big Data borrow better practices from the actuarial profession for validating the chosen statistical model and its assumptions. Prescribed hereby are model documentation to ensure auditability of the statistical model. Data profiling should be performed to determine the ontology of the data and establish which statistical approach is a best match. Further measures are possible, can include developing protocols for research methods, reducing measurement error, bounds checking of the data, cross tabulation, modelling and outlier detection, verifying data integrity, etc. This is worthy of a study by itself.

Algorithm development and testing

Managed software development was proposed this study to prevent bugs in Big Data’s algorithmic code. A more specific measure, directly related to the statistical nature of Big Data, is to choose the right code to express the statistical model in. Some are better suited than others. (Canny & Zhao, 2014) argue that for speed and efficiency’s sake code used for analytical algorithms applied to data sets should be as close as possible to their mathematical expression, allowing for easier development, debugging and maintenance. The authors list that most common statistical inference algorithm that have been used for behavioural data are: regression, factor models, clustering and some ensemble methods.

Also important, is choosing the right software development method. Gaining popularity in are the lightweight development methods of Agile or SCRUM. Lightweight, because they favour individuals and interactions over processes and tools; working software over comprehensive documentation; customer collaboration over contract negotiation; responding to change over following a plan.16 These methods form a good fit with Big Data & Analytics process, as these share many characteristics. They are both iiterative, incremental and evolutionary; both require efficient but intense face-to-face communication; both have very short feedback loops

16 Taken from the Agile Manifesto at: http://agilemanifesto.org/

Page 41: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

41

and adaptation cycles; and both are focused on the quality of the outcome. What is further shared between the two, is that lightweight development methods as big data projects take a holistic view to the process.

In conclusion, next to the application of software development best practices, it is also important to choose the right Big Data language to develop queries and algorithms in, and also follow a development method which fits the iterative nature of Big Data & Analytics process, in order not to overburden with too much bureaucracy the creative process involved with the latter.

Coping with project-based Big Data

At the moment, many organisations (especially the established banking and insurance behemoths in the financial sector) are still in the pilot phase of implementing Big Data. Perhaps for some it will never leave this R&D stage. This means that it is not part of business as usual and thus not yet institutionalised within the organisation’s structure and governance. As the selected business case will show, due to the try-out nature of Big Data implementations, these are often run on a project basis. Recognising this fact, this study has analysed the applicability of better practice project management principles such as Prince2 (Office of Government Commerce, 2009) and IT service management principles such as ITIL and COBIT on the Big Data & Analytics process. More specifically, aspects of KPMG Advisory’s GETT17 (Global Enterprise Transformation Tool) methodology were also evaluated for usefulness as it combined a number of better practices. The toolkit is applied by determining to what degree of coverage certain criteria are present in an organisation’s programme design. For example, within the People component the user of the toolkit must consider if future job descriptions are available to users, and if training requirements have been identified. However, this study established that many of the challenges were generic and not specific to Big Data & Analytics, at least not more so than the issues and risks already identified.

In conclusion: while it is recommended that project management better practice principles are applied during any project, this study will not propose any specific project controls for Big Data & Analytics implementations. As such project management will be left out of scope of the proposed control framework.

Accounting for Enterprise IT Governance

This study recognised that Big Data & Analytics and the IT supporting it are part of the greater organisational structure and governance, and build upon those qualities. This study has therefore also investigated IT Governance framework. COBIT suggestions for adequate IT Governance practice which contribute positively to BD&A. One of these is the aligning of IT strategy with the business. In case of Big Data, it makes sense to align the IT strategy to the Big Data & Analytics requirements of the business. Others agree, for example, (Accenture, 2013) calls for a Big Data strategy as a prerequisite for achieving strategic goals with data, with a survey by IBM echoing that as data management is an onerous challenge for organisations, “Executives need to establish a business-driven agenda for analytics that enables executive ownership, aligns to enterprise strategy and business goals, and defines any new business capabilities needed to deliver new sources of revenue and efficiencies.” (IBM, 2013). That sounds a lot like a Big Data strategy.

While a lot of companies sit upon large mountains of data, many do not know exactly what to do with it. As the importance of IT has increased throughout the years, it became a given that companies appoint CIO’s, as a recognition of the importance of information technology and its contribution to a company’s success. Considering that Big Data is said to become an important value driver for companies, this would mean that in order to gain clarity and align the use of Big Data with strategic goals, ownership of data and representation of analytics professionals at C-level should be considered. In that light, (IBM, 2014) even calls for the appointment

17 GETT is a proprietary toolkit used to provide quality assurance on Global Enterprise (IT) transformation programs and projects. It draws inspiration from Prince2 but also the COBIT standards and the Capability Maturity Model. GETT considers the following dimensions: Programme Governance, Programme Management, Change Management and Performance management, as well as the familiar People, Process and Technology.

Page 42: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

42

of a Chief Data Officer, and “.. posit that the addition of a CDO to the organization’s executive team will enable greater focus and optimized use of this critical strategic asset.” (Gartner, 2014) goes so far that is predicts that by 2015, a quarter of all companies will have appointed CDO’s.

In conclusion: the risk exists that an organisations IT department is not aligned with the business requirements of Big Data & Analytics strategy. This strategy should be formulated, so that in turn the organisation’s IT strategy may be aligned to it. An important enabler thereby is sponsorship up to the highest level of company management. A powerful example is Big Data & Analytics representation at C-level, whereby a board member is given the mandate to align business, IT and Big Data & Analytics strategy.

Page 43: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

43

SECURITY, PRIVACY AND ETHICS A recent report by (KPMG, 2014) on the privacy and security challenges of Big Data identified as the risks those that are “related to identification, re-identification, predictive analysis, the indiscriminate collection of data, and increased risk of a data breach”. They propose some familiar measures such as: Data Governance to establish data ownership and responsibility; Compliance to align analytics processing with company policy and applicable regulation; Access Management to ensure only authorized access to Big Data sets. The authors also call for careful Anonymisation of the data sets, and caution when sharing data with or using data from third parties. More novel proposals by these authors include applying Attribute Based Access Control to regulate access to Big Data sets which might contain Personally Identifiable Information. It remains to be seen if the technical and organizational challenges associated with labeling all data attributes (as part of data classification) are possible in practice.

The scientific community has long faced the challenge of the requirement to anonymise its research subjects and the data which has been gathered from them, as a means of ensuring honesty of answers by data subjects and later protection from risks. It is obvious that the field of medicine cannot do without personal data. Can we look to this field to borrow better practice as to how an individual’s personal data contributing to this field is safeguarded? Perhaps. For the field of scientific research in medicine, the European Union (EU) has introduced a procedural solution, by imposing regulation (EU Directive 2001/20/EC, also known as the Clinical Trials Directive) which makes mandatory the appointment of Ethics Boards in EU member states. In the United States of America (USA), the equivalent supervisory body to is the Institutional Review Board, which has also been enacted in law, These bodies oversee which scientific research practices are allowed and what is not.

In some countries Privacy law makes the appointment of Data Privacy officers mandatory for data processors. These regulations never go so far as to stipulate what the responsibilities are of the business with regard to ethical processing of data. Private organisations outside of the field of medicine thus lack a supervisory body charged with maintaining ethical (as distinct from legal- or compliance-only) choices. Due to the extensive use of personal data in Big Data & Analytics, is no surprise to see that (KPMG, 2014) proposes that a process is introduced which requires the evaluation and providing Data use cases and use of data feed approval. The authors do not go so far as to state what responsible body should be involved in the supervision. This paper assumes that the data owners, data scientists and the business together with the data privacy officers, legal, compliance and IT form a committee so that all required are involved in the risk assessment.

One caveat in business is that even when some form of supervision is in place, it lacks the broader focus on social issues which an Ethics Committee might have. Instead the focus is more often on compliance and legal issues, or those which have financial impact (e.g. Audit Committee, Supervisory board). Are these existing committees well placed to judge the ethical application of Big Data by the commercial company they are charged with to oversee? (IBM, 2013) posits that “Leaders use a rigorous system of enterprise-level standards and strong data management practices to help ensure not only the timeliness and quality of the data, but its security and privacy, as well … These standards cover data management practices from intake to transfers, data storage processes for static and streaming data, and metadata management to ensure data traceability and enterprise data definitions.”

This study therefore is also a proponent that ethical considerations are considered next to commercial ones or regulatory ones.

As mentioned, privacy is an important part of the equation. Which measures can counter the risk of re-identification? Differential privacy is one of these concepts, which tries to alleviate the risk that arises from being able to isolate data related to the individual form the outcome of Big Data analysis, thus preventing leakage of individual information from participating individuals. This could also improve the quality of the data set by promoting honesty in answers as participants are aware that the risk of discovery is low. (McSherry & Talwar, 2007) Differential privacy can be achieved through the use of algorithms which are insensitive to

Page 44: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

44

variation in outcome when data of one individual is subtracted from the analysed data set.(Dwork & Lei, 2009). In the same philosophy, data set can also be made more privacy robust, by adding random Gaussian noise.

Not only specific statistics methods can reduce privacy risk, also software architecture based solutions exist. E.g. MapReduce based-software has also been developed with the same intention of reducing privacy risk. Airavat18 is one such example. This software will take in any untrusted code and output trusted data by introducing a secure reducer into the equation, replacing the untrusted reducer in the mapper-reducer combination of MapReduce software. The Reducer component s thereby surrounded by strong access controls which would enable end-to-end privacy (Roy, Setty, Kilzer, Shmatikov, & & Witchel, 2010). To not deviate from the purpose of the research question, this study avoids getting into technical details. The main conclusion is however, that tooling is available to the public which can reduce privacy risk of Big Data analytics. It is up to organisations to apply it when necessary. IBM goes so far as to state that “In addition to protecting customer data, strong security – ironically – also enables wider sharing of data within an organization. Once sensitive data is secured through such practices as role-based access, data masking and monitoring, sharing the data becomes less risky.

18 Airavat is a MapReduce-based system which provides strong security and privacy guarantees for distributed computations on sensitive data. It offers a novel integration of mandatory access control and differential privacy. See also http://z.cs.utexas.edu/users/osa/airavat/

Page 45: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

45

There are more risks associated with that unsupervised Big Data gathering by companies will lead to more privacy breaches down the road. What happens when firms go bankrupt? How is source data laid to rest? Where is it archived? These are a few of the many unanswered questions within the Big Data lifecycle. In a report on how to Fuel growth with Big Data, (Accenture, 2011) also notes the need to consider consumer protection, considering the ethical challenges associated with selling consumer personal information for use in Big Data applications.

In conclusions, what this study proposes are suitably security and privacy controls such as for example logical access controls to both the Hadoop cluster in the ETL phase, and the data sandbox accessed by data scientists in the Analytics phase in order to mitigate the risk of unauthorised access. Attribute Based Access can assess in providing the security controls which assist achieving privacy requirements To ensure privacy measures should be implemented such as encryption, data masking and application of differential privacy in statistical sampling.. In addition, to ensure that the Big Data & Analytics case not only comply with regulatory requirements but also takes into account ethical considerations, a supervisory body should be appointed responsible for approving Big Data use cases and connection to data feeds.

Page 46: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

46

PROPOSED QUALITY ASSURANCE FRAMEWORK FOR BIG DATA & ANALYTICS Selected perspectives

Together with the empirical data obtained from the selected business case, expert interviews and the outcome of literature review, this study has selected the following domains which have a demonstrable relevance to the ‘balanced scorecard’ of a Big Data & Analytics process. The following matrix illustrates their relationships with the Big Data & Analytics process.

Nr. Phase Process step Challenge Domain

People Process Technology GEIT

1 ETL Capture and storage Volume, Velocity, Variety x x x x

2 Conversion and transformation

Variety, Velocity x x x

3 Cleansing and Enrichment

Veracity, Volume x x x x

4 Analytics Data profiling Variety, Velocity x x x

5 Statistical modelling Variety, Veracity x x x

6 Algorithm / query development

Variety, Velocity, Veracity x x

7 Decision Making

Visualisation Variety, Veracity x x x

8 Interpretation Variety, Volume, Veracity x

9 Automation Volume, Variety, Velocity, Veracity x x x

Figure 13

For each of these domains, controls are defined with the intention of mitigating or reducing identified risks so that that these may attain a level which are acceptable to the end-user. For the purpose of understanding and in order to be specific, the Big Data control framework will include the following identifying attributes: control name, control description, linked risk and linked to domain.

Due to the nature of this study, with its associated time constraints and other boundaries, the controls listed within the framework cannot be assumed to be exhaustive. Instead they are intended to point the direction of the road which must be taken. Wherever possible, only those controls specific to Big Data & Analytics are mentioned. For example, while there are many enterprise IT controls, the appointment of a Chief Data Officer is one which has a direct relation to the quality of Big Data and & analytics. The framework is ordered by process steps, again, using the conceptual model of the knowledge discovery process by (Chen & Zhang, 2014) as adapted for this study. For the complete and details control framework, this study refers to the Appendix.

CHAPTER IV. CASE STUDY By review of available literature and expert interview this study has laid bare the Big Data & Analytics process and investigated the sources and properties of its data, in order to come to an understanding of the challenges and risks associated with the process which might adversely affect its outcome. Next to existing, this study also explored alternative ways of dealing with these risks, drawing analogues with professional fields which face

Page 47: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

47

similar issues and where risk countermeasures are well-known and more mature. It has become apparent that new risks exist that currently do not have easily or well-known solutions and of which both the Big Data practitioner (i.e. the data scientist) as well as the Big Data end-user (very often the business) must be aware when using Big Data & Analytics outcomes.

Drawing from both the existing and newly proposed better control practices, this study has composed a control framework, which when applied to the Big Data ventures, is intended to be useful in providing quality assurance. In order to ascertain the practicality and validity of the proposed control measures, this study has investigated a real life Big Data & Analytics business case in order to determine if a) the proposed control measures were implemented, b) if so, if they contributed value and c) if not implemented, determine if the proposed measure would have merit in real-life.

METHOD OF INQUIRY Base expert interviews

Before selection of the Big Data business case and before entering the research phase of this study, subject matter experts in the field of Big Data & Analytics from the KPMG Advisory Big Data department (BD) and financial risk modelling experts from the KPMG Financial Risk Management (FRM) were interviewed to gain insight into the high-level process involved with Big Data & Analytics. This study would like to thank the following persons for their availability, time and effort:

Fabian Jansen - Consultant BD (KPMG)

Paul Wessel - Senior Manager FRM (KPMG)

Sam van Hensbergen - Manager FRM (KPMG)

Sipho van der Putten - Manager BD (KPMG until July 2014)

Base business case team member interviews

In order to gain an understanding of the background of the project, its scope and the chosen approach, KPMG experts involved in the engagement were initially interviewed in a free format setting. In a subsequent meeting, one business case team member was interviewed with the help of the developed Big Data & Analytics control framework, which was used as a questionnaire to guide discussion and obtain more detailed responses on individual controls which might have been implemented within the business case.

OBTAINING EMPIRICAL DATA For the purpose of obtaining empirical data from the field, a recent Big Data project was sought as a business case, in order to be able to inquire how quality was maintained and assured within the project’s Big Data & Analytics process. To this purpose, a recent engagement which KPMG Advisory N.V. (hereafter referred to as “KPMG”) performed at a large international bank (hereafter referred to as: “the bank” )was chosen because of this business case’s fit with this study’s research question.

Due to the pilot nature of the selected business case, it was not expected that all proposed control measures would be present or were applied. For example, the bank’s business case used only internally generated data, external data such as internet data, social data, public data and trend data were not considered. Thus the risks associated with these did not apply. One could beforehand also exclude gaining any empirical evidence for measures within the step of Big Data Automation, because none of the outcome (algorithm / code) of the selected business case entered the live environment. For the same reason, while privacy requirements were implement, no ethical considerations were taken into account by those involved in the business case, because none of the results would be applied to the production environment and thus able to affect external parties or individuals.

Page 48: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

48

DESCRIPTION OF SELECTED BUSINESS CASE The focus of this study is not on the questions posed or the answers gained as result of the performed Big Data & Analytics business case, nor does it focus on the relevance of the gained or intended results. Instead the focus is on the measures which were implemented to improve the quality of the process and the outcome. This makes the content of the analysis secondary to the applied measures. However, to provide the required context to understand why this business case can be considered to be one of Big Data & Analytics, a descriptive summary is provided in the following section.

Engagement summary

The goal of the bank’s Big Data project was to establish a Proof of Concept (POC) and demonstrate the viability of new types of key risk indicators which might be used for predicting credit and liquidity risk. The core of the engagement consisted of analysing a large bucket of historical payment transactions made by the banks Small Medium Enterprise (SME) and Corporate clientele. The size of the bucket was upwards of 10TB. An external consulting firm, KPMG, was requested to apply Big Data & Analytics to this data set in order to determine if any useful credit and/or liquidity risk signals could be discovered, based on statistical analysis. The deliverable of the project was a written report, wherein the results of the research for new and better credit risk signals were described, in order that the business could consider if these risk signals could be applied in practice.

Project and team structure

The project’s consultant team members were made up of Big Data experts employed by KPMG Advisory N.V., as well IT personnel and business analysts involved on behalf of the bank. To assist in setting up the Hadoop cluster, the services of GoDataDriven consultants were also procured by the bank. GoDataDriven provided science and software engineers to setup the Cloudera Hadoop platform in which the Analytics would be performed. The bank itself was responsible for providing the KPMG data scientists with access to a Hadoop cluster and the relevant transactional data.

The KPMG team members responsible for day-to-day operation included a manager (Mr. Sipho van der Putten) and an advisor (Mr. Fabian Jansen), both of which had a technical science background, having obtained their PhD’s at NIKHEF19. Both these team members were previously employed at CERN20. Also involved in the business case was a political scientist (Mr. Sam van Hensbergen) with experience in the field of Financial Risk Management, specifically on the subjects of credit and liquidity risk. Mr. Van Hensbergen was responsible for project management, facilitating the interaction between the business and data scientists and providing the necessary financial expertise.

Project approach

The first step the bank took was to the define a use case in which it described what it was that it would like to research. The use case also described the reason why this research was pursued by the bank, in other words, what was the possible application of the answers gained. Of course the intended deliverable was also described. The project team first built a statistical model to assess the effectiveness of the signals in the “as-is” situation, which was later combined with historical transaction data in order to perform the analytics, come up with results and draw conclusions.

All of the bank’s retail and commercial current account transactions were selected from a period of 2006 to 2011. Transactions were enriched with actual customer information (the historical customer identifier numbers were updated to current customer identifier numbers). Of the approximately 4 billion transactions, 96% were updateable. Based on the available bank transaction grouping, a selection could be made of only commercial

19 NIKHEF is the Dutch National Institute for Subatatomic Physics and a collaboration between the Stichting voor Fundamenteel Onderzoek der Materie (FOM), Universiteit van Amsterdam, Vrije Universiteit Amsterdam, Radboud Universiteit Nijmegen and the Universiteit Utrecht 20 CERN is the European Organization for Nuclear Research

Page 49: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

49

transactions (800mln). A small percentage of data was not usable and therefore discarded. Associations and foundations were excluded from the transactions to be analysed.

The team then made a choice to focus on :

• Identifying liquidity patterns and constructing a liquidity prognosis for a select client segment.

• Reconstructing an existing liquidity signal (to validate the data set)

• Optimising the credit risk signal by using liquidity patterns of the selected client segment.

• Validating the credit risk signal with default data.

The following step was to identify and label all available business transactions, as retail customers were excluded from the analysis. Without this step it would not have been possible to analyse the data set and search for correlations. Also, the labelling would be key to being able to distinguish transactions in a live environment, if this was ever decided by the bank.

The transaction code, based on existing division of transaction types by the bank, was used as the departure point for the analysis of the data set . Another required step before any downstream processing could occur, was to determine the quality and meta-content of the data which was to be analysed. As a results of this analysis gaps were spotted. An example included the fact that the payment transaction data was not contiguous, which meant that there were gaps in the selected 5 year period. As a result, wherever this was required and possible, additional data was added and existing data was enriched.

However, the team discovered that they had insufficient data to complete the labelling exercise. Due to privacy concerns, the bank had only made available data which excluded personal information. Name and address data was made unavailable, as was the field used for the description of the payment transaction, because this might also contain personal information. This measure decreased the granularity in which the data could be labelled, posing a challenge to the teams ability to sort the data into meaningful buckets. As a result the exercise of selecting and labelling the transaction data itself was more fraught with difficulties. E.g. due to the lack of an accompanying description field it was not possible to make more detailed labels, e.g. based on the transaction type, which would have made it possible for the data scientists to distinguish invoice payments (for which services or products, PIN transactions (at what time, or to which benefactor), reason for the transaction (salary payment for temporary, permanent staff and / or consultants, VAT payments or other useful information in relation to their timing (time-of-day, periodicity i.e. recurring or one-time, late payments).

In the end, the business did prove that valid risk signals could be used to detect credit and liquidity risks. These exact signals will be left unmentioned in order to protect the bank’s intellectual property.

INTERVIEW FINDINGS For the purpose of this study, a team member (Mr. Jansen, hereafter referred to as “the interviewee”) was selected for inquiry on applied control measures. The main reason for selecting this team member was due to his very close involvement in the analytics phase of the Big Data & Analytics proof of concept project at the bank and his frequent communication with the business employees as well as the external vendor.

At the start of the interview, the applicability of the process model to the business case was confirmed with the interviewee. The risks identified by this study, along with the proposed preventive measures resulting from the literature study (hereafter referred to as controls) were discussed with the interviewee for each part of the Big Data & Analytics process. Where the discussion led to new insight, additional controls were also added to the respective process steps or to the framework domain. These domains were identified as a result of balancing the control framework into discernible and manageable topic areas. In the following section we describe the results of the interview.

Page 50: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

50

ETL process

Capture and storage Regarding the risk of source data being not being available for analysis due to lack of capture, the interviewee indicated that the measure of formally deciding which data to record was not applicable for the bank’s business case, as the data set consisted of historical payment transactions data. Based on his experience in the Big Data & Analytics field however, this measure might prove useful in case the Big Data & Analytics implementation does not have any historical data set to being with and data must first be generated or captured.

Regarding the risk of source data not being available because its makeup or origins are unknown, the interviewee informed us that in the bank’s case the source data was well described and known beforehand, which enabled correct loading of the data within the Cloudera platform during the ETL phase. The interview did offer a interesting real life example of where this risk would be very apparent:

“This risk also occurs when using web crawler data. Web site layouts change all the time. It is therefore a challenge to ensure that web crawlers continue to parse information correctly” – Fabian Jansen

Another risk which came up during the interview, was the risk that arose when using Big Data sets provided by external vendors. Online vendors aggregate and prepare certain data sets in predetermined ways and sell these on the open market. As an example, , pre-analysed sentimental data sets can be obtained for Twitter messages, or tweets. It is difficult to ascertain if the pre-processing was performed correctly and as result a third party trust issue arises. Even if the data comes from a renowned party such as a government institute, e.g. the Dutch Bureau of Statistics (CBS), it is still a challenge for data scientists to determine the quality of this source data.

Regarding the risk of source data not being available at a later stage for troubleshooting, forensic or audit purposes due to a lack of retention, the interviewee informed us that this measure was implemented as part of the regular bank’s IT security policy, which ensured that source data was retained and available at a later time to data scientists.

Regarding the risk of source data not being available because the data in question is in transit, the challenges in capturing described earlier in this study arise. The interviewee indicated that the continuity of live data streams, when used in Big Data analytics, must be monitored to ensure the algorithm is fed with enough data to perform its job. With regard to the bank’s business case only internally generated data was used, control measures related to live data were not applicable.

Conversion and transformation In the conversion and transformation phase, the risk occurs that source data loaded into the Big Data platform (e.g. Hadoop cluster, or in the business case, the modified version of Hadoop Distributed File System provided by Cloudera) is processed incorrectly due to the fact that the data mapping is unknown. This was not an issue with the bank’s case as the internally generated data was already available within a Data warehouse environment, and the control activity of profiling the source data to understand its origin, nature and characteristics was not performed as relevant documentation such as the data scheme and dictionary were readily available. The interviewee informed us that this measure was applied successfully in another business case (at a large global furniture retailer) as one of the stated requirements of the implementation was that the extracted data had to be exactly the same as the utilized source data.

For the same risk, subsequent measures which this study proposes, namely the actions of identifying and documenting the interface mapping, performing job monitoring and input and output validation of source and extracted data, were not performed within the business case of the bank. According to the interviewee this measure has merit in mitigating the risk of incorrect loading, can performed by monitoring JobTrack messages, the HDFS cluster job monitoring application. Other output validations were performed by the data scientist, such as determining the MD5 checksums and performing ad-hoc queries on loaded data and comparing this to the source.

Page 51: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

51

Cleansing and enrichment In this phase new data may be added or data may be deleted or otherwise omitted in later processing. The risk arises that the data is altered and adversely affects further processing, or in the worst case invalidates it. This study proposes that in order to mitigate this risk an audit log is maintained of all administrative actions, particularly write actions, performed on source data when performing cleansing and enrichment. One important prerequisite to achieve this is that logical access control has been implemented on the Big Data cluster, so that user actions, such as IT personnel in charge of maintain the Hadoop cluster or data scientists, can be traced back to the responsible individual.

It is possible in many Hadoop platforms to perform manual data entry and alter the loaded source data. This study proposed that input controls, such as a 4-eye principle, are introduced to mitigate the risk of unauthorized and incorrect alteration. When possible other input controls should also be placed on manual data entry within the MapReduce data set. This measure was not implemented for the banks business case, but according to the interviewee it is possible to configure tables in Apache Hive so that these may not be amended without authorization.

When new data is added, the risk that data is incorrectly mapped reoccurs, as is the case in the step of conversion and transformation. This measure was not applied by the bank, but the same principles apply according to the interviewee.

Continuous write access to the Hadoop platform is particularly risky situation. Due to the pilot nature of both the Big Data business case and the tooling which was applied (the Hadoop implementation of sourced from Cloudera) it was possible to perform unauthorized changes to the uploaded data by the data scientists. The interviewee confirmed that this could be seen as a legitimate risk and expected that future releases of Cloudera would contain functionality to prevent it.

Analytics

Data profiling During earlier discussion, leading up to the interview, the interviewee indicated that in case of Big Data & Analytics application, there is a real risk that garbage in may lead to garbage out in the case of the data analytics phase. Due to the diverse nature of Big Data no specific or explicit controls are proposed by this study, other than the high-level activity of plausibility analysis on intermediate and final outcomes by the data scientists handling the data. The interviewee confirmed that this action was performed within the bank’s Big Data implementation. Examples how this could be performed were ad-hoc queries and reviewing for example the number of records which are thrown out by the system in query answers. The suggestion was made that an informal threshold of maximum number of thrown-away records can be established, based on the capabilities of the statistical model. This would be used to identify suspect answers and trigger more detailed analysis to determine if the query answer is correct.

Before entering the later step of statistical modelling, it is necessary to understand the nature of the data which will be inputted into the statistical model, as this affects the choice of a suitable statistical model. To this purpose, this study proposes that a measure is introduced which ensures that combined data loaded into the Hadoop platform is profiled to determine its characteristics and match it to the right statistical method. Examples for this include density checking etc. This also leads to the requirement that the argumentation for the statistical model, based on the properties of the uploaded data should be documented. According to the interviewee the statistical method applied within the bank was well documented. Regarding data profiling, data dictionary’s and schema's were available to the data scientists to be able to understand the data they were dealing with.

One of the risks identified by the literature study is controlling for the evolution of the ontology of processed data. If the underlying ontology changes, the applied statistical model might be invalidated. According to the interviewee this proposed control, while not applied at the bank, has merit:

Page 52: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

52

“A good example is the Twitter API, whereby not everything a data scientist should be aware of is described within the available documentation. For example, if you use Twitter GPS information, you should be aware of the fact that only about 1% of Dutch Twitter users use GPS data when posting tweets and account for these constraints.” – Fabian Jansen

Statistical modelling When entering the process step whereby a choice has to be made in the statistical method chosen for analysis, the risk arises that the chosen method is invalid for the data set or the nature of the problem which the Big Data & Analytics implementation is trying to solve. This study therefore proposes as a quality assurance measure that prior and perhaps also after analysis, validation of statistical model and method is performed by experts. This activity was however not performed for the bank’s business case, and relied instead on the knowledge and experienced of the involved data scientists.

According to the interviewee, the error margin is an extremely important part of the answer of data analytics, as this is an important determinant in how the answer should be interpreted and what the sound and justified decisions are which can be made based on these. In his opinion, the error margin might be more important to understand than the answer output of analytics itself, as it the error margin provides information on the reliability and uncertainty of the statistical outcome. The measure therefore has merit in his view.

Another risk is that while the chosen statistical method or model might be valid and have a justifiable scientific base, but that the performance of the model is not up to expected standards. This might be cause by minor configuration details or otherwise. This study proposes as one of the ways to ascertain the statistical model performance is to perform back-testing of the model, based on historical data (when available). If historical data is not available, a standard sample can be developed. While this measure was not implemented in the selected business case, this was measure was applied for another business case (a large global furniture retailer) and according to the interviewee is a useful control measure for mitigating this particular risk.

Development of algorithms and queries In case of applied algorithms or queries, there is the risk that query responses are not valid. This study has advanced as a mitigating measure the (automatic) detection of invalid queries. E.g. as is the case during the activity of plausibility analysis, thresholds can be defined for the maximum number of records which may be omitted in a system response to a query. It makes sense if an query or algorithm has been automated to have this monitored automatically as well when possible.

Another reoccurring risk is the evolution in ontology of processed data. This study advances the measure that the evolution thereof should be monitored to ensure that changes in ontology do not affect the applied statistical model and hence the algorithm’s performance. According to The interviewee, data scientists activity of debugging of process data has merits. While not a formal measure in the bank’s business case, the involved data scientist did perform it in practice as part of their work activities. In the interviewee’s experience data scientists usually make extensive use of intermediary visualisation to comprehend the nature of the data, and includes the use of multitudes of histograms and it is sensible to perform this activity throughout the project.

Because the codification of the statistical method in essence involve programming, the same risks can occur as are usually mitigated by program development controls. According to The interviewee, most of the current Big Data implementations are part of Research and Development and not in live production systems.

“Complete Program Development controls are only of value when Big Data goes live and might form a costly administrative hurdle in case an organisation is only researching causal or correlative links in a test environment as part of R&D” – Fabian Jansen

Page 53: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

53

While this measure was in all aspects covered in the bank case, version control was implemented for a more recent Big Data implementation (at a large global furniture retailer). This choice was made by the team as a result of the fact that during the engagement, when data scientists unleashed the developed query on their data, the results were oftentimes different than expected due to the particular properties of the data and special cases which needed to be excluded. It was important for troubleshooting that a consistently audit trail was available to analyse and monitor the evolution of queries. The interviewee informed us that debugging (testing) of algorithms is part of KPMG best practice and was applied in the bank case as well.

Decision making

Visualisation After the analytics phase the next phase concerns acting upon the analysed data. Before doing so it must be first be understood and communicated to stakeholders and especially the decision makers by data scientists. These reports contain visualisations which must be interpreted by the Big Data & Analytics end-users in the business. According to the interviewee, as we shall also see for another proposed control, the risk exists that the visualisation does not convey sufficient information to completely understand the results, and form a well-informed decision. This study proposes as control measure to check if all necessary data points have been used. According to the interviewee, visualisation Good Practices are available and should be applied to correctly portray analytics outcomes. Within the bank’s business case this was also informally applied as part of the data scientists method.

Another risk is that an invalid visual portrayal is used for the analytics answers. This might be for example by using a logarithmic scale without explicit mention thereof, whereby the end-user might mistake the graph to portray linear results and base his decision on this assumption.

Visuals, especially if complex, might also be experienced as confusing to the user, leading the risk of misinterpretation. To prevent this, this study proposes that the less is more principle is applied, whereby a minimalistic approach is taken in including the complete required set of data points, but also no more than that. This can be enforced by way of monitoring.

In the bank’s business case the interviewee indicated that interaction and discussion between the data scientists and the business owners occurred multiple times each week. For each visualisation, an explanation was provided to the business by the data scientists whereby they were given the opportunity to ask questions on the graphical representation on what it represented. This allowed the data scientists to gauge the business owner's understanding.

“As part of KPMG best practice in Big Data projects, one of the goals must be that the business understands the product of the data scientists. It is therefore the responsibility of the data scientists to explain analytic results as objective and as understandable as possible to end-users– Fabian Jansen”

While this may be an ethical consideration, the above of course does not exonerate the business of its own responsibilities in providing its employees with the right kind of training and tooling when they are Big Data & Analytics end-users. To mitigate the risk of misinterpretation this study proposed that employee training is performed and tooling is provided for understanding Big Data & Analytics process and visualisation According to the interviewee, the business end-user which is responsible for making a decision based on the outcome of analytics, must be able to engage SME's who can explain important aspects or challenge their assumptions.

To prevent the inadvertent or wilful introduction of bias in visualisation, which might skew users into a certain direction of interpretation and alter their decision, a suggestion was made that an alternative visualisation of the Big Data & Analytics outcome should always be available in order for comparative purposes. According to the interviewee, this measure might prove useful but impose extra work burden. It was not applied within the bank’s business case.

Page 54: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

54

Interpretation As with the step of visualisation, another requirement is that the team members on the business side must be capable (at least to an adequate extent) to understand statistical graphs. The same also goes for the uncertainty level which is coupled to the analytics output. Low level end-user courses can be provided to that purpose within Big Data engagements. According to the interviewee this was not applied within the bank’s business case, but it would be useful if end-users let themselves be advised by experts. This was not performed by the bank, but might be useful in case a gap analysis has established that there is a need for training according to the interviewee.

During the interview, one of the possible measures which came up was the suggestion to put in place a conflict of interest and segregation of duties between the data scientists responsible for analysing data, the person responsible for creating the report and the associated visualisations, and the business end-user who uses the report to arrive on a decision.

The risk of bias in visualisations also occurs in the interpretation phase. Visualisations should be challenged for inducing bias or skewing the direction of decision making by end-users. This measure was successfully implemented at the Bank case whereby the visualisations were challenged within the bank business case by the team itself, as a result of continuous interaction with the business by the data scientist. According to the interviewee, this resulted in mistakes being discovered and deviations being explained. This process was documented by way of an issue tracker.

As a suggestion to further improve on visualisation, a proposal was made for creating a conflict of interest between the employee responsible for producing the visualisation and the report and the end-user or employee responsible for making decisions on Big Data. This can be performed by e.g. having the data scientist reports and visualisation monitored by a third party or a peer.

Within the interpretation step, there is always the risk that the conclusion is incorrect, even with all the prior control measures in place. Science has taken to peer review as a way of mitigating the risk that falsehoods enter the knowledge base of the scientific community. This is a challenge in the case of Big Data. As previously argued by this study, the inherent commercial value of Big Data & Analytics answers often leads to restrictive access to other stakeholders which do not have a share in bearing the costs According to the interviewee, the report provided by Big Data scientists to business end-users is not always technical in nature (due to their inability to understand it) making it difficult for e.g. auditors to evaluate if the proper scientific method was applied. In another business case (a global furniture retailer), KPMG did provide a technical report which functioned as a form of assurance and will be reviewed (not audited) by an independent third party, (a big four auditing firm) The interviewee shared another common practice in the scientific community is keeping of personal scientific logbooks by data scientists. As a suggestion, these reports can be formalised and challenged during a Quality Assurance session

Automation In case of scheduled or automatically operating algorithms, periodic evaluation are performed to ascertain that the model can still deal with the source data. i.e. monitoring is performed to ensure the algorithms are operating within their designed envelope. No live algorithm was put into place for the bank’s business case as the engagement consisted of a proof of concept. According to the interviewe automated Big Data merits monitoring due to the risks of changes to algorithm input resulting in invalid outputs which can go undetected.

Governance of Enterprise IT

The literature review of this study identified the risk of misalignment of Big Data endeavours and the goals of the company. As a mitigating measure, it is proposed that organisations develop a Big Data strategy to ensure this alignment. The interviewee did not know if a separate Big Data & Analytics strategy was in place at the

Page 55: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

55

bank or produced as part of the business case for the project. However, the Communication section of the bank’s website and an article published on the internet21 confirmed that the bank had started in 2011 with the development of multiple Big Data initiatives as part of its Big Data & Analytics strategy.

Big Data requires sizeable investments in resources and effort as this study has also shown. The risk therein lies that insufficient sponsorship is available to make available those resources for Big Data teams or departments which are trying out the field. What is required is a sponsor for Big Data at C-level to align business, IT and Big Data & Analytics strategy. According to the interviewee, an organisational Big Data & Analytic sponsor was not a formally appointed person for the bank’s case, but perhaps assigned as a role of the project.

Due to the sizeable investment, the risk exists that firm IT infrastructure is unable to meet demands of Big Data implementation if not supported with money, skills and resources. What is required is the choice for a suitable Big Data IT Architecture which achieves the required performance levels, this can be achieved by e.g. information technology capacity management. With regard to the studies business case, the bank had chosen to implement a separate platform supported by external an external vendor (Go Data Driven) Mr. Jansen suggested that costs were an inhibiting factor in acquiring technological capacity for Big Data analytics and that often this leads to the choice of outsourcing. In case of the bank he did not know if the IT policy was applicable, but KPMG performed an assessment of the analytics platform which was made available to the data scientists, and this was determined to be adequate in relation to the data to be processed. However, he did not encounter any SLA's with Cloudera, which was the outsourcing provider for the platform for which GoDataDriven acted as an intermediary. It was unknown if these external vendors were governed by Service Level Agreements and Service Level Management controls.

Concerning IT, a familiar risk is the one of unauthorised access to IT assets and underlying data. To mitigate this, user access control measures must be in place in order to prevent unauthorised access to both the original sources and the processed data sets, This study proposes that Role Based Access Control can be applied to ensure only authorised users (i.e. data scientists and/or IT personnel) can access data in order to prevent unauthorised or incorrect manipulation. The interviewee confirmed that access control security measures were in place for both internal bank employees as well as the externally recruited data scientists from KPMG.

Payment transactions often contain information on individuals, this was also the case in this business case. Improper use of personal data in Big Data & Analytics source sets may lead to illegal processing and invalidation of the use of Big Data & Analytics results. To mitigate the risk that of privacy breaches and non-compliance with applicable Privacy laws, this study proposes that privacy enhancing techniques, such as encryption on all data and other Privacy Enhancing techniques are applied to ensure that the personal data is guarded. Suitably privacy controls such as masking and/or encryption must be in place in case source data contains personal data which might be traced back to individual contributors to the data set. Data classification policy must be in place in order to identify and classify (sensitive) personal data in data sets so these may received the proper treatment during processing. The interviewee was not privy to the fact if the IT organisation or the vendor which hosted the platform, implemented all of these measures, but one privacy enhancing technique, which consisted of removing personal data from the source data set, hence masking, was performed to ensure

When processed data consists of personal data which is sensitive to reengineering of individual contributions to the data set, this study proposes that the chosen statistical method must comply with the requirements of differential privacy. Mr. Jansen was not familiar with the technique, though he was familiar with the work of the Max Planck institute proposed control measure. He referred us to their publications.

As became apparent with this business case, the the bank to implement achieved the goals of their proof of concept by bringing in external vendors. This study proposes that when such services (InfoaaS) are bought from

21 The link to the article is known to the author of this study. In order to keep the bank involved in the selected business case anonymous, it is not mentioned in this paper.

Page 56: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

56

Cloud vendor suitable Cloud controls must be implemented. The interviewee was not aware if Cloud controls were implemented in case of the contracted external vendor (Cloudera, via Go Data Driven) in this particular business case.

People

During the interview, the interviewee indicated that no training was performed for the business end-users in relation to the bank’s business case and that as far as he was aware, no training was performed for IT employees involved either, due to the fact that the activity of setting up the Big Data & Analytics platform was outsourced to a third party (GoDataDriven). For controls related to the People domain, this study refers to those related measures identified within the process steps of the Big Data & Analytics process, which includes training of end-users.

Technology

Within the technology domain, one of the risks identified was that the program language might not be a good fit for coding statistical queries. As a control measure, a proposal is made that before starting with Big Data & Analytics, an informed choice is made for the most fit for purpose programming language, based on measurable requirement criteria of the business study in question. The interviewee informed us that bank had provided a technical solution / platform for the intended Big Data proof of concept research without input from KPMG. Python and C++ were available to code the algorithms. Prior to this KPMG had evaluated it these languages were fit for purpose in relation to the business case, with positive result. In addition KPMG requested and was authorized to install ROOT22, an object oriented framework for large scale data analysis..

Another risk in this domain, is the risk that the vendor's external tooling is unable to provide the service level necessary to achieve the indented analytics results. As a quality assurance control measure, this study proposes that a sourcing strategy for the-shelf applications or Cloud service is in place for is also applied to vendors providing Big Data implementations. The interviewee could not confirm if this measure was applied in case of the bank’s business case. He informed us that there are open source libraries available to assist in Big Data implementations, which are often used in case of self-development.

There are vendors which offer off-the-shelf Big Data & Analytics solutions. Examples are Tabulate or Qlickview. The Tabulate platform is an easy and flexible way to present summary data in tabular form. Tables are built from grouping columns, analysis columns, and statistics keywords. Qlickview is a Business Intelligence and Visualisation software package. But these solutions are not always a good fit for your problem or offer complete enough functionality. Data scientists therefore often develop their own tooling – Fabian Jansen

In the continuously evolving space of software Big Data & Analytics solutions, this study has identified the need for tooling which is user-friendly to business end-users. Much of what now is bespoke development by data scientists may become standardized in future. A proposal is made to ensure that the selection of the software tool which matches the abilities of the intended Big Data & Analytics user. Due to the proof of concept nature of the bank’s business case, the data scientist still had a prominent role in researching the problem. The interviewee informed us that in current practice, building of tooling is left to the data scientists.

“It is the responsibility of the scientist to speak the language of the business user. In order to be successful, close cooperation between the business and data scientists is necessary. This alleviates the need for tooling which is user friendly enough so the

22 ROOT is a C++ replacement of the popular PAW program developed at CERN

Page 57: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

57

end-user might be able to use it” - Fabian Jansen

Page 58: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

58

CHAPTER V. CONCLUSIONS AND RECOMMENDATIONS CONCLUSION

Through interview of subject matter experts and literature review, this study determined the conceptual steps of the Big Data & Analytics process and the challenges which the distinguishing properties (Volume, Variety, Velocity and Veracity) of Big Data pose on these steps. This study also proposed a number of counter measures to mitigate the identified risks within the process and the IT environment and organisation surrounding it, and mapped these to a balanced scorecard.

This study also shed light on if the applied measures achieved positive results in practice by assessing if the control measures were applied or otherwise present in a real-life business case at a Dutch bank. The results for individual controls are listed in the table with controls included in the appendix and marked as such. Based on a review of empirical data, this study concludes with some general remarks on traditional measures encountered and provides some key observations on uncovered novel measures.

The data obtained during empirical research has demonstrated that within the selected business case, there were many traditional controls in place, such as those for maintaining security and logical access. Also, the need for proper Governance of Enterprise IT as an enabler has been established. On the operation level of IT, controls resembling traditional interface and conversion controls, such as output validation by way of reconciliation, also are in place and were used both in the ETL phase as well as the data scientists sandbox in the Analytics phase. In the same phase, managed development of algorithms is called for. While a Big Data & Analytics strategy has been called for alongside the appointment of a sponsor at board level, this might be seen as a further specification of the IT strategy and the responsibilities of the CIO.

More interesting where the more novel measures which were implemented in the business case, which might not be as well embedded in the current practice of the business, its risk management function or that of IT Governance. What has become apparent is that there is a requirement for iterative communication between the business and data scientists to ensure that understanding is of the Big Data analytics statistical methods, resulting answers along with their uncertainty, their visualisations and inherent complexity is complete and any misunderstanding is resolved.

Not usually part of the current skill set of the IT auditor or business employees is the intimate knowledge of statistics and the validation of the chosen method in the Big Data & Analytic process. While this control measure was not present in the selected business case, this study literature research has demonstrated the importance of involving the right subject matter experts who may challenge and validate the chosen scientific method within the analytics steps. Drawing from the field of Financial Risk management, Model Validation measures are proposed as measures to increase the validity of Big Data & Analytics answers.

Also, the increased reliance on the science of applied statistics and its visualisation has introduced new risks in the misapplication or the misinterpretation thereof, negatively impacting the Big Data & Analytic process outcomes. As a novel measure, this study has proposed to create a conflict of interest between the data scientists responsible for coming up with the answers, the persons responsible for producing the end-user report and visualisations, and the business end-users in order to prevent bias, mistakes or coercion. This might be achieved by segregation of duties, however these might be imposed in practice.

Of the implemented control measures, a number were shown not to be effective or necessary of value in practice. The more stringent requirements of full Program Development controls, such as end-user acceptance testing, might very well form a costly burden if the Big Data research is only used to provide a proof of concept and the algorithm does not enter the live environment.

Page 59: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

59

RECOMMENDATIONS A number of proposed control measures were not implemented in the chosen business case. These are also listed in the control framework in the appendix and marked as such. Of those the interviewed subject matter experts were of the opinion that some of these might be effective if put into practice. If that is the case, the question has been answered as yes and listed in the control table.

For these controls, further research is necessary to determine their validity in practice however. One important one is the education of business end-users. While the theoretical literature study has uncovered the need for skills and training, the bank in the selected business case did not apply this measure and relied upon the present skills of its employees and that of externally hired consultants. This might have been caused by the pilot nature of the case. Based on the literature review and results in other fields, this study recommends others busy setting up a Big Data use case to carefully considered if training and tooling are required to help achieve the intended goals.

It was also know from the moment that the business case was selected, that while Privacy measures were considered and applied (and did have a considerable effect on downstream steps of the process), no consideration on the Ethics of the Big Data business case was performed. This study’s literature review demonstrated that Ethical considerations are a mandatory part of other research fields. As a proposed measure, this study finds it necessary to appoint a committee charged with maintaining Ethical processing and application of Big Data use cases, next to considering compliance to Privacy or Legal requirements. Additional research might uncover how this committee must be composed and how it will contribute maximum value by limiting the risk of unethical processing. Coming back to the topic of privacy, additional research is required to determine the effectiveness of technical privacy enhancing techniques such as the one of differential privacy identified in this study.

SUGGESTIONS FOR FURTHER RESEARCH This study hopes to have provided a stepping stone for further research direction and recommends that further business cases are explored, or better yet, that the proposed measures are implemented in Big Data & Analytics applications which have yet to be started and their results measured with more precision, perhaps by comparing the results to a control group of business cases whereby the measures have not been implemented.

While its relevance in the field of Financial Risk Management is proved, it is worthwile to study how exactly Model validation might be performed on a real-life Big Data & Analytics use case. This study has drawn conclusions based their similarities, but knowledge and insight might be uncovered by identifying where Model Validation approaches (should) differ between the fields of Financial Risk Management and Big Data & Analytics.

Another interesting tangent would be to perform a deep dive study of the development of Big Data & Analytics algorithms, to determine how far formal Software Development methods could form a hindrance to the creative process and the need for flexibility due to the iterative nature of the process. As part of this follow-up research focus, the applicability and efficacy of Agile development methods should be considered.

Visualisation was shown to have an important role in Big Data & Analytics. It is therefore worthy of consideration for further study. Answers should be sought for questions as to what constitutes better practice for displaying Big Data & Analytics information, or which end-user visualisation tooling performs the best for which use case, along with any other questions which might arise as result of machine-human interaction, data scientist-business user interaction and might be relevant to the fields of neuroscience and human information processing,

Page 60: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

60

CHAPTER VI. ANSWER TO THE RESEARCH QUESTION In what manner and under which conditions is it feasible to apply effective control measures to Big Data implementations and gain increased certainty over the validity of data analytics outcomes?

The answer to the question if control measures can be applied is: yes. Quality assurance control measures can be applied to the Big Data & Analytics process in order to improve the quality of its outcome. This study has established that for a studied rich business case, numerous controls were either formally or informally implemented to increase the quality of the Analytics answers.

It has been established that the manner in which these must be applied are both prior and during the execution of the process itself. Prior, to ensure that the right baseline is in place within the domains of People, Technology and Governance of IT. Also during, to ensure that corrections can be applied in the ETL, Analytics and Decision Making phases. Corrections only apply afterwards to any of these steps, would require revising all subsequent steps, which is often a costly or practical impossibility.

• Which unique properties distinguish Big Data, what is the analytical process involved, and what are its critical organisational aspects (Description)

This paper has elaborated on the Big Data process, its inputs and outputs and risk associated with each iterative step. This has led to the following model:

The challenges of the process steps due to the impact of Volume, Velocity, Variety and Veracity have been described, along with the risks arising from these and the requirements which they pose on a firms IT infrastructure, architecture and governance.

• Which control measures are appropriate for Big Data implementations, and based on a real-life business case study, what control measures have been applied and found effective? (Analysis)

And

• Can controls strengthen the validity of decisions based on Big Data & Analytics by increasing the quality of the outcome? (Reflection)

The business case study revealed a number of quality assurance measures which were applied and contributed positively to the analytics outcome. These controls have been listed in the appendix. The limitation of this study is that the actual decisions were not investigated, and that only one business case, albeit richly described, was selected for study.

CHAPTER VII. BIBLIOGRAPHY Accenture. (2011). How Big Data can fuel bigger growth.

Page 61: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

61

Accenture. (2012). Building the Foundation for Big Data.

Accenture. (2013). Data Monetization in the age of Big Data.

Accenture. (2014). Why Big Data Needs Visualisation to Succeed.

Aggarwal, S., & Kumar, M. (2003). Debuggers for Programming Languages . In In The Compiler Design Handbook: Optimizations and Machine Code Generation. (pp. 295-327). Boca Raton, Florida: CRC Press, 2003.

Amar, R., Eagan, J., & Stasko, J. (2005). Low-Level Components of Analytic Activity in Information Visualization. INFOVIS '05 Proceedings of the Proceedings of the 2005 IEEE Symposium on Information Visualization (p. 15). Georgia Institute of Technology.

ASQ. (2014). Overview of Quality Assurance and Quality Control. Opgeroepen op 07 27, 2014, van American Society for Quality website: http://asq.org/learn-about-quality/quality-assurance-quality-control/overview/overview.html

Benham, A., Edwards, E., & al., e. (2012). Web of Deceit: Misinformation and Manipulation in the Age of Social Media. Information Today, Inc. Medford, NJ, USA.

Berry, D. M. (2011). The Computational Turn, Thinking about the Digital Humanities. Culture Machine Volume 12.

Beyer, M. L. (2013). The Importance of 'Big Data': A Definition. Gartner.

Boritz, J. (2005). IS practitioners’ views on core concepts of information integrity. International Journal of Accounting Information Systems, 260– 279.

Bostrom, N. (2003). Ethical Issues in Advanced Artificial Intelligence. Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence, 12-17.

Boyd, D., & Crawford, K. (2012). Critical Questions for Big Data. Information, Communication & Society, 662-679.

Brown, E. (2010 , April 23). Computerized Front-Running. Opgeroepen op 07 27, 2014, van counterpunch: http://www.counterpunch.org/2010/04/23/computerized-front-running/

Bryant, R. E., Katz, R. H., & Lazowska, E. D. (2008). Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society. Computing Community Consortium.

Canny, J., & Zhao, J. (2014). Big Data Analytics with Small Footprint: Squaring the Cloud. Berkeley: University of California.

CBS News. (2006, April 20). Bush, the Decider in Chief. Opgehaald van CBS News web site: http://www.cbsnews.com/news/bush-the-decider-in-chief/

Ceri, S., Della Valle, E., & Pedreschi, D. a. (2012). Mega-modeling for Big Data Analytics. Conceptual Modeling, 31st International Conference.

Chan, L., Peters, G., Richardson, V., & Watson, M. (2012). The Consequences of Information Technology Control Weaknesses on Management Information Systems: The Case of Sarbanes-Oxley Internal Control Reports. MIS Quarterly, 179-203.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275, 314–347.

Page 62: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

62

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data To Big Impact. MIS Querterly Vol. 36 No. 4, pp. 1165-1188.

Chen, S. (2010). Cheetah: A High Performance, Custom Data Warehouse on Top of Mapreduce. Proceedings of the VLDB Endowment, Vol. 3, No. 2. Singapore.

Chung, W. S. (2011). The impact of cloud computing on financial statements. Compact Magazine.

CMMI Product Team. (2010). CMMI for Development, Version 1.3. Software Engineering Institute.

Cohen, J., B., D., Dunlap, M., Hellerstein, J., & Welton, C. (2009). MAD Skills: New Analysis Practices for Big Data. Proceedings of the VLDB Endowment Volume 2 Issue 2, August 2009, (pp. 1481-1492).

Committee of Sponsoring Organizations of the Treadway Commission. (2004). Enterprise Risk Management - Integrated Framework. Opgehaald van http://www.coso.org/Publications/ERM/COSO_ERM_ExecutiveSummary.pdf

Davenport, T. H., Barth, P., & Bean, R. (2012). How 'Big' Data is different. MIT Sloan Management Review, 22-24.

De Veaux, R. D., & Hand, D. J. (2005). How to Lie with Bad Data. Statistical Science, 231-238.

Devins, R. M. (1865). Cyclopædia of commercial and business anecdotes: comprising interesting ... D. Appleton and company.

Dwork, C., & Lei, J. (2009). Differential Privacy and Robust statistics. STOC’09, May 31–June 2, 2009. Universtity of California, Berkeley (Department of Statistics) & Microsoft.

Eppler, J. M., & Mengis, J. (2004). The Concept of Information Overload: A Review of Literature from Organization Science, Accounting, Marketing, MIS, and Related Disciplines. The Information Society: An International Journal, 325-344.

Equens S.E. (2013, 3 1). Een mijlpaal: 1 miljard SEPA-transacties door Equens verwerkt. Opgeroepen op 6 9, 2014, van Equens Blog: http://blog.equens.com/2013/03/een-mijlpaal-1-miljard-sepa-transacties-door-equens-verwerkt/

European Parlement and the Council of the European Union. (1995, 10 24). Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal of the European Communities, 31-50. Opgeroepen op 7 24, 2014, van EUR-Lex: Access to European Union law: http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31995L0046

Ferguson, M. (2012). Architecting A Big Data Pfatform for Analytics. Intelligent Business Strategy.

Fernandez-Medina, E., & Trujillo, J. R. (2006). Access control and audit model for the multidimensional modeling of data warehouses. Decision Support Systems 42, 1270-1289.

Friendly, M. &. (2009). Milestones in the history of thematic cartography, statistical graphics, and data visualization. Opgeroepen op 07 25, 2014, van http://euclid.psych.yorku.ca/SCS/Gallery/milestone/

Gartner. (2011). Hype Cycle for Business Intelligence. Gartner.

Gartner. (2014, January 30). By 2015, 25 Percent of Large Global Organizations Will Have Appointed Chief Data Officers. Opgehaald van Gartner Newsroom announcement: http://www.gartner.com/newsroom/id/2659215?fnl=search.

Page 63: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

63

Gigerenzer, G., & Selten, R. (2002). Bounded Rationality: The Adaptive Toolbox. Massachusetts Institute of Technology.

Gudipati, M., Rao, S., Mohan, N., & Gajja, N. (2013). Big Data: Testing approach to overcome Quality Challenges. Infosys Lab Briefings Vol 11 No 1, 65-72.

Harris, D. (2012, 6 14). Netflix analyzes a lot of data about your viewing-habits. Opgeroepen op 6 9, 2014, van GigaOM: http://gigaom.com/2012/06/14/netflix-analyzes-a-lot-of-data-about-your-viewing-habits/

Hitzig, N. B. (1992). Audit Sampling: A Survey of Current Practice. The CPA Journal , Vol. 65, No. 7 .

Hubbard, D. (2011). Pulse: The New Science of Harnessing Internet Buzz to Track Threats and Opportunities. . John Wiley & Sons.

Huff, D. (1954). How to Lie with Statistics. New York: Norton.

IBM. (2013). Analytics, a Blueprint for Value.

IBM. (2013). Enterprise Information Protection - The Impact of Big Data. Business Intelligence Strategies.

IBM. (2013, April 04). Risk Analytics Applications and Vision: An example of applying BIG data to mitigate an electric grid issue. Opgeroepen op August 08, 2014, van Stanford University Lectures: http://web.stanford.edu/class/archive/ee/ee392n/ee392n.1134/lecture/apr23/Talk_ChrisCouper.pdf

IBM. (2014). The new hero of big data and analytics. IBM Institute for Business Value.

Intelligent Business Strategies. (2012). Architecting A Big Data Platform for Analytics.

International Standards Organisation. (2014). International Standards Organisation web site. Opgehaald van http://www.iso.org/iso/qmp_2012.pdf

ISACA. (2014). Controls and Assurance in the Cloud: Using COBIT 5.

Jonker, R., & Pols, A. (2014). Datakwaliteit, basis vor gezonde bedrijfsvoering en kostenbesparing: een win-win situatie. Compact_ 2014 Vol. 2, 6-12.

Jordan, J. (2013, October 20). The Risks of Big Data for Companies. Opgeroepen op July 31, 2014, van Wall Street Journal: http://online.wsj.com/news/articles/SB10001424052702304526204579102941708296708

Kaplan, R. S., & Norton, D. P. (1993). Putting the Balanced Scorecard to Work. Harvard Business Review.

KD Nuggets. (2013, 08). Poll: Languages for analytics/data mining (Aug 2013). Opgeroepen op 08 08, 2014, van KDNuggets: http://www.kdnuggets.com/polls/2013/languages-analytics-data-mining-data-science.html

Keim, D., Huamin, Q., & Kwan-Liu, M. (2013). Big-Data Vizualisation. IEEE Computer Graphics and Applications, 50-51.

Keim, D., Kohlhammer, J., Ellis, G., & Mansmann, F. (2010). Mastering the Information Age: Solving problems with visual analytics.

Kennett, R. S., & Raanan, Y. (20111). Operational Risk Management: A Practical Approach to Intelligent Data analysis. John Wiley & Sons Ltd.

Kondylakis, H., & Plexousakis, D. (2012). Ontology Evolution: Assisting Query Migration. Conceptual Modeling 31st International Conference, ER 2012, (pp. 331-344).

Page 64: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

64

Konold, C. (1999). Statistics goes to school. Contemporary Psychology, 81-82.

KPMG. (2011). Model Validation: Mitigating Model Risk.

KPMG. (2013). Big Data and Analytics.

KPMG. (2014). Going beyond the data.

KPMG. (2014). Navigating Big Data’s Privacy and Security Challenges.

Laney, D. A. (2001, 2 6). 3D data management: Controlling data volume, velocity and variety. Opgehaald van http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Langley, G., Moen, R., Nolan, K. N., & Norman, C. P. (2009). The Improvement Guide: A Practical Approach to Enhancing Organizational Performance.

Laptev, N., Zeng, K., & Zaniolo, C. (2012). Early Accurate Results for Advanced Analytics on MapReduce. Proceedings of the VLDB Endowment Vol 5. No. 10, 1028 - 1039.

Laudon, K., & Laudon, J. (2010). Management information systems: Managing the digital firm. Pearson Prentice Hall.

Lawrence, G. W., Kehoe, W. R., Rieger, O. Y., Walters, W. H., & Kenney, A. R. (2000). Risk Management of Digital Information: A File Format Investigation. Washington: Council on Library and Information Resources.

Leinweber, D. (2007). Stupid Data Miner Tricks: Overfitting the S&P500. Caltech.

Lumineau, N., Laforest, F., Grippay, J., & Petit, J. (2012). Extending Conceptual Data Mode lfor Dynamic Environment. Conceptual Modeling, 31st International Conference, (pp. 242-251). Florence, Itay.

Lumineau, N., Laforest, F., Grippay, J., & Petit, J. (2012). Extending Conceptual Data Model for Dynamic Environment. Conceptual Modeling, 31st International Conference, (pp. 242-251). Florence, Itay.

Manovich, L. (2011). Trending: The Promises and the Challenges of Big Social Data. Opgeroepen op May 18, 2014, van http://www.manovich.net/DOCS/Manovich_trending_paper.pdf.

McAfee, A., & Brynjolfsson, E. (2013). Big Data: The Management Revolution. Harvard Business Review.

McKinsey. (2011). Risk modeling in a new paradigm: developing new insight and foresight on structural risk.

McKinsey Global Institute. (2011). Big data: The next frontierfor innovation, competition,and productivity. The McKinsey Global Institute.

McSherry, F., & Talwar, K. (2007). Mechanism Design via Differential Privacy. Foundations of Computer Science, 2007. FOCS '07. 48th Annual IEEE Symposium on. Microsoft Research.

Mercosur Cup. (2009). Opgehaald van HIARCS Chess Software: http://www.hiarcs.com/Games/Mercosur2009/mercosur09.htm

Metaxas, P., & Mustafaraj, E. (2012, October 26). Social Media and the Elections: Manipulation of social media affects perceptions of candidates and compromises decision making. Science, 472-473.

Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical.

Muthukrishnan, S. (2010). Massive Data Streams Research: Where to Go. Rutgers University.

Page 65: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

65

Muthukrishnan, S. (2010). Massive Data Streams Research: Where to Go. Rutgers University.

O'Brien, J. (1999). Management Information Systems – Managing Information Technology in the Internetworked Enterprise. Irwin McGraw-Hill.

Office of Government Commerce. (2009). Managing Successful Projects with PRINCE2 (2009 edition). The Stationary Office.

Osborne, C. (2014, July 14). GCHQ's dark arts: Leaked documents reveal online manipulation, Facebook, YouTube snooping. Opgeroepen op July 27, 2014, van ZDNet: http://www.zdnet.com/gchqs-dark-arts-leaked-leaked-documents-reveal-online-manipulation-facebook-and-youtube-snooping-7000031598/

Pavlo, A., Paulson, E., & al., A. R. (2009). A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD’09, June 29–July 2, 2009, (p. 14). Providence, Rhode Island, USA.: SIGMOD’09, June 29–July 2, 2009,.

Poksinska, B., Dahlgaard, J., & Antoni, M. (2002). The state of ISO 9000 certification: a study of Swedish organizations. The TQM Magazine, 297-306.

R., G.-I., S., M., & N., W. (2011). Identifying sarcasm in Twitter: A closer look. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Vol. 2, The Association for Computer Linguistics, (pp. 581–586).

Rabl, T., & Sadoghi, M. e. (2012). Solving Big Data Challenges for Enterprise Application. Proceedings of the VLDB endowment, 1724-1735.

Rothenberg, J., & Bikson, T. (1999). Carrying Authentic, Understandable and usable Digital Records Through Time. Report to the Dutch National Archives and Ministry of the Interior.

Roy, I., Setty, S. T., Kilzer, A., Shmatikov, V., & & Witchel, E. (2010). Airavat: Security and Privacy for MapReduce. NSDI , (Vol. 10, pp. 297-312).

Sandvine Incorporated. (2014). Global Internet Phenomena. Opgeroepen op 6 9, 2014, van https://www.sandvine.com/trends/global-internet-phenomena/

SAS. (2012). Big Data Meets Big Data Analytics.

Setty, K., & Bakhshi, R. (2013). What is Big Data and what does it have to do with IT audit? ISACA Journal 2013 Vol. 3, 23-25.

Shaw, H. (2006, March 15). The Trouble With COSO. CFO Magazine.

Singleton, T. (2013). Auditing the IT auditors. ISACA Journal Vol. 3, 10-14.

Steele, J. M. (2005). Darrell Huff and Fifty Years of How to Lie. Statistical Science, 205-209.

Stucchio, C. (2013, 09 16). Don't use Hadoop - your data isn't that big. Opgehaald van http://www.chrisstucchio.com/: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

TDWI. (2011). Best Practices Report: Big Data Analytics. TDWI Research.

The Atlantic. (2014, June 28). Everything we know about Facebooks secret mood manipulation experiment. Opgehaald van The Atlantic: http://www.theatlantic.com/technology/archive/2014/06/everything-we-know-about-facebooks-secret-mood-manipulation-experiment/373648/

Tukey, J. (1977). Exploratory data analysis. Addison-Wesley.

Page 66: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

66

Vance, J. (2013, July 16). 10 Top Big Data startups to watch. Opgeroepen op August 08, 2014, van Cio.com: http://www.cio.com/article/2384144/big-data/10-top-big-data-startups-to-watch--final-rankings.html

Zorrilla, M., & García-Saiz, D. (2013). A service oriented architecture to provide data mining services for non-experts. Decision Support Systems 55, 399-411.

Page 67: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

67

CHAPTER VIII. APPENDIX BIG DATA & ANALYTICS QA CONTROLS AND EXPERT INTERVIEW ANSWERS

ETL processing

Capture and storage Interview responses

Objective: Control measures provide a reasonable degree of certainty that source data is correctly captured and available for use by analytics.

In re

latio

n to

sele

cted

bus

ines

s ca

se

Con

tribu

tion

to o

utco

me

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: External raw source data / internal raw source data / manual input

Output: HDFS / Hadoop data storage

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

1.a Source data is unavailable due to lack of capture or definition.

In the design phase, a decision is made and documented on what data to measure and record (making independent re-performance impossible) n n/a y

1.b Source data in incorrectly loaded into the Hadoop platform

Source data should be mapped and its characteristics described and documented. This requires profiling of source data to understanding its origin and characteristics, including its veracity.

n n/a y

1.c Source data in incorrectly loaded into the Hadoop platform

When source data is obtained from intermediaries, understanding should be obtained on how it has been composed.

n n/a y

1.d Source data is not available for audit or troubleshooting

What is required is a strategy for source data preservation. A data retention policy is in place which stipulates that is so far legally and technically possible, raw source data records are never deleted but archived later forensic purposes in the problem solving phase.

y y n/a

1.e Source data integrity cannot be confirmed because it was in transit

When data is in transit, the issue of referential integrity should be considered and either accounted for in the ETL phase (applying enrichment) and/or Analytics phase (incorporating the statistical uncertainty)

n ? ?

1.f System architecture is not adequate to deal with Velocity, Volume

Capacity and Demand managed is practiced taking into account the Big Data service level required. y ? ?

Page 68: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

68

Conversion and transformation Interview responses

Objective: Control measures provide a reasonable degree of certainty that source data is in a suitable format for use by analytics.

In re

latio

n to

sele

cted

bus

ines

s ca

se

Con

tribu

tion

to o

utco

me

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: HDFS data storage

Output: MapReduce data set

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

2.a Source data cannot be converted into a legible format by or for HDFS /Hadoop /Hive/ Pig

When converting data e.g. to different formats, conversion controls must be applied. When converting media to machine readable information, the impact of errors must be considered in the Analytics phase.

n n/a y

2.b Source data is incorrectly loaded into HDFS/Hadoop/Hive/ Pig

When loading source data into Hadoop, interface mapping, job monitoring and input and output validation is performed in the ETL processing phase of Big Data & Analytics.

y y n/a

Page 69: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

69

Cleansing and enrichment Interview responses

Objective: Control measures increase the level of certainty of certainty that source data is of sufficient quality and completeness for use by analytics.

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: MapReduce data set

Output: Cleansed and enriched prepared data

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

3.a Added or deleted data as result of cleansing and enrichment step alters the validity of the outcome of analysis

An audit log is maintained of all administrative actions performed on source data, such as cleansing and enrichment. To ensure this, what is also required is logical access control.

y y n/a

3.b Manually added or deleted data renders original source data inaccurate

When possible input controls are placed on manual data entry within the MapReduce data set to prevent unauthorized changes to source data by data scientists.

n n/a y

3.c Enrichment data is incorrectly loaded into HDFS/Hadoop/Hive/ Pig

When loading enrichment data into Hadoop, interface mapping, monitoring and input and output validation in the ETL processing phase of Big Data & Analytics.

n/a n/a n/a

Page 70: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

70

Analytics phase

Data profiling Interview responses

Objective: Control measures increase the level of certainty that data characteristics such as e.g. density and distribution are understood so this will lead to the choice of correct statistical method.

In re

latio

n to

sele

cted

bus

ines

s cas

e

In re

latio

n to

sele

cted

bus

ines

s cas

e

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: MapReduced data

Output: Profiled data

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

4.a Lack of garbage-in controls may lead to garbage-out due to the inaccuracy of source data

In so far possible, data classification and the actions of checking its accuracy, consistency, timeliness, conformity, reasonableness, preciseness and integrity, along with other sanity. Plausibility analyses are performed on intermediate goods and outcomes.

y y n/a

4.b Incorrect statistical method is used leading to invalid results

Data is profiled to determine its characteristics and match it to the right statistical method. This includes checking of relevant properties such as checking its density, distribution, etc.

y y n/a

4.c The statistical models performance is negatively affected by the changing make-e up of data processed in the ETL phase.

The evolution of the ontology of source data as well as processed data must be monitored to ensure it fits within the performance envelope of the applied statistical model.

n n/a y

Page 71: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

71

Statistical modelling Interview responses

Objective: Control measures increase the level of certainty that the chosen statistical method is correctly applied.

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: Profiled data

Output: Modeled data

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

5.a The applied statistical model is incorrect

Model Validation of statistical model and method is performed by experts to determine the validity, performance of the chose model with the intended data it will be fed.

n n/a y

5.b Model and data inputs cannot be understood by third parties.

What is required is model documentation to ensure auditability of the statistical model and documentation of the data profile so the data input (sample data or real data) may be related to the model’s performance.

n n/a y

5.c Performance of the predictive model is not adequate.

Back testing should be performed to estimate the performance of the model if it had been employed during a past period, based on historical data. Model overfitting should however be avoided.

n n/a y

Page 72: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

72

Development of algorithms and queries Interview responses

Objective: Control measures increase the level of certainty that the chosen statistical method is correctly programmed and answers are correctly retrieved.

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: Modeled data

Output: Query answers

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

6.a Algorithms or query output is invalid

What is required is the (automatic) detection of invalid queries. E.g. Thresholds have been defined for the maximum number of records which may be omitted in a system response to a query

y y n/a

6.d Outcome of data analytics cannot be reproduced because the original algorithm or query is unavailable

What is required is managed software development.. Writing of algorithms or queries follow change management principles whereby changes are logged in more advanced scenarios formally tested and approved. This includes preservation of the original query or algorithm i.e. version control and testing and debugging of developed algorithms.

n n/a y

Page 73: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

73

Decision Making

Visualisation Interview responses

Objective: Control measures increase the level of certainty that portrayal of analytics is valid and correct

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: Analytical model output

Output: Queries, graphs and visualisations

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

7.a Visualisation is incomplete The visualisation should include all required data points. y y n/a

7.b Visualisation is incorrectly portrayed

A valid type of visualisation is chosen for displaying the analysed data in question y y n/a

7.c Visualisation complexity is confusing to the user

The “less is more” principle is applied to only show relevant data points for the purpose of the analysis. y y n/a

7.d Visualisation is misrepresented and/or misleading

Challenge for bias and peer review in order to eliminate of ambiguity in interpretation of the analytics model. This requires involvement of multiple experts challenging methods and outcomes, similar to the scientific process of peer review.

y y n/a

7.e Visualisation cannot be interpreted by user

Employee training is performed and tooling is provided for understanding Big Data & Analytics process and visualisation.

n n/a y

7.f Visualisation is biased For each type of visualisation an alternative visual portrayal should be available to allow for comparison

n n/a y

Page 74: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

74

Interpretation Interview responses

Objective: Control measures increase the level of certainty that visualisation of analysed data is not misinterpreted

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: Queries, graphs and visualisations

Output: Conclusions

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

8.a The end-user is not versed in interpreting Big Data & Analytics results

Big Data & Analytics users receive adequate training to be able to interpret visual results. n n/a y

8.b Due to bias in Big Data & Analytics end-user reports, the end-user can be coerced into (incorrect) decisions

Non-technical end-user reports should be challenged for inducing bias or skewing the direction of decision making and for their alignment with the technical reports.

y y n/a

8.c Conclusion is incorrect or based on incorrect assumption or interpretation

Business end-user’s conclusions are peer-reviewed and challenged by Big Data & Analytic experts such as statisticians to ensure correct interpretation, and also consideration for the confidence interval.

n n/a y

Page 75: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

75

Automation Interview responses

Objective: Control measures increase the level of certainty that automated algorithms working on Big Data function as intended in production

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: Automated algorithm

Output: Automated conclusions and actions

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

9.a Algorithm in production functions incorrectly

In case of scheduled or automatically operating algorithms, periodic evaluation and monitoring are performed to ascertain that the model can still deal handle source data. i.e. ensure that algorithms are operating within their designed envelope.

n n/a y

Page 76: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

76

Governance of Enterprise IT

Interview responses

Objective: Governance ensures that stakeholder needs, conditions and options are evaluated to determine balanced, agreed-on enterprise Big Data & Analytics objectives to be achieved; setting direction through prioritisation and decision making; and monitoring performance and compliance against agreed-on direction and objectives.

In re

latio

n to

sele

cted

bus

ines

s cas

e

In re

latio

n to

sele

cted

bus

ines

s cas

e

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: n/a

Output: n/a

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

10.a Business Goals are not aligned with Big Data strategy

What is required is a Big Data & Analytics strategy to which the IT strategy is aligned. ? n/a y

10.b Big Data investment is unsuccessful due to lack of sponsorship/ ownership/understanding by firm management.

What is required is a sponsor for Big Data at C-level to align business, IT and Big Data & Analytics strategy. ? n/a y

10.c Firm IT infrastructure is unable to meet demands of Big Data implementation.

What is required is the choice for a suitable Big Data IT Architecture which achieves the required performance levels, this can be achieved by e.g. information technology capacity management

y n/a y

10.d Risk of unauthorised access to the HADOOP server or the data scientist sandbox.

User access control measures must be in place in order to prevent unauthorised access to both the original sources and the processed data sets. Role Based Access Control may be applied to ensure only authorised users (i.e. data scientists and or IT personnel) can access data, in order to prevent unauthorised or incorrect manipulation.

y y n/a

10.e Improper use of personal data in Big Data & Analytics source sets may lead to illegal processing and invalidation of the use of Big Data & Analytics results.

Suitably privacy controls such as masking and/or encryption must be in place in case source data contains personal data which might be traced back to individual contributors to the data set. Data classification policy must be in place in order to identify and classify (sensitive) personal data in data sets so these may received the proper treatment during processing.

y/n y y

10.f Risk of re-identification of private or sensitive information

Privacy enhancing techniques, such as encryption and differential privacy must be applied on processed data to ensure that the personal data is guarded. When processed data consists of personal data which is sensitive to re-engineering of

n n y

Page 77: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

77

Interview responses

Objective: Governance ensures that stakeholder needs, conditions and options are evaluated to determine balanced, agreed-on enterprise Big Data & Analytics objectives to be achieved; setting direction through prioritisation and decision making; and monitoring performance and compliance against agreed-on direction and objectives.

In re

latio

n to

sele

cted

bus

ines

s cas

e

In re

latio

n to

sele

cted

bus

ines

s cas

e

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: n/a

Output: n/a

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

individual contributions to the data set, the chosen statistical method must comply with the requirements of differential privacy.

10.g Risk of Cloud vendor not delivering required service or security levels

When InfoaaS is bought from Cloud vendor what is required are Cloud controls ? ? ?

10.h Risk that analytics processing, while legal and compliant, is unethical or unacceptable should it become known to the general public

An oversight committee is charged with approved business use cases for Big Data & Analytics. Only after careful consideration and approval may work activity and data gathering be started. Next to legal and compliance, explicit attention must be given to the ethics of analytics data processing.

n n ?

Page 78: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

78

People

No training was performed for the business end-users in relation to the bank’s business case. No training was performed for IT employees involved either, as the setup was outsourced to a third party: Go Data Driven.

For controls related to the People domain, this study refers to those related measures identified within the process steps of the Big Data & Analytics process

Page 79: Quality Assurance on Big Data & Analytics · 2015. 9. 22. · Big Data platform (e.g. Hadoop) External data sources (unstrucured) Internal data sources (structured) ) {Algorithm and

79

Technology

Interview responses

Objective: Control measures increase the level of certainty that Technology is adequate to support the goals of Big Data & Analytics.

In re

latio

n to

sele

cted

bus

ines

s ca

se

In re

latio

n to

sele

cted

bus

ines

s ca

se

If no

t app

lied,

doe

s it h

ave

mer

it?

Input: n/a

Output: n/a

Nr. Issue / inherent risks Proposed control measures

App

lied

Posi

tive

Pote

ntia

l

12.a Program language is not fit for coding statistical queries

What is required is the choice for the most fit for purpose programming language y y n/a

12.b Externally source tooling are unable to provide service level necessary to achieve the intended analytics results.

What is required is selection of the right tool which matches the ability of the intended Big Data & Analytics user. n n/a ?