conference proceedings - wordpress.com · capability to publish structured data together with its...

144
IDIMC 2014 Making Connections Loughborough University 17 th September 2014 Conference Proceedings

Upload: others

Post on 02-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

IDIMC 2014Making Connections

Loughborough University17th September 2014

Conference Proceedings

Page 2: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

ISBN: 978-1-905499-51-9 © LISU Published by: LISU Loughborough University • Leicestershire • LE11 3TU

Page 3: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Loughborough University i

Programme Committee • Christine L. Borgman, Presidential Chair & Professor of Information Studies,

University of California

• Guy Fitzgerald, Professor of Information Systems, Loughborough University

• Bob Galliers, Distinguished Professor in Information Systems, Bentley University

• Mike Myers, Professor of Information Systems, University of Auckland

• Reijo Savolainen, Professor at the Department of Information Studies, University of Tampere

• Philip Woodall, Distributed Information and Automation Laboratory, University of Cambridge

Organising Committee Centre for Information Management, Loughborough University

• Crispin Coombs

• Louise Cooke

• Claire Creaser

• Mark Hepworth

• Dave Gerrard

• Alexander von Lünen

Distributed Information and Automation Laboratory, University of Cambridge

• Philip Woodall

Page 4: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

ii Conference Proceedings

Contents

Introduction ............................................................................................................................. 1

Invited speakers Can you make “Agile” work with a global team? Peter Cooke.................................................................................................................................. 2

Open Data and Linked Data – how can they help your organisation? Mark Harrison .............................................................................................................................. 2

A new Era of Knowledge Management? Reflections on the implications of ubiquitous computing Sue Newell ................................................................................................................................... 3

The role of social networks in mobilizing knowledge Jacky Swan................................................................................................................................... 3

Contributed papers UK Data Service: creating economic and social science metadata microcosms Lucy Bell ...................................................................................................................................... 4

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management Malcolm Chisolm ........................................................................................................................ 15

Adopting a situated learning framework for (big) data projects Martin Douglas and Joe Peppard ................................................................................................. 24

Exploring different information needs in Building Information Modelling (BIM) using Soft systems Mohammad Mayouf, Daivd Boyd and Sharon Cox .......................................................................... 36

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology Alistair Milne and Paul Parboteeah .............................................................................................. 49

The use of ontologies to gauge risks associated with the use and reuse of E-Government services Onyekachi Onwudike, Russell Lock and Iain Phillips....................................................................... 56

Page 5: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Loughborough University iii

Assessing trustworthiness in Digital Information Frances Johnson, Laura Sbaffi and Jenny Rowley ........................................................................... 68

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises Soureh Latif Shabgahi and Andrew Cox ......................................................................................... 77

Re-purposing manufacturing data: a survey Philip Woodall and Anthony Wainman .......................................................................................... 88

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider Gongtao Zhang, Neil Doherty and Mayasandra-Nagaraja Ravishankar........................................... 101

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition Torben Jess, Philip Woodall and Duncan McFarlane .................................................................... 113

Balancing Big Data with Data Quality in Industrial Decision-Making Philip Woodall and Maciej Trzcinski ........................................................................................... 125

Appendix 1: Programme ...................................................................................................... 137

Page 6: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and
Page 7: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Loughborough University 1

Introduction Welcome to the inaugural International Data and Information Management Conference at Loughborough University. IDIMC2014 was hosted by the Centre for Information Management in collaboration with the British Computer Society’s Data Management Specialist Group and it was our great pleasure to offer an exciting and high quality programme.

This booklet contains the abstracts of the invited presentations and the papers and posters that were presented at the conference.

This year’s conference theme was ‘making connections’. Academic work can be criticized for operating in silos and failing to bridge disciplines in a way that practitioners face on a daily basis. The information society and knowledge based economy rely on the organization and retrieval of data and information; the processes associated with knowledge creation; and the knowledge required to design, develop and implement solutions that enable the exploitation of knowledge, data and information. However, it is only when the contribution of these strands of important research are combined and integrated that their influence has the power to make breakthrough impacts on the information society and knowledge based economy. We hope that this conference helped to build bridges between academics and practitioners through sharing knowledge and exploring opportunities to connect disciplines and theories.

We would like to express our deep appreciation for everyone who helped make the conference possible. First and foremost, to the organising team Mark Hepworth, Claire Creaser, Philip Woodall, Louise Cooke, Kristin Meredith-Galley, David Gerrard, Alex Von Lünen, Lesley Chikoore for their hard work and in particular, to Sharon Fletcher, Ruth Cufflin and Ondine Barry for their crucial administrative support; to the members of the programme committee Christine Borgman, Guy Fitzgerald, Bob Galliers, Michael Myers, Reijo Savolainen, Philip Woodall, for their help with reviewing and promoting the conference; to the BCS Data Management Specialist Group and Facet Publishing for their support; and to you for being part of this new conference!

We enjoyed a successful IDIMC2014 with plenty of constructive and inspiring debates and, of course, fun and new friendships. We sincerely hope that you enjoyed your visit to our campus, that you enjoyed the conference, and that you took home many wonderful research ideas!

Tom Jackson, Director, Centre for Information Management

Crispin Coombs, Deputy Director, Centre for Information Management and Organizing Chair

Page 8: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

2 Conference Proceedings

Invited speakers

Can you make “Agile” work with a global team? Peter Cooke Global Enterprise Architect, Ford Motor Company Agile development methodologies arose as a way to deliver value faster – but assumed that developers and customers were collocated. However, is it possible to get the benefits of Agile in a global environment where business stakeholders and IT personnel are geographically dispersed around the world?

This lecture briefly describes the history of Ford Motor Company’s experiences in global IS development before summarizing research into virtual requirements elicitation. The final section of the lecture will report on how this research is currently shaping new practices being deployed at Ford.

Open Data and Linked Data – how can they help your organisation? (Opening Keynote)

Mark Harrison Director Auto-ID Lab, Distributed Information and Automation Laboratory, Institute for Manufacturing, University of Cambridge

Open Data and Linked Data are somewhat complementary to Big Data but are technologies that companies can already use today to tackle some of their Big Data challenges. Increasingly, government departments and agencies are publishing open datasets about government spending, local socio-economic information, geographic information, weather forecasts, roadworks, public transport, health, etc. and there are opportunities to combine this freely available data with our internal data to increase its value and help us to make better decisions, using the additional context information provided by these open data sets. There is also an opportunity to make the public-facing data about the products and services provided by your company available as structured open data on the web, so that your company and its products and services can be more easily discovered by search engines, smartphone apps and other software. Linked Data technology (also known as Semantic Web technology) provides the capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and remote – and to navigate end-to-end through a number of data linkages across those datasets. Some illustrations will be provided about how Open Data and Linked Data are being used today and in the future.

Page 9: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Invited speakers

Loughborough University 3

A new Era of Knowledge Management? Reflections on the implications of ubiquitous computing Sue Newell Professor of Information Systems, University of Sussex

This presentation will focus on how changes in IT (specifically the increasing use of social software and more generally the digitization of our everyday lives) is changing organizational approaches to knowledge management. For example, organizations are increasingly relying on ‘the crowd’ to perform the kind of knowledge work that was previously done internally and they are using big data to examine connections and from this make predictions rather than rely on expertise and understanding of knowledge workers. The presentation will focus on how this is impacting knowledge work and more generally learning within and across organizations.

The role of social networks in mobilizing knowledge Jacky Swan Professor of Organisational Behaviour, University of Warwick

The presentation will focus on the importance of understanding social networks and relationship between different kinds of social networks and knowledge processes (transfer, translation, transformation).

Page 10: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

4 Conference Proceedings

Contributed papers

UK Data Service: creating economic and social science metadata microcosms Lucy Bell Functional Director Data Access, UK Data Archive, University of Essex

Background The UK Data Service (http://www.ukdataservice.ac.uk) is a comprehensive resource funded by the Economic and Social Research Council (ESRC) to support researchers, teachers and policymakers who depend on high-quality social and economic data. The Service disseminates data via around 67,000 downloads per year to its 22,000 registered users, worldwide. Its contract began in September 2012; however, a data archiving, curation and dissemination service has been in existence at the University of Essex since 1967, in various incarnations, including the Economic and Social Data Service (ESDS) from 2003 to o2012. The UK Data Service is a national resource, coordinated by the UK Data Archive, with organisational partners from the universities of Manchester, Edinburgh, Leeds, UCL and Southampton.

The service preserves, curates and disseminates data collections to its users, following the Open Archival Information System (OAIS) reference model (The Consultative Committee for Space Data Systems, 2012). The data collections are owned by a variety of organisations and individuals:

• National statistical authorities - Office for National Statistics (ONS), National Records of Scotland, Northern Ireland Statistics and Research Agency;

• UK government departments - including the Home Office, Department for Business, Innovation and Skills (BIS) and Department for Work and Pensions (DWP);

• Intergovernmental organisations - including the International Monetary Fund (IMF), Office for Economic Cooperation and Development (OECD), and the World Bank;

• Research institutes - including NatCen, Institute for Social and Economic Research and Centre for Longitudinal Studies;

• Individual researchers.

The service’s primary aim is to provide its users with seamless and flexible access to a wide range of data collections to facilitate high quality social and economic research and education.

Metadata retrieval, with high precision, is paramount in this aim. Being able, quickly and easily, to find and retrieve just the right metadata for those data required for research is fundamental to seamless access. The user must have the facility to move from the metadata of one resource to another with speed and elegance, and, at any stage in that journey, to be able to get to the data delivery mechanism for the resource they require.

In order to create seamless access, the service’s resource discovery mechanisms have been given high priority technologically and developmentally over the last few years and activities undertaken to link, and make as open as possible, the metadata it manages. This has happened in many different ways, employing SKOS, DOIs and more Service-specific methodologies. Several strands of work to give prominence to the discoverability of open metadata have been undertaken recently. The vision is one of a world of accessible metadata

Page 11: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

UK Data Service: creating economic and social science metadata microcosms

Loughborough University 5

which connects related resources, presenting the users with a map of possible pathways through the data.

In the context of open and linked metadata, this paper describes the technologies and metadata management processes that the Service employs across its four academic sites in order to ensure timely and effective information retrieval, and compares these with some of the issues surrounding open data. The paper highlights: the metadata schema used; the innovative, international, SKOS-based thesaurus management development the service is currently undertaking via a 5-year ESRC award; and the data citation methodologies that are a) employed and b) have been designed in-house.

Open data vs. open metadata So, what are open data? The Open Knowledge Foundation’s Open Government Data Working Group definition of open data (Open Definition, 2014) sets out ten principles, covering access, redistribution, reuse, technological restrictions, attribution, integrity, discrimination and licensing. These can be problematic for social science data, most of which are classed as ‘personal’ – they almost always refer to real people. Issues of consent and anonymisation play a part at every stage of the data journey (Corti et al, 2014). Although the hard sciences are hurtling along the path to open data – quite laudably – the players in the social science open data arena have to consider the rights of the data subjects very carefully. Rather than prevent any ‘open-type’ access to these types of data, it may be better to see access as a continuum, along which data collections may be moved to make them more (or less) open.

Open metadata can be easier to deal with. Discovery1 defines open metadata thus:

“Open metadata creates the opportunity for enhancing impact through the release of descriptive data about library, archival and museum resources. It allows such data to be made freely available and innovatively reused to serve researchers, teachers, students, service providers and the wider community in the UK and internationally.” (Discovery, 2012)

The UK Data Archive, as the lead institution in the UK Data Service, has signed up to these principles. The principles advocate, among other things, that: metadata are, by default, made freely available for use and reuse, unless explicitly precluded by third party rights or licences; all metadata are licensed and that this takes place using a standard, open licensing framework, such as Open Data Commons (ODC) Public Domain Dedication Licence2, Creative Commons CC03 or the UK Open Government Licence (OGL)4.

Open data and metadata at the UK Data Service A spectrum of data access has been formalised by the UK Data Service in its Data Access Policy (UK Data Service, 2014) is as follows:

• Open: freely available with no need for registration or authentication;

• Safeguarded: requires the agreement to, at minimum, a standard End User Licence (EUL) and also, sometimes, additional licences or conditions;

1 http://www.discovery.ac.uk 2 http://opendatacommons.org/licenses/pddl/1-0/ 3 http://creativecommons.org/choose/zero/ 4 http://www.nationalarchives.gov.uk/information-management/government-licensing/about-the-ogl.htm

Page 12: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

6 Conference Proceedings

• Controlled: requires approved or accredited researcher training and usage approval.

The Service has recently been opening up access to data collections on this spectrum where at all possible. Some data can be made truly open, while others have had, via re-negotiation with the data owners, additional, restrictive conditions of access removed.

Some data in the UK Data Service are truly open. A prominent example is the Census. The Service provides all users with open access to 2001 (England and Wales) and 2011 (UK) Census aggregate statistics under the Open Government Licence, as well as to the UK Data Service back catalogue of aggregate data from the 1971, 1981, 1991 and 2001 Censuses.

Other open data are also available. ReShare5, the UK Data Services’s online self-deposit data repository, for the archiving and sharing of research data, offers access to both open and safeguarded data collections for research and learning. Another important suite of data collections which have been released as fully open data appears in the Service’s QualiBank6. These open data collections still cannot have a CC0 licence attached, however; CC 4.0 BY-NC-SA7 is the recommended version.

In relation to metadata, the UK Data Service is slightly hampered by third party licensing issues, in that it does not own the intellectual property (IP) in all the abstracts in its catalogue. It makes this clear, through transparent copyright statements, however. It pushes for re-use of its metadata nonetheless, wherever this is both possible and acceptable. It makes its metadata freely available for inclusion in publicly-accessible catalogues. This is agreed with the data owners at the time of deposit and is a condition up to which they sign. The metadata are then made available via an OAI-PMH feed (http://oai.ukdataservice.ac.uk/oai/).

Another way the Service encodes open metadata is via RDF technologies. The UK Data Service employs these extensively for the two thesauri that it manages, both of which are available in SKOS. Each hierarchy is available openly (although in each case the entire thesaurus is not, for IP reasons). See http://lod.data-archive.ac.uk/skoshasset/page/ for access to the SKOS version of the Humanities and Social Science Electronic Thesaurus (HASSET).

The Service has also embraced whole-heartedly the principles of open metadata through being a signatory to the discovery.ac.uk Open Metadata Principles, as endorser of the Joint Declaration on Data Citation Principles8 and via distribution of its metadata.

It is clear that open metadata is something which can be achieved, even if licensing or IP issues create a few barriers.

Creating metadata microcosms One of the Service’s primary goals is to make the data it curates discoverable. Without the careful organisation, opening up and linkage of metadata, information retrieval suffers, and users may fail to discover data critical to their research. Links made between information resources within a single repository or organisation are important to the integrity of the user’s journey through that system; but being able to encode metadata so that they can be shared

5 http://reshare.ukdataservice.ac.uk/ 6 http://discover.ukdataservice.ac.uk/QualiBank 7 https://creativecommons.org/licenses/by-nc-sa/4.0/ 8 https://www.force11.org/datacitation

Page 13: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

UK Data Service: creating economic and social science metadata microcosms

Loughborough University 7

more widely and connections made with other, external resources, is vital for wider linkages to be made, enabling even more cross-referencing, and allowing what could be called ‘assisted serendipity’ to occur.

Many other initiatives are currently investigating how this can be done, such as the Natural Europe project (Skevakis et al, 2014) and Data without Boundaries (Silberman and Wolf, 2012). This is a wide movement, reflected across many information-related disciplines. It appears in the library world, through the Functional Requirements for Bibliographic Records (FRBR) (International Federation of Library Associations and Institutions, 2009), as well as in the data retrieval world. These activities attempt to plot the relationships between associated resources. In their use of metadata, they are creating microcosms of the respective worlds each schema type represents.

Connecting metadata within a Knowledge Organization System mirrors the power afforded to the data themselves when linkages are made using variables; the difference with the latter being that careful consideration has to be given to the protection of individual data subjects’ privacy when dealing with personal data. Linking metadata, on the other hand, has far fewer restrictions, and can support more powerful retrieval. In being able to work more freely and openly with metadata, the UK Data Service is implementing its vision of connected data pathways via its resource discovery tool, Discover9, with, borrowing from the health literature, a whole systems approach (Pratt et al, 2005) to metadata.

Implementation

UK Data Service metadata The metadata schema the UK Data Service uses include:

• the Data Documentation Initiative Codebook (DDI-C) version 2.5;

• QuDex;

• Text Encoding Initiative (TEI);

• SDMX;

• Dublin Core (DC).

DDI The primary reference schema used in the Service is the DDI. This is the de facto standard for social science metadata. It is developed and managed by the DDI Alliance10, a cross-border membership organisation. The DDI-C schema contains 351 detailed elements to describe, cross-nationally, data collections. The Service does not use all 351 elements, but employs all those that are mandatory.

There are two general flavours of DDI: DDI-Codebook and DDI-Lifecycle (DDI-L). DDI-L is designed to document and manage data across the entire lifecycle, from conceptualisation to data publication and analysis - and beyond. Based on XML Schemas, DDI-Lifecycle is modular and extensible. It allows extensive metadata reuse and the grouping of variables. DDI-C, on the other hand, is a more traditional, linear schema. Historically, the UK Data Service (and its

9 http://discover.ukdataservice.ac.uk 10 http://www.ddialliance.org

Page 14: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

8 Conference Proceedings

predecessors) have used this more library-like schema. The advantages of this are that it can easily be mapped to MARC 21, ISO 19115 and other references, for record inclusion within external databases. A move to DDI-L is considered.

QuDex QuDex is a highly innovative metadata schema for describing qualitative data collections, researched and developed at the UK Data Archive (Corti and Gregory, 2011).

QuDEx was designed via a Jisc grant in 200511. It arose from a comparison of the key functionalities for the market-leading qualitative software packages in the hope of stimulating data exchange and import and export facilities between these softwares. It defined common denominator functions for documenting, organising and analysing qualitative data: coding; classifying; memoing; and relating. These core concepts helped define the QuDEx schema.

The QuDEx standard/schema is a software-neutral format for qualitative data that preserves annotations of, and relationships between, data and other related objects. It can be viewed as the optimal baseline data exchange model for the archiving and interchange of data and metadata.

Text Encoding Initiative (TEI)12 The TEI is in fact a consortium which collectively develops and maintains a standard for the representation of texts in digital form; however, it has produced a set of guidelines to specify encoding methods for machine-readable texts. The UK Data Service holds such qualitative material, in the form of interview transcripts, diaries and observation notes. These guidelines are used to present texts online.

SDMX13 The SDMX standard is used within the UK Data Service for the encoding of its large, international macrodata. These include data collections from the World Bank, OECD the International Monetary Fund and Unido.

Dublin Core14 The Service does not make extensive use of Dublin Core, but does release its metadata in this schema via its OAI-PMH feed, in order to allow as many publicly-accessible database creators to gain access to its metadata as possible.

Discover All these metadata initiatives coalesce for the users’ benefit via the Service’s search and browse application, Discover. It is here that metadata worlds, microcosms of the wider world represented by the data collections and associated resources themselves, are reproduced.

11 http://www.data-archive.ac.uk/create-manage/projects/qudex 12 http://www.tei-c.org/index.xml 13 http://sdmx.org/ 14 http://dublincore.org/

Page 15: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

UK Data Service: creating economic and social science metadata microcosms

Loughborough University 9

Figure 1: Discover, the UK Data Service’s search and browse application

In the spirit also of FRBR, Discover promotes connections between related data, case studies of data use, publications and other outputs, citations, as well as investigating the inclusion of standardised means of identifying organisations and individuals (ORCIDs, ISNIs etc.). The paths between resources are made as complete as possible, with the links between each one representing the relationships that exist – albeit often undocumented – in the real world. The user’s journey is not interrupted at any stage. Discover is a SOLR based search application.

Page 16: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

10 Conference Proceedings

Figure 2: Discover’s functionality

A series of core facets allows the user to move from one resource type (or core) to another. Each core has its own set of dynamic facets, to assist with further browsing. The user may – and is encouraged to – jump from core to core from within each record.

Controlled vocabularies (CVs) The UK Data Service is heavily involved in both the creation and use of controlled vocabularies.

In terms of policy, the Service influences international CVs used within the DDI schema via its inclusion on a cross-border DDI CV working group. Cross-border CVs ensure that equivalent concepts are accurately retrieved in different countries and from different repositories.

The primary way that the Service works with CVs, however, is through the research, development and management of the SKOS-encoded thesauri which is manages: the Humanities and Social Science Electronic Thesaurus (HASSET) and its sister product, the European Language Social Science Thesaurus (ELSST).

The UK Data Service has responsibility for both of these:

• HASSET has been developed and in use in the UK Data Archive for over 30 years through funding from both the ESRC and the University of Essex.

• ELSST was originally developed as part of the EU-funded LIMBER project (Etheridge et al, 2002) and has been further enhanced through additional funding from the ESRC and the University of Essex and through subsequent EU grants such as MADIERA (University of Essex, 2006). ELSST takes its shared concepts, hierarchical structure and framework from HASSET, with translations made in eleven European languages, with more on the way. Any concepts specific to British activities and bureaucracy are not included.

Page 17: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

UK Data Service: creating economic and social science metadata microcosms

Loughborough University 11

The two products have, historically, been managed separately, using different administrative interfaces and with different licences for external users. This has created problems:

• it is time-consuming to have to enter the same information in two places;

• maintaining two separate structures creates the capacity for the two thesauri to diverge – or for errors to be entered – without this being obvious.

A five-year ESRC award has been received by the UK Data Archive, as part of its UK Data Service work, to enhance, manage and link these two thesauri at several levels. The initial part of this work encompasses a large-scale development project, whose aims are:

a) to review and improve their structures;

b) to re-develop their management applications so the two may be managed in tandem.

Innovative development work has taken place to create a single thesaurus application which allows concepts within each tool to be managed seamlessly and simultaneously. This not only saves staff time, but also allows the concepts shared by the two tools to remain in sync or, where they diverge, for these differences to be transparent and clearly identified.

This work is based on, but extends (for the Service’s unique purposes), the principles laid down in the recent ISO 25964 (NISO, 2011 and 2013), in that the concepts held in the two thesauri have been mapped using extended equivalence relationships. This has resulted in metadata linkage between these two powerful products, allowing data users across the globe to search for and retrieve the data they require using twelve languages.

The development work undertaken at the UK Data Archive has created a single application which allows shared concepts to be managed across multiple thesauri and for external users to interact with the two tools15. At present, this only relates to the two existing products, but the potential for expansion has been built in.

While the ideal is for ELSST and HASSET to share identical concepts, divergence has been accommodated in a number of ways, in accordance with SKOS and ISO 25964. The two thesauri have different cultural identities and these are respected. ELSST contains, in the main, broader and more internationally-applicable concepts than HASSET. Examples of divergence for ELSST and HASSET are a) definitional and b) numerical.

There may be cases where a definition needs to change by a single word. An example of this could be SOCIAL ASSISTANCE. The ELSST definition refers to insurance: “ASSISTANCE IN MONEY OR IN KIND TO PERSONS, OFTEN NOT COVERED BY SOCIAL INSURANCE, WHO LACK THE NECESSARY RESOURCES TO COVER BASIC NEEDS.” The HASSET definition does not include this reference, because social insurance does not exist in the UK: “ASSISTANCE IN MONEY OR KIND TO PERSONS WHOSE INCOME IS BELOW A CERTAIN LEVEL AND WHO LACK THE NECESSARY RESOURCES TO COVER BASIC NEEDS.” Both definitions are appropriate and both are correct. They differ, however. 15 See http://elsst.ukdataservice.ac.uk and http://hasset.ukdataservice.ac.uk

Page 18: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

12 Conference Proceedings

HASSET contains more concepts than ELSST. This is entirely expected, as HASSET holds many concepts that are specific to the UK, culturally (for example, GENERAL CERTIFICATE OF SECONDARY EDUCATION). The aim is for ELSST’s concepts to be entirely internationally applicable; however, there may be occasions when it is agreed that a concept that is used in 99% of European countries should still be included as it is so important to the majority of Europe. An example of this is WAGES GUARANTEE FUND, something that does not exist in the UK (or HOUSE HUSBANDS, a concept that did not exist in some other countries some time ago). There are not expected to be many of these concepts, but provision has been made for them to appear only in ELSST and not in HASSET, as long as, structurally, these concepts have no Narrower Terms (NTs) which are shared between the two thesauri.

HASSET and ELSST have been mapped, with the nature of these mappings made clear. Shared concepts are deemed to have either ‘exact equivalence’ or ‘close equivalence’. In order to make exact or close equivalence between shared concepts entirely transparent, the following notation, based on, but stricter than, SKOS and ISO standardised models, is used.

Both exact and close equivalence will always require that shared concepts have the same:

• Preferred Term (PT)

• Broader Term(s) (BTs)

Exact equivalence will be achieved when the following associated metadata are the same:

• Scope notes

• Scope note sources

Close equivalence will be achieved if the PTs and BTs match but one of these pieces of associated metadata is different in any way.

This work to manage two thesauri simultaneously has not only created resource efficiencies in their management, but has also created an environment in which concepts may be quickly and easily shared, mapped and released to the appropriate products, thus enhancing the quality of each tool.

Persistent identification The last but perhaps most important element in the creation of metadata (and data) pathways is persistent identification.

The UK Data Service adds DataCite DOIs to all its data collections, with well-documented versioning practices.

It is also developing methodologies for more granular data citation, at variable level. The first results of this work are visible via the qualitative material citable within QualiBank16, which allows a user to create a paragraph-level citation of interview material, thus providing an even more detailed, atomic view of the metadata. This is a dynamic function, allowing the user to select just those parts of a qualitative text that they would like to cite and generating, on the fly, the relevant reference. This citation is then presented to the user, ready for them to copy and paste into the document of their choice.

16 http://discover.ukdataservice.ac.uk/QualiBank

Page 19: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

UK Data Service: creating economic and social science metadata microcosms

Loughborough University 13

Figure 3: QualiBank citation generator

The UK Data Service has also been working with the British Library’s ODIN project as a stakeholder to investigate how DOIs may be linked with ORCIDs to create more linkages for the benefit of data users.

The application of DOIs generally allows the Service’s metadata to become part of the wider DataCite search as well. The Service encourages the academic community to support data citations wherever possible and works to make data citation standard practice within academia, pushing for these citations to figure within the mesh of references published within journals.

Conclusions Social science data provide a rich resource for policy-makers, academics and business in identifying trends and emerging issues in the economic and societal arenas; however, social science data are different to those from the hard sciences, because they are ‘personal’. Access to these types of data must be carefully managed.

There are several initiatives these days which are pushing the open data agenda and the UK Data Service is engaging with many of these. It has released open data where possible and made more open those data which cannot be fully open. Linking social science data can create statistically disclosive variables, which all social science archives must prevent. Linking their metadata, on the other hand, can augment discoverability to the vast benefit of data users globally.

Social science data, in many ways, map the human world. For their discoverability to be optimised, the metadata associated with them must similarly represent the world at large. This is the work that the UK Data Service has been taking forward recently, and with which it will continue.

Keywords: Metadata; Discovery; Persistent Identifiers; Thesauri

Page 20: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

14 Conference Proceedings

References Chowdhury, G., 2010. Introduction to Modern Information Retrieval, Third Edition. London:

Facet. Etheridge, A., Wilson, M. Sande, T. and Ramfos, A., 2002. LIMBER: Language Independent

Metadata Browsing of European Resources. Colchester: University of Essex. Available at: <http://www.data-archive.ac.uk/media/223876/limber_final_report.pdf> [Accessed 22 August 2014].

The Consultative Committee for Space Data Systems, 2012. Recommendation for space data system practices: reference model for an Open Archival Information System (OAIS): recommended practice. Washington: CCSDS. [http://public.ccsds.org/publications/archive/650x0m2.pdf; accessed 11 August 2014]

Corti, L. and Gregory, A., 2011. CAQDAS Comparability. What about CAQDAS Data Exchange? Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 12(1), Art. 35. Available at: http://nbn-resolving.de/urn:nbn:de:0114-fqs1101352 [Accessed 21 August 2014].

Corti, L., Van den Eynden, V., Bishop, L. and Woollard, M., 2014. Managing and Sharing Research Data: A Guide to Good Practice. London: Sage.

Discovery, 2012. Discovery open metadata principles. [online] Available at: <http://www.discovery.ac.uk/businesscase/principles/> [Accessed 18 August 2014].

International Federation of Library Associations and Institutions, 2009. Functional Requirements for Bibliographic Records: Final Report. [The Hague]: IFLA.

NISO, 2011. ISO 25964: Thesauri and interoperability with other vocabularies: Part 1: Thesauri for information retrieval. Baltimore: NISO.

NISO, 2013. ISO 25964: Thesauri and interoperability with other vocabularies: Part 2: Interoperability with other vocabularies. Baltimore: NISO.

Open Definition, 2014. Open Definition. [online] Available at: < http://opendefinition.org/od/> [Accessed 18 August 2014].

Pratt, J., Gordon, P. and Plamping, D, 2005. Working whole systems. London: The King’s Fund. Silberman, R., Wolf, C, 2012. Data without Boundaries: A European Project to Enhance Access

to Official Microdata. European DDI User Meeting, 3-4 December 2012, Bergen, Norway. [online] Available at: <http://www.eddi-conferences.eu/ocs/index.php/eddi/eddi12/paper/view/46/35> [Accessed 19 August 2014]

Skevakis, G., Makris, K., Kalokyri, V., Arapi, P., Christodoulakis, S., 2014. Metadata management, interoperability and Linked Data publishing support for Natural History Museums. International Journal of Digital Libraries, 14, p. 127-140.

UK Data Service, 2014. Data Access Policy. Colchester: University of Essex. University of Essex, 2006. Multilingual Access to Data Infrastructures of the European Research

Area: MADIERA: Final report: HPSE-CT-2002-00139. Brussels: European Commission. Available at < http://www.data-archive.ac.uk/media/1688/MADIERA_finalreport.pdf> [Accessed 22 August 2014].

Page 21: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management

Loughborough University 15

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management Malcolm Chisholm Visiting Fellow in Financial Technologies at Loughborough University, and President, AskGet.com Inc ([email protected])

Introduction “BCBS 239” is a common shorthand term for referring to the paper Principles for effective risk data aggregation and risk reporting (BCBS, January 2013) published by the Basel Committee on Banking Supervision (BCBS). It is a global regulation that presents a challenge for Global Systemically Important Banks and a wide range of other financial services companies to mature their data governance and data management practices to achieve compliance in the next two years. This paper seeks to identify the capabilities that the broad community of data management professionals, consultants, product vendors, as well as academics in financial information, need to understand in order to plan for compliance. It does so by analyzing the principles outlined in BCBS 239 as well as a more detailed set of requirements based on these principles and issued by the BCBS as part of a self-assessment survey. Results were grouped into four main categories: leadership capabilities, organizational capabilities, methodology capabilities, and technology capabilities. Interestingly, fewer technology capabilities were found to be required than either organizational or methodology capabilities.

1. The Challenge of BCBS 239 As the theory and practice of data management advances, certain sectors seem to play a leading role. Financial services companies are arguably such as sector, having been early adopters of relational database technology and data modeling techniques. Today, regulators in financial services are becoming increasingly interested in data management concerns, and regulations are becoming more focused on data management. While it is too early to say if this will be replicated in other economic sectors, it is worth examining the implications of recent financial services regulations to better understand their impact on data management. A major recent regulation that fits into this category is BCBS 239.

“BCBS 239” is a common shorthand term for referring to the paper Principles for effective risk data aggregation and risk reporting (BCBS, January 2013) published by the Basel Committee on Banking Supervision (BCBS). The regulation is also sometimes known as “PERDAR”. It was first published as a consultative paper in June 2012 (BCBS 2012), for commentary by September 2013, and then in its final form in January 2013. The problem that BCBS 239 seeks to address is the stated as follows:

1. One of the most significant lessons learned from the global financial crisis that began in 2007 was that banks’ information technology (IT) and data architectures were inadequate to support the broad management of financial risks (BCBS, January 2013).

In order to address the inadequacy of the banks’ IT and data architectures, BCBS 239 outlines a set of supervisory expectations that are to be adopted by Global Systemically Important Banks (“G-SIBs”) by January 2016, and which may at the discretion of national supervisory bodies be applied to Domestic Systemically Important Banks (“D-SIBs”). The expectations are stated as 14 principles, of which 11 apply directly to banks, and the remaining 3 apply to supervisors, although they have some implications for banks.

Page 22: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

16 Conference Proceedings

The BCBS conducted a self-assessment survey in 2013 and received responses from 30 G-SIBs. The results were published in December 2013 (BCBS December 2013). Despite concerns expressed by respondents to the consultation, there has been no extension to the January 2016 compliance deadline.

In addition to the firm deadline, banks need to clearly understand the scope of BCBS 239. Any organisation to which a bank outsources its processing to must also comply, as indicated by this statement in the final regulation:

19. All the Principles included in this paper are also applicable to processes that have been outsourced to third parties (BCBS, January 2013).

Given the complex web of interdependencies in processing and reporting transactions, this statement alone must be taken as indicating that BCBS 239 affects a range of organizations far beyond a relatively small number of G-SIBs. Furthermore, the categories of risk assessment to which the regulation applies are open-ended, as indicated by this statement in the final regulation:

17. These Principles also apply to all key internal risk management models, including but not limited to, Pillar 1 regulatory capital models (eg internal ratings-based approaches for credit risk and advanced measurement approaches for operational risk), Pillar 2 capital models and other key risk management models (eg value-at-risk) (BCBS, January 2013).

Thus banks cannot satisfy BCBS 239 by focusing efforts on a few narrow areas, such as credit risk or market risk. Rather, the principles must be implemented wherever any form of risk management exists in a bank.

Additionally, while a self-assessment survey was initially carried out, independent validation of compliance will be needed in the future. This is again clearly stated in the final form of BCBS 239

29. A bank’s risk data aggregation capabilities and risk reporting practices should be:

(a) Fully documented and subject to high standards of validation. This validation should be independent and review the bank’s compliance with the Principles in this document. The primary purpose of the independent validation is to ensure that a bank’s risk data aggregation and reporting processes are functioning as intended and are appropriate for the bank’s risk profile (BCBS, January 2013).

From this it can clearly be seen that many members of the financial services industry must respond to the challenge of complying with BCBS 239. It will be difficult for them to escape by avoiding the designation of “G-SIB”, or by directing efforts to IT and data architectures of a few special silos, or by attempting to offload responsibilities to some other party. However, even if they have the will to comply, the institutions affected (referred to for brevity here as “banks”) appear to have a considerable challenge to achieve compliance, as indicated by the results of the self assessment survey7. Key to success in compliance will be the need for banks to understand the capabilities they will need to acquire or develop.

2. Evolution of Data-Centricity in Financial Regulation and Position of BCBS 239 BCBS 239 represents another stage in increasing regulatory awareness and interest in matters concerning data management. The Dodd-Frank Act (Dodd-Frank Wall Street Reform and Consumer Protection Act (H.R. 4173), 2010) contained references to “data” in a general way,

Page 23: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management

Loughborough University 17

such as the right to request data from a financial institution, and maintenance of confidentiality of data received by the regulators. It did mention standardization of data, but did not address data management, data governance, or data architecture. Solvency II, although an insurance regulation, was more specific about data quality, introducing the notion of data credibility, meaning proof that the data used for Solvency II reporting is used in the normal operational and decision-making processes of the enterprise (CEIOPS 2009, page 43 Para 3.66). Data credibility is a new dimension of data quality, not mentioned in the traditional data quality dimensions used in the data management industry (e.g. Myers 2013). Yet BCBS 239 breaks new ground by having concerns about data as central to the regulation, rather than incidental. It remains to be seen whether this trend will be continued in future regulations.

BCBS 239’s position in a series of increasingly data-centric regulations is important to the banks because it is addressing areas that the banks have not had to respond to in a detailed way before. The question therefore arises as to what capabilities banks must possess in order to comply with BCBS 239. These capabilities must address the first 11 of the 14 principles in BCBS 239, which are enumerated in Table 1.

Table 1: Principles of BCBS 239

Theme No. Principle Title Principle Description

I. Overarching Governance and Infrastructure

1 Governance A bank’s risk data aggregation capabilities and risk reporting practices should be subject to strong governance arrangements consistent with other principles and guidance established by the Basel Committee.

2 Data architecture and IT infrastructure

A bank should design, build and maintain data architecture and IT infrastructure which fully supports its risk data aggregation capabilities and risk reporting practices not only in normal times but also during times of stress or crisis, while still meeting the other Principles.

II. Risk Data Aggregation Capabilities

3 Accuracy and Integrity

A bank should be able to generate accurate and reliable risk data to meet normal and stress/crisis reporting accuracy requirements. Data should be aggregated on a largely automated basis so as to minimise the probability of errors.

4 Completeness A bank should be able to capture and aggregate all material risk data across the banking group. Data should be available by business line, legal entity, asset type, industry, region and other groupings, as relevant for the risk in question, that permit identifying and reporting risk exposures, concentrations and emerging risks.

Page 24: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

18 Conference Proceedings

Table 1 continued

Theme No. Principle Title Principle Description 5 Timeliness A bank should be able to generate

aggregate and up-to-date risk data in a timely manner while also meeting the principles relating to accuracy and integrity, completeness and adaptability. The precise timing will depend upon the nature and potential volatility of the risk being measured as well as its criticality to the overall risk profile of the bank. The precise timing will also depend on the bank-specific frequency requirements for risk management reporting, under both normal and stress/crisis situations, set based on the characteristics and overall risk profile of the bank.

6 Adaptability A bank should be able to generate aggregate risk data to meet a broad range of on-demand, ad hoc risk management reporting requests, including requests during stress/crisis situations, requests due to changing internal needs and requests to meet supervisory queries.

III. Risk Reporting Practices

7 Accuracy Risk management reports should accurately and precisely convey aggregated risk data and reflect risk in an exact manner. Reports should be reconciled and validated.

8 Comprehensiveness Risk management reports should cover all material risk areas within the organisation. The depth and scope of these reports should be consistent with the size and complexity of the bank’s operations and risk profile, as well as the requirements of the recipients.

9 Clarity and usefulness

Risk management reports should communicate information in a clear and concise manner. Reports should be easy to understand yet comprehensive enough to facilitate informed decision-making. Reports should include meaningful information tailored to the needs of the recipients.

Page 25: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management

Loughborough University 19

Table 1 continued

Theme No. Principle Title Principle Description 10 Frequency The board and senior management (or

other recipients as appropriate) should set the frequency of risk management report production and distribution. Frequency requirements should reflect the needs of the recipients, the nature of the risk reported, and the speed, at which the risk can change, as well as the importance of reports in contributing to sound risk management and effective and efficient decision-making across the bank. The frequency of reports should be increased during times of stress/crisis.

11 Distribution Risk management reports should be distributed to the relevant parties while ensuring confidentiality is maintained.

IV. Supervisory Review, Tools and Cooperation

12 Review Supervisors should periodically review and evaluate a bank’s compliance with the eleven Principles above.

13 Remedial actions and supervisory measures

Supervisors should have and use the appropriate tools and resources to require effective and timely remedial action by a bank to address deficiencies in its risk data aggregation capabilities and risk reporting Principles for effective risk data aggregation and risk reporting 15 practices. Supervisors should have the ability to use a range of tools, including Pillar 2 [the supervisory review process of Basel II].

14 Home/host cooperation

Supervisors should cooperate with relevant supervisors in other jurisdictions regarding the supervision and review of the Principles, and the implementation of any remedial action if necessary.

The principles are grouped into four closely related topics, termed “themes” in Table 1. As can be seen, Theme IV, Supervisory Review, Tools and Cooperation, applies directly to regulators rather than to banks.

3. Approach to Capability Assessment The principles in Table 1 are somewhat vague, and unfortunately there are only a few definitions supplied in the regulation. E.g. “Governance” would seem to need to include “Data Governance”, but neither “Governance” not “Data Governance” are defined.

Fortunately, the survey carried out in 20133 broke down each principle into a total of 87 specific requirements for the first 11 principles (the ones that apply directly to banks). This breakdown can be analyzed in terms of capabilities needed .

Page 26: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

20 Conference Proceedings

The capabilities were divided into 4 categories:

1. Leadership: the capabilities needed to set vision, specify mission, earmark resources, and bear the final accountability for adherence to the full scope of BCBS 239.

2. Organization: the capabilities for implementing vision and mission, planning, assigning roles and responsibilities, mobilizing resources, recruitment, setting work plans, and ensuring work gets done for the scope of BCBS 239.

3. Methodology: the capabilities for applying knowledge in a structured, documented, and repeatable way to design data stores, design processes, assure quality, and carry out operations, for the scope of BCBS 239.

4. Technology: automated tools that support the scope of BCBS 239.

Each requirement for each principle was analyzed to determine what aspects of it fitted into each of the above four capability groups. This was done by consulting the supporting texts within BCBS 239 to extract specific capabilities, and by comparing the requirements to known capabilities specified in industry with generally agreed norms in data management.

Each requirement had its corresponding capabilities in the 4 groups scored as follows:

0 = No to weak involvement 3 = Supporting role 5 = Required

This scale of scoring was chosen in order to aggregate the capabilities required for each principle into the 4 categories listed above. No weighting of capabilities was attempted, and neither were possible dependencies among the capabilities analyzed.

The analysis did require interpretation of some of the technical terms employed in the regulation. There is a glossary that is part of it, but not all terms are defined in it. For instance, the term “business owner” is used but not defined. In such cases, the common understanding in data management was used, or the context was consulted to elucidate the meaning.

The results are summarized in Table 2

Page 27: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management

Loughborough University 21

Table 2: Assessment of Capabilities for BCBS 239

Scores No. Principle Title Leadership Organisation Methodology Technology

1 Governance 58 93 120 31 2 Data architecture and

IT infrastructure 0 40 65 40

Subtotal 58 133 185 71 3 Accuracy and Integrity 5 60 75 53

4 Completeness 11 35 35 20 5 Timeliness 0 10 35 25 6 Adaptability 0 25 40 33 Subtotal 16 130 185 131 7 Accuracy 0 40 45 30 8 Comprehensiveness 18 25 20 8

9 Clarity and usefulness 35 33 45 10

10 Frequency 0 10 15 0 11 Distribution 0 10 5 10 Subtotal 53 118 130 58

12 Review 0 10 0 5 13 Remedial actions and

supervisory measures 0 10 5 10

14 Home/host cooperation

0 5 0 0

Subtotal 0 25 5 15 Total 127 406 505 275

4. Conclusion The overall scores shown in Table 2 indicate that Leadership is the category with the fewest number of capabilities that are needed. This is not surprising as Leadership activities will be oriented to setting up the program to comply with BCBS 239, rather than dealing with detailed operational aspects. Perhaps more surprising is that Technology has the next lowest score. This means that banks are unlikely to be able to address compliance with BCBS 239 simply by purchasing and implementing technology. Clearly, Technology is important, and vital for some areas, but it is not a full solution for BCBS 239 requirements.

The next highest scoring group is Organisation. This implies that many capabilities will need to be provided by staff working within a set of defined relationships and with defined goals. This highest scoring category is Methodology, meaning that organizations must have defined ways

Page 28: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

22 Conference Proceedings

of working in particular areas of concern, within defined organizational structures and using defined technology.

The result that a great deal of the capabilities that banks need to comply with BCBS 239 will comes from Methodology, and not from Technology raises the question as to whether the current state of methodology, particularly in Data Governance and Data Management, is adequate to meet this challenge. It would appear from the deadline of January 2016 for compliance with BCBS 239, that the FSB holds the presupposition that methodology today is sufficient for this purpose. However, this presupposition cannot be confirmed, and further research will be needed to determine if is well founded or not.

An example of this can be found within Principle 2, Data Architecture and IT Infrastructure. One of the requirements specified by BCBS in the survey is Data Taxonomies. This is described as follows:

A bank should establish integrated data taxonomies and architecture across the banking group, which includes information on the characteristics of the data (metadata), as well as use of single identifiers and/or unified naming conventions for data including legal entities, counterparties, customers and accounts.

This requirement was analyzed in terms of the four categories as shown in Table 3.

Table 3: Analysis of Capabilities Needed for Data Taxonomies

Category Capability Comment Leadership No implication as it is covered by other

leadership capabilities Organisation Governance and management for all

taxonomies must be implemented (Score = 5).

Accountabilities must be established for each taxonomy. Some individual or individuals must be accountable for resolving definitional concerns, revising the taxonomy, and promulgating new versions.

Methodology Governance and management for all taxonomies must be implemented (Score = 5).

Accountabilities must be established for each taxonomy. Some individual or individuals must be accountable for resolving definitional concerns, revising the taxonomy, and promulgating new versions.

Content of taxonomies must be of high quality (e.g. definitions, terminology). (Score = 5).

This capability refers to the needs to develop taxonomies that can be clearly understood and easily used. Both the substantive and formal aspects of entries in taxonomies must be addressed

Technology Technology to support all taxonomies must be implemented. (Score = 5).

An environment where taxonomies can be documented, consulted, and distributed is needed

With respect to data taxonomies, technology provides a container where they can be stored. However, the content that goes into such containers cannot be developed by the technology. This must be done by staff, who require defined ways of working to ensure the content is of high quality. Indeed the value of the technology to the company depends entirely on the quality of the

Page 29: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The Implementation of Basel Committee BCBS 239: An Industry-Wide Challenge for International Data Management

Loughborough University 23

content held within it. This provides a useful illustration of why methodology capabilities are more numerous than technology ones.

The extent to which ad-hoc ways of working are used in place of methodologies will likely influence the success with which financial services companies address BCBS 239, as will the relative attention paid to technology relative to methodology. However, assessing this will have to wait until the deadline for compliance has passed.

References BCBS (2012). Consultative Document Principles for effective risk data aggregation and risk

reporting. Retrieved from http://www.bis.org/publ/bcbs222.pdf BCBS (January 2013). Principles for effective risk data aggregation and risk reporting.

Committee for Banking Supervision. Retrieved from http://www.bis.org/publ/bcbs239.pdf BCBS (December 2013) Progress in adopting the principles for effective risk data aggregation

and risk reporting. Retrieved from http://www.bis.org/publ/bcbs268.htm CEIOPS (2009). CEIOPS’ Advice for Level 2 Implementing Measures on Solvency II: Technical

Provisions – Article 86 f Standards for Data Quality (former CP 43). Retrieved from https://eiopa.europa.eu/fileadmin/tx_dam/files/consultations/consultationpapers/CP43/CEIOPS-L2-Final-Advice-on-TP-Standard-for-data-quality.pdf

Dodd-Frank Wall Street Reform and Consumer Protection Act (H.R. 4173) (2010) Retrieved from https://www.govtrack.us/congress/bills/111/hr4173/text July 2010

Myers, D, Dimensions of Data Quality Under the Microscope (2013). Information Management Magazine. Retrieved from: http://www.information-management.com/news/dimensions-of-data-quality-under-the-microscope-10024529-1.html?zkPrintable=1&nopagination=1

Page 30: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

24 Conference Proceedings

Adopting a situated learning framework for (big) data projects Martin Douglas Cranfield University, School of Management Joe Peppard European School of Management and Technology, Berlin

1. Background The rapidly growing availability of data, including new forms of data (e.g. social media and location data), is widely seen as a significant opportunity to derive new insights to create value (Davenport: 2014). Related capabilities are also now widely regarded as an important dimension of corporate competitiveness (Kettinger & Marchand: 2011, Davenport: 2009, Marchand, Kettinger & Rollins: 2001, Davenport, Harris, De Long & Jacobson: 2001). While this has prompted many initiatives, implementing a variety of Data Analytics technologies (Ranjan & Bhattnagar: 2011, Bose: 2009), many result in mixed outcomes (Marchand & Peppard: 2013, Yeoh & Koronios: 2010, Wixom & Watson: 2001, Cooper, Watson & Wixom: 2000), i.e. generating ’a wealth of Data but a poverty of insight’ (Douglas & Peppard: 2013). While projects often focus on technical implementation, several researchers argue that human and social factors are important (Marchand & Peppard: 2013, Yeoh & Koronios: 2010, Hopkins, Lavalle & Balboni: 2010, Wang & Wang: 2008, Marchand et al: 2001).

This paper focuses on how such human insight is generated from data. This activity can be characterised in several ways and fundamental to the activity is a clear conceptualisation of ‘data’ and ‘insight’. While there does not appear to be clarity or a consensus on these constructs (Douglas & Peppard: 2013, Kettinger & Li: 2010), this paper uses Checkland and Howell’s (1998) human-centred ideas as its starting point. However, they do not explore the process of transforming data into knowledge in detail. Turning to other disciplines concerned with knowledge creation as a research focus, several characterisations or perspectives can be identified (Douglas & Peppard: 2013). These are summarised in the diagram below, highlighting their typical unit of focus and analysis in relation to the phenomenon.

Figure 1: Various disciplinary perspectives on generating insights (Douglas & Peppard: 2013)

While the fields identified all contribute useful ideas (some indicated in blue italics in Figure 1), this paper focuses on the question: What can be revealed using a situated learning communities of practice framework? Wenger’s (1998) framework seems to offer a promising ‘lens’ to examine the phenomenon at work in a project context. While Weick’s (1995)

Page 31: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 25

sensemaking perspective may overlap, it is typically used to focus at the individual rather than the group or team level, although nevertheless broadly consistent with it.

2. Methodology Given the exploratory nature of the research, an abductive research strategy was adopted (Blaikie: 2010). While ethnography and Action Research were considered equally good at providing rich access to the phenomenon, ethnography was chosen as it encompasses a range of active participation (Blaikie: 2010, Eden & Huxham: 2002). Ethnography also focuses on translation into theoretical language rather than a co-production of meaning. An idealist ontology (Blaikie: 2010) is adopted and an engaged researcher stance, most closely aligning to what Blaikie (2010) describes as a ‘Mediator of Languages’. In Ethnographic terms, the focus is mainly etic rather than emic (Hammersley & Atkinson: 2007), reflecting the sensitising role of theory and a pursuit of theoretical meaning and explanation.

2.1 Case selection and overviews Rich organisational contexts were sought, to elicit multidisciplinary project team interaction dimensions anticipated across different departments and communities of practice. Projects were selected which explored new, unfamiliar sources of data. Two such organisations (both based in the UK) were found representing interesting and contrasting contexts:

• PoshCouncil A district council was seeking to develop new propositions, premium versions of existing services and cross-sell services more actively. They envisaged using third party supplied (Acorn) household data about district inhabitants to identify likely customers and markets for such services. One of the authors worked with three groups’ propositions, jointly facilitating workshops for two, to identify what market insights and data validation they required to support their proposed business plans.

• InfraDig A significant rail infrastructure project was committed to build, collect and hand over an integrated set of data about the infrastructure being constructed, described as a ‘virtual railway’. One of the authors participated in two projects within InfraDig: one to specify requirements for a performance management system for the asset data collection effort, and the other to develop their information management strategy.

2.2 Data collected Data was collected in a natural setting, with a researcher immersed on site in the two case settings for a period of 6 months (between January and July 2013). The researcher typically spent one day a week at the District Council site and two days a week at the InfraDig site. Three sources of qualitative data were used: researcher observation, project participant interviews and project documentation or artefacts.

In depth, semi-structured interviews were held with project participants to gain a participant perspective on their ‘framing’ of the data project, their roles and the extent to which this might reflect their experience, study or practice backgrounds. It also sought to identify any questions and objectives being addressed by the project (i.e. underlying learning objectives). An interview protocol was developed ahead of fieldwork, as part of formulating the research design. Notes were taken of meetings attended and they were recorded (wherever possible and practical), after gaining consent from participants. These were transcribed or summarised to facilitate subsequent review and coding (using Nvivo). Formal meeting outputs, relevant project documentation and other artefacts were collected (where feasible), e.g. requirement

Page 32: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

26 Conference Proceedings

specification drafts, project update reports. These were also uploaded into Nvivo for review and coding.

Of particular importance were researcher field notes, captured in a research diary (Sing & Dickson: 2002, Emerson, Pretz & Shaw: 2001). Field notes were subsequently recaptured into electronic Microsoft Word form, to facilitate reflection and subsequent coding within Nvivo. It was not possible to record the facilitated workshops given the number of participants, challenges in obtaining permission and technical feasibility, although copies of workshop outputs and some photos of post-it wall-charts, were retained for reference and analysis.

Across both cases, 48 days were spent on site undertaking direct observation (with a corresponding number of field note journal entries). While on site, 14 in-depth participant interviews were conducted, 34 project meetings were attended, 23 additional meetings were held or attended, 3 workshops were (co-)facilitated, 49 artefacts collected and 19 project artefacts or outputs were (co)produced. Of the meetings attended, 39 out of 71 (55%) were recorded.

2.3 Data analysis Our data analysis sought to provide a firm and transparent grounding for an explanatory account of the phenomenon (Van Maanen: 2011, Hammersley & Atkinson: 2007). This was based on a synthesis of the data collected rather than an emphasis on ‘pattern-seeking’ analysis and tabulation (Sing & Dickson: 2002, Miles & Huberman: 1994). With this in mind, various strands of analysis and methods were adopted, primarily to facilitate and aid reflection through iterative engagement with the data, from various starting points and at different levels of analysis to illuminate different aspects of the phenomenon, and create opportunities for triangulation (Sing & Dickson: 2002, Eden & Huxham: 2002). Theory played both a sensitising and a creative dialectical role (Blaikie: 2010) during data analysis. This was most explicit in using Wenger’s (1998) communities of practice framework as an a priori coding framework. Reflexiveness was also recognised as important and as occurring at various levels, prompting a pervasive approach (captured in research diary entries and memo entries during coding) using it dialectically (Davies: 2008) and creatively to generate insights, seeking opportunities for ‘triangulation’ and employing Mirroring and Contrasting approaches, in particular to do so during reflection (Sing & Dickson: 2002). Our final findings are still in the process of being shared and discussed with participants, which will offer further opportunities for triangulation, dialectical discussion and reflection.

Page 33: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 27

3. Findings

3.1 Emerging framework for generating insights from data Considerable reflection and analysis revealed a clearer view of the phenomenon, highlighting its key constituent elements or constructs. These are illustrated in Figure 2 below.

Figure 2: Emerging framework for generating insight from data

Practitioner Engagement with a Phenomenon they are interested in (e.g. customer preference) emerges as central, either directly interacting with it (i.e. customers) or indirectly through related Data about it (e.g. Acorn data). Learning may occur through this engagement and new Knowledge is generated (or existing Knowledge confirmed or called into question). This learning can be about the Phenomenon of interest itself (the main objective) but it can also be about the related Data purporting to represent it, as well as how to improve the Engagement processes or activities themselves (e.g. handling, collecting, managing, presenting data, tools used). Such new Knowledge can lead to ‘reframing’ or refinement of the existing Knowledge about the Phenomenon of interest, triggering new ideas as to Purposes (e.g. new service opportunities), new Questions about the Phenomenon, identifying new Data dimensions and (Data) Engagement ideas (e.g. new data fields, dimensions or categories, approaches to Data collection, designing/presenting sensemaking artefacts to facilitate such Data Engagement, selecting new tools or enhancing existing tools, etc.). Tools play a mediating role and can facilitate or enable data collection and organisation as well as practitioner engagement with the data to perform analysis, present related findings and generate outputs (e.g. reports, screen displays, visualisations, Excel extracts, etc.). These outputs represent ‘sensemaking artefacts’ for other practitioners when presented to them or selected and used, which can prompt learning through sensemaking engagement with them in relation to their particular purposes and activities. Questions are posed about the Phenomenon (implicitly or explicitly), which are ‘framed’ by an overarching Purpose (e.g. increasing sales) and prior Knowledge and experience. These are situated within a practice and organisational function context.

With this emerging framework, we explore InfraDig (one of our cases), paying particular attention to several communities of practice ideas, to highlight the value of using such a situated

Page 34: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

28 Conference Proceedings

learning ‘lens’: identifying different practitioner groups, practice (and organisational) boundaries, spanning activity and related boundary artefacts. InfraDig, a rail construction project, is seeking to create an integrated set of data for operators in relation to the rail infrastructure being built (a ’virtual railway’), with which to inform and improve maintenance practice.

3.2 ‘The emperor has few clothes...’ ‘We’re building two InfraDigs: we’re building the physical InfraDig and the virtual InfraDig. And it’s as simple as that. The loving care and attention that we pay to creating the physical world we should be giving the same love and attention to the virtual world because it’s the virtual world that often gets used for managing and maintaining the physical world.’ (engineer)

As an overall vision, it is compelling and sparks the imagination. It seems ambitious, futuristic and sexy, especially when illustrated using some three-dimensional prototype software – this allows a virtual tour of the infrastructure, with the ability to remove panels and layers and access data related to elements pointed to or when hovering with a mouse pointer. More pragmatically, value will be created using this data to inform ‘smarter maintenance’ practices during operation, thereby reducing total lifetime maintenance costs.

A huge document collection effort Peering under the bonnet, to understand the mechanics of realising this vision, reveals a huge data collection effort, as illustrated in Figure 3 below.

Figure 3: A visual summary of the Data effort

Data collection builds on existing construction project practice, broadly involving engineers designing and specifying the physical infrastructure to be built, captured in design documentation (CAD, drawings, component specifications, etc.). These engineers are part of InfraDig, a project organisation constituted to build the infrastructure. InfraDig used this design documentation as a basis for procuring contracts with various construction firms or contractors, who use these design documents to plan and execute construction. InfraDig project managers and engineers assess delivery against these specifications before formal sign-offs and contract payment. ‘As-built’ changes are allowed for but have to be negotiated and agreed. Contractors

Page 35: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 29

are also responsible for delivering a final set of ‘as-built’ documentation, highlighting departures from the original design (called ‘red-lining’), as well as operating and maintenance instructions for the assets or components delivered. The ultimate infrastructure operators are also involved, initially in signing off on specifications, and again when finally taking delivery, during a transition process captured in a ‘handover plan’. They are also represented on the project itself, making up the majority of the Operations team, responsible for formulating and executing the handover plan and transition process, and to promote knowledge transfer during the project.

Turning to the design documentation being collected, this is maintained in a document repository, organised in a complex hierarchy related to functional units of infrastructure, which can be decomposed into constituent components, (e.g. an air conditioning system can be made up of various pump and other components). This repository and its structure is maintained by an InfraDig Asset Data team. The Asset Data team use a software package (they call eB) to keep track of documents and link to them (e.g. to the CAD system supplied by the same vendor). While hosted and supported by the IT department, using an outsourced provider, eB is primarily administered by a super-user in the asset data team. On a day-to-day basis though, a distributed team of InfraDig document controllers collect documents and update the repository. They are co-located with the InfraDig project staff members, at the various construction sites, typically facing off against equivalent contractor staff supported by a small central team of document controllers, who also monitor this activity and document quality through periodic audits. Data requirements are considered part of the specification and vary considerably for different types of equipment, so have been specified and signed off by class (or domain area) by the relevant infrastructure operator engineers, working with the InfraDig design engineers and asset data team. Some experienced infrastructure asset data management consultants have been retained by InfraDig to assist.

A focus on data collection, tools and artefacts rather than use While the compelling vision of a virtual railway is actively promoted to InfraDig staff and contractors (e.g. specific induction sessions), how the collected data will help illuminate maintenance challenges has not been fleshed out in detail. Indeed, within the asset data team and document controllers, the phenomenon of maintenance (and total cost of ownership) hasn’t been the focus of much attention, discussion or activity. Instead, the focus has been on the logistics of data collection and organisation. This includes the applications to capture and organise the data (e.g. fields to use, configuration, performance issues and reporting), recruiting staff, challenges to collect data from contractors (for whom this emphasis on data is new), getting project managers and site staff sufficiently focused on ensuring good data quality (in particular, seeing delivery of good data as equally important as delivering the physical infrastructure), and how best to organise the data collected. Certainly, frustration was expressed about delays in getting the operators to sign off on data requirements but this was discussed in terms of their systems and process incompatibilities, complexity of their systems, the need for more standardisation, and relative data immaturity.

The project and contract frames focus attention on ‘deliverables’ that can be easily specified and assessed in terms of delivery, including the virtual railway data artefact, rather than on related tacit knowledge transfer. While formal training sessions can be shown to be delivered, and is included for operating equipment, equivalent training (let alone experimentation) in the use of the virtual data asset or artefact isn’t reflected in the handover plan, nor does the operations team include operator asset data staff to encourage tacit knowledge transfer.

Page 36: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

30 Conference Proceedings

However, the infrastructure is being delivered and handed over in stages, which offers opportunities for early engagement and experimentation by operator staff.

Where are Maintenance, Finance (and Sustainability) practitioners? Closely related to the absence of focus on data use (and associated cost reduction), is the absence of these ultimate users as active participants in InfraDig data initiatives. Some industry guidance emerged during fieldwork, Building Information Management (BIM Industry Working Group: 2011), which infers that both infrastructure operator finance and sustainability functions will have an active interest such data (in addition to maintenance). They may seek to address questions related to maintenance costs and work together to do so, e.g. to build and refine appropriate Total Cost of Ownership (TCO) models which were grounded in and mapped to the underlying asset data. Their absence results in less opportunity to evolve a richer and more useful set of data through experimentation. Most importantly though, there is no opportunity to build ownership of, trust in, and familiarity with the data being collected, making data use, practice reflection and embedding them in improved practice less likely in due course.

‘Cross-border’ collaboration In addition to the question of involving different practitioner communities, the effort is also characterised by working across organisational and contractual boundaries, i.e. ‘cross-border’ as opposed to just ‘cross-discipline’. While InfraDig are keen to demonstrate good/improving practice in terms of BIM compliance, the contractual and organisation structure does not support this. In fact, in the case of contractors, a perverse incentive or conflict of interest can be detected, as they are likely to be bidding for subsequent maintenance work, where their unique (tacit) experience and knowledge may be a differentiator, whereas the project is now requesting effective knowledge transfer and codification of considerable knowledge for the operator, which may undermine this advantage. Overcoming such conflicts currently relies heavily on the professionalism of the parties involved, as well as transparency and the involvement and vigilance of the operations team.

Different practitioner groups, different agendas The mapping exercise highlighted three broad data related practitioner groupings within InfraDig: those focused on data collection (including the asset data team and the document controllers); the engineers, including designers as well as those engaged in construction project delivery; and IT. Construction contractors are seen as external but closely related to the core engineering and project manager groups, while the maintenance and operators are also identified as closely related to the engineers, most closely via the operations delivery team. Interviews with these groups and observing them during project meetings, revealed they each have distinct agendas and concerns, with differing views of the purpose of creating a ‘virtual railway’ data artefact. In particular, the operations delivery team, responsible for the handover plan and closest to the infrastructure operators, has a very different (audit and evidence trail) view of its ultimate purpose rather than the purpose outlined earlier:

‘I know they mention the term digital and the real railway and distinguish between them... I’m not sure I entirely understand... what it is that this is trying to do and why it is important. I mean I know at a nitty gritty level... to make it easily accessible to our approving body without them having to spend years doing it, and being a legacy system for operators to see how we assured it...’ (operations team handover plan coordinator)

In spite of the critical interdependence of these various teams’ activities they rarely, if ever, get together as a consolidated group to reflect on the broader ambition and vision, and clarify how

Page 37: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 31

to achieve desired outcomes, instead focusing on their particular roles and more immediate activities as they have framed them for themselves. This severely limits the scope for potential learning, reframing and innovation that can emerge to enhance the broader outcomes sought, i.e. collaboration to create and realise a ‘shared vision’.

Reliance on boundary-spanning ‘evangelists’ Considerable reliance is placed on a limited number of boundary spanners (Wenger: 1998) to bridge these communities or groups to promote the importance of realising the vision outlined. Much of this activity can be characterised either as evangelising or as relationship management and communication. Nearly all of these spanners identified sit within Engineering and might be perceived as outsiders rather than one of the communities of practitioners they are addressing, representing a potential identity and communication barrier as well. While there is broad CEO support, much of the drive for this activity seems grounded in the personal commitment of these individuals to improve maintenance outcomes, often based on prior personal experience.

‘So for me, when we go into the maintenance world - and one of the reasons I have that dinky little app which shows bits of wall coming off and barcodes, is because I want to get some of our sort of old-fashioned, sort of lever arch file and plan chest for drawings maintenance friends into the... to show them a view of the 21st and 22nd century, and the way which data could be used.’

The boundary spanners do not meet as a group to review and compare notes on their spanning activity, identify possible synergies from sharing intelligence, or take a holistic, ‘joined-up’ approach. Indeed they don’t recognise or frame their roles as boundary-spanning and as a similar or common practice. This points to significant opportunities to recognise, support and coordinate targeted boundary-spanning activity better, informed by clear practice development and learning objectives (e.g. through industry standardisation, training, professional associations and accreditation initiatives).

Boundary artefacts – engagement opportunities abound The project contract, design, as-built, and operational documentation loom large as boundary artefacts between the project team designers and construction contractors on the one hand, and with the infrastructure operators on the other (together with the handover plan). These have evolved and been institutionalised across the industry over many years and rail infrastructure projects, mainly between engineers, who bring considerable tacit engineering usage and experience to bear in interpreting them during use.

In the same vein, the proposed virtual railway data artefact, supported and mediated by its underpinning software, will represent an extended and enhanced boundary and sensemaking artefact in relation to the built infrastructure. To the extent the proposed data artefact builds on existing usages and organisation standards, this is likely to aid sensemaking and ease its use and institutionalisation into practice for the recipients, but may require multiple presentation formats to accommodate different operator usages, standards and systems. However, a strictly sequential and ‘deliverables’ approach to handover, e.g. on overall contract or project completion, may limit boundary engagement and iterative learning and may lead to the data being viewed as a creature of the project rather than a co-created artefact identified with their evolving practice.

Page 38: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

32 Conference Proceedings

4. Discussion

4.1 Bringing practitioner learning into focus For researchers, using the Communities of Practice framework proved useful in two main ways. The first was that it helped to bound or situate the research and phenomenon within a particular practice setting. While this is similar to the idea of focusing on ‘site-shifting’ adopted by Huang, Newell, Huang & Pan (2014) within the strategy-as-practice literature, this framework allows us to manageably examine both group and individual level aspects of the phenomenon. It brings boundaries, related spanning activity, artefacts and agents involved clearly into view.

The second major contribution was its conceptual integration of learning, knowledge, artefacts and tools, providing very useful constructs and terminology, with which to examine, analyse and then explain what was being observed. This sensitised and alerted the researcher to relevant cues during data collection, and prompted reflection during data analysis, generating new insights. In particular, this brought practitioner data use, as well as data’s socio-material nature into sharper focus, both as a boundary artefact and as a codification or reification of practitioner knowledge about a phenomenon. Importantly, it helped distinguish data from the closely associated algorithmic elements within an IT system or artefact, allowing us to examine this more closely in its own right. This focus on data clearly allows us to build on and extends earlier work by Orlikowski (1991) on duality and socio-materiality (Orlikowski: 2006) of technology, while sharing an emphasis and focus on situated practitioner knowing and technology use (Orlikowski: 2002, 2000).

It also proved useful for practiitioners. While mindsets were identified by participants as a major challenge, they found it difficult to articulate the challenge more specifically. The framework made it easier to ‘frame’ the problem, identify which practitioner groups were involved, what learning was envisaged, and where to focus attention to improve collaborative engagement (e.g. forums being required, lack of engagement, etc.). This accords with and complements work on strategy blindness and cognitive entrenchment (Arvidsson, Holmstrom & Lyytinen: 2014), although with a primary emphasis on practitioner learning rather than a focus on changing practice, although they are likely to be closely interrelated. In our case, this made the challenge more manageable for practitioners to bound, and shifted the emphasis from from broad communication to facilitating more specific engagement. It also offered a common language with which to unpack and discuss challenges and proposed interventions.

However, like all frameworks, the communities of practice framework has limitations. Most importantly in this case setting, it doesn’t address the clear tension and interplay between practice communities and organisational structures, which emerged from our research, e.g. communities that span functions and organisational or contractual boundaries. This points to the need to extend or complement it with other theory, which focuses more explicitly on structural features and mechanisms, as in Huang et al (2014) within the strategy-as-practice literature, as well as considering relevant complementary theory and methods from collaboration and innovation literatures.

4.2 Researchers rather than Statisticians... While either data or a phenomenon of interest can be a starting point for data projects, their inter-relationship emerges as central for generating new insight. The choice of starting point may be influenced by factors such as the ease of practitioner access to directly observe the phenomenon. However, data represents only a limited ‘snapshot’ of the phenomenon through a

Page 39: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 33

particular, theory-laden ‘lens’. It is necessarily reductionist, often involving attributing labels and categorisations. One is reminded of the analogy – ‘the map is not the territory’ (Korzybski: 1931). In this instance, the data is not the phenomenon and, particularly for social phenomena, measurement and validity challenges abound. Similarly, big data initiatives often use external or secondary data which may also throw up validity considerations. So, rather than emphasise the importance of quantitative analytical skills only, this paper argues for the importance of broader research skills to address epistemological and validity issues, actively considering the role of theory used or being developed (Blaikie: 2010), and assumptions or boundary conditions, making these more explicit and transparent to data users. More sophisticated and appropriate research strategies and mixed methods may also emerge, taking account of the maturity of our understanding of the phenomenon and the data available (e.g. exploratory versus hypothesis testing).

4.3 Limitations Given the nature of the ethnographic, case based approach, our work focused on achieving an explanatory rather than causal outcomes. The findings are clearly ‘grounded’ in the cases studied and generalizability is not being sought, although the explanations and related theoretical insights may have value in other settings. While our emerging thinking was transparently discussed with participants throughout the research, we still need to review the findings as discussed and presented here with them more formally, which will provide a further opportunity for dialectical triangulation to enhance our understanding.

We did not seek to assess the relative levels of codification of practice and institutional maturity of the various communities or groups identified, which may prove an interesting line of further inquiry, when applying the communities of practice framework to similar research and practical settings.

5. Conclusion This paper has presented and illustrated clear value in using the communities of practice framework and adopting a situated learning ‘frame’ for (big) data projects. For researchers this provides a practitioner oriented ‘lens’, which highlights several useful feature about generating insight from data which may otherwise be missed. In particular, it highlights practitioner domain knowledge as important. Where such endeavours span practitioner communities or functional and organisational boundaries, its use highlights the importance of considering boundaries and related spanning activities. For practitioners, the paper illustrates how such a framing can be practically adopted and the valuable complementary focus it brings on practitioner learning, their evolving knowledge and supporting related boundary spanners and related activities, in order to increase the chances for realising value from such initiatives.

The paper also demonstrates clear value from engaging with the organisation learning and knowledge management fields when researching the phenomenon of generating insights from data. Together with earlier work (Douglas & Peppard: 2013), this paper also underlines the need for better theorising data and data use as phenomena in order to better research and exploit big data opportunities.

Page 40: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

34 Conference Proceedings

References Arvidsson, V, Holmstrom, J & Lyytinen, K (2014) Information systems use as strategy practice:

A multi-dimensional view of strategic information system implementation and use, Journal of Strategic Information Systems, 23, pp.45-61

BIM Industry Working Group (2011) BIM Management for value, cost and carbon improvement – Building Information Management (BIM) Working Party Strategy Paper (A report for the Government Client Construction Group), available at http://www.bimtaskgroup.org/wp-content/uploads/2012/03/BIS-BIM-strategy-Report.pdf

Blaikie, N (2010) Designing Social Research: The Logic of Anticipation (2nd Ed.), Polity Press, Cambridge, United Kingdom

Bose, R (2009) Advanced Analytics: Opportunities and Challenges, Industrial Management & Data Systems, 109(2), pp.155-172

Checkland, P & Holwell, S (1998) Information, Systems and Information Systems – making sense of the field, John Wiley & Sons, Chichester, England

Cooper, B, Watson, H & Wixom, B (2000) Data Warehousing Supports Corporate Strategy at First American Corporation, MIS Quarterly, 24(4), pp.547-567

Davenport, TH (2009) Make Better Decisions, Harvard Business Review, 87(11), p.117-123 Davenport, TH, Harris, JG, De Long, DW & Jacobson, AL (2001) Data to Knowledge to Results:

Building an Analytic Capability, California Management Review, 43(2), pp.117-138 Davies, CA (2008) Reflexive Ethnography: A Guide to Researching Selves and Others, 2nd ed.,

Routledge, Abingdon, United Kingdom Douglas, MD & Peppard, J (2013) Theorizing Data, Information and Knowledge constructs and

their inter-relationship, UKAIS 2013 conference proceedings, available at http://aisel.aisnet.org/ukais2013/

Eden, C & Huxham, C (2002) Action Research, in Essential Skills for Management Research, Partington, D (Ed), Sage Publishing, London, United Kingdom, pp. 254-272

Emerson, RM, Pretz, RI & Shaw, LL (2001) Participant Observation and Fieldnotes, in Handbook of Ethnography, Atkinson, P, Coffey, A, Delamont, S, Lofland, J & Lofland, L (Editors), Sage, London

Hammersley, M & Atkinson, P (2007) Ethnography: Principles in practice, 3rd Edition, Routledge (Taylor & Francis Group), New York, United States of America

Huang, J, Newell, S, Huang, J, Pan, S (2014) Site-shifting as the source of ambidexterity: Empirical insights from the field of ticketing, Journal of Strategic Information Systems, 23, pp.29-44

Hopkins, MS, Lavalle, S & Balboni, F (2010) The New Intelligent Enterprise 10 Insights: A First Look at the New Intelligent Enterprise Survey on Winning with Data 10 Data Points: Information and Analytics at Work, MIT Sloan Management Review, 52(1), pp.22-31

Kettinger, WJ & Li, Y (2010) The Infological Equation Extended: Towards Conceptual Clarity in the Relationship between Data, Information and Knowledge, European Journal of Information Systems, 19 (March), pp.409–421

Kettinger, WJ & Marchand, DA (2011) Information Management Practices (IMP) from the Senior Manager’s Perspective: an Investigation of the IMP Construct and its Measurement, Information Systems Journal, 21, pp.385-406

Korzybski, A (1931) A Non-Aristotelian System and its necessity for rigour in Mathematics and Physics (presented at meeting of the AAAS), Published as Supplement III, Science & Sanity, pp. 747-761

Marchand, DA, Kettinger, WJ & Rollins, JD (2001) Information Orientation – The Link to Business Performance, Oxford University Press, Oxford

Page 41: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Adopting a situated learning framework for (big) data projects

Loughborough University 35

Marchand, DA & Peppard, J (2013) Why IT Fumbles Analytics, Harvard Business Review, 91(1), pp.104-112

Miles, MB & Huberman, AM (1994) Qualitative Data Analysis, 2nd Edition, Sage Publishing, California, United States of America

Orlikowski, WJ (1991) The Duality of Technology: Rethinking the Concept of Technology in Organizations, Organization Science, 3(3), pp. 398-427

Orlikowski, WJ (2000) Using Technology and Constituting Structures: A Practice Lens for Studying Technology in Organizations, Organization Science, 11(4), pp. 404-428

Orlikowski, WJ (2002) Knowing in Practice: Enacting a Collective Capability in Distributed Organizing, Organization Science, 13(3), pp. 249-273

Orlikowski, WJ (2006) Material Knowing: the Scaffolding of Human Knowledgeability, European Journal of Information Systems, 15(5), pp. 460-466

Ranjan, J & Bhatnagar, V (2011) Role of Knowledge Management and Analytical CRM in Business: Data Mining based Framework, The Learning Organization, 18(2), pp.131-148

Sing, V & Dickson, J (2002) Ethnographic Approaches to the Study of Organizations, in Essential Skills for Management Research, Partington, D (Ed), Sage Publishing, London, United Kingdom, pp. 117-135

Van Maanen, J (2011) Tales of the Field: On Writing Ethnography, 2nd Edition, University of Chicago Press, Chicago, United States of America

Wixom, BH & Watson, HJ (2001) An Empirical Investigation of the Factors Affecting Data Warehousing Success, MIS quarterly, 25(1), pp.17–41

Wang, H & Wang, S (2008) A Knowledge Management Approach to Data Mining Process for Business Intelligence, Industrial Management & Data Systems, 108(5), pp.622-634

Weick, K (1995) Sensemaking in Organizations, Sage Publications, Thousand Oaks, California Wenger, E. (1998) Communities of Practice: Learning, Meaning & Identity, Cambridge

University Press, New York Yeoh, W & Koronios, A (2010) Critical Success Factors for Business Intelligence Systems,

Journal of Computer Information Systems, 50(3), pp.23-32

Page 42: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

36 Conference Proceedings

Exploring different information needs in Building Information Modelling (BIM) using Soft systems Mohammad Mayouf and David Boyd Birmingham School of Built Environment, CEBE Faculty, Birmingham City University Sharon Cox School of Computing, Telecommunications and Network, CEBE Faculty, Birmingham City University

Abstract Managing information in construction projects is a crucial task and information technology (IT) has been employed to tackle this issue. Building information modelling (BIM) is considered to be the first truly global digital construction technology; it supports a process that aims to inform and communicate project decisions through the creation and use of an intelligent 3D model. BIM is claimed to be an effective tool for information exchange, which involves digitally representing the physical and functional characteristics of a building. However, the involvement of interdisciplinary stakeholders within a construction project implies different data and information requirements that need to be supported in BIM. The complexity of data required to deliver the information needed by different stakeholders in construction projects is an on-going issue, thus an understanding of the nature of this complexity is needed. This paper aims to investigate the different information needs from multiple stakeholder perspectives, and raise awareness of the data requirements that BIM needs to incorporate. CATWOE, one of the modelling tools of soft systems, is used to surface the different information requirements of three groups of stakeholders. The data have been obtained using interviews conducted with the building design team, facility management team and occupants of a newly operated building. The paper concludes with a proposed road map suggesting that different data are required to support the design and operation of the building. Further work is needed to assess BIM capabilities in terms of integrating the data to support the information needs of different stakeholders, and whether interoperability issues mean that additional tools are required to support the BIM process. This paper raises awareness about the information needed from BIM by different stakeholders in order to create a more productive building.

Introduction Conceptually, the term ‘information’ is considered to be ambiguous due to the variety of ways it can be interpreted or used (Buckland, 1991). In fact, humanity has been living and experiencing various kinds of information since the era of 4th millennium BC (also called “the Bronze Age”) when writing was invented for the first time (Floridi, 2010). Although it is not intended to draw the focus on the revolution of information, it is important to acknowledge the different beliefs and perspectives about information. In the construction industry, the way that information is managed, presented and interpreted requires a relatively high level of technical knowledge and experience that also considers the various interdisciplinary stakeholders involved (Kassem, Brogden and Dawood, 2012). Moreover, providing the required data needed to satisfy the information needs for all stakeholders is a crucial task using traditional methods such as 2D drawings and paper-based documents. Building information modelling (BIM) is part of the advancements in information technology (IT) technology that have dramatically changed the way that information are being managed, exchanged and transformed in the construction industry allowing more efficient ways to collaborate among stakeholders (Eastman et al., 2011). BIM is underpinned by digital technologies, which unlock more efficient methods of designing, creating and maintaining building assets providing a collaborative way of working among project stakeholders (HM Government, 2012). An asset can be described as any item, thing or entity

Page 43: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 37

that has a potential value or actual value to an organisation (IAM, 2013). Assets in buildings involve the physical building, systems (mechanical, electrical and plumbing) and facilities (Fallon and Palmer, 2007). Although BIM provides a collaborative information exchange platform, it is not yet clear how this information can be perceived by other stakeholders who are not directly involved in the design process such as facility managers and end-users (i.e. the building occupants) who need information about the building and should contribute to the building design process. Exploring the potential of BIM to provide information to support different perspectives may contribute to the construction of better buildings. This paper aims to explore how the different information needs of different stakeholders can be accommodated in BIM. Data are collected through interviews with the building design team, a facility manager and end-users. Soft systems is then used to analyse the interview data in order to help to surface the different perspectives of each stakeholder using rich picture, which derive CATWOE analysis followed by root definitions, to capture the core values of each stakeholder. The focus of the analysis will be drawn upon different viewpoints, information requirements and finally BIM in terms of delivering these information requirements.

Literature Review

Building Information Modelling (BIM): A general overview In the construction industry, large volumes of information are generated during the building design process, and often time is wasted searching for, sharing and sometimes recreating information (Persson, Malmgren and Johnsson, 2009). Therefore, there was a need to establish a common data environment, which includes processes and procedures that enable reliable information exchange between project team members and other stakeholders (CIC and BIM Task Group, 2013). The advent of computer technology has supported a rapid, efficient development and management of building deliveries in the built environment; making it possible to share and exchange data as well as organise complicated processes in the construction industry within one digital environment. BIM is known as the first truly global digital construction technology, which is steadily becoming deployed in every country (HM Government, 2012). BIM is considered as a procedural and technological shift within architecture, engineering, construction and operations (Succar, 2009).

BIM can be seen as an evolution of Computer Aided Design (CAD) systems, but provides more interoperable and intelligent information (Aranda-Mena et al., 2009). According to Penttilä (2006), BIM is defined as a set of interacting policies, processes and technologies generating a methodology that aims to manage the essential building design and project data in a digital format throughout the building’s life cycle. In addition, Aranda-Mena et al. (2009) point out that BIM is an ambiguous term, which can either be described as a software application, a process for designing and documenting building information, and finally a whole new approach that requires implementing new policies, contracts and relationships between stakeholders. However, to avoid any confusion, there is one central role for BIM, which is as a product to produce and manage data models that relate to information about buildings (Björk, 1995).

Construction Information: why BIM? According to (Buckland, 1991), there are two traditional meanings of ‘information’: the process of telling something and the thing that is being told. In addition, he identified three uses of the word ‘information’: information-as-process, information-as-knowledge and information-as-thing. Buckland emphasized the logic behind considering information-as-thing is that communicating knowledge, beliefs, and opinions are subjective, conceptual and even personal. Therefore, the

Page 44: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

38 Conference Proceedings

representation, expression, or description would be “information-as –thing” (Buckland, 1991). Historically, information is considered as a thing (e.g. drawing) and is communicated between architects, different construction parties and stakeholders (Khatib, Chileshe and Sloan, 2007). Today, building architects, apart from their primary design skills, need to create and communicate information in a way that previous generations never had to (Race, 2012). Architects have to be increasingly cautious in obtaining and filtering the information they require because of the increasing complexity of buildings. In other words, the information that the architects generate is subjected to immense scrutiny by all members of the project team. Another reason is that information is changing because of the continual update of processes and software, as the industry’s understanding of BIM evolves (Race, 2012). Construction may be considered to be one of the heaviest information producing industry sectors as it involves so many parties contributing information to a building project.

BIM represents an approach to create and manage information over the whole life cycle of a building (Liu, Eybpoosh and Akinci, 2012). The building life cycle consists of the stages: production, construction, building in use and finally end of building’s life (WBCSD, 2007). BIM is described as information-centric software, providing information modelling, unlike CAD, which is a graphic model of a building (Ibrahim and Krawczyk, 2003). BIM does not discriminate between the types of information that can be considered (Race, 2012). It supports the coordination of the following information: construction documentation, visualisation of building design and construction, material and equipment quantities, cost estimates, 4-D construction sequencing and reporting, scheduling and fabricating data and tool paths (Garber, 2014). The operation of BIM is based on digital databases of building information, and by managing and storing these databases, BIM can capture and present data in ways that are appropriate for designers, contractors, client or vendors (Ibrahim and Krawczyk, 2003). Succar (2009) has described the data flows in BIM, as critical for BIM stakeholders. The data flows include the transfer of structured/computable data objects (e.g. databases), semi-structured data (e.g. spreadsheets) or non-structured/non-computable data (e.g. images). It is important to mention that data flows do not only include sending/receiving semantically rich data objects (the main components of BIM Models ‘smart objects’ see Figure 1), but also sending and receiving of document-based information (Froese, 2003).

Figure 1: BIM and their Objects (Succar, 2009)

Page 45: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 39

BIM: Information and stakeholders’ needs According to Volk, Stengel and Schultmann (2014), BIM can be seen from two perspectives: narrow sense and broader sense (see Figure 2). The narrow sense comprises of the digital building model itself as a central information management repository (Eastman et al., 2011). The broader sense covers a more holistic image looking at functional, informational, technical and organisational aspects (see Figure 2). Feasibly, depending on the stakeholders’ needs and project requirements, a BIM model is used to support and perform expert services such as energy or environmental analysis for buildings (AIA, 2008). Potential applications and required functionalities of BIM are needed to suit stakeholders’ and project’s needs (Volk, Stengel and Schultmann, 2014). Thus there are two types of expert software that might interact with BIM. The first type is data input applications providing services of data import, data capture and monitoring, data processing or transformation of captured data into BIM. The second type is data output applications providing technical analysis and reports (Volk, Stengel and Schultmann, 2014). Functionalities are based on process maps, which define the logical view of information and activities, also defining stakeholders’ roles with a particular functionality (ISO Standard, 2010). Functionality (can also be called data output) depends on stakeholder, building and project requirements (e.g. 4-D scheduling). These functionalities are either inherent in BIM or attached to it as independent expert applications.

Figure 2: Broader and Narrow Sense of BIM (Eastman et al., 2011)

Industrial Foundation Classes (IFC), which is BIM data model represents functionalities for the building through Information Delivery Manual (IDM) frameworks and Model View Definition (MVD) to provide relevant information, facilitate data exchange and to avoid possible ambiguities (Venugopal et al., 2012). The purpose of IDM and MVD is to specify storage, conversion and information exchange in BIM (see Figure 3). The information flow is described using an exchange requirement model (ERM). It describes this flow with regard to: the requesting users in their roles, relevant information for a particular process, moment of the information flow, the content and the receiving user (Venugopal et al., 2012). The data exchange is enhanced between different BIM systems using IFCs and International Framework for Dictionaries (IFD) to minimize information loss (ISO Standard, 2007). However, interoperability issues remain as a major obstacle in BIM data exchanges, and to solve this issue, there is a continual development of data structures (buildingSMART, 2013).

Page 46: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

40 Conference Proceedings

Simultaneously, although BIM is spreading worldwide, stakeholders such as facility managers and building owners are scarcely using BIM, and are not fully integrated in BIM development and implementation yet (Becerik-Gerber and Rice, 2010). Rüppel and Schatz (2011) suggest that BIM should be considered as a socio-technical system where humans form the social aspect and the way they interact with the building forms the technical aspect. Focusing on integrating these aspects in the BIM development process implies the need to consider information from multiple perspectives (Mayouf, Boyd and Cox, 2014), but current applications of BIM do not support the integration of a wide variety of information (Ding, Zhou and Akinci, 2014). Therefore, a multiple perspective insight into the different information needs of BIM is necessary, not only for future maintenance and operation of a building, but also to promote greater collaboration at an early design stage to achieve the design of a more efficient buildings (Choi, Choi and Kim, 2011).

Figure 3: Information Delivery Manual (IDM) framework (ISO Standard, 2010)

Methodology This research aims to investigate different information requirements in BIM from multiple perspectives. A soft systems approach has been used as a process of inquiry into a problematic situation, which acknowledges systemic complexity (Mehregan, Hosseinzadeh and Kazemi, 2012). The Soft System Methodology as a systems-based methodology is used to tackle real world problems; it enables the analyst to understand different perspectives on the situation. Moreover, it proposes a way to improve problematic situations through a learning approach (Checkland, 2000). The analysis of the problem situation will be based on the results of interviews with the building delivery team, facility management team and building occupants. CATWOE is one of the modelling tools in SSM that will be used to demonstrate the different information requirements of multiple perspectives through the formation of root definitions. Furthermore, CATWOE has helped to simplify complex situations and understand different actors’ perspectives and perceptions, hence leading to a better description of the problem to be addressed (Vacik et al., 2014). To analyse the different information needs in BIM from multiple perspectives, conceptual models will be developed from the root definitions in future work.

There were three parties involved in the data collection process; these parties are represented by the: building delivery team, facility management team and building occupants. The data collection process was conducted using semi-structured interviews, which allowed a consistent approach to be adopted to develop a rich picture of information requirements in BIM. These

Page 47: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 41

parties were selected based on those who create and manage the BIM model (building delivery team) and those who are affected by information created in BIM (facility management team).

A rich picture of the situation was developed through conducting interviews with the above-mentioned parties, involved in the construction, maintenance and occupancy of a newly constructed university building. Interviews were conducted with four members of the building design team (project architect, energy assessor, project manager and BIM coordinator), two members of the facility management team (facility manager and building services supervisor) and three building occupants (members of university teaching staff). The scope of the interview questions aimed to review their experience with buildings, what information they acquire in BIM and the role of this information in delivering an effective and productive building environment.

The interview data were analysed and CATWOE was used to develop root definitions for each perspective, derived from the rich picture. Three root definitions were derived from the rich picture to demonstrate the different views of the problem situation being explored in this paper. It is important to mention that the analysis presented is specific to the university building developed, and thus other building projects may adopt different ways of communicating information.

Results This section presents the findings of the interviews with the three parties mentioned earlier in this paper. The results are structured to begin with the rich picture (Figure 4), which is derived from the interviews conducted with the three parties. For the purpose of clarity, Table 1 is provided to explain some terms used in the rich picture. Followed by, three tables (Table 2, 3 and 4), which show root definitions and CATWOE analysis from the three perspectives: building delivery team, facility management team and building occupants.

Figure 4: Rich Picture Developed from Both Interviews and Literature Review

Page 48: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

42 Conference Proceedings

1: Terms Used in The Rich Picture and Their Meanings

Term What it Means Project information Briefing information.

Building specification. Building size and layout. Building assets and facilities. Energy information: rating and consumption.

Building systems Mechanical, electrical and plumbing (MEP). Asset information Manufacturer information.

Equipment’s start up and emergency shut downs dates. Operation and Maintenance manual.

Aspects in building Space management. Energy control. Lighting control: Natural (sunlight) and artificial (lighting system).

Table 2: CATWOE analysis and Root Definitions for the Building Delivery Team

CATWOE Root Definitions

C: client, contractors, facility management team

A: BIM coordinator, project director, energy assessor

T: Process to capture Data provided by the designer and contractors to generate information, processes and activities about building project, which are coordinated and managed by the BIM coordinator who continuously liaise with contractors and project director from the design stage to execution stage. At the operational stage, all information related to operation and management of the building is passed to the FMT by the project director.

W: BIM is a knowledge sharing platform and information should be shared, updated and accessible for all project stakeholders, this information should support the delivery of the building, also allowing any changes to be informed by all project stakeholders.

O: architect (designer), contractors

E: BIM protocols from government, time, training cost, use of BIM software.

A system owned by the project architect (designer) and contractors. The system is created by designers and contractors, but managed and coordinated by the BIM coordinator who communicates project information with the project director and energy assessor who monitor changes from the design stage towards building execution.

The information generated are used to assist contractors during building execution who are in direct communication with both BIM coordinator and project director. The client is updated and informed by the project director about the project status. By the completion of the project, FMT receives operation and management information from the project director.

Page 49: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 43

Table 3: CATWOE analysis and Root Definitions for Facility Management Team (FMT)

CATWOE Root Definitions

C: facility management team (FMT), building occupants, architect, contractors

A: BIM coordinator, project director, client

T: A process to enable the Facility management team to be involved at the design stage during BIM development process, and be informed about layout of building systems and facilities to identify any possible issues, which may influence the building itself or building occupants.

W: We should have adequate information about building systems, their operation and maintenance, and we should be informed about our authority over different aspects within the building.

O: BIM coordinator, project director

E: resistance to change, policies from either contractors, necessary training, additional costs, time constraint.

A system owned by the BIM Coordinator and project director where information are communicated to and from the FMT to understand building systems and manage facilities more efficiently. The client is involved in the decision-making in case any changes are to be made.

The information generated such as layout of building systems, space information and facilities’ operation from this system can be used to serve the FMT to manage building’s operation and maintenance (O&M) as well as work proactively in to maintain occupants’ satisfaction at a good level.

Table 4: CATWOE analysis and Root Definitions for Building Occupants

CATWOE Root Definitions C: building occupants, architect, client, contractors, BIM coordinator, energy assessor

A: client, project director, building occupants

T: A process to provide building occupants with an information platform or a way of communication, to ensure that the needs of occupants are being addressed.

W: Building should serve the needs and requirements of the people who will occupy the building.

O: university management (client)

E: Cost, time constraint, language barriers, software.

A system owned by the university management (client) where graphical and non-graphical information about the building is provided by the BIM coordinator and passed through the project director to the client. The client communicates project information to the building occupants and in return the occupants state their needs through an information platform or by any mean that can support addressing their needs.

The information generated such as workspaces, facilities provided, way-finding and contacts to report faults can used to serve the building occupants’ needs. This may imply changes, which can affect building’s layout, services provided and if necessary re-evaluate energy performance of the whole building.

Discussion The rich picture, CATWOE analysis and root definitions have shown an intrinsic side of the embedded complexity of different information requirement from three perspectives. In this section, the analysis draws attention to different viewpoints, clarifying information requirements and considers whether the different information needs can be accommodated in the BIM environment.

Different viewpoints The rich picture (Figure 4) illustrates the complexity involved in the delivery of a building project. The information provided from each perspective is based on their duties (Building delivery team and FMT) or living experiences (FMT and Building occupants). There are several conflicts

Page 50: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

44 Conference Proceedings

identified within the rich picture that are also evident in the CATWOE analysis. To begin with, the client was the main connection between building delivery team and building occupants where the client pointed out that selected senior academics were involved to identify user requirements of the building. However, building occupants claimed that their needs were not being addressed. The gap identified is that the senior staff provided information on behalf of the rest of other staff members who perceive that their requirements have not been addressed.

The second conflict is between the building occupants and FMT where the building occupants claim that communication with FMT is inadequate while the FMT point out that the building occupants should be aware of their environment and know who to contact from FMT when necessary. Occupants are mainly concerned with what directly affect them, which is often related to their working spaces and the facilities they use frequently. The occupants believe that the FMT do not provide occupants with the necessary contacts for when a problem occurs.

The third and probably the most major conflict, which also has an implication on the first two conflicts is between the FMT and the building delivery team. From one side, the project manager and the BIM coordinator claim that the FMT are provided with the information they need to operate and manage building facilities, which should allow FMT to work proactively. From another side, the FMT pointed out that the information they have does not represent a clear image of building systems and their layout. They added that they should have an authority over some aspects in the building such as public space. The reasons for these conflicts can be summarized as following:

• Poor involvement of FMT at the design stage during BIM development process,

• FMT report faults or damages reactively,

• There is no clarity over what aspects or areas in the building that FMT should have authority on.

Information requirements It can be realised that one of the major causes of the conflicts mentioned in the previous section is the lack of ability to determine and communicate the required information at the right time. Moreover, communicating this information at the design stage would avoid many issues. From the rich picture, it can be seen that the main key players in managing information and processes amongst the building delivery team, FMT and building occupants are the project architect (creating information), BIM coordinator (managing information) and project director (communicating information). It is important to note that there are two levels of complexity in terms of managing and communicating information generated from BIM. The first is within the building delivery team where issues such as interoperability of different models of building systems, updating the model and communicating all changes can be problematic and may lead to different conflicts, but are not discussed in this paper. The second level, which is more holistic, as it involves FMT and building occupants and how BIM should positively contribute to their satisfaction of the delivered building.

From the FMT’s perspective, their involvement during the BIM development process is considered essential and can avoid several issues and conflicts. As BIM forms a central information platform where all building systems are integrated within one single model, it allows the FMT to understand the layout of building systems, providing an enhanced 3D visualisation, which in return can alert the building delivery team of any possible issues related to operation and/or maintenance of these systems or facilities. In addition, once the building starts operating,

Page 51: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 45

the BIM model can be used as a reference for operation and maintenance (O&M) information for facilities. This can replace the enormous O&M manuals, which can be quite problematic and lack reliable information. As the FMT support managing operations within the building, it is important to clarify the level of authority over various aspects in the buildings. For instance, authority can play an important role in a certain space (e.g. social-learning space) in terms of controlling fixed/movable facilities when dealing with a certain event. In this sense, involving FMT during the BIM process can effectively support their understanding about functionality and flexibility of the spaces.

From the building occupants’ perspective, communicating and understanding their needs at an early stage will increase their level of satisfaction with the building developed. Within the context of a university building, building occupants range from senior management through academic staffs to students. Although some senior staffs were involved in defining the occupants’ needs to the client, different perceptions based on different school requirements personal views were not taken in consideration. The occupants claim that an information platform where they can communicate their needs to the client would be useful so that maximum satisfaction can be achieved. The information platform can be in the form of a discussion forum or focus groups where the client, project director and even the architect can share information. This can promote an effective collaboration making building occupants aware of their built environment and ensure the delivery of an effective and productive building.

BIM Role in satisfying different information requirements It is realised based on both the rich picture and CATWOE analysis that BIM has the potential to extend its benefits to include both FMT and building occupants. The shift from 3D CAD where data are limited to graphical entities such as lines, arcs and circles to BIM, which integrates semantically rich information including geometry and functional properties forming ‘smart objects’ has allowed for further accountancy of the whole building life cycle (Azhar et al., 2008). However, looking at the conflicts (section 5.1) and what information are required to solve them (section 5.2), it is important to be aware of how BIM can support solving these issues and to what extent. In terms of facility management (FM), BIM supports various aspects such as asset management through the use of Construction Operation Building information exchange (COBie) providing spatial (spaces and their grouping into floors, and into other zones) and physical (components ‘facilities’ and their grouping into product type and into other systems) information about different facilities (BIM Task Group, 2013). Other efforts have explored various aspects such as improving the operation of building maintenance using BIM (Motawa and Almarshad, 2013). However, these solutions and others did not involve the FMT as part of the BIM development process, which would play an important role in terms of testing the validity and reliability of these methods. Mayouf, Boyd and Cox (2014) have identified that a greater involvement of FMT in BIM may increase building performance from a facilities management perspective.

The satisfaction of end-users (building occupants) is the optimum aim for the client when a new building project is to be delivered. This paper has explored the importance of communicating information between the building delivery team and building occupants through the client. Some applications such as Customer Interactive Building Information Modelling (CIBIM) have been developed to support the correlation between digital and built environment for building occupants (Lee and Ha, 2013). Although the application was limited to the design of apartments, it allowed for more effective communication with the designers. However, when considering more complex environments such as public buildings, it is important to establish a

Page 52: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

46 Conference Proceedings

simple but effective way of communicating requirements between building occupants and building delivery team. Virtual reality (VR) can provide a way of communicating design decisions allowing building occupants to see the building in a real imagined environment (Ding, Zhou and Akinci, 2014). Another way of communicating project information can be through developing an application where the designer, client and building occupants can share and update information about the BIM model. However, interoperability as one of BIM’s major issues remains an obstacle where further work is still needed to improve information communication across the boundary of designers and contractors (Succar, 2009).

Conclusion This paper has aimed to explore information requirements from the building delivery team, facility management team (FMT) and building occupants. The literature has explored the complexity of information creation, management and communication in BIM. It also emphasized the difficulty of delivering effective buildings, which can satisfy all stakeholders. Soft systems techniques were used to inquire into different information requirements from the three specified perspectives, and applied within the context of a newly operating university building. The results included both rich picture and root definitions that highlighted the different perspectives of the three stakeholders. Different viewpoints have demonstrated that BIM needs to provide both information-as-a-thing and information-as-a-process to meet the needs of different parties in building design. It has been concluded that involving FMT in the design process and establishing a way to communicate different building occupants’ requirements can address some of the conflicts between stakeholders. Further work is needed to define the specific information categories that need to be included in BIM to support the needs of different stakeholders.

References AIA, F. E. J. (2008) BIG BIM little bim. 2 edition. Salisbury, MD: 4Site Press. Aranda-Mena, G., Crawford, J., Chevez, A. and Froese, T. (2009) ‘Building information

modelling demystified: does it make business sense to adopt BIM?’, International Journal of Managing Projects in Business, 2(3), pp. 419–434. doi: 10.1108/17538370910971063.

Becerik-Gerber, B. and Rice, S. (2010) ‘The perceived value of building information modeling in the U.S. building industry’, ITcon, 15, pp. 185–201.

BIM Task Group (2013) What is COBie UK 2012?, www.bimtaskgroup.org. Available at: www.bimtaskgroup.org/cobie-uk-2012/ (Accessed: 2 August 2014).

Björk, B.-C. (1995) Requirements and information structures for building product models. Helsinki University of Technology. Available at: itc.scix.net/data/works/att/0fb9.content.01355.pdf.

Buckland, M. K. (1991) ‘Information as Thing’, 42(5), pp. 351–360. buildingSMART (2013) Start Page of IFC2*4 RC2 Documentation - Scope. Checkland, P. B. (2000) ‘Soft systems methodology: a thirty year retrospective’, Systems

Research and Behavioral Science, 17, pp. 11–58. Choi, J., Choi, J. and Kim, I. (2011) ‘Development of BIM-based evacuation regulation checking

system for high-rise and complex buildings’, Automation in Construction, (In Press). doi: 10.1016/j.autcon.2013.12.005.

CIC and BIM Task Group (2013) Outline Scope of Services for the Role of Information Management. London: Construction Industry Council.

Ding, L., Zhou, Y. and Akinci, B. (2014) ‘Building Information Modeling (BIM) application framework: The process of expanding from 3D to computable nD’, Automation in Construction, (In Press). doi: 10.1016/j.autcon.2014.04.009.

Page 53: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring different information needs in Building Information Modelling (BIM) using Soft systems

Loughborough University 47

Eastman, C., Teicholz, P., Sacks, R. and Liston, K. (2011) BIM Handbook: A Guide to Building Information Modelling for Owners, Managers, Designers, Engineers and Contractors. 2nd ed. New Jersey: John Wiley & Sons Inc.

Fallon, K. K. and Palmer, M. E. (2007) General Buildings Information Handover Guide: Principles, Methodology and Case Studies. National Institute of Standards and Technology.

Floridi, L. (2010) Information: A Very Short Introduction. Oxford: Oxford University Press. Froese, T. (2003) Future directions for IFC-based interoperability. Available at:

http://www.itcon.org/cgi-bin/works/Show?2003_17 (Accessed: 23 July 2014). Garber, R. (2014) BIM Design: Realising the Creative Potential of Building Information

Modelling. Chichester, West Sussex, United Kingdom: Wiley. HM Government (2012) Industrial Strategy: government and industry in partnership: Building

Information Modelling. United Kingdom: Crown. IAM (2013) What is Asset Management? The Institute of Asset Management. Available at:

http://theiam.org/what-asset-management (Accessed: 29 July 2014). Ibrahim, M. and Krawczyk, R. (2003) ‘The level of knowledge of CAD objects within the building

information model’, ACADIA 2003 Conference, Muncie, IN, USA, p. 173. ISO Standard (2007) ISO 12006-3:2007 Building Construction: Organization of Information

About Construction Works, Part 3: Framework for Object-oriented Information. ISO Standard (2010) ISO 29481-1:2010(E): Building Information Modeling - Information

Delivery Manual - Part 1: Methodology and Format. Kassem, M., Brogden, T. and Dawood, N. (2012) ‘BIM and 4D planning: a holisitc study of the

barriers and drivers to widespread adoption’, 2(4), pp. 1–10. Khatib, J. M., Chileshe, N. and Sloan, S. (2007) ‘Antecedents and benefits of 3D and 4D

modelling for construction planners’, Journal of Engineering, Design and Technology, 5(2), pp. 159–172. doi: 10.1108/17260530710833202.

Lee, S. and Ha, M. (2013) ‘Customer interactive building information modeling for apartment unit design’, Automation in Construction, 35, pp. 424–430. doi: 10.1016/j.autcon.2013.05.026.

Liu, X., Eybpoosh, M. and Akinci, B. (2012) ‘Developing As-Built Building Information Model Using Construction Process History Captured by a Laser Scanner and a Camera’, in. American Society of Civil Engineers, pp. 1232–1241. doi: 10.1061/9780784412329.124.

Mayouf, M., Boyd, D. and Cox, S. (2014) ‘Different Perspectives on Facilities Management to Incorporate in BIM’, in Proceedings of CIB Facilities Management Conference: Using Facilities in an open world creating value for all stakeholders. Copenhagen, pp. 144–153.

Mehregan, M. R., Hosseinzadeh, M. and Kazemi, A. (2012) ‘An application of Soft System Methodology’, Procedia - Social and Behavioral Sciences. (The First International Conference on Leadership, Technology and Innovation Management), 41, pp. 426–433. doi: 10.1016/j.sbspro.2012.04.051.

Motawa, I. and Almarshad, A. (2013) ‘A knowledge-based BIM system for building maintenance’, Automation in Construction, 29, pp. 173–182. doi: 10.1016/j.autcon.2012.09.008.

Penttilä, H. (2006) ‘Describing the changes in architectural information technology to understand design complexity and free-form architectural expression’, ITcon, 11, pp. 395–408.

Persson, S., Malmgren, L. and Johnsson, H. (2009) ‘Information management in industrial housing design and manufacture’, ITcon, 14, pp. 110–122.

Race, S. (2012) BIM Demystified. London: RIBA Publishing.

Page 54: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

48 Conference Proceedings

Rüppel, U. and Schatz, K. (2011) ‘Designing a BIM-based serious game for fire safety evacuation simulations’, Advanced Engineering Informatics. (Special Section: Advances and Challenges in Computing in Civil and Building Engineering), 25(4), pp. 600–611. doi: 10.1016/j.aei.2011.08.001.

Succar, B. (2009) ‘Building information modelling framework: A research and delivery foundation for industry stakeholders’, Automation in Construction, 18(3), pp. 357–375. doi: 10.1016/j.autcon.2008.10.003.

Vacik, H., Kurttila, M., Hujala, T., Khadka, C., Haara, A., Pykäläinen, J., Honkakoski, P., Wolfslehner, B. and Tikkanen, J. (2014) ‘Evaluating collaborative planning methods supporting programme-based planning in natural resource management’, Journal of Environmental Management, 144, pp. 304–315. doi: 10.1016/j.jenvman.2014.05.029.

Venugopal, M., Eastman, C. M., Sacks, R. and Teizer, J. (2012) ‘Semantics of model views for information exchanges using the industry foundation class schema’, Advanced Engineering Informatics. (Knowledge based engineering to support complex product design), 26(2), pp. 411– 428. doi: 10.1016/j.aei.2012.01.005.

Volk, R., Stengel, J. and Schultmann, F. (2014) ‘Building Information Modeling (BIM) for existing buildings — Literature review and future needs’, Automation in Construction, 38, pp. 109–127. doi: 10.1016/j.autcon.2013.10.023.

WBCSD (2007) Energy Efficiency in Buildings, Business realities and opportunities. World Business Council for Sustainable Development.

Page 55: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology

Loughborough University 49

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology Alistair Milne and Paul Parboteeah School of Business and Economics, Loughborough University [email protected], [email protected]

Abstract Big data now presents the opportunity for near real time access to large information sets from different sources, and it promises to offer new tools for monitoring systemic financial risk. However, evidence from other fields (spatial data analysis, bio-informatics) on data modelling and visualisation demonstrates that in order to get useful insights the underlying data, and relationships between the data must be understood. In the context of trying to monitor systemic risk, we therefore draw a distinction between financial stress indicators (timely but broad indicators of current of prospective financial stress) and financial stress diagnostics (measures that reveal a specific mechanism that can trigger a systemic problem).

Big data can be used ‘as is’ for stress indicators, but for stress diagnostics, an ontological foundations required to create a conceptual framework for the underlying data. This paper therefore tries to draw lessons from the development of other ontologies and in doing so examines SNOMED CT (a medical ontology) drawing out the lessons for the use of big data for monitoring systemic financial risk. This paper brings together previous research on the development and use of SNOMED CT, provides a critique on its successfulness to improve patient care/drug development and extracts the lessons that can be learnt. Stemming from this critique, implications are drawn for the creation of an ontology to support financial stress indicators and diagnostics. The main finding is that in contrast to SNOMED CT, creating a stress diagnostic system is more about mapping the financial system as a whole to identify relationships and dependencies.

In this paper we have sought to draw a distinction between financial stress indicators providing immediate warnings of imminent systemic problems and financial stress diagnostics which provide insight into the underlying causes of stress and therefore into the appropriate policy response. We have argued that while big data techniques can be used directly to construct financial stress indicators their use to create financial stress diagnostics requires an ontological understanding of data at the granular level. Only then will it be possible to effectively use big data to provide effective quantitative measurement of key systemic vulnerabilities of the financial system, such as network instabilities or common exposures to property and other asset markets.

Keywords: Big Data, Ontologies, Financial Stress Indicators, Financial Stress Diagnostics

Introduction Big data – the opportunity for near real time access to large scale information sets from disparate sources – promises to become a key tool for monitoring and anticipating systemic financial risks. But experience in a variety of fields (eg spatial data analysis (Haining, 2003), bio-informatics (Khatri & Drăghici, 2005) on data modeling and visualization indicates that a prior requirement for obtaining precise and interpretable results from large scale data is a clear understanding of underlying data structures and relationships.

Page 56: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

50 Conference Proceedings

In the context of systemic risk monitoring this can be understood by drawing a distinction between financial stress indicators and financial stress diagnostics. By financial stress indicators we mean timely but broad indicators of current or prospective financial stress. By financial stress diagnostics, we mean measures that reveal a specific mechanism that can potentially trigger a systemic problem and thereby suggest an appropriate policy response. (In keeping with the medical perspective we pursue, an analogy for a financial stress indicator might be a temperature or blood pressure reading; while an analogy for a financial stress diagnostic would be disease specific investigations such as a blood or smear test.)

While big data may be useful ‘as is’ for the creation of financial stress indicators, using big data to create effective financial stress diagnostics that are revealing about the underlying causes of stress requires something more: a conceptual framework for understanding of the granular data on financial instruments and how these can be parsed and aggregated i.e. in modern computer science terminology an ontological understanding. It is therefore appropriate to try and draw lessons, from the development of ontologies in other domains. The aim of this paper is to examine the SNOMED CT medical ontology, drawing out the lessons for the use of big data for monitoring systemic financial risk.

The paper is organized as follows. Section2 considers the current state of data analysis and modeling in financial services, exploring where and why an ontology may be useful. We explores where standardization has been developing and is starting to set the foundation for an ontology of financial activities. Section 3 describes the SNOMED CT ontology, providing an overview of its history and development and, now that it is in place, its usage and how it is supporting innovation and reducing costs in medical research. Section 4 presents and discusses the lessons that can be learnt from SNOMED CT for the financial services industry and the financial authorities. Considering how recent calls for a common financial language can be taken further to support a fully fledged ontology. Section 5 concludes.

Common Languages, Standards and Ontologies in Financial Services The need for agreed definitions and ontology in finance is now widely accepted. A recent speech by Andrew Haldane (Ali et al., 2012) called for a move towards a common financial language, arguing that competing in house languages and silo-ed information systems make it difficult to manage risk exposure. Haldane argues that inability to quickly aggregate information on risk exposure during the financial crisis hampered risk management, ultimately proving fatal for some banks. Taking steps towards creating a common financial language will enable systems both across, and between, organisations to communicate with each other, allowing the effective management of risks at both in individual institutions and for the financial system.

This is related to the distinction we make between financial stress indicators and financial stress diagnostics. Financial stress indicators are already widely used, especially in the context of financial bubbles: for example Shiller (2000) uses keywords from newspapers as indicators of when housing markets have been gripped by speculation; Sornette (2013) uses novel data analytic methods in order to create financial stress indicators from market prices. There is obvious potential for extension of work of this kind using big data, for example discussions on traders blogs, or financial attitudes revealed by social networks.

But such stress indicators do not accurately locate the underlying causes of systemic stress. Spurred by the global financial crisis, there is now a good general understanding on the underlying sources of systemic financial stress (see (Besar, Booth, Chan, Milne, & Pickles,

Page 57: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology

Loughborough University 51

2011) for an extensive literature review). These vulnerabilities include maturity mismatch and illiquidity; inadequate capitalization; common exposures to asset markets (notably residential and commercial property) and unstable network structures. But while we understand these vulnerabilities in outline, we lack adequate tools for their measurement and quantification.

Use big data to create financial stress diagnostics, e.g. identifying common exposures, maturity mismatch, or potential network instabilities, requires semantic understanding. It is not enough simply to collect data and summarise in a relatively crude fashion. Rather, in order to be used diagnostically, data must be carefully, parsed and modeled, to build an aggregate and intepretable analysis of the system. This cannot be undertaken with raw data alone, it requires a semantic understanding, which can only be obtained through the use of an ontology.

Ontologies are the formalisation of definitions of concepts and their relationships to other concepts, used so that information systems can work together better. They can be understood as the most advanced concept of a common language. Common language can be interpreted at different levels (Chisholm and Milne, 2013). At the most basic level, it could be considered a single set of agreed upon definitions. At a higher level a sets of standards that are agreed upon as ways of making or doing things. At the highest level, as an ontology.

There have been many standards developed in the financial services industry, notably the Fix Protocol, ISO20022, FIBO (the Financial Instrument Business Ontology) and the global Legal Entity Identifier system (a global reference data system that uniquely identifies every legal entity or structure, in any jurisdiction, that is party to a financial transaction). These examples can be as a first step towards the implementation of a full ontology since standards, by definition, force a common way of describing, and carrying out processes.

Whilst the financial services industry has benefited from an agreed upon set of definitions and transaction standards (FIX Protocol, ISO20022), standardization is far from complete, and the ultimate aim of creating an ontology is some way off. An initial start has been made in the FIBO (the Financial Industry Business Ontology), but this is still at an early stage of development.

There are also obvious governance challenges to developing standards and ontologies. Whilst there are very substantial economic gains to standardisation in financial markets, progress in achieving these gains has been uneven, in part because benefits to individual market participants can fall short of total societal benefits ((Houstoun, 2012)). It is even possible that individual market participants can suffer losses from the introduction of standards/ ontologies/ common languages and could seek to hinder their development. Even with benefits to individual firms, often problems with coordination mean the benefits go unrealised.

Learning from other Industries Numerous ontologies already exist, some of which are in the public domain. For instance the SNOMED CT ontology of healthcare and clinical terms, WordNet which is a lexical database of the English language, SWEET, which is NASA’s ontology of Earth and Environmental Terminology and CIDOC, which is an ontology for concepts and information in cultural heritage and museum documentation. SNOMED CT was chosen as the case study for this paper because the healthcare domain it covers matches in scope, size and complexity the situation found in finance. For example, SNOMED CT aims to cover body structures, diagnoses, procedures, clinical findings, organisms, pharmaceuticals, medical devices and specimens with approximately 311,000 concepts, compared to NASA’s 6000 concepts in SWEET. The

Page 58: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

52 Conference Proceedings

academic body of literature on SNOMED CT is also substantially larger than SWEET, with SNOMED CT having 9700 results on Google Scholar, as opposed to SWEET’s 1690. It is also prudent to point out that the purpose of the study is to uncover some of the lessons to be learnt from using big data and ontologies, not to try and draw direct comparisons.

SNOMED CT is a medical ontology which provides a consistent way of recording information across health care. It includes codes, terms, synonyms and definitions to improve the recording of clinical/research information with the ultimate aim of improving patient care. The scope of SNOMED CT ranges from recording symptoms and procedures, defining body structures and organisms, through to diagnoses, medical procedures and treatments and outcomes. It was initially started in 1960s and the first complete version was released 40 years later (Cornet and de Keizer, 2008).

The need, and desire, for standardising vocabulary in the medical field is not recent (Bodenreider, 2008). Indeed, the medical science and healthcare field have a long tradition of structuring their knowledge through controlled vocabularies. The 17th century saw that start, with the London health authorities compiling a list of 200 causes of death, which was later incorporated into the International Classification of Diseases (McCray, 2006). The first, comprehensive terminological system was SNOP in 1965 (Cornet and de Keizer, 2008) and was part of an effort to improve patient information storage and retrieval, improve billing and accounting activities, the generation of health statistics and semantic interoperability between systems (Schulz and O Klein, 2008). SNOP later evolved into SNOMED in 1974, which later saw versions 2 and 3, followed by 3.5 and then RT, before becoming SNOMED CT in 2002. SNOMED CT was developed with three main aims: to enable a consistent way of indexing, storing and aggregating clinical data, enabling structuring and computerizing of medical records and enabling automated reasoning to support clinical decision making.

In its most basic form, SNOMED CT is a source of vocabulary, but combined with identifying the relationships between the concepts SNOMED CT becomes an important resource for natural language processing and supporting knowledge management initiatives, such as information mapping and merging (Bodenreider, 2008). Using a terminological system to describe concepts and their relations means a standardized and structured approach can be taken to describing information (de Keizer et al., 2000), especially in regards to medical records.

Whilst an incredibly upbeat, and beneficial, picture of SNOMED CT has been presented, there exists a weakness that needs exploring. That weakness lies in the specification and creation of the ontology according to ontology creation principles/best practice. SNOMED contains several ontological errors (Heja et al., 2008; Schulz et al., 2008; Schulz et al., 2011). Outlining first the reason they exist, they exist because SNOMED CT was developed giving high priority to the needs to clinicians (Heja et al., 2008). This sounds like an ironic twist of fate – it is for the benefit of clinicians that the ontology has been created and yet it is the very act of prioritizing their needs that has resulted in the ontological failings. For example, issues with multiple inheritance means that from the ontology you would, actually incorrectly, infer that alcoholic beverage is subsumed by central depressant, ethyl alcohol and substance of abuse. It may seem like a minor point, but these errors actually prevent automatic reasoning by the ontology.

The reason for these weaknesses is that by prioritizing the input from clinicians, the semantics of the ontology become rooted in the human understanding of natural language and peoples implicit understanding of the relations between concepts (Schulz and O Klein, 2008). The

Page 59: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology

Loughborough University 53

problem of the lack of rigid, formal semantics is one that continues to be reduced by developments and improvements to the ontology. At one point, for instance, SNOMED CT used to contain the words “Mars Bar” and “Kit Kat”, but these have now been removed and replaced by the broader term “Chocolate Candy” (Heja et al., 2008).

The benefit of SNOMED CT, and the ultimate aim, is to support decision-making in the pharmaceutical/ medical industries (Bodenreider, 2006). Knowledge management initiatives are supported because SNOMED CT enables the annotation of data and resources. Clinical documents can be coded, for different purposes (Giannangelo, 2006) and information retrieval is more effective, and efficient by improving the level of detail and the ability to drill down (Bodenreider, 2008). SNOMED CT also supports data integration and semantic interoperability (Mead, 2006). Ontologies enable, and support, the creation of electronic health records permitting the numerous “islands of data” to be merged. There are two ways SNOMED CT supports data integration: warehousing and mediation. Ontologies provide the standardisation required to facilitate data warehousing and provide the global schema for mapping different datasets.

To conclude, SNOMED CT provides the controlled vocabulary for the annotation of datasets and documents, which facilitates the access to, and retrieval of, information. The standardisation it provides enables the exchange of information across systems and by providing a representation of the medical domain, enables the integration and merging of different datasets. (Bodenreider, 2008). Finally, by enabling the annotating, retrieval, integration and merging of information, SNOMED CT becomes a computable source of domain knowledge, that will eventually support natural language processing applications and decision support systems.

Implications for Financial Services Having briefly reviewed SNOMED CT and its development, use, problems and role in supporting clinical decision making, this paper will now explore what lessons can be extracted from the SNOMED CT implementation and applied to the creation of financial stress indicators and diagnostics.

The first lesson from SNOMED CT is that developing an almost industry wide ontology is a large undertaking. Whilst it did take approximately 40 years to arrive at SNOMED CT, efforts now to produce a useful FIBO should be substantially shorter: the technological environment is more advanced and the regulatory pressures are greater.

The SNOMED CT model of development does however act as a good practice for FIBO developers (FIBO being necessary for the semantic analysis need in financial stress diagnostics). As outlined by (Cornet and de Keizer, 2008) different medical authorities around the world started their own implementations of terminology/ontology systems before they were combined by a specially created organisation: International Health Terminology Standards Development Organisation (IHTSDO). IHTSDO is a not for profit organisation whose purpose is to own and administer SNOMED CT. As of 2012, 18 countries are members of IHTSDO, with 50 countries using SNOMED CT.

As noted by Heja et al. (2008) among others, SNOMED CT does contain numerous ontological errors that prohibit automated processing and reasoning. A further lesson to be learnt here is that sufficient importance needs to be attached to the rules governing the creation of ontologies

Page 60: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

54 Conference Proceedings

in this space, including FIBO. Creating an organizationally correct ontology will then permit the semantic analysis to enable the financial stress diagnostics to take place because inferences will be able to be drawn and decisions taken.

In contrast to SNOMED CT however, creating a stress diagnostic system is more about mapping the financial system as a whole to identify relationships and dependencies. SNOMED CT on the other hand, has as its focus the drawing of better conclusions from large cross-sectional data and better understanding of which treatments or risk factors are more closely linked to disease and its progression.

Conclusion In this paper we have sought to draw a distinction between financial stress indicators providing immediate warnings of imminent systemic problems and financial stress diagnostics which provide insight into the underlying causes of stress and therefore into the appropriate policy response.

We have argued that while big data techniques can be used directly to construct financial stress indicators their use to create financial stress diagnostics requires an ontological understanding of data at the granular level. Only then will it be possible to effectively use big data to provide effective quantitative measurement of key systemic vulnerabilities of the financial system, such as network instabilities or common exposures to property and other asset markets.

There has been progress in the creation of ‘common financial language’, definitions, standards and at least in prototype form ontologies for finance. But, while travel has been in the right direction, the experience from medicine of the SNOMED CT ontology suggests this is still only the beginning of a long road. It will take a sustained effort, involving many stakeholders, to develop a complete financial ontology that can be used to support a complete mapping of financial sector vulnerabilities from big data. Achieving this end will depend on effective co-operation and governance, which is not always easily achieved.

References Ali, R.D., Haldane, A.G. and Nahai-Williamson, P. (2012) “Towards a Common Financial

Language” Presented at the Securities Industry and Financial Markets Association “Building a Global Legal Entity Identifier Framework” Symposium, New York.

Besar, D., Booth, P., Chan, K. K., Milne, A. K. L., & Pickles, J. (2011). Systemic Risk in Financial Services. British Actuarial Journal, 16(02), 195–300.

Bodenreider, O. (2006) “Lexical, terminological and ontological resources for biological text mining”. In: Ananiadou, S., McNaught, J. editors. Text Mining for biology and biomedicine. Boston:artech House.

Bodenreider, O. (2008) “Biomedical Ontologies in Action: Role in Knowledge Management, Data Integration and Decision Support”, IMIA Yearbook of Medical Informatics, Vol. 47, Suppl.1, pp. 67-79.

Chisholm, M. and Milne, A. (2013) “The Prospects for Common Language in Wholesale Financial Services”, The Swift Institute, Available at: http://www.swiftinstitute.org/wp-content/uploads/2013/09/SWIFT-Institute-Working-Paper-No.-2012-005-Common-Language-in-Wholesale-Financial-Services-Chisolm-and-Milne_v4-FINAL.pdf.

Cornet, R. and de Keizer, N. (2008) “Forty years of SNOMED: a literature review”, BMC Medical Informatics and Decision Making, Volume 8, Suppl. 1: S2.

Page 61: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Using Big Data in Banking to Create Financial Stress Indicators and Diagnostics: Lessons from the SNOMED CT Ontology

Loughborough University 55

De Keizer, N.F., Abu-Hanna, A. and Zwetsloot-Schonk, J.H. (2000) “Understanding terminological systems 1: Terminology and typology”, Methods of Information in Medicine, Vol. 39, pp. 16-21.

Giannangelo, K. (2006) “Healthcare code sets, clinical terminologies and classification systems”, Chicago III: American Health Information Management Association.

Heja, G., Surjan, G. and Varga, P. (2008) “Ontological Analysis of SNOMED CT”, BMC Medical Informatics and Decision Making, Volume 8, Suppl. 1: S2.

Haining, R. P. (2003). Spatial Data Analysis: Theory and Practice, p432, Cambridge University Press.

Houstoun, K. (2012). Standards in Computer Based Trading: A Review. Retrieved from http://www.bis.gov.uk/assets/foresight/docs/computer-trading/12-1091-dr31-standards-in-computer-based-trading.pdf

Khatri, P., & Drăghici, S. (2005). Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (Oxford, England), 21(18), 3587–95. doi:10.1093/bioinformatics/bti565

McCray, A.T. (2006) “Conceptualizing the world: lessons from history”, Journal of Biomedical Informatics. Vol. 39, No. 3, pp. 267-273.

Schulz S. and O Klein, G. (2008) “SNOMED CT – advances in concept mapping, retrieval and ontological foundations. Selected contributions to the Semantic Mining Conference on SNOMED CT”, “, BMC Medical Informatics and Decision Making, Volume 8, Suppl. 1: S2.

Schulz, S., Marko, K. and Suntisrivaraporn, B. (2008) “Formal representations of complex SNOMED CT expressions”, BMC Medical Informatics and Decision Making, Volume 8, Suppl. 1: S2.

Schulz, S., Cornet, R. and Spackman, K. (2011) “Consolidating SNOMED CT’s ontological commitment”, Applied Ontology, Vol. 6, pp. 1-11.

Sornette, D. (2013) “How we can Predict the Next Financial Crisis”, TED Talks, Available at: http://www.ted.com/talks/didier_sornette_how_we_can_predict_the_next_financial_crisis.

Shiller, R.J. (2000) “Irrational Exuberance”, Princeton University Press: New Jersey.

Page 62: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

56 Conference Proceedings

The use of ontologies to gauge risks associated with the use and reuse of E-Government services Onyekachi Onwudike, Russell Lock and Iain Phillips Dept of Computer Science, Loughborough University [email protected], [email protected], [email protected]

Abstract E-Government ontologies have been developed for different strata of government over a number of years. However, the majority of these ontologies have had little or no impact on E-Government as a whole. The development of E-Government ontologies in isolation, without wider integration in perspective and the lack of reuse of components present serious challenges for the E-Government domain. Although the idea of reuse across ontologies seems to be a welcome idea with respect to the problem of interoperability, the risks and disadvantages associated with reusing existing solutions, as well as making certain functionalities sharable between E-Government services, is a relatively new area of research.

This reuse of existing solutions may potentially help to foster co-operation amongst E-Government departments, reduce costs and reduce development time as well as increase reliability and maintainability of such systems. This paper explores existing E-Government ontologies and assesses the assistance a suitably designed ontology could have in reducing system development and evolution risks. It incorporates the development of a new ontology for E-Government and explores the role of ontologies in overcoming risks that may be associated with service combinations such as overlapping of services, the uncontrolled reuse of components, monopoly of information across departments and areas of dependency resulting in conflict amongst others. The listed scenarios avail us the opportunity to investigate if some combination of services are beneficial especially in cases where there is service dependence amongst services. We conclude that the use of ontologies could play a significant role in gauging the risks associated with this.

Keywords: E-Government, Risks, Threats, Ontology, Reuse

1. Introduction Any large scale system should be developed to support evolution. Most sustainable systems are subject to on-going change and these changes to a system can take place for a variety of reasons. The reasons behind the development of a system may be invalidated because of changes which may not have been foreseen from the initial development of the system; redundancy of the system; expansion of the system to incorporate new services; changes in user needs and requirements amongst others. One of the major problems in handling knowledge representation practically is the aspect of dealing with relationships that change with the passage of time (Welty & Fikes 2006). This problem is made more complex since most modelling languages are limited in their expressive power considering that they are restricted to binary and unary relations. The complication is also made worse because most representation languages overlook the specification of time.

The concept of E-Government is well-established with numerous service providers offering similar services to citizens, businesses and governments. However, most of these services are composed of service components that are similar. Therefore, the reuse of domain knowledge is significant in this research as this can contribute greatly in the area of effort reduction and quality of service improvement. It is currently difficult to answer questions such as “what

Page 63: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 57

difficulties or threats can arise when information is exchanged across departmental boundaries?” or “can dependencies among services result in conflict?” In areas such as business and engineering, metrics are used to determine the health of a project and whether the dividends justify the costs. The threats that face any enterprise are critical to the advancement and growth of that enterprise. Looking at the E-Government domain as an enterprise especially with respect to the delivery of services, there is a possibility of making incorrect or unwise decisions when it comes to the reuse of service components, the threats such reuse and combination of components bring about as well as the possible countermeasures. When services are developed in silos, they are prone to a number of risks including a lack of communication between departments which could potentially introduce shared points of failure thereby reducing the resilience of the system. It is one thing to chant the chorus of reuse; it is quite another to discover the effects these have on the E-Government domain. Since there is no precise definition of the adverse effects lack of reuse can bring about in the E-Government domain, there is confusion among Service Providers on which services can be combined and which components of these services can be reused. Therefore an ontology to gauge the risks and threats associated with this is a viable solution to this problem. The reason for this is that the entities in an ontology would be defined in a sound manner and relationships such as dependencies that exist amongst entities would be precisely specified. Furthermore, the use of an ontology would give decision makers greater depth as to why certain decisions should be made when it comes to Electronic services in the E-Government domain. Having seen that decisions are mainly made by managers who have little or no knowledge about the underlying infrastructure on which the E-Government domain is built upon and who base decisions on intuitions rather than on defined metrics, the use of an ontology can be used to greatly reduce this (Singhal 2010).

Therefore, the key goals of this research are to:

1. develop an ontology which can be used to identify threats that endanger the E-Government domain in areas such as:

• component reuse

• component combination

• asset procurement

2. incorporate countermeasures that can be taken to counteract the threats that would be identified or better still reduce the probability of system fatality.

The structure of the remaining paper is as follows section 2 introduces us to E-Governments and their corresponding ontologies; section 3 explores the aspect of component reuse. In section 4, we define constructs for the intended ontology and in section 5, we point out risks that may exist as a result of combining components. section 6 presents the language for developing the ontology and section 7 concludes the work as well as provides room for future work.

2. E-Government Ontologies Electronic Government often known as E-Gov was established in the late 1990s. The reason behind its establishment was to foster relationships between government and the public so that citizens can effectively and efficiently interact with government (Layne & Lee 2001). The onus lies on governments all over the world to implement E-Government in order to improve the state of governance and delivery of services to her citizens by eliminating processes that are inefficient as well as time-consuming.

Page 64: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

58 Conference Proceedings

In line with the reason for its establishment, it has as its main objective the development of solutions that are technological which can support interactions between citizens and public institutions which would improve public participation, social life as well as serve as a means for reducing cost (Barbagallo et al. 2010). It is worth noting that E-Government is not limited to the publishing of information on the internet or on websites but involves understanding the structure and operations of different departments and administrations. Often governmental demands for improvements to services clash with citizen requirements. An example can be seen in the development of policies that respond to the needs of the individual as well as their circumstances (Holmes 2011). Therefore, for a government to remain relevant to her citizens, an active role in the implementation of E-Government must be taken into consideration (Mundy & Musa 2010). The reason behind the practice of public administration is the placing of citizens at the centre of policymakers’ considerations, not just as the target of the decisions being made but also as agents and drivers of these decisions. E-Government provides services that are used regularly by service providers as well as service receivers. With the passage of time, there has been an increased complexity of E-Government services. Correspondingly, this requires increased management (Stojanovic et al. 2004). The problem with governments is that they are very complex and because of this complexity, they are subdivided into different departments with each department offering its own kind of service and operating on a separate budget. However, a major problem is that a general approach is hardly ever employed in the development or distribution of these services. For example, there is reuse of similar components across departments such as the Drivers License Department and the department in charge of issuing passports. This has brought about lack of integration amongst these departments, a lot of repetition and the use of the same kind of components across departments. This has led to development taking place across the government in silos. The potential for reuse and savings across these departments exist and the use of an ontology is a viable technique for achieving this and overcoming the problem of silos in government. This is made possible with the use of an ontology to capture the different activities of these departments as well as model the relationships and interdependencies that exist between them.

An ontology is defined as “an explicit specification of a conceptualization” (Gruber 1993). Therefore E-Government ontology can be described as an explicit description of the E-Government domain containing a common vocabulary with shared understanding. E-Government is a domain which must be carefully considered because it deals with the use of Information technologies in providing better government to citizens. Concepts and relations managed by any scientific community need to be formally defined and the use of ontologies support their definition. Several E-Government ontologies have been developed such as SmartGov, EGov, OntoGov, TerreGov etc. Most of these ontologies are already obsolete and lack semantic consistency which has led to loss of critical information. However,(Gugliotta A et al. 2005) argue that no one of these ontologies adopt the Semantic Web technologies to represent concepts and actions.

The question may arise, why are the ontologies previously developed not being applied today?

According to (Fonou-dombeu & Huisman 2011), ontologies are used to describe and specify E-Government services (E-services), primarily because semantic integration and interoperability of E-services are facilitated with their use; there is ease in composition, matching, mapping and merging of various E-Government services. In the context of this paper, the domain of an E-Government ontology comprises of issues that are government related. Therefore the purpose of the E-Government ontology is to facilitate adequate understanding of the

Page 65: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 59

E-Government domain by service providers so that issues relating to the integration of services as well as the risks associated with integration in the E-Government domain can be addressed as well as used as prediction tools for future governments. It is practically impossible to develop a single ontology that satisfies all users especially in the areas of precision, coverage, actuality and individualization. This can be attributed to the fact that specific approaches as well as vocabularies are needed by different departments for solving tasks specific to them (Stumme et al. 2000).

Ontologies serve as a platform or a means for defining the services offered by governments and attempts have been made at the development of E-Government ontologies. The use of ontologies for knowledge representation enhances organization, communication and re-usability as well as serve as the building blocks for intelligent systems. This has been of immense benefits as seen in applications.

3. Literature review on component reuse For a government to meet her objectives in a cost-effective and timely manner, applications should be made reusable by other software. It is also possible for a solution developed by one government to be reused by another government (Ratneshwer & Tripathi 2010). This can be seen in some EU system developments such as the ECRIS system which is designed to provide international access to criminal records.

Globally, across most industries, about 85% of the processes that take place across various departments are the same. This is applicable also to the processes that take place across government organizations (Anon 2003). It is therefore logical to cultivate the reuse of software where possible as the reuse of solutions may potentially help to foster co-operation amongst departments, reduce costs, reduce development time as well as increase reliability and maintainability of such systems.

Most software are developed in component or modular form and the act of developing these reusable components is known as Component-Based Development. The Netherlands is a country that ensures collaboration between departments as well as makes use of component-based development. Collaboration amongst departments is even seen between small municipalities in the Netherlands. This is aimed at elimination of duplicated efforts and to establish one shared back-office (Janssen & Wagenaar 2004). Since services cannot always be provided at reduced costs and implemented locally; organizations that are small and limited by budgets and expertise cannot develop all the services that are desired, by sharing services and expertise among organizations, a larger number of services can become widely available. (Sheng & Lingling 2011) identified information and data sharing as the cornerstone for E-Government. In the process of integrating data we discover that beyond information sharing, resources can also be shared which we believe can bridge gaps as well as foster trust in the E-Government domain. Considering that there are varying degrees of information, data and component reuse across departments and even governments, there are also risks and disadvantages associated with reuse which would be identified in the questions this research would attempt to answer. Although component reuse across departments is essential in the sense that they are built with common functionalities and attributes and therefore can be deployed into a new system with modifications to suit the requirements of the system; they pose inherent challenges.

Page 66: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

60 Conference Proceedings

This may help to greatly reduce development time and costs. This would greatly increase the reliability and maintainability of the system (Ratneshwer & Tripathi 2010). But the question arises although the goal of E-Government should be the delivery of services to her citizenry, is there a way that the people behind the delivery of these services can work jointly? In the area of a joint work force, as stated by (Homburg et al. 2002) there is often a requirement of information exchange in the back offices of government. It is usually difficult to establish a joint workforce because most of the ontologies are developed in isolation and sometimes with no possibility of reuse in mind. The need for collaborative development is key because the influence of modification of ontologies can be effectively managed (Sunagawa et al. 2003). Although (Sunagawa et al. 2003) established the need for collaborative development across E-Government ontologies, can we say this was adhered to? (Vasista 2011) in his paper also viewed this collaborative problem as “the inability of existing integration strategies to organise and apply the available knowledge to the range of real scientific, business and governance issues”. This he believed to impact not only on the productivity of a government but also the level of transparency of information in crucial safety and regulatory applications. He however proposed focussing on models of E-Government that are normative which can assert integration of data both horizontally and vertically. This form of assertion is supposed to be reusable by several E-Government applications.

4. Definition of a service construct for the ontology In defining a service construct for this ontology, the need to focus on these areas are key because the definition of a service with respect to this ontology requires a construct that is generic enough to allow the specification of any kind of service.

a) Cataloguing: Cataloguing is an important aspect of E-services. This aspect of a service should enable users locate services without having to go through tedious processes. This entails categorizing services in form of informational services which would aid the easy location of compatible options for sub-services.

b) Combination of services: Services which have constructs that are similar can be combined because this would fully aid servicing the needs of a customer or citizen. Exploring services that are composed of sub services also entails mapping of services with similar constructs or mapping similar services to an integrated or generic service.

c) Change management: Incorporating the changes that occur in the system is another very interesting aspect of this ontology. By change management, proper documentation of changes such as releases, updates, failures in the system, date of commissioning and decommissioning of the system or parts of the system are taken into account. This we find an essential bit of the ontology that should be incorporated. Changes take place all the time and a mechanism for updating the areas undergoing change must be in place. This aspect of change management greatly informs of the inherent dangers certain combinations of services or subservices could cause.

Table I presents one of the key building blocks of our ontology in development showing the semantic relationship necessary to specify that a person belongs to a department and a department offers a service. A service can further be divided into sub-services which can be made up of similar components. Table II explains what a leaf service is which means that a service can stand on its own without being composed of sub-services. Table III presents the axioms that are important in the development of this ontology.

Page 67: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 61

Table I: Defining the ontology construct

Definition Description Person (p) p is a person and belongs to d Department (d) d is a department p → d Person offers service and is a member of department offers service (s, ss) Where service s has a sub service ss has components (c) service (s,ss) is made up of components (c)

Table II: Defining the service construct

Definition Description Service (s) s is a service Has subservice (s, ss) Service s has a sub service ss Leaf service (ls) → ((∀s)(¬has subservice(ss, s))) A leaf service is a service that has no subservice

Table III: Defining axioms for the service construct

Axiom 1. If sb is a subservice of sa, and sc is a subservice of sb, then sc is a subservice of sa. has_subservice(sa, sb) ∧ has_subservice(sb, sc) →has_subservice(sa, sc)

Axiom 2. No service is a sub-service of itself ¬has_subservice(s, s)

Axiom 3. If s2 is a subservice of s1, then s1 cannot be a subservice of s2. has_subsevice(s1, s2) → ¬ has_subservice(s2, s1)

In an attempt to reuse bits of E-Government ontologies, different ontologies (SmartGov, OntoGov, TerreGov, EGov) have been written by different authors for different purposes, with different assumptions, and with the use of different vocabularies. Also, in testing and diagnosing individual or multiple ontologies, the discovery that different authors were using relational arguments in differing orders and thus type constraints were being violated across ontologies was evident. Additionally, if a relation’s domain and range constraints were used to conclude additional class membership assertions for arguments of the relation, then those arguments could end up with multiple class memberships that were incorrect.

Ontologies may require small, yet pervasive changes in order to allow them to be reused for slightly different purposes. Increasingly e-government services are being developed that cut across old department lines and there is an increasing need for intra and inter-governmental agencies to work more closely together, moving towards joined- up government. With this change comes the need for better communication between people and a need for a common vocabulary and understanding of terms that are being shared.

5. Designing an ontology to gauge risks and threats associated with e-government

Having an understanding of the kind of threats that may take place when services are combined gives us an insight into the conflicts and co-operation that may take place at the back office especially with respect to sharing of resources and information property rights.

Page 68: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

62 Conference Proceedings

(Homburg et al. 2002) analysed the effects of resource dependence theory and information property rights theory stating the conflicts that could stem from such mixtures in the network.

The development of modern products requires heavy reliance on the use of IT systems. (Woll et al. 2013) outlined a major challenge associated with this as lack of interoperability between different IT systems. Although a lot of research and industrial activities have focused on the feasibility of interoperability in the past, the problem still lingers.

In editing ontologies, attention should be paid to the influence this has on other ontologies. According to (Sunagawa et al. 2003), changes in an ontology have the potential of eliminating the consistency that exists between the ontologies.

(Woll et al. 2013) also outlined how approaches have been mapped out on embracing interoperability but there is a lack of application in the industry. This they attributed to the high cost of linking many different IT systems and the data contained in them.

To successfully build a platform for E-Government to operate requires the collation of information from the different departments and parastatals that make up the government. Hence, there is a lot of replicated data as data collated for one department may be the same data collated across other departments even though the modes of collation or delivery may differ. A typical scenario seen while building this ontology from the UK Government website is in the department of Birth, Deaths, Marriages and Care which has Child Benefit as one of the services it offers and a replication of this same service in the Department Benefits. The question is this, why can’t the Department for benefits make use of the already existing framework the Birth, Deaths, Marriages and Care department has? Is there the need for the user of the system to fill this information independently for each department? Analysis of this scenario based on perceived risks and threats include:

1. Could reuse of components or data affect data resourcefulness?

2. Can dependencies amongst services result in conflict?

3. Can the above scenario bring about shifts in power?

4. What difficulties can arise when information is exchanged across departmental boundaries?

5. Does information or resource sharing bring about conflicts amongst departments?

6. What happens to departments that are dependent on other departments for shared resources or information?

7. What threats, risks and conflicts do these pose to such departments? For example, unforeseen dependencies which embody potential single points of failure

8. Can dependencies among services bring about inter departmental co-operation?

(Potential advantage): When concepts from an ontology are imported from other ontologies, the dependencies that exist among them are managed using the reproduction of concepts to be imported (Kozaki et al. 2007). In the same vein, when dependencies amongst services exist, they reproduce all definitions related to the concepts produced. From Figure I, assuming that ontology B imports concept A3 defined in Ontology A, then we can say that all the concepts depended by A2 are reproduced with relations among them, and Ontology B imports these reproductions. It means all definitions related to the concept are reproduced. The system

Page 69: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 63

reproduces all definitions related to the concept. In this example, “the super concept of A5” (A4 and A1), “the concept referred by A5” (A4).

Figure I: Diagram showing dependencies

6. OWL The choice of OWL to model our ontology gives us the ability to easily build systems that are interoperable which would enable the production, reasoning and visualization of data in the E-Government domain (OWL 2004). Considering that the E-Government domain is a large one and the amount of data associated with it is quite large and complex; there is need to make use of reasoning components that are highly optimized which is made available through the use of off-the-shelf editors i.e. Protégé (Protege 2005). The main goal of this research is to provide an ontology to support service providers and decision makers in the E-Government domain. The ontology would be able to highlight potential threats and risks which endanger service

Page 70: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

64 Conference Proceedings

combinations, component reuse as well as asset procurement and what countermeasures can be taken to lower the chances of attacks and severity.

Since most organizations do not develop products that are entirely new at every stage of development, there is an increasing demand for the reuse of existing products and models. The reuse of products or models is one that suffers from lack of interoperability especially in the area of missing tools. (Woll et al. 2013) stated that challenges with reusing models lies in choosing the right one to make use of; adding that the development of a suitable model to reuse is largely dependent on the requirements and functions of the new product. Considering the knowledge, experience and expectations service providers have about E-Government ontologies, there should be a means by which the performance of these E-Government ontologies are evaluated. One of the goals we have in the area of developing the ontology in terms of reusable components is to make certain functionalities sharable as well as evaluate the information gotten to make certain decisions. The question is, by what standards would we be able to evaluate this ontology? We want to believe that in terms of this research, service providers across government share a common understanding of what a service means to a service receiver. This serves as a platform to design the needs of the receiver. We also use this as a yardstick to determine what kind of detail a service provider may be interested in as well as the amount of detail the proposed ontology should contain. A major area identified is in the area of information exchange and dependence in the networks of back offices. This avails us the opportunity to analyze areas of conflict and cooperation that arise from the complex mixture of the services provided. Based on this line of thought, a reasoning engine or tool would be incorporated against the ontology that produces a list of departments offering a particular service as well as make inferences on whether certain service combinations are needed and what risks they pose. This would be achieved with the use of semantic technologies to achieve interoperability and integration between the E-Government systems. This can serve as a support tool for the stakeholders of the system (government, service providers, software developers) responsible for systems which are in use. With the definition of a viable tool which can be used to solve problems faced in the E-Government domain; this tool should be able to store data gotten from the ontologies as well as computationally solve the problems listed in the above scenarios.

6.1 Representation of roles in the ontology using OWL Understanding the meaning of Role in the development of an ontology is very important. Although the usefulness of OWL in ontology representation cannot be under-estimated, semantic interoperability can equally be decreased because of gaps that may exist between developers and OWL in properly defining roles (Kozaki et al. 2007). Therefore, in order to overcome this problem, a clear understanding of the roles of a Service Provider in the E-Government domain are needed. Within our ontology, a key concept is the concept of Service Provider. We analyze the activities of a Service Provider and the roles he plays in the examples below. Service Provider joins the E-Government domain to publish his service. However, he has to choose an appropriate department to which his service maps. It is also possible for a Service Provider to map to different departments offering different services. If a ServiceProvider chooses to map a service to a non-existent department within the ontology, the ontology allows him to create a new department to map his services to. Therefore the ServiceProvider can readily associate with an existing department in the ontology or he can update the ontology with a new department and associate with the new department. Our model distinguishes between two principle types of role, those of service provider, and service reciever. A ServiceProvider is

Page 71: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 65

a person able to carry out several different roles. Therefore it is worth noting that his role would potentially involve a one-to-many-relationship.

Figure II shows three examples of our role representation model for the ServiceProvider in the ontology being developed. Each of these examples is evaluated according to the requirements for dealing with roles. With respect to our ontology, the ServiceProvider Role is dependent on a department as its domain. Comparison among Figures IIa, b and c show three examples of our role representation model in OWL.

Example 1 (in Figure IIa): the role of a ServiceProvider is dealt with in serviceProviderOf property. This object property may represent the role which is determined in e.g. “ServiceProvider-ServiceReceiver relation”. However, the context dependency of the concept of roles is implicit. A critical problem therefore arises because the context dependency relates to other characteristics essentially.

Figure IIa: Role Representation model 1

Example 2 (in Figure IIb): This model represents the context of ServiceProvider explicitly. This role is still dealt with in a property. This can complicate management of identity of roles in its instance model. For example, it is difficult to describe that after a particular ServiceProvider resigns from his role as a ServiceProvider, another person can easily fill this role as the ServiceProvider. This model is not intended to represent state of the role concept. This means that a vacant role would be difficult to represent or identify.

Figure IIb: Role Representation model 2

Example 3 (in Figure IIc): The hasPart property in this model of our ontology means that a Department consists of Service Provider(s). Therefore, a restriction placed on dependOn property in ServiceProvider class shows that a ServiceProvider depends on Department as its domain in this context. This model is superior to the above two models because their problems can be solved in this model. However, a ServiceProvider is classified into a Person which can also be confused with the concept of role; ServiceProvider. According to the semantics of rdfs:subClassOf, an instance of a ServiceProvider and its player (an instance of Person) are required to be one and the same instance. The player can therefore not stop to be an instance

Page 72: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

66 Conference Proceedings

of a ServiceProvider without stopping to be an instance of a Person, i.e., deletion of an instance of a ServiceProvider brings with it the deletion of an instance of a Person.

Figure IIc: Role Representation model 3

7. Conclusions and future work In this paper, we have discussed the role of ontologies in the delivery of E-Government services, the advantages of reusing the components that cut across these services as well as the inherent risks and challenges that a government may face with reusing components. We present the role a suitably designed OWL ontology could play in the delivery, management and evolution of E-Government services. The model presented in this paper predominantly focuses on the specification of a general service construct which represents the various roles and characteristics of a service. The ontology under development contributes to the semantic interoperability that should exist in E-Government ontologies and also provides a novel potential approach to solving the problems caused by reuse of components.

In terms of future work, we plan to investigate the key problems caused by reuse of services and service components as well as enhancing and investigating the resilience of our tool on an existing real world government case study.

References Anon, 2003. The Greenhouse Effect. Available at:

http://www.academon.com/essay/the-greenhouse-effect-18937/ [Accessed August 11, 2014].

Barbagallo, A., De Nicola, A. & Missikoff, M., 2010. eGovernment Ontologies: Social Participation in Building and Evolution. 2010 43rd Hawaii International Conference on System Sciences, pp.1–10. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5428280.

Page 73: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

The use of ontologies to gauge risks associated with the use and reuse of E-Government services

Loughborough University 67

Fonou-dombeu, J.V. & Huisman, M., 2011. Semantic-Driven e-Government : Application of Uschold and King Ontology Building Methodology for Semantic Ontology Models Development. , 2(4), pp.1–20.

Gruber, T.R., 1993. Toward Principles for the Design of Ontologies Used for Knowledge Sharing. , pp.907–928.

Gugliotta A, Cabral Liliana & Domingue John, 2005. Knowledge modelling for integrating semantic web ser- vices in e-government applications Conference Item.

Holmes, B., 2011. Citizens ‘ engagement in policymaking and the design of public services. , (1).

Homburg, V., Bekkers, V. & Rotterdam, N., 2002. The Back-Office of E-Government ( Managing Information Domains as Political Economies ) Center for Public Management The Dutch Setting : Networks of Governmental Organizations and The Political Economy of Information. , 00(c), pp.1–9.

Janssen, M. & Wagenaar, R., 2004. Developing Generic Shared Services for e- Government. Electronic Journal of E-Government, 2(1), pp.31–38.

Kozaki, K. et al., 2007. Role Representation Model Using OWL and SWRL. Layne, K. & Lee, J., 2001. Developing fully functional E-government: A four stage model.

Government Information Quarterly, 18(2), pp.122–136. Available at: http://linkinghub.elsevier.com/retrieve/pii/S0740624X01000661.

Mundy, D. & Musa, B., 2010. Towards a Framework for eGovernment Development in Nigeria Independent Researcher , UK. , 8(2), pp.148–161.

OWL, 2004. OWL Web Ontology Language Overview. Available at: Web Ontology Language (OWL),.

Protege, 2005. The protégé ontology editor and knowledge acquisition system. Available at: http://protege.stanford.edu/.

Ratneshwer & Tripathi, A.K., 2010. SOME COMPONENT GENERATION APPROACHES FOR E-GOVERNANCE SYSTEMS. International Journal of Public Information Systems, 2, pp.133–147.

Sheng, L. & Lingling, L., 2011. Application of Ontology in E-Government. 2011 Fifth International Conference on Management of e-Commerce and e-Government, pp.93–96. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6092638 [Accessed April 22, 2014].

Singhal, A., 2010. Ontologies for Modeling Enterprise Level Security Metrics Categories and Subject Descriptors. , pp.5–7.

Stojanovic, L. et al., 2004. On Managing Changes in the ontology-based E- Government. In On the Move to Meaningful Internet Systems. pp. 1080–1097.

Stumme, G., Studer, R. & Sure, Y., 2000. Towards an Order-Theoretical Foundation for Maintaining and Merging.

Sunagawa, E. et al., 2003. Management of dependency between two or more ontologies in an environment for distributed development.

Vasista, T.G.K., 2011. S EMANTIC D ATA I NTEGRATION A PPROACHES. , 2(1). Welty, C. & Fikes, R., 2006. A Reusable Ontology for Fluents in OWL. In Proceedings of the

2006 conference on Formal Ontology in Information Systems. pp. 226–236. Woll, R., Geißler, C. & Hakya, H., 2013. Modular ontology design for semantic data integration. ,

pp.3–6.

Page 74: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

68 Conference Proceedings

Assessing trustworthiness in Digital Information Frances Johnson, Laura Sbaffi and Jenny Rowley Department of Languages, Information & Communications, Manchester Metropolitan University

Abstract/Summary This paper reports the key findings of an empirical study conducted with undergraduate students to identify and assess the core constructs of online trust, with a focus on health digital information. This study suggests the use of a trust scale in further research into the potential impact of the system and information design on the users’ trust and use of digital information.

1. Introduction Within the field of Information Management there are very diverse research communities. Connecting the communities of those studying information behaviour and those focusing on systems design aims to build knowledge of information behaviours to inform and improve the design of interactive information systems. In the context of providing information in digital environments, the users’ trust formation may be core as a predictor of a user’s ‘intention to use’ a given piece of information, typically to resolve some underlying need or problem. As the design of information systems and/or information-based applications become more interactive, enhanced with functionality enabled in the web environment (for example, links to related information, recommendations, features for annotation and personalisation), it seems vital that such developments serve a purpose and, at same time, impact on the users’ judgement of trust. With regards to the conference theme of ‘making connections’, this paper reviews our recent research on modelling online trust with a discussion on some of the main findings. Based on this, we propose a trust scale, developed for understanding the constructs of trust in digital information contexts; further research will focus on exploring the potential impact of the system design on the confident use of digital information.

Trust in digital environments has been widely studied (e.g. Chopra and Wallace, 2003; Ivanov et al., 2012; Kelton et al., 2008; Shekarpour and Katebi, 2010; Belanger and Carter, 2008; Rowley and Johnson, 2013) but, specific to online information, the user has a particular need to fulfil, creating a state of dependence on the information and providing a necessary precondition for trust to be formed (Rousseau et al., 1998). As such, trust is a dynamic concept formed in the context of the information need. It is unlikely that use is made of information that is not trusted and so it is appropriate that, in assessing the information, we look for indicators of its trustworthiness. Within the framework of digital information, indicators of trustworthiness are likely to be numerous and the formation of trust is likely to be subject to various influences, thus becoming multidimensional and difficult to measure directly. The aim of this research is to identify the factors that influence the formation of trust in digital information. This empirical study adopted a quantitative, survey-based research design, in order to develop measurement items and explore the relationships between variables in the process of trust formation.

2. The trust scale Trust scales have been developed in the past (e.g. the 24-item scale from Sillence et al., 2007), whilst other authors have researched trust formation by gaining insights from interviews and/or qualitative research (Robins et al., 2010; Zhang, 2012). In this study, a questionnaire was chosen to collect data as this approach was deemed the most suitable for gathering large amounts of data and collecting accurate information. The core of the questionnaire was represented by a set of 50 Likert-style statements, designed to investigate respondents’ (in this

Page 75: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Assessing trustworthiness in Digital Information

Loughborough University 69

case students’) perceptions of the relative importance of various aspects of web health information in their evaluation of its trustworthiness. Previous research on the constructs of trust have suggested that factors like style and authority (e.g. Sillence et al., 2007) and criteria such as credibility (e.g. Corritore et al., 2012) influence the formation of trust. Research into how people evaluate information when searching in specific contexts, for example for school coursework or for health information, has identified factors relating to design as influencing trust formation (Sillence et al., 2007). Drawing on these and other previous studies, the set of statements in the questionnaire was chosen to reflect the possible constructs of trust including information credibility, usefulness, content, authority, style, verification, brand, ease of use and recommendation, all designed with a 5-point scale. For example, the variable ‘authority’ was indirectly measured on the responses to 5 items (e.g. ‘that the author appears to be knowledgeable’ and ‘that the author’s qualifications and/or expertise are indicated’). Each construct was represented by at least four items. The indicators of trustworthiness designed and selected for the questionnaire ranged from the tangible attributes of the information itself, such as content and style to more peripheral (but still important in the digital context) factors, such as the ease of use, as well as the factors that relate to the users’ assessment of the information itself, such as usefulness and credibility. Ultimately, the implementation of the questionnaire seeks to identify which of these, as influencing factors, are the core constructs of trust formation in digital information.

Participants were 1st and 3rd year undergraduate students at a large metropolitan university in the UK recruited from the discipline areas of humanities, business, and sport. Copies of the questionnaire were distributed in class settings and participants were asked to think about some information they recently looked up on the web and which was related to some health issue that they were generally interested in or to some serious complaint/condition they had in mind. This brief scenario was set initially to ensure that all participants would be able to recall information found in response to some information need albeit at different levels of interest (i.e. passing, medium and serious). In total, 531 usable questionnaires were returned, out of 550 questionnaires originally distributed. The exploration of the data collected involved principal components analysis to test a theoretical model of the core criteria on which trust is formed and to model the observed correlations among the influences, to gain a critical evaluation of the information and its context. The approach to investigate the constructs of trust allowed for further insight to be gained through the variations observed in the factors influencing trust formation across participants grouped by user characteristics, such as course of study or gender, as well as by task variation, such as the passing or serious interest in the health topic. Further details on these studies, including the design of the instrument alongside a detailed comparison of the items used in the evaluation across the years are reported in Rowley et al. (2014), while the factor analysis, along with modelling trust based on the criteria of usefulness and credibility and their influencing factors, is reported in Johnson et al. (2014). In this paper we consider only the difference between the two year groups of the students surveyed.

3. The constructs of trust In order to identify the constructs of trust, principal component analysis (PCA) was used to test the convergent and discriminant validity of the questions (items) and to extract the underlying factors in the data. Subsequently, confirmatory factor analysis (CFA) was carried out on each of the 1st year and the 3rd year data sets. This was done in order to compare the resulting measures across the two year groups.

Page 76: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

70 Conference Proceedings

Table 1: Summary of the factors identified with CFA for the 1st and 3rd year students’ datasets

1st Year students 3rd Year students

Factor Item IR Factor Item IR

1 Ease of Use

– Access

EU1-How easy it was to access the information EU3-The information is free ST2-The ease with which I can read the information EU2-How easy it was to find the information

0.73 0.75 0.77 0.81

1 Reliable Content

AU4-That the information appears to be objective (i.e. no hidden agendas) CO3-The reliability of the information CO2-The comprehensiveness of the information CO4-The accuracy of the information (such as the absence of errors)

0.65 0.73 0.74 0.77

2 Believable Content

CR1-Whether I feel I can believe the information AU4-The information appears to be objective (i.e. no hidden agenda) CO4-The accuracy of the information (such as the absence of errors)

0.70 0.73 0.75

2 Assessing Credibility

CR5-The extent to which the source contains facts rather than opinions CR3-The impartiality of the information CR1-Whether I feel I can believe the information CR4-The quality of the information CR2-The objectivity of the information

0.66 0.69 0.70 0.75 0.81

3 Personal

Recommen.

RE6-My friends and family use the source RE1-Family and friends have recommended the source to me

0.71 0.86 3

Personal Recommen.

RE4-I have seen recommendations from members of a social network community RE1-Family and friends have recommended the source to me RE6-My friends and family use the source

0.71 0.73 0.79

4 Branded –

Logo

BR1-The information source features the logo of a respected brand BR2-The information source carries the logo of a well-known brand

0.66 0.87

4 Ease of Use

– Access

EU1-How easy it was to access the information EU2-How easy it was to find the information

0.89 0.97

5 Assessing Usefulness

UF1-That the information tells me most of what I need to know UF2-That the information helps me to understand the issue better

0.78 0.88

6 Style -

Readable

ST3-The clarity of the structure of the information ST1-The ease with which I can understand the information ST2-The ease with which I can read the information

0.67 0.85 0.94

7 Branded -

Logo

BR1-The information source features the logo of a respected brand BR2-The information source carries the logo of a well-known brand

0.90 0.90

The Cronbach’s alpha coefficient was calculated; at 0.937 for the 1st year dataset and 0.933 for the 3rd year dataset, the reliability of the scale within the samples was confirmed (Bryman and Bell, 2011). The Kaiser-Meyer-Olkin Measure of Sampling Adequacy (KMO) was calculated to measure sampling adequacy and, with values of 0.879 (1st year students) and 0.874 (3rd year

Page 77: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Assessing trustworthiness in Digital Information

Loughborough University 71

students), the two samples were well above the recommended value of 0.6 (Kaiser, 1974), confirming that the use principal components analysis was appropriate. A scree plot was used to identify the number of factors; also, to satisfy convergent validity, hence to make sure that all items intended to measure a construct did indeed reflect that construct, only factor loadings greater than 0.5 were selected. Items with low loading or cross loading were removed. This resulted in the identification of six factors in the 1st year dataset, explaining a total of 48% of the variance, and seven factors in the 3rd year dataset, explaining a total of 53.6% of the variance. In order to test the measurement model, CFA was performed on the factors and the items. According to Segars and Grover (1998), the measurement model should be evaluated first and then re-specified as necessary to generate the ‘best-fit’ model. This iterative process led to a refined measurement model with four factors and 11 items in the 1st year dataset and seven factors and 21 items in the 3rd year dataset. Item reliability (IR) ranged from 0.660 to 0.926, thus exceeding the acceptable value of 0.500 recommended by Hair et al. (2006). The average variance extracted (AVE) ranged from 0.576 to 0.624 in the 1st year and from 0.516 to 0.805 in 3rd year, which for all factors exceeded the threshold value of 0.500 recommended by Fornell and Larcker (1981). The factors from the CFA are shown in Table 1. The labels for the intended constructs were retained for each as follows: Content, Credibility, Recommendation, Ease of Use, Usefulness, Style and Brand, but for clarity each are given a sub-label reflecting the core items forming the factors extracted in the analysis.

3.1. Core constructs and impacts The results of the principal components and confirmatory factor analyses provide the constructs involved in the formation of trustworthiness judgement of information (in health domains). The seven constructs in the 3rd year data were Assessing Credibility, Assessing Usefulness, Reliable Content, Personal Recommendation, Ease of Access, Style – Readable, Branded – Logo. These explain 53.6% of the variance in the data on the trust scales suggesting that these factors are quite comprehensive in explaining trust judgements. A smaller number of factors were found for the 1st year students, Ease of Access, Believable Content, Personal Recommendation and Branded – Logo. An in-depth examination of the differences in the factors between the 1st and 3rd years is presented in Rowley et al. (2013; 2014). The general explanation given for this difference in the principal components analysis is that the 1st year students relied more on Ease of Access while the 3rd year students demonstrated an increasing sophistication in their evaluation of the information with reference to its Credibility and Usefulness, both of which are criterion on which we might judge the information in forming trust. Based on these findings we speculate that students in their 3rd year of study engage in a more critical assessment of the information and are influenced not only by the features of the information as indicators of trustworthiness, but also by their assessment of the information on criteria such as its Usefulness and Credibility.

Standard multiple regression was performed on both datasets to model the relationship of the factors in determining the judgements of Usefulness and Credibility. Preliminary analyses were conducted to ensure that there was no violation of the assumptions of normality, linearity, multicollinearity and homoscendasticity. The results were very different for the 1st and 3rd years; while for the 1st year no statistically significant outcome was reached, for the 3rd year the results show that the factors Ease of Access, Reliable Content and Brand associate with both Usefulness and Credibility (Figure 1a and 1b), while Style only associates with Usefulness (Figure 1b).

Page 78: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

72 Conference Proceedings

Figure 1: Standard multiple regression analysis performed on the 3rd year dataset posing Credibility (a) and Usefulness (b) as dependent variables

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta

1 (Constant) 1.633 .227 7.197 .000

Style -.011 .058 -.012 -.194 .846

Content .392 .057 .419 6.912 .000

Brand .083 .038 .115 2.172 .031

Ease of Access .188 .049 .220 3.857 .000

(a) Dependent Variable: Credibility

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig. B Std. Error Beta

1 (Constant) 1.218 .183 6.640 .000

Style .216 .047 .274 4.609 .000

Content .186 .046 .231 4.067 .000

Brand .112 .031 .180 3.633 .000

Ease of Access .141 .039 .192 3.581 .000

(b) Dependent Variable: Usefulness

The explanation for this in terms of a model of trust is presented in Johnson et al. (2014), who propose that users’ assessment of the information on the constructs of Usefulness and Credibility are antecedents to trust formation and are determined or influenced by various factors relating to information Content, Style and Ease of Access in accessing the information. Interestingly, the factors formed in the 1st year data, Recommendation and Brand, were not found to have an association with the judgements of Usefulness and Credibility, which questions these as core constructs of trust when actively formed in assessing the information. The 1st year students appear to rely on the more contextual indicators of trustworthiness, Ease of Access, Recommendation and Brand. The assessment of the information based on the criteria of Usefulness and Credibility, with these factors relating, among others, to the Content construct, indicates a more involved and sophisticated assessment in the users’ trust formation. With the intention of extending this study of the influencing factors of trustworthiness, a subsequent study, using the same questionnaire, was carried out involving 471 adult members of the public with various educational backgrounds. The early indication in the analysis is that the data factorises closely with the 3rd year data and this will be presented along with the above findings from the earlier studies at the workshop.

4. Discussion The findings from this study demonstrate that trust formation involves the assessment of the information based on a range of factors. This research identifies the key factors that influence

Page 79: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Assessing trustworthiness in Digital Information

Loughborough University 73

trust judgements as reliability of Content, Readability/Style and Ease of Access, Brand and, to some extent, Recommendation. The factors derived from the analysis of the data positions the assessments of Credibility and Usefulness as the most important antecedents to trust. Exploration of the factors offers further insight into the critical evaluation of the information particularly with a greater range of cues and indicators being bought to bear in judgements formed by students as they progress to the final stages of their studies. The trust scale developed from the 3rd year includes the core factors Assessing Credibility and Assessing Usefulness along with the influencing factors Reliable Content, Readable Style, Ease of Access and Brand forming the 6 constructs of trust as shown in Table 2.

Table 2: The trust scale based on the 3rd year data

Assessing Credibility

The extent to which the source contains facts rather than opinions The impartiality of the information Whether I feel I can believe the information The quality of the information The objectivity of the information

Assessing Usefulness

That the information tells me most of what I need to know That the information helps me to understand the issue better

Reliable Content

That the information appears to be objective (i.e. no hidden agendas) The reliability of the information The comprehensiveness of the information The accuracy of the information (such as the absence of errors)

Readable Style

The clarity of the structure of the information The ease with which I can understand the information The ease with which I can read the information

Ease of Access

How easy it was to find the information How easy it was to access the information

Brand

The information source features the logo of a respected brand The information source carries the logo of a well-known brand

With regards to the development of our research in understanding trust, we intend to use the item scale for the identified factors to explore the influence of user and/or task characteristics on the formation of trust. Sillence et al. (2007) recognised the judgement to be dynamic, as do others such as Lucassen et al. (2013), suggesting that trust formation is further influenced by user characteristics (such as domain expertise) leading to different features of the information being used in trust judgements (Wildemuth, 2004; Hembrooke et al., 2005). Keselman et al. (2008) found that imprecise domain knowledge led consumers to search for information on irrelevant site. Whilst these investigations into the role of domain expertise focus on the impact

Page 80: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

74 Conference Proceedings

on actions taken to find information, Lucassen et al.’s (2013) study on Wikipedia use did show that those familiar with a topic focussed on the semantic features of the information, whilst those who were unfamiliar with the topic paid more attention to surface features. Whilst the present research into trust formation did not gather details on domain knowledge or other user characteristics, a distinction was made in the 1st and 3rd year students with the 3rd year ones drawing on more and diverse factors, both relating to Content and Style as well as design factors such as Ease of Access as constructs of trust. Further research with the trust scale will explore the effect of domain knowledge and other variables, such as task and type of information, on the assessment of the information in the formation of trust judgments using the content and the design or other contextual indicators.

Further use of the trust scale is proposed with regards to the evaluation of web sites given the critical importance of trustworthiness in the provision of digital information. Participant responses to the items that formed the factors of trust provide an indication of the trust level in assessing a given piece of digital information and, in doing so, an evaluation of the impact of design. Of particular interest is the potential use of the trust scale as a diagnostic, as in the approach of the more traditional usability testing. Questionnaires used in usability studies, such as the Questionnaire for User Interaction Satisfaction (QUIS) (Chin et al., 1988) provide a series of statements for the user response to aspects of the design of the site based on usability principles and heuristics. For example, the QUIS scale measures overall reaction ratings of the system and specific factors such as interface, terminology and system feedback. This provides the designer/developer with an evaluation of the usability of the site which is an important indicator of the users’ effective use and experience of the site. Based on usability principles, the questionnaire can be used as a tool to identify where the site is failing to conform to best practice and possibly hindering the overall user experience. It is proposed that the user evaluation of a web site, based on responses to the trust items, would provide a latent measure of their assessment of the information Usefulness, Credibility, Content (reliable), Style (readable), Brand and Ease of Access all of which, by influencing trust formation, indicate an evaluation of the information provided. Consequently, the trust scale could also provide a diagnostic into the impact of site design, with insights into how the information is critically evaluated. This is not to imply that trust levels might be improved simply by altering a characteristic of the information, for example in the straightforward addition of a logo. Such a step may, however, impact on the users in their assessment of the information while making trust judgments. Further research conducted within and across different information types, tasks and user groups, as suggested here, aims to confirm and develop the constructs of trust as an important tool in the evaluation of digital information and its services. The next phase of our investigation will focus on this testing of the trust scale as an instrument for the data collection of the user evaluation of health information websites.

5. Conclusion On the basis of these findings, recommendations for future research can be made in terms of to further developing and implementing the use of the trust scale as an instrument in advancing understanding in the information behaviour of trust formation in digital information contexts. Not least, further investigation is needed in both user and task contexts to explore the nature of the differences in the factors influencing the formation of trust with the view towards obtaining a theoretical model. On a practical application, the scale, originally developed for use in evaluating the impact of design on the user’s trust formed, offers a potential to inform system designers and developers concerned with the impact on use and usability but also, and of equal importance, on the users’ ability to form critical judgements on information presented and,

Page 81: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Assessing trustworthiness in Digital Information

Loughborough University 75

specifically, on the particular dimensions or constructs of trust. This paper has presented a large scale study of trust in digital information with reflection on its potential to inform and build collaborative work across relevant research communities in information management. The study of trust formation, when based on the critical evaluation of the information as a key behaviour, can be invaluable in the development and evaluation of system interface design by enabling critical user behaviour.

6. References Belanger, F. and Carter, L. (2008) Trust and risk in e-government adoption. The Journal of

Strategic Information Systems, 17 (2): 165-176. Bryman, A. and Nell, A. (2011) Business Research Methods. 3rd ed. Oxford University Press. Chin, J.P., Diehl, V.A. and Norman, K.L. (1988) Development of an instrument measuring user

satisfaction of the human-computer interface. In: Proceedings of the SIGCHI conference on Human factors in computing systems pp. 213-218, ACM.

Chopra, K. and Wallace, W.A. (2003) Trust in electronic environments. System Sciences, Proceedings of the 36th Annual Hawaii International Conference on IEEE.

Corritore, C., Wiedenbeck, S., Kracher, B. and Marble R.P. (2012) Online Trust and Health Information Websites. International Journal of Technology and Human Interaction, 8 (4): 92-115.

Fornell, C. and Larcker, D.F. (1981) Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18: 39-50.

Hair, J.F., Black, W.C., Babin, B.J., Anderson, R.E. and Tatham, R.L. (2006) Multivariate data analysis. 6th ed. Upper Saddle River, NJ: Pearson Education.

Hembrooke, H.A., Gay, G.K. and Granka, L.A. (2005) The effects of expertise and feedback on search term selection and subsequent learning. Journal of the American Society for Information Science, 56 (8): 861-871.

Ivanov, I., Vajda, P., Lee, J.S. and Ebrahimi, T. (2012) In tags we trust: Trust modelling in social tagging of multimedia content. Signal Processing Magazine, IEEE 29 (2).

Johnson, F., Rowley, J. and Sbaffi, L. (2014) Modelling trust formation in digital information contexts. Journal of Information Science, in press.

Kaiser, H. (1974) An index of factorial simplicity. Psychometrika, 39: 401-425. Keselman, A., Browne, A.C. and Kaufman, D.R. (2008) Consumer health information seeking

as hypothesis testing. Journal of the American Medical Informatics Association, 15 (4): 484-495.

Kelton, K., Fleischman, K.R. and Wallace, W.A. (2008) Trust in digital information. Journal of the American Society for Information Science and Technology, 59 (3); 363-374.

Lucassen, T. Muilwijk, R., Noordzij, M.L. and Schraagen, J.M. (2013) Topic familiarity and information skills in online credibility evaluation. Journal of the American Society for Information Science and Technology, 64 (2): 254-264.

Robins, D., Holmes, J. and Stansbury, M. (2010) Consumer health information on the Web: The relationship of visual design and perceptions of credibility. Journal of the American Society for Information Science & Technology, 61 (1): 13-29.

Rousseau, D.M., Sitkin, S.B., Burt, R.S. and Camerer, C. (1998) Not so different after all: A cross-discipline view of trust. Academy of Management Review, 23 (3): 393-404.

Rowley, J. and Johnson, F. (2013) Understanding trust formation in digital information sources: The case of Wikipedia. Journal of Information Science, 39 (4): 494-508.

Rowley, J., Johnson, F. and Sbaffi, L. (2013) Insights into trust in digital health information. CARPE Conference, Manchester, UK, 4-6 November 2013.

Page 82: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

76 Conference Proceedings

Rowley, J., Johnson, F. and Sbaffi, L. (2014) Does university education influence students approach to the evaluation of digital information? A perspective from trust in online health information. British Journal of Educational Technology, in press.

Segars, A.H. and Grover, V. (1998) Strategic Information Systems Planning Success: An Investigation of the Construct and Its Measurement. MIS Quarterly, 22 (2): 139-163.

Shekarpour, S. and Katebi, S.D. (2010) Modelling and evaluation of trust with an extension in semantic web. Web Semantics: Science, Services and Agents on the World Wide Web, 8 (1): 26-36.

Sillence, E., Briggs, P., Harris, P. and Fishwick, L. (2007) How do patients evaluate and make use of online health information? Social Science & Medicine, 64 (9): 1853-1862.

Wildemuth, B.M. (2004) The effects of domain knowledge on search tactic formulation. Journal of the American Society for Information Science and Technology, 55 (3): 246-258.

Zhang, Y. (2012) College students’ uses and perceptions of social networking sites for health and wellness information. Information Research – an International Electronic Journal, 17 (3).

Page 83: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 77

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises Soureh Latif Shabgahi and Andrew Cox University of Sheffield

Abstract As well as using it for marketing, many organisations use microblogging tools like Twitter or Yammer for internal communication. Previous literature has explored such uses and also perceptions of its risks, but mostly in the context of trials in large organisations. The aim of the research described in this paper is to investigate the risks of microblogging as perceived by SMEs in the UK and what these organisations did to manage such risks. Interviewees in the study did perceive microblogging to be risky, especially for reputation. Suggestions about how to mitigate risk include guidelines on what was microblogged; guidelines on who should do it; review procedures; training of staff; in addition some relied on complaints procedures as the way to deal with risk. Organsiations did not seem to be doing anything about some of the types of risk they perceived.

Keywords: Microblogging, Enterprise Microblogging, Twitter, Yammer, Risk, Mitigating risk.

Introduction In the last five years, microblogging has been a growing phenomenon and it has naturally become a popular topic of investigation by researchers. In the corporate context, there has been surprisingly little research on microblogging within organisations – in contrast to the growing body of literature focussed on its use in marketing. Such internal uses are sometimes termed enterprise microblogging (EMB) (Álvaro et al., 2010). Internal uses of microblogging include raising awareness among co-workers (Günther et al., 2009) and creating or sustaining a feeling of connectedness between colleagues (Zhao and Rosson, 2009). Also, microblogging can be used for knowledge sharing (Riemer et al., 2011), providing work-related updates (Riemer and Richter, 2010), asking or responding to questions (Ehrlich and Shami, 2010) and recording information for future reference (Riemer and Richter, 2010).

While microblogging use internally within an organisation may have great potential benefits, previous research suggests that organisation are concerned about risks and that they need to take action, in order to mitigate these risks (Lee and Warren, 2010). Yet ways of managing the risk remain under-explored in the literature, much of which was conducted in outside the context of early experiments in large corporations. Thus the focus of this study is to investigate the risks of microblogging as perceived by small to medium enterprises and explore the actions they took to mitigate the risks.

The paper is laid out as follows. Firstly, existing literature in the field of microblogging is examined; this is mainly based on the specific risks of EMB. Details of the research design and methodology are then provided, before presenting the findings of this study. This includes a visual representation of the policies and guidelines of EMB. The last sections present the discussion and conclusions.

Literature review The most familiar example of microblogging is, of course, Twitter. Launched in 2008 and since 2012 owned by Microsoft, Yammer is another example of a microblogging platform, but one

Page 84: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

78 Conference Proceedings

specifically designed as a private and collaborative tool for enterprise social networking. Internal corporate uses of microblogging have begun to be the object of research. Most research to date has been typically the result of trials of microblogging tools within large organisations (Riemer et al., 2011). Examples of uses of EMB identified in the literature include creating or sustaining a feeling of connectedness (Ehrlich and Shami, 2010), sharing knowledge or information (Schöndienst et al., 2011; Mayfield, 2009) and asking or responding to questions (Günther et al., 2009). The EMB literature also explores the risks of microblogging. The commonest types of risks identified in the literature are to do with the privacy of employees and security of the organisation. Notions of privacy may affect the decision to use EMB (Günther et al., 2009; Zhao and Rosson, 2009) for sharing data, contributing content and responding to others (Schöndienst et al., 2011). Case and King (2010) suggest that there is a risk of confidential information about the organisation leaking to outsiders through microblogging. In research, some organisations have had concerns about spending too much time wading through different pieces of information when using their internal EMB tool, BlueTwit (Ehrlich and Shami, 2010). Several researchers refer to the high ‘noise-to-value ratio’ of microblogging (Schöndienst et al., 2011). Günther et al., (2009) also claimed that participants had concerns about EMB taking up too much time.

Several researchers have identified the need for organisations to take action, in order to mitigate such risks. For example, Case and King, (2010) suggest that rules and guidelines may be required in order to balance the benefits of using EMB tools such as Twitter, with the risk of leaking private data. Similarly, Schöndienst et al., (2011) suggested that people had concerns about who might see information they posted, by controlling who can receive contents, and for how long they will be shared, issues of privacy could be minimised. According to Mayfield, (2009), for best use of EMB, microblogging needs to be secure and in agreement with the organisation’s security guidelines. The research by Zhao and Rosson, (2009) showed that individuals would be more prepared to increase their activity on Twitter and in relation to work, if they sensed that microblogging was a protected area for sharing data within the organisation. Ehrlich and Shami, (2010) found that employees avoided posting any confidential information about the organisation publically, through Twitter. Instead, they preferred to share details through BlueTwit; their internal microblogging tool. Most of the recommendations relate to what is perceived to be the main types of risks identified in the literature, which are to do with the privacy of employees and security of the organisation.

Ehrlich and Shami, (2010) have suggested that for extensive use of EMB, there is a need to find intelligent ways of filtering data, such as maintaining data relevant to individual users. Raeth et al., (2009) further suggest that users need to be trained in areas such as software use and business guidelines; so that they can use microblogging effectively. They emphasise the need for training early users of microblogging. According to Othman and Siew, (2012), organisations should always clearly state what the benefits of their internal system are, as well as continuously educating users on how to best use the tools.

Although the current EMB literature has revealed that organisations perceive microblogging as somewhat risky and so the need for actions to mitigate risks, ways of managing the tools remain under-explored in the literature. Much of the existing literature is based on the early experience of large organisations: how do the issues differ in smaller organisations? The objectives of the study was to identify how risk was perceived in a number of UK SMEs and to identify what actions these organisations take to mitigate such risks.

Page 85: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 79

Design/methodology Empirical data was collected through conducting face-to-face semi-structured interviews with 23 employees of small to medium enterprises (SMEs) in the area of South Yorkshire, UK, during 2013. Semi-structured interviews offer the flexibility to capture opinions (Hove and Anda, 2005), without predetermining the participants views by using a priori selection of questionnaire categories (Patton, 2001). The researchers investigated the use of microblogging in SMEs in the UK, because most studies on EMB has been trials in large organisations. Such studies in large organisations are not necessarily a reliable guide to how other types of organisations, such as SMEs, might use EMB. Also, most research to date about microblogging in businesses has been published about the USA and mainland Europe and almost nothing has been based on its use by UK organisations.

The participants were selected by directly contacting the organisations by phone or email. They were mostly from organisations in the field of IT, with a few from Consultancy and Sports. The majority of participants were the director or manager of their organisation and they had influenced the decision to adopt microblogging. Guided by an ‘interview plan’ (Bryman, 2008, Cohen and Crabtree, 2006), the first set of questions asked background information about the company and process of microblogging adoption. The second section focused on the benefits and risks of microblogging i.e. perceptions as well as actual experiences of using EMB. The third section asked questions about how microblogging could be improved, to help minimise issues associated with using the tools. Thematic analysis was selected as the approach to analysing the data. The ‘6-phase guide to performing thematic analysis’ was followed (Braun and Clarke, 2006). The analysis is not a ‘linear process’, rather more of a ‘recursive’ one (Braun and Clarke, 2006). Therefore the researchers did not always follow them in specific order; rather iteratively they moved back and forth throughout the phases and across the empirical data as required.

Findings The majority of interviewees in the study perceived microblogging to be risky and as a consequence most organisations had strategies in place to mitigate the risks. Some found it highly risky, others less so. A few interviewees indicated that risk did not seem to be an issue. For example, one interviewee commented that they do not really tweet that much about the company and so he did not perceive any risks to the organisation from the tweets that they make. More typical were those who talked at length about the risks of microblogging. The risk of reputation damage is primarily what most interviewees were worried about for their organisation, for example:

“If you are sending out a tweet which was inappropriate, or had some kind of mistake or mislead people in some way, just even accidentally [...] it is a risk because there is no real quality control. Once you publish something, it is done in seconds and it is out there in public. So if you make a mistake, then that could be a problem”.

Internal and external risks As a whole, the risks that interviewees identified could be broadly categorised as internal and external. Three internal risks were identified. One of the IT interviewees referred to a risk of employees negatively talking about each other when using microblogging. They reported bullying or accusations can reach others through EMB. As more people find out about an accusation or disagreement between employees due to the details leaking into the public domain, through microblogging, the loss of personal privacy and feeling of the discomfort would

Page 86: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

80 Conference Proceedings

be higher for the individual. Several participants commented about the risk of losing valuable information when microblogging. If microblogging went offline, organisations would not have access to a lot of useful information that they have got stored in the tools.

Several external risks were also identified by interviewees. The commonest type of risk was reputation damage. For example, even something as seemingly trivial as spelling mistakes can influence how knowledgeable, reliable and professional the business is perceived to be.

Breach of confidentiality was another external risk of microblogging mentioned by a number of interviewees. Due to the public nature of Twitter, confidential information about the organisation, such as company code can leak into the public. Misleading information i.e. sending out incorrect data, and negative media which could be associated with the business were further examples of external risks. Issues related to the security of computer systems such as hacking and spam were further risks mentioned.

Given that organisations perceived a range of risks in microblogging it was natural that many had put in place policies to counteract them. Perhaps surprisingly, some organisations, however, chose not to do so. Generally this was because the business was very small. Another viewpoint was that too many rules could make people feel uncomfortable because they would have to constantly worry about whether they are allowed to post messages. People needed to feel relaxed when using microblogging. There is also the need to trust employees with microblogging.

“My business partner I trust her implicitly as well and if she does something wrong it is still our problem. And we have another person who was fairly accurate with grammar and appropriate in the sorts of things that would be posted.”

In this context, some organisations were happy to allow their employees to use microblogging and trust that they were capable of making appropriate decisions, in terms of what messages should be shared.

However, on the whole organisations perceived microblogging to be risky so they did take action (see Figure 1). These actions could be broadly divided into rules about what was microblogged; guidelines on who should do it; review procedures; training of staff; in addition some relied on complaints procedures as the way to deal with risk.

Page 87: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 81

Actions taken by Organisations to mitigate risk Figure 1: Visual Representation of the Actions taken by Organisations to mitigate risks

Stop Privacy Breaches

Contractual Obligations

Actions taken to mitigate

risks

Ask permission from customers if talking about

them

Key

Main Theme

Sub- Theme

Relationship

Example

What Should be

Microblogged

Who Should Microblog

Review Procedure

Training of Staff

Complaint Procedures

Not talking about

colleagues

Solve problems by

phone

Not interacting with

customers

Use appropriateness language/topics

Not uploading pictures of

young people

Show people how to use

microblogging

Have employee contract

Set accounts policy

Review the messages

Complaint process

Avoid using racial or sexual

comments

Use positive language

Use formal language

Stay neutral about politics and religion

Manage people by outcome

What should be microblogged The commonest type of approach was to have rules or guidelines related to what people should microblog. The majority of interviewees were aware of how important it is to share appropriate information about the business and each other, and the need to have content guidelines. Several interviewees proposed a policy in regards to not talking about colleagues. By clearly stating that no specific names of members of staff are to be mentioned on microblogging, no work is directly linked to individuals. Also, no details about people are shared with the public. As a result, risk of loss of personal privacy of employees is reduced.

Page 88: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

82 Conference Proceedings

The majority of participants referred to the need to use appropriate language and to select suitable topics when using microblogging. Particularly because of the public nature of Twitter and that people outside of the organisation can view information that is being shared, the majority of participants found that it is important for co-workers to be careful about sensitive topics which could influence the reputation of the organisation. Specific examples where provided. For instance if employees are talking about their organisation, they have to use positive and formal language. They should stay neutral about politics and religion. Also, they need to avoid using racial or sexual comments.

Another specific rule about what people should tweet is related to not uploading pictures of young people. One of the consultancy interviewees explained that if customers or clients witnessed that privacy of children was not being respected, this could affect the reputation of the organisation and reduce the chances of people ever working with the organisation. A few interviewees also referred to the policy about not interacting with customers.

“People aren’t allowed to interact with customers on any of these sites [...] we have a no interacting with customer’s policy. So staffs aren’t allowed to add customers [...] The Twitter stuff tends to be we’re very strict with all of that because I find that Twitter’s not a business tool.”

Employees should not engage with customers because there is a chance that employees will share their personal thoughts and opinions on Twitter, which could end up being associated with the business. Therefore, the reputation of the organisation and relationships with customers could be negatively affected i.e. if people dislike particular comments, they may feel disrespected and lose trust in the organisation. One IT manager stated that it is also important to ask permission from customers if talking about them; the organisation will maintain their reputation in front of their customers, by showing that they respect their feelings and privacy.

As well as not engaging with customers, more specifically problems were not to be discussed through microblogging. Another specific policy is solving problems by phone. Due to the limited number of characters on Twitter, users are unable to post detailed messages. Therefore if problems occur, this organisation would contact people directly, rather than through EMB. This way people can express themselves better and resolve issues more effectively. The reputation of the organisation will be maintained as there will be fewer chances of people misinterpreting the short, posted information.

Who should microblog The second type of guideline that organisations used to manage the risk with Twitter was related to who should microblog. The majority of interviewees talked about the need to determine who should engage with microblogging. One IT manager clearly stated that they would like to identify specific individuals to be in charge of microblogging. This would make it easier for the organisation to make sure the tools are being used appropriately; the reputation of the business will be maintained. For example, interviewees were concerned that employees might share sensitive information and those outside of the organisation, such as competitors could potentially steal ideas and information for their own benefits. Consequently, the less people who microblog, the lower the chances are of breach of confidentiality.

According to one of the IT participants, it would be most appropriate for organisations to make clear to what extent information can be shared through microblogging. The best way to do so would be by presenting people with an employee contract i.e. an arrangement which clearly

Page 89: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 83

explains all the things that should and should not be done. As a result, the possibility of private company data, such as software code getting into the public domain would be reduced.

“What concerned us was that whilst most of it was supposed to be private, there were slips where it’d come in and ended up in the public domain and I had to get some things removed. So we added bits to the employment contract about social media and blogging and things like that [...] saying you don’t put that on the internet or you don’t put this in the public domain, so you need a contract that up.”

Several interviewees referred to the need to have account policies i.e. to clearly separate using personal and organisational accounts. According to one of the IT managers, there is a need to clearly differentiate between personal and organisational microblogging accounts. Employees using their own account, they need to explicitly indicate that the information they are sharing on Twitter is based on their personal thoughts and are not related to the business. Otherwise, the reputation of the organisation can be damaged. One solution is for members of staff to not talk about the organisation or on behalf of the business when using their personal account.

Review procedures One of the Sports interviewees she mentioned that she would have her colleagues review the messages, before sharing them on Twitter. This is a process put in place to ensure guidelines and rules are followed and employees are posting sensible information on microblogging; employees also learn how to appropriately use Twitter, based on the organisation’s expectations. The chances of sharing any misleading information would be reduced, and the reputation of the organisation would be maintained.

“I have just only started using Twitter. Say that I want to put a blog on or re-tweet something I would normally just say to one of the colleagues this is what I want is this alright. And then they would read it through and say yeah I will put in on the website for you, or yeah you can put a link on Twitter or something. So I get it checked first before I put anything on.”

Training of staff A few interviewees explained that rather than having policies, the best approach is to show people how to use microblogging. A good method to help people learn how to use microblogging appropriately, some interviewees thought, is to show them face to face how to engage with the technology; this way people’s confidence will be enhanced, particularly the older people who may be less comfortable with microblogging.

“We have done some work with elective members. They tend to be older and they are very nervous about the use of these things. They are very concerned about their own reputation, making mistakes. The only way you can overcome that with that group of people is to actually sit down set them accounts up and show them how it works [...] In my experience actually showing people some stuff is the best way.”

One participant talked about preferring to manage people by outcome. Instead of having microblogging guidelines, a good strategy would be to manage people according to whether they meet the targets set by the organisation and how well they are achieving the expected goals.

Page 90: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

84 Conference Proceedings

Complaint procedures One practice related to mitigating the risky nature of microblogging was having a complaint process. According to one of Consultancy interviewees:

“We have got our own we’re a credited by customer first, but we do have our policies, complaint policies and things like that that would kick in if something went wrong.”

This organisation valued their reputation highly but considered that if any issues arise, for example if people were provided with misleading or incorrect information, they have the chance to express their dissatisfaction with the service or employees through complaint procedures.

Discussion The aim of the research was to discover what risks organisations perceive to be associated with microblogging and what they do to try and mitigate the risks. The majority of interviewees saw microblogging as risky and risks could be divided into internal and external risks. The commonest type of risk was an “external risk”, reputation damage. The majority of organisations actively took specific action to deal with the risk to their reputation. These actions could be broadly divided into guidelines about contents shared on microblogging; specifying who should microblog; review procedures; training of staff; as well as having in place complaints procedures to deal with the consequences of mistakes.

One interesting observation on the findings is a potential mismatch between the concerns people had about microblogging and the mitigating actions they sought to take. It is true that the risk of reputation damage is primarily what the interviewees were worried about for their organisation. The majority of policies and guidelines were put in place to minimise this risk. Yet interviewees mentioned risks that did not produce any actions. For example, interviewees mentioned information overload or the noise-to-value ratio on Twitter as a problem; a type of issue also recognised in the literature (Günther et al., 2009; Schöndienst et al., 2011). However, interviewees did not provide solutions and much information about how to handle this concern. The interviewees mainly showed that they were aware of these issues and a few of them suggested that Twitter needs to change i.e. from a social platform to become more business oriented and focused. Several interviewees stated that microblogging users have to be prepared to spend a lot of time and effort to learn how to use microblogging, to build connections up and to start working with them. One interviewee said that it could take organisations a year to get to the point where microblogging is worthwhile. No specific solutions or actions were identified, to suggest how organisations could make better use of microblogging, at a faster rate.

Several researchers argue that the main risk of microblogging is breach of confidentiality (Case and King, 2010; Schöndienst et al., 2011; Ehrlich and Shami, 2010). According to Zhao and Rosson, (2009), individuals would be more prepared to increase their activity on Twitter and in relation to work, if they sensed that microblogging was a protected area for sharing private data. In this study the commonest type of risk was reputation damage. There are similarities between the types of policies and guidelines identified in this study and those mentioned in the literature. However, on the whole organisations perceived more specific risks, in comparison to the literature. Hence, more actions were taken to mitigate risks. This could be because empirical data was collected in 2013, through conducting interviews with participants from SMEs in the area of South Yorkshire, UK. At the time of data collection, more people were familiar with microblogging. Most of the interviewees were using microblogging for some time, for personal uses and for work purposes. Therefore, they identified more risks and they knew which actions

Page 91: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 85

were required to mitigate risks. In contrast, earlier studies on EMB were mostly conducted at the time when microblogging was still in its early adoption stages. Typically these pieces of research were results of trials of the tools within one organisation.

Following from seeing breach of confidentiality as the main risk of microblogging, most actions that were recommended in the literature to deal with the problem were controlling who can receive contents and for how long (Schöndienst et al., 2011). Ehrlich and Shami, (2010) found that employees avoided posting any confidential information about the organisation publically, through Twitter. Instead, they preferred to share details through BlueTwit; their internal microblogging tool. In this study more actions were taken to mitigate such risks. The commonest type of approach was to have rules or guidelines related to what people should microblog. For example, several interviewees proposed a policy in regards to not talking about colleagues and the majority of participants referred to the need to use appropriate language and to select suitable topics when using microblogging. Other specific rules are related to not uploading pictures of young people and not interacting with customers. Also, another policy is solving problems by phone; problems were not to be discussed through microblogging. The majority of interviewees also talked about the need to determine who should microblog. Some interviewees stated that they would like to identify specific individuals to be in charge of microblogging. For example, some interviewees were concerned that employees might share sensitive information and those outside of the organisation, such as competitors could potentially steal ideas and information for their own benefits. Consequently, the less people who microblog, the lower the chances are of breach of confidentiality. According to one of the interviewees, it would be most appropriate for organisations to make clear to what extent information can be shared through microblogging. The best way to do so would be by presenting people with an employee contract. As a result, the possibility of private company data, such as software code getting into the public domain would be reduced.

The literature does not identify the need to have review procedures and a complaint process in organisations. One of the Sports interviewees would have her colleagues review her messages, before sharing them on Twitter. Another interviewee said that if any issues arise, for example if people were provided with misleading or incorrect information, they have the chance to express their dissatisfaction with the service or employees.

Although there are some differences, there is a similarity between the findings of this study and the literature. A few interviewees explained that rather than relying on policies, the best approach is to provide training for staff and show them how to use microblogging. Raeth et al., (2009) also suggested that users need to be trained in areas such as software use and business guidelines; so that they can use microblogging effectively. According to Othman and Siew, (2012), organisations should always clearly state what the benefits of their internal system are, as well as continuously educating users on how to best use the tools.

Conclusions Before this study, most research to date has been typically the result of trials of microblogging tools within one large organisation. Previous studies were not necessarily, therefore, a reliable guide to how other types of organisations such as SMEs might use microblogging, perceive its risks and the types of policies and guidelines they could introduce to address problems they encountered. On the whole organisations perceived more specific risks, in comparison to the literature. The risk of reputation damage is primarily what most interviewees were worried about for their organisation. The actions they took were broadly divided into rules about what was

Page 92: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

86 Conference Proceedings

microblogged; guidelines on who should do it; review procedures; training of staff; in addition some relied on complaints procedures as the way to deal with risk.

This research was conducted in 2013 when microblogging within organisations was relatively new. By 2015, EMB could be significantly more common place. Organisations have learned rapidly about the power and risk of social media as a whole in the last few years. As a result, specific perceptions about risks of microblogging may have changed, as well as practices of use. More work is needed to explore how sense of risk is changing. This study was also based on a rather narrow range of organisations in one region of the UK. The results need to be validated by further studies – probably its use has spread beyond early adopters like the IT companies that made up a large part of the sample in this research. Nevertheless, this study can be a useful guide for other researchers to explore how perceptions about risks and policies can change over time. The framework that has been developed in this study can be a useful guide to further investigate perceptions about risks and responses to risk.

References Álvaro, G., Co´rdoba, C., Penela, V., Castagnone, M., Carbone, F., G´omez-P´erez. J. M. &

Contreras, J. (2010). “MIKROW, An Intra-Enterprise Semantic Microblogging Tool as a Micro-Knowledge Management Solution”. 7th 2010 International Conference on Knowledge Management and Information Sharing 2010, Valencia, pp. 1-8.

Braun, V. & Clarke, V. (2006). “Using thematic analysis in psychology”. Qualitative Research in Psychology, 3 (2), 77- 101.

Bryman, A. (2008). Social research methods. New York: Oxford University Press. Case, C. J. & King, D. L. (2010). “Cutting Edge Communication: Microblogging at the Fortune

200, Twitter Implementation and Usage”. Issues in Information Systems, 6 (1), 216- 223. Cohen, D. & Crabtree, B. (2006). Qualitative Research Guidelines Project. Robert Wood

Johnson Foundation. Ehrlich, K. & Shami, N. S. (2010). “Microblogging Inside and Outside the Workplace”.

Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pp. 42- 49.

Günther, O., Krasnova, H., Riehle, D. & Schöndienst, V. (2009). “Modeling Microblogging Adoption in the Enterprise”. Proceedings of the Fifteenth Americas Conference on Information Systems, San Francisco, California, pp. 1-10.

Hove, S. E. & Anda, B. (2005). “Experiences from Conducting Semi-Structured Interviews in Empirical Software Engineering Research”. Proceedings 11th IEEE International Software Metrics Symposium, IEEE, Como, Italy, pp. 1-10.

Lee, C.Y. & Warren, M. (2010). “Micro-Blogging in the Workplace”. Proceedings of the 8th Australian Information Security Management Conference, Perth, Australia, pp. 41- 48.

Mayfield, R. (2009). “Enterprise Microblogging Whitepaper”. Socialtext. September. Othman, S. Z. & Siew, K. B. (2012). “Sharing knowledge through organization’s blogs: The role

of organization, individual and technology”. 3rd International Conference on Business and Economic Research Proceeding, pp. 562 – 581.

Patton, M.Q. (2001). Qualitative research and evaluation methods. Thousand Oaks, CA: Sage. Raeth, P., Smolnik, S., Urbach, N. & Butler, B. S. (2009). “Corporate Adoption of Web 2.0:

Challenges, Success, and Impact”. Proceedings of the 15th AMCIS, San Francisco, California, pp. 1-10.

Page 93: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Managing the risks of internal uses of Microblogging within Small and Medium Enterprises

Loughborough University 87

Riemer, K. & Richter, A. (2010). “Tweet Inside: Microblogging in a Corporate Context”. 23rd Bled eConference eTrust: Implications for the Individual, Bled, Slovenia, pp. 1-17.

Riemer, K., Diederich, S., Richter, A. & Scifleet, P. (2011). “Tweet Talking - Exploring The Nature Of Microblogging At Capgemini Yammer”. Business and Information Systems, Sydney, Australia, pp.1-15.

Schöndienst, V., Krasnova, H., Günther, O. & Riehle, D. (2011). “Micro-Blogging Adoption in the Enterprise: An Empirical Analysis”. Proceedings of the 10th International Conference on Wirtschaftsinformatik, Zurich, Switzerland, pp. 1-10.

Zhao, D. & Rosson, M. B. (2009). “How and why people Twitter: The role that microblogging plays in Informal communication at work”. In Proceedings of the ACM 2009 International Conference on Supporting Group Work, ACM , pp. 243-252.

Page 94: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

88 Conference Proceedings

Re-purposing manufacturing data: a survey Philip Woodall* and Anthony Wainman University of Cambridge *corresponding author: [email protected]

Abstract With the advent of areas such as data analytics, to improve business intelligence, manufacturing organisations must now consider other uses for their data, which are distinct from its original and primary purpose. Once the primary purposes of data in Manufacturing-related Enterprise Information Systems has been fulfilled, consigning it to being useless and relegating it to the back shelf is no longer an option for forward thinking manufacturing organisations. It is not reuse, which is using the data again for the same task/decision, but rather the “re-purposing” of data for completely different tasks/decisions to what it was originally intended that is the focus of this research. We conducted a survey of various manufacturing companies to determine whether and, if so, how they currently re-purpose data. Our findings indicate that automated solutions to re-purposing data could save manufacturers significant amounts of time and therefore money, and, based on our survey findings, we suggest what problems need to be addressed in order to achieve this.

Introduction Manufacturing organisations currently use data for a variety of purposes and in many companies operational data, for routine tasks, is stored in various systems including, for example, Enterprise Resource Planning (ERP) systems and Manufacturing Execution systems (MES). These may include data about suppliers, deliveries, customers, sales, current production line conditions etc. Operational uses of this data span the selection of suppliers for particular materials by the procurement department to sending out marketing materials by the marketing department.

With the advent of Big Data, and the effort needed to manage large amounts of organisational data, companies are keen to ensure that maximal value is extracted from their data. The operational use of data takes priority and the data must be fit for this purpose. However, many organisations are starting to invest in data analytics programmes where the operational data is also used by data scientists to reveal business insights that can give a company a competitive advantage (Davenport and Harris, 2007). Unfortunately, a major problem is that many data scientists spend inordinate amounts of time getting data into a form where it can be used for their analysis (Kandel et al., 2012). The data must be transformed in some way before it can be input into data analytics tools and algorithms such as data mining applications. Typical transforms include:

• conversion of units such as times and dates (e.g. from two week intervals to 1 month intervals) or units of measure (e.g. centimetres to metres)

• coding of values and abbreviations such as United Kingdom to UK

• conversion of aggregated quantities (e.g. 1 box of items may be equivalent to 500 parts)

• entering values for missing values (e.g. entering a default value, or inferring the missing value based on other data)

Page 95: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 89

In many cases, organisations create and use satellite systems, such as spreadsheets, that ERP data must be extracted into and then manipulated manually in order to carry out the various transforms needed to get data into the necessary form for analysis (Baskarada, 2012). A problem that this causes is that this data often then becomes the “real source of truth” and ends up being better quality than the data in the ERP system. The spreadsheet becomes the holder of the data that is managed and looked after.

People in other departments, who need the data, end up using the old and dirty data in the ERP system and have no visibility of the transformed data in the satellite systems. In this scenario, there is the potential for other departments to leverage these transforms for other purposes and therefore recoup some of the costs of transforming this data.

In manufacturing organisations not all may have a single ERP system and therefore we broaden our scope to investigate manufacturing-related enterprise information systems and how data could be repurposed in and between people who use these. Our research question is hence:

“How can manufacturers automatically re-purpose data in manufacturing information systems, which has already been used for a business operation, for other decisions in different parts of the business?”

Note that re-purposing in this case refers to the use of data for a completely different task/decision than the original purpose. This is different from reuse, which would be using the data again for the same or very similar task.

Background

Manufacturing information systems Manufacturing Information Systems are Information Systems pertaining to the major business operations involved in Manufacturing: Procurement, Scheduling, Finance, Asset & Inventory Management as well as Production Systems Management. Three different levels of Information Systems can be present within a manufacturing company: Enterprise, Management and Machine. Typical elements of each are respectively Enterprise Resource Planning Systems and its parallel elements (customer relationship management, human resource management and product lifecycle management), Manufacturing Execution Systems (MES) followed by Process Control Systems (PCS), as can be viewed in Figure 1.

Figure 1: Manufacturing Information Systems Hierarchy

Enterprise{ERP, SCM, CRM, HRM, PLM}

Management {MES}

Machine {PCS}

Page 96: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

90 Conference Proceedings

Fully integrated manufacturing information systems incorporate all three levels (Qiu and Zhou, 2004) with the front end including the office level (Enterprise) and back end including supportive logistics, MESs and shop floor controls (Management and Machine).

As (Qiu and Zhou, 2004) have studied, a well-integrated system for manufacturers can only be done by deploying the information systems as a whole on all three levels. These Information Systems have become an integral part of manufacturing companies in these last two decades (Melville et al., 2004) and the integration as well as adoption of such systems often correlate with a plant’s performance (Banker et al., 2006).

With the constant aim to keep costs down and with the widespread storage solutions available, both online and offline, manufacturers have to “manage the explosion of data that will only continue to grow” (GE Intelligent Platforms, n.d.). Not only can methods and infrastructure for big data analytics be used from the tools developed by Web giants such as Google or Amazon, but there is a strong potential for manufacturers to find new purposes for the data accumulated. As (Zhu and Madnick, 2009) state, it is possible for organisations to either sell “private” data or find new purposes for datasets internally.

A number of areas (research, technology and the Web) are currently finding methods for data re-purposing to leverage the value of datasets multiple times. This aspect, factored with making data easily accessible and “re-purposable” within an organisation or the public, make up the base of how manufacturers can find new ways to use data within their company. Manufacturers that “capitalize on the value of big data will gain insights to improve performance beyond their competitors” (GE Intelligent Platforms, n.d.).

Data re-purposing in research Data re-purposing in the world of research stems from the early days of data curation. The latter involves the “activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and reuse.” (Lord et al., 2004). Data curation for potential reuse is pivotal in the future of research (Lord et al., 2004). At a national level, the Data Curation Center (DCC), UK http://www.dcc.ac.uk/) looks into increasing the span of knowledge and the reduction of effort duplication through data curation.

Data re-purposing in the Web With the advent of communication through the web and the ease of information access that it has created, web services have emerged as a technology for enabling automated communication between distributed and heterogeneous software applications. These services have created a new market incorporating the following: Software as a Service, Platform as a Service and Infrastructure as a Service. The separation of responsibilities differs between these in the view to give the user a software for end use (SaaS), an application and developing framework (IaaS) and finally Information Technology infrastructure and network architecture (IaaS). In the view to repurpose data from each of these, regardless of their type, lightweight web Application Programming Interfaces (APIs) have started to surface. These use web-based requests (HTTP based) to then obtain information in easily available formats (JSON, XML, CSV).

Page 97: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 91

Figure 2: API Usage Data Flow

This re-purposing enabling technology enables users to quickly get a hold of data, often dynamic, to further incorporate it in a work task or other service. Furthermore, as integration is key, the stewardship of data is often carefully done in a way that the data can be used quickly and in a scalable fashion. This has led to the automation of service integrations, named “Services Mashup” (Benslimane et al., 2008) solely focussed on enabling new ways to repurpose data.

Barriers to re-purposing data Information systems projects frequently fail. As (Dorsey, 2005) mentions, “up to 80%” of large information systems projects fail. As such, manufacturers seldom have their manufacturing systems well implemented and integrated throughout. From that onset, workarounds are performed by the users of the systems to address their practical data problems, which can stem from misalignments between the systems and its implementation.

“A misalignment [is] defined as any instance where project team members [identify] an organizational requirement that they [feel is] not being addressed by the [MIS] package[s]” (Soh et al., 2003).

Figure 3: Structures embedded in MISs (Soh et al., 2003)

These misalignments and the workarounds they cause can influence the ease at which data can be repurposed. For instance, as noted previously, it is common for organisations to extract data from their ERP system into spreadsheets, where the data is ultimately managed and controlled. Any person in the organisation, such as a data scientist in the data analytic business unit, wanting to access ERP data managed by another business unit could find that the data is

Integration Process Orientation

Flexibility Domainspecific rules,

regulations, assumptions

Implementing organization’s norms, values, practices

Misalignments

Page 98: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

92 Conference Proceedings

out-of-date or has other problems such as inaccuracies, because they are not accessing the data that is really used and managed. They may not know about the spreadsheets, which contain the useful and accurate data, and hence find that they need to perform many transformations to get the data from the ERP system into a form suitable to carry out their analysis. A key barrier to re-purposing is therefore the need to transform the data into a useful and relevant form for the secondary purpose.

Research methodology In order to answer our research question (“How can manufacturers automatically re-purpose data in manufacturing information systems, which has already been used for a business operation, for other decisions in different parts of the business?”) we have divided this into the following three sub-questions:

SQ1) How is data currently repurposed in manufacturing organisations?

SQ2) How does the data need to be transformed to be fit for purpose for the new target audience?

SQ3) How do decision makers want this data to be presented?

To answer these questions the 2nd author (AW) conducted interviews with employees from various manufacturing organisations and the questions in the interview instrument were derived directly from the four sub-questions. The names of interviewees and their respective organisations have been kept confidential. The interview was deliberately designed in this way so that the respondents would be more likely to answer freely about any problems found in their organisation.

Survey case selection The different organisations and respondents from these were selected according to the selection criteria shown in Table 1. A company size of at least 30 people was chosen in order to have a medium or larger sized company. We chose larger organisations as a starting point because with a smaller organisation, re-purposing data is likely to be less of an issue; for example, an organisation with only 5 people is likely to be agile enough to share and repurpose data easily because the communication between people is less complex. The time that the company were required to have had their information systems in place was chosen to be a year or more to provide the company with enough experience to report how it is used. The interviewees were required to be a Manager, Engineer (IS, Manufacturing, Process, etc.) or regular MIS user, and, clearly, they were required to have some knowledge of the information systems.

Page 99: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 93

Table 1: Case selection criteria

Criteria Requirement

Company Size >30 people or more

MIS has to be in place Yes

Time of MIS implementation 12 months or longer Industry Manufacturing + related (e.g. logistics)

Interviewee Manager, Engineer or regular MIS user

The following three data sources were used to find the appropriate target organisations and interviewees for the survey:

1. Contacts from both the Distributed Information and Automation Laboratory (DIAL) and Education and Consultancy Services (ECS) group at the Institute for Manufacturing (http://www.ifm.eng.cam.ac.uk/)

2. Contacts known to the author

3. Contacts through the use of the LinkedIn platform

A set of interview questions, including the purpose of the research as well as a research plan explaining the aims, outcomes and confidentiality of the research was sent to potential interviewees via email. For those that responded, the case selection criteria were double-checked, and if these were satisfied, the interview was performed either via teleconference or through a physical meeting.

Before this, a pilot interview was first undertaken with an academic with experience in Manufacturing Information Systems. The pilot was used to check for biased and potentially invalid questions, and any general problems with the interview instrument. Following that, minor changes were made to some of the questions, and then the actual interviews were performed. The interviews were recorded as well as notes being taken during the interviews by AW.

Results

Organisations A total of 7 interviewees for six different cases (case F consisted of two interviews with different personnel from the same company) were conducted, and the details of these are summarised in Table 2. In case A, the respondent was an academic who had worked for various manufacturing organisations in different roles and answered the interview questions by selecting from his experiences that were most relevant and useful to the question. The interviews lasted approximately 1 hour in duration.

Page 100: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

94 Conference Proceedings

Table 2: Case interviewees and organisations

Case Case Interviewee Case Organisation

Case A Pilot: Academic Various

Case B Business Systems Manager Print head manufacturer

Case C Engineering Project Manager Paint manufacturer

Case D Director High precision parts manufacturer

Case E Senior Supply Chain Analyst Coatings manufacturer

Case F Technical Lead Systems Engineer Aircraft Manufacturer

Case F Engineer (IT) Aircraft Manufacturer

Interview results In order to understand the context of the company, the respondents were asked what enterprise systems they use and which departments use the data from these systems. The results are summarised in Table 3 and Figure 4. All organisations have an ERP system and all except case E have a Manufacturing Execution system. Enterprise visualisation tools for data analytics (business insight) are relatively common compared to visualisation tools used for operational factory decisions. These systems are large in scale (being “enterprise” systems) and the numerous systems indicate that a single system (such as an ERP system) are often not sufficient for organisational needs; only case E relies on a single ERP system with a visualisation tool. As expected, the use of the data within these systems transcends multiple departments, and therefore data reuse and re-purposing could be a potential benefit to these organisations.

Table 3: Enterprise systems and departments

Enterprise information systems Departments that use the information systems

Enterprise Resource P

lanning (ER

P) system

Manufacturing Execution S

ystem

Inventory Managem

ent Software

Enterprise Visualisation Tool (factory)

Enterprise Visualisation Tool (business insight)

Product Lifecycle Managem

ent Tool

Docum

ent Managem

ent Tool

Production

Planning

Procurement

Finance

Asset and Inventory M

anagement

Case A x x

x

x x

Case B x x x

x x x x x x x x Case C x x

x x x

Case D x x

x

x

x

x

Case E x

x

x x x x x Case F x x x x x x x x x x x x

Page 101: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 95

The respondents indicated that they all use transactional data from these systems, which includes customer orders, payment records, observations, sensor data, and other event data (as opposed to master data or data about the organisations structure, such as information about plants and warehouses).

Figure 4: Frequencies of enterprise systems

When asked whether the respondents knew of anyone within their company who has data that they could also use to support their decision making, all cases indicated that they know of other people/departments who have data they could use. Furthermore, when asked if this data is shared with them, all respondents indicated that it is shared. The types of data shared, how the data is shared (the different methods), and how often the data is shared is summarised in Table 4. Cells that contain a “-” indicate that the response is not available; this could be because the respondent did not answer the question or that insufficient notes were taken and the audio recording is not clear, or another similar reason for not having or being able to interpret the answer.

The respondents required various types of data including master data, models of the organisational structure, and all required transactional data. Interestingly, the most common method of data sharing is via email (see Figure 5). This indicates that data sharing is not fully automated in organisations and it is likely that the persons sharing the data need to spend time responding to a data request; including, gathering the data, assembling it into a form that can be shared, and then emailing it to the receiver. Given that data needs to be transformed in various ways, which is described later in Table 5 and Figure 6, this also indicates that effort is also required to complete this task too. The respondents indicated that they need to perform this task when they receive data, and hence there is a likelihood that because they are not the creator/owner of the data, they could make false assumptions and incorrectly transform the data – although, this is an assertion because this question was not directly asked. Another common

0

1

2

3

4

5

6

ERP ManufacturingExecution

System

InventoryManagement

Software

EnterpriseVisualisationTool (factory)

EnterpriseVisualisation

Tool (businessinsight)

ProductLifecycle

ManagementTool

DocumentManagement

Tool

Freq

uenc

y

Enterprise Information Systems

Page 102: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

96 Conference Proceedings

method of data sharing is via the company intranet, although these types of solutions often require the user to search for the relevant data from the vast amount of data in the organisation. Data warehouses and ERP content management systems are also used while, interestingly, cloud services have yet to be adopted. Finally, most respondents required data continuously rather than in longer periods of time.

Any automated solution to enable data re-purposing could reduce the manual workload of sharing data (from the person who sends the email, or the person who needs to search the company intranet). Although, the solution would need to operate in real-time to satisfy the needs of the decision makers.

Table 4: Shared data properties, including types of data, methods of sharing and sharing timescales

Types of shared data Methods for obtaining copies of shared data

Timescale with which a copy is

needed

Model(s) of the organisation

structure

Master data

Transactional data

Other

ERP

System sharepoint

Intranet/Server

From a data w

arehouse

Cloud services

Email

USB

stick (physical movem

ent)

Other

Continuously / real-tim

e

Minutes

Hours

Days

Case A x x x - - - - - - - - - - - Case B x x x x x Case C x x x - - - - Case D x x x x x Case E x x x x x x Case F x x x x x x x x

Page 103: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 97

Figure 5: Common data sharing methods

The data, once shared, needs to undergo various transformations before it can be used by the receiving party; the various ways in which it needs to be transformed are shown in Table 5. Only case E reported that they do not need to transform their data.

Table 5: Necessary data transforms for shared data

Transforms that need to be applied to the shared data

Conversion of units

Coding of values and

abbreviations

Conversion of aggregated

quantities

Correct inconsistent term

s for nam

ed entities

Other

Case A x x Case B x x x x Case C x x Case D - - - - Case E N/A N/A N/A N/A N/A Case F x x x x

The most common type of transform that needs to be applied to shared data is the coding of values and abbreviations (see Figure 6). An example of this type of transform is changing a country name, e.g. United Kingdom, into a code or abbreviation such as “GBR” or “826” –

0

1

2

3

4

5

6

ERP Systemsharepoint

Intranet/Server From a datawarehouse

Cloud services Email USB stick(physical

movement)

Other

Freq

uenc

y

Data sharing methods

Page 104: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

98 Conference Proceedings

according to ISO 3166-1. If the person receiving the data already has a dataset with countries referred to using the numeric value, then all other values from the shared dataset need to be converted into their numerical form. Otherwise, if these types of transformations are not made, then inconsistencies can occur; for example, a report showing revenue for different countries will likely show incorrect results if some countries are missed because of inconsistent naming – see gap 5 and 6 in (Woodall et al., 2014).

Other transforms that needed to be made include conversion of units, examples being converting weeks to months and metric values to imperial. The latter was not done in the case of NASAs Climate Orbiter, which was lost on 23rd September 1999 because Lockheed Martin’s engineering team used English units of measurement while the agency’s team used the more conventional metric system for a key spacecraft operation (Lloyd, 1999). Conversion of aggregated quantities, such as a box of parts into individual parts and correction of inconsistently named terms (such as when engineering part names are different for the same part type) are the other types of transformations that need to be performed.

Figure 6: The frequency of types of data transformed applied to shared data

Finally, we asked how the respondents would like shared data, which intends to be repurposed, to be presented. They indicated that graphical dashboards are the preferred visualisation (see Table 6), in some cases these need to be interactive depending on the needs of the users.

0

1

2

3

4

5

6

Conversion of units Coding of values andabbreviations

Conversion ofaggregated quantities

Correct inconsistentterms for named

entities

Other

Freq

uenc

y

Data transformations that need to be applied to shared data

Page 105: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Re-purposing manufacturing data: a survey

Loughborough University 99

Table 6: Preferred ways of visualising data

How data should be presented

Tabular

Graphical

Dashboard

Interactive D

ashboard

Static Report

Case A x x Case B x Case C x x Case D x Case E x x Case F x x

Summary To answer our original research question (How can manufacturers automatically re-purpose data in manufacturing information systems, which has already been used for a business operation, for other decisions in different parts of the business?”), we surveyed decision makers who work with industrial data from various large manufacturing organisations. Based on the results of this survey, we now discuss the feasibility and benefits of automating the re-purposing of data in manufacturing organisations.

Currently, the main method of sharing data for re-purposing is via email, which disturbs the data sender requiring them to collect, assemble and send the data – perhaps unnecessarily? There is, therefore, a potential to save effort by reducing this workload on the data sender. If techniques can be developed which can automatically determine what data is needed and by whom, then the data sender could be removed from the process entirely. However, the majority of the people we surveyed require data sharing to be in real-time (i.e. obtain a more-or-less immediate response), and so any automated solution needs to operate as fast as possible.

Another complication to enabling automated data re-purposing is that the data needs to be transformed in various different ways before it can be used/repurposed. This is a very time-consuming activity (Kandel et al., 2012), and automated methods which can share the data will need to be able to support the transformation of data, which could potentially save a significant amount of time for organisations.

Finally, graphical dashboards are the preferred option for the presentation of data, according to the respondents of our survey. Any automated solution to enable data re-purposing should therefore consider being able to interface with existing dashboard technology, and could perhaps use this as a platform to interact with the user to determine how and what transformations need to be applied to the data before it can be used.

Page 106: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

100 Conference Proceedings

References Banker, R.D. et al. (2006). Plant information systems, manufacturing capabilities, and plant

performance. Mis Quarterly, 30, p.pp.315–337. Baskarada, S. (2012). How Spreadsheet Applications Affect Information Quality. Journal of

Computer Information Systems, 51 (3), p.pp.77–84. Benslimane, D., Dustdar, S. and Sheth, A. (2008). Services mashups: The new generation of

web applications. IEEE Internet Computing, 12, p.pp.13–15. Davenport, T.H. and Harris, J.G. (2007). Competing on Analytics: The New Science of Winning,

Harvard Business School Press. Dorsey, P. (2005). Top 10 reasons why systems projects fail. www.dulican.com [Accessed

September 5, 2014] GE Intelligent Platforms The Rise of Industrial Big Data: Leveraging large time-series data sets

to drive innovation, competitiveness and growth—capitalizing on the big data opportunity. Available at: http://www.ge-ip.com/library/detail/13170 [Accessed August 10, 2014].

Kandel, S. et al. (2012). Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions on, 18 (12), p.pp.2917–2926. Available at: [Accessed March 20, 2014].

Lloyd, R. (1999). Metric mishap caused loss of NASA orbiter. CNN news report. P. Lord, A. Macdonald, L. Lyon, and D. Giaretta, “From Data Deluge to Data Curation,” in In

Proceedings of the 3th UK e-Science All Hands Meeting, 2004, pp. 371–375 Melville, N., Kraemer, K. and Gurbaxani, V. (2004). Information Technology and Organisational

Performance: An Integrative Model of IT Business Value. MIS quarterly, 28, p.pp.283–322.

Qiu, R.G. and Zhou, M.Z.M. (2004). Mighty MESs; state-of-the-art and future manufacturing execution systems. IEEE Robotics & Automation Magazine, 11.

Soh, C. et al. (2003). Misalignments in ERP Implementation: A Dialectic Perspective. International Journal of Human-Computer Interaction, 16, p.pp.81–100.

Woodall, P., Oberhofer, M. and Borek, A. (2014). A Classification of Data Quality Assessment and Improvement Methods. International Journal of Information Quality - in Press.

Zhu, H. and Madnick, S.E. (2009). Finding New Uses for Information. MIT Sloan Management Review, 50, p.pp.17–21.

Page 107: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 101

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider Gongtao Zhang, Neil Doherty and Mayasandra-Nagaraja Ravishankar School of Business and Economics, Loughborough University

1. Introduction Cloud computing is a service-based model of computing resources provision (Marston et al., 2011; Venters and Whitley, 2012; Willcocks et al., 2014). Recently, the phenomenon has been the subject of much interest in practitioner and IS research. For example, in 2010, Amazon’s annual revenue of cloud services was estimated between $500m and $700m (Economist, 2010). Forrester predicts that global market for cloud computing will grow to $240bn by 2020 (Dignan, 2011). Cloud computing has been viewed as an effective solution which improves operational efficiency, simplifies management process and motivates innovation (Boss et al., 2007; Hayes, 2008; Willcocks et al., 2013). The cloud computing literature suggests that the technology can be exploited for better value creation in different industries (Marston et al., 2011; Willcocks et al., 2013). Yet, due to complex and rapidly changing environments, cloud computing for value creation remains a challenging problem.

There are two important gaps in extant research on cloud computing led value creation. First, the process of how cloud computing can be deployed to achieve benefits, has not been examined in great detail. In other words, most of the current prescriptions for acquiring benefits with the use of cloud services have not been empirically validated (Garrison et al., 2012), and thus appear to be overly abstract. Secondly, the vendors’ perspective of cloud computing has hardly been explored. Understanding the vendor perspective in IT service engagement is crucial, since vendors’ capabilities determine the quality of service delivery (Levina and Ross, 2003). Hence, without grasping the nature of service delivery process and vendors’ capabilities, it is difficult to deliver cloud service for value creation in the long-term. In this study, we address these gaps by understanding and examining how vendor capabilities deliver cloud services, through a thorough case investigation of Alibaba Cloud Computing (Aliyun) in China.

Table 1: Standard of Cloud Computing (Mell and Grance, 2011)

2. Theoretical background

Cloud computing The emergence of cloud computing was coined in 2007. It is service-based, highly scalable and applies virtualised resources. There was little consensus on what exactly cloud computing is due to researchers’ different background and expertise. Till 2011, Mell and Grace (2011) has developed the standards of cloud computing including its features, services and deployment model (in Table 1). Cloud computing is defined as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g.,

Characteristics Deployment Service delivery model • On-demand self-service; • Broad network access; • Resource pooling; • Rapid elasticity; • Measured service.

• Private clouds; • Public clouds; • Community clouds; • Hybrid clouds.

• Software as a service (SaaS); • Platform as a service (PaaS); • Infrastructure as a service (IaaS).

Page 108: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

102 Conference Proceedings

networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”.

Resourced-based view and dynamic capabilities The resource-based view (RBV) provides the theoretical lens of understanding firm’s resources and capabilities to create competitive advantage. In extant literature, resources and capabilities have been clearly defined, although sometimes resources and capabilities are used interchangeably. For example, resources are defined as physical IT assets such as hardware, software, applications and network infrastructure (Makadok, 2001). Capabilities are intangible assets such as relationships, leadership or culture to enable a firm to transform input into outputs of greater worth (Amit and Schoemaker, 1993). By effective use of VRIN (valuable, rare, inimitable, non-substitutable) characteristics, company can thus achieve competitive advantage (Barney, 1991).

However, RBV is not sufficient to explain how competitive advantage can be sustained in a rapidly changing context, particularly in highly competitive industry (Wade and Hulland, 2004). Hence, we also adopt dynamic capability theory to extend our theoretical background. According to the existing literature, dynamic capability concept has been clearly explained as the process, routines or patterns to constantly reconfigure, integrate, acquire or eliminate resources (Teece and Pisano, 1997; Eisenhardt and Martin, 2000; Chen et al., 2008; Ambrosini et al., 2009). In addition, some interesting natures have also been identified. For example, dynamic capability might involve the learning pattern, through which the company can therefore modify its operation routines (Zollo and Winter, 2002). The company should be behaviour-oriented, and thus upgrade and rebuild its core capabilities in response to the changing environment (Wang and Ahmed, 2007)

3. Research method The case study approach is particularly appropriate for this study due to a number of reasons. First, the case study method is usually used “when ‘how’ or ‘why’ questions are being posed, when the investigator has little control over events, and when the focus is on a contemporary phenomenon within some real-life context” (Yin, 2003). The investigation of how vendor capabilities influence the delivery of cloud services satisfied all of these criteria. Second, the case study is well-established method in IS research, especially used for “sticky, practice-based problems” (Benbasat et al., 1987). Hence, the case study method is appropriate to examine the phenomenon by interpreting shared understanding of the relevant stakeholders (Klein and Myers, 1999).

Based on our research question, several principles guided the case selection. First, the vendor is now providing cloud services (including SaaS, IaaS, or PaaS) to firm-level clients. Second, the vendor company has individuals at extensive and multiple levels who could describe management practices and how they deliver cloud services. Third, the vendor company is willing to share its experience and communicate with external individuals. The case of Aliyun is particularly appropriate for this study, due to its success and remarkable achievements in China. The use of a single case study for our research is also advantageous in that many contextual variables are kept constant, which helps to rule out possible alternative interpretations of the data.

There were two stages of data collection, namely pilot interviews and case investigation. The first stage aimed to formulate proper interview guidelines, test the feasibility of data collection,

Page 109: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 103

and gather general market information. Then, a further step was conducted to bridge, improve and shape our ideas, in a new context: China. The second stage was designed to conduct detailed case study. Research access was negotiated and granted in May 2013, and a total of 23 interviews were conducted till June 2013. Seventeen interviews were conducted with middle and top managers in Aliyun. Secondary data was also collected to supplement interview data, such as internal publications, memos, newspaper articles, industry reports and information from the company website.

Data analysis was carried out by iteratively moving back and forth the between empirical data, relevant literature, the theoretical lens, and emerging process model (Eisenhardt, 1989; Walsham, 2006). A detailed narrative was developed to summarise our interpretation of the voluminous amount of data into a more manageable form. Next, the relevant findings were organised into themes with the theoretical in order to build the emerging model (Walsham, 1995). The process continued until no additional data can be collected, analysed, and added to develop the process model.

4. Case description Headquartered in the capital of Zhejiang Province in Eastern China, Aliyun was established in September 2009 with the event of Alibaba’s 10th anniversary. Funded by Alibaba Group, Aliyun aimed to offer public cloud services, and focused on delivering the services for SMB clients. In the past 5 years, the company has successfully developed a unique cloud platform, called Feitian. The platform was deployed to support a variety of Aliyun’s cloud services or applications. Today, Aliyun becomes the largest public cloud vendor in China. It offers public, data-centric, cloud-based services for SMBs, and provides industry-based solutions of applying cloud applications. In addition, Aliyun is also required to deliver cloud services for other divisions within Alibaba Group including B2C and C2C portals. “We had to develop our own interpretation of the cloud computing concept,” says Dr Jiang Wang, former CEO of Aliyun. “We have a lot of experience and expertise in dealing with SMB clients and delivering online services for them. Cloud computing was a new concept for us in 2009. We developed our understanding based on how we could exploit it, particularly based on our background.”

Figure 1: Alibaba’s business units

Founded in

1999. World’s largest B2B online market.

Founded in 2003. Largest B2C and C2C portals in Asia.

Founded in 2004. World’s largest online payment and escrow

Purchased in 2005. One of largest Internet portal in China.

Founded in 2009. Largest cloud service vendor in China.

Page 110: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

104 Conference Proceedings

Unlike other Alibaba’s well-known divisions such as Taobao or Alipay (see Figure 1), Aliyun was not noticed by the public until the end of 2011. Alibaba initially intended to continue its delivery of online software services by applying the new cloud technology, but soon decided to restructure its resources and found a new firm. The new firm is called “Aliyun”. Literally, “yun” in Chinese means “cloud”. Alibaba hoped that Aliyun’s clients could be attracted with its future new cloud-based services, due to Alibaba’s reputation.

5. Findings However, in 2009, the concept of cloud computing was still new and vague for any competitors in China. Aliyun’s first challenge was to develop a good understanding of the concept. Motivated by the success of Amazon’s AWS and Google, Aliyun managed to develop its own interpretation. “In essence, cloud computing is data-centric. Data has become the most valuable resource for any organization,” says Chenxi Lin, senior expert of Aliyun. “Operations such as storage, sharing, processing and application, become far more crucial. Cloud computing is to provide data management solutions.”

Resource pooling, self-service, and measured service are the three major characteristics of cloud computing, according to Aliyun’s interpretation. These three characteristics result in a completely new design of computing infrastructure, compared to traditional models. First, resources are pooled to serve multiple clients based on a multi-tenant manner. The resources are location-independent. Second, resources can be dynamically assigned or reassigned on demand. The operations can be automatically conducted without any human interaction. Third, the usage of resources can be monitored and reported for both vendors and clients. Aliyun considered these essential characteristics as the three-point checklist to distinguish cloud computing from other models.

Since 2012, Aliyun’s business started aggressive expansion. Till April 2013, Aliyun had a large customer base of over 600, 000 clients, and became the largest public cloud vendor in China. The continuous effort of improving products and services led to stable profit margin and growing number of clients. To facilitate business expansion, Aliyun conducted a number of strategies in three ways. First, Aliyun exploited its innovation by developing open-standardized APIs and SaaS applications. For example, Aliyun helped a small company, Versatile, to deliver the first 3D animation film in Chinese filming history, with its new 3D rendering service. Second, Aliyun collaborated with a wide variety of organisations such as extant clients, local authority, individual developers, and internal divisions of Alibaba Group. For example, Aliyun collaborated with local authority to establish the cloud industrial park, and tried to encourage and support more start-up cloud ventures. Third, Aliyun managed to establish distinguished reputation in the market. For example, in January 2013, HiChina, the leading provider of Internet infrastructure services, was purchased to merge into Aliyun. With the acquisition, Aliyun was able to enhance its reputation by acquiring new key resources including new value-added applications, and a strong operating team.

In December 2013, Aliyun was awarded the world’s first gold certification for cloud security by British Standards Institute. In January 2014, Aliyun officially claimed to expand its business overseas, and has started selling its cloud services since March.

We developed three distinct time-based phases to demonstrate the evolution process of the capability development in the case. Each phase is categorised into three themes: 1) the objectives and expectations of capability development, which reflect the antecedent conditions;

Page 111: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 105

2) the process of capability development, manifested in strategies conducted by Aliyun; 3) the consequences of capability development, focused on service delivery, from a strategic perspective. In each phase, Aliyun has adopted various strategies to acquire capabilities. The data is presented according to the sequence of the phrasal models.

Table 2: How Aliyun obtained resources and developed dynamic capabilities in Phase 1 (2009-2011)

Key organizational strategies Evidence

Leverage unique insight of cloud in Chinese SMB market

““ We were successful in the past decade, dealing with thousands of SMB clients, and most of our clients are SMBs. … Cloud computing has provided us a very good opportunity. By delivering public cloud service, our SMB clients can have a more standardised service, with even lower rate; and for us, we can attract more clients and make better use of our resources.

– by one senior product manager

Acquire reconfiguring capability “Alisoft was merged into Aliyun. The experience from developing SaaS might be highly valuable for Aliyun. Although, as Aliyun stated they will focus on infrastructure and platform development rather than SaaS application, they can pick up the technical skills, and learn the lesson of Alisoft’s failure… In terms of the technical capability, Aliyun is able to develop a platform which can enable collaboration with third-party SaaS vendors “

– by third-party report

Develop innovating capability “To build the Feitian platform was very challenging. There are so many technical difficulties we never met. We are very proud of what we achieved today. You can imagine how difficult to manage a team of 2,000 software engineers, especially for a start-up company. The process was not just about coding, but involving a lot of design, testing, and endless discussions. More importantly, we made it on time.”

– by one senior technician

Process of capability development Evidence

Develop a new cloud platform including the file system, and the task scheduling module

“We successfully developed our own cloud platform in two years. I believe it is very impressive. Aliyun is the only cloud vendor who developed its own cloud platform in China. The success also means we would have our own cloud infrastructure, applications and standard APIs.”

– by one software engineer

Consequences Evidence

Obtain the primary firm resource and innovation capability

“We successfully developed our own platform. … Our platform has some unique features such as security, real-time protection, supporting both online services and off-line data processing. These features cannot be easily achieved by renting a third-party, large-scale, cloud platform. More importantly, our development team has learnt a lot from the whole process. Such valuable experience is priceless.”

– by one marketing mid-manager

Page 112: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

106 Conference Proceedings

Phase 1: Acquiring primary firm-specific resources (2009 - 2011) In the first phase (see Table 2), Aliyun’s objectives focused on technology development based on a unique understanding of cloud computing. The concept of cloud computing was still new and vague for any competitor in the industry till 2009. Aliyun had to develop a unique understanding, which cannot be imitated by potential competitors. Aliyun then successfully built a distinctive platform to support its future cloud-based service offerings. Accordingly, a number of strategies were enacted, and these strategies could be broadly categorised into three strategic thrusts. First, Aliyun took advantage of its unique understanding of cloud computing, and decided to offer public cloud service in China. Second, Aliyun enhanced its technical capability by acquisition of Alisoft’s resources and recruiting new technical crew. Third, Aliyun successfully developed a new cloud platform to differentiate itself from competitors in the industry, by exploiting its technical capability. As a result, Aliyun was able to obtain primary resources and capability.

Page 113: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 107

Table 3: How Aliyun obtained resources and developed dynamic capabilities in Phase 2 (2011-2012)

Key organizational strategies Evidence

Develop imitating capability “Our understanding of cloud computing was based on the research of Amazon’s AWS. Through the process, we found out that we could imitate their service offerings to develop our own services. We know our competitor also applies the same logic. But what they do not have, is a potential large customer base. “

- by product manager A

Develop ability to improve services by building stakeholder relationships

“Dealing with clients, I think, is always a good opportunity to learn from them. Sometimes clients might ask for the same thing, but in different application scenarios. The more we communicate with clients, the better you can serve.”

– by one client-engagement manager

Process of capability development Evidence

Explore Amazon’s offerings and apply into Chinese market

“We cannot understand how each Amazon’s API works. We can only access some frameworks or diagrams. Those can help us understand how services are designed, and interact with each other. But, we can clearly address the outcomes of using those APIs, and functional modules. We then implemented similar functions on our own platform. Finally, we can manage those services and shape them in Chinese market, according to our experience.”

– by product manager C

Use Alibaba’s reputation to collaborate with interested clients, and consequently improve service quality

“With the good reputation, we could always look for interesting clients and offer great discounts or free trial services. Our clients have provided us a lot of helpful comments, feedbacks and suggestions. The process allowed us to improve our client development, support department, problem solving mechanisms. More importantly, our cloud platform (Feitian) has become robust through the process. It’s still not perfect, but it becomes more and more acceptable.”

– VP of product development

Consequences Evidence

Acquire unique capabilities to implement and improve service offerings

“We admit that enhancing quality-of-service (QoS) is a long-term process, and technology is easily transferrable. But we believe our attitude cannot be imitated; the good relationship between clients cannot be imitated; more importantly, how we develop and improve our service cannot be imitated”

– by one project manager

Phase 2: Obtaining new capabilities (2011 - 2012) Having successfully developed a cloud platform, Aliyun began to realise the gap between its value proposition and clients’ requirements. This is because Aliyun did not know how to design acceptable products for SMBs. Furthermore, the cloud platform was not used by external clients, and therefore required further testing procedure to support various applications. Without well-tested and robust products, Aliyun was not able to attract any clients. In this phase (see Table 3), to facilitate the objectives of product development, Aliyun conducted a number of

Page 114: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

108 Conference Proceedings

strategies in two ways. First, Aliyun imitated Amazon’s cloud services (Amazon Web Services) by providing similar products, because Amazon also focused on public cloud market. Second, Aliyun encouraged many clients to use its products by offering them huge discounts. Aliyun gathered and then carefully analysed clients’ feedbacks, complaints, or advice to improve the quality of its products and services. Consequently, Aliyun was able to acquire unique capabilities to implement and improve service offerings.

Phase 3: Reforming core capabilities (2012 - present) The strategies of acquiring new capabilities resulted in aggressive expansion. Till October 2013, Aliyun had a large customer base of 600,000 clients, and became the largest public cloud vendor in China. Aliyun was aware that the great success was based on capability development. The continuous effort of improving services led to stable profit margin and growing number of clients. In this phase (see Table 4), to facilitate business expansion, Aliyun conducted new strategies in three ways. First, Aliyun exploited its innovation by developing open-standardized APIs and SaaS applications. Second, Aliyun collaborated with a wide variety of organisations such as extant clients, local authority, individual developers, and other internal divisions of Alibaba Group. Third, Aliyun managed to establish distinguished reputation in the market. Hence, Aliyun was able to expand the business, more importantly obtain trust and reputation.

Page 115: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 109

Table 4: How Aliyun obtained resources and developed dynamic capabilities in Phase 3 (2012-present)

Key organizational strategies Evidence

Enhance innovation capability “We regularly invite some loyal clients to join our development process. They usually provide many useful ideas and opinion. … Continuous improvement is the key for us. …We have the pressure from our clients, and we are also self-motivated “

- Client manager

Enhance capability of building stakeholder relationships

“Based on the good relation with local authority, we could collaborate with them to build the industrial park. … We hope to build our own ‘silicon valley’ by starting this project and the birthplace of many high-tech cloud computing businesses.”

– by VP of research

Process of capability development Evidence

Develop incremental improvement by imitation and collaborating with stakeholders

“First, we keep ourselves up-to-date on Amazon’s or Google’s, new-released APIs. We can implement those based on our Feitian platform. Second, from our clients. By interacting with our clients, we can keep improving our platform, and developing new essential applications. Third, from third-party partners. We have a lot of projects going on, with local universities, authorities and other companies. We can always learn from them.”

– by product manager C

Promote Aliyun’s services by host or attending public events

“We conducted a lot of events including ISV meetings, annual developer tournament, and client salon. We also like attend some public events. These are great opportunites to introduce our services, dig for potential collaboration and therefore create great relationship with our stakeholders.”

- Business developer

Consequences of capability development Evidence

Acquire trust and reputation for better service delivery

“Feitian platform is the most critical component of our cloud infrastructure. All the applications and services are supported by it. We now have the platform and we still need a lot of effort to maintain it. Aliyun might try developing some SaaS applications, but we will not ignore the platform. Of course, based on Feitian, we can further develop our capability, for instance, the innovation capability”“

- by one business analyst

6. Discussion and Conclusion This study intends to uncover how vendor capabilities deliver cloud services. Through progressing back and forth between data and literature, we inductively derive the process model to answer the question (see Figure 2). The model highlights the relationships between each capability, and clearly demonstrates how each capability influences the service delivery. Our model suggests a number of critical processes. First, the vendor should build an effective service and technical development process by developing reconfiguring, imitating and innovating capabilities. Such process will lead to incremental improvements. Second, based on

Page 116: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

110 Conference Proceedings

incremental improvements, the vendor can build stakeholder relationships. Due to the interaction with stakeholders, the vendor can understand stakeholders’ requirements and expectations, and identify potential opportunities in order to develop new offerings. Third, when the vendor successfully building relationships with stakeholders, reputation and trust can be achieved. Due to the reputation and trust, clients will retain their use of their services instead of switching vendors.

Figure 2: Process model of vendor capabilities and better service delivery

In Aliyun, the development of its services and platform was based on its capacity of recombining of assets and resources. For example, Aliyun acquired former Alisoft’s operating team and recruited new technical crew, or received feedback from stakeholders. We conceptualise this capacity as reconfiguring capability (Ambrosini et al., 2009). The reconfiguring capability is common among technology-oriented vendors which are always dealing with new ideas or knowledge. Aliyun’s technical development process also includes imitating market leaders to implement new offerings. We conceptualise this capacity as imitating capability (Kogut and Zander, 1992; Shenkar, 2010). Imitation is not simply replication, and it also usually results in the ability to develop incremental improvements (Levitt, 1966; Shenkar, 2010). We conceptualise the ability as the innovating capability based on imitating (Garcia and Calantone, 2002).

Guided by our empirical data, the reconfiguring, imitating and innovating capabilities are linked as a cycle. The reconfiguring capability involves new resources, assets, crew, ideas and stakeholders’ feedback, and will foster imitation and innovation. The imitating capability involves analysing and collecting market leaders’ information, and will facilitate innovation. The innovating capability involves technical development and commercialization, and will lead to refreshing the vendors’ resource stock. The cycle will continue endlessly unless the vendor decides to leave the market.

The service and technical development process results in incremental service improvements. Such improvements provide the foundation that allows the vendor to build relationships with stakeholders. The vendor is able to create, extend, or modify its resource base, augmented to included preferred access to the resources of its alliance partners (Helfat, 2007). From the vendor perspective, this capability can promote service development, share costs and production facilities, and access new markets, technologies and resources (Bensaou, 1997; Gulati and Singh, 1998). Based on service incremental improvements, Aliyun was able to build relationships with a variety of stakeholders to foster its service development process. According to our evidences, trust and reputation are the outcomes when Aliyun successfully build

Page 117: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider

Loughborough University 111

relationships with its stakeholders. As a result, clients and other stakeholders will retain their use of services for a longer term.

This study makes several important theoretical contributions. First, by examining how vendor delivers cloud services, this study contributes to a resource-based perspective of cloud service delivery. Second, by further examining how vendor’s capabilities are developed, the study sheds light on how a firm sustains its competitive positioning in cloud service industry. Third, the study has provided evidences which reveal how value is delivered, by in-depth case investigation. The case study clearly explains how a vendor firm shapes, develops and reforms its capabilities to deliver cloud service in a rapidly changing and intensively competitive context.

Reference

Ambrosini, V., Bowman, C. & Collier, N. (2009). Dynamic capabilities: An exploration of how firms renew their resource base. British Journal of Management 20 (1), 9-24.

Amit, R. & Schoemaker, P. H. (1993). Strategic assets and organisational rent. Strategic Management Journal 14 (1), 33-46.

Barney, J. B. (1991). Firm resources and sustained competitive advantage. Journal of Management 17 (1), 99-120.

Benbasat, I., Goldstein, D. K. & Mead, M. (1987). The case research strategy in studies of information systems. MIS Quarterly 11 (3), 368-386.

Bensaou, M. (1997). Interorganisational cooperation: the role of information technology an empirical comparison of U.S. and Japanese supplier relations. Information System Research 8 (2), 107-124.

Boss, G., Malladi, P., Quan, D., Legregni, L. & Hall, H. (2007). Cloud Computing, IBM Technical Report: High Performance on Demand Solutions (HiPODS).

Chen, R., Sun, M. M., Helms, W. & Jih, K. (2008). Alighment information technology and business strategy with a dynamic capaiblites perspective: A longitudinal study of a Taiwanese Semiconductor Company. International Jounal of Information Management 28, 366-378.

Dignan, L. (2011). Cloud computing market: $241 billion by 2020 [Online]. http://www.zdnet.com/blog/btl/cloud-computing-market-241-billion-in-2020/47702. [Accessed 3 June 2014].

Economist, T. (2010). Tanks in the cloud [Online]. www.economist.com/node/17797794. [Accessed 3 June 2014].

Eisenhardt, K. M. (1989). Building theories from case study research. Academy of Management Review 14 (4), 532-550.

Eisenhardt, K. M. & Martin, J. A. (2000). Dynamic capabilities: what are they? . Strategic Management Journal October - November (21), 1105 - 1121.

Garcia, R. & Calantone, R. (2002). A critical look at technological innovation typology and innovativeness terminology: a literature review. Journal of Product Innovation Management 19 (2), 110-132.

Garrison, G., Kim, S. & Wakefield, R. L. (2012). Success Factors for Deploying Cloud Computing. Communications of the ACM 55 (9), 62-68.

Gulati, R. & Singh, H. (1998). The architecture of cooperation: managing coordination costs and appropriation concerns in strategic alliances. Administrative Science Quarterly 43 (4), 781-814.

Hayes, B. (2008). Cloud computing. Communications of the ACM 51, July (7), 9 - 11. Helfat, C. (2007). Relational capabilities: drivers and implications. In: HELFAT, C.,

FINKELSTEIN, S., MITCHELL, W., PETERAF, M., SINGH, H. & TEECE, D. J. (eds.) Dynamic capabilities: Strategic Change in Organisations Oxford, UK: Blackwell.

Page 118: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

112 Conference Proceedings

Klein, H. K. & Myers, M. D. (1999). A set of principles for conducting and evaluating interpretive field studies in information systems. MIS Quarterly 23 (1), 67-93.

Kogut, B. & Zander, U. (1992). Knowledge of the firm, combinative capabilties and the replication of technology. Organisation Sciences 3 (3), 383-397.

Levina, N. & Ross, J. W. (2003). From the vendor’s perspective: exploring the value proposition in information technology outsourcing. MIS Quarterly 27 (3), 331-364.

Levitt, T. (1966). Innovative Imitation. Harvard Business Review 44 (5), 63-70. Makadok, R. (2001). Toward a synthesis of the resources-based and dynamic-capabilities views

of rent creation. Strategic Management Journal 22 (5), 387-401. Marston, S., Li, Z., Bandyopadhyay, S., Zhang, J. & Ghalsasi, A. (2011). Cloud computing - The

business perspective. Decision Support Systems 51, 176-189. Mell, P. & Grance, T. (2011). The NIST Denition of Cloud Computing. In: COMPUTING

SECURITY DIVISION, INFORMATION TECHNOLOGY LABORATORY & NIST (eds.). Gaithersburg: U.S. Department of Commerce and National Institute of Standards and Technology

Shenkar, O. (2010). Imitation Is More Valuable Than Innovation. Harvard Business Review April, 28-29.

Teece, D. J. & Pisano, G. (1997). Dynamic capabilties and strategic management. Strategic Management Journal 18 (7), 509-533.

Venters, W. & Whitley, E. A. (2012). A critical review of cloud computing: researching desires and realities. Journal of Information Technology 27, 179-197.

Wade, M. & Hulland, J. (2004). The resource-based view and information systems research: Review, Extension, and Suggestions for future research. MIS Quarterly 28 (1), 107-142.

Walsham, G. (1995). Interpretive case studies in IS research:Nature andmethod. European Journal of Information System 4 (2), 74-81.

Walsham, G. (2006). Doing interpretive research. European Journal of Information System 15 (3), 320-330.

Wang, C. & Ahmed, P. (2007). Dynamic capabilities: a review and research agenda. International Journal of Management Reviews 9 (1), 31-51.

Willcocks, L., Venters, W. & Edgar, A. W. (2013). Cloud sourcing and innovation: slow traing coming? - A composite research study. Strategic Outsourcing: An International Journal 6 (2), 184-202.

Willcocks, L., Venters, W. & Whitley, E. A. (2014). Moving to the cloud corporation: How to face the challenges and harness the potential of cloud computing, Hampshire, Palgrave Macmillan.

Yin, R. K. (2003). Case Study Research: Design and Methods, Third Edition, Califorinia, SAGE Publications.

Zollo, M. & Winter, S. G. (2002). Deliberate learning and the evolution of dynamic capabilities Organisation Science 13 (3), 339-351.

Page 119: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 113

Poster presentations

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition Torben Jess ([email protected]), Philip Woodall and Duncan McFarlane Department of Engineering, University of Cambridge

Abstract Various researchers have suggested concepts for the application of market-based techniques for data management and data acquisition. However, very few concepts have successfully been implemented and shown a positive practical impact. This can partially be attributed to the lack of existing techniques in identifying appropriate scenarios. Using knowledge from existing successful market implementations in data acquisition, data management and related areas, this paper aims to close this gap. It develops a framework for identifying scenarios applicable for a market-based data acquisition approach. It finds that a heterogeneous environment with distributed decision-making based on partial information combined with assumptions about the users knowledge about the data he is using, can give a good indication for the application of market-based data acquisition approaches. The developed framework is applied towards two sample scenarios.

Keywords: Market-based algorithms; Data Management; Data Acquisition

1 Introduction In today’s Big Data environment, industrial companies are overloaded with data, and it is extremely difficult for decision makers (users) to find the data they need: especially in the case where multiple datasets need to be combined to produce richer data for decision making. These datasets include various internal datasets, or external datasets, which could be bought from an external data provider. All datasets have costs associated with their provision, e.g. maintenance for internal datasets, and acquisition costs for external datasets. Allocating these datasets therefore can be described as an inner company resource allocation problem.

Markets have been shown to work well for resource allocation problems. Various researchers have suggested the application of markets for data management or data acquisition17 (Christoffel, 2002; Koroni et al., 2009; Yemini et al., 1998). However, very few of these concepts have successfully been adopted in organisations. The main barrier, which hinders this adoption, is being able to identify the right data acquisition and data management problems. Therefore identifying scenarios for the application of market-based approaches can solve this problem and lead to a more successful application of market-based techniques in data acquisition.

Using interviews to identify industrial cases and literature reviews to identify typical criteria for using market-based approaches, this paper develops a framework to identify promising

17 This paper focuses on the field of data acquisition in order to reduce the scope of the paper and the

analysis. However, various problems in data management can be reduced to problems in data acquisition, making the problem of data acquisition very similar to a list of problems in data management. The selecting of a data quality tool towards a dataset with tool A and tool B as options could for example be reduced to the acquisition of one dataset after treatment with tool A and another dataset after treatment with tool B.

Page 120: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

114 Conference Proceedings

scenarios. It explains what characteristics need to be fulfilled so that a market-based approach can provide benefits.

A literature review found the following characteristics of market-based approaches: allocation of resources; abstraction of a problem; required fast, efficient, flexible and extensible calculation processes; and incentivising participants for increased efficiency (Brydon, 2006; Koroni et al., 2009; Tucker and Bermany, 1996; Voos and Litz, 2000). This served as input to develop required (definitely needed to apply a market-based approach in data acquisition) and beneficial (not needed but potentially helpful to find scenarios where a market provides benefits) characteristics of data acquisition problems: Required: multiple data sources; set of alternatives among the datasets; user knows the value of a dataset combination; different data alternatives have costs associated with them Beneficial: Partial information with data users and/or data providers; Heterogeneous environment for data users and/or data providers; Distributed decision making between data user and data provider These characteristics can be checked against our scenarios to see if they fit our description for the problem space. For two industrial examples we found good matches using these criteria.

The rest of the paper is structured in the following way. In section 2 an overview about market in industrial data acquisition introduces the research gap, before section 3 shows our developed framework followed by two sample cases in section 4. In section 5 we then present the conclusion and potential future work.

2 Research background For the research background we are looking into application of markets in other fields and then specifically into data acquisition.

2.1 Industrial application of markets Adam Smith identified the use of markets in resource allocation and value estimation in his book “the Wealth of Nations” in 1776 (Smith, 2012). Various other researches have shown that markets are good in allocating resources and estimating value (Kaihara and Fujii, 2008; Tucker and Bermany, 1996; Yemini et al., 1998). Markets have been successfully in various industrial resource allocation problems such as supply chain management, workforce allocation (Virginas et al., 2003), airport traffic control(Jonker et al., 2005), or task scheduling (Reeves et al., 2005). They have also been used for intra-company allocation problems; especially helping to solve the NP-complete resource allocation problem (Brydon, 2006). This variety of internal and external applications of markets for companies suggests that they can also provide benefits in other internal areas such as data management (or more specifically data acquisition).

Other application areas related to information systems are robot allocation in manufacturing systems (Dias and Stentz, 2003), bandwidth allocation (Shapiro and Varian, 1998), allocation of CPU and IO capacity (Kwiat, 2002) or in supply chain management systems (Fan et al., 2003).

2.2 Markets in data acquisition Various authors have developed market-based approaches to data management problems and implementations in related arrears. Yemini et al. (Yemini et al., 1998) introduce markets as a concept to application and service resource management for large-scale information systems. But they do not show a concrete application of this market type towards data but rather focus on access to resources and services. Koroni et al. (Koroni et al., 2009) introduce an “internal data

Page 121: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 115

market” and the idea that markets can be used for information evaluation, evaluation of information’s quality, costs of information and benefits that information can create. Therefore addressing the key arrears within data management. Overall they indicate some potential benefits and challenges but they do not show potential ways to overcome these issues or concrete implementations of “internal data markets”. Christoffel (Christoffel, 2002) describes a market-architecture for data integration, discussing various issues around using markets for data and data integration. However, it has not been implemented or tested, and the specific applicability to test cases is not evaluated.

When reviewing the literature it became apparent that while various approaches for the application of markets towards data acquisitions have been developed, only very few have been implemented such as data security (Dailianas et al., 2000) or allocation of data in a network (Wang et al., 2012) in related arrears.

This large number of concepts and the reasonable arguments for applying markets towards specific problems in data acquisition without actual implementation could be due to the complexity of implementing markets or the application towards the wrong type of scenarios where market-based approaches did not show any benefits. The large number of market-based approaches in other fields (see section 2.1) indicates, that the identification of scenarios is the major problem. So far no framework for identifying appropriate scenarios to use markets in data acquisition has been developed.

3 Characteristics of a market-based approach in data acquisition For the development of our framework we are relying on two main streams. First conduct a review of market-based approaches and the criteria for selecting them. This is then combined with industry knowledge for describing a potential problem space.

3.1 Characteristics of Market-based approaches in data acquisition Our framework is based on a review of characteristics of various market-based approaches in different fields. We used them to identify the main characteristics potentially applicable to data acquisition. These characteristics can be divided into three categories.

1. The first once are characteristics typical for all techniques calculating the value of a piece of data. (See Table 1)

2. The second once are characteristics often associated and used with a market approach (See Table 2)

3. Characteristics that most likely are going to give a market some form of advantage over existing techniques for data acquisition (See Table 3)

For the first characteristic, the main aspect is about the ability of a market-based approach to get similar results to existing value of information techniques from decision theory and resource allocation.

Page 122: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

116 Conference Proceedings

Table 7: Abilities that are similar to existing value of information techniques

Number Characteristic Description and/or examples 1 Allocation of resources

given value and costs (Koroni et al., 2009), (Brydon, 2006), (Xu et al., 2006), (Voos and Litz, 2000)

By combining value of data, users and costs of data providers a market-based approach to data acquisition has the ability to calculate prices between the two sides and make sure the resource data is allocated to the right users in a way that the companies overall utility is increased.

Besides these characteristics there are other characteristics typically associated with a market-based approach. These techniques are techniques to formalize users’ values and requirements on the one side and supplier’s offerings on the other side and then merge them. While these are techniques used in a combination with a market they are not standalone abilities of markets. However, they might increase the benefits of a market-based approach due to the additional calculation required for the automation.

Table 8: Characteristics of a problem space were techniques typically associated with a market might be beneficial

Number Characteristic Description and/or examples 2 Formalism and

abstraction of problems (Brydon, 2006; Voos and Litz, 2000)

Data managers typically speak of formats, quality, structure of data, etc. of data. Users on the other hand do not always understand this type of discussion. By offering users to buy what they like and offering data managers to provide technical details a market puts the market on a more abstract level.

In addition to these criteria used for a market-based approach a range of additional characteristics of a typical problem space. These characteristics are likely going to be the problem space characteristic due to which a market is able to outperform existing solutions.

Page 123: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 117

Table 9: Characteristics of the problem space in which a market-based approach might outperform existing solutions.

Number Characteristic Description and/or examples 3 Faster and/or more

accurate computation (Brydon, 2006; Tucker and Bermany, 1996; Voos and Litz, 2000)

Meaning a market-based approach for calculating the value of data might outperform other solutions for characteristic 1 of a market-based approach the resource allocation and value calculation. Due to the high complexity it is difficult to calculate the changes in the Value of Information for certain changes in the data. The resource allocation problem is in fact known to be NP-hard. Using a market to transform the resource allocation problem into a winner determination problem (which is still NP hard but has various good heuristics) might help to get better results for the problem.

4 Flexibility and extendibility (Brydon, 2006)

A market-based approach can react better to changes in the data offers and user values. Meaning that a market-based approach could calculate an appropriate and higher overall company utility solution towards the value of information and resource allocation problem under changing conditions in the valuations of users and costs of data offers.

5 Incentivizes participants and increases efficiency (Brydon, 2006; Tucker and Bermany, 1996)

The money (actual or fake) involved in such trading algorithm gives both sides data managers and users incentive to improve their data. It also works as a very honest feedback mechanism towards the data managers and the users. It requires that the data managers orient their development towards the needs of the user and users develop a realistic understanding of their data requirements.

3.2 Characteristics of the problem space for a market-based approach in data acquisition Generally a market-mechanism is applicable in a data acquisition context to situations where the value of a piece of data is required in order to make better data acquisition decisions for the allocation of data. Therefore the following conditions should always be satisfied if a market-based approach is considered applicable:

Page 124: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

118 Conference Proceedings

Table 4: Characteristics of a suitable problem space for a market-based approach to data acquisition

Number Characteristic Description and/or example 1 User is using data from

multiple (possibly changing) data sources

This characteristics means that the user is using data from more than one data source and that these data sources might change over time.

2 User has a set of offers (from data providers) to acquire more or different data to improve his decisions

This means that the user has offers from data providers where he could acquire additional datasets that would help him in his decision-making. The user needs to select from these offers. This selection will also inform the decision of data provider about the data he needs to provide.

3 User knows the value of a certain piece of data or a combination of data pieces in terms of contribution to a decision

This means that the data is influencing the decision in a way that the user can calculate or estimate its value based on the way the decision changes with this information

4 Data knows the costs associated with its provision

The provision of data costs money and the data managers are aware of these one-off and maintenance costs to provide the data

These characteristics describe all situations where a market-mechanism could potentially be deployed. However, in order to provide benefits towards industrial companies a market-based approach also needs to outperform existing approaches. Therefore additional characteristics focussing towards the difficulties that exist within data acquisition and other applications are required.

Table 5: Characteristics of use cases where a market can outperform existing solutions

Number Characteristic Description and/or examples 1 Partial information with

data users and/or data providers

Data providers don’t know about the data interests and data users do not know about the available data and data offers of the other side. Data providers have no clear idea about how and for what decisions the data is used. They also don’t know the value of the different potential data users for a data offer. Users have not clear idea about the data that exists or could become available to them. They also don’t know the costs associated with providing the data.

2 Heterogeneous environment for data users and/or data providers

Describing situations where: Users have a range of values for different datasets Data comes from different sources and has a variety of characteristics and options to be adjusted Users and data managers have a different understanding of problems and approaches

3 Distributed decision making between data user and data provider

Meaning that the decision to provide information to the user is done in a different division (or even company) than the actual business decision (based on data) from the user.

Page 125: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 119

Table 6 shows how these different characteristics match to the initially identified characteristics of a market-based approach, showing an overall good suitability of markets for these identified problem criteria.

Table 6: Combination of problem space characteristics with market characteristics

Characteristic matching Allocation of resources given value and costs

Formalism and abstraction of problems

Faster and/or more accurate computation

Flexibility and extendibility

Incentivizes participants and increases efficiency

User is using data from multiple (possibly changing) data sources

✔ ✔ ✔

User has a set of offers (from data providers) to acquire more or different data to improve his decisions

✔ ✔

User knows the value of a certain piece of data or a combination of data pieces in terms of contribution to a decision

✔ ✔ ✔

Data knows the costs associated with its provision

✔ ✔ ✔

Partial information with data users and/or data providers

✔ ✔

Heterogeneous environment for data users and/or data providers

✔ ✔

Distributed decision making between data user and data provider

✔ ✔ ✔

4 Sample cases In order to validate test our framework we interviewed industrial experts from a large manufacturing organisation in order to gain their expert opinion about potential cases where a market-based approach can be applied. Based on these interviews two hypothetical scenarios were developed. These scenarios are tested against our framework for initial validation.

Scenario A: Company X is a manufacturing company with 100,000 employees. These employees work in a broad range of products and functions. Within this company employee H. Smith in the IT division has the option to acquire dataset containing credit data for suppliers for the costs of 10,000 USD every year. He has a sample data set from the company for one month to test the data set in his organisation. H. Smith thinks that this data could be relevant to some employees in the company, but is uncertain about who specifically could be interested. He knows that the supplier risk analysis division of 10 employees could use this data to predict potential problems with the supplier, but the quantification of an exact value is difficult.

Page 126: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

120 Conference Proceedings

Estimates can be given from the users. H. Smith is unsure whether he should acquire the data set, especially given various other internal datasets, which might help the user to reach a similar decision. Due to the large number of employees, H. Smith cannot ask all potential divisions in the company, due to the large number of employees, distances between employees and due to the fact that he does not know the complete organization.

Scenario B: Company Y is a large machine manufacturing company with 150,000 employees. The service division has 15,000 employees and offers mainly provision of spare parts but also repair services to the companies owning the large manufacturing products of company Y. The customers of Company Y asking for potential spare parts or specific services typically rely on Company Y for support. However, Company Y has little insight into their customer’s machines and processes and is therefore unsure about orders of spare parts - orders from customers often “surprise” them. Company Y would benefit from further insight into their customers’ data. The problem is that customers would only sell their data for very significant costs of 300,000 USD on average per machine as a customer because the data has high value to them. Manager J. Anderson wants to run a trial activity by acquiring data from 5 customers for 1.5 Million USD. He knows that the supplier forecasting division of 10 employees would benefit from this data, but he is unsure to which extent this data would improve their operation. Each employee in the operation division will be able to give him a rough estimate for the value that this data created for him. However, he is uncertain about the specific divisions and their value for the different part of the dataset. He also does not know where else in the organization this data might be useful.

We checked these scenarios against our initial framework.

Page 127: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 121

Table 7: Characteristics tested against the two example use cases

Characteristics Example A Example B User is using data from multiple (possibly changing) data sources

The supplier risk division and the procurement division are very likely to use a broad range of data. This data is also very likely to change over time.

Users have different data offers and their data provided from different sources might vary. Especially for the spare parts order in Company Y different data offers could be used to make the predictions. One of them is the customer data.

User has a set of offers (from data providers) to acquire more or different data to improve his decisions

In this case the users have the chance to use the data about credit evaluations for their supplier. Especially the risk analysis department faced with an analytical challenge also has a broad range of data offers they could use for their problems.

Users have various other data offers they could use and in addition have potential customer data. They could use both sources to make forecasts of future required spare parts or just the internal data in case it is cheaper.

User knows the value of a certain piece of data or a combination of data pieces in terms of contribution to a decision

Users have an idea of the value data, and its potential combinations with other data sources, would provide.

Each individual user could give estimates when they can use the trial data. This could then be extended for all data sets.

Data knows the costs associated with its provision

H. Smith knows the costs J. Anderson knows the cost of the data.

Partial information with data users and/or data providers

The users and H. Smith work in are not necessary aware of each other’s existence or each other’s values and costs.

Data users know the information exists within the customer. However, they do not have access to it and also do not know exactly the value that this data would provide to their respective colleagues and throughout the division, making good estimates difficult.

Heterogeneous environment for data users and/or data providers

Both the data provider H. Smith has a variety of data sources he could look for to provide to the user. In addition the data users have a broad range of potential data offers that can be combined in a different way. Each different combination would offer a different utility to the user.

Each user within Company Y might have a different value for the data, especially when it is combined with other data sources.

Distributed decision making between data user and data provider

The users and the H. Smith work in different departments and make decisions without interacting intensively.

J. Anderson, the customer and the potential data users all work in different divisions or even companies.

Allocation of resources given value and costs

A market can make the calculation between the values given by the users and the costs for the data set known to H. Smith. However this can also be done by other Value of Information solutions.

A market could take the value estimates gathered from the user and combine it with costs of getting the data from each of the customers.

Page 128: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

122 Conference Proceedings

Formalism and abstraction of problems

A market might deploy techniques to automatically capture the users valuation for data and also the users data requirements. This automation could analyze the credit card data and offer it to the users for a trial run. The users would have to give their value in order to keep the data in the future.

Similar to the Example A the users could get a test of the data and then bid for getting this data. The automation could also identify which customer data is interesting for each user to enable the collection of values from the users.

Faster and/or more accurate computation

Due to its ability to deal with the NP-hard resource allocation a market-based approach might outperform existing techniques for the problem of calculating the best allocation of potential data for the supplier risk analysis and procurement challenges to the different users.

Given the 5 data offers of customers data and the large number of users there are various potential combinations of allocating the data to the right users. Calculating which of these combinations generates the highest utility is very difficult, especially because some of the users that might be interested are not know from the beginning.

Flexibility and extendibility

Given an existing market solution of giving the credit analysis data to the users to the resource allocation a market might calculate a new optimum faster once a change to a user value or data providers costs happen (for example another data set for supplier risk analysis becoming available to the supplier risk analysis division)

Once the market did the calculations for the 5 data offer of customer data and the current users he can include additional data offers of customer data potentially building on the knowledge already collected from the existing market.

Incentivizes participants and increases efficiency

Due to the feedback provided from the users for the data offer of having this additional supplier credit information the data provider has an incentive to test additional data offers and overall improve the data provided to all users.

In this case J. Anderson gets the direct feedback form the users whether the data from the customers is worth its price. If it improves efficiency he can continue getting this data if it does not increase efficiency he does not need to buy the data.

The initial test show that our framework would cover these two initial scenarios very well. Its current main limitations are that some of the criteria are very general, which can make it difficult to check against them in a scenario. A more detailed checklist breaking down each of the main points could help to reduce this problem. Additional work requires an actual application (beyond initial industry verification) of each of these scenarios in order to verify this framework further.

5 Conclusion and future work Our paper developed a framework to help identifying suitable scenarios for the application of markets in data acquisition. Addressing the lack of existing techniques in this field. Our initial framework showed promising indications towards the two sample cases presented in section 4. It can help to breakdown a scenario and check if the description of the data acquisition scenarios fits a market-based approach. However, further validation of this framework by actually applying markets towards the identified scenarios is required. Future work therefore aims to use and refine this framework as part of a questionnaire and scenario description when developing market-based approaches for data acquisition.

Page 129: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

A framework for identifying suitable cases for using market-based approaches in industrial data acquisition

Loughborough University 123

In addition to identifying scenarios the framework can also help in the justification process, enabling a clear breakdown for the reasons of applying markets towards a specific data acquisition problem.

References Brydon, M., 2006. Economic metaphors for solving intrafirm allocation problems: What does a

market buy us? Decis. Support Syst. 42, 1657–1672. doi:10.1016/j.dss.2006.02.009 Christoffel, M., 2002. Information Integration as a Matter of Market Agents, in: Proceedings of

the 4th International Conference on Electronic Commerce Research. Presented at the International Conference on Electronic Commerce Research (ICECR-5), Montreal, Canada.

Dailianas, A., Yemini, Y., Florissi, D., Huang, H., 2000. Marketnet: Market-based protection of network systems and services-an application to snmp protection, in: Proceedings INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Presented at the INFOCOM 2000. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies., Tel Aviv, Israel, pp. 1391–1400. doi:10.1109/INFCOM.2000.832536

Dias, M.B., Stentz, A., 2003. TraderBots: A Market-Based Approach For Resource, Role, and Task Allocation in Multirobot Coordination (Technical Report No. CMU-RI -TR-03-19). Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.

Fan, M., Stallaert, J., Whinston, A.B., 2003. Decentralized mechanism design for supply chain organizations using an auction market. Inf. Syst. Res. 14, 1–22. doi:10.1287/isre.14.1.1.14763

Jonker, G., Meyer, J.-J., Dignum, F., 2005. Towards a market mechanism for airport traffic control, in: Progress in Artificial Intelligence, Lecture Notes in Computer Science. Springer, pp. 500–511.

Kaihara, T., Fujii, S., 2008. A study on modeling methodology of oligopolistic virtual market and its application into resource allocation problem. Electr. Eng. Jpn. 164, 77–85. doi:10.1002/eej.20628

Koroni, A., Redman, T., Gao, J., 2009. Internal Data Markets: The Opportunity and First Steps, in: COINFO ‘09 Proceedings of the 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology. Presented at the Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, 2009. COINFO ‘09., IEEE, Beijing, China, pp. 127–130. doi:10.1109/COINFO.2009.57

Kwiat, K., 2002. Using markets to engineer resource management for the information grid. Inf. Syst. Front. 4, 55–62. doi:10.1023/A:1015386522883

Reeves, D.M., Wellman, M.P., MacKie-Mason, J.K., Osepayshvili, A., 2005. Exploring bidding strategies for market-based scheduling. Decis. Support Syst. 39, 67–85. doi:10.1016/j.dss.2004.08.014

Shapiro, C., Varian, H.R., 1998. Information Rules: A Strategic Guide to the Network Economy, 1ST ed. Harvard Business Review Press.

Smith, A., 2012. The Wealth of Nations. Simon & Brown. Tucker, P., Bermany, F., 1996. On Market Mechanisms as a Software Technique (Technical

Report No. CS96-513). Department of Computer Science and Engineering, University of California, San Diego, San Diego, California, USA.

Page 130: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

124 Conference Proceedings

Virginas, B., Voudouris, C., Owusu, G., Anim-Ansah, G., 2003. ARMS Collaborator—intelligent agents using markets to organise resourcing in modern enterprises. BT Technol. J. 21, 59–64. doi:10.1007/s10550-007-0082-9

Voos, H., Litz, L., 2000. Market-based optimal control: a general introduction, in: Proceedings of the 2000 American Control Conference. Chicago, IL, USA, pp. 3398–3402. doi:10.1109/ACC.2000.879198

Wang, T., Lin, Z., Yang, B., Gao, J., Huang, A., Yang, D., Zhang, Q., Tang, S., Niu, J., 2012. MBA: A market-based approach to data allocation and dynamic migration for cloud database. Sci. China Inf. Sci. 55, 1935–1948. doi:10.1007/s11432-011-4432-3

Xu, Y., Scerri, P., Sycara, K., Lewis, M., 2006. Comparing market and token-based coordination, in: Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems. Presented at the AAMAS’06, Fifth international joint conference on Autonomous agents and multiagent systems, ACM, Hakodate, Hokkaido, Japan., pp. 1113–1115. doi:10.1145/1160633.1160834

Yemini, Y., Dailianas, A., Florissi, D., 1998. MarketNet: A market-based architecture for survivable large-scale information systems, in: Proceedings of Fourth ISSAT International Conference on Reliability and Quality in Design.

Page 131: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 125

Balancing Big Data with Data Quality in Industrial Decision-Making Philip Woodall* and Maciej Trzcinski Department of Engineering, University of Cambridge * corresponding author: [email protected]

Abstract Data is now an abundant resource in many industrial organisations being generated both internally, such as from continuously operating manufacturing equipment, and sourced externally from the Web. As datasets rise in size, many challenges will follow, but also techniques will emerge which can harness this opportunity to address old industrial data problems in new ways. This research is following this path and aims to determine if, rather than improve data quality in the traditional way, it is possible to achieve the same benefits by simply gathering more data. While this may seem like an odd notion, our aim is to test this claim empirically and objectively to determine if it is possible for certain manufacturing-related decisions, and this paper presents experimental results towards this endeavour.

Introduction The volume and availability of datasets are increasing massively in today’s industrial organisations. Despite its critical role in the past, however, the problems of data quality are sometimes being dismissed in this Big Data world as being irrelevant, and the importance of improving data quality is being discussed (Mayer-Schonberger and Cukier, 2013). One important question is “as we scale data, will the effects of errors in the datasets be reduced to an extent where data quality assessment and improvement activities are no longer needed” (Woodall et al., 2014)?

For example, consider the question of determining the performance of a particular supplier to a manufacturing company (see (Woodall et al., 2014)). A series of transactions with the supplier could be recorded in a database indicating whether the supplier delivered parts to the company on-time or not (see Table 10). The company could calculate the supplier’s performance using the number of on-time deliveries over the total number of deliveries for that supplier. If, however, the supplier is referred to by two different names (the quality error being that they should be consistently referred to), then some of the deliveries for that supplier may not be counted. This can happen when a query to the database requests all records where company is “Air parts Ltd.” and this query would thus not return records for “Air pts”. The result is that this may give an inaccurate performance figure, as the example in Table 10 shows: the performance of the “Air parts Ltd” supplier would be counted as 75% (3/4). Correcting the error in the supplier names (“Air pts” to “Air parts Ltd.”) would yield the actual performance to be, instead, 50% (3/6).

If the company has a policy to keep suppliers above a threshold of 70%, then correcting data quality in this case actually changes the decision: without correcting data quality the company would continue business with the supplier, whereas after correcting quality the company would cancel the contract with the supplier according to their policy of 70% delivery performance.

Page 132: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

126 Conference Proceedings

Table 10: Supplier transactions

Company On-time delivery? Air parts Ltd. y Air parts Ltd. y Air parts Ltd. y Air pts n Air pts n Air parts Ltd. n

In relation to scenarios of this type, the question we aim to address is: whether industrial decision makers can arrive at the correct decision by only increasing the size of the dataset, without having to correct data quality?

In particular, whether collecting more data records (in the spirit of the Big Data era) will lead to a result where a few erroneous entries will not inflict such a large bias on the overall results, and the result will tend to the correct supplier delivery performance. In the broadest sense, this is asking what effect Big Data has on data quality in industrial decision-making scenarios.

If we continue the example from table 1 and include more data records (possibly from another data source, or by recording more deliveries etc.) and re-calculate the supplier performance in the same way, we obtain that the performance of “Air parts Ltd” is 66% (8/12). The company therefore arrives at the correct decision to cancel the contract with the supplier because this value is below the policy threshold of 70%, and this has been achieved without spending time correcting data quality. Note that in this case the “actual answer” (after correcting data quality in table 2) is 61% (11/18).

In this motivating example, it is clear to see that the situation has been conveniently setup to reveal a desirable outcome because the errors “cancel each other out”. In real situations this may/may not be the case and it is our intention to determine whether, and in what particular cases, increasing data volume is an equivalent or improvement over correcting data quality and for what types of decisions etc.

Table 11: Table 1 after obtaining more records

Company On-time delivery? Company On-time delivery?

Air parts Ltd. y Air pts y

Air parts Ltd. y Air pts y

Air parts Ltd. y Air parts Ltd. n

Air pts n Air parts Ltd. n

Air pts n Air parts Ltd. y

Air parts Ltd. n Air parts Ltd. y

Air parts Ltd. y Air pts y

Air parts Ltd. y Air pts n

Air parts Ltd. y Air parts Ltd. n

Page 133: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 127

Approach In order to address our research question, we focus on the volume aspect of big data and investigate the effects of the completeness and inconsistency data quality dimensions (Wang and Strong, 1996) when more data records are appended to a dataset. The decision to be made on the basis of the data in this dataset is similar to the one presented in the introduction, where a proportion is computed over the records. It is this proportion that could therefore be inaccurate because of data quality, and could be made more accurate by improving quality or possibly by obtaining more of the same data records over time.

We performed an experiment using software simulations, similar to the ones developed for the research in (Woodall et al., 2014), to assess the effects of the following criteria on this scenario:

• The distribution and frequency of errors,

• The types of errors (missing data, identifier inconsistencies)

• The level of data volume increase.

• The way in which data is initially sampled.

Background The definition of Big Data most widely adopted by industry is described in terms of three V’s: volume, velocity and variety (Suciu, 2013). This means that it applies to constantly growing complex datasets of large volume with heterogeneous sources. Because of one or many of those factors, Big Data may present difficulties in processing using conventional data analytics. Moreover, the hitherto used approaches to data manipulation may not be most efficient when faced the challenge of analysing Big Data (Suciu, 2013).

The emergence of Big Data is easily shown by the variety of data sources available on the World Wide Web alone and numbers on data generation provided by IBM. The company claims that 2.5 million terabytes of data are generated every day and 90% of all data has been created over the past 2 years.

The main focus of this work is Big Data in the industrial context. The variety of data sources is numerous and includes information systems in the areas of warehousing, manufacturing processes, suppliers, customers, finance, forecasts, human resources etc. and stretches to relevant data available online. All of that data is constantly updated every day a company operates.

Page 134: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

128 Conference Proceedings

Experimental Design Figure 7: Experimental design part 1

The experiment is designed to test the relationship between data quality and volume and allow for assessment of accuracy of statistical analysis of data samples to be compared as shown in Figure 7 and Figure 8. First the analysis was run on the complete dataset with 100% quality as a point of reference (see Figure 7). Then data quality errors were inserted and results were obtained for the complete dataset with reduced quality. Finally, a randomly or systematically chosen fraction of the low quality dataset was tested. The most important comparison is between the “reduced quality” and the “fraction of the reduced quality” datasets, while the complete high quality dataset is used as a benchmark result with which overall accuracy can be compared.

The second part of the experiment aims to compare a reduced quality fraction of data with a high quality sample (all data quality problems removed) of the same size, as depicted in Figure 8. This will show the improvement achieved by correcting data quality and could then be directly compared with the improvement achieved by ‘increasing’ dataset size. The deciding metric is the improvement of accuracy, which is defined as the difference from the underlying true value obtained from complete high quality dataset.

Page 135: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 129

Figure 8: Experimental design part 2

A nested experimental design was used, and the following list summaries the different parameters/variables used in the experiment:

1. Sampling type: partial, interval, random

2. Error frequency (number of errors inserted into the datasets): 5%, 10%, 20%

3. Volume increase: 2x, 3x

4. Error type: missing data, identifier inconsistencies.

These variable are explained in the following sections, and the first subsection explains the details of the actual dataset used in the experiment.

Dataset selection The selected dataset contained airline on-time performance statistics provided by the US Buerau of Transportation Statistics, and is publicly available online at http://apps.bts.gov. The database contains nearly 500 000 entries for each month and contains historic entries from as far as 1987. The part of the databased used for experiments is from January 2014 when 14 airlines were active. An example record containing a selection of attributes is shown in Figure 9.

Figure 9: An example record from the airline dataset

Page 136: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

130 Conference Proceedings

Table 12: Airline on-time performance data

Airline Name Total Flights

Delayed Flights

Delayed %

Mean delay time

Southwest Airlines Co 86698 29017 33.5 22

Delta Air Lines Inc. 55928 14498 25.9 20

ExpressJet Airlines Inc. 49365 16611 33.6 26

SkyWest Airlines Inc. 47578 11094 23.3 14

American Airlines Inc. 43711 9381 21.5 13

United Air Lines Inc. 37291 9390 25.2 17

US Airways Inc. 33651 6690 19.9 11

Envoy Air 30091 9783 32.5 24

JetBlue Airways 17966 6462 36 31

Alaska Airlines Inc 12169 1279 10.5 7

AirTran Airways Corporation 8776 2581 29.4 16

Hawaiian Airlines Inc. 6007 442 7 4

Frontier Airlines Inc. 5647 2100 37.2 26

Virgin America 4742 686 14.5 10

We set a threshold for the airline on-time performance that classifies flights as delayed when they are 15 or more minutes late; this is indicated in the “delayed” field on the far right of the record which contains either a yes or no value. The percentage of delayed flights determines the performance of an airline and it is this metric that is used to check any improvement/reduction in accuracy.

Selecting dataset size (and increasing volume) Using different types of sampling serves the purpose of simulating different industrial data collection scenarios. Sample datasets were selected from the full-size dataset using the following three methods:

Random: Random sampling is equivalent to a situation in which a company is collecting some data from some parts of their operations but their data collection is not structured in any systematic way and random records are collected.

Interval: A different situation takes place when a company is collecting data regularly for a particular production process but only in certain time/units intervals e.g. through sampling and testing a product every 15 minutes. In this case increasing the volume of collected data can be realised through more frequent sampling.

Partial: In the third case, data would have only been collected for a limited time and stopped at some arbitrary point. In our experiment, for the “partial” method, we collected the first half of the records in the dataset and as another instance of this we also collected the first third of records in the dataset.

Page 137: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 131

The number of records selected for each of these sampling types was either one half of the complete size dataset or one third of the complete size dataset. In order to increase the volume of a dataset, the full size dataset was restored (still including the quality errors, of course) and hence the initial size of the dataset is linked to the increase in volume: for the half size dataset as the initial starting dataset, the increase is therefore equivalent to doubling the dataset. For the one third size initial starting dataset the volume increase is tripled. These are referred to as doubled volume and tripled volume in the results section.

Error frequency and distribution For many cases in industry the underlying reason for quality problems is human error. The amount of variables influencing human behaviour is so vast that making mistakes can be treated as a natural process that follows a random normal distribution. Hence, for the experimental scenarios the records affected by quality problems were selected at random. Hence, 5% errors means that 5% of the records in the dataset were selected and the value indicating whether the airline was late or not was removed.

Error types The most straightforward scenario is one where a portion of data is missing from the dataset. There is a variety of reasons for such situation with the most common being negligence or technical errors.

A data inconsistency problem occurs when identifiers (names) used as indices for statistical data summarising are entered in multiple ways. This could occur in a situation where a name is entered manually by an employee without an automated workflow-based selection. Examples of such mistakes are hyphenated phrases and abbreviations such as ‘Air-Tran Airways Corporation’ or as ‘Air Tran Airways Corporation’ instead of the actual as ‘AirTran Airways Corporation’.

Although the problem initially seems to be one of consistency, it is actually a special case of incompleteness. Because the variations of the actual identifier are not known they cannot be directly referenced by the numerical analysis. To calculations focusing on for example, airline on-time performance, those records will not be accessible and, therefore, although not completely lost, they will affect the results in the same way as they would if they were missing.

Measuring accuracy improvement For each configuration of the experiment the improvement obtained by correcting quality problems and the improvement through increased volume was measured by calculating:

the true value of the percentage of delayed flights for each airline (by computing this for the 100% complete and free of quality dataset) and checking whether there is any difference in the datasets with reduced quality and/or reduced size. Formally, let t denote the underlying true value, the initial estimate from low quality sample to be e0 and the estimates calculated after improving quality and increasing dataset sizes to be eq and ev respectively. The ideal estimated value is one that matches the underlying true value i.e. when e = t and hence e - t = 0. We therefore define the inaccuracy as the distance of the estimate from the true value as follows:

Page 138: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

132 Conference Proceedings

Since the ideal inaccuracy is 0, we then calculate the relative improvements as a difference of absolute values of inaccuracy for the initial estimate and absolute values of the improved estimates:

Note that the values of Δ will be negative if the improved estimate is worse than the initial. Since a dataset can have multiple categories n of which to estimate statistics (14 in case of the airline on-time performance database, for each airline), the resulting improvements of accuracy have to be summarised across all of them. In order to separate the occasional situations where improvements happen to have negative values, we use a positive and a negative sums defined as follows:

Results This section presents the results of experiments for all scenarios. The interpretation of the results is focused on answering the question of whether increasing dataset size or improving data quality is the right choice to improve the accuracy of the result.

Due to the random variation of the error distribution, the test results would be different for each execution of our experiment. Hence we executed the experiment 100 times and present the average accuracy improvement (shown in the traditional bars of a bar chart) and the minimum and maximum values (shown with the “whiskers”).

Page 139: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 133

Figure 10: Comparison of accuracy improvement between doubled volume and improved quality for quality error rates of 5%, 10% and 20% for the missing data error. For each rate there are two alternative fields - the grey field shows the accuracy improvement when quality is corrected and the white field shows results for increased dataset size. There are three values for each measurement based on three different ways of selecting the initial dataset: partial, interval and random.

Table 13: Data values for missing data – doubled volume

Sampling method Partial Interval Random

Error rate

mean max min mean max min mean max min

5%

Q+ 0.73 1.3 0.2 0.585 1 0 0.735 1.2 0.2

Q- -0.505 -0.1 -1.2 -0.445 0 -1 -0.34 0 -0.8

V+ 80.26 81.2 79.4 2.025 2.8 1.2 2.835 4 1.4

V- 0 0 0 -0.15 0 -0.5 -0.09 0 -0.5

10%

Q+ 1.025 1.7 0.5 0.97 2 0.3 0.995 1.9 0.4

Q- -0.69 -0.2 -1.5 -0.515 -0.2 -0.9 -0.38 -0.2 -0.9

V+ 80.04 81.3 78.9 2.045 3.1 1.1 2.8 3.8 1.6

V- 0 0 0 -0.185 0 -0.7 -0.165 0 -0.7

20%

Q+ 1.555 2.8 0.6 1.31 2.8 0.4 1.355 2.5 0.5

Q- -1.02 -0.4 -2.1 -0.555 -0.2 -1 -0.59 -0.2 -1

V+ 79.58 80.6 78.2 1.825 3 1 2.485 3.6 1.1

V- 0 0 0 -0.325 0 -0.8 -0.36 -0.1 -0.6

Page 140: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

134 Conference Proceedings

Missing data Results from partial, interval and random sampling for all configurations show that, in general, more accurate results are obtained by doubling the dataset size (see both Figure 10, Figure 11, Table 4 and Table 5). In order explain this result, consider the configuration that seems most favourable for quality improvement as an example. 20% error rate and double volume increase gives the biggest possible improvement and smallest dataset enlargement. Given the initial sample of size s the amount of information in a reduced quality sample equals 0.8s. Doubling the size of the sample gives 1.6s compared to s obtained by recreating the missing data. Since increasing volume offers 60% more data than quality improvement, the accuracy is naturally higher.

Figure 11: Comparison of accuracy improvement between tripled volume and improved quality for quality error rates of 5%, 10% and 20% for the missing data error

Page 141: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Balancing Big Data with Data Quality in Industrial Decision-Making

Loughborough University 135

Table 14: Data values for missing data – tripled volume

Sampling method Partial Interval Random

Error rate

mean max min mean max min mean max min

5%

Q+ 0.79 1.9 0.2 0.746 1.6 0.1 0.7 1.4 0.1

Q- -0.669 -0.1 -1.7 -0.502 0 -1.1 -0.585 0 -1.4

V+ 190.455 191.7 188.8 2.983 4.1 1.7 4.442 8 2.4

V- 0 0 0 -0.105 0 -0.5 -0.105 0 -0.4

10%

Q+ 1.154 2.9 0 1.186 2.3 0.1 1.101 2.3 0.3

Q- -0.955 -0.1 -2.1 -0.621 0 -1.3 -0.698 -0.1 -1.5

V+ 190.201 192.3 187.9 3.008 4.4 1.5 4.42 8.3 2.4

V- 0 0 0 -0.141 0 -0.5 -0.127 0 -0.6

20%

Q+ 1.689 3.5 0.3 1.834 3.7 0.4 1.748 3.5 0.4

Q- -1.45 -0.1 -3.6 -0.741 -0.1 -1.7 -0.843 0 -2.8

V+ 189.623 191.9 186.3 3.055 4.8 1.4 4.425 8.5 2

V- -0.003 0 -0.3 -0.281 0 -1 -0.251 0 -1 When improving data quality the dataset remains a sample and the airline performance is therefore still an estimate calculated from this sample. Whereas in our experiment, increasing dataset size results in obtaining the full population of records (minus those records that act as being excluded because of a missing value in the delayed field, or an inconsistent value in a filtering field) and hence this too is a sample of the population. Thus, improving quality is equivalent to adding records that were excluded because of quality issues, and increasing volume is equivalent to adding records that were not captured due to the sampling process. Overall, for the problem of missing and inconsistent values, whether or not to choose improving quality or increasing dataset size is therefore dependent on how many records can be added. This also depends on how the errors are spread between the current sample and the “new” records that will be obtained when increasing dataset size.

Consider the drastically different results for the partial sampling method when compared to the random and interval methods (in both Figure 10 and Figure 11). This suggests that the distribution of delays is not balanced between the beginning and the end of the time period analysed. I.e. the first half or third of the data are not statistically representative of the whole period. The clustering of delays is likely to be caused by some external circumstances (e.g. weather) that may equally likely affect all airlines at the same point in time. Hence, using the partial sampling approach (where a whole chunk of contiguous records are selected) is more likely to pick up either a high or low number of delays. This contrasts with the random and interval sampling which gather a more evenly distributed number of delays.

Therefore, when considering increasing volume as the option for improving results, one must also consider:

the number of errors in the “new” set of records, which will be influenced by:

1. the distribution of errors in the population (e.g. are the errors clustered?)

Page 142: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

International Data and Information Management Conference (IDIMC 2014)

136 Conference Proceedings

2. the sampling method for obtaining the new records (is this likely to collect records which have a high/low number of erroneous records?)

3. what data is being selected by the query/filter (is the value of interest high in errors, is the filtering value likely to contain missing or inconsistent values, and what is the distribution of the value of interest e.g. is it clustered like the delays caused by bad weather?)

Similarly, when considering using data quality improvement to improve results one should consider:

the number of errors in the current dataset and how they are likely to be spread throughout the entire population.

If a low number of the total errors are in the current dataset, then improving quality will only yield minimal results, whereas if a high number of the total errors are in the current dataset, then improving data quality is likely to be a good option. Whether or not to choose improving quality or increasing dataset size therefore depends on the these variables, and it could be that using a combination of quality improvement and increasing dataset size could yield optimal results given the constraints of time, resources, and cost to collect or improve data.

Summary and Future Work This paper reports an experimental study where we addressed the question: whether industrial decision makers can arrive at the correct decision by increasing the size of the dataset, without having to correct data quality?

We have shown that correcting data quality for the missing an inconsistent data quality problems, in certain circumstances, is equivalent to adding more data records. Hence, managers in industrial organisations could consider applying either approach to improve the data that supports their decisions for the types of problems and scenarios discussed in this paper.

In future work we aim to test other data quality problems, besides missing and inconsistent values, in order to determine, for a broad range of issues found in industry, what the effects are and what the trade-off is between improving quality and increasing dataset size.

References Mayer-Schonberger, V. and Cukier, K. (2013). Big Data: A Revolution That Will Transform How

We Live, Work and Think, John Murray. Suciu, D. (2013). Big Data Begets Big Database Theory. In G. Gottlob et al., eds. Big Data.

Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 1–5. Available at: http://link.springer.com/chapter/10.1007/978-3-642-39467-6_1 [Accessed August 29, 2014].

Wang, R.Y. and Strong, D.M. (1996). Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, 12 (4), p.pp.5–34.

Woodall, P. et al. (2014). An investigation of how data quality is affected by dataset size in the context of Big Data analytics. In Proceedings of the International Conference on Information Quality.

Page 143: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

Programmme

Loughborough University 137

Appendix 1: Programme 09:00 Registration and coffee

09:30 Welcome to CIM Prof Tom Jackson, CIM Director

09:45 Opening Keynote Mark Harrison, University of Cambridge Open Data and Linked Data – how can they help your organisation?

Parallel 1 – A Parallel 1 – B 10:30 The implementation of Basel Committee

BCBS 239: An industry-wide challenge for international data management Malcolm Chisholm

Assessing trustworthiness in digital information Laura Sbaffi

10:50 Using Big Data in banking to create financial stress indicators and diagnostics: Lessons from the SNOMED CT ontology Alistair Milne

Exploring vendor’s capabilities for cloud services delivery: A case study of a large Chinese service provider Gongtao Zhang

11:10 Re-purposing manufacturing data: a survey Philip Woodall

Exploring different information needs in Building Information modelling (BIM) using Soft systems Mohammad Mayouf

11:30 Refreshments and poster session

12:00 The role of social networks in mobilizing knowledge Jacky Swan, University of Warwick

12:40 Lunch and poster session

13:40 A new Era of Knowledge Management? Reflections on the implications of ubiquitous computing Sue Newell, University of Sussex

14:20 Can you make “Agile” work with a global team? Peter Cooke, Ford Motor Company

15:00 Refreshments and poster session

Parallel 2 – A Parallel 2 – B 15:30 UK Data Service: creating economic and

social science metadata microcosms Lucy Bell

Managing the risks of internal uses of Microblogging within small and medium enterprises Soureh Latif Shabgahi

15:50 Adopting a situated learning framework for (big) data projects Martin Douglas

Learning about information management across projects Sunila Lobo

16:10 The use of ontologies to gauge risks associated with the use and reuse of E-Government services Onyekachi Onwudike

16:30 Best paper and poster awards Tom Jackson and Crispin Coombs

Closing remarks

17:00 Close

Page 144: Conference Proceedings - WordPress.com · capability to publish structured data together with its semantics, to perform federated queries from multiple data sources, both local and

www.lboro.ac.uk/cim

ISBN: 978-1-905499-51-9