albert y. zomaya · sherif sakr editors handbook of big

30
Albert Y. Zomaya · Sherif Sakr Editors Handbook of Big Data Technologies

Upload: others

Post on 17-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Albert Y. Zomaya · Sherif Sakr Editors

Handbook of Big Data Technologies

Handbook of Big Data Technologies

Albert Y. Zomaya • Sherif SakrEditors

Handbook of Big DataTechnologiesForeword by Sartaj Sahni, University of Florida

123

EditorsAlbert Y. ZomayaSchool of Information TechnologiesThe University of SydneySydney, NSWAustralia

Sherif SakrThe School of Computer ScienceThe University of New South WalesEveleigh, NSWAustralia

and

King Saud Bin Abdulaziz Universityof Health Science

RiyadhSaudi Arabia

ISBN 978-3-319-49339-8 ISBN 978-3-319-49340-4 (eBook)DOI 10.1007/978-3-319-49340-4

Library of Congress Control Number: 2016959184

© Springer International Publishing AG 2017This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, express or implied, with respect to the material contained herein orfor any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer NatureThe registered company is Springer International Publishing AGThe registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To the loving memory of my Grandparents.

Albert Y. Zomaya

To my wife, Radwa,my daughter, Jana,and my son, Shehabfor their love, encouragement, and support.

Sherif Sakr

Foreword

Handbook of Big Data Technologies (edited by Albert Y. Zomaya and Sherif Sakr)is an exciting and well-written book that deals with a wide range of topical themesin the field of Big Data. The book probes many issues related to this important andgrowing field—processing, management, analytics, and applications.

Today, we are witnessing many advances in Big Data research and technologiesbrought about by developments in big data algorithms, high performance com-puting, databases, data mining, and more. In addition to covering these advances,the book showcases critical evolving applications and technologies. These devel-opments in Big Data technologies will lead to serious breakthroughs in science andengineering over the next few years.

I believe that the current book is a great addition to the literature. It will serve asa keystone of gathered research in this continuously changing area. The book alsoprovides an opportunity for researchers to explore the use of advanced computingtechnologies and their impact on enhancing our capabilities to conduct moresophisticated studies.

The book will be well received by the research and development community andwill be beneficial for researchers and graduate students focusing on Big Data. Also,the book is a useful reference source for practitioners and application developers.

Finally, I would like to congratulate Profs. Zomaya and Sakr on a job well done!

Sartaj SahniUniversity of FloridaGainesville, FL, USA

vii

Preface

We live in the era of Big Data. We are witnessing radical expansion and integrationof digital devices, networking, data storage, and computation systems. Data gen-eration and consumption is becoming a main part of people’s daily life especiallywith the pervasive availability and usage of Internet technology and applications. Inthe enterprise world, many companies continuously gather massive datasets thatstore customer interactions, product sales, results from advertising campaigns onthe Web in addition to various types of other information. The term Big Data hasbeen coined to reflect the tremendous growth of the world’s digital data which isgenerated from various sources and many formats. Big Data has attracted a lot ofinterest from both the research and industrial worlds with a goal of creating the bestmeans to process, analyze, and make the most of this data.

This handbook presents comprehensive coverage of recent advancements in BigData technologies and related paradigms. Chapters are authored by internationalleading experts in the field. All contributions have been reviewed and revised formaximum reader value. The volume consists of twenty-five chapters organized intofour main parts. Part I covers the fundamental concepts of Big Data technologiesincluding data curation mechanisms, data models, storage models, programmingmodels, and programming platforms. It also dives into the details of implementingBig SQL query engines and big stream processing systems. Part II focuses on thesemantic aspects of Big Data management, including data integration andexploratory ad hoc analysis in addition to structured querying and pattern matchingtechniques. Part III presents a comprehensive overview of large-scale graph pro-cessing. It covers the most recent research in large-scale graph processing plat-forms, introducing several scalable graph querying and mining mechanisms indomains such as social networks. Part IV details novel applications that have beenmade possible by the rapid emergence of Big Data technologies, such asInternet-of-Things (IOT), Cognitive Computing, and SCADA Systems. All partsof the book discuss open research problems, including potential opportunities, thathave arisen from the rapid progress of Big Data technologies and the associatedincreasing requirements of application domains. We hope that our readers willbenefit from these discussions to enrich their own future research and development.

ix

This book is a timely contribution to the growing Big Data field, designed forresearchers and IT professionals and graduate students. Big Data has been recog-nized as one of leading emerging technologies that will have a major contributionand impact on the various fields of science and varies aspect of the human societyover the coming decades. Therefore, the content in this book will be an essentialtool to help readers understand the development and future of the field.

Sydney, Australia Albert Y. ZomayaEveleigh, Australia; Riyadh, Saudi Arabia Sherif Sakr

x Preface

Contents

Part I Fundamentals of Big Data Processing

Big Data Storage and Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Dongyao Wu, Sherif Sakr and Liming Zhu

Big Data Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Dongyao Wu, Sherif Sakr and Liming Zhu

Programming Platforms for Big Data Analysis . . . . . . . . . . . . . . . . . . . . 65Jiannong Cao, Shailey Chawla, Yuqi Wang and Hanqing Wu

Big Data Analysis on Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Loris Belcastro, Fabrizio Marozzo, Domenico Talia andPaolo Trunfio

Data Organization and Curation in Big Data . . . . . . . . . . . . . . . . . . . . . . 143Mohamed Y. Eltabakh

Big Data Query Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179Mohamed A. Soliman

Large-Scale Data Stream Processing Systems . . . . . . . . . . . . . . . . . . . . . . 219Paris Carbone, Gábor E. Gévay, Gábor Hermann,Asterios Katsifodimos, Juan Soto, Volker Markl and Seif Haridi

Part II Semantic Big Data Management

Semantic Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263Michelle Cheatham and Catia Pesquita

Linked Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307Manfred Hauswirth, Marcin Wylot, Martin Grund, Paul Grothand Philippe Cudré-Mauroux

xi

Non-native RDF Storage Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339Manfred Hauwirth, Marcin Wylot, Martin Grund, Sherif Sakrand Phillippe Cudré-Mauroux

Exploratory Ad-Hoc Analytics for Big Data . . . . . . . . . . . . . . . . . . . . . . . 365Julian Eberius, Maik Thiele and Wolfgang Lehner

Pattern Matching Over Linked Data Streams . . . . . . . . . . . . . . . . . . . . . 409Yongrui Qin and Quan Z. Sheng

Searching the Big Data: Practices and Experiencesin Efficiently Querying Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . 429Wei Emma Zhang and Quan Z. Sheng

Part III Big Graph Analytics

Management and Analysis of Big Graph Data:Current Systems and Open Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 457Martin Junghanns, André Petermann, Martin Neumann andErhard Rahm

Similarity Search in Large-Scale Graph Databases . . . . . . . . . . . . . . . . . 507Peixiang Zhao

Big-Graphs: Querying, Mining, and Beyond. . . . . . . . . . . . . . . . . . . . . . . 531Arijit Khan and Sayan Ranu

Link and Graph Mining in the Big Data Era . . . . . . . . . . . . . . . . . . . . . . 583Ana Paula Appel and Luis G. Moyano

Granular Social Network: Model and Applications . . . . . . . . . . . . . . . . . 617Sankar K. Pal and Suman Kundu

Part IV Big Data Applications

Big Data, IoT and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655Beniamino di Martino, Giuseppina Cretella andAntonio Esposito

SCADA Systems in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691Philip Church, Harald Mueller, Caspar Ryan, SpyridonV. Gogouvitis, Andrzej Goscinski, Houssam Haitof and Zahir Tari

Quantitative Data Analysis in Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . 719Xiang Shi, Peng Zhang and Samee U. Khan

Emerging Cost Effective Big Data Architectures . . . . . . . . . . . . . . . . . . . 755K. Ashwin Kumar

xii Contents

Bringing High Performance Computing to Big DataAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777H. Anzt, J. Dongarra, M. Gates, J. Kurzak, P. Luszczek,S. Tomov and I. Yamazaki

Cognitive Computing: Where Big Data Is Driving Us . . . . . . . . . . . . . . . 807Ana Paula Appel, Heloisa Candello and Fábio Latuf Gandour

Privacy-Preserving Record Linkage for Big Data:Current Approaches and Research Challenges . . . . . . . . . . . . . . . . . . . . . 851Dinusha Vatsalan, Ziad Sehili, Peter Christen and Erhard Rahm

Contents xiii

Part IFundamentals of Big Data Processing

Big Data Storage and Data Models

Dongyao Wu, Sherif Sakr and Liming Zhu

Abstract Data and storage models are the basis for big data ecosystem stacks.While storage model captures the physical aspects and features for data storage,data model captures the logical representation and structures for data processingand management. Understanding storage and data model together is essential forunderstanding the built-on big data ecosystems. In this chapter we are going toinvestigate and compare the key storage and data models in the spectrum of big dataframeworks.

The growing demand of storing and processing large scale data sets has been drivingthe development of data storage and databases systems in the last decade. The datastorage has been improved and enhanced from that of local storage to clustered,distributed and cloud-based storage. Additionally, the database systems have beenmigrated from traditional RDBMS to the more current NoSQL-based systems. Inthis chapter, we are going to present the major storage and data models with someillustrations of related example systems in big data scenarios and contexts based ontaxonomy of data store systems and platforms which is illustrated in Fig. 1.

1 Storage Models

A storage model is the core of any big-data related systems. It affects the scalabil-ity, data-structures, programming and computational models for the systems that arebuilt on top of any big data-related systems [1, 2]. Understanding about the under-

D. Wu (B) · S. Sakr · L. ZhuData61, CSIRO, Sydney, Australiae-mail: [email protected]

D. Wu · S. Sakr · L. ZhuSchool of Computer Science and Engineering, University of New South Wales,Sydney, Australia

S. SakrNational Guard, King Saud Bin Abdulaziz University for Health Sciences,Riyadh, Saudi Arabia

© Springer International Publishing AG 2017A.Y. Zomaya and S. Sakr (eds.), Handbook of Big Data Technologies,DOI 10.1007/978-3-319-49340-4_1

3

4 D. Wu et al.

Fig. 1 Taxonomy of data stores and platforms

lying storage model is also the key of understanding the entire spectrum of big-dataframeworks. For addressing different considerations and focus, there has been threemain storage models developed during the past a few decades, namely, Block-basedstorage, File-based Storage and Object-based Storage.

1.1 Block-Based Storage

Block level storage is one of the most classical storage model in computer science.A traditional block-based storage system presents itself to servers using industrystandard Fibre Channel and iSCSI [3] connectivity mechanisms. Basically, blocklevel storage can be considered as a hard drive in a server except that the hard drivemight be installed in a remote chassis and is accessible using Fibre Channel or iSCSI.In addition, for block-based storage, data is stored as blocks which normally have afixed size yet with no additional information (metadata). A unique identifier is used toaccess each block. Block based storage focus on performance and scalability to storeand access very large scale data. As a result, block-based storage is usually used asa low level storage paradigm which are widely used for higher level storage systemssuch as File-based systems, Object-based systems and Transactional Databases, etc.

Big Data Storage and Data Models 5

Fig. 2 Block-based storagemodel

1.1.1 Architecture

A simple model of block-based storage can be seen in Fig. 2. Basically, data arestored as blocks which normally have a fixed size yet with no additional information(metadata). A unique identifier is used to access each block. The identifier is mappedto the exact location of actual data blocks through access interfaces. Traditionally,block-based storage is bound to physical storage protocols, such as SCSI [4], iSCSI,ATA [5] and SATA [6].

With the development of distributed computing and big data, block-based storagemodel are also developed to support distributed and cloud-based environments. Aswe can see from the Fig. 3, the architecture of a distributed block-storage systemis composed of the block server and a group of block nodes. The block server isresponsible for maintaining the mapping or indexing from block IDs to the actualdata blocks in the block nodes. The block nodes are responsible for storing the actualdata into fixed-size partitions, each of which is considered as a block.

6 D. Wu et al.

Fig. 3 Architecture of distributed Block-based storage

1.1.2 Amazon Elastic Block Store (Amazon EBS)

Amazon Elastic Block Store (Amazon EBS) [7] is a block-level storage serviceused for AWS EC2 (Elastic Compute Cloud) [8] instances hosted in Amazon Cloudplatform. Amazon EBS can be considered as amassive SAN (StorageAreaNetwork)in the AWS infrastructure. The physical storage could be hard disks, SSDs, etc. underthe EBS architecture. Amazon EBS is one of the most important and heavily usedstorage services of AWS, even the building blocks component offerings from AWSlike RDS [9], DynamoDB [10] and CloudSearch [11], rely on EBS in the Cloud.In Amazon EBS, block volumes are automatically replicated within the availabilityzone to protect against data loss and failures. It also provides high availability anddurability for users. EBS volumes can be used just as traditional block devices andsimply plugged into EC2 virtual machines. In addition, users can scale up or downtheir volume within minutes. Since the Amazon EBS lifetime is separate from theinstance on which it is mounted, users can detach and later attach the volumes onother EC2 instances in the same availability zone.

1.1.3 OpenStack Cinder and Nova

In the open-source cloud such as OpenStack [12], the block storage service is pro-vided by the Nova [13] system working with the Cinder [14] system. When youstart a Nova compute instance, it should come configured with some block storagedevices by default, at the very least to hold the read/write partitions of the runningOS. These block storage instances can be “ephemeral” (the data goes away when

Big Data Storage and Data Models 7

Fig. 4 File-based storagemodel

the compute instance stops) or “persistent” (the data is kept, can be used later againafter the compute instances stops), depending on the configuration of the OpenStacksystem you are using.

Cinder manages the creation, attaching and detaching of the block devices toinstances in OpenStack. Block storage volumes are fully integrated into OpenStackCompute and the Dashboard allowing for cloud users to manage their own storageon demand. Data in volumes are replicated and also backed up through snapshots.In addition, snapshots can be restored or used to create a new block storage volume.

1.2 File-Based Storage

File-based storage inherits from the traditional file system architecture, considersdata as files that are maintained in a hierarchical structure. It is the most commonstorage model and is relatively easy to implement and use. In big data scenario, afile-based storage system could be built on some other low-level abstraction (such asBlock-based and Object-based model) to improve its performance and scalability.

1.2.1 Architecture

The file-based storage paradigm is shown in Fig. 4. File paths are organized in ahierarchy and are used as the entries for accessing data in the physical storage. For abig data scenario, distributed file systems (DFS) are commonly used as basic storagesystems. Figure5 shows a typical architecture of a distributed file system whichnormally contains one or several name nodes and a bunch of data nodes. The namenode is responsible for maintaining the file entries hierarchy for the entire systemwhile the data nodes are responsible for the persistence of file data.

In a file based system, a user would need to know of the namespaces andpaths in order to access the stored files. For sharing files across systems, the pathor namespace of a file would include three main parts: the protocol, the domainname and the path of the file. For example, a HDFS [15] file can be indicated as:“[hdfs://][ServerAddress:ServerPort]/[FilePath]” (Fig. 6).

8 D. Wu et al.

Fig. 5 Architecture of distributed file systems

Fig. 6 Architecture of Hadoop distributed file systems

For a distributed infrastructure, replication is very important for providing faulttolerance in file-based systems. Normally, every file has multiple copies stored onthe underlying storage nodes. And if one of the copies is lost or failed, the namenode can automatically find the next available copy to make the failure transparentfor users.

Big Data Storage and Data Models 9

1.2.2 NFS-Family

Network File System (NFS) is a distributed file system protocol originally developedby Sun Microsystems. Basically, A Network File System allows remote hosts tomount file systems over a network and interact with those file systems as though theyaremounted locally. This enables systemadministrators to consolidate resources ontocentralized servers on the network. NFS is built on the Open Network ComputingRemote Procedure Call (ONC RPC) system. NFS has been widely used in Unixand Linux-based operating systems and also inspired the development of moderndistributed file systems. There have been three main generations (NFSv2, NFSv3and NFsv4) for the NFS protocol due to the continuous development of storagetechnology and the growth of user requirements.

NFS consists of a few servers and more clients. The client remotely accesses thedata that is stored on the server machines. In order for this to function properly, a fewprocesses have to be configured and running. NFS is well-suited for sharing entirefile systems with a large number of known hosts in a transparent manner. However,with ease-of-use comes a variety of potential security problems. Therefore, NFS alsoprovides two basic options for access control of shared files:

• First, the server restricts which hosts are allowed to mount which file systemseither by IP address or by host name.

• Second, the server enforces file system permissions for users on NFS clients inthe same way it does for local users.

1.2.3 HDFS

HDFS (HadoopDistributedFile System) [15] is an open source distributedfile systemwritten in Java. It is the open source implementation of Google File System (GFS)and works as the core storage for Hadoop ecosystems and the majority of the existingbig data platforms. HDFS inherits the design principles from GFS to provide highlyscalable and reliable data storage across a large set of commodity server nodes [16].HDFShas demonstrated production scalability of up to 200 PBof storage and a singlecluster of 4500 servers, supporting close to a billion files and blocks. Basically, HDFSis designed to serve the following goals:

• Fault detection and recovery: Since HDFS includes a large number of commodityhardware, failure of components is expected to be frequent. Therefore, HDFS havemechanisms for quick and automatic fault detection and recovery.

• Huge datasets: HDFS should have hundreds of nodes per cluster to manage theapplications having huge datasets.

• Hardware at data: A requested task can be done efficiently, when the computationtakes place near the data. Especially where huge datasets are involved, it reducesthe network traffic and increases the throughput.

10 D. Wu et al.

As shown in Fig. 6, the architecture of HDFS consists of a name node and a setof data nodes. Name node manages the file system namespace, regulates the accessto files and also executes some file system operations such as renaming, closing, etc.Data node performs read-write operations on the actual data stored in each node andalso performs operations such as block creation, deletion, and replication accordingto the instructions of the name node.

Data in HDFS is seen as files and automatically partitioned and replicated withinthe cluster. The capacity of storage for HDFS grows almost linearly by adding newdata nodes into the cluster. HDFS also provides an automated balancer to improvethe utilization of the cluster storage. In addition, recent versions of HDFS haveintroduced a backup node to solve the problem caused by single-node failure of theprimary name node.

1.3 Object-Based Storage

The object-based storage model was firstly introduced on Network Attached Securedevices [17] for providing more flexible data containers objects. For the past decade,object-based storage has been further developedwith further investments beingmadeby both system vendors such as EMC, HP, IBM and Redhat, etc. and cloud providerssuch as Amazon, Microsoft and Google, etc.

In the object-based storage model, data is managed as objects. As shown in Fig. 7,every object includes the data itself, somemeta-data, attributes and a globally uniqueobject identifier (OID). Object-based storage model abstracts the lower layers ofstorage away from the administrators and applications. Object storage systems canbe implemented at different levels, including at the device level, system level andinterface level.

Data is exposed and managed as objects which includes additional descriptivemeta-data that can be used for better indexing or management. Meta-data can beanything from security, privacy and authentication properties to any applicationsassociated information.

Fig. 7 Object-based storagemodel

Big Data Storage and Data Models 11

Fig. 8 Architecture of object-based storage

1.3.1 Architecture

The typical architecture of an object-based storage system is shown in Fig. 8. Aswe can see from the figure, the object-based storage system normally uses a flatnamespace, in which the identifier of data and their locations are usually maintainedas key-valuepairs in the object server. In principle, the object server provides location-independent addressing and constant lookup latency for reading every object. Inaddition, meta-data of the data is separated from data and is also maintained asobjects in a meta-data server (might be co-located with the object server). As a result,it provides a standard and easier way of processing, analyzing and manipulating ofthe meta-data without affecting the data itself.

Due to the flat architecture, it is very easy to scale out object-based storage sys-tems by adding additional storage nodes to the system. Besides, the added storagecan be automatically expanded as capacity that is available for all users. Draw-ing on the object container and meta-data maintained, it is also able to providemuch more flexible and fine-grained data policies at different levels, for example,Amazon S3 [18] provides bucket level policy, Azure [19] provides storage accountlevel policy, Atmos [20] provides per-object policy.

1.3.2 Amazon S3

Amazon S3 (Simple Storage Service) [18] is a cloud-based object storage systemoffered by Amazon Web Services (AWS). It has been widely used for online backupand archiving of data and application programs. Although the architecture and imple-mentation of S3 is not published, it has been designed with high scalability, avail-ability and low latency at commodity costs.

12 D. Wu et al.

In S3, data is stored as arbitrary objects with up to 5 terabytes data size and up to2 kilobytes of meta-data. These data objects are organized into buckets which aremanaged by AWS accounts and authorized based on the AMI identifier and privatekeys. In addition, S3 supports data/objects manipulation operations such as creation,listing and retrieving through either RESTful HTTP interfaces or SOAP-based inter-faces. In addition, objects can also be downloaded using the BitTorrent protocol, inwhich each bucket is served as a feed. S3 claims to guarantee 99.9% SLA by usingtechnologies such as redundant replications, failover support and fast data recovery.

S3 was intentionally designed with a minimal feature set and was created to makeweb-scale computing easier for developers. The service gives users access to thesame systems that Amazon uses to run its own Web sites. S3 employs a simple web-based interface and uses encryption for the purpose of user authentication. Users canchoose to keep their data private or make it publicly accessible and even encrypt dataprior to writing it out to storage.

1.3.3 EMC Atmos

EMC Atmos [20] is a object-based storage services platform developed by EMCCorporation. Atmos can be deployed as either a hardware appliance or a softwarein a virtual environment such as cloud. Atmos is designed based on the object stor-age architecture aiming to manage petabytes of information and billions of objectsacross multiple geographic locations yet be used as a single system. In addition,Atmos supports two forms of replication: synchronous replication and asynchro-nous replication. For a particular object, both types of replication can be specified,depending on the needs of the application and the criticality of the data.

Atmos can be used as a data storage system for custom or packaged applicationsusing either a REST or SOAP data API, or even traditional storage interfaces likeNFS and CIFS. It stores information as objects (files + metadata) and provides asingle unified namespace/object-space which is managed by user or administrator-defined policies. In addition, EMC has recently added support for the Amazon S3application interfaces that allow for the movement of data from S3 to any Atmospublic or private cloud.

1.3.4 OpenStack Swift

Swift [21] is a scalable, redundant and distributed object storage system for theOpenStack cloud platform. With the data replication service of OpenStack, objectsand files in Swift arewritten tomultiple nodes that are spread throughout the cluster inthe data center. Storage in Swift can scale horizontally simply by adding new servers.Once a server or hard drive fails, Swift automatically replicates its content fromother active nodes to new locations in the cluster. Swift uses software logic to ensuredata replication and distribution across different devices. In addition, inexpensivecommodity hard drives and servers can be used for Swift clusters (Fig. 9).

Big Data Storage and Data Models 13

Fig. 9 Architecture of swift object store

The architecture of Swift consists of several components including proxy server,account servers, container servers and object servers:

• The Proxy Server is responsible for tying together the rest of the Swift architecture.It exposes the Swift API to users and streams objects to and from the client basedon requests.

• The Object Server is a simple blob storage server which handles storage functionssuch as the retrieval and deletion of objects stored on local devices.

• The Container Server is responsible to handle the listings of objects. Objects inSwift are logically organized in specific containers. The listings relations are storedas sqlite database files and replicated across the cluster.

• The Account Server is similar to the Container Server except that it is responsiblefor the listings of containers rather than objects.

Objects in Swift are accessed through the REST interfaces, and can be stored,retrieved, and updated on demand. The object store can be easily scaled across alarge number of servers. Swift uses rings to keep track of the locations of partitionsand replicas for objects and data.

1.4 Comparison of Storage Models

In practice, there is no perfect model which can suit all possible scenarios. Therefore,developers and users should choose the storage models according to their applicationrequirements and context. Basically, each of the storagemodel thatwe have discussedin this section has its own pros and cons.

14 D. Wu et al.

1.4.1 Block-Based Model

• Block-based storage is famous for its flexibility, versatility and simplicity. In ablock level storage system, raw storage volumes (composed of a set of blocks) arecreated, and then the server-based system connects to these volumes and uses themas individual storage drives. This makes block-based storage usable for almost anykind of applications, including file storage, database storage, virtual machine filesystem (VMFS) volumes, and more.

• Block-based storage can be also used for data-sharing scenarios. After creatingblock-based volumes, they can be logically connected or migrated between dif-ferent user spaces. Therefore, users can use these overlapped block volumes forsharing data between each other.

• Block-based storage normally has high throughput and performance and is gener-ally configurable for capacity and performance. As data is partitioned and main-tained in fix-sized blocks, it reduces the amount of small data segments and alsoincreases the IO throughput due to more sequential reading and writing of datablocks.

• However, block-based storage is complex to manage and not easy to use due tothe lack of information (such as meta-data, logical semantics and relation betweendata blocks) when compared with that of other storage models such as file-basedstorage and object-based storage.

1.4.2 File-Based Model

• File storage is easy to manage and implement. It is also less expensive to use thanblock-storage. It is used more often on home computers and in smaller businesses,while block-level storage is used by larger enterprises, with each block beingcontrolled by its own hard drive and managed through a server-based operatingsystem.

• File-based storage is usually accessible using common file level protocols such asSMB/CIFS (Windows) and NFS (Linux, VMware). At the same time, files containmore information for management purposes, such as authentication, permissions,access control and backup. Therefore, it is more user-friendly and maintainable.

• Due to the hierarchical structure, file-based storage is less scalable the the numberof files becomes extremely huge. It becomes extremely challenging to maintainboth low-latency and scalability for large scale distributed file systems such asNFS and HDFS.

1.4.3 Object-Based Model

• Object-based storage solves the provisioning management issues presented by theexpansion of storage at very large scale. Object-based storage architectures can bescaled out and managed simply by adding additional nodes. The flat name space

Big Data Storage and Data Models 15

Table 1 Comparison of storage models

Storage Model Data model Indexing Scalability Consistency

Block-based Blockswith fixed size

Block Id Flat Strong

File-based Files File path Hierarchy Configurable

Object-based Objects and meta datasize not fixed

Block Id or URI Flat Configurable

organization of the data, in combination with the expandable metadata function-ality, facilitate this ease of use. Object storage are commonly used for the storageof large scale unstructured data such as photos in Facebook, songs on Spotify andeven files in Dropbox.

• Object storage facilitates the storage for unstructured data sets where data is gen-erally read yet not written-to. Object storage generally does not provide the abilityof incrementally editing one part of a file (as block storage and file storage do).Objects have to be manipulated as a whole unit, requiring the entire object to beaccessed, updated then re-written into the physical storage. This may cause someperformance implications. It is also not recommended to use object storage fortransactional data because of the eventual consistency model.

1.4.4 Summary of Data Storage Models

As a result, the main features of each storage model can be summarized as shownin Table1. Generally, block-based storage has a fixed size for each storage unitwhile file-based and object-based models can have various sizes of storage unitbased on application requirements. In addition, file-based models use the file-baseddirectory to locate the data whilst block-based and object-based models both replyon a global identifier for locating data. Furthermore, both block-based and object-based models have flat scalability while file-based storage may be limited by itshierarchical indexing structure. Lastly, block-based storage can normally guaranteea strong consistency while for file-based and object-based models the consistencymodel is configurable for different scenarios.

2 Data Models

A data model illustrates how the data elements are organized and structured. It alsorepresents the relations among different data elements. A data model is at the core fordata storage, analytic and processing of contemporary big data systems. Accordingto different data models, current data storage systems can be categorized into twobig families: relational-stores (SQL) and NoSQL stores.

16 D. Wu et al.

For past decades, relational database management systems (RDBMS) have beenconsidered as the dominant solution formost of the data persistence andmanagementservice. However, with the tremendous growth of the data size and data variety, thetraditional strong consistency and pre-defined schema for relational databases havelimited their capability for dealing with large scale and semi/un-structured data in thenew era. Therefore, recently, a new generation of highly-scalable, more flexible datastore systems has emerged to challenge the dominance of relational databases. Thisnew groups of systems are called NoSQL (Not only SQL) systems. The principleunderneath the advance of NoSQL systems is actually a trade-off between the CAPproperties of distributed storage systems.

Asweknow from theCAP theorem [22], a distributed systemcanonly guarantee atmost two out of the three properties: Consistency,Availability and Partition tolerance.Traditional RDBMS normally provide a strong consistency model based on theirACID [23] transaction model while NoSQL systems try to sacrifice some extent ofconsistency for either higher availability or better partition tolerance. As a result,data storage systems can be categorized into three main groups base on their CAPproperties:

• CA systems, which are consistent and highly available yet not partition-tolerant.• CP systems, which are consistent and partition-tolerant but not highly available.• AP systems, which are highly available and partition-tolerant but not strictly con-sistent.

In the remaining of this section, we will discuss about major NoSQL systems andscalable relation databases, respectively.

2.1 NoSQL (Not only SQL)

Rational databases management systems (such as MySQL [24], Oracle [25], SQLServer [26] and PostgreSQL [27]) have been dominating the database communityfor decades until they face the limitation of scaling to very large scale datasets.Therefore, recently a group of database systems which abandoned the support ofACID transactions (Atomicity, Consistency, Isolation and Durability, which are keyprinciples for relational databases) has emerged to tackle the challenge of big data.The group of these database systems are named as NoSQL (Not only SQL) systemswhich aims to provide horizontal scalability towards any large scale of datasets. Amajority of NoSQL systems are originally designed and built to support distributedenvironments with the need to improve performance by adding new nodes to theexisting ones. Recall that the CAP theorem states that a distributed system can onlychoose at most two of the three properties: Consistency, Availability and Partitiontolerance. One key principle of NoSQL systems is to compromise the consistency totrade for high availability and scalability. Basically the implementation of a majorityNoSQL systems share a few common design features as below:

Big Data Storage and Data Models 17

• High scalability, which requires the ability to scale up horizontally over a largecluster of nodes;

• High availability and fault tolerance, which is supported by replicating and dis-tributing data over distributed servers;

• Flexible data models, with the ability to dynamically define and update attributesand schemas;

• Weaker consistency models, which abandoned the ACID transactions and areusually referred as BASE models (Basically Available, Soft state, Eventually con-sistent) [28];

• Simple interfaces, which are normally single call-level interfaces or protocol incontrast to the SQL bindings.

For different scenarios and focus of usage, more NoSQL systems have developedin both industry and academia. Based on different data model ported, these NoSQLsystems can be classified as three main groups: Key-Value Stores, Document storesand Extensible-Record/Column-based stores.

2.1.1 Key-Value Stores

Key-value stores use a simple data model, in which data are considered as a set Key-Value pairs, in which, keys are unique IDs for each data and also work as indexesduring accessing the data (Fig. 10). Values are attributes or objects which containsthe actual information of data. Therefore, these systems are called key-value stores.The data in key-value stores can be accessed using simple interfaces such as insert,delete and search by key. Normally, secondary keys and indexes are not supported.In addition, these systems also provide persistence mechanism as well as replication,locking, sorting and other features.

• Redis

Redis [29] is an open source Key-Value store system written in C. It supports a fairlyrich data model for stored values. Values can be lists, sets and structures in additionto basic types (Integer, String, Double and Boolean). Apart from ordinary operationssuch as reading and writing, Redis also provides a few atomic modifier such asincrement of a numeric value by one, adding an element to a list, etc. Redis mainlystores data in memory, which ensures high performance. To provide persistence, data

Fig. 10 Data model of Key-value stores

18 D. Wu et al.

snapshots and modification operations are written out to disk for failure tolerance.Redis can scale out by distributing data (normally achieved at client side) amongmultiple Redis servers and providing asynchronous data replication through master-slaves.

• Memcached family

Memcached [30] is the first generation of Key-Value stores initially working as cachefor web servers then being developed as a memory based Key-value store system.Memcahed has been enhanced to support features such as high availability, dynamicgrowth and backup. The original design of Memcached does not support persis-tence and replication. However, its follow-up variation: Membrain and Membase doinclude these features which make them more like storage systems.

• DynamoDB

DynamoDB [10] is a NoSQL store service provided by Amazon. Dynamo supports amuch more flexible data model especially for key-value stores. Data in Dynamo arestored as tables each of which has a unique primary ID for accessing. Each table canhave a set for attributes which are schema free and scalar types and sets are supported.Data in Dynamo can be manipulated by searching, inserting and deletion based onthe primary keys. In addition, conditional operation, atomic modification and searchby non-key attributes are also supported (yet inefficient), which makes it also closerto that of a document store. Dynamo provides a fast and scalable architecture wheresharding and replication are automatically performed. In addition, Dynamo providessupport for both eventually consistency and strong consistency for reads while strongconsistency degrades the performance. E.g. Redis, Memcached, DynamoDB (alsosupport document store).

2.1.2 Document Stores

Document stores provide a more complex data structure and richer capabilities thanKey-Value systems. In document stores, the unit of data is called a document whichis actually an object that can contains an arbitrary set of fields, values and evennested objects and arrays (Fig. 11). Document stores normally do not have predefinedschemas for data and support search and indexing by document fields and attributes.Unlike key-value stores, they generally support secondary indexes, nested objectsand lists. Additionally, some of them can even support queries with constraints,aggregations, sorting and evaluations.

• MongoDB

MongoDB [31] is an open source project developed in C++ and supported by thecompany 10gen. MongoDB provides its data model based on JSON documents andmaintained asBSON (a compact and binary representation of JSON). Each documentin MongoDB has a unique identifier which can be automatically generated by the

Big Data Storage and Data Models 19

Fig. 11 Data model of document stores

server or manually created by users. A document contains an arbitrary set of fieldswhich can be either arrays or embedded documents. MongoDB is schema free andeven documents in the same collection can have completely different fields. Docu-ments inMongoDB are manipulated based on the JSON representation using search,insertion, deletion andmodification operations.Users can find or query documents bywriting them as expressions of constraints of fields. In addition, complex operationssuch as sorting, iteration and projecting are supported. Moreover, users can performMapReduce-like program and aggregation paradigms on documents, which makesit possible to execute more complicated analytic queries and programs. Documentscan be completely replaced and any parts of their fields can also be manipulated andreplaced.

Indexes of one ormorefields in a collection are supported to speed up the searchingqueries. In addition, MongoDB scales up by distributing documents of a collectionamong nodes based on a sharding key. Replication between master and slaves withdifferent consistency models depending on whether reading from secondary nodesare allowed and how many nodes are required to reach a confirmation.

• CouchDB

CouchDB [32] is an Apache open source project written in Erlang. It is a distributeddocuments-based store that manipulates JSON documents. CouchDB is schema free,documents are organized as collections. Each document contains a unique identifierand a set of fields which can be scalar fields, arrays and embedded documents.

20 D. Wu et al.

Queries on CouchDB documents are called views which are MapReduce-basedJavaScript functions specifying the matching constraints and aggregation logics.These functions are structured into so-called designed documents for execution.For these views, B-Tree based indexes are supported and updated during modifica-tions. CouchDB also supports optimistic locks based on MVCC (Multi-VersionedConcurrency Control) [33] which enables CouchDB to be lock-free during readingoperations. In addition, every modification is immediately written down to the diskand old versions of data are also saved. CouchDB scales by asynchronous replication,in which both master-slave and master-master replication is supported. Each clientis guaranteed to see a consistent state of the database, however, different clients maysee different states (as strengthened eventually consistency).

2.1.3 Extensible-Record Stores

Extensible-Record Stores (also called column stores) are initially motivated byGoogle’s Big Table project [34]. In the system, data are considered as tables withrows and column families in which both rows and columns can be split over multiplenodes (Fig. 12). Due to this flexible and loosely coupled data model, these systemssupport both horizontal and vertical partitioning for the scalability purposes. In addi-tion, correlated fields/columns (named as column families) are located on the samepartition to facilitate query performance. Normally column families are predefinedbefore creating a data table. However, this is not a big limitation as new columns andfields can always be dynamically added to the existing tables.

• BigTable

BigTable [34] is introduced by Google in 2004 as a column store to support variousGoogle services. BigData is built onGoogle File System (GFS) [35] and can be easilyscaled up to hundreds and thousands of nodes maintaining Terabytes and Petabytesscale of data.

Fig. 12 Data model of extensible-record stores