international journal of …ijifr.com/pdfsave/26-03-201476609-03...international journal of...

12
Copyright © IJIFR 2013 Author’s Research Area: Big Data Available Online at: - http://www.ijifr.com/searchjournal.aspx www.ijifr.com [email protected] ISSN (Online): 2347-1697 INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH An Enlightening Online Open Access, Refereed & Indexed Journal of Multidisciplinary Research Volume -1 Issue -7, March 2014 22 Big Data: A Detailed Review Abstract In this paper, Big Data, its background and the four phases of the value chain of big data i.e., data generation, data acquisition, data storage, and data analysis are being explained. The paper highlights the need of switching to NoSQL/Big Data and describes its characteristics and applications. Further different Storage mechanisms and major points of differences between Hbase, FlockDB, Cassandra, SimpleDB, CouchDB and Neo4j are discussed. The paper provided a review of several published papers, white papers, seminar presentations and published articles in order to gain a complete picture about the background, emergence, utility, applications, comparison between different techniques and open issues in the area of BigData. The paper also discusses about open issues and risks involved with this new emerging technology. Finally the survey is concluded with a discussion of open problems and future directions. Keywords: Big Data, Cloud Computing, Internet of Things, Data Center, Hadoop, NoSQL 1. Introduction Over the past 20 years, data has increased in a large scale in various fields. According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (˜ 10 B), which increased by nearly nine times within five years [1]. This figure will double at least every other two years in the near future. Under the explosive increase of global data, the term of Big Data is mainly used to describe enormous datasets. Big Data typically includes masses of unstructured data that need more real time analysis [23]. The sharply increasing data deluge in the big data era brings about huge challenges on data acquisition, storage, management and analysis. Traditional data management and analysis systems are based on the relational database management system (RDBMS). Ms. Subita Kumari Research Scholar Computer Science and Engineering Department UIET, MDU, Rohtak, Haryana, India PAPER ID: IJIFR / V1 / E7 / 017

Upload: vutu

Post on 12-Apr-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Copyright © IJIFR 2013 Author’s Research Area: Big Data

Available Online at: - http://www.ijifr.com/searchjournal.aspx

www.ijifr.com [email protected] ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH An Enlightening Online Open Access, Refereed & Indexed Journal of Multidisciplinary Research

Volume -1 Issue -7, March 2014

22

Big Data: A Detailed Review

Abstract

In this paper, Big Data, its background and the four phases of the value chain of big data i.e.,

data generation, data acquisition, data storage, and data analysis are being explained. The paper

highlights the need of switching to NoSQL/Big Data and describes its characteristics and

applications. Further different Storage mechanisms and major points of differences between

Hbase, FlockDB, Cassandra, SimpleDB, CouchDB and Neo4j are discussed. The paper provided

a review of several published papers, white papers, seminar presentations and published articles

in order to gain a complete picture about the background, emergence, utility, applications,

comparison between different techniques and open issues in the area of BigData. The paper also

discusses about open issues and risks involved with this new emerging technology. Finally the

survey is concluded with a discussion of open problems and future directions.

Keywords: Big Data, Cloud Computing, Internet of Things, Data Center, Hadoop, NoSQL

1. Introduction

Over the past 20 years, data has increased in a large scale in various fields. According to a report from

International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world

was 1.8ZB (˜ 10 B), which increased by nearly nine times within five years [1]. This figure will double at

least every other two years in the near future. Under the explosive increase of global data, the term of Big

Data is mainly used to describe enormous datasets. Big Data typically includes masses of unstructured

data that need more real time analysis [23]. The sharply increasing data deluge in the big data era brings

about huge challenges on data acquisition, storage, management and analysis. Traditional data

management and analysis systems are based on the relational database management system (RDBMS).

Ms. Subita Kumari Research Scholar

Computer Science and Engineering Department UIET, MDU, Rohtak, Haryana, India

PA

PE

R I

D:

IJIF

R /

V1 /

E7

/ 0

17

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

23

However, such RDBMSs only apply to structured data, other than semi-structured or unstructured data. In

addition, RDBMSs are increasingly utilizing more and more expensive hardware. It is apparent that the

traditional RDBMSs could not handle the huge volume and heterogeneity of big data. For solutions of

permanent storage and management of large-scale disordered datasets, distributed file systems and

NoSQL (Not Only SQL) databases are good choices.

Nowadays, big data related to the service of Internet companies grow rapidly. For example,

Google processes data of hundreds of Petabyte (PB), Facebook generates log data of over 10 PB per

month. per day. Figure 1 illustrates the boom of the global data volume. While the amount of large

datasets is drastically rising, it also brings about many challenging problems demanding prompt solutions.

The key challenges are Data representation, Redundancy reduction and data compression, Data life cycle

management, Analytical mechanism, Data confidentiality, Energy management, Expendability and

scalability and Cooperation.

Figure 1: The Continuously Growing Big Data

Big data is an abstract concept. Apart from masses of data, it also has some other features, which

determine the difference between itself and “massive data” or “very big data.”. In general, big data shall

mean the datasets that could not be perceived, acquired, managed, and processed by traditional IT and

software/hardware tools within a tolerable time. In 2010, Apache Hadoop defined big data as “datasets

which could not be captured, managed, and processed by general computers within an acceptable scope.”

At present, big data generally ranges from several TB to several PB [1]. In 2011, an IDC report defined

big data as “big data technologies describe a new generation of technologies and architectures, designed

to economically extract value from very large volumes of a wide variety of data, by enabling the high-

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

24

velocity capture, discovery, and/or analysis.” [23] . With this definition, characteristics of big data may be

summarized as four Vs, i.e., Volume (great volume), Variety (various modalities), Velocity (rapid

generation), and Value (huge value but very low density), as shown in Figure 2. However, many

challenges on big data arose. With the development of Internet services, indexes and queried contents

were rapidly growing. Therefore, search engine companies had to face the challenges of handling such

big data. Google created GFS [26] and Map Reduce [27] programming models to cope with the

challenges brought about by data management and analysis at the Internet scale. In addition, contents

generated by users, sensors, and other ubiquitous data sources also fueled the overwhelming data flows,

which required a fundamental change on the computing architecture and large-scale data processing

mechanism.

Figure 2: 4V’s of BigData

2. Literature Review

John Gantz and David Reinsel , in 2011 , said that The growth of the digital universe continues to outpace

the growth of storage capacity. But keep in mind that a gigabyte of stored content can generate a petabyte

or more of transient data that we typically don't store (e.g., digital TV signals we watch but don't record,

voice calls that are made digital in the network backbone for the duration of a call). So, like our physical

universe, the digital universe is something to behold — 1.8 trillion gigabytes in 500 quadrillion "files" —

and more than doubling every two years. That's nearly as many bits of information in the digital universe

as stars in our physical universe [1].

In 2012, Danah boyd & Kate Crawford presented that the era of Big Data has begun. Computer

scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and

other scholars are clamoring for access to the massive quantities of information produced by and about

people, things, and their interactions. Six Provocations given them for Big Data are :

1. Big Data changes the definition of knowledge

2. Claims to objectivity and accuracy are misleading

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

25

3. Bigger data are not always better data

4. Taken out of context, Big Data loses its meaning

5. Just because it is accessible does not make it ethical

6. Limited access to Big Data creates new digital divides [2]

Mango DB white paper states that companies should consider 5 critical dimensions to make the right

choice for their applications and their businesses. These are - Data Model, Query Model, Consistency

Model, APIs & Commercial Support and Community Strength [3] Moniruzzaman, A. B. M.& Hossain,

Syed Akhter presented a paper to motivate - classification, characteristics and evaluation of NoSQL

databases in Big Data Analytics. This report is intended to help users, especially to the organizations to

obtain an independent understanding of the strengths and weaknesses of various NoSQL database

approaches to supporting applications that process huge volumes of data. [4] Jongwook Woo, Siddharth

Basopia and Yuhang Xu presents a paper for HBase schema to process transaction data for Market Basket

Analysis algorithm. The algorithm runs on Hadoop MapReduce by reading data from both HBase and

HDFS. The experimental results show that the code with Map/Reduce increases the performance as

adding more nodes but at a certain point, there is a bottle-neck that does not allow the performance gain.

Besides, executing the algorithm with data in HBase is slower than in HDFS. [5]

In 2012, Vatika Sharma and Meenu Dave presented a paper with characteristics of NoSQL e.g. NoSQL

does not use the relational data model thus does not use SQL language, stores large volume of data, it is

used without any inconsistency in distributed environment, its open source database, does not have any

fixed schema, does not use concept of ACID properties, high performance and has more flexible

structure. [6] Kailash Raj Joshi submitted in his paper the pros and cons of both the technologies and tried

to address issues involving data visualization. Characteristics such as flexibility, low latency, scalability,

schema-less, fast query, and performance are some major advantages of a NoSQL database. [7]

CouchBase NoSQL white paper states, “Today, the use of NoSQL technology is rising rapidly among

Internet companies and the enterprise. It’s increasingly considered a viable alternative to relational

databases". [8]

In an international seminar in 2012, Arto Salminen shared information about the NoSQL DBs being used

by big organizations e.g. Google (BigTable, LevelDB), LinkedIn (Voldemort), Facebook (Cassandra),

Twitter (Hadoop/Hbase, FlockDB, Cassandra), Netflix (SimpleDB, Hadoop/HBase, Cassandra) and

CERN (CouchDB) [9] Looking at the emerging BigData trends NYTimes.com states, Data is not only

becoming more available but also more understandable to computers. Most of the Big Data surge is data

in the wild — unruly stuff like words, images and video on the Web and those streams of sensor data. It is

called unstructured data and is not typically grist for traditional databases. "GOOD with numbers?

Fascinated by data? The sound you hear is opportunity knocking". [10] In December 2012, Hsinchun

Chen , Roger H & Veda C said ,"Now, in this era of Big Data, even while BI&A 2.0 is still maturing, we

find ourselves poised at the brink of BI&A 3.0, with all the attendant uncertainty that new and potentially

revolutionary technologies bring". [11] A White Paper by TATA Consultancy Services describes various

breeds, choices, and tradeoffs to enable enterprises to make informed decisions to optimize the utilization

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

26

of NoSQL databases. [12] Darshana Shimpi & Sangita Chaudhari have presented comparison of current

graph databases and it has been observed that Neo4j is the best in all current graph databases. [13]

In a paper submitted in April 2013, Renu Kanwar, Prakriti Trivedi & Kuldeep Singh claimed that NoSQL

is the solution for use cases where ACID is not the major concern and uses BASE instead which works up

on eventual consistency. [14] In 2013, Mirko Kampf and Jan W. Kantelhardt describe a computational

framework for time-series analysis. Generic data structures represent different types of time series, e. g.

event and inter event time series, and define reliable interfaces to existing big data. [15] David Navetta

states that the potential uses and benefits of Big Data are endless. Unfortunately, Big Data also poses

some risk to both the companies seeking to unlock its potential and the individuals whose information is

now continuously being collected, combined, mined, analyzed, disclosed, and acted upon. Big Data and

some of the privacy related legal issues and risks associated with it are explained in this paper. [16]

Matt Bishop says Big data is revolutionizing our view of science, and has the potential to do the same for

the social sciences and humanities. With the benefits come very serious potential problems, ranging from

invasion of personal privacy to enabling spectacular failures of analytics. This article discusses some of

them. [17] In 2013, Scott J. Lusher, Ross McGuire, Renevan Schaik, David Nicholson and JacobdeVlieg

highlighted the benefits of BigData in the field of medicine. They said, " The exploitation of so-called

‘big data’ will enable us to undertake research projects never previously possible but should also

stimulate a re-evaluation of all our data practices". [18]

Peter Tseng, Westbrook M. Weaver, Mahdokht Masaeli, Keegan Owsley and Dino Di Carlo highlighted a

study that the development of large scale mutant or fusion libraries, automation of microscopy, image

analysis, and data extraction will be key components as microfluidics. [19] Nicholas P. Restifo in his

paper in 2013 admits that this ‘‘Big Data Revolution’’ is likely to affect fields as diverse as weather

forecasting and crime fighting, and the conduct of immunobiology is certainly no exception. [20] Marton

Mestya, Taha Yasseri & Janos Kertesz proved that the popularity of a movie can be predicted much

before its release using bigdata by measuring and analyzing the activity level of editors and viewers of the

corresponding entry to the movie in Wikipedia, the well-known online encyclopedia. [21] In Feb 2014,

Wenliang Huang, Zhen Chen, Wenyu Dong, Hang Li, Bin Cao, and Junwei Cao presented a comparison

of HBase and Oracle, "Compared with Oracle database, our HBase shows very consistent performance,

and the peak insertion rate reaches approximately 100 000 records per second". [22]

Min Chen, Shiwen Mao and Yunhao Liu states that Traditional relational databases cannot meet the

challenges on categories and scales brought about by big data. NoSQL databases (i.e., nontraditional

relational databases) are becoming more popular for big data storage. NoSQL databases feature flexible

modes, support for simple and easy copy, simple API, eventual consistency, and support of large volume

data. NoSQL databases are becoming the core technology for of big data. [23] Raghupati & Raghupati

stated in their paper that McKinsey estimates that big data analytics can enable more than $300 billion in

savings per year in U.S. healthcare, two thirds of that through reductions of approximately 8% in national

healthcare expenditures. Clinical operations and R & D are two of the largest areas for potential savings

with $165 billion and $108 billion in waste respectively. McKinsey believes big data could help reduce

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

27

waste and inefficiency in the following three areas: Clinical operations, Research & development and

Public Health [24].

3. The development of big data

In the late 1970s, the concept of “database machine” emerged, In the 1980s, people proposed “share

nothing,” a parallel database system, Teradata system was the first successful commercial parallel

database system. In the late 1990s, the advantages of parallel database was widely recognized in the

database field. Taking IBM as an example, since 2005, IBM has invested USD 16 billion on 30

acquisitions related to big data. In the beginning of 2012, a report titled Big Data, Big Impact presented at

the Davos Forum in Switzerland, announced that big data has become a new kind of economic assets, just

like currency or gold. Gartner, an international research agency, classified big data computing, social

analysis, and stored data analysis into 48 emerging technologies that deserve most attention.

Fundamental technologies that are closely related to big data, includes cloud computing, Internet Of

Things (IoT), data center, and Hadoop.

4. Big data generation and acquisition

Value chain of big data can be generally divided into four phases: data generation, data acquisition, data

storage, and data analysis. If we take data as a raw material, data generation and data acquisition are an

exploitation process, data storage is a storage process and data analysis is a production process that

utilizes the raw material to create new value.

4.1. Data generation

Data generation is the first step of big data. Given Internet data as an example, huge amount of data in

terms of searching entries, Internet forum posts, chatting records, and microblog messages, are generated.

Data could be generated using Enterprise Data, IoT Data, Bio Medical Data, Data Generation from Other

fields. [23]

4.2. Big data acquisition

Big data acquisition includes data collection, data transmission, and data pre-processing. During big data

acquisition, once we collect the raw data, we shall utilize an efficient transmission mechanism to send it

to a proper storage management system to support different analytical applications.

4.3. Data collection

Data collection is to utilize special data collection techniques to acquire raw data from a specific data

generation environment. Common data collection methods are Log Files, Sensing & Methods for

acquiring network data.The current network data acquisition technologies mainly include traditional

Libpcap-based packet capture technology, zero-copy packet capture technology, as well as some

specialized network monitoring software such as Wireshark, SmartSniff, and WinNetCap.Data

transportation Upon the completion of raw data collection, data will be transferred to a data storage

infrastructure for processing and analysis. Big data is mainly stored in a data center. Data transmission

consists of two phases: Inter-DCN transmissions and Intra-DCN transmissions. As a strengthening

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

28

technology, Zhou et al. in [30] adopt wireless links in the 60GHz frequency band to strengthen wired

links.

4.4. Data pre-processing

Because of the wide variety of data sources, the collected datasets vary with respect to noise, redundancy,

and consistency, etc., and it is undoubtedly a waste to store meaningless data. Some relational data pre-

processing techniques are Integration, Cleaning & Redundancy elimination. Generally, data integration

methods are accompanied with flow processing engines and search engines [31]. Authors in [33]

discussed data cleaning in e-commerce by crawlers and regularly re-copying customer and account

information. Apart from the data pre-processing methods, specific data objects shall go through some

other operations such as feature extraction. Such operation plays an important role in multimedia search

and DNA analysis [37-39].

5. Big data storage

5.1. Storage mechanism for big data

Existing storage mechanisms of big data may be classified into three bottom-up levels: (i) file systems,

(ii) databases, and (iii) programming models.

5.2. File Systems

Google’s GFS is an expandable distributed file system to support large-scale, distributed, data-intensive

applications [26]. GFS uses cheap commodity servers to achieve fault-tolerance and provides customers

with high performance services. However, GFS also has some limitations, such as a single point of failure

and poor performances for small files. Such limitations have been overcome by Colossus, the successor of

GFS. Other examples are HDFS, Kosmosfs, Cosmos, Haystack etc. Microsoft developed Cosmos [44] to

support its search and advertisement business. Facebook utilizes Haystack [45] to store the large amount

of small-sized photos.

5.3. Database technology

Relational databases cannot meet the challenges on categories and scales brought about by big data. Eric

Brewer proposed a CAP [41,42] theory in 2000, which indicated that a distributed system could not

simultaneously meet the requirements on consistency, availability, and partition tolerance; at most two of

the three requirements can be satisfied simultaneously. NoSQL databases (i.e., nontraditional relational

databases) are becoming more popular for big data storage. NoSQL databases feature flexible modes,

support for simple and easy copy, simple API, eventual consistency, and support of large volume data.

Three main NoSQL databases are: Key-value databases, column-oriented databases, and document-

oriented databases.

5.4 Key-value Databases: Key-value Databases are constituted by a simple data model and data is

stored corresponding to key-values. Every key is unique and customers may input queried values

according to the keys. Over the past few years, many key-value databases have appeared as motivated by

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

29

Amazon’s Dynamo system [46,47]. Other examples are Redis, Tokyo Canbinet and Tokyo Tyrant,

Memcached and Memcache DB, Riak and Scalaris etc.

5.5 Column-oriented Database: The column-oriented databases store and process data according to

columns other than rows. Both columns and rows are segmented in multiple nodes to realize

expandability. The column-oriented databases are mainly inspired by Google’s BigTable. BigTable is

based on many fundamental components of Google, including GFS [26], cluster management system,

SSTable file format, and Chubby [49]. HBase is a BigTable cloned version programmed with Java and is

a part of Hadoop of Apache’s MapReduce framework. HBase replaces GFS with HDFS. Cassandra is a

distributed storage system to manage the huge amount of structured data distributed among multiple

commercial servers [50].

5.6 Document Database:

Compared with key-value storage, document storage can support more complex data forms. Three

important representatives of document storage systems are, MongoDB, SimpleDB, and CouchDB.

MongoDB stores documents as Binary JSON (BSON) objects, which is similar to object. Every document

has an ID field as the primary key. SimpleDB: SimpleDB is a distributed database and is a web service of

Amazon. Data in SimpleDB is organized into various domains in which data may be stored, acquired, and

queried. Domains include different properties and name/value pair sets of projects.CouchDB: Apache

CouchDB is a documentoriented database written in Erlang. Data in CouchDB is organized into

documents consisting of fields named by keys/names and values, which are stored and accessed as JSON

objects. Recently, some proposed parallel programming models effectively improve the performance of

NoSQL and reduce the performance gap to relational databases. MapReduce [27] is a simple but powerful

programming model for large-scale computing. It has two functions, i.e., Map and Reduce. The Map

function processes input key-value pairs and generates intermediate key-value pairs. Then, MapReduce

will combine all the intermediate values related to the same key and transmit them to the Reduce function,

which further compress the value set into a smaller set. The initial MapReduce framework did not support

multiple datasets in a task, which has been mitigated by some recent enhancements [51].

In order to improve the programming efficiency, some advanced language systems have been proposed,

e.g., Sawzall [52] of Google, Pig Latin [53] of Yahoo, Hive [54] of Facebook, and Scope [55]of

Microsoft.

6. Big data analysis

Data analysis is the final and the most important phase in the value chain of big data, with the purpose of

extracting useful values, providing suggestions or decisions.

6.1. Traditional data analysis

Traditional Data analysis approaches are: Cluster Analysis, Factor Analysis, Correlation Analysis,

Regression Analysis, A/B Testing, Statistical Analysis and Data Mining Algorithms. Big data analytic

methods Big Data analysis approaches are : Bloom Filter, Hashing, Index, Triel & Parallel Computing.

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

30

6.2. Tools for big data mining and analysis

Major tools used for big data analysis and mining are R (30.7%), Excel (29.8%), Rapidminer (26.7 %),

KNMINE (21.8 %) & Weka/Pentaho (14.8 %).

Figure 3 depicts the complete Big Data architecture. Key Applications of Big Data are: Big Data in

enterprises, IoT based big data, online social network-oriented big data, healthcare and medical big data,

Collective intelligence & Smart grid. Analysis of Big Data could be classified under: Structured data

analysis, Text data analysis, Web data analysis, Multimedia data analysis, Network data analysis &

Mobile data analysis.

Figure 3: BigData architecture

7. Conclusion and Open Issues

The analysis of big data is confronted with many challenges, but the current research is still in early stage.

Considerable research efforts are needed to improve the efficiency of display, storage, and analysis of big

data. At present, many discussions of big data look more like commercial speculation than scientific

research. This is because big data is not formally and structurally defined and the existing models are not

strictly verified. An evaluation system of data quality and an evaluation standard/benchmark of data

computing efficiency should be developed. The big data technology is still in its infancy. Many key

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

31

technical problems, such as cloud computing, grid computing, stream computing, parallel computing, big

data architecture, big data model, and software systems supporting big data, etc. should be fully

investigated. In IT, safety and privacy are always two key concerns. In the big data era, as data volume is

fast growing, there are more severe safety risks, while the traditional data protection methods have

already been shown not applicable to big data. In particular, big data safety is confronted with the

following security related challenges: Big data privacy, Data quality, Big data safety mechanism & Big

data application in information security. Particularly, big data security, including credibility, backup and

recovery, completeness maintenance, and security should be further investigated.

8. References

[1] Gantz J , Reinsel D , "Extracting value from chaos" , IDC iView , pp 1-12(2011)

[2] Mango DB, “Top 5 considerations when evaluating NoSQL Databases”, White Paper.

[3] Danah boyd & Kate Crawford , “CRITICAL QUESTIONS FOR BIG DATA”,Routledge, Information,

Communication & Society Vol. 15, No. 5, June 2012, pp. 662–679 ISSN 1369-118 (2012).

[4] Moniruzzaman, A. B. M.; Hossain, Syed Akhter , “NoSQL Database: New Era of Databases for Big data

Analytics-Classification, Characteristics and Comparison”, Source: International Journal of Database Theory &

Application. Vol. 6 Issue 4, pp-13. (2013).

[5] Jongwook Woo, Siddharth Basopia, Yuhang Xu, “Market Basket Analysis Algorithm with no-SQL DB HBase

and Hadoop”, Computer Information Systems Department California State University, Los Angeles, CA, USA.

[6] Vatika Sharma, Meenu Dave, “SQL and NoSQL Databases”, International Journal of Advanced Research in

Computer Science and Software Engineering, Volume 2, Issue 8, August 2012 ISSN: 2277 128X.

[7] Kailash Raj Joshi , “GRAPH VISUALIZATION USING THE NoSQL DATABASE”, Paper Submitted to the

Graduate Faculty of the North Dakota State University of Agriculture and Applied Science.

[8] CouchBase, “NoSQL”, White Paper

[9] Arto Salminen, “Introduction to NoSQL”, NoSQL Seminar 2012 @ TUT.

[10] Big Data’s Impact in the World - NYTimes.com.

[11] Hsinchun Chen , Roger H. L , Veda C., “BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG

DATA TO BIG IMPACT”, MIS Quarterly Vol. 36 No. 4/December 2012,Eller College of Management, University

of Arizona, Tucson, AZ 85721 U.S.A.(2012).

[12] TATA Consultancy Services, “NoSQL, the Database for the Cloud”, White Paper.

[13] Darshana Shimpi, Sangita Chaudhari , “An overview of Graph Databases”, International Conference in Recent

Trends in Information Technology and Computer Science (ICRTITCS - 2012),

Proceedings published in International Journal of Computer Applications® (IJCA) (0975 – 8887)(2012).

[14] Renu Kanwar, Prakriti Trivedi, Kuldeep Singh , “NoSQL, a Solution for Distributed Database Management

System”, International Journal of Computer Applications (0975 – 8887) Volume 67– No.2, (2013).

[15] Mirko Kampf, Jan W. Kantelhardt , “Hadoop.TS: Large-Scale Time-Series Processing”, International Journal

of Computer Applications (0975 - 8887) Volume 74 - No. 17, (2013).

[16] David Navetta , “Legal Implications of BIG DATA - A Primer”, ISSA - Developing And Connecting

CyberSecurity Leaders Globally, ISSA member, Denver, USA Chapter.

[17] Bishop M, “Caution: Danger Ahead with Big Data”, ISSA DEVELOPING AND CONNECTING

CYBERSECURITY LEADERS GLOBALLY

[18] Scott J. Lusher, Ross McGuire, Renevan Schaik, David Nicholson and JacobdeVlieg , “Data-driven medicinal

chemistry in the era of bigdata”, ELSEVIER DrugDiscoveryTodayVolume00,Number00Decembe (2013).

[19] Peter Tseng, Westbrook M. Weaver, Mahdokht Masaeli, Keegan Owsley and Dino Di Carlo, “ Research

highlights: microfluidics meets big data”, Royal Society of Chemistry - www.rsc.org/loc

[20] Nicholas P. Restifo, “Big Data” View of the Tumor ‘‘Immunome’’”, Immunity Previews - Cell Press, Center

for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

32

[21] Marton Mestya, Taha Yasseri,Janos Kertesz , “Early Prediction of Movie Box Office Success Based on

Wikipedia Activity Big Data”, PLOS

[22] Wenliang Huang, Zhen Chen, Wenyu Dong, Hang Li, Bin Cao, and Junwei Cao, “Mobile Internet Big Data

Platform in China Unicom”, TSINGHUA SCIENCE AND TECHNOLOGY, ISSN 1007-0214 10/10 pp95-101

Volume 19, Number 1, (2014)

[23] Min Chen, Shiwen Mao and Yunhao Liu, “Big Data: A Survey”, Springer, School of Computer Science and

Technology, Huazhong University of Science and Technology, 1037 Luoyu Road, Wuhan, 430074, China. (2014).

[24] Raghupathi and Raghupathi, “Big data analytics in healthcare: promise and potential”, Health Information

Science and Systems, http://www.hissjournal.com/content/2/1/3 (2014).

[25] Manyika J, McKinsey Global Institute, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Byers AH, “Big

data: the next frontier for innovation, competition, and productivity”, McKinsey Global Institute (2011).

[26] Ghemawat S, Gobioff H, Leung S-T, “The google file system. In: ACM SIGOPS Operating Systems Review”,

vol 37. ACM, pp 29–43.

[27] Dean J, Ghemawat S, “Mapreduce: simplified data processing on large clusters”, Commun ACM 51(1):107–

113.

[28] Singla A, Singh A, Ramachandran K, Xu L, Zhang Y, “Proteus: a topology malleable data center network”, In

Proceedings of the 9th ACM SIGCOMM workshop on hot topics in networks. ACM, pp 8(2010).

[29] Liboiron-Ladouceur O, Cerutti I, Raponi PG, Andriolli N, Castoldi P, “Energy-efficient design of a scalable

optical multiplane interconnection architecture”. IEEE J Sel Top Quantum Electron 17(2):377–383(2011).

[30] Zhou X, Zhang Z, Zhu Y, Li Y, Kumar S, Vahdat A, Zhao BY, Zheng H, “Mirror mirror on the ceiling: flexible

wireless links for data centers”. ACM SIGCOMM Comput Commun Rev 42(4):443–454(2012).

[31] Cafarella MJ, Halevy A, Khoussainova N, “Data integration for the relational web”, Proc VLDB Endowment

2(1):1090– 1101(2009).

[32] Maletic JI, Marcus, “A Data cleansing: beyond integrity analysis”, In: IQ. Citeseer, pp 200–209(2000).

[33] Kohavi R, Mason L, Parekh R, Zheng Z, “Lessons and challenges from mining retail e-commerce data”, Mach

Learn 57(1-2):83–113(2004).

[34] Chen H, Ku W-S, Wang H, Sun M-T, “Leveraging spatiotemporal redundancy for rfid data cleansing”, In:

Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 51–62(2010).

[35] Zhao Z, Ng W, “A model-based approach for rfid data stream cleansing”, In Proceedings of the 21st ACM

international conference on information and knowledge management. ACM, pp 862–871(2012).

[36] Khoussainova N, Balazinska M, Suciu D, “Probabilistic event extraction from rfid data”, In: Data Engineering,

2008. IEEE 24th international conference on ICDE 2008. IEEE, pp 1480–1482(2008).

[37] Kamath U, Compton J, Dogan RI, Jong KD, Shehu A, “An evolutionary algorithm approach for feature

generation from sequence data and its application to dna splice site prediction”, IEEE/ACM Transac Comput Biol

Bioinforma (TCBB) 9(5):1387–1398(2012).

[38] Leung K-S, Lee KH, Wang J-F, Ng EYT, Chan HLY, Tsui SKW, Mok TSK, Tse PC-H, Sung JJ-Y, “Data

mining on dna sequences of hepatitis b virus”, IEEE/ACM Transac Comput Biol Bioinforma 8(2):428–440(2011).

[39] Huang Z, Shen H, Liu J, Zhou X, “Effective data coreduction for multimedia similarity search”, In Proceedings

of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 1021–1032(2011).

[40] Bleiholder J, Naumann F, “Data fusion”, ACMComput Surv (CSUR) 41(1):1(2008).

[41] Brewer EA, “Towards robust distributed systems”, In: PODC. p 7(2000).

[42] Gilbert S, Lynch N Brewer’s, “conjecture and the feasibility of consistent, available, partition-tolerant web

services”, ACM SIGACT News 33(2):51–59(2002).

[43] McKusick MK, Quinlan S, “Gfs: eqvolution on fastforward”, ACM Queue 7(7):10(2009).

[44] Chaiken R, Jenkins B, Larson P° A, Ramsey B, Shakib D, Weaver S, Zhou J, “Scope: easy and efficient

parallel processing of massive data sets”, Proc VLDB Endowment 1(2):1265– 1276(2008).

[45] Beaver D, Kumar S, Li HC, Sobel J, Vajgel P, “Finding a needle in haystack: facebook’s photo storage”, In

OSDI, vol 10. pp 1–8(2010).

[46] DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P,

Vogels W, “Dynamo: amazon’s highly available key-value store”, In: SOSP, vol 7. pp 205–220 (2007).

Ms. Subita Kumari: Big Data: A Detailed Review

www.ijifr.com Email: [email protected] © IJIFR 2013

This paper is available online at - http://www.ijifr.com/searchjournal.aspx

PAPER ID: IJIFR/V1/E7/017

ISSN (Online): 2347-1697

INTERNATIONAL JOURNAL OF INFORMATIVE & FUTURISTIC RESEARCH Volume -1 Issue -7, March 2014

Author’s Subject Area: Big Data, Page No.: 22-33

33

[47] Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D, “Consistent hashing and random trees:

distributed caching protocols for relieving hot spots on the world wide web”, In: Proceedings of the twenty-ninth

annual ACM symposium on theory of computing. ACM, pp 654–663 87. (1997).

[48] Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE,

“Bigtable: a distributed storage system for structured data”, ACM Trans Comput Syst (TOCS) 26(2):4(2008).

[49] Burrows M, “The chubby lock service for loosely-coupled distributed systems”, In: Proceedings of the 7th

symposium on Operating systems design and implementation. USENIX Association, pp 335–350(2006).

[50] Lakshman A, Malik P, “Cassandra: structured storage system on a p2p network”, In: Proceedings of the 28th

ACM symposium on principles of distributed computing. ACM, pp 5–5(2009).

[51] Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y, “A comparison of join algorithms for log

processing in mapreduce”, In: Proceedings of the 2010 ACM SIGMOD international conference on management of

data. ACM, pp 975–986 (2010).

[52] Pike R, Dorward S, Griesemer R, Quinlan S, “Interpreting the data: parallel analysis with sawzall”, Sci

Program 13(4):277– 298(2005).

[53] Gates AF, Natkovich O, Chopra S, Kamath P, Narayanamurthy SM, Olston C, Reed B, Srinivasan S, Srivastava

U, “Building a high-level dataflow system on top of map-reduce: the pig experience”, Proceedings VLDB

Endowment 2(2):1414–1425(2009).

[54] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R, “Hive: a

warehousing solution over a map-reduce framework”, Proc VLDB Endowment 2(2):1626–1629 (2009).

[55] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D, “Dryad: distributed data-parallel programs from sequential

building blocks”, ACM SIGOPS Oper Syst Rev 41(3):59–72 (2007).