bamuengine · bamuengine.com 4.4 documented by prof. k. v. reddy asst.prof at diems hadoop...

32
Bamuengine.com 4.1 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS UNIT 4 BIG DATA AND ANALYTICS Big Data, Challenges in Big Data, Hadoop: Definition, Architechture, Cloud file systems: GFS and HDFS, BigTable, HBase and Dynamo, MapReduce and extensions: Parallel computing, The MapReduce model: Parallel efficiency of MapReduce, Relational operations using MapReduce, Projects in Hadoop: Hive, HBase, Pig, Oozie, Flume, Sqoop BIG DATA DEFINITION Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. ―Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.'' Characteristics of big data and its role in current world:-- Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X- ray scanning devices, airplane engines, and consumer-driven data from social media. 1) DATA VOLUME:- Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB based on the type of the application that generates or receives the data. Data volume is characterized by the amount of data that is generated continuously. Different data-types come in different sizes. For example, a blog text is a few kilobytes; voice calls or video‘s few megabytes; sensor data, machine logs, and clickstream data can be in gigabytes. The following are some examples of data generated by different sources. Machine data: Every machine (device) that we use today from industrial to personal devices can generate a lot of data. This data includes both usage and behaviors of the owners of these machines. Machine-generated data is often characterized by a steady pattern of numbers and text, which occurs in a rapid-fire fashion. There are several examples of machine-generated data; for instance, Radio signals. Satellites, and Mobile devices all transmit signals. Application log: Another form of machine-generated data is an application log. Different devices generate logs at different paces and formats. For example: CT scanners, X-ray machines, body scanners at airports, airplanes, ships, military equipment, commercial satellites. Clickstream logs: The usage statistics of the web page are captured in clickstream data. This data type provides insight into what a user is doing on the web page, and can provide data that is highly useful for behavior and usability analysis, marketing, and general research. 2) DATA VELOCITY:- Velocity refers to the low latency, real-time speed at which the analytics need to be applied. With the advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this is to analyze the data generated. Let us look at some examples of data velocity.

Upload: others

Post on 20-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.1

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

UNIT 4

BIG DATA AND ANALYTICS

Big Data, Challenges in Big Data, Hadoop: Definition, Architechture, Cloud file systems: GFS and HDFS, BigTable, HBase and

Dynamo, MapReduce and extensions: Parallel computing, The MapReduce model: Parallel efficiency of MapReduce, Relational

operations using MapReduce, Projects in Hadoop: Hive, HBase, Pig, Oozie, Flume, Sqoop

BIG DATA DEFINITION

Big Data is the term for a collection of data sets so large and complex that it becomes difficult to

process using on-hand database management tools or traditional data processing applications.

―Big data technologies describe a new generation of technologies and architectures, designed to

economically extract value from very large volumes of a wide variety of data, by enabling high-velocity

capture, discovery, and/or analysis.''

Characteristics of big data and its role in current world:--

Data defined as Big Data includes machine-generated data from sensor networks, nuclear plants, X-

ray scanning devices, airplane engines, and consumer-driven data from social media.

1) DATA VOLUME:- Volume refers to the size of the dataset. It may be in KB, MB, GB, TB, or PB

based on the type of the application that generates or receives the data. Data volume is characterized by the

amount of data that is generated continuously. Different data-types come in different sizes. For example, a

blog text is a few kilobytes; voice calls or video‘s few megabytes; sensor data, machine logs, and

clickstream data can be in gigabytes. The following are some examples of data generated by different

sources.

Machine data: Every machine (device) that we use today from industrial to personal devices can generate a

lot of data. This data includes both usage and behaviors of the owners of these machines. Machine-generated

data is often characterized by a steady pattern of numbers and text, which occurs in a rapid-fire fashion.

There are several examples of machine-generated data; for instance, Radio signals. Satellites, and Mobile

devices all transmit signals.

Application log: Another form of machine-generated data is an application log. Different devices generate

logs at different paces and formats. For example: CT scanners, X-ray machines, body scanners at airports,

airplanes, ships, military equipment, commercial satellites.

Clickstream logs: The usage statistics of the web page are captured in clickstream data. This data type

provides insight into what a user is doing on the web page, and can provide data that is highly useful for

behavior and usability analysis, marketing, and general research.

2) DATA VELOCITY:-

Velocity refers to the low latency, real-time speed at which the analytics need to be applied. With the

advent of Big Data, understanding the velocity of data is extremely important. The basic reason for this is to

analyze the data generated. Let us look at some examples of data velocity.

Page 2: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.2

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Amazon, Yahoo, and Google

The business models adopted by Amazon, Facebook, Yahoo, and Google, which became the de-facto

business models for most web-based companies. So, the number of clickstreams on the web-sites are

millions of clicks gathered from users at every second, amounting to large volumes of data. This data can be

processed, segmented, and modeled to study population behaviors based on time of day, geography,

advertisement effectiveness, click behavior, and guided navigation response. The velocity of data produced

by user clicks on any website today is a prime example for Big Data velocity.

Sensor data

Another prime example of data velocity comes from a variety of sensors like GPS, mobile devices,

biometric systems, airplane sensors and engines. The data generated from sensor networks can range from a

few gigabytes per second to terabytes per second. For example, a flight from London to NewYork generates

650 TB of data from the airplane engine sensors. There is a lot of value in reading this information during

the stream processing and post gathering for statistical modeling purposes.

Social media

Another Big Data favorite, different social media sites produce and provide data at different

velocities and in multiple formats. While Twitter is fixed at 140 characters, Facebook, YouTube, or Flickr

can have posts of varying sizes from the same user. Not only is the size of the post important, understanding

how many times it is forwarded or shared and how much follow-on data it gathers is essential to process the

entire data set.

3) DATA VARIETY:-

Variety refers to the various types of the data that can exist, for example, text, audio, video, and photos.

Big Data comes in multiple formats as it ranges from emails to tweets to social media and sensor data. There

is no control over the input data format or the structure of the data. The processing complexity associated

with a variety of formats is the availability of appropriate metadata for identifying what is contained in the

actual data. This is critical when we process images, audio, video, and large chunks of text. The platform

requirements for processing new formats are:

● Scalability

● Distributed processing capabilities

● Image processing capabilities

● Graph processing capabilities

● Video and audio processing capabilities

TYPES OF DATA:-

Structured data is characterized by a high degree of organization and is typically the kind of data you see

in relational databases or spreadsheets. Because of its defined structure, it maps easily to one of the standard

data types. It can be searched using standard search algorithms and manipulated in well-defined ways.

Page 3: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.3

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Semi-structured data (such as what you might see in log files) is a bit more difficult to understand than

structured data. Normally, this kind of data is stored in the form of text files, where there is some degree of

order. For example, tab delimited files, where columns are separated by a tab character. So instead of being

able to issue a database query for a certain column and knowing exactly what you‘re getting back, users

typically need to explicitly assign data types to any data elements extracted from semi structured data sets.

Unstructured data has none of the advantages of having structure coded into a data set. Its analysis by way

of more traditional approaches is difficult and costly at best, and logistically impossible at worst. Without a

robust set of text analytics tools, it would be extremely tedious to determine any interesting behavior

patterns.

Challenges in Big-Data:

Acquire: Making the most of big data means quickly capturing high volumes of data generated in many

different formats is the first and foremost challenge.

Organize: A big data research platform needs to process massive quantities of data—filtering, transforming

and sorting it before loading it into a data warehouse.

Analyze: The infrastructure required for analyzing big data must be able to support deeper analytics such as

statistical analysis and data mining on a wider variety of data types stored in diverse systems; scale to

extreme data volumes; deliver faster response times; and automate decisions based on analytical models.

Scale: With big data you want to be able to scale very rapidly and elastically. Most of the NoSQL solutions

like MongoDB or HBase have their own scaling limitations.

Performance: In an online world where nanosecond delays can cost your sales, big data must move at

extremely high velocities no matter how much you scale or what workloads your database must perform.

Continuous Availability: When you rely on big data to feed your essential, revenue-generating 24/7

business applications, even high availability is not high enough. Your data can never go down. A certain

amount of downtime is built-in to RDBMS and other NoSQL systems.

Data Security: Big data carries some big risks when it contains credit card data, personal ID information

and other sensitive assets. Most NoSQL big data platforms have few if any security mechanisms in place to

safeguard your big data.

HADOOP

DEFINITION: Apache Hadoop is a framework that allows for the distributed processing of large data sets

across clusters of commodity computers using a simple programming model. Apache Hadoop is an open-

source software framework written in Java for distributed storage and distributed processing of very large

data sets on computer clusters built from commodity hardware. The core of Apache Hadoop consists of a

storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce).

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules.

Page 4: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.4

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

machines, providing very high aggregate bandwidth across the cluster.

Hadoop YARN – a resource-management platform responsible for managing computing resources

in clusters and using them for scheduling of users' applications and

Hadoop MapReduce – a programming model for large scale data processing.

Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on

their MapReduce and Google File System. Apache Hadoop is a registered trademark of the Apache Software

Foundation.

Architecture:

A multi-node Hadoop cluster:

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a

JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and

TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and

shutdown scripts require that Secure Shell (ssh) be set up between nodes in the cluster.

In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system

index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus

preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can

manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate

file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the

file-system-specific equivalents.

Hadoop distributed file system

The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written

in Java for the Hadoop framework. A Hadoop cluster has nominally a single name node plus a cluster of

data nodes. Each data node serves up blocks of data over the network using a block protocol specific to

HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to

communicate between each other.

Page 5: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.5

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It

achieves reliability by replicating the data across multiple hosts. With the default replication value 3, data

is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other

to rebalance data, to move copies around, and to keep the replication of data high. HDFS stores the data by

dividing into the large amount of file into blocks of size 64 MB by default and it may vary to 128 MB size

of each block.

Understanding HDFS components

HDFS is managed with the master-slave architecture included with the following components:

• NameNode:

This is the master of the HDFS system. It maintains the directories, files, and manages the blocks that

are present on the DataNodes.

Only one per hadoop cluster.

Manages the file system namespace and metadata.

Single point of failure but mitigated by writing state to multiple file systems.

Single point of failure: Don‘t use inexpensive commodity hardware for this node, large memory

requirements.

• DataNode:

These are slaves that are deployed on each machine and provide actual storage. They are responsible for

serving read-and-write data requests for the clients.

Many per hadoop cluster.

Manages blocks with data and serves them to clients.

Periodically reports to name node the list of blocks it stores.

Use inexpensive commodity hardware for this node.

• Secondary NameNode: The HDFS file system includes a so-called secondary namenode, a misleading

name that some might incorrectly interpret as a backup namenode for when the primary namenode goes

offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots

of the primary namenode's directory information, which the system then saves to local or remote directories.

These check pointed images can be used to restart a failed primary namenode without having to replay the

Page 6: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.6

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because

the namenode is the single point for storage and management of metadata, it can become a bottleneck for

supporting a huge number of files, especially a large number of small files. HDFS Federation, a new

addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate

namenodes.

Understanding the MapReduce architecture:

MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job

submission, job initialization, task assignment, task execution, progress and status update, and job

completion-related activities, which are mainly managed by the JobTracker node and executed by

TaskTracker. Client application submits a job to the JobTracker. Then input is divided across the cluster.

The JobTracker then calculates the number of map and reducer to be processed. It commands the

TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a local machine and

launches JVM to map and reduce program over the data. Along with this, the TaskTracker periodically

sends update to the JobTracker, which can be considered as the heartbeat that helps to update JobID, job

status, and usage of resources.

Understanding MapReduce components

MapReduce is managed with master-slave architecture included with the following components:

• JobTracker:

This is the master node of the MapReduce system, which manages the jobs and resources in the

cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being

processed on the TaskTracker, which is running on the same DataNode as the underlying block.

One per hadoop cluster.

Receives job requests submitted by client.

Schedules and monitors MapReduce jobs on task trackers.

• TaskTracker:

These are the slaves that are deployed on each machine. They are responsible for running the map

and reducing tasks as instructed by the JobTracker.

Page 7: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.7

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Many per Hadoop cluster.

Executes MapReduce operations.

CLOUD FILE SYSTEMS: GFS AND HDFS

The Google File System (GFS) is designed to manage relatively large files using a very large distributed

cluster of commodity servers connected by a high-speed network. It is therefore designed to (a) expect and

tolerate hardware failures, even during the reading or writing of an individual file (since files are expected to

be very large) and (b) support parallel reads, writes and appended by multiple client programs. A common

use case that is efficiently supported is that of many ‗producers‘ appending to the same file in parallel,

which is also being simultaneously read by many parallel ‗consumers‘. In contrast to traditional parallel

databases, on the other hand, do not make similar assumptions as regards to the prevalence of failures or the

expectations that failures will occur often even during large computations as a result they also do not scale.

The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS architecture that

is also available on the Amazon EC2 cloud platform.

We refer to both GFS and HDFS as ‗cloud file systems.‘ The architecture of cloud file systems is

illustrated in Figure 10.3. Large files are broken up into ‗chunks‘ (GFS) or ‗blocks‘ (HDFS), which are

themselves large (64MB being typical). These chunks are stored on commodity (Linux) servers called

Chunk Servers (GFS) or Data Nodes (HDFS); further each chunk is replicated at least three times, both on a

different physical rack as well as a different network segment in anticipation of possible failures of these

components apart from server failures.

When a client program (‗cloud application‘) needs to read/write a file, it sends the full path and

offset to the Master (GFS) which sends back meta-data for one (in the case of read) or all (in the case of

write) of the replicas of the chunk where this data is to be found. The client caches such meta-data so that it

need not contact the Master each time. Thereafter the client directly reads data from the designated chunk

server/DataNode. This data is not cached since most reads are large and caching would complicate writes.

Page 8: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.8

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Anatomy of HDFS READ OPERATIONS

In case of a write, in particular an append, the client sends only the data to be appended to all the

chunk servers/DataNodes; when they all acknowledge receiving this data it informs a designated ‗primary‘

chunk server, whose identity it receives (and also caches) from the Master. The primary chunk server

appends its copy of data into the chunk at an offset of its choice; note that this may be beyond the EOF to

account for multiple writers who may be appending to this file simultaneously.

The primary then forwards the request to all other replicas which in turn write the data at the same

offset if possible or return a failure. In case of a failure the primary rewrites the data at possibly another

offset and retries the process. The Master maintains regular contact with each chunk server through

heartbeat messages and in case it detects a failure its meta-data is updated to reflect this, and if required

assigns a new primary for the chunks being served by a failed chunk server. Since clients cache meta-data,

occasionally they will try to connect to failed chunk servers, in which case they update their meta-data from

the master and retry. In [26] it is shown that this architecture efficiently supports multiple parallel readers

and writers. It also supports writing (appending) and reading the same file by parallel sets of writers and

readers while maintaining a consistent view, i.e. each reader always sees the same data regardless of the

replica it happens to read from. Finally, note that computational processes (the ‗client‘ applications above)

run on the same set of servers that files are stored on.

Anatomy of HDFS WRITE OPERATIONS

Page 9: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.9

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

BIGTABLE & HBASE:

BigTable is a distributed structured storage system built on GFS; Hadoop‘s HBase is a similar open source

system that uses HDFS. BigTable is accessed by a row key, column key and a timestamp. Each column can

store arbitrary name–value pairs of the form column-family:label, string. The set of possible column-

families for a table is fixed when it is created whereas columns, i.e. labels within the column family, can be

created dynamically at any time. Column families are stored close together in the distributed file system;

thus the BigTable model shares elements of column oriented databases.

We illustrate these features below through an example. Figure 10.4 illustrates the BigTable data

structure: Each row stores information about a specific sale transaction and the row key is a transaction

identifier. The ‗location‘ column family stores columns relating to where the sale occurred, whereas the

‗product‘ column family stores the actual products sold and their classification. Note that there are two

values for region having different timestamps, possibly because of a reorganization of sales regions.

Since data in each column family is stored together, using this data organization results in efficient

data access patterns depending on the nature of analysis: For example, only the location column family may

be read for traditional data-cube based analysis of sales, whereas only the product column family is needed

for say, market-basket analysis. Thus, the BigTable structure can be used in a manner similar to a column-

oriented database. Figure 10.5 illustrates how BigTable tables are stored on a distributed file system such as

GFS or HDFS. Each table is split into different row ranges, called tablets. Each tablet is managed by a tablet

server that stores each column family for the given row range in a separate distributed file, called an

SSTable. Additionally, a single Metadata table is managed by a meta-data server that is used to locate the

tablets of any user table in response to a read or write request. The Metadata table itself can be large and is

also split into tablets, with the root tablet being special in that it points to the locations of other meta-data

tablets. BigTable and HBase rely on the underlying distributed file systems GFS and HDFS respectively and

therefore also inherit some of the properties of these systems. In particular large parallel reads and inserts are

efficiently supported, even simultaneously on the same table, unlike a traditional relational database. In

particular, reading all rows for a small number of column families from a large table, such as in aggregation

queries, is efficient in a manner similar to column-oriented databases. Similarly, the consistency properties

of large parallel inserts are stronger than that for parallel random writes, as is pointed out in. Further, writes

can even fail if a few replicas are unable to write even if other replicas are successfully updated.

Page 10: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.10

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

DYNAMO ---- AMAZON:----

A distributed data system called Dynamo, which was developed at Amazon and underlies its SimpleDB key-

value pair database. Unlike BigTable, Dynamo was designed specifically for supporting a large volume of

concurrent updates, each of which could be small in size, rather than bulk reads and appends as in the case

of BigTable and GFS. Dynamo‘s data model is that of simple key-value pairs, and it is expected that

applications read and write such data objects fairly randomly. This model is well suited for many web-based

e-commerce applications that all need to support constructs such as a ‗shopping cart.‘ Dynamo also

replicates data for fault tolerance, but uses distributed object versioning and quorum-consistency to enable

writes to succeed without waiting for all replicas to be successfully updated, unlike in the case of GFS.

Managing conflicts if they arise is relegated to reads which are provided enough information to

enable application dependent resolution. Because of these features, Dynamo does not rely on any underlying

distributed file system and instead directly manages data storage across distributed nodes. The architecture

of Dynamo is illustrated in Figure 10.6. Objects are keyvalue pairs with arbitrary arrays of bytes. An MD5

hash of the key is used to generate a 128-bit hash value. The range of this hash function is mapped to a set of

virtual nodes arranged in a ring, so each key gets mapped to one virtual node. The object is replicated at this

primary virtual node as well as N − 1 additional virtual nodes (where N is fixed for a particular Dynamo

cluster). Each physical node (server) handles a number of virtual nodes at distributed positions on the ring so

as to continuously distribute load evenly as nodes leave and join the cluster because of transient failures or

network

partitions. Notice that the Dynamo architecture is completely symmetric with each node being equal, unlike

the BigTable/GFS architecture that has special master nodes at both the BigTable as well as GFS layer. A

write request on an object is first executed at one of its virtual nodes which then forward the request to all

nodes having replicas of the object. Objects are always versioned, so a write merely creates a new version of

the object with its local timestamp (Tx on node X) incremented. Thus the timestamps capture the history of

object updates; versions that are superseded by later versions having a larger vector timestamp are discarded.

For example, two sequential updates at node X would create an object version with vector timestamp to [2 0

Page 11: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.11

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

0], so an earlier version with timestamp [1 0 0] can be safely discarded. However, if the second write took

place at node Y before the first write had propagated to this replica, it would have a timestamp of [0 1 0]. In

this case even when the first write arrives at Y (and Y‘s write arrives symmetrically at X), the two versions

[100] and [010] would both be maintained and returned to any subsequent read to be resolved using

application-dependent logic. Say this read took place at node Z and was reconciled by the application which

then further updated the object; the new timestamp for the object would be set to [1 1 1], and as this

supersedes other versions they would be discarded once this update was propagated to all replicas. We

mention in passing that such vector-timestamp-based ordering of distributed events was first conceived of by

Lamport in [35].

In Dynamo write operations are allowed to return even if all replicas are not updated. However a

quorum protocol is used to maintain eventual consistency of the replicas when a large number of concurrent

reads and writes take place: Each read operation accesses R replicas and each write ensures propagation to

W replicas; as long as R +W > N the system is said to be quorum consistent [14]. Thus, if we want very

efficient writes, we pay the price of having to read many replicas, and vice versa. In practice Amazon uses N

= 3, with R

and W being configurable depending on what is desired; for a high update frequency one uses W = 1, R = 3,

whereas for a high-performance read store W = 3, R = 1 is used.

Dynamo is able to handle transient failures by passing writes intended for a failed node to another

node temporarily. Such replicas are kept separately and scanned periodically with replicas being sent back to

their intended node as soon as it is found to have revived. Finally, Dynamo can be implemented using

different storage engines at the node level, such as Berkeley DB or even MySQL; Amazon is said to use the

former in production.

MAPREDUCE AND EXTENSIONS

The MapReduce programming model was developed at Google in the process of implementing large-scale

search and text processing tasks on massive collections of web data stored using BigTable and the GFS

distributed file system. The MapReduce programming model is designed for processing and generating large

Page 12: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.12

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

volumes of data via massively parallel computations utilizing tens of thousands of processors at a time. The

underlying infrastructure to support this model needs to assume that processors and networks will fail, even

during a particular computation, and build in support for handling such failures while ensuring progress of

the computations being performed.

Hadoop is an open source implementation of the MapReduce model developed at Yahoo, and

presumably also used internally. Hadoop is also available on pre-packaged AMIs in the Amazon EC2 cloud

platform, which has sparked interest in applying the MapReduce model for large-scale, fault-tolerant

computations in other domains, including such applications in the enterprise context.

PARALLEL COMPUTING

Parallel computing has a long history with its origins in scientific computing in the late 60s and early 70s.

Different models of parallel computing have been used based on the nature and evolution of multiprocessor

computer architectures. The shared-memory model assumes that any processor can access any memory

location, but not equally fast. In the distributed memory model each processor can address only its own

memory and communicates with other processors using message passing over the network. In scientific

computing applications for which these models were developed, it was assumed that data would be loaded

from disk at the start of a parallel job and then written back once the computations had been completed, as

scientific tasks were largely compute bound. Over time, parallel computing also began to be applied in the

database arena, such as SAN, NAS, database systems supporting shared-memory, shared-disk and shared-

nothing models became available.

The premise of parallel computing is that a task that takes time T should take time T/p if executed

on p processors. In practice, inefficiencies are introduced by distributing the computations such as

(a) The need for synchronization among processors,

(b) Overheads of communication between processors through messages or disk, and

(c) Any imbalance in the distribution of work to processors.

Thus in practice the time Tp to execute on p processors is less than T, and the parallel efficiency of an

algorithm is defined as:

€ = T/p Tp……………………………………………... (11.1)

A scalable parallel implementation is one where:

(a) The parallel efficiency remains constant as the size of data is increased along with a corresponding

increase in processors and

(b) The parallel efficiency increases with the size of data for a fixed number of processors.

We illustrate how parallel efficiency and scalability depends on the algorithm, as well as the nature

of the problem, through an example. Consider a very large collection of documents; say web pages crawled

from the entire internet. The problem is to determine the frequency (i.e., total number of occurrences) of

each word in this collection. Thus, if there are n documents and m distinct words, we wish to determine m

frequencies, one for each word.

Page 13: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.13

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Now we compare two approaches to compute these frequencies in parallel using p processors:

(a) Let each processor compute the frequencies for m/p words and

(b) Let each processor compute the frequencies of m words across n/p documents, followed by all the

processors summing their results.

At first glance it appears that approach

(a) Where each processor works independently may be more efficient as compared to

(b) Where they need to communicate with each other to add up all the frequencies. However, a more careful

analysis reveals otherwise: We assume a distributed-memory model with a shared disk, so that each

processor is able to access any document from disk in parallel with no contention.

Further we assume that the time spent c for reading each word in the document is the same as that of

sending it to another processor via inter-processor communication. On the other hand, the time to add to a

running total of frequencies is negligible as compared to the time spent on a disk read or inter-processor

communication, so we ignore the time taken for arithmetic additions in our analysis.

Finally, assume that each word occurs f times in a document, on average. With these assumptions,

the time for computing all the m frequencies with a single processor is n×m×f ×c, i.e. since each word needs

to be read approximately f times in each document.

Using approach (a) each processor reads approximately n × m × f words and adds them n × m/p × f

times. Ignoring the time spent in additions, the parallel efficiency can be calculated as:

Since efficiency falls with increasing p the algorithm is not scalable.

On the other hand using approach (b) each processor performs approximately n/p×m×f reads and the same

number of additions in the first phase, producing p vectors of m partial frequencies, which can be written to

disk in parallel by each processor in time cm. In the second phase these vectors of partial frequencies need to

be added: First each processor sends p – 1 sub-vectors of size m/p to each of the remaining processors. Each

processor then adds p sub-vectors locally to compute one pth of the final m-vector of frequencies. The

parallel efficiency is computed as:

Since in practice p<< nf the efficiency of approach (b) is higher than that of approach (a), and can even be

close to one: For example, with n = 10 000 documents and f = 10, the condition (11.3) works out to p << 50

000, so method (b) is efficient (€b ≈ 0.9) even with thousands of processors. The reason is that in the first

approach each processor is reading many words that it need not read, resulting in wasted work, whereas in

the second approach every read is useful in that it results in a computation that contributes to the final

Page 14: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.14

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

answer. Algorithm (b) is also scalable, since €b remains constant as p and n both increase, and approaches

one as n increases for a fixed p.

THE MAPREDUCE MODEL

Traditional parallel computing algorithms were developed for systems with a small number of processors,

dozens rather than thousands. So it was safe to assume that processors would not fail during a computation.

At significantly larger scales this assumption breaks down, as was experienced at Google in the course of

having to carry out many large-scale computations similar to the one in our word counting example. The

MapReduce parallel programming abstraction was developed in response to these needs, so that it could be

used by many different parallel applications while leveraging a common underlying fault-tolerant

implementation that was transparent to application developers. Figure 11.1 illustrates MapReduce using the

word counting example where we needed to count the occurrences of each word in a collection of

documents.

MapReduce proceeds in two phases, a distributed ‗map‘ operation followed by a distributed ‗reduce‘

operation; at each phase a configurable number of M ‗mapper‘ processors and R ‗reducer‘ processors are

assigned to work on the problem (we have used M = 3 and R = 2 in the illustration). The computation is

coordinated by a single master process (not shown in the figure).

A MapReduce implementation of the word counting task proceeds as follows: In the map phase each

mapper reads approximately 1/Mth of the input (in this case documents), from the global file system, using

locations given to it by the master. Each mapper then performs a ‗map‘ operation to compute word

frequencies for its subset of documents. These frequencies are sorted by the words they represent and

written to the local file system of the mapper. At the next phase reducers are each assigned a subset of

words; in our illustration the first reducer is assigned w1 and w2 while the second one handles w3 and w4.

In fact during the map phase itself each mapper writes one file per reducer, based on the words assigned to

each reducer, and keeps the master informed of these file locations. The master in turn informs the reducers

where the partial counts for their words have been stored on the local files of respective mappers; the

reducers then make remote procedure call requests to the mappers to fetch these. Each reducer performs a

‗reduce‘ operation that sums up the frequencies for each word, which are finally written back to the GFS file

system.

The MapReduce programming model generalizes the computational structure of the above example.

Each map operation consists of transforming one set of key-value pairs to another:

Map: (k1, v1) → [(k2, v2)]……………………………… (11.4)

In our example each map operation takes a document indexed by its id and emits a list if word-count pairs

indexed by word-id: (dk, [w1 . . .wn]) → [(wi, ci)]. The reduce operation groups the results of the map step

using the same key k2 and performs a function f on the list of values that correspond to each

Reduce: (k2, [v2]) → (k2, f ([v2]))………………………. (11.5)

Page 15: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.15

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

In our example each reduce operation sums the frequency counts for each word:

The implementation also generalizes. Each mapper is assigned an input-key range (set of values for k1) on

which map operations need to be performed. The mapper writes results of its map operations to its local disk

in R partitions, each corresponding to the output-key range (values of k2) assigned to a particular reducer,

and informs the master of these locations. Next each reducer fetches these pairs from the respective mappers

and performs reduce operations for each key k2 assigned to it. If a processor fails during the execution, the

master detects this through regular heartbeat communications it maintains with each worker, wherein

updates are also exchanged regarding the status of tasks assigned to workers.

If a mapper fails, then the master reassigns the key-range designated to it to another working node

for re-execution. Note that re-execution is required even if the mapper had completed some of its map

operations, because the results were written to local disk rather than the GFS. On the other hand if a reducer

fails only its remaining tasks (values k2) are reassigned to another node, since the completed tasks would

already have been written to the GFS.

Finally, heartbeat failure detection can be fooled by a wounded task that has a heartbeat but is

making no progress: Therefore, the master also tracks the overall progress of the computation and if results

from the last few processors in either phase are excessively delayed, these tasks are duplicated and assigned

to processors who have already completed their work. The master declares the task completed when any one

of the duplicate workers complete.

Page 16: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.16

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

The MapReduce model is widely applicable to a number of parallel computations, including

database-oriented tasks which we cover later. Indexing large collections is not only important in web search,

but also a critical aspect of handling structured data; so it is important to know that it can be executed

efficiently in parallel using MapReduce. Traditional parallel databases focus on rapid query execution

against data warehouses that are updated infrequently; as a result these systems often do not parallelize

index creation sufficiently well.

PARALLEL EFFICIENCY OF MAPREDUCE

As we have seen earlier, parallel efficiency is impacted by overheads such as synchronization and

communication costs, or load imbalance. The MapReduce master process is able to balance load efficiently

if the number of map and reduce operations are significantly larger than the number of processors. For large

data sets this is usually the case (since an individual map or reduce operation usually deals with a single

document or record). However, communication costs in the distributed file system can be significant,

especially when the volume of data being read, written and transferred between processors is large.

Page 17: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.17

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

For the purposes of our analysis we assume a general computational task, on a volume of data D,

which takes wD time on a uniprocessor, including the time spent reading data from disk, performing

computations, and writing it back to disk (i.e. we assume that computational complexity is linear in the size

of data). Let c be the time spent reading one unit of data (such as a word) from disk. Further, let us assume

that our computational task can be decomposed into map and reduce stages as follows: First cmD

computations are performed in the map stage, producing σD data as output. Next the reduce stage performs

crσD computations on the output of the map stage, producing σμD data as the final result. Finally, we

assume that our decomposition into a map and reduce stages introduces no additional overheads when run

on a single processor, such as having to write intermediate results to disk, and so

wD = cD + cmD + crσD + cσμD………………………………… (11.6)

Now consider running the decomposed computation on P processors that serve as both mappers and

reducers in respective phases of a MapReduce based parallel implementation. As compared to the single

processor case, the additional overhead in a parallel MapReduce implementation is between the map and

reduce phases where each mapper writes to its local disk followed by each reducer remotely reading from

the local disk of each mapper. For the purposes of our analysis we shall assume that the time spent reading a

word from a remote disk is also c, i.e. the same as for a local read. Each mapper produces approximately

σD/P data that is written to a local disk (unlike in the uniprocessor case), which takes cσD/P time. Next,

after the map phase, each reducer needs to read its partition of data from each of the P mappers, with

approximately one Pth of the data at each mapper by each reducer, i.e. σD/P2. The entire exchange can be

executed in P steps. Thus the transfer time is cσD/P2 × P = cσD/P. The total overhead in the parallel

implementation because of intermediate disk writes and reads is therefore 2cσD/P. We can now compute the

parallel efficiency of the MapReduce implementation as:

Let us validate (11.7) above for our parallel word counting example discussed in Section 11.1: The volume

of data is D = nmf . We ignore the time spent in adding word counts, so cr = cm = 0. We also did not include

the

(small) time cm for writing the final result to disk. So wD = wnmf = cnmf, or w = c. The map phase

produces mP partial counts, so σ = mP/nmf = p/nf . Sing (11.7) and c = w we reproduce (11.3) as computed

earlier. It is important to note how _MR depends on σ, the ‗compression‘ in data achieved in the map phase,

and its relation to the number of processors p. To illustrate this dependence, let us recall the definition (11.4)

of a map operation, as applied to the word counting problem, i.e. (dk, ‗w1 . . .wn]) → [(wi, ci)]. Each map

operation takes a document as input and emits a partial count for each word in that document alone, rather

than a partial sum across all the documents it sees. In this case the output of the map phase is of size mn (an

m-vector of counts for each document). So, σ = mn/nmf = 1/f and the parallel efficiency is 1/1+2/f ,

independent of data size or number of processors, which is not scalable.

Page 18: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.18

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

A strict implementation of MapReduce as per the definitions (11.4) and (11.5) does not allow for partial

reduction across all input values seen by a particular reducer, which is what enabled the parallel

implementation of Section 11.1 to be highly efficient and scalable. Therefore, in practice the map phase

usually includes a combine operation in addition to the map, defined as follows:

Combine: (k2, [v2]) → (k2, fc([v2])). ………………………….(11.8)

The function fc is similar to the function f in the reduce operation but is applied only across documents

processed by each mapper, rather than globally. The equivalence of a MapReduce implementation with and

without a combiner step relies on the reduce function f being commutative and associative,

i.e. f (v1, v2, v3) = f (v3, f (v1, v2)).

Finally, recall our definition of a scalable parallel implementation: A MapReduce implementation is scalable

if we are able to achieve an efficiency that approaches one as data volume D grows, and remains constant as

D and P both increase. Using combiners is crucial to achieving scalability in practical MapReduce

implementations by achieving a high degree of data ‗compression‘ in the map phase, so that σ is

proportional to P/D, which in turn results in scalability due to (11.7).

RELATIONAL OPERATIONS USING MAPREDUCE

Enterprise applications rely on structured data processing, which over the years has become virtually

synonymous with the relational data model and SQL. Traditional parallel databases have become fairly

sophisticated in automatically generating parallel execution plans for SQL statements. At the same time

these systems lack the scale and fault-tolerance properties of MapReduce implementations, naturally

motivating the quest to execute SQL statements on large data sets using the MapReduce model. Parallel

joins in particular are well studied, and so it is instructive to examine how a relational join could be executed

in parallel using MapReduce. Figure 11.2 illustrates such an example: Point of sale transactions taking place

at stores (identified by addresses) are stored in a Sales table. A Cities table captures the addresses that fall

within each city. In order to compute the gross sales by city these two tables need to be joined using SQL as

shown in the figure. The MapReduce implementation works as follows: In the map step, each mapper reads

a (random) subset of records from each input table Sales and Cities, and segregates each of these by address,

i.e. the reduce key k2 is ‗address.‘ Next each reducer fetches Sales and Cities data for its assigned range of

address values from each mapper, and then performs a local join operation including the aggregation of sale

value and grouping by city. Note that since addresses are randomly assigned to reducers, sales aggregates for

any particular city will still be distributed across reducers. A second mapreduce step is needed to group the

results by city and compute the final sales aggregates. Parallel SQL implementations usually distribute the

smaller table, Cities in this case, to all processors. As a result, local joins and aggregations can be performed

in the first map phase itself, followed by a reduce phase using city as the key, thus obviating the need for

two phases of data exchange. Naturally there have been efforts at automatically translating SQL-like

statements to a map-reduce framework. Two notable examples are Pig Latin developed at Yahoo!, and Hive

developed and used at Facebook. Both of these are open source tools available as part of the Hadoop project,

Page 19: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.19

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

and both leverage the Hadoop distributed file system HDFS. Figure 11.3 illustrates how the above SQL

query can be represented using the Pig Latin language as well as the HiveQL dialect of SQL. Pig Latin has

features of an imperative language, wherein a programmer specifies a sequence of transformations that each

read and write large distributed files. The Pig Latin compiler generates MapReduce phases by treating each

GROUP (or COGROUP) statement as defining a map-reduce boundary, and pushing remaining statements

on either side into the map or reduce steps.

HiveQL, on the other hand, shares SQL‘s declarative syntax. Once again though, as in Pig Latin,

each JOIN and GROUP operation define a map-reduce boundary. As depicted in the figure, the Pig Latin as

well as HiveQL representations of our SQL query translate into two MapReduce phases similar to our

example of Figure 11.2.

Page 20: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.20

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Pig Latin is ideal for executing sequences of large-scale data transformations using MapReduce. In the

enterprise context it is well suited for the tasks involved in loading information into a data warehouse.

HiveQL, being more declarative and closer to SQL, is a good candidate for formulating analytical queries on

a large distributed data warehouse.

There has been considerable interest in comparing the performance of MapReduce-based

implementations of SQL queries with that of traditional parallel databases, especially specialized column-

oriented databases tuned for analytical queries. In general, as of this writing, parallel databases are still faster

than available open source implementations of MapReduce (such as Hadoop), for smaller data sizes using

fewer processes where fault tolerance is less critical. MapReduce-based implementations, on the other hand,

are able to handle orders of magnitude larger data using massively parallel clusters in a fault-tolerant

manner. MapReduce is also preferable over traditional databases if data needs to be processed only once and

then discarded: As an example, the time required to load some large data sets into a database is 50 times

greater than the time to both read and perform the required analysis using MapReduce. On the contrary, if

data needs to be stored for a long time, so that queries can be performed against it regularly, a traditional

database wins over MapReduce, at least as of this writing. HadoopDB is an attempt at combining the

advantages of MapReduce and relational databases by using databases locally within nodes while using

MapReduce to coordinate parallel execution. Another example is SQL/MR from Aster Data that enhances a

set of distributed SQL-compliant databases with MapReduce programming constructs. Needless to say,

relational processing using MapReduce is an active research area and many improvements to the available

state of the art are to be expected in the near future.

PIG:

Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of

creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into the Apache

Software Foundation.

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this

platform is called Pig Latin. Pig Latin is a high level Data Flow scripting language that enables data

workers to write complex data transformations without knowing Java. Pig Latin can be extended using UDF

(User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then

call directly from the language. Pig works with data from many sources, including structured and

unstructured data, and stores the results into the Hadoop Data File System.

PIG ARCHITECHTURE:

The Pig Latin compiler converts the Pig Latin code into executable code. The executable code is in the form

of MapReduce jobs. The sequence of MapReduce programs enables Pig programs to do data processing and

analysis in parallel, leveraging Hadoop MapReduce and HDFS.

Pig programs can run on MapReduce v1 or MapReduce v2 without any code changes, regardless of what

mode your cluster is running. However, Pig scripts can also run using the Tez API instead. Apache Tez

Page 21: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.21

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

provides a more efficient execution framework than MapReduce. YARN enables application frameworks

other than MapReduce (like Tez) to run on Hadoop.

Figure: Pig architecture.

Executing Pig

Pig has two modes for executing pig scripts

1) Local mode: When you run pig in Local mode, the Pig program runs in the context of a local Java

Virtual Machine, and data access is via the local file system of a single machine. To run pig in local mode

type pig –x local in terminal.

2) MapReduce mode (also known as Hadoop mode):

The second option, MapReduce mode, runs on a Hadoop cluster and converts the Pig statements into

MapReduce code. To run pig in MapReduce mode type simply pig or pig –x mapreduce

Pig programs can be packaged in three different ways:

✓ Script: This method is nothing more than a file containing Pig Latin commands, identified by the .pig

suffix (FlightData.pig, for example).

✓ Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin at the Grunt

command line and immediately see the response.

✓ Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript programs.

PIG DATA TYPES:

Pig‘s data types can be divided into two categories: scalar types, which contain a single value, and complex

types, which contain other types.

SCALAR TYPES

Table :Atomic data types in Pig Latin

Page 22: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.22

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

int Signed 32-bit integer

Long Signed 64-bit integer

float 32-bit floating point

double 64-bit floating point

chararray Character array (string) in Unicode UTF-8

bytearray Byte array (binary object)

COMPLEX DATA TYPES:

Tuple (12.5,hello world,-2) A tuple is an ordered set of fields. It‘s most often used as a row in a relation.

It‘srepresented by fields separated by commas, all enclosed by parentheses.

Bag {(12.5,hello world,-2),(2.87,bye world,10)} A bag is an unordered collection of tuples. A relation is

aspecial kind of bag, sometimes called an outer bag. An inner bag is a bag that is a field within some

complextype. A bag is represented by tuples separated by commas, all enclosed by curly brackets. Tuples

in a bag aren‘trequired to have the same schema or even have the same number of fields. It‘s a good idea to

do this though,unless you‘re handling semistructured or unstructured data.

Map [key#value] A map is a set of key/value pairs. Keys must be unique and be a string (chararray). The

valuecan be any type.

Input and Output

Before you can do anything of interest, you need to be able to add inputs and outputs to your data flows.

Load: alias = LOAD 'file' [USING function] [AS schema];

Load data from a file into a relation. Uses the PigStorage load function as default unless specified otherwise

with the USING option. The data can be given a schema using the AS option.

Store: STORE alias INTO 'directory' [USING function];

Store data from a relation into a directory. The directory must not exist when this command is executed. Pig

will create the directory and store the relation in files named part-nnnnn in it. Uses the PigStorage store

function as default unless specified otherwise with the USING option.

Dump DUMP alias;

Display the content of a relation. Use mainly for debugging. The relation should be small enough for

printing on screen. You can apply the LIMIT operation on an alias to make sure it‘s small enough for

display.

HIVE:

One of the biggest ingredients in the Information Platform built by Jeff Hammerbacher team at Facebook

was Hive, a framework for data warehousing on top of Hadoop in 2007. Hive was open sourced in 2008

August.

Hive is an SQL oriented Query Language and Abbreviated as HiveQL or Simply HQL.

Hive saves you from having to write the MapReduce programs.

Like most SQL dialects, HiveQL does not conform to the ANSI SQL standard and it differs in various

ways from the familiar SQL dialects provided by Oracle, MySQL, and SQL Server.

Page 23: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.23

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast

response times are not required, and when the data is not changing rapidly.

Hive is not a full database because the design constraints and limitations of Hadoop and HDFS impose

limits on what Hive can do.

The biggest limitation is that Hive does not provide record-level update, insert, or delete.

HIVE ARCHITECTURE:

CLI: The command line interface to Hive (the shell). This is the default service. Hiveserver Runs Hive as a

server exposing a Thrift service, enabling access from a range of clients written in different languages.

Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with

Hive.

Metastore Database: The metastore is the central repository of Hive metadata.

The Hive Web Interface (HWI): As an alternative to the shell, you might want to try Hive‘s simple web

interface. Start it using the following commands:

Hive clients: If you run Hive as a server (hive --service hiveserver), then there are a number of different

mechanisms for connecting to it from applications. The relationship between Hive clients and Hive services

is illustrated in Figure .Hive architecture

Thrift Client: The Hive Thrift Client makes it easy to run Hive commands from a wide range of

programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They

can be found in the src/service/src subdirectory in the Hive distribution.

JDBC Driver: Hive provides a Type 4 (pure Java) JDBC driver, defined in the class

org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form

jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process

at the given host and port.

ODBC Driver : The Hive ODBC Driver allows applications that support the ODBC protocol to connect to

Hive.

Page 24: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.24

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

We can create two types of tables in Hive

Managed Tables: Managed tables are the one in which when data is loaded into the table and if we drop the

table then the table along with the data is dropped. So, data loss occurs in managed tables.

External Tables: External tables are the one which when data is loaded into the table and if we drop the

table then the table only is dropped but the data is going to be safe.

Hive supports many of the primitive data types you find in relational databases, as well as three

collection data types that are rarely found in relational databases.

Primitive Data Types : Hive supports several sizes of integer and floating-point types, a Boolean type, and

character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary fields.

Note that Hive does not support ―character arrays‖ (strings) with maximum-allowed lengths, as is

common in other SQL dialects.

Collection Data Types : Hive supports columns that are structs, maps, and arrays. Note that the literal

syntax examples in Table 3-2 are actually calls to built-in functions. Table 3-2. Collection data types.

Type Description example

STRUCT Analogous to a C struct or an ―object.‖ Fields can be accessed

using the ―dot‖ notation. For example, if a column name is of type

STRUCT {first STRING; last STRING}, then the first name field

can be referenced using name.first.

struct('John', 'Doe')

MAP A collection of key-value tuples, where the fields are accessed

using array notation (e.g., ['key']). For example, if a column name

is of type MAP with key→value pairs 'first'→'John' and

'last'→'Doe', then the last name can be referenced using

name['last'].

map('first', 'John',

'last',’Doe’)

ARRAY Ordered sequences of the same type that are indexable using zero-

based integers. For example, if a column name is of type ARRAY

of strings with the value ['John', 'Doe'], then the second element

can be referenced using name[1].

array('John', 'Doe')

HBASE:

• Hbase project was started toward the end of 2006 by Chad Walters and Jim Kellerman at

Powerset.

• Hbase is the Hadoop database modeled after Google‘s Bigtable.

• ―A HBase is a sparse, distributed, persistent, multi-dimensional sorted map‖ for structured data

by Chang et al.

• Sparse basically means that the data is scattered.

• Distributed means that the storage of the data is spread out across commodity hardware.

• Persistent means that the data will be saved.

Page 25: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.25

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

• Multidimensional in HBase means that for a particular cell, there can be multiple versions of

it.

• Sorted map basically means that to obtain each value, you must provide a key.

• The first Hbase release was bundled as part of Hadoop 0.15.0 in Oct,2007.

• In May,2010, Hbase graduated from Hadoop sub-project to become an Apache top level project.

• Facebook, Twitter, and other leading websites uses Hbase for their BigData.

• Hbase is a distributed column-oriented database built on top of HDFS

• Use Hbase for random, real-time read/write access to your datasets.

• Hbase and its native API is written in Java, but you do not have to use Java to access its API.

HBASE ARCHITECTURE:

In HBase Architecture, we have the Master and multiple region servers consisting of multiple regions.

HBase Master

• Responsible for managing region servers and their locations

– Assigns regions to region servers

– Re-balanced to accommodate workloads

– Recovers if a region server becomes unavailable

– Uses Zookeeper – distributed coordination service

• Doesn't actually store or read data

Page 26: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.26

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Region Server

Each region server is basically a node within your cluster.

You would typically have at least 3 region servers in your distributed environment.

A region server contains one or more regions.

It‘s important to know that each region contains data from ONE column family with a range of rows.

The files are primarily handled by the HRegionServer.

The HLog is shared between all the stores of an HRegionServer.

WAL

The Write-Ahead Log exists in the HLog file. When you have your data only in the MemStore and the

system fails before the data is flushed to file, all the data is lost. On a large system, the chances of this are

high because you‘ll have hundreds of commodity hardware. The WAL is to prevent data loss if this should

happen.

HRegion:

Region‘s consists of one or more Column Families with a range of row keys.

Default size of the region is 256 MB.

Page 27: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.27

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Data Storage

• Data is stored in files called HFiles/StoreFiles

• HFile is basically a key-value map

• When data is added it's written to a log called Write Ahead Log (WAL) and is also stored in memory

(memstore)

• Flush: when in-memory data exceeds maximum value it is flushed to an HFile

HBase Data Model

• Data is stored in Tables

• Tables contain rows

– Rows are referenced by a unique key

• Key is an array of bytes .

• Rows made of columns which are grouped in column families

• Data is stored in cells

– Identified by row x column-family x column

– Cell's content is also an array of bytes.

HBase Families

• Rows are grouped into families

– Labeled as ―family:column‖

• Example ―user:first_name‖

– A way to organize your data

– Various features are applied to families

HBase Timestamps: Cells' values are versioned: For each cell multiple versions are kept 3 by default.

Another dimension to identify your data Either explicitly timestamped by region server or provided by the

client Versions are stored in decreasing timestamp order Read the latest first – optimization to read the

current value.

Page 28: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.28

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

HBase Cells: Value = Table+RowKey+Family+Column+Timestamp

SortedMap<

RowKey, List<

SortedMap<

Column, List<

Value, Timestamp

>

>

>

>

CRUD operations In HBase:

These operations enable the basics of HBase‘s get, put, and delete commands. They are the building blocks

of an HBase application.

Create Command: creates the table with a column family.

hbase(main):001:0> create 'test', 'cf1'

Put Command: The put command allows you to put data into HBase. This is also the same command to

update data currently in HBase. Remember that a table can have more than one column family and that each

column family consists of one or more columns.

hbase(main):003:0> put 'test', 'row1', 'cf1', 'val2'

Get Command: The Get command retrieves data from an HBase

hbase(main):003:0> get 'test', 'row1'

Delete command:This deletes from an HBase table.

hbase(main):003:0>Delete 'test', 'row1'

Scan operation: Scans could be used for something like counting the occurrences of a hash tag over a given

time period. One thing to remember is that you need to release your scanner instance as soon as you are

done.

hbase(main):004:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf1:, timestamp=1297853125623, value=val2

Hbase vs BigTable:

HBase Bigtable

Region Tablet

Region Server Tablet Server

Write-ahead log Commit log

HDFS GFS

Hadoop Map-Reduce Map-Reduce

Page 29: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.29

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

SQOOP:

When Big Data storages and analyzers such as MapReduce, Hive, HBase, Cassandra, Pig, etc. of the

Hadoop ecosystem came into picture, they required a tool to interact with the relational database servers for

importing and exporting the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop

ecosystem to provide feasible interaction between relational database server and Hadoop‘s HDFS.

Sqoop: ―SQL to Hadoop and Hadoop to SQL‖

Sqoop is a command-line interface application for transferring data between relational databases and

Hadoop. Sqoop is used to import data from a relational database management system (RDBMS) such as

MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop

MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be

imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as

fault tolerance. Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server

databases to Hadoop. Couchbase, Inc. also provides a Couchbase Server-Hadoop connector by means of

Sqoop.

SQOOP ARCHITECTURE

Sqoop Import

The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record

in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.

$ sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp_add --m 1--target-dir /queryresult

Sqoop Export

Memstore Memtable

Hfile SSTable

ZooKeeper Chubby

Page 30: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.30

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop

contain records, which are called as rows in table. Those are read and parsed into a set of records and

delimited with user-specified delimiter.

$ sqoop export --connect jdbc:mysql://localhost/db --username root --table employee --export-dir

/emp/emp_data

FLUME:

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and

moving large amounts of log data from many different sources to a centralized data store. Apache Flume is a

top level project at the Apache Software Foundation. There are currently two release code lines available,

versions 0.9.x and 1.x. This documentation applies to the 1.x codeline. Please click here for the Flume 0.9.x

User Guide.

Architecture Of FLUME:

Data flow model

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string

attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an

external source to the next destination (hop). A Flume source consumes events delivered to it by an external

source like a web server. The external source sends events to Flume in a format that is recognized by the

target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro

clients or other Flume agents in the flow that send events from an Avro sink. When a Flume source receives

an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it's

consumed by a Flume sink. The JDBC channel is one example -- it uses a file system backed embedded

database. The sink removes the event from the channel and puts it into an external repository like HDFS (via

Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The

source and sink within the given agent run asynchronously with the events staged in the channel.

Complex flows: Flume allows a user to build multi-hop flows where events travel through multiple agents

before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup

routes (fail-over) for failed hops.

Reliability: The events are staged in a channel on each agent. The events are then delivered to the next

agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they

Page 31: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.31

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message

delivery semantics in Flume provide end-to-end reliability of the flow. Flume uses a transactional approach

to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the

storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel.

This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-

hop flow, the sink from the previous hop and the source from the next hop both have their transactions

running to ensure that the data is safely stored in the channel of the next hop.

Recoverability: The events are staged in the channel, which manages recovery from failure. Flume supports

a durable JDBC channel which is backed by a relational database. There's also a memory channel which

simply stores the events in an in-memory queue, which is faster but any events still left in the memory

channel when an agent process dies can't be recovered.

OOZIE:

Oozie Workflow Overview

Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run

Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container.

For the purposes of Oozie, a workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)

arranged in a control dependency DAG (Direct Acyclic Graph). "Control dependency" from one action to

another means that the second action can't run until the first action has completed. In terms of the actions we

can schedule, Oozie supports a wide range of job types, including Pig, Hive, and MapReduce, as well as

jobs coming from Java programs and Shell scripts.

An Oozie coordinator job, for example, enables us to schedule any workflows you‘ve already created. We

can schedule them to run based on specific time intervals, or even based on data availability. At an even

higher level, we can create an Oozie bundle job to manage our coordinator jobs. Using a bundle job, you can

easily apply policies against a set of coordinator jobs by using a bundle job.

For all three kinds of Oozie jobs (workflow, coordinator, and bundle), we start out by defining them using

individual .xml files, and then we configure them using a combination of properties files and command-line

options.

Writing Oozie workflow definitions

Oozie workflow definitions are written in XML, based on the hPDL (Hadoop Process Definition Language)

schema. This particular schema is, in turn, based on the XML Process Definition Language (XPDL) schema,

which is a product independent standard for modeling business process definitions.

Oozie workflows definitions are written in hPDL Oozie workflow actions start jobs in remote systems (i.e.

Hadoop, Pig). Upon action completion, the remote systems callback Oozie to notify the action completion,

at this point Oozie proceeds to the next action in the workflow.

Oozie workflows contain control flow nodes and action nodes.

Page 32: Bamuengine · Bamuengine.com 4.4 Documented by Prof. K. V. Reddy Asst.Prof at DIEMS Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity

Bamuengine.com

4.32

Documented by Prof. K. V. Reddy Asst.Prof at DIEMS

Control flow nodes define the beginning and the end of a workflow (start, end and fail nodes) and provide a

mechanism to control the workflow execution path (decision, fork and join nodes).

Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing

task. Oozie provides support for different types of actions: Hadoop map-reduce, Hadoop file system, Pig,

SSH, HTTP, eMail and Oozie sub-workflow. Oozie can be extended to support additional type of actions.

To see how this concept would look, check out Listing 10-1, which shows an example of the basic structure

of an Oozie workflow‘s XML file.

Listing 10-1: A Sample Oozie XML File

<workflow-app name="SampleWorkflow" xmlns="uri:oozie:workflow:0.1">

<start to="firstJob"/>

<action name="firstJob">

<pig>...</pig>

<ok to="secondJob"/>

<error to="kill"/>

</action>

<action name="secondJob">

<map-reduce>...</map-reduce>

<ok to="end" />

<error to="kill" />

</action>

<end name="end"/>

<kill name="kill">

<message>"Killed job."</message>

</kill>

</workflow-app>

In this example, aside from the start, end, and kill nodes, you have two action nodes. Each action node

represents an application or a command being executed. The next few sections look a bit closer at each node

type.

Start and end nodes

Each workflow XML file must have one matched pair of start and end nodes. The sole purpose of the start

node is to direct the workflow to the first node, which is done using the to attribute. Because it‘s the

automatic starting point for the workflow, no name identifier is required.

Action nodes need name identifiers, as the Oozie server uses them to track the current position of the control

flow as well as to specify which action to execute next. The sole purpose of the end node is to provide a

termination point for the workflow. A name identifier is required, but there‘s no need for a to attribute.

Kill nodes

Oozie workflows can include kill nodes, which are a special kind of node dedicated to handling error

conditions. Kill nodes are optional, and you can define multiple instances of them for cases where you need

specialized handling for different kinds of errors. Action nodes can include error transition tags, which direct

the control flow to the named kill node in case of an error.

You can also direct decision nodes to point to a kill node based on the results of decision predicates, if

needed. Like an end node, a kill node results in the workflow ending, and it does not need to attribute.