big data doc
TRANSCRIPT
-
8/13/2019 Big Data Doc
1/18
A DIVE INTO BIG DATA ITS
SOLUTION USING HADOOP
BY
G.LOUIS AROKIARAJ
B.TECH CSE (IV YEAR)
NATIONAL INSITUTE OF TECHNOLOGY, PUDUCHERRY
-
8/13/2019 Big Data Doc
2/18
-
8/13/2019 Big Data Doc
3/18
HOW EFFICIENT THAN DATA WAREHOUSE ANALYSIS
Data warehouses store current as well as historical data and are used for creating trending reports for
senior management reporting such as annual and quarterly comparisons.
The drawback of data warehouse is when such a large quantity of data is flooding into thesystem, then it wont be able to process those data.
More expensive.Some of the solutions to data warehouse are INFORMATICA, TERADATA.
SOURCES OF BIG DATA
Social MediaFacebook, Twitter, Google+, Orkut. Stock marketRisk analysis. Health carePatient details, diagnosis, prescriptions, medicines, reports. Information Technology companiesEmployee details, stats. E-commerceRecommendations. Indian Government is trying to implement big data analysis in tax revenues to increase the
economy of the country.
STATISTICS
In the present data world about 90% of data is unstructured. Only remaining 10% is structured. In the last 2 years there has been in immense increase in the quantity of data because of
various factors like online shopping, facebook, twitter etc.
In 1 day, about 2.2 million data are created.
In 2010, the growth of big data market was accounted to $3.2 billion. In 2016, the growth of big data market is expected to be $16.9 billion.
Huge Data is
lost and left
unprocessed
X
Y
-
8/13/2019 Big Data Doc
4/18
HOW TO SOLVE THIS??
Here comes the solution of HADOOP.
Hadoop is one of the solution to big data Analysis. Other forms of Big data technologies are
No sql databaseCasandra, MangoDB, etc Search toolslucene, elastic search, etc. Stream processingSTORM, S4, etc. Kafka, Thrift, Scribe, etc.
WHY HADOOP?
FlexibleHadoop can process all the 3 types of (structured, semi-structured, unstructured) data.Hadoop supports various languages like perl, python, java, sql using hadoop streaming
API. So Hadoop is not adhered to Java experts alone!
Scale out architecture. Building more efficient data economy. Robust ecosystem. Cost effective Hadoop is getting cloudy too! Hadoop focuses on moving code to data instead of data to code. Hadoop supports OLAP (online analytics processing) but not OLTP (online transaction
processing).
HADOOP OVER TRADITIONAL SYSTEMS
HADOOP ANALYSIS
Hadoop is an open source frameworkwhich allows for distributed processing oflarge data across clusters of computers
TRADITIONAL SYSTEM ANALYSIS
Traditional systems use paid software andtools for the analysis of data.
-
8/13/2019 Big Data Doc
5/18
using simple programming model.
When the data keeps growing, if the rackarchitecture is not able to take it up thenwe can just replace/substitute acheap/common commodity to the rack
and continue the execution
Scale out architecture
When the data keeps growing, if thesystem is not able to withstand the datathen we can extend it to certain limit andif the data crosses beyond that limit, then
we are forced to replace the entiremachine.
Scalable architecture
HISTORY OF HADOOP
Hadoop was created by DOUG CUTTING, the creator of Apache Lucene, the widely used textsearch library and MIKE CAFARELLA.
The concept was proposed in a paper by GOOGLE proposing it GOOGLE FILE SYSTEM (GFS)which was evolved into HDFS (Hadoop Distributed File System) in Hadoop.
Once again it is from GOOGLEs paper of map reduce, hadoops concept of Distributed ParallelProcessing came into existence.
2004Initial versions of what is now Hadoop Distributed File system and Map- Reduceimplemented by Doug Cutting and Mike Cafarella.
December 2005Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006Doug Cutting joins Yahoo! February 2006Apache Hadoop project officially started to support the standalone development
of Map Reduce and HDFS.
February 2006Adoption of Hadoop by Yahoo! Grid team. April 2006Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. May 2006Yahoo! set up a Hadoop research cluster300 nodes. May 2006Sort benchmark run on 500 nodes in 42 hours (better hardware than April
benchmark).
October 2006Research cluster reaches 600 nodes. December 2006Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500
nodes in 5.2 hours, 900 nodes in 7.8 hours.
January 2007Research cluster reaches 900 nodes. April 2007Research clusters2 clusters of 1000 nodes. April 2008won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. October 2008Loading 10 terabytes of data per day on to research clusters. March 200917 clusters with a total of 24,000 nodes. April 2009won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100
terabyte sort in 173 minutes (on 3,400 nodes).
The name Hadoop was kept after Doug Cuttings sons toy elephant's name. And thus the symbolof elephant too came into existence.
-
8/13/2019 Big Data Doc
6/18
DISTRIBUTORS OF HADOOP
Apachefunded by google. Houton works Cloudera MapR Intel
HADOOP ARCHITECTURE
The hadoop architecture comprises of 2 major parts
HDFSHadoop distributed file systemA distributed file system that runs on large clusters of commodity machines.
It comprises of 3 components:
Name Node is the master of the system. It maintains the name system (directories andfiles) and manages the blocks which are present on the Data Nodes. It holds the metadatafor hdfs.
Data Nodes are the slaves which are deployed on each machine and provide the actualstorage. They are responsible for serving read and write requests for the clients.
Secondary Name Node is responsible for performing periodic checkpoints. In the eventof Name Node failure, you can restart the Name Node using the checkpoint.
MAP REDUCEThis is responsible for 2 processes
Map taskbreak the input into set of key value pairs. Reduce task It will consolidate the outputs from each distributed executions and
process them into reduced tuples.
This is responsible for the computation of the problem.
It comprises of 2 components:
Job Tracker is the master of the system which manages the jobs and resources in the clus-ter (Task Trackers). The Job Tracker tries to schedule each map as close to the actual databeing processed i.e. on the Task Tracker which is running on the same Data Node as theunderlying block.
Task Trackers are the slaves which are deployed on each machine. They are responsiblefor running the map and reduce tasks as instructed by the Job Tracker.
-
8/13/2019 Big Data Doc
7/18
Input and output must be always be in HDFS for the execution of Hadoop.
CLUSTERThe entire configuration of 1 hadoop architecture is called a cluster.
RACKRack is a metal shelf which holds the various nodes, servers and memory storage components.
DATA BLOCKSThe Data to be processed will be disintegrated to blocks of data which will be subjected toparallel execution. The block size can be of either 64 mb, 128 mb or 256 mb.
FAULT TOLERANCE SCHEME
-
8/13/2019 Big Data Doc
8/18
In order to make the system fault tolerant Hadoop uses the concept of REPLICATION. The given
Data will be replicated and saved in some other blocks of memory. In case of any data loss, this
replicated block of data will be used for the further processing/execution.
The replication order can be of 1,2 or 3.
EXECUTION PROCESS OF HADOOP
The client will submit the job to the job tracker The job tracker communicates with the name node which holds the metadata of all the other
nodes. ( holding the index of all other nodes)
The job trackers retrieves the information from the name node determining the availability of thetask trackers.
Depending upon the availability, the job trackers will assign the jobs to the corresponding tasktrackers.
The corresponding data nodes hold the block of data upon which the task tracker works. The task trackers will intimate the job tracker upon the completion of execution.. Then upon receiving the completion status of all the blocks of given data, the job tracker informs
back to the client saying that the execution is complete.
The reduce task will start only after the entire mapping task is done.WORD COUNT PROBLEM
The word count problem is one basic example of illustrating the working of hadoop architecture. Here a
huge text file consisting of various words and phrases is processed into hadoop cluster. The job is to count
the number of distinct words and its occurrences in the given file. This is done by the process of
Mapper instantiation of each line. Map key value splitting. Sorting and shuffling. Reduce key value pairs. Printing the final output.
-
8/13/2019 Big Data Doc
9/18
This simple problem can be done using a single node cluster which can be setup in our personal computer
machines.
SETTING UP A SINGLE NODE CLUSTER.
The following steps are used to setup a single node cluster. Hadoop 1.2 supported only linux platform
operating systems.
1. Unzip the tar file$ tarxzvf hadoop-1.1.2.tar.gz
2. Install Jdk$ sh jdk-6u45-linux-x64.bin
3. Open /home/ubunutu/.bashrc file and set path
-
8/13/2019 Big Data Doc
10/18
4. Hadoop configuration filePath: /home/ubunutu/hadoop1.1.2/conf
$ cd hadoop-1.1.2/
$ cd conf
$ vi hadoop-env.sh
-
8/13/2019 Big Data Doc
11/18
5. $ vi core-site.xml
-
8/13/2019 Big Data Doc
12/18
6. $ vi mapred-site.xml
7. $ vi hdfs-site.xml
-
8/13/2019 Big Data Doc
13/18
8. Change the directory to/home/ubunutu/
9. Now generate the ssh-key from user path$ ssh-keygent rsa
$ cd .ssh
$ sudo apt-get install openssh-server
$ cat id_rsa.pub >> authorized_keys
$ ssh localhost
It prompts for password
$ exit.
10.Now we have set the single node cluster in our personal system.11.We have to start the cluster using
$ cd /home/ubunutu/hadoop-1.1.2/bin
$ ./start-all.sh
-
8/13/2019 Big Data Doc
14/18
`The JPS command is used to list out the various nodes running in the cluster.
12.Since hadoop reads from the HDFS and writes back to the HDFS, it requires both the input fileand the output file to be present in its local HDFS.
Thus now we need to copy the file from our local storage to HDFS storage. This is done using the commandHadoop dfscopyFromLocal /home/ubunutu/filename.txt /home/ubunutu/inputdata
Thus the file is copied from our local storage file system to HDFS file system of hadoop.
13.Now we can execute the distributed execution of the data using the following commandHadoop jar (jar name) (program name) (input file path) (output file path)
-
8/13/2019 Big Data Doc
15/18
The input is a text as follows :
-
8/13/2019 Big Data Doc
16/18
The output file can be seen through the following addresses:
Namenodehttp://localhost:50070
Jobtrackerhttp://localhost:50030.
Thus the word count program is being executed on a certain exampled data using hadooparchitecture of Distributed execution.
WORD COUNT PROGRAM IN JAVA
package org.samples.mapreduce.training;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
http://localhost:50070/http://localhost:50070/http://localhost:50070/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50030/http://localhost:50070/ -
8/13/2019 Big Data Doc
17/18
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer {
public void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
-
8/13/2019 Big Data Doc
18/18
{
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setJarByClass(WordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
BIBLIOGRAPHIES
Hadoop, the definitive guide 3rdedition, Tom White, published by OReilly. Hadoop, the definitive guide 2ndedition, Tom white, published by OReilly and Yahoo
press.
file:///G:/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm
file:///G:/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htm
http://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Hadoop%20Tutorial%201%20--%20Running%20WordCount%20-%20DftWiki.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htmhttp://g/studies!!/hadoop/Kick%20Start%20Hadoop%20%20Word%20Count%20-%20Hadoop%20Map%20Reduce%20Example.htm