![Page 1: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/1.jpg)
1
Let’s Hadoop
![Page 2: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/2.jpg)
2
1. WHAT’S THE
BIG DEAL WITH BIG DATA?
![Page 3: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/3.jpg)
3
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Gartner Predicts 800% data growth over next 5
years
Big Data opens the door to a new approach to engaging customers and making decisions
![Page 4: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/4.jpg)
4
2. BIG DATA:
WHAT ARE THE CHALLENGES?
![Page 5: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/5.jpg)
5
How we can capture and deliver data to right people in real-time?
How we can understand and use big data when it is in Variety of forms?
How we can store/analyze the data given its size and computational capacity?
While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up. Example: Need to process 100TB datasets
• On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Hardware Problems
Hardware Problems / Process and combine data from Multiple disks Traditional Systems: They can’t scale, not reliable and expensive.
![Page 6: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/6.jpg)
6
3. WHAT
TECHNOLOGIES SUPPORT BIG
DATA?
![Page 7: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/7.jpg)
7
Scale-out everything: •Storage •Compute
![Page 8: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/8.jpg)
8
4. WHAT MAKES
HADOOP DIFFERENT?
![Page 9: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/9.jpg)
9
■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud (EC2 ). ■ Robust—Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. ■ Simple—Hadoop allows users to quickly write efficient parallel code. ■ Data Locality—Move Computation to the Data. ■ Replication - Use replication across servers to deal with unreliable storage/servers
![Page 10: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/10.jpg)
10
5. IS HADOOP ONE-STOP SOLUTION?
![Page 11: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/11.jpg)
11
Not good for... Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk
Good for....
![Page 12: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/12.jpg)
12
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.
Framework written in Java Designed to solve problem that involve analyzing large data
(Petabytes) Programing model based on Google’s Map Reduce Infrastructure based on Google’s Big Data and Distributed File System
Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed file system MapReduce - distributed processing on compute clusters.
![Page 13: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/13.jpg)
13
NameNode This manages the file system
namespace (metadata) and regulates access to files by clients.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories
DataNode This manages storage attached to the
node in which they run. DataNode serves read, write, perform
block operation, delete, and replication upon request from NameNode
Many Data Nodes, typically one DataNode for a physical node
![Page 14: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/14.jpg)
14
Large-Scale Data Processing o Want to use 1000s of CPUs
o But don’t want hassle of managing things MapReduce Architecture provides
o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates
MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two phases:
o Map o Reduce
![Page 15: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/15.jpg)
15
![Page 16: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/16.jpg)
16
In map phase , the mapper reads data in the form of key/value pairs
The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS.
There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers
Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker runs in NameNode. Tasktracker runs in DataNode.
![Page 17: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/17.jpg)
17
![Page 18: TITLE 44 POINT VERDANA ALL CAPS - …...3 Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the](https://reader036.vdocument.in/reader036/viewer/2022081407/5f2270ecf869a035c77c2ff7/html5/thumbnails/18.jpg)