title 44 point verdana all caps - …...3 big data is a term applied to data sets whose size is...
TRANSCRIPT
1
Let’s Hadoop
2
1. WHAT’S THE
BIG DEAL WITH BIG DATA?
3
Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Gartner Predicts 800% data growth over next 5
years
Big Data opens the door to a new approach to engaging customers and making decisions
4
2. BIG DATA:
WHAT ARE THE CHALLENGES?
5
How we can capture and deliver data to right people in real-time?
How we can understand and use big data when it is in Variety of forms?
How we can store/analyze the data given its size and computational capacity?
While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up. Example: Need to process 100TB datasets
• On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Hardware Problems
Hardware Problems / Process and combine data from Multiple disks Traditional Systems: They can’t scale, not reliable and expensive.
6
3. WHAT
TECHNOLOGIES SUPPORT BIG
DATA?
7
Scale-out everything: •Storage •Compute
8
4. WHAT MAKES
HADOOP DIFFERENT?
9
■ Accessible—Hadoop runs on large clusters of commodity machines or on cloud (EC2 ). ■ Robust—Hadoop is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. ■ Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. ■ Simple—Hadoop allows users to quickly write efficient parallel code. ■ Data Locality—Move Computation to the Data. ■ Replication - Use replication across servers to deal with unreliable storage/servers
10
5. IS HADOOP ONE-STOP SOLUTION?
11
Not good for... Real time Small datasets Algorithms requires large temp space Problems that are CPU bound and have lots of cross talk
Good for....
12
Hadoop is an open source framework for writing and running distributed applications that process large amounts of data.
Framework written in Java Designed to solve problem that involve analyzing large data
(Petabytes) Programing model based on Google’s Map Reduce Infrastructure based on Google’s Big Data and Distributed File System
Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed file system MapReduce - distributed processing on compute clusters.
13
NameNode This manages the file system
namespace (metadata) and regulates access to files by clients.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories
DataNode This manages storage attached to the
node in which they run. DataNode serves read, write, perform
block operation, delete, and replication upon request from NameNode
Many Data Nodes, typically one DataNode for a physical node
14
Large-Scale Data Processing o Want to use 1000s of CPUs
o But don’t want hassle of managing things MapReduce Architecture provides
o Automatic parallelization & distribution o Fault tolerance o I/O scheduling o Monitoring & status updates
MapReduce is a method for distributing a task across multiple nodes Each node processes data stored on that node Consists of two phases:
o Map o Reduce
15
16
In map phase , the mapper reads data in the form of key/value pairs
The Reducer process all output from mapper and arrives at final output as final key/value pairs writes to HDFS.
There are two types of nodes that control the job execution process: o Jobtracker o Tasktrackers
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers
Tasktrackers run tasks and send progress reports to the jobtracker Jobtracker runs in NameNode. Tasktracker runs in DataNode.
17