lecture 2 part 1
TRANSCRIPT
What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.
What is Rack, Cluster, Nodes and Commodity Hardware?
HDFS - Hadoop Distributed File System
Using HDFS commands
MapReduce
Higher-level languages over Hadoop: Pig and Hive
HBase – Overview
HCatalog
What is Hadoop and its components?
What is the commodity server/Hardware?
Why HDFS ?
What is the responsibility of NameNode in HDFS?
What is Fault Tolerance?
What is the default replication factor in HDFS?
What is the heartbeat in HDFS?
What are JobTracker and TaskTracker?
Why MapReduce programming model?
Where do we have Data Locality in MapReduce?
Why we need to use Pig and Hive?
What is the difference between Hbase and HCatalog
Software platform that lets one easily write and run applications that process vast amounts of data.
It includes
Storage : the Hadoop Distributed File System (HDFS)
Access : HBase
Processing: MapReduce
• A replacement for existing data warehouse systems
• A File system
• An online transaction processing (OLTP) system
• Replacement of all programming logic
• A database
It is written with large clusters of computers in mind
and is built around the following assumptions:
Hardware will fail.
delivering a high throughput of data
A typical HDFS file is gigabytes to terabytes in size.
high aggregate data bandwidth and scale in a single cluster.
support tens of millions of files in a single instance.
write-once-read-many access model.
Moving Computation is Cheaper than Moving Data.
Portability is important.
1. Client machines: Distributed
Data Processing
Data storage
2. The Master nodes: Manage
HDFS
Map Reduce
3. The slave nodes: Do
All the dirty work
The three major categories of machine roles in a Hadoop deployment are :
Distributed Data Processing
Client
Job Traker Name NodeSecondary
Name Node
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Distributed Data storage
Processing(Map Reduce)
Storage(HDFS)
running the
computations
storing the data
ByCommunicate
masterreceive
instructions
Namenode :
stores and manages all cluster’s metadata
so it is the single point of contact to Hadoop
Jobtracker :
runs on the Namenode
perform the map reduce of the jobs
Secondarynamenode:
maintains the backup of metadata present
on the :
• Namenode
• file system change history.
The Master nodes
Distributed Data Processing
Map Reduce
Client
Job Traker Name NodeSecondary Name
Node
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Distributed Data storage
HDFS
Datanode: contain the actual data.
Default Block Size 64MB
Tasktracker: perform task on the local data, assigned
by the Jobtracker.
The Slave nodesDistributed Data Processing
Map Reduce
Client
Job Traker Name NodeSecondary Name
Node
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Distributed Data storage
HDFS
installed with all the cluster settings
load data into the cluster
submit Map Reduce jobs
view the results of the job
single physical server
Distributed Data Processing
Map Reduce
Client
Job Traker Name NodeSecondary Name
Node
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Data Node &Task Tracker
Distributed Data storage
HDFS
Yahoo! on 1000-node cluster
“Cheap” Commodity Server Hardware No need for super-computers, use commodity unreliable hardware Not
desktops!
Typically in 2 level architecture
• Nodes are commodity PCs
• 30-40 nodes/rack
• Uplink from rack is 3-4 gigabit
• Rack-internal is 1 gigabit
1 gigabit
3-4 gigabit
The NameNode keeps track of the file metadata- which files are in the system and how each file is broken down into blocks.
The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.
After a client calls the JobTracker to begin a data processing job, the JobTrackerpartitions the workand assigns different map and reduce tasks to each TaskTracker in the cluster.
1
2
3
4
5
File F 300 MB
Block 1 (64 MB)
Block 2 (64 MB)
Block 3 (64 MB)
Block 4 (64 MB)
Block 5 (Remaining 44 MB)
Files added to HDFS are split into fixed- size blocks
block size is configurable, but default to 64 MB
Namenode maintains metadata info about files (blocks)
datanode stores the actual data (blocks)
Each block is replicated N times (Default = 3)
HDFS block
4
Benefits of replication
Availability: data isn’t lost when a node fails
Reliability: HDFS compares replicas and fixes data corruption
Performance: allows for data locality
We have more
than one4
4
The NameNode endeavors to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives. When a block becomes over replicated, the NameNodechooses a replica to remove. The NameNode will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the DataNode with the least amount of available disk space. The goal is to balance storage utilization across DataNodes without reducing the block’s availability.
Many datanode (1000s)
Centralized namenode
1
2
3 45
12
3
4
5
2
4
5
Block replication
Given Data: Huge file containing all emails sent to customer service.
Sample Scenario: How many times did our customers type the word “Refund” into emails sent to customer service?
Typical workflow:
• Load data into the cluster (HDFS writes)
• Analyzing the data (Map Reduce)
• Store results in the cluster (HDFS writes)
• Read the results from the cluster (HDFS reads)
• [-ls <path>]
• [-du <path>]
• [-cp <src> <dst>]
• [-rm <path>]
• [-put <localsrc> <dst>]
• [-copyFromLocal <localsrc> <dst>]
• [-moveFromLocal <localsrc> <dst>]
• [-get [-crc] <src> <localdst>]
• [-cat <src>]
• [-copyToLocal [-crc] <src> <localdst>]
• [-moveToLocal [-crc] <src> <localdst>]
• [-mkdir <path>]
• [-touchz <path>]
• [-test -[ezd] <path>]
• [-stat [format] <path>]
• [-help [cmd]]
1. Create a directory in HDFS at given path(s).
2. List the contents of a directory.
3. Remove a file or directory in HDFS.Remove files specified as argument. Deletes directory only when it is empty
Remove folder: hadoop fs -rm -r /folder
4. Upload a file in HDFS.Copy single src file, or multiple src files from local file system to the Hadoop data file system
5. Download a file in HDFS.Copies/Downloads files to the local file system
6. See contents of a file
./filename.exet