hadoop distributed file system - ce.uniroma2.it · hadoop distributed file system a.a. 2016/17...
TRANSCRIPT
Hadoop Distributed File System A.A. 2016/17
Matteo Nardelli
Laurea Magistrale in Ingegneria Informatica - II anno
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Matteo Nardelli - SABD 2016/17
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
HDFS • Hadoop Distributed File System
– open-source implementation – clones the Google File System – de-facto standard for batch-processing frameworks:
e.g, Hadoop MapReduce, Spark, Hive, Pig
Design principles • Process very large files: hundreds of megabytes, gigabytes, or terabytes
in size
• Simple coherency model: files follow the write-once, read-many-times pattern
• Commodity hardware: HDFS is designed to carry on working without a noticeable interruption to the user even when failures occur
• Portability across heterogeneous hardware and software platforms
Matteo Nardelli - SABD 2016/17
2
HDFS HDFS does not work well with: • Low-latency data access: HDFS is optimized for delivering a high
throughput of data
• Lots of small files: the number of files in HDFS is limited by the amount of memory on the namenode, which holds the file system metadata in memory
• Multiple writers, arbitrary file modifications
Matteo Nardelli - SABD 2016/17
3
HDFS A file is split into one or more blocks and these blocks are stored in a set of storing nodes (named DataNodes)
Matteo Nardelli - SABD 2016/17
4
HDFS: architecture • An HDFS cluster has two types of nodes:
– One master, called NameNode – Multiple workers, called DataNodes
Matteo Nardelli - SABD 2016/17
5
HDFS NameNode • manages the file system namespace • manages the metadata for all the files and directories • determines the mapping between blocks and DataNodes.
DataNodes • store and retrieve the blocks (also shards or chunks) when they
are told to (by clients or by the namenode) • manage the storage attached to the nodes where they execute
• Without the namenode HDFS cannot be used – It is important to make the namenode resilient to failures
• Large size blocks (default 128 MB): why?
Matteo Nardelli - SABD 2016/17
6
HDFS
Matteo Nardelli - SABD 2016/17
7
HDFS
Matteo Nardelli - SABD 2016/17
8
Namenode
HDFS: file read
The NameNode is only used to get block location
Matteo Nardelli - SABD 2016/17
9
Source: “Hadoop: The definitive guide”
HDFS: file write
Matteo Nardelli - SABD 2016/17
10
Source: “Hadoop: The definitive guide”
• Clients ask NameNode for a list of suitable DataNodes • This list forms a pipeline: first DataNode stores a copy of a block,
then forwards it to the second, and so on
Installation and Configuration of HDFS (step by step)
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
12
Download http://hadoop.apache.org/releases.html Configure environment variables In the .profile (or .bash_profile) export all needed environment variables
(on a Linux/Mac OS system)
$cd
$nano.profile
exportJAVA_HOME=/usr/lib/jvm/java-8-oracle/jre
exportHADOOP_HOME=/usr/local/hadoop-2.7.2
exportPATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
13
Allow remote login • Your system should accept connection through SSH
(i.e., run a SSH server, set your firewall to allow incoming connections)
• Enable login without password and a RSA key
• Create a new RSA key and add it into the list of authorized keys
(on a Linux/Mac OS system)
$ssh-keygen–trsa–P“”
$cat$HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
14
Hadoop Configuration in $HADOOP_HOME/etc/hadoop: • core-site.xml: common settings for HDFS, MapReduce, and YARN
• hdfs-site.xml: configuration settings for HDFS deamons (i.e., namenode, secondary namenode, and datanodes)
• mapred-site.xml: configuration settings for MapReduce (e.g., job history server)
• yarn-site.xml: configuration settings for YARN daemons (e.g., resource manager, node managers)
By default, Hadoop runs in a non-distributed mode, as a single Java process. We will configure Hadoop to execute in a pseudo-distributed mode More on the Hadoop configuration: https://hadoop.apache.org/docs/current/
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
15
core-site.xml hdfs-site.xml
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
16
mapred-site.xml yarn-site.xml
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Installation and Configuration of HDFS (our pre-configured Docker image)
HDFS with Dockers
Matteo Nardelli - SABD 2016/17
18
• create a small network named hadoop_network with one namenode (master) and 3 datanode (slave)
$dockerpullmatnar/hadoop
$dockernetworkcreate--driverbridgehadoop_network
$dockerrun-t-i-p50075:50075-d--network=hadoop_network--name=slave1matnar/hadoop
$dockerrun-t-i-p50076:50075-d--network=hadoop_network--name=slave2matnar/hadoop
$dockerrun-t-i-p50077:50075-d--network=hadoop_network--name=slave3matnar/hadoop
$dockerrun-t-i-p50070:50070--network=hadoop_network--name=mastermatnar/hadoop
HDFS with Dockers
Matteo Nardelli - SABD 2016/17
19
How to remove the containers • stop and delete the namenode and datanodes
• remove the network
$dockernetworkrmhadoop_network
$dockerkillmasterslave1slave2slave3
$dockerrmmasterslave1slave2slave3
HDFS: initialization and operations
Apache Hadoop 2: Configuration
Matteo Nardelli - SABD 2016/17
21
At the first execution, the HDFS needs to be initialized • this operation erases the content of the HDFS • it should be executed only during the initialization phase
$hdfsnamenode–format
HDFS: Configuration
Matteo Nardelli - SABD 2016/17
22
Start HDFS: Stop HDFS:
$$HADOOP_HOME/sbin/start-dfs.sh
$$HADOOP_HOME/sbin/stop-dfs.sh
HDFS: Configuration
Matteo Nardelli - SABD 2016/17
23
$$HADOOP_HOME/sbin/stop-dfs.sh
When the HDFS is started, you can check its WebUI: • http://localhost:50070/
$hdfsdfsadmin-report
Obtain basic filesystem information and statistics:
HDFS: Basic operations
Matteo Nardelli - SABD 2016/17
24
ls: for a file ls returns stat on the file; for a directory it returns list of its direct children mkdir: takes path uri's as argument and creates directories
$hdfsdfs-ls[-d][-h][-R]<args>
$hdfsdfs-mkdir[-p]<paths>
-d: Directories are listed as plain files -h: Format file sizes in a human-readable fashion -R: Recursively list subdirectories encountered
-p: creates parent directories along the path.
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
HDFS: Basic operations
Matteo Nardelli - SABD 2016/17
25
mv: coves files from source to destination. This command allows multiple sources in which case the destination needs to be a directory. Moving files across file systems is not permitted put: copy single src, or multiple srcs from local file system to the destination file system
$hdfsdfs-mvURI[URI...]<dest>
$hdfsdfs-put<localsrc>...<dst>
Also reads input from stdin and writes to destination file system
$hdfsdfs-put-<dst>
HDFS: Basic operations
Matteo Nardelli - SABD 2016/17
26
append: append single or multiple files from local file system to the destination file system get: copy files to the local file system; files that fail the CRC check may be copied with the -ignorecrcoption cat: copies source paths to stdout
$hdfsdfs-get[-ignorecrc][-crc]<src><localdst>
$hdfsdfs-catURI[URI...]
$hdfsdfs-appendToFile<localsrc>...<dst>
HDFS: Basic operations
Matteo Nardelli - SABD 2016/17
27
rm: Delete files specified as args cp: copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory
$hdfsdfs-cp[-f][-p|-p[topax]]URI[URI...]<dest>
$hdfsdfs-rm[-f][-r|-R][-skipTrash]URI[URI...]
-f: does not display a diagnostic message (modify the exit status to reflect an error if the file does not exist)
-R (or -r): deletes the directory and any content under it recursively -skipTrash: bypasses trash, if enabled
-f: overwrites the destination if it already exists. -p: preserves file attributes [topx] (timestamps, ownership,
permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission.
HDFS: Basic operations
Matteo Nardelli - SABD 2016/17
28
stat: Print statistics about the file/directory at <path> in the specified format
$hadoopfs-stat[format]<path>...
Format accepts %b Size of file in bytes %F Will return "file", "directory", or "symlink" depending on the type of inode %g Group name %n Filename %o HDFS Block size in bytes ( 128MB by default ) %r Replication factor %u Username of owner %y Formatted mtime of inode %Y UNIX Epoch mtime of inode
An example
Matteo Nardelli - SABD 2016/17
29
$echo"Filecontent">>file$hdfsdfs-putfile/file$hdfsdfs-ls/$hdfsdfs-mv/file/democontent$hdfsdfs-cat/democontent$hdfsdfs-appendToFilefile/democontent$hdfsdfs-cat/democontent$hdfsdfs-mkdir/folder01$hdfsdfs-cp/democontent/folder01/text$hdfsdfs-ls/folder01$hdfsdfs-rm/democontent$hdfsdfs-get/folder01/texttextfromhdfs$cattextfromhdfs$hdfsdfs-rm-r/folder01
HDFS: Snapshot
Matteo Nardelli - SABD 2016/17
30
Snapshots • read-only point-in-time copies of the file system • can be taken on a sub-tree or the entire file system
Common use cases: data backup, protection against user errors, and disaster recovery. The implementation is efficient: • the creation is instantaneous • blocks in datanodes are not copied (it operates on metadata only) • a snapshot does not adversely affect regular HDFS operations
– changes are recorded in reverse chronological order so that the current data can be accessed directly
– the snapshot data is computed by subtracting the modifications from the current data.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
HDFS: Snapshot
Matteo Nardelli - SABD 2016/17
31
Declare a folder where snapshot operations are allowed Create a snapshot Listing the snapshots Delete a snapshot Disable snapshot operations within a folder
$hdfsdfs-ls<folder>/.snapshot
$hdfsdfs-deleteSnapshot<folder><snapshot-name>
$hdfsdfsadmin-allowSnapshot<folder>
$hdfsdfs-createSnapshot<folder><snapshot-name>
$hdfsdfsadmin-disallowSnapshot<folder>
An example
Matteo Nardelli - SABD 2016/17
32
$hdfsdfs-mkdir/snap$hdfsdfs-cp/debs/snap/debs$hdfsdfsadmin-allowSnapshot/snap$hdfsdfs-createSnapshot/snapsnap001$hdfsdfs-ls/snap/.snapshot$hdfsdfs-ls/snap/.snapshot/snap001$hdfsdfs-cp-ptopax/snap/.snapshot/snap001/debs/debs$hdfsdfs-deleteSnapshot/snapsnap001$hdfsdfsadmin-disallowSnapshot/snap$hdfsdfs-rm-r/snap
HDFS: Replication
Matteo Nardelli - SABD 2016/17
33
setrep: change the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path. $hdfsdfs-setrep[-w]<numReplicas><path>
-w: requests that the command wait for the replication to complete; this can potentially take a very long time
An example
Matteo Nardelli - SABD 2016/17
34
$hdfsdfs-putfile/norepl/file$hdfsdfs-ls/norepl$hdfsdfs-setrep1/norepl$hdfsdfs-ls/norepl$hdfsdfs-putfile/norepl/file2$hdfsdfs-ls/norepl$hdfsdfs-setrep1/norepl/file2#alsocheckblockavailabilityfromwebUI