hadoop distributed file system - ce.uniroma2.it · hadoop distributed file system a.a. 2016/17...

18
Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno Università degli Studi di Roma Tor VergataDipartimento di Ingegneria Civile e Ingegneria Informatica The reference Big Data stack Matteo Nardelli - SABD 2016/17 1 Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

Hadoop Distributed File System A.A. 2016/17

Matteo Nardelli

Laurea Magistrale in Ingegneria Informatica - II anno

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Matteo Nardelli - SABD 2016/17

1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Page 2: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS •  Hadoop Distributed File System

–  open-source implementation –  clones the Google File System –  de-facto standard for batch-processing frameworks:

e.g, Hadoop MapReduce, Spark, Hive, Pig

Design principles •  Process very large files: hundreds of megabytes, gigabytes, or terabytes

in size

•  Simple coherency model: files follow the write-once, read-many-times pattern

•  Commodity hardware: HDFS is designed to carry on working without a noticeable interruption to the user even when failures occur

•  Portability across heterogeneous hardware and software platforms

Matteo Nardelli - SABD 2016/17

2

HDFS HDFS does not work well with: •  Low-latency data access: HDFS is optimized for delivering a high

throughput of data

•  Lots of small files: the number of files in HDFS is limited by the amount of memory on the namenode, which holds the file system metadata in memory

•  Multiple writers, arbitrary file modifications

Matteo Nardelli - SABD 2016/17

3

Page 3: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS A file is split into one or more blocks and these blocks are stored in a set of storing nodes (named DataNodes)

Matteo Nardelli - SABD 2016/17

4

HDFS: architecture •  An HDFS cluster has two types of nodes:

–  One master, called NameNode –  Multiple workers, called DataNodes

Matteo Nardelli - SABD 2016/17

5

Page 4: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS NameNode •  manages the file system namespace •  manages the metadata for all the files and directories •  determines the mapping between blocks and DataNodes.

DataNodes •  store and retrieve the blocks (also shards or chunks) when they

are told to (by clients or by the namenode) •  manage the storage attached to the nodes where they execute

•  Without the namenode HDFS cannot be used –  It is important to make the namenode resilient to failures

•  Large size blocks (default 128 MB): why?

Matteo Nardelli - SABD 2016/17

6

HDFS

Matteo Nardelli - SABD 2016/17

7

Page 5: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS

Matteo Nardelli - SABD 2016/17

8

Namenode

HDFS: file read

The NameNode is only used to get block location

Matteo Nardelli - SABD 2016/17

9

Source: “Hadoop: The definitive guide”

Page 6: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: file write

Matteo Nardelli - SABD 2016/17

10

Source: “Hadoop: The definitive guide”

•  Clients ask NameNode for a list of suitable DataNodes •  This list forms a pipeline: first DataNode stores a copy of a block,

then forwards it to the second, and so on

Installation and Configuration of HDFS (step by step)

Page 7: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

12

Download http://hadoop.apache.org/releases.html Configure environment variables In the .profile (or .bash_profile) export all needed environment variables

(on a Linux/Mac OS system)

$cd

$nano.profile

exportJAVA_HOME=/usr/lib/jvm/java-8-oracle/jre

exportHADOOP_HOME=/usr/local/hadoop-2.7.2

exportPATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

13

Allow remote login •  Your system should accept connection through SSH

(i.e., run a SSH server, set your firewall to allow incoming connections)

•  Enable login without password and a RSA key

•  Create a new RSA key and add it into the list of authorized keys

(on a Linux/Mac OS system)

$ssh-keygen–trsa–P“”

$cat$HOME/.ssh/id_rsa.pub>>$HOME/.ssh/authorized_keys

Page 8: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

14

Hadoop Configuration in $HADOOP_HOME/etc/hadoop: •  core-site.xml: common settings for HDFS, MapReduce, and YARN

•  hdfs-site.xml: configuration settings for HDFS deamons (i.e., namenode, secondary namenode, and datanodes)

•  mapred-site.xml: configuration settings for MapReduce (e.g., job history server)

•  yarn-site.xml: configuration settings for YARN daemons (e.g., resource manager, node managers)

By default, Hadoop runs in a non-distributed mode, as a single Java process. We will configure Hadoop to execute in a pseudo-distributed mode More on the Hadoop configuration: https://hadoop.apache.org/docs/current/

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

15

core-site.xml hdfs-site.xml

Page 9: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

16

mapred-site.xml yarn-site.xml

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Installation and Configuration of HDFS (our pre-configured Docker image)

Page 10: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS with Dockers

Matteo Nardelli - SABD 2016/17

18

•  create a small network named hadoop_network with one namenode (master) and 3 datanode (slave)

$dockerpullmatnar/hadoop

$dockernetworkcreate--driverbridgehadoop_network

$dockerrun-t-i-p50075:50075-d--network=hadoop_network--name=slave1matnar/hadoop

$dockerrun-t-i-p50076:50075-d--network=hadoop_network--name=slave2matnar/hadoop

$dockerrun-t-i-p50077:50075-d--network=hadoop_network--name=slave3matnar/hadoop

$dockerrun-t-i-p50070:50070--network=hadoop_network--name=mastermatnar/hadoop

HDFS with Dockers

Matteo Nardelli - SABD 2016/17

19

How to remove the containers •  stop and delete the namenode and datanodes

•  remove the network

$dockernetworkrmhadoop_network

$dockerkillmasterslave1slave2slave3

$dockerrmmasterslave1slave2slave3

Page 11: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: initialization and operations

Apache Hadoop 2: Configuration

Matteo Nardelli - SABD 2016/17

21

At the first execution, the HDFS needs to be initialized •  this operation erases the content of the HDFS •  it should be executed only during the initialization phase

$hdfsnamenode–format

Page 12: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: Configuration

Matteo Nardelli - SABD 2016/17

22

Start HDFS: Stop HDFS:

$$HADOOP_HOME/sbin/start-dfs.sh

$$HADOOP_HOME/sbin/stop-dfs.sh

HDFS: Configuration

Matteo Nardelli - SABD 2016/17

23

$$HADOOP_HOME/sbin/stop-dfs.sh

When the HDFS is started, you can check its WebUI: •  http://localhost:50070/

$hdfsdfsadmin-report

Obtain basic filesystem information and statistics:

Page 13: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: Basic operations

Matteo Nardelli - SABD 2016/17

24

ls: for a file ls returns stat on the file; for a directory it returns list of its direct children mkdir: takes path uri's as argument and creates directories

$hdfsdfs-ls[-d][-h][-R]<args>

$hdfsdfs-mkdir[-p]<paths>

-d: Directories are listed as plain files -h: Format file sizes in a human-readable fashion -R: Recursively list subdirectories encountered

-p: creates parent directories along the path.

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

HDFS: Basic operations

Matteo Nardelli - SABD 2016/17

25

mv: coves files from source to destination. This command allows multiple sources in which case the destination needs to be a directory. Moving files across file systems is not permitted put: copy single src, or multiple srcs from local file system to the destination file system

$hdfsdfs-mvURI[URI...]<dest>

$hdfsdfs-put<localsrc>...<dst>

Also reads input from stdin and writes to destination file system

$hdfsdfs-put-<dst>

Page 14: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: Basic operations

Matteo Nardelli - SABD 2016/17

26

append: append single or multiple files from local file system to the destination file system get: copy files to the local file system; files that fail the CRC check may be copied with the -ignorecrcoption cat: copies source paths to stdout

$hdfsdfs-get[-ignorecrc][-crc]<src><localdst>

$hdfsdfs-catURI[URI...]

$hdfsdfs-appendToFile<localsrc>...<dst>

HDFS: Basic operations

Matteo Nardelli - SABD 2016/17

27

rm: Delete files specified as args cp: copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory

$hdfsdfs-cp[-f][-p|-p[topax]]URI[URI...]<dest>

$hdfsdfs-rm[-f][-r|-R][-skipTrash]URI[URI...]

-f: does not display a diagnostic message (modify the exit status to reflect an error if the file does not exist)

-R (or -r): deletes the directory and any content under it recursively -skipTrash: bypasses trash, if enabled

-f: overwrites the destination if it already exists. -p: preserves file attributes [topx] (timestamps, ownership,

permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission.

Page 15: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: Basic operations

Matteo Nardelli - SABD 2016/17

28

stat: Print statistics about the file/directory at <path> in the specified format

$hadoopfs-stat[format]<path>...

Format accepts %b Size of file in bytes %F Will return "file", "directory", or "symlink" depending on the type of inode %g Group name %n Filename %o HDFS Block size in bytes ( 128MB by default ) %r Replication factor %u Username of owner %y Formatted mtime of inode %Y UNIX Epoch mtime of inode

An example

Matteo Nardelli - SABD 2016/17

29

$echo"Filecontent">>file$hdfsdfs-putfile/file$hdfsdfs-ls/$hdfsdfs-mv/file/democontent$hdfsdfs-cat/democontent$hdfsdfs-appendToFilefile/democontent$hdfsdfs-cat/democontent$hdfsdfs-mkdir/folder01$hdfsdfs-cp/democontent/folder01/text$hdfsdfs-ls/folder01$hdfsdfs-rm/democontent$hdfsdfs-get/folder01/texttextfromhdfs$cattextfromhdfs$hdfsdfs-rm-r/folder01

Page 16: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

HDFS: Snapshot

Matteo Nardelli - SABD 2016/17

30

Snapshots •  read-only point-in-time copies of the file system •  can be taken on a sub-tree or the entire file system

Common use cases: data backup, protection against user errors, and disaster recovery. The implementation is efficient: •  the creation is instantaneous •  blocks in datanodes are not copied (it operates on metadata only) •  a snapshot does not adversely affect regular HDFS operations

–  changes are recorded in reverse chronological order so that the current data can be accessed directly

–  the snapshot data is computed by subtracting the modifications from the current data.

https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html

HDFS: Snapshot

Matteo Nardelli - SABD 2016/17

31

Declare a folder where snapshot operations are allowed Create a snapshot Listing the snapshots Delete a snapshot Disable snapshot operations within a folder

$hdfsdfs-ls<folder>/.snapshot

$hdfsdfs-deleteSnapshot<folder><snapshot-name>

$hdfsdfsadmin-allowSnapshot<folder>

$hdfsdfs-createSnapshot<folder><snapshot-name>

$hdfsdfsadmin-disallowSnapshot<folder>

Page 17: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

An example

Matteo Nardelli - SABD 2016/17

32

$hdfsdfs-mkdir/snap$hdfsdfs-cp/debs/snap/debs$hdfsdfsadmin-allowSnapshot/snap$hdfsdfs-createSnapshot/snapsnap001$hdfsdfs-ls/snap/.snapshot$hdfsdfs-ls/snap/.snapshot/snap001$hdfsdfs-cp-ptopax/snap/.snapshot/snap001/debs/debs$hdfsdfs-deleteSnapshot/snapsnap001$hdfsdfsadmin-disallowSnapshot/snap$hdfsdfs-rm-r/snap

HDFS: Replication

Matteo Nardelli - SABD 2016/17

33

setrep: change the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path. $hdfsdfs-setrep[-w]<numReplicas><path>

-w: requests that the command wait for the replication to complete; this can potentially take a very long time

Page 18: Hadoop Distributed File System - ce.uniroma2.it · Hadoop Distributed File System A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno ... • read-only

An example

Matteo Nardelli - SABD 2016/17

34

$hdfsdfs-putfile/norepl/file$hdfsdfs-ls/norepl$hdfsdfs-setrep1/norepl$hdfsdfs-ls/norepl$hdfsdfs-putfile/norepl/file2$hdfsdfs-ls/norepl$hdfsdfs-setrep1/norepl/file2#alsocheckblockavailabilityfromwebUI