serengeti user guide_0.8
DESCRIPTION
Serengeti User Guide_0.8TRANSCRIPT
VMware, Inc.
Serengeti User‟s Guide Serengeti 0.8
Serengeti User’s Guide
2
Contents 1. Serengeti User‟s Guide ...................................................................................................................................... 6
1.1 Intended Audience ........................................................................................................................................ 6
2. Serengeti Overview ............................................................................................................................................ 6
2.1 Serengeti ........................................................................................................................................................ 6
2.1.1 Serengeti Features ................................................................................................................................ 6
2.1.2 Serengeti Architecture Overview......................................................................................................... 8
2.2 Hadoop ........................................................................................................................................................... 8
2.3 VMware Virtual Infrastructure ..................................................................................................................... 9
2.4 Serengeti Virtual Appliance Requirements ............................................................................................... 9
2.5 Serengeti CLI Requirements ....................................................................................................................... 9
3. Installing the Serengeti Virtual Appliance ...................................................................................................... 10
3.1 Download ..................................................................................................................................................... 10
3.2 Deploy Serengeti ........................................................................................................................................ 10
4. Quick Start ......................................................................................................................................................... 13
4.1 Set up the Serengeti CLI ........................................................................................................................... 13
4.2 Deploy a Hadoop Cluster .......................................................................................................................... 13
4.3 Deploy a HBase Cluster ............................................................................................................................ 15
5. Using Serengeti ................................................................................................................................................. 15
5.1 Manage Serengeti Users ........................................................................................................................... 15
5.1.1 Add/Delete a User in Serengeti ......................................................................................................... 15
5.1.2 Modify User Password ........................................................................................................................ 16
5.2 Manage Resources in Serengeti .............................................................................................................. 16
5.2.1 Add a Datastore ................................................................................................................................... 16
5.2.2 Add a Network ..................................................................................................................................... 16
5.2.3 Add a Resource Pool .......................................................................................................................... 17
5.2.4 View Datastores................................................................................................................................... 17
5.2.5 View Networks ..................................................................................................................................... 17
5.2.6 View Resource Pools .......................................................................................................................... 17
5.2.7 Remove a Datastore ........................................................................................................................... 18
5.2.8 Remove a Network .............................................................................................................................. 18
5.2.9 Remove a Resource Pool .................................................................................................................. 18
5.3 Manage Distros ........................................................................................................................................... 18
5.3.1 Supported Distros ................................................................................................................................ 18
5.3.2 Add a Distro to Serengeti ................................................................................................................... 18
5.3.3 List Distros ............................................................................................................................................ 21
5.3.4 Using a Distro....................................................................................................................................... 21
5.4 Hadoop Clusters ......................................................................................................................................... 21
5.4.1 Deploy Hadoop Clusters .................................................................................................................... 21
Serengeti User’s Guide
3
5.4.2 Manage Hadoop Clusters .................................................................................................................. 30
5.4.3 Use Hadoop Clusters .......................................................................................................................... 36
5.5 HBase Clusters ........................................................................................................................................... 40
5.5.1 Deploy HBase Clusters ...................................................................................................................... 40
5.5.2 Manage HBase Clusters .................................................................................................................... 43
5.5.3 Use HBase Clusters ............................................................................................................................ 43
5.6 Monitoring Cluster Deployed by Serengeti ............................................................................................. 44
5.7 Make Hadoop Master Node HA/FT .......................................................................................................... 44
5.8 Hadoop Topology Awareness ................................................................................................................... 45
5.9 Start and Stop Serengeti Services ........................................................................................................... 45
6. Cluster Specification Reference ..................................................................................................................... 46
7. Serengeti Command Reference ..................................................................................................................... 51
7.1 connect ......................................................................................................................................................... 51
7.2 cluster ........................................................................................................................................................... 51
7.2.1 cluster config ........................................................................................................................................ 51
7.2.2 cluster create ........................................................................................................................................ 52
7.2.3 cluster delete ........................................................................................................................................ 53
7.2.4 cluster export ........................................................................................................................................ 53
7.2.5 cluster limit ............................................................................................................................................ 53
7.2.6 cluster list .............................................................................................................................................. 54
7.2.7 cluster resize ........................................................................................................................................ 55
7.2.8 cluster start ........................................................................................................................................... 55
7.2.9 cluster stop ........................................................................................................................................... 56
7.2.10 cluster target ...................................................................................................................................... 56
7.2.11 cluster unlimit ..................................................................................................................................... 56
7.3 datastore ...................................................................................................................................................... 56
7.3.1 datastore add ....................................................................................................................................... 56
7.3.2 datastore delete ................................................................................................................................... 57
7.3.3 datastore list ......................................................................................................................................... 57
7.4 distro ............................................................................................................................................................. 58
7.4.1 distro list ................................................................................................................................................ 58
7.5 disconnect .................................................................................................................................................... 58
7.6 fs .................................................................................................................................................................... 58
7.6.1 fs cat ...................................................................................................................................................... 58
7.6.2 fs chgrp ................................................................................................................................................. 58
7.6.3 fs chmod ............................................................................................................................................... 58
7.6.4 fs chown ................................................................................................................................................ 59
7.6.5 fs copyFromLocal ................................................................................................................................ 59
7.6.6 fs copyToLocal ..................................................................................................................................... 59
Serengeti User’s Guide
4
7.6.7 fs copyMergeToLocal ......................................................................................................................... 60
7.6.8 fs count.................................................................................................................................................. 60
7.6.9 fs cp ....................................................................................................................................................... 60
7.6.10 fs du ..................................................................................................................................................... 60
7.6.11 fs expunge .......................................................................................................................................... 60
7.6.12 fs get.................................................................................................................................................... 61
7.6.13 fs ls ...................................................................................................................................................... 61
7.6.14 fs mkdir ............................................................................................................................................... 61
7.6.15 fs moveFromLocal ............................................................................................................................. 61
7.6.16 fs mv .................................................................................................................................................... 61
7.6.17 fs put.................................................................................................................................................... 62
7.6.18 fs rm .................................................................................................................................................... 62
7.6.19 fs setrep .............................................................................................................................................. 62
7.6.20 fs tail .................................................................................................................................................... 62
7.6.21 fs text ................................................................................................................................................... 63
7.6.22 fs touchz ............................................................................................................................................. 63
7.7 hive ............................................................................................................................................................... 63
7.7.1 hive cfg .................................................................................................................................................. 63
7.7.2 hive script .............................................................................................................................................. 63
7.8 mr .................................................................................................................................................................. 63
7.8.1 mr jar ..................................................................................................................................................... 63
7.8.2 mr job counter ...................................................................................................................................... 64
7.8.3 mr job events ........................................................................................................................................ 64
7.8.4 mr job history ........................................................................................................................................ 64
7.8.5 mr job kill ............................................................................................................................................... 64
7.8.6 mr job list .............................................................................................................................................. 65
7.8.7 mr job set priority ................................................................................................................................. 65
7.8.8 mr job status ......................................................................................................................................... 65
7.8.9 mr job submit ........................................................................................................................................ 65
7.8.10 mr task fail .......................................................................................................................................... 66
7.8.11 mr task kill .......................................................................................................................................... 66
7.9 network ......................................................................................................................................................... 66
7.9.1 network add .......................................................................................................................................... 66
7.9.2 network delete ...................................................................................................................................... 67
7.9.3 network list ............................................................................................................................................ 67
7.10 pig script ..................................................................................................................................................... 68
7.10.1 pig cfg.................................................................................................................................................. 68
7.10.2 pig script ............................................................................................................................................. 68
7.11 resourcepool .............................................................................................................................................. 68
Serengeti User’s Guide
5
7.11.1 resourcepool add ............................................................................................................................... 68
7.11.2 resourcepool delete .......................................................................................................................... 69
7.11.3 resourcepool list ................................................................................................................................ 69
7.12 topology...................................................................................................................................................... 70
7.12.1 topology upload ................................................................................................................................. 70
7.12.2 topology list ........................................................................................................................................ 70
8. vSphere Settings ............................................................................................................................................... 70
8.1 vSphere Cluster Configuration .................................................................................................................. 70
8.1.1 Setup Cluster ....................................................................................................................................... 70
8.1.2 Enable DRS/HA on an existing cluster ............................................................................................. 71
8.1.3 Add Hosts to Cluster ........................................................................................................................... 71
8.1.4 DRS/FT Settings .................................................................................................................................. 71
8.1.5 Enable FT on specific virtual machine ............................................................................................. 71
8.2 Network Settings ......................................................................................................................................... 71
8.2.1 Setup Port Group - Option A (vSphere Distributed Switch) .......................................................... 72
8.2.2 Setup Port Group - Option B (vSwitch) ............................................................................................ 72
8.3 Storage Settings ......................................................................................................................................... 72
8.3.1 Shared Storage Setting ...................................................................................................................... 72
8.3.2 Local Storage Settings ....................................................................................................................... 72
9. Appendix A: Create Local Yum Repository for MapR ................................................................................. 74
9.1 Install a web server to server as yum server .......................................................................................... 74
9.1.1 Configure http proxy ............................................................................................................................ 74
9.1.2 Install Apache Web Server ................................................................................................................ 74
9.1.3 Install yum related packages ............................................................................................................. 75
9.1.4 Sync the remote MapR yum repository ............................................................................................ 75
9.2 Create local yum repository ...................................................................................................................... 75
9.3 Configure http proxy for the VMs created by Serengeti Server ........................................................... 76
10. Appendix B: Create Local Yum Repository for CDH4 ............................................................................... 76
10.1 Install a web server to server as yum server ........................................................................................ 76
10.1.1 Configure http proxy.......................................................................................................................... 76
10.1.2 Install Apache Web Server .............................................................................................................. 77
10.1.3 Install yum related packages ........................................................................................................... 77
10.1.4 Sync the remote CDH4 yum repository ......................................................................................... 77
10.2 Create local yum repository .................................................................................................................... 77
10.3 Config http proxy for the VMs created by Serengeti Server ............................................................... 78
Serengeti User’s Guide
6
1. Serengeti User’s Guide
The Serengeti User‟s Guide provides information about installing and using the Serengeti to deploying
and scaling Hadoop clusters on vSphere.
To help you start with Serengeti, this information includes descriptions of Serengeti concepts and features.
In addition, this information provides a set of usage examples and sample scripts.
1.1 Intended Audience
This book is intended for anyone who needs to install and use Serengeti. The information in this book is
written for administrators and developers who are familiar with VMware vSphere.
2. Serengeti Overview
2.1 Serengeti
The Serengeti virtual appliance is a management service that you can use to deploy Hadoop clusters on
VMware vSphere systems. It is a “one-click” deployment toolkit that allows you to leverage the VMware
vSphere platform to deploy a highly available Hadoop cluster in minutes, including common Hadoop
components such as HDFS, MapReduce, Pig, and Hive on a virtual platform. Serengeti supports multiple
Hadoop 0.20 based distributions, CDH4 (except YARN), and MapR M5.
2.1.1 Serengeti Features
2.1.1.1 Rapid Provisioning
Serengeti can deploy Hadoop clusters with HDFS, MapReduce, HBase, Pig, Hive client and Hive server
in your vSphere system easily and quickly.
Serengeti includes a provisioning engine, the Apache Hadoop distribution, and a virtual machine template.
Serengeti is preconfigured to automate Hadoop cluster deployment and configuration. With Serengeti,
you can save time in getting started with Hadoop because you do not need to install and configure an
operating system, or download, install and configure each software package on each machine.
2.1.1.2 High Availability
Serengeti takes advantage of vSphere high availability to protect the Hadoop master node virtual
machine. The master node virtual machine can be monitored by vSphere. When Hadoop namenode or
jobtracker service stops unexpectedly, vSphere will restart master node for recovery. When the virtual
machine stops unexpectedly by host failover or cannot access due to poor network, vSphere will leverage
FT to start another standby virtual machine automatically to reduce the unplanned down time.
Serengeti User’s Guide
7
2.1.1.3 Local Disk Management
Serengeti allows you to use both shared storage and local storage. After the disks are formatted to
datastores in vSphere, you can add the datastores to Serengeti easily. You can specify whether the
datastores are shared storage (SHARED) or local storage (LOCAL). Serengeti automatically allocates the
datastores to Hadoop clusters when you deploy a Hadoop cluster.
By default, Serengeti allocates Hadoop master nodes and client nodes on SHARED datastores, and
data/compute nodes on LOCAL datastores, including both system disk and data disks of those nodes. If
you specify only local storage or shared storage, Serengeti allocates all Hadoop nodes on the available
datastores for a default cluster.
2.1.1.4 Easy Scale Out
With Serengeti you can add more nodes to a Hadoop cluster with a single command after it has been
deployed. You can start with a small Hadoop cluster and scale out as needed.
2.1.1.5 Configuration
Serengeti allows you to customize the following:
Number of virtual machines
CPU, RAM, storage for the virtual machines
Software packages for the virtual machines
Hadoop configuration.
Serengeti automatically adjusts Hadoop configurations according to the virtual machine specification.
After creation, you can export Hadoop cluster‟s spec and tune Hadoop configuration without impacting
irrelevant Hadoop node.
Serengeti provides both cluster level and node group level configuration. You can set different
parameters for different node groups.
2.1.1.6 Data Compute Separation
Serengeti allows you to deploy a data and computer separated Hadoop cluster.
You can specify number of data nodes per host.
You can specify the number of compute nodes for one data node and specify compute node and related data node on the same physical host.
Serengeti also allows you to deploy a compute-only cluster to performance isolation between different MapReduce clusters or consume the existing HDFS.
Deploy a Hadoop cluster with only JobTracker and TaskTracker to consume an existing apache 0.20 based HDFS.
Deploy a Hadoop cluster with only job tracker and task tracker to consume an 3rd party HDFS.
2.1.1.7 Remote CLI
You can remotely access Serengeti Management Server by installing CLI client in your environment.,
which is a one-stop-shop shell to deploy, manage and use Hadoop.
2.1.1.8 Hadoop Distribution Management
Serengeti allows you to use any of the following Hadoop distributions
Serengeti User’s Guide
8
Apache Hadoop 1.0.x
Greenplum HD 1.2
Hortonworks HDP-1
CDH3
CDH4
MapR M5
You can add your preferred distribution to Serengeti and deploy Hadoop clusters accordingly.
2.1.2 Serengeti Architecture Overview
The Serengeti virtual appliance runs on top of vSphere system and includes a Serengeti Management
Server virtual machine and a Hadoop Template virtual machine. The Hadoop Template virtual machine
includes an agent.
Serengeti performs these major steps to deploy a Hadoop cluster:
1. Serengeti Management Server searches for ESXi hosts with sufficient resources.
2. Serengeti Management Server selects ESXi hosts on which to place Hadoop virtual machines.
3. Serengeti Management Server sends a request to vCenter to clone and reconfigure virtual
machines.
4. Agent configures the OS parameters and network configurations.
5. Agent downloads Hadoop software packages from the Serengeti Management sServer.
6. Agent installs Hadoop software.
7. Agent configures Hadoop parameters.
Provisioning is performed in parallel, which reduces deployment time.
2.2 Hadoop
Apache Hadoop is open source software for distributed storage and computing. Apache Hadoop includes HDFS and MapReduce. The HDFS is a distributed file system. The MapReduce is a software framework for distributed data processing.
You can find more information about Apache Hadoop on http://hadoop.apache.org/ for more information.
Serengeti User’s Guide
9
2.3 VMware Virtual Infrastructure
VMware„s leading virtualization solutions provide multiple benefits to IT administrators and users. VMware
virtualization creates a layer of abstraction between the resources required by an application and
operating system, and the underlying hardware that provides those resources. A summary of the value of
this abstraction layer includes the following:
Consolidation: VMware technology allows multiple application servers to be consolidated onto
one physical server, with little or no decrease in overall performance.
Ease of Provisioning: VMware virtualization encapsulates an application into an image that can
be duplicated or moved, greatly reducing the cost of application provisioning and deployment.
Manageability: Virtual machines may be moved from server to server with no downtime using
VMware vMotion™, which simplifies common operations like hardware maintenance and reduces
planned downtime.
Availability: Unplanned downtime can be reduced and higher service levels can be provided to an
application. VMware High Availability (HA) ensures that in the case of an unplanned hardware
failure, any affected virtual machines are restarted on another host in a VMware cluster.
2.4 Serengeti Virtual Appliance Requirements
Software
o VMware vSphere 5.0 Enterprise or VMware vSphere 5.1 Enterprise
o VMWare vSphere Client 5.0 or VMWare vSphere Client 5.1
o SSH client
Network
o DNS Server
o DHCP Server or Static IP Address Block
Resource requirements
o Resource pool with at least 27.5GB RAM
o Port group with at least 6 uplink ports
o 350G or more disk spaces are suggested.
o 17GB is for Serengeti virtual appliance,
o 300GB is for your first Hadoop cluster. You can reduce the disk space
requirements by specifying the storage size in a cluster specification.
o The remaining disk space is reserved for swap space.
o Shared storage is required if you use HA or FT for the Hadoop master node.
Others
o All ESXi hosts should have time synchronized using the Network Time Protocol (NTP)
2.5 Serengeti CLI Requirements
OS
Serengeti User’s Guide
10
o Windows
o Linux
Software
o Java 1.6.26 or later
o Unzip tool
Network
o Can access Serengeti Management Server through HTTP in order to download CLI
package
3. Installing the Serengeti Virtual Appliance
3.1 Download
Download a Serengeti Virtual Appliance OVA from VMware site.
3.2 Deploy Serengeti
Serengeti runs in VMWare vSphere system. You can use the vSphere client to connect VMware vCenter
Server and deploy Serengeti.
1. In the vSphere Client, Select menu File -> Deploy OVF Template
2. Select the OVA file location of Serengeti Virtual Appliance. vSphere client will verify the OVA file and
show you the brief information.
3. Specify the Serengeti virtual appliance name and inventory location.
Only alphabetic letters (“a-z”, “A-Z”), numbers (“0-9”), space (“ “), hyphen (“-“) and underscore
(“_”) can be used for virtual appliance name and resource pool name. For datastore name, it can
be the above ones plus parenthesis (“(“, “)”) and period (“.”).
4. Select the resource pool on which to deploy the template.
You MUST deploy Serengeti in a top-level resource pool.
Serengeti User’s Guide
11
5. Select a datastore.
6. Select a format for the virtual disks.
7. Map the networks used in the OVF template to the networks in your inventory.
8. Set the properties for this Serengeti deployment.
Serengeti User’s Guide
12
Serengeti Management Server Network Settings
Network Type Select DHCP or Static IP.
IP Address Enter IP address for the Serengeti Management Server virtual machine.
Net mask Enter the subnet mask of the network.
Gateway Enter the IP address for the network gateway.
DNS Server 1 Enter the DNS server IP address.
DNS Server 2 Enter a second DNS server IP address.
Hadoop Resource Settings
Initialize Resources Keep this option selected to add the resource pool, datastore and network
to Serengeti Management Server database. Users can deploy Hadoop
clusters in the resource pool, datastore and network in which the Serengeti
virtual appliance is deployed. Hadoop node virtual machines attempt to
obtain IP address by using DHCP on the network.
9. Verify binding to vCenter Extension Service.
10. Click Next to deploy the virtual appliance. It‟ll take several minutes to deploy the virtual appliance.
After Serengeti virtual appliance is deployed successfully, two virtual machines will be installed in
Serengeti User’s Guide
13
vSphere. One is the Serengeti Management Server virtual machine another is the virtual machine
template for Hadoop nodes.
11. Power on the Serengeti vApp and open the console of Serengeti Management Server VM, you see
the initial OS login password for root/serengeti user. Update the password with command „sudo
/opt/serengeti/sbin/set-password -u‟ after login to the VM, and the initial password will disappear on
the welcome screen.
4. Quick Start
4.1 Set up the Serengeti CLI
Serengeti command line shell can run in Windows or Linux. You need Java installed on the machine.
You can download VMware-Serengeti-cli-0.8.0.0-<build number>.zip from the Serengeti Management Server (http://your-serengeti-server/cli).
Unzip the downloaded package to a directory. Run Serengeti CLI by going to this directory, under ”cli”, and enter “java –jar serengeti*.jar”.
Please refer to the troubleshooting document if you have any issues.
4.2 Deploy a Hadoop Cluster
You can use Serengeti CLI to perform actions such as creating and customizing Hadoop clusters. You
have two ways to access Serengeti CLI: from the Serengeti Management Server virtual machine or install
CLI on any machine and use it.
1. Enter the Serengeti shell.
>java –jar serengeti*.jar
2. Run "connect” command to connect to the Serengeti server.
serengeti>connect --host xx.xx.xx.xx:8080 --username xxx --password xxx
A user named “serengeti” with password “password” is created by default.
3. Run "cluster create” command to deploy a Hadoop cluster on vSphere.
serengeti>cluster create --name myHadoop
In the example, “myHadoop” is the name of the Hadoop cluster you deploy. The Serengeti command
continually updates the progress of the deployment.
Only alphabetic letters (“a-z”, “A-Z”), numbers (“0-9”), and underscore (“_”) can be used cluster
Serengeti User’s Guide
14
name.
This command will deploy a Hadoop cluster with one master node virtual machine, three worker node
virtual machines, and one client node virtual machine. The master node virtual machine contains
NameNode and JobTracker in it. The worker node virtual machines contain datanode and TaskTracker
services. The client node virtual machines contain a Hadoop client environment, including Hadoop client
shell, Pig, and Hive.
After the deployment is complete, you can view the IP addresses of the Hadoop node virtual machines.
Hint
Use the tab key for auto-completion and to get help for commands and parameters.
By default, Serengeti might use any resources added to deploy a Hadoop Cluster. To limit the scope of
resources for the cluster, you can specify resource pools, datastores, or a network in the “cluster create
command”
serengeti>cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW
In this example “myRP” is the resource pool where the Hadoop cluster is deployed on, “myDS” is the
datastore where the virtual machine images is stored, “myNW” is the network which virtual machines will
use.
Once you have a Hadoop cluster deployed you can execute Hadoop command directly in the CLI. In this section we will describe how you can copy files from the local file system to HDFS and then run a MapReduce job.
1. Start the Serengeti CLI and connect to Serengeti Management Server as described in section 4.1
2. Run the “cluster list” command to show all the available clusters
$serengeti>cluster list
3. Run the “cluster target --name” command to connect to the cluster you want to get data in/out. The “--name” value is the cluster name that you want to connect.
$serengeti>cluster target --name cluster1
4. Run the “fs put” command to upload data to HDFS
$serengeti>fs put from /etc/inittab to /tmp/input/inittab
5. Run the “fs get” command to download data from HDFS
$serengeti>fs get from /tmp/input/inittab to /tmp/local-inittab
6. Run the “mr jar” command to run a MapReduce job
$serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"
7. Run the “fs cat ” command to show the output of the MR job
$serengeti> fs cat /tmp/output/part-r-00000
8. Run the “fs get ” command to download the output of the MR job
$serengeti> fs get from /tmp/output/part-r-00000 to /tmp/wordcount
Hint
You can use “resourcepool list”, “datastore list”, “network list” command to see what resources are in
Serengeti.
Serengeti User’s Guide
15
Another way to use Hadoop is through the client VM. By default, Serengeti will deploy a VM named client VM. It has Hadoop client, pig and Hive installed. The OS is configured ready to use Hadoop. You can see the IP of client VM after a cluster is deployed or use cluster list command to see the IP. Following are the steps to follow in order to verify that the Hadoop cluster is working properly.
1. Use ssh to login to the client VM.
use "joe" for user name. Password is "password".
2. Create your own home directory.
$ hadoop fs -mkdir /usr/joe
3. Or run a sample Hadoop mapreduce job.
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 10000000 Feel free to use submit other MapReduce, Pig or Hive jobs as well.
4.3 Deploy a HBase Cluster
Serengeti also supports deploying HBase cluster on HDFS. The easiest way to deploy a HBase cluster is running the following command:
serengeti>cluster create --name myHBase --type hbase
In the example, “myHBase” is the name of the HBase cluster you deployed, “--type hbase” implies you
want to deploy a HBase cluster based on a default template Serengeti provides. This command will
deploy one master node virtual machine which runs NameNode and HBaseMaster daemon, three
zookeeper nodes running ZooKeeper daemon, three data nodes running Hadoop DataNode and HBase
Regionserver daemon, and one client node from which you can launch Hadoop or HBase Jobs.
When deployment finished, you can access HBase through a few ways as you expected:
1. Login client VM to run “hbase shell” commands;
2. Launch a HBase job like “hbase org.apache.hadoop.hbase.PerformanceEvaluation –nomapred
randomWrite 3”;
Default HBase cluster does not contain Hadoop JobTracker or Hadoop TaskTracker daemon. So
you need to deploy a customized cluster in case you want to run a HBase mapr job.
3. Access HBase through Rest-ful Web Service or Thrift gateway, HBase Rest and Thrift service are
configured on the HBase client node, and Rest service listens on port 8080, Thrift service listens
on port 9090.
5. Using Serengeti
5.1 Manage Serengeti Users
Spring security In-Memory Authentication is used for Serengeti Authentication and user management. You can modify /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file to manage Serengeti users. And then restart tomcat service using command "sudo service tomat restart".
5.1.1 Add/Delete a User in Serengeti
Add or delete user at /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file, user-service element. Following is a sample to add one user into user-service. <authentication-manager alias="authenticationManager"> <authentication-provider> <user-service>
Serengeti User’s Guide
16
<user name="serengeti" password="password" authorities="ROLE_ADMIN"/> <user name="joe" password="password" authorities="ROLE_ADMIN"/> </user-service> </authentication-provider> </authentication-manager> The authorities value should define user role in Serengeti, but in M2, it‟s not used, so it‟s OK to have any value here.
5.1.2 Modify User Password
Modify the password value in user-service element at the same file. Following is a sample. <authentication-manager alias="authenticationManager"> <authentication-provider> <user-service> <user name="serengeti" password="password" authorities="ROLE_ADMIN"/> <user name="joe" password="welcome1" authorities="ROLE_ADMIN"/> </user-service> </authentication-provider> </authentication-manager>
5.2 Manage Resources in Serengeti
When deploying Serengeti.OVA, VI admin might allow you to use the same resources in which Serengeti
virtual appliance is using. You can also add more resources to Serengeti for your Hadoop cluster. You
can list resources in Serengeti and delete them if it‟s no longer needed.
You must add resource pool, datastore and network before deploying a Hadoop cluster if VI
admin does not allow you to deploy Hadoop cluster in the same set of resources as Serengeti
server.
5.2.1 Add a Datastore
You can use “datastore add” command to add a vSphere datastore to Serengeti.
serengeti>datastore add --name myLocalDS --spec local* --type LOCAL
In this example, “myLocalDS” is the name you used to create the Hadoop cluster.
“local*” is a wildcard specifying a set of datastores. All datastores whose name starts with “local” will be
added and managed as a whole.
“LOCAL” specifies that the datastores are local storage.
In this version, Serengeti does not check if the datastore really exists. If you use a nonexistent
datastore, cluster creation will fail.
5.2.2 Add a Network
You can use “network add” command to add a network to Serengeti. A network is a port group and a way
to get ip on the port group.
serengeti>network add --name myNW --portGroup 10GPG --dhcp
In this example, “myNW” is the name you used to create the Hadoop cluster.
“10GPG” is the name of the port group created by VI Admin in vSphere.
Virtual machines using this network will use DHCP to obtain IP.
You can also add networks using a static IP.
serengeti>network add --name myNW --portGroup 10GPG --ip 192.168.1.2-100 --dns 10.111.90.2 --
Serengeti User’s Guide
17
gateway 192.168.1.1 --mask 255.255.255.0
In this example, “192.168.1.2-100” is the IP address range Hadoop nodes can use.
“10.111.90.2” is the DNS server IP.
“192.168.1.1” is the gateway.
“255.255.255.0” is the subnet mask.
In this version, Serengeti does not check if the added network is correct. If you use a wrong
network, cluster creation will fail.
5.2.3 Add a Resource Pool
You can use “resourcepool add” command to add a vSphere resource pool to Serengeti.
serengeti>resourcepool add --name myRP --vccluster cluster1 --vcrp rp1
In this example, “myRP” is the name you used to create the Hadoop cluster.
“cluster1” is the vSphere cluster name and “rp1” is vSphere resource pool name.
In this version, Serengeti does not check if the resource pool really exists. If you use a
nonexistent resource pool, cluster creation will fail.
vSphere nested resource pools are not supported in current version. The resource pools must be
one that is located directly under a cluster.
5.2.4 View Datastores
In the Serengeti shell, you can list datastores added to Serengeti.
serengeti>datastore list
You can see details of datastores.
serengeti> datastore list --detail
You can specify which datastore to list.
seretenti> datastore list --name myDS --detail
5.2.5 View Networks
In the Serengeti shell, you can list networks added to Serengeti.
serengeti>network list
You can see details of networks.
serengeti> network list --detail
You can specify which network to list.
seretenti> network list --name myNW --detail
5.2.6 View Resource Pools
In the Serengeti shell, you can list resource pools added to Serengeti.
serengeti>resourcepool list
You can see details of resource pools.
serengeti>resourcepool list --detail
Serengeti User’s Guide
18
You can specify which resource pool to list.
seretenti>resourcepool list --name myRP --detail
5.2.7 Remove a Datastore
You can use the “datastore delete” command to remove a datastore from Serengeti.
serengeti>datastore delete --name myDS
In this example, “myDS” is the name you specified when you added the datastore.
You cannot remove a datastore from Serengeti if it is referenced by a Hadoop cluster.
5.2.8 Remove a Network
You can use the “network delete” command to remove a network from Serengeti.
serengeti>network delete --name myNW
In this example, “myNW” is the name you specified when you added the network.
You cannot remove a network from Serengeti if it is referenced by a Hadoop cluster.
You can use “network list” command to see which cluster is referencing the network.
5.2.9 Remove a Resource Pool
You can use the “resoucepool delete” command to remove a resource pool from Serengeti.
serengeti>resourcepool delete --name myRP
In this example, “myRP” is the name you specified when you added the resource pool.
You cannot remove a resource pool from Serengeti if the resource pool is referenced by a
Hadoop cluster.
5.3 Manage Distros
5.3.1 Supported Distros
Serengeti Management Server includes the Apache Hadoop 1.0.1, but you can use your preferred
Hadoop distro as well. Greenplum HD1, CDH3, CDH41, HDP1 and MapR M5 are also supported.
Serengeti now supports Hadoop cluster, Pig and Hive instance deployment.
5.3.2 Add a Distro to Serengeti
Serengeti uses tar ball or yum repository to deploy Hadoop cluster for different Hadoop distributions.
5.3.2.1 Using tar ball to deploy Hadoop cluster
Serengeti uses tar ball to deploy the following Hadoop distros:
Apache Hadoop 1.0.x
Greenplum HD 1
1 YARN is not supported at this moment.
Serengeti User’s Guide
19
CDH3
HDP1
1. Download the three packages (hadoop/pig/hive) in tar ball format from the distro vendor's site.
2. Upload them to Serengeti Management Server virtual machine.
3. Put the packages in “/opt/serengeti/www/distros/”. The hierarchy should be
DISTRO_NAME/VERSION_NUMBER/TARBALLS. For example, place the Apache Hadoop distro as
shown in the following way.
- apache/ - 1.0.1/ - hadoop-1.0.1.tar.gz - hive-0.8.1.tar.gz - pig-0.9.2.tar.gz
4. Edit the “/opt/serengeti/www/distros/manifest” in Serengeti Management Server virtual machine to
add the mapping between Hadoop roles and the tar ball package of the distro. As the following
example, add JSON text to the manifest file:
{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, In this example, the CDH tar balls are put in directory /opt/serengeti/www/distros/cdh/3u3. Please note if a distro supports HVE, please add “hveSupported” : “true”, after the line related to version in the above example.
5. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added appears. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.2.2 Using yum repository to deploy Hadoop cluster
Serengeti uses yum repository to deploy the following Hadoop distros:
Serengeti User’s Guide
20
CDH4
MapR M5
1. Open the sample manifest file “/opt/serengeti/www/distros/manifest.sample” in Serengeti
Management Server virtual machine, you will see the following distro configuration for MapR and
CDH4:
{ "name" : "mapr", "vendor" : "MAPR", "version" : "2.1.1", "packages" : [ { "roles" : ["mapr_zookeeper", "mapr_cldb", "mapr_jobtracker", "mapr_tasktracker", "mapr_fileserver", "mapr_nfs", "mapr_webserver", "mapr_metrics", "mapr_client", "mapr_pig", "mapr_hive", "mapr_hive_server", "mapr_mysql_server"], "package_repos" : ["http://<ip_of_serengeti_server>/mapr/2/mapr-m5.repo"] } ] }, { "name" : "cdh4", "vendor" : "CDH", "version" : "4.1.2", "packages" : [ { "roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_journalnode", "hadoop_client", "hive", "hive_server", "pig", "hbase_master", "hbase_regionserver", "hbase_client", "zookeeper"], "package_repos" : ["http://<ip_of_serengeti_server>/cdh/4/cloudera-cdh4.repo"] } ] }
The two yum repo files (mapr-m5.repo and cloudera-cdh4.repo) point to the official yum repository of MapR and CDH4 on the Internet. You can copy this sample file “/opt/serengeti/www/distros/manifest.sample” to “/opt/serengeti/www/distros/manifest”. When you create a MapR or CDH4 cluster, Hadoop nodes will download rpm packages from the MapR/CDH4 official yum repository on the Internet. If your VMs in the cluster created by Serengeti Management Server do not have access to the Internet or the bandwidth to the Internet is not fast, we strongly suggest create a local yum repository for MapR and CDH4. Please read the Appendix A: Create Local Yum Repository for MapR and Appendix B: Create Local Yum Repository for CDH4 to create a yum repository.
2. Config the local yum repository url in manifest file
Once the local yum repository for MapR/CDH4 is created, open /opt/serengeti/www/distros/manifest
and add the distro configuration (use the sample in previous step and modify attribute
"package_repos" to the url of the local yum repository file).
3. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management
Server read the new manifest file.
$ sudo service tomcat restart
Serengeti User’s Guide
21
If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the
distro that you added. Otherwise, make sure you write the correct JSON text in the manifest.
5.3.3 List Distros
You can use the "distro list" command to see available distros.
serengeti> distro list
You can see packages in each of the distro and make sure it includes services you want to deploy.
5.3.4 Using a Distro
You can choose which distro you use when deploying a cluster.
serengeti>cluster create --name myHadoop --distro cdh
5.4 Hadoop Clusters
5.4.1 Deploy Hadoop Clusters
5.4.1.1 Deploy a Customized Hadoop Cluster
You can customize the number of nodes, and size of virtual machines etc. when you create a cluster.
In Serengeti Management Server you can find sample specs in /opt/serengeti/samples/. If you are using Serengeti CLI from your desktop you can find the sample specs in the client folder.
1. Edit a cluster spec file.
For example:
{ "nodeGroups" : [ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM" }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL" }, { "name": "client", "roles": [
Serengeti User’s Guide
22
"hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "instanceType": "SMALL" } ] }
In this example, you want 1 master virtual machine MEDIUM size, 5 worker virtual machines in SMALL
size, 1 client virtual machine in SMALL size. You can also specify number of CPUs, RAM, disk size etc.
for each of node groups.
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name myHadoop --specFile /home/serengeti/mySpec.txt
CAUTION
Changing the role of node groups might cause the deployed Hadoop cluster not workable.
Deploy a CDH4 Hadoop ClusterYou can create a default CDH4 Hadoop cluster by executing the following
command in Serengeti CLI:
serengeti>cluster create --name mycdh --distro cdh4
You can also create a customized CDH4 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro cdh4 --specFile /opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json
/opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json is a sample spec file for
CDH4. You can make a copy of it and modify the parameters in the file before creating the cluster. In this
example, nameservice0 and nameservice1 are federated with each other, the name nodes inside
nameservice0 node group (with instanceNum set as 2) are HDFS2 HA enabled. In Serengeti, name node
group names will be the name service names of HDFS2.
5.4.1.1.1 Deploy a MapR Hadoop Cluster
You can create a default MapR M5 Hadoop cluster by executing the following command in Serengeti CLI:
serengeti>cluster create --name mymapr --distro mapr
You can also create a customized MapR M5 Hadoop cluster with a cluster spec file:
serengeti>cluster create --name mycdh --distro mapr --specFile /opt/serengeti/samples/ default_mapr_cluster.json /opt/serengeti/samples/ default_mapr_cluster.json is a sample spec file for MapR, you can make a copy of it and modify the parameters in the file before creating the cluster.
5.4.1.2 Separating Data and Compute nodes
You can separate data and compute nodes in a cluster and apply more fined control of node placement
among ESX hosts. For example, you can use Serengeti to deploy the following clusters:
1. A data and compute separated cluster, without any node placement constraints.
Serengeti User’s Guide
23
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { }
Serengeti User’s Guide
24
}
In this example, four data nodes and eight compute nodes will be created and put into individual VMs. By default, Serengeti uses Round Robin algorithm to put VM/node across ESX hosts evenly. 2. A data compute separated cluster, with instancePerHost constraint.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2 } }, { "name": "client", "roles": [ "hadoop_client", "hive",
Serengeti User’s Guide
25
"pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
In this example, data and compute node group have “placementPolicy” constraint. After a successful provision, four data nodes and eight compute nodes will be created and put into individual VMs. With the “instancePerHost=1” constraint, the four data nodes will be placed on four ESX hosts. The eight compute nodes will be put onto four ESX hosts as well, two nodes for each. Note that it is not guaranteed that the two compute nodes will stay collocated with each data node on each of the four ESX hosts. To ensure that this is the case, create a VM-VM affinity rule between each host‟s compute nodes and data node, or disable DRS on the compute nodes.
3. A data compute separated cluster, with instancePerHost , groupAssociations constraints for compute
node group and groupRacks constraint for data node group.
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] },
Serengeti User’s Guide
26
} }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data", "type": "STRICT" } } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }
In this example, after a successful provision, the four data nodes and eight compute nodes will be placed on exactly the same four ESX hosts, each ESX host has one data node and two compute nodes, and these four ESX hosts are selected from “rack1”, “rack2” and “rack3” fairly. Here, as the definition of “compute” node group says, the placement of compute nodes should strictly refer to the placement result of “data” node. That means, “compute” nodes should only be placed on ESX hosts that have “data” nodes.
5.4.1.3 Deploy a Compute Only Cluster
You can create a compute only cluster that refers to an existing HDFS cluster with the following steps:
1. Edit a cluster spec file and save it, for example, as /home/serengeti/coSpec.txt.
Serengeti User’s Guide
27
For example:
{ "externalHDFS": "hdfs://hostname-of-namenode:8020", "nodeGroups": [ { "name": "master", "roles": [ "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "worker", "roles": [ "hadoop_tasktracker", ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 }, }
], “configuration” : { }
}
In this example, the externalHDFS field points to an existing HDFS. You should also specify the node
group with role hadoop_jobtracker and hadoop_tasktracker. Note that the externalHDFS field conflicts
with node groups that have hadoop_namenode and hadoop_datanode roles. The sample cluster spec
can also be found in file in samples/compute_only_cluster.json in the Serengeti CLI directory,
2. Specify the spec when creating the cluster. You need use the full path to specify the file.
serengeti>cluster create --name computeOnlyCluster --specFile /home/serengeti/coSpec.txt
Serengeti User’s Guide
28
5.4.1.4 Control Hadoop Virtual Machine Placement
Serengeti provides a way for user to control how Hadoop virtual machines to be placed. Generally, it‟s implemented by specifying the “placementPolicies” field inside a node group, like:
{ "nodeGroups":[ … { "name": "group_name", … "placementPolicies": { "instancePerHost": 2, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] }, "groupAssociations": [{ "reference": "another_group_name", "type": "STRICT" // or "WEAK" }] } }, … }
As this example shows, the “palcementPolicy” field contains three optional items: “instancePerHost”, “groupRacks” and “groupAssociations”.
As the name implies, “instancePerHost” indicates how many VM nodes or instances should be placed for each physical ESX host, this constraint is aimed at balancing the workload.
The “groupRacks” controls how VM nodes should be put across the racks you specified. In this example, the rack type equals “ROUNDROBIN”, and the “racks” item indicates which racks in the topology map (refer to chapter 5.8 to see how to configure topology map information and enable Hadoop cluster to be rack awareness) will be used for this placement policy. If “racks” item is ignored, Serengeti will use all racks in the topology map. “ROUNDROBIN” here means the candidates will be fairly selected when determining which rack should be selected for each node.
On the other side, if you specify both the “InstancePerHost” and “groupRacks” for placement policy, you should make sure the number of available hosts is enough. You can get the rack-hosts information by using the command “topology list”.
“groupAssociations” means the node group has associations with target node groups, and each association has “reference” and “type” fields. The field “reference” is the name of a target node group, and “type” can be “STRICT” or “WEAK”. “STRICT” means the node group must be placed on the same set or subset of ESX hosts relevant to the target group, while “WEAK” means the node group tries to be placed on the same set or subset of ESX hosts relevant to the target group but no guarantee.
A typical scenario of applying “groupRacks” and “groupAssociations” is deploying a Hadoop cluster with data and compute nodes separated. In this case, user might tend to put compute nodes and data nodes on the same set of physical hosts for better performance, especially the throughput. You can refer to 5.3.3 for the practical examples of how to deploy Hadoop cluster by applying placement policy.
Serengeti User’s Guide
29
5.4.1.5 Use NFS as Compute Nodes’ Local Directory
Serengeti allows user to specify NFS for compute nodes. There are several benefits 1) increase the capacity of each compute node; 2) return storage resource back when some compute nodes stopped. Here is an example to show how to deploy a cluster whose compute nodes have only NFS storage:
{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "LARGE", "cpuNum": 2, "memCapacityMB": 7500, "haFlag": "on" }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "TEMPFS" }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data",
Serengeti User’s Guide
30
"type": "STRICT" } ] } }, { "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 } } ] } In this example, the cluster is D/C separated. Compute nodes are strictly associated with data nodes. By specifying the “Storage” field of compute node group to “type: TEMPFS”, Serengeti will install NFS server on associated data nodes, install NFS client on compute nodes, and mount data nodes‟ disks on compute nodes. Serengeti will not assign disks to compute nodes, and all temp files generated during running MapReduce jobs are saved on the NFS disks.
5.4.2 Manage Hadoop Clusters
5.4.2.1 Modify Hadoop
Serengeti provides a simple and easy way to tune the Hadoop cluster configuration including attributes in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, etc.
In addition to modifying Hadoop configuration of an existing Hadoop cluster created by
Serengeti, you can also define Hadoop configuration in the cluster spec file when creating a new
cluster.
5.4.2.1.1 Cluster Level Configuration
You can modify the Hadoop configuration of an existing cluster by following these steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop –output /home/serengeti/myHadoop.json
2. Modify the „configuration‟ section at the bottom of /home/serengeti/myHadoop.json with the following content and add the customized Hadoop configuration in this „configuration‟ section:
…
Serengeti User’s Guide
31
"configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "", }, "log4j.properties": { // "hadoop.root.logger": "DEBUG, DRFA ", // "hadoop.security.logger": "DEBUG, DRFA ", }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html } } }
…
Serengeti provides a tool to convert the Hadoop configuration files of your existing cluster into
the above json format, so you don‟t need to write this json file manually. Please read section
„Tool for converting Hadoop Configuration‟.
Some Hadoop Distributions have their own java jar files which are not put in
$HADOOP_HOME/lib, so by default Hadoop daemons can‟t find it. In order to use these jars,
you need to add a cluster configuration to include the full path of the jar file in
$HADOOP_CLASSPATH.
Here is a sample cluster configuration to configure Cloudera CDH3 Hadoop cluster with Fair
Scheduler (the jar files of Fair Scheduler is put in /usr/lib/hadoop/contrib/fairscheduler/):
…
Serengeti User’s Guide
32
"configuration": { "hadoop": { "hadoop-env.sh": { "HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH" }, "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" … }, "fair-scheduler.xml": { … } } }
…
3. Run „cluster config‟ command to apply the new Hadoop configuration
serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
4. If you want to reset an existing configuration attribute to the Hadoop default value, simply remove it or comment it out using „// ‟ in „configuration‟ section in cluster spec file, and run „cluster config‟ command.
5.4.2.1.2 Group Level Configuration
You can also modify the Hadoop configuration within a node group in an existing cluster by following steps below:
1. Export the cluster spec file of the cluster:
serengeti>cluster export --spec --name myHadoop --output /home/serengeti/myHadoop.json
2. Modify the „configuration‟ section within the node group in /home/serengeti/myHadoop.json with the same content as in „Cluster Level Configuration‟ and add the customized Hadoop configuration for this node group.
The Hadoop configuration in Group Level Configuration will override the configuration with the
same name in Cluster Level Configuration.
3. Run „cluster config‟ command to apply the new Hadoop configuration
serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
5.4.2.1.3 Black List and White List in Hadoop Configuration
Almost all the configuration attributes provided in Apache Hadoop are configurable in Serengeti, and these attributes belong to White List. However a few attributes are not configurable in Serengeti and these attributes belongs to Black List.
If you set an attribute in the cluster spec file and it is in the Black List or not in the White List, then run „cluster config‟ command, Serengeti will detect these attributes and give a warning, you need to answer „yes‟ to continue or „no‟ to abort.
Usually you don‟t need to configure „fs.default.name' or „dfs.http.address‟ if there is a NameNode or JobTracker in your cluster, because Serengeti will automatically configure these 2 attributes. For example, when you create a default cluster in Serengeti, it will contains a NameNode and JobTracker, and you don‟t need to explicitly configure „fs.default.name' and „dfs.http.address‟.
However you can set „fs.default.name' to the uri of another NameNode if you really want to.
Serengeti User’s Guide
33
5.4.2.1.3.1 White List
core-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/core-default.html
exclude attributes defined in Black List
hdfs-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/hdfs-default.html
exclude attributes defined in Black List
mapred-site.xml
all attributes listed on http://hadoop.apache.org/common/docs/stable/mapred-default.html
exclude attributes defined in Black List
hadoop-env.sh
JAVA_HOME
PATH
HADOOP_CLASSPATH
HADOOP_HEAPSIZE
HADOOP_NAMENODE_OPTS
HADOOP_DATANODE_OPTS
HADOOP_SECONDARYNAMENODE_OPTS
HADOOP_JOBTRACKER_OPTS
HADOOP_TASKTRACKER_OPTS
HADOOP_LOG_DIR
log4j.properties
hadoop.root.logger
hadoop.security.logger
log4j.appender.DRFA.MaxBackupIndex
log4j.appender.RFA.MaxBackupIndex
log4j.appender.RFA.MaxFileSize
fair-scheduler.xml
text
all attributes described on http://hadoop.apache.org/docs/stable/fair_scheduler.html , which can be put inside „text‟ field
exclude attributes defined in Black List
capacity-scheduler.xml
all attributes described on http://hadoop.apache.org/docs/stable/capacity_scheduler.html
exclude attributes defined in Black List
5.4.2.1.3.2 Black List
core-site.xml
Serengeti User’s Guide
34
net.topology.impl
net.topology.nodegroup.aware
dfs.block.replicator.classname
hdfs-site.xml
dfs.http.address
dfs.name.dir
dfs.data.dir
topology.script.file.name
mapred-site.xml
mapred.job.tracker
mapred.local.dir
mapred.task.cache.levels
mapred.jobtracker.jobSchedulable
mapred.jobtracker.nodegroup.awareness
hadoop-env.sh
HADOOP_HOME
HADOOP_COMMON_HOME
HADOOP_MAPRED_HOME
HADOOP_HDFS_HOME
HADOOP_CONF_DIR
HADOOP_PID_DIR
log4j.properties
None
fair-scheduler.xml
None
capacity-scheduler.xml
None
mapred-queue-acls.xml
None
5.4.2.1.4 Tool for converting Hadoop Configuration
In case you have a lot of Hadoop configuration in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, mapred-queue-acls.xml, etc. for your existing Hadoop cluster, you can use a tool provided by Serengeti to convert the Hadoop xml configuration files into the json format used in Serengeti.
1) Copy the directory $HADOOP_HOME/conf/ in your existing Hadoop cluster to the Serengeti Server.
2) Execute „convert-hadoop-conf.rb /path/to/hadoop_conf/‟ in bash shell and it will print out all the converted Hadoop configuration attributes in json format.
Serengeti User’s Guide
35
3) Open the cluster spec file and replace the Cluster Level Configuration or Group Level Configuration with the content printed out step 2.
4) Execute „cluster config --name … --specFile …‟ to apply the new configuration to the existing clusteror execute „cluster create --name … --specFile …‟ to create a new cluster with your configuration.
5.4.2.2 Scale Out a Hadoop Cluster
You can scale out to have more Hadoop worker nodes or client nodes after Hadoop cluster is provisioned.
In the following example, the number of instances in “worker” node group in “myHadoop” cluster will
increase to 10.
serengeti>cluster resize --name myHadoop --nodeGroup worker --instanceNum 10
You cannot set a number smaller than current instance number in this version of the Serengeti
virtual appliance.
5.4.2.3 Scale TaskTracker Nodes Rapidly
You can change the number of active TaskTracker nodes rapidly in a running Hadoop cluster or node
group. The selection of TaskTrackers to be enabled or disabled is done with the goal of balancing the
number of TaskTrackers enabled per host in the specified Hadoop cluster or node group.
In this example, the number of active TaskTracker nodes in “worker” node group in “myHadoop” cluster is
set to 8:
serengeti>cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8
If fewer than 8 TaskTracker nodes were running in the “worker” node group of “myHadoop” cluster,
additional TaskTracker nodes are enabled (re-commissioned and powered-on), up to the number
provisioned in the “worker” node group. If more than 8 TaskTrackers were running in the “worker” node
group, excess TaskTracker nodes are disabled (decommissioned and powered-off). No action is
performed if the number of active TaskTrackers already equals 8.
If the node group is not specified, the TaskTracker nodes are enabled/disabled such that the total number
of active TaskTrackers is 8 across all the compute node groups in the “myHadoop” cluster:
serengeti>cluster limit --name myHadoop –activeComputeNodeNum 8
To enable all the TaskTrackers in the “myHadoop” cluster, use the “cluster unlimit” command:
serengeti>cluster unlimit --name myHadoop This command is especially useful to fix any potential mismatch between the number of active TaskTrackers as seen by Hadoop and the number of powered on TaskTracker nodes as seen by the vCenter.
To enable all TaskTrackers within only one compute node group, specify the name of the node group using the “--nodeGroup” option, similar to the “cluster limit” command.
5.4.2.4 Start/Stop Hadoop Cluster
In the Serengeti shell, you can start (or stop) a whole Hadoop cluster:
serengeti>cluster start --name mycluster
5.4.2.5 View Hadoop Clusters Deployed by Serengeti
In the Serengeti shell, you can list Hadoop clusters deployed by Serengeti.
serengeti>cluster list
Serengeti User’s Guide
36
You can specify which cluster to list.
serengeti>cluster list --name mycluster
You can see details of Hadoop clusters.
serengeti>cluster list --detail
5.4.2.6 Login to Hadoop Nodes
You can login to Hadoop nodes including master, worker, and client nodes with password-less SSH from Serengeti Management Server using SSH client tools like SSH, PDSH, ClusterSSH, Mussh and etc. to do trouble shooting or run your own management automation scripts.
Serengeti Management Server is configured to be able to SSH to Hadoop cluster nodes without password. Other clients or machines can use user name and password to SSH to the Hadoop cluster nodes.
All of these deployed nodes have random passwords protection. If you want to login to each Hadoop
node directly, please login each node from vSphere client in order to change the password by following
the step in Section 3.2 step 11. Please press “Ctrl + D” in order to get the login information with the
original random password.
5.4.2.7 Delete a Hadoop Cluster
You can delete a Hadoop cluster you no longer needed.
serengeti>cluster delete --name myHadoop
In this example, “myHadoop” is the name of the Hadoop cluster you want to delete.
When a Hadoop cluster is deleted, all virtual machines in the cluster are destroyed.
You can delete a Hadoop cluster even though it is running.
5.4.3 Use Hadoop Clusters
5.4.3.1 Run Pig Scripts
You can run Pig script in the Serengeti CLI. For example, you have a Pig script in “/tmp/data.pig”.
serengeti> pig cfg
serengeti> pig script --location /tmp/data.pig
5.4.3.2 Run Hive Scripts
You can run Hive script in the Serengeti CLI. For example, you have a Hive script in “tmp/data.hive”.
serengeti>hive cfg
serengeti>hive script –location /tmp/data.hive
5.4.3.3 Run HDFS command
You can run HDFS command in the Serengeti CLI. For example, you have file in “/home/serengeti/data” and want to put it in your HDFS path /tmp.
serengeti> fs put –from /home/serengeti/data –to /tmp
Serengeti User’s Guide
37
5.4.3.4 Run Map Reduce job
You can run Map Reduce job in the Serengeti CLI. For example, you get example jar file in “/opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar” and want to run pi.
serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.PiEstimator --args "10 10"
Make sure you have chosen a cluster as target first in Serengeti CLI. See Chapter 7.2.10.
5.4.3.5 Using Data through JDBC
Using Data through Hive JDBC, you can execute SQL in different programming language, such as Java,
Python and PHP, and so on. The following is a JDBC Client sample of Java code.
1. SSH to the node contains hive server role. 2. Create a Java file HiveJdbcClient.java which contains the Java Sample Code for connecting to the
Hive Server:
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJdbcClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
**/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e){
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default",
Serengeti User’s Guide
38
"", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value
string)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per line
String filepath = "/tmp/test_hive_server.txt";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
Serengeti User’s Guide
39
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()){
System.out.println(res.getString(1));
}
}
}
3. Running the JDBC Sample Code
a. Then on the command-line
$ javac HiveJdbcClient.java
b. Alternatively, you can run the following bash script, which will seed the data file and build your
classpath before invoking the client.
#!/bin/bash
HADOOP_HOME=/usr/lib/hadoop
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/test_hive_server.txt
echo -e '2\x01bar' >> /tmp/test_hive_server.txt
HADOOP_CORE=`ls $HADOOP_HOME/hadoop-core-*.jar`
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for jar_file_name in ${HIVE_HOME}/lib/*.jar
do
CLASSPATH=$CLASSPATH:$jar_file_name
done
java -cp $CLASSPATH HiveJdbcClient
For more information of Hive client please visit https://cwiki.apache.org/Hive/hiveclient.html.
Serengeti User’s Guide
40
5.4.3.6 Using Data through ODBC
You can use specified out-of-box ODBC server for Hadoop Hive such as MapR Hive ODBC connector,
Apache Hadoop Hive ODBC Driver, etc.
Take MapR ODBC Connector as an example: 1. Install the MapR Hive ODBC Connector on your Windows 7 Professional or Windows 2008 R2. 2. Create a Data Source Name (DSN) with the ODBC Connector‟s Data Source Administrator to
connect your remote Hive server. 3. Import rows of HIVE_SYSTEM table in Hive server into excel by connecting to this DSN. More information about Hive ODBC, please refer to https://cwiki.apache.org/Hive/hiveodbc.html More information about MapR Hive ODBC Connector, please refer to www.mapr.com/doc/display/MapR/Hive+ODBC+Connector.
5.5 HBase Clusters
5.5.1 Deploy HBase Clusters
You can customize a HBase cluster by specifying your own spec file. The following is an example:
{ "nodeGroups" : [ { "name" : "zookeeper", "roles" : [ "zookeeper" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 20 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "on", "configuration" : { } }, { "name" : "hadoopmaster", "roles" : [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum" : 1, "instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : {
Serengeti User’s Guide
41
} }, { "name" : "hbasemaster", "roles" : [ "hbase_master" ], "instanceNum" : 1, "instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : { } }, { "name" : "worker", "roles" : [ "hadoop_datanode", "hadoop_tasktracker", "hbase_regionserver" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "local", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { } }, { "name" : "client", "roles" : [ "hadoop_client", "hbase_client" ], "instanceNum" : 1, "instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { }
Serengeti User’s Guide
42
} ], // we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output here "configuration" : { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "" }, "log4j.properties": { // "hadoop.root.logger": "DEBUG,DRFA", // "hadoop.security.logger": "DEBUG,DRFA" }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html }, "mapred-queue-acls.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons // "mapred.queue.queue-name.acl-submit-job": "", // "mapred.queue.queue-name.acl-administer-jobs", "" } }, "hbase": { "hbase-site.xml": { // check for all settings at http://hbase.apache.org/configuration.html#hbase.site }, "hbase-env.sh": { // "JAVA_HOME": "", // "PATH": "", // "HBASE_CLASSPATH": "", // "HBASE_HEAPSIZE": "",
Serengeti User’s Guide
43
// "HBASE_OPTS": "", // "HBASE_USE_GC_LOGFILE": "", // "HBASE_JMX_BASE": "", // "HBASE_MASTER_OPTS": "", // "HBASE_REGIONSERVER_OPTS": "", // "HBASE_THRIFT_OPTS": "", // "HBASE_ZOOKEEPER_OPTS": "", // "HBASE_REGIONSERVERS": "", // "HBASE_SSH_OPTS": "", // "HBASE_NICENESS": "", // "HBASE_SLAVE_SLEEP": "" }, "log4j.properties": { // "hbase.root.logger": "DEBUG,DRFA" } }, "zookeeper": { "java.env": { // "JVMFLAGS": "-Xmx2g" }, "log4j.properties": { // "zookeeper.root.logger": "DEBUG,DRFA" } } } }
In the example, it has JobTracker and TaskTracker roles compared to the template we mentioned in
section 4.4, which means you can launch a HBase mapreduce job. It separate Hadoop NameNode and
HBase Master roles. The two HBase Master instances,will be protected by HBase internal HA function.
5.5.2 Manage HBase Clusters
HBase cluster has a few more configurable files compared to Hadoop cluster, including hbase-site.xml,
hbase-env.sh, log4j.properties and java.env for Zookeeper nodes. You can refer to HBase official site to
tune your HBase clusters.
Most operations and advanced specifications on Hadoop cluster can also apply to HBase cluster, like
scale out node group, separate data and compute nodes, control placement policy and so on with
following exceptions:
1. Zookeeper nodes are not allowed to scale out in this version;
2. You cannot deploy a compute-only cluster pointing to a HBase cluster to run HBase
mapreduce jobs.
5.5.3 Use HBase Clusters
Serengeti supports most of ways that HBase provides to access the database, including:
1. Do operations through “HBase shell”;
2. If the HBase cluster deployed has Hadoop JobTracker and TaskTracker roles, you can develop a HBase mapreduce job to access HBase from the client node. Here is an example:
>hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 3
Serengeti User’s Guide
44
3. Rest-ful Web Service is running on client node and listening on port 8080
>curl –I http://<client_node_ip>:8080/status/cluster 4.Thrift gateway is also enabled and listening on port 9090.
5.6 Monitoring Cluster Deployed by Serengeti
Serengeti will create one VM folder for each deployed Serengeti Server. The folder name is SERENGETI-
vApp-<vApp name>. The vApp name is specified during Serengeti deployment.
For each cluster, two level folders will be created under Serengeti instance folder. First level is the cluster
name, and second level is the node group name.
Node group folder contains all nodes in that node group.
To browse the VM and check VM status in vCenter client, you may select “Inventory”, “VMs and
Templates”. The Serengeti folder is listed in the left panel. And then you can check VM nodes following
the folder structure.
If you have installed vCOPs, you can also fetch VM-level metrics including cluster‟s health state, workload,
resource allocation, hardware status and etc. Please refer to vCOPs‟ manual guide for more details.
5.7 Make Hadoop Master Node HA/FT
You can leverage vSphere HA and FT to address the SPOF problem of Hadoop.
1. Make sure you enabled the HA for the cluster where the Hadoop cluster is deployed. Please refer to
for detailed setting steps as needed.
2. Make sure you provide a shared storage for Hadoop to deploy on.
3. By default, Hadoop master node is configured to be protected by vSphere HA.
After doing this, once the master node virtual machine not reachable by vSphere. vSphere will start a new instance on another available ESXi host to serve Hadoop cluster automatically.
There‟s a short downtime when doing the recovery. If you want eliminate the down time, you can use vSphere FT to protect the master node.
Serengeti support configure FT feature for master nodes. In cluster spec file, you can specify “haFlag” to “ft” to enable FT protection.
...
"name": "master",
"cpuNum": 1,
"haFlag": “ft”
"storage": {
"type": "SHARED",
}
By using the cluster spec, master node of the Hadoop cluster is protected by vSphere FT. When one master is not reachable, vSphere will switch traffic to the standby virtual machine immediately. So there‟s no failover downtime.
Please refer to Apache Hadoop 1.0 High Availability Solution on VMware vSphere for more information.
Serengeti User’s Guide
45
5.8 Hadoop Topology Awareness
You can make the Hadoop cluster topology aware when you create a cluster with the option of --topology from CLI. By --topology, we support 3 types of topology awareness: HVE, RACK_AS_RACK, HOST_AS_RACK.
Here is an example to create a cluster with the topology of HVE.
serengeti>cluster create --name myHadoop --topology HVE --distro HVE-supported_Distro
HVE stands for Hadoop Virtualization Extensions2. HVE refines Hadoop‟s replica placement, task
scheduling and balancer policies. Hadoop clusters implemented on virtualized infrastructure have full awareness of the topology on which they are running. Thus, the reliability and performance of these clusters are enhanced. For more information about HVE, you can refer to
https://issues.apache.org/jira/browse/HADOOP-8468.
RACK_AS_RACK stands for the standard topology in existing Hadoop 1.0.x, where only rack and host information are exposed to Hadoop.
HOST_AS_RACK is a simplified topology of RACK_AS_RACK when all the physical hosts for Serengeti are on a single rack. In this case, each physical host will be treated as a rack in order to avoid all HDFS data replicas are placed in a physical host in some worst cases.
HVE is the recommended topology in Serengeti if a distro supports HVE. Otherwise, we recommend using RACK_AS_RACK topology in multiple rack environments. HOST_AS_RACK is used only when one rack exists for Serengeti or no rack information at all.
In addition, when you decide to enable HVE, or RACK_AS_RACK, you need to upload the rack and physical host information to Serengeti through CLI command below before you create a topology awareness cluster.
serengeti>topology upload --fileName name_of rack_hosts_mapping_file
Here is a sample of the rack and physical hosts mapping file.
rack1: a.b.foo.com, a.c.foo.com rack2: c.a.foo.com
In this sample, physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.
After a cluster is created with the selected topology option, you can view the allocated nodes on each rack with:
serengeti>cluster list --name cluster-name --detail
5.9 Start and Stop Serengeti Services
You can stop and start Serengeti service to make a configuration take effect or to recover from an abnormal situation.
You can run the following command in a Linux shell to stop the Serengeti service.
$ sudo serengeti-stop-services.sh
You can run the following command in a Linux shell to start the Serengeti service.
$ sudo serengeti-start-services.sh
2 HVE is currently supported on Greenplum HD 1.2.
Serengeti User’s Guide
46
6. Cluster Specification Reference
Cluster specification is a JSON text file. Here‟s a longer example with line number. Same file without line
number is attached as appendix.
1 {
2 "nodeGroups" : [
3 {
4 "name": "master",
5 "roles": [
6 "hadoop_namenode",
7 "hadoop_jobtracker"
8 ],
9 "instanceNum": 1,
10 "instanceType": "LARGE",
11 "cpuNum": 2,
12 "memCapacityMB":4096,
13 "storage": {
14 "type": "SHARED",
15 "sizeGB": 20
16 },
17 "haFlag":"on",
18 "rpNames": [
19 "rp1"
20 ]
21 },
22 {
23 "name": "data",
24 "roles": [
25 "hadoop_datanode"
26 ],
27 "instanceNum": 3,
28 "instanceType": "MEDIUM",
29 "cpuNum": 2,
30 "memCapacityMB":2048,
Serengeti User’s Guide
47
31 "storage": {
32 "type": "LOCAL",
33 "sizeGB": 50
34 }
35 "placementPolicies": {
36 "instancePerHost": 1,
37 "groupRacks": {
38 "type": "ROUNDROBIN",
39 "racks": ["rack1", "rack2", "rack3"]
40 }
41 }
42 },
43 {
44 "name": "compute",
45 "roles": [
46 "hadoop_tasktracker"
47 ],
48 "instanceNum": 6,
49 "instanceType": "SMALL",
50 "cpuNum": 2,
51 "memCapacityMB":2048,
52 "storage": {
53 "type": "LOCAL",
54 "sizeGB": 10
55 }
56 "placementPolicies": {
57 "instancePerHost": 2,
58 "groupAssociations": [{
59 "reference": "data",
60 "type": "STRICT"
61 }]
62 }
63 },
64 {
65 "name": "client",
Serengeti User’s Guide
48
66 "roles": [
67 "hadoop_client",
68 "hive",
69 "hive_server",
70 "pig"
71 ],
72 "instanceNum": 1,
73 "instanceType": "SMALL",
74 "memCapacityMB": 2048,
75 "storage": {
76 "type": "LOCAL",
77 "sizeGB": 10,
78 "dsNames": [“ds1”, “ds2”]
79 }
80 }
81 ],
82 "configuration": {
83 }
84 }
It defines 4 node groups.
Line 3 to 21 defines a node group named “master”.
Line 22 to 42 defines a data node group named “data”.
Line 43 to 63 defines a compute node group named “compute”.
Line 64 to 83 defines a client node group.
Line 3 to 21 is an object defines the “master” node group. The attributes are as follows.
Line 4 defines the name of the node group. Attribute name is “name”. Value is “master”.
Line 5 to 8 defines role of the node group. Attribute name is “role”. Value is “hadoop_ namenode”
and “hadoop_jobtracker”. It means hadoop_namenode and hadoop_jobtracker will be deployed
to the virtual machine in the group.
You can see available roles by “distro list” command.
Line 9 defines number of instances in the node group. Attribute name is “instanceNum”. Attribute
value is 1. It means there‟ll be only one virtual machine created for the group.
You can have multiple instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig,
and hive. But you can have only one instance for hadoop_namenode and hadoop_jobtracker.
Line 10 defines the instance type in the node group. Attribute name is “instanceType”. Value is
“LARGE”. The instance types are predefined virtual machine spec. They are combinations of
Serengeti User’s Guide
49
number of CPUs, RAM sizes, and storage size. The predefined number can be overridden by the
cpuNum, memCapacityMB and storage specified in the file.
Line 11 defines number of CPUs per virtual machine. Attribute name is “cpuNum”. Value is 2. It‟ll
override the number of CPUs of the predefined virtual machine spec.
Line 12 defines RAM size per virtual machine. Attribute name is "memCapacityMB". Value is
4096. It will override the RAM size of the predefined virtual machine spec.
Line 13 to 16 defines the storage requirement of the node group. It‟s an object. Object name is
“storage”.
o Line 14 defines the storage type. It‟s an attribute of “storage” object. Attribute name is
“type”. Value is “SHARED”. It means it is required that Hadoop data must be stored in
shared storage.
o Line 15 defines the storage size. It‟s an attribute of “storage” object. Attribute name is
“sizeGB”. Value is 20. It means there‟ll be 20GB disk for Hadoop to use.
Line 17 defines if HA applies to the node. The attribute name is “haFlag”. The value is on. It
means the virtual machine in the group is protected by vSphere HA.
Line 18 to 20 defines the resourcepools which the node group must be associated with. The
attribute name is “rpNames”. The value is an array, which contains one resourcepool “rp1”.
You can see same structure for other 3 node groups. One more thing is for “data” and “compute” groups,
we specify a pair of comprehensive placement constraints:
Line 35 to 41 defines the placement constraints for the data node group. The attribute name is
“placementPolicies” and the value is a hash which contains “instancePerHost” and “groupRacks”.
The contraint means you need at least 3 esx hosts because this group requires 3 instances and
forces putting 1 instance on each one host, furthermore, this group will be provisioned on hosts
on “rack1”, “rack2” and “rack3” by using “ROUNDROBIN” algorithm.
Line 56 to 62 defines the placement constraints for the compute node group which contains
“instancePerHost” and “groupAssociations”. The contraint means you also need at least 3 esx
hosts for the same reason and this group is “STRICT” associated to node group “data” for better
performance.
You can customize Hadoop configuration by “configuration” attribute on line 82 to 83, which happens to
be empty in the sample.
You can modify value of the attributes, and you can also remove the optional value if you don‟t care.
Following is definition for the outer most attributes in a cluster spec:
Attribute Type Mandatory/optional Description
nodeGroups object Mandatory It contains one or more group specification, and
the details can be found in below table.
configuration object Optional Customizable Hadoop configuration key/value
pairs.
externalHDFS string Optional URI of external HDFS (only valid for a compute
only cluster)
Serengeti User’s Guide
50
Following is the definition of the objects and attributes for a particular node group.
Attribute Type Mandatory/Optional Description
name string Mandatory User defined node group name.
roles list of
string
Mandatory A list of software packages or services will be
installed in the virtual machines in the node
group. The item must be exactly the same as
you saw by “distro list”
instanceNumber integer Mandatory How many virtual machines in the node group.
It must be a positive integer. For
hadoop_namenode and hadoop_jobtracker, it
must be 1.
instanceType string Optional Size of virtual machines in the node group. It‟s
the name of predefined virtual machine
template. It can be “SMALL”, “MEDIUM”,
“LARGE”, and “EXTRA_LARGE”.
The cpuNum, memCapacityMb, and
Storage.sizeGB will overwrite this attribute if
they are all be defined in the same node group.
cpuNum integer Optional Number of vCPUs per virtual machine
memCapacityMb integer Optional Number of RAMs in MB per virtual machine
Storage object Optional Storage settings
type string Optional It can be “LOCAL” or “SHARED”.
sizeGB integer Optional Data storage size. It must be a positive integer.
dsNames list of
string
Optional Datastores the node group can use.
rpNames list of
string
Optional Resourcepools the node group can use.
haFlag string Optional It can be “on”, “off” or “ft”. “on” means use HA to
protect the node group, “ft” means use vSphere
FT to protect the node group.
By default, name node and job tracker are
protected by vSphere HA.
placementPolicies object Optional It can contains three optional constraints:
"instancePerHost", "groupRacks" and
"groupAssociations", refer to 5.3.2 for details.
Serengeti User’s Guide
51
Serengeti comes with predefined virtual machine specification.
SMALL MEDIUM LARGE EXTRA_LARGE
Number of vCPU 1 2 4 8
RAM 3.75GB 7.5GB 15GB 30GB
Disk size for Hadoop master data 25GB 50GB 100GB 200GB
Disk size for Hadoop worker data 50GB 100GB 200GB 400GB
Disk size for Hadoop client data 50GB 100GB 200GB 400GB
When creating virtual machine, Serengeti will try to allocate datastore on the preferred type. SHARED
storage is preferred for master and clients. LOCAL storage is preferred for workers.
Separate disks are created for OS and swap.
7. Serengeti Command Reference
7.1 connect
Connect and login to remote Serengeti server.
Parameter Mandatory/Optional Description
--host Mandatory Specify the Serengeti web service URL in format <Serengeti
Management Server ip or host>:<port>. By default, the Serengeti web
service is started at port 8080.
--username Optional The Serengeti user name
--password Optional The Serengeti password
The command will read username and password in interactive mode. Section 5.1 describes how to
manage Serengeti users.
If connect failed, or do not run connect command, the other Serengeti command is not allowed to be
executed.
7.2 cluster
7.2.1 cluster config
Modify Hadoop configuration of an existing default or customized Hadoop cluster in Serengeti.
Parameter Type Description
--name <cluster name in
Serengeti>
Mandatory Specify the Hadoop cluster name in Serengeti.
--specFile <spec file path> Optional Specify the Hadoop cluster's specification in a customized file.
Serengeti User’s Guide
52
--yes Optional Answer „y‟ to „Y/N‟ confirmation. If not specified, the users need
to answer „y‟ or „n‟ explicitly.
7.2.2 cluster create
Create a default/customized Hadoop cluster in Serengeti.
Parameter Mandatory/Optional Description
--name <cluster name
in Serengeti>
Mandatory Specify the Hadoop cluster name in Serengeti.
--type <cluster type> Optional Specify the cluster type. Hadoop or HBase is supported.
The default one is Hadoop.
--specFile <spec file
path>
Optional Specify the Hadoop cluster's specification in a customized
file
--distro <Hadoop distro
name>
Optional Specify which distro will be used to deploy Hadoop
cluster. The distros includes Apache Hadoop, Greenplum
HD, CDH3 and HDP1.
--dsNames <datastore
names>
Optional Specify which datastore will be used to deploy Hadoop
cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine. Multiple datastores can be
used, separated by “,”.
--networkName
<network name>
Optional Specify which network will be used to deploy Hadoop
cluster in Serengeti. By default, it will use the same one
with Serengeti virtual machine.
--rpNames <resource
pool name>
Optional Specify which resource pool will be used to deploy
Hadoop cluster Serengeti. By default, it will use the same
one with Serengeti virtual machine. Multiple resource
pools can be used, separated by “,”.
--resume Optional If resume is specified, this command will recover a
creation process which cluster is deployed failed.
--topology <topology
type>
Optional Specify which topology type will be used for rack
awareness: HVE, RACK_AS_RACK, or
HOST_AS_RACK.
--yes Optional Answer „y‟ to „Y/N‟ confirmation. If not specified, the users
need to answer „y‟ or „n‟ explicitly.
--skipConfigValidation Optional Skip cluster configuration validation.
If the cluster spec does not include required nodes, for example master node, Serengeti will generate
them with a default configuration.
Serengeti User’s Guide
53
7.2.3 cluster delete
Delete a Hadoop cluster in Serengeti.
Parameter Mandatory/Optional Description
--name <cluster name> Mandatory Delete a specified Hadoop cluster in Serengeti.
7.2.4 cluster export
Export cluster information.
Parameter Mandatory/Optional Description
--spec Mandatory Export cluster specification. The exported cluster specification can be
used in cluster create or cluster config command.
--output Optional Specify the output file name for exported cluster information.
If not specified, the output will be displayed in the console.
7.2.5 cluster limit
Enable or disable provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti
to reach the limit specified by activeComputeNodeNum. Compute nodes are re-commissioned and
powered-on, or decommissioned and powered-off to reach the specified number of active compute nodes.
Parameter Mandatory/Optional Description
--name <cluster_name> Mandatory Name of the Hadoop cluster in Serengeti
--nodeGroup
<node_group_name>
Optional Name of a node group in the specified Hadoop cluster
in Serengeti (supports node groups with task tracker
role only)
--
activeComputeNodeNum
<number>
Mandatory Number of active compute nodes for the specified
Hadoop cluster or node group within that cluster.
The valid value range is integers larger or equal to
zero.
- For zero value, all the nodes in the specific
Hadoop cluster or the specific node group (if --
nodeGroup value is specified) will be
decommissioned and powered off.
- For integer value between 1 and the max
node number of a Hadoop cluster or the node
group (if --nodeGroup value is specified), the
specific number of nodes will stay
Serengeti User’s Guide
54
commissioned and powered on, other nodes
will be decommissioned.
- For integer value larger than the max node
number of a Hadoop cluster or the node group
(if --nodeGroup value is specified), all the
nodes in the specific Hadoop cluster or the
specific node group (if --nodeGroup value is
specified) will be re-commissioned and
powered on.
7.2.6 cluster list
List all Hadoop clusters in Serengeti.
Parameter Mandatory/Optional Description
--name <cluster
name in
Serengeti>
Optional List the specified Hadoop cluster in Serengeti including name,
distro, status, each role's information. For each role, it will list
instance count, CPU, memory, type and size.
--detail Optional List all the Hadoop clusters' details including name in
Serengeti, distro, deploy status, each node‟s information in
different roles.
Note: with this option specified, Serengeti will query from
vCenter server to get the latest node status. That operation
may take a few seconds for each cluster.
For example:
Serengeti User’s Guide
55
7.2.7 cluster resize
Change the number of nodes in a node group.
Parameter Mandatory/Optional Description
--name <cluster name in
Serengeti>
Mandatory Specify the target Hadoop cluster in Serengeti.
--nodeGroup <name of
the node group>
Mandatory Specify the target role which will be scaled out in
Hadoop cluster deployed by Serengeti.
--instanceNum <instance
number>
Mandatory Specify the target count which will be scaled out to.
The target count needs to be more that original.
Example:
Cluster resize --name foo --nodeGroup slave --instanceCount 10
7.2.8 cluster start
Start a Hadoop cluster in Serengeti.
Parameter Mandatory/Optional Description
--name <cluster name> Mandatory Start a specified Hadoop cluster in Serengeti.
Serengeti User’s Guide
56
7.2.9 cluster stop
Stop a Hadoop cluster in Serengeti.
Parameter Mandatory/Optional Description
--name <cluster name> Mandatory Stop a specified Hadoop cluster in Serengeti.
7.2.10 cluster target
Connect to one Hadoop cluster to interact with it by Serengeti CLI, including run fs, mr, pig, and hive
commands.
Parameter Mandatory/Optional Description
--name <cluster name> Optional The name of the cluster to connect to. If user don‟t specify
this parameter, the first cluster listed by “cluster list”
command will be used
--info Optional Show to targeted cluster information, such as the HDFS
URL, Job Tracker URL and Hive server URL.
Note: --name and –info can not be used together.
7.2.11 cluster unlimit
Enable all of the provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti.
Compute nodes are re-commissioned and powered-on as necessary.
Parameter Mandatory/Optional Description
--name <cluster_name> Mandatory Name of the Hadoop cluster in Serengeti
--nodeGroup
<node_group_name>
Optional Name of a node group in the specified Hadoop cluster
in Serengeti (only supports node groups with task
tracker role)
7.3 datastore
7.3.1 datastore add
Add a datastore to Serengeti for deploying.
Parameter Mandatory/Optional Description
--name <datastore
name in Serengeti> Mandatory Specify the name of datastore added to Serengeti
--spec <datastore
name in VCenter> Mandatory Specify datastore name in vSphere. User can use wild
card to specify multiple vmfs store. * and ? are
Serengeti User’s Guide
57
supported in wild card.
--type <datastore type:
LOCAL|SHARE>
Mandatory Specify datastore type in vSphere: local storage or
shared storage.
7.3.2 datastore delete
Delete a datastore from Serengeti.
Parameter Mandatory/Optional Description
--name <datastore name in
Serengeti> Mandatory Delete a specified datastore in
Serengeti.
7.3.3 datastore list
List datastores added to Serengeti.
Parameter Mandatory/Optional Description
--name <Name of datastore name
in Serengeti>
Optional List the specified datastore information
including name, type.
--detail Optional List the datastore details including datastore
path in vSphere.
All datastores that are added to Serengeti are listed if the name is not specified.
For example:
Serengeti User’s Guide
58
7.4 distro
7.4.1 distro list
Show what are the roles offered in a distro.
Parameter Mandatory/Optional Description
--name <distro name> Optional List the specified distro information.
For example:
7.5 disconnect
Disconnect and logout from remote Serengeti server. After disconnect, user is not allowed to run any CLI
commands.
7.6 fs
7.6.1 fs cat
Copy source paths to stdout.
Parameter Mandatory/Optional Description
<file name> Mandatory The file to be showed in the console. Multiple files must be quoted,
such as “/path/file1 /path/file2”
7.6.2 fs chgrp
Change group association of files.
Parameter Mandatory/Optional Description
--group <group name> Mandatory The group name of the file
--recursive true|false Optional make the change recursively through the directory
structure
<file name> Mandatory The file whose group to be changed. Multiple files
must be quoted, such as “/path/file1 /path/file2”
7.6.3 fs chmod
Change the permissions of files.
Parameter Mandatory/Optional Description
Serengeti User’s Guide
59
--mode <permission mode> Mandatory The file permission mode, such as “755”
--recursive true|false Optional make the change recursively through the directory
structure
<file name> Mandatory The file whose permission to be changed. Multiple
files must be quoted, such as “/path/file1
/path/file2”
7.6.4 fs chown
Change the owner of files.
Parameter Mandatory/Optional Description
--owner <permission
mode>
Mandatory The file owner name
--recursive true|false Optional make the change recursively through the directory structure
<file name> Mandatory The file whose owner to be changed. Multiple files must be
quoted, such as “/path/file1 /path/file2”
7.6.5 fs copyFromLocal
Copy single source file, or multiple source files from local file system to the destination file system. It is
the same as put.
Parameter Mandatory/Optional Description
--from <local file
path>
Mandatory The file path in local. Multiple files must be quoted, such as
“/path/file1 /path/file2”
--to <HDFS file
path>
Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is
directory name.
7.6.6 fs copyToLocal
Copy files to the local file system. It is the same as get.
Parameter Mandatory/Optional Description
--from < HDFS file path > Mandatory The file path in HDFS. Multiple files must be quoted,
such as “/path/file1 /path/file2”
--to < local file path > Mandatory The file path in local. If “--from” is multiple files, “--to” is
directory name.
Serengeti User’s Guide
60
7.6.7 fs copyMergeToLocal
Takes a source directory and a destination file as input and concatenates the files in the HDFS directory
into the local file system.
Parameter Mandatory/Optional Description
--from < HDFS file path > Mandatory The file path in HDFS. Multiple files must be quoted,
such as “/path/file1 /path/file2”.
--to < local file path > Mandatory The file path in local.
--endline <true|false> Optional Whether add end line character.
7.6.8 fs count
Count the number of directories, files, bytes, quota, and remaining quota.
Parameter Mandatory/Optional Description
--path < HDFS path > Mandatory The path to be counted.
--quota <true|false> Optional Whether with quota information.
7.6.9 fs cp
Copy files from source to destination. This command allows multiple sources as well in which case the
destination must be a directory.
Parameter Mandatory/Optional Description
--from <HDFS source
file path>
Mandatory The file path in local. Multiple files must be quoted, such
as “/path/file1 /path/file2”
--to <HDFS destination
file path>
Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is
directory name.
7.6.10 fs du
Displays sizes of files and directories contained in the given directory or the length of a file in case it‟s just
a file.
Parameter Mandatory/Optional Description
<file name> Mandatory The file to be showed in the console. Multiple files must be quoted,
such as “/path/file1 /path/file2”.
7.6.11 fs expunge
Empty the trash bin in the HDFS.
Serengeti User’s Guide
61
7.6.12 fs get
Copy files to the local file system.
Parameter Mandatory/Optional Description
--from < HDFS file
path >
Mandatory The file path in HDFS. Multiple files must be quoted, such as
“/path/file1 /path/file2”.
--to < local file
path >
Mandatory The file path in local. If “--from” is multiple files, “--to” is
directory name.
7.6.13 fs ls
List files in the directory.
Parameter Mandatory/Optional Description
<path name> Mandatory The path to be listed. Multiple files must be quoted,
such as “/path/file1 /path/file2”.
--recursive <true|false> Optional Whether list the directory with recursion.
7.6.14 fs mkdir
Create a new directory.
Parameter Mandatory/Optional Description
<dir name> Mandatory The directory name to be created.
7.6.15 fs moveFromLocal
Similar to put command, except that the source local file is deleted after it is copied.
Parameter Mandatory/Optional Description
--from <local file path> Mandatory The file path in local. Multiple files must be quoted, such
as “/path/file1 /path/file2”.
--to <HDFS file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is
directory name.
7.6.16 fs mv
Move source files to destination in the HDFS.
Parameter Mandatory/Optional Description
--from <dest file path> Mandatory The file path in local. Multiple files must be quoted, such
Serengeti User’s Guide
62
as “/path/file1 /path/file2”.
--to <source file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is
directory name.
7.6.17 fs put
Copy single src, or multiple srcs from local file system to the HDFS.
Parameter Mandatory/Optional Description
--from <local file path> Mandatory The file path in local. Multiple files must be quoted, such
as “/path/file1 /path/file2”.
--to <HDFS file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is
directory name.
7.6.18 fs rm
Remove files in the HDFS.
Parameter Mandatory/Optional Description
< file path> Mandatory The file to be removed.
--recursive <true|false> Optional Remove files with recursion.
--skipTrash <true|false> Optional Bypass trash.
7.6.19 fs setrep
Change the replication factor of a file
Parameter Mandatory/Optional Description
--path < file path> Mandatory The path to be changed replication factor.
--replica <replica number> Mandatory Number of replicas.
--recursive <true|false> Optional Whether set replica with recursion.
--waiting <true|false> Optional Whether wait for the replica number is equal to the
number.
7.6.20 fs tail
Display last kilobyte of the file to stdout.
Parameter Mandatory/Optional Description
<file path> Mandatory The file path to be displayed.
Serengeti User’s Guide
63
--file <true|false> Optional Show content while the file grows.
7.6.21 fs text
Take a source file and output the file in text format.
Parameter Mandatory/Optional Description
<file path> Mandatory The file path to be displayed.
7.6.22 fs touchz
Create a file of zero length.
Parameter Mandatory/Optional Description
<file path> Mandatory The file name to be created.
7.7 hive
7.7.1 hive cfg
Configure Hive.
Parameter Mandatory/Optional Description
--host <server host > Optional The server host.
--port <server port> Optional The server port.
--timeout Optional The timeout in milliseconds.
7.7.2 hive script
Execute a Hive script. Note: You need to run hive cfg before running a hive script.
Parameter Mandatory/Optional Description
--location <script path> Mandatory The hive script file name to be executed.
7.8 mr
7.8.1 mr jar
Run a MapReduce job located inside the provided jar.
Parameter Mandatory/Optional Description
Serengeti User’s Guide
64
--jarfile <jar file path> Mandatory The jar file path.
--mainclass <main class name> Mandatory The class which have main() method.
--args <arg> Optional The arguments to the main class. If there are
multiple arguments, they must be double
quoted.
7.8.2 mr job counter
Print the counter value of the MR job.
Parameter Mandatory/Optional Description
--jobid <job id> Mandatory The MR job id.
--groupname <group name> Mandatory The counter‟s group name.
--countername <counter name> Mandatory The counter‟s name.
7.8.3 mr job events
Print the events' detail received by JobTracker for the given range.
Parameter Mandatory/Optional Description
--jobid <job id> Mandatory The MR job id.
--from < from-event-#> Mandatory The start number of events to be printed.
--number < #-of-events> Mandatory The total number of events to be printed.
7.8.4 mr job history
Print job details, failed and killed job details.
Parameter Mandatory/Optional Description
<job history directory> Mandatory The directory where job history files put.
--all <true|false> Optional Print all jobs information.
7.8.5 mr job kill
Kill the MR job.
Parameter Mandatory/Optional Description
--jobid <job id> Mandatory The job id.
Serengeti User’s Guide
65
7.8.6 mr job list
List MR jobs.
Parameter Mandatory/Optional Description
--all <true|false> Optional Whether list all jobs.
7.8.7 mr job set priority
Change the priority of the job.
Parameter Mandatory/Optional Description
--jobid <jobid> Mandatory The job id.
--priority
<VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW>
Mandatory The job‟s priority.
7.8.8 mr job status
Query MR job status.
Parameter Mandatory/Optional Description
--jobid <jobid> Mandatory The job id.
7.8.9 mr job submit
Submit a MR job defined in the job file.
Parameter Mandatory/Optional Description
--jobfile <jobfile> Mandatory Specify the file which define the MR job. The file is
standard Hadoop configuration. One example configuration
file is as following:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.jar</name> <value>/home/hadoop/hadoop-1.0.1/hadoop-examples-1.0.1.jar</value> </property> <property>
Serengeti User’s Guide
66
<name>mapred.input.dir</name> <value>/user/hadoop/input</value> </property> <property> <name>mapred.output.dir</name> <value>/user/hadoop/output</value> </property> <property> <name>mapred.job.name</name> <value>wordcount</value> </property> <property> <name>mapreduce.map.class</name> <value>org.apache.hadoop.examples.WordCount.TokenizerMapper</value> </property> <property> <name>mapreduce.reduce.class</name> <value>org.apache.hadoop.examples.WordCount.IntSumReducer</value> </property> </configuration>
7.8.10 mr task fail
Fail the Map Reduce task.
Parameter Mandatory/Optional Description
--taskid <taskid> Mandatory Specify the task id.
7.8.11 mr task kill
Kill the Map Reduce task.
Parameter Mandatory/Optional Description
--taskid <taskid> Mandatory Specify the task id.
7.9 network
7.9.1 network add
Add a network to Serengeti.
Parameter Mandatory/Optional Description
--name <network name in Serengeti> Mandatory Specify the name of network resource
added to Serengeti
Serengeti User’s Guide
67
--portGroup <port group name in
vSphere>
Mandatory Specify the name of port group in vSphere
which user wants to add to Serengeti
--dhcp Combination 1 Specify the IP address assignment type,
DHCP.
--ip <IP Spec, an IP address range
looks like xx.xx.xx.xx-xx[,xx]*>
--dns <dns server ip>
--secondaryDNS <dns server ip>
--gateway <gateway IP>
--mask <network mask>
Combination 2 Specify the IP address assignment type,
static IP.
For example:
>network add --name ipNetwork --ip 192.168.1.1-100,192.168.1.120-180 --portGroup pg1 --dns
202.112.0.1 --gateway 192.168.1.255 --mask 255.255.255.1
>network add --name dhcpNetwork --dhcp --portGroup pg1
7.9.2 network delete
Delete a network in Serengeti.
Parameter Mandatory/Optional Description
--name <network name in Serengeti> Mandatory Delete the specified network in Serengeti.
7.9.3 network list
List available networks in Serengeti.
Parameter Mandatory/Optional Description
--name <network name in Serengeti> Optional List the specified network in Serengeti
including name, port group in vSphere, IP
address assignment type, assigned IP
address and so on.
--detail Optional List the network detail information in
Serengeti including Hadoop cluster node's
network information.
For example:
Serengeti User’s Guide
68
7.10 pig script
7.10.1 pig cfg
Configure Pig.
Parameter Mandatory/Optional Description
--props Optional Specify the Pig properties file location.
--jobName Optional Specify the job name.
--jobPriority Optional Specify the job priority.
--jobTracker Optional Specify the job tracker.
--execType Optional Specify the execution type.
--validateEachStatement Optional Validation of each statement or not.
7.10.2 pig script
Execute a Pig script. Note: You need to run pig cfg before running this command.
Parameter Mandatory/Optional Description
--location <script path> Mandatory Specify the name of the script to be executed.
7.11 resourcepool
7.11.1 resourcepool add
Add a resource pool in vSphere to Serengeti.
Parameter Mandatory/Optional Description
--name <resource pool name in Serengeti> Mandatory Specify the name of resource pool
added to Serengeti.
--vccluster <vSphere cluster of the resource Mandatory Specify the vSphere cluster name in
Serengeti User’s Guide
69
pool> vSphere where the resource pool is
in.
--vcrp <vSphere resource pool name> Mandatory Specify the vSphere resource pool
in vSphere which is added to
Serengeti for deploying. The
vSphere resource pool must be
directly under a cluster.
7.11.2 resourcepool delete
Remove a resource pool from Serengeti.
Parameter Mandatory/Optional Description
--name <resource pool name in Serengeti> Mandatory Remove specified resource pool
from Serengeti.
7.11.3 resourcepool list
List resource pools added to Serengeti.
Parameter Mandatory/Optional Description
--name <resource pool name in Serengeti> Optional List the specific resource pool
name, path.
--detail Optional List each resource pool's general
information and Hadoop cluster'
node in this resource pool.
All resource pools that are added to Serengeti are listed if a name is not specified. For each resource
pool, NAME, PATH are listed. NAME is the name in Serengeti. PATH is the combination of the vSphere
cluster name and resource pool name, separated by “/”.
For example:
Serengeti User’s Guide
70
7.12 topology
7.12.1 topology upload
Upload a rack-hosts mapping topology file to Serengeti. The new uploaded file will overwrite the existing
file. The accepted file format looks like: for each line, rackname: hostname1, hostname2…
Hostname1,hostname2,… stands for the host name displayed in vSphere.
Parameter Mandatory/Optional Description
--fileName <topology file name> Mandatory Specify the topology file name.
--yes Optional Answer „y‟ to „Y/N‟ confirmation.
7.12.2 topology list
List rack-hosts mapping topology stored in Serengeti.
8. vSphere Settings
8.1 vSphere Cluster Configuration
8.1.1 Setup Cluster
In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right-click the Datacenter
and select "New Cluster..." Follow new Cluster Wizard using the following settings:
Enable “vSphere HA” and “vSphere DRS”
Enable Host Monitoring
Enable Admission Control and set desired policy. (Default policy is to tolerate 1 host failure)
Virtual machine restart priority “High”
Virtual machine Monitoring “virtual machine and Application Monitoring”
Monitoring sensitivity “High”
Serengeti User’s Guide
71
8.1.2 Enable DRS/HA on an existing cluster
If DRS or HA is not already enabled on an existing cluster, it can be enabled by right-clicking the cluster
and selecting “Edit Settings”. Under “Cluster Features”, select "Turn On vSphere DRS" and "Turn On
vSphere HA". Use settings specified in "Setup Cluster" above.
8.1.3 Add Hosts to Cluster
In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right-click the Cluster
that was just created and select "Add Host...". Follow the Add Host Wizard to add a Host. Repeat for each
additional Host.
8.1.4 DRS/FT Settings
In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, click a host in the
cluster. On the right side there will be a row of tabs near the top of the window, click on “Configuration”
then click on Networking. The window will display vSwitch port groups. By default A VMkernel Port called
“Management Network” is pre-configured. Click “Properties...” of the vSwitch, choose the “Management
Network” and click the “Edit” button. Enable “vMotion” and “Fault Tolerance Logging” from the
“Management Network Properties” window.
To verify the FT status of a host, click on the Summary tab and locate “Host Configured for FT” in the
general section. If there are any issues with FT they will be shown here.
8.1.5 Enable FT on specific virtual machine
Fault Tolerance runs one virtual machine on two separate hosts, it allows for instant failover in a variety of
situations. Before enabling FT ensure the necessary requirements are met:
Host hardware is listed in the VMware Hardware Compatibility List (HCL)
All hosts in the cluster have Hardware VT enabled in the BIOS
The “Management Network” (VMkernel Port) has “vMotion” and "Fault Tolerance Logging"
enabled
Available capacity in the cluster
Virtual machine disks are thick provisioned, without snapshots and located on shared storage
Virtual machine is single vCPU
In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right click the virtual
machine and select “Fault Tolerance”, “Turn On Fault Tolerance”.
8.2 Network Settings
Serengeti currently deploys using a single network. Virtual machines are deployed with one NIC which is
attached to a specific Port Group. How this Port Group is configured and the network backing the Port
Group depends on the environment. Here we will cover a basic network configuration that may be
customized as needed.
Serengeti User’s Guide
72
Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group backing a
Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host
and requires the Port Group to be configured manually.
8.2.1 Setup Port Group - Option A (vSphere Distributed Switch)
In the vCenter Client, select “Inventory”, “Networking”. Right Click the Datacenter and select “New
vSphere Distributed Switch”.
Using the Create vSphere Distributed Switch wizard. Choose Switch Version 5.0. Enter a name and
number of uplink ports (physical adapters) you require.
On the Add Hosts and Physical Adapters step, select the adapter(s) on each host that will carry traffic to
the switch.
On the last step it will create a default Port Group. You can rename this Port Group after it is created and
the wizard is completed.
8.2.2 Setup Port Group - Option B (vSwitch)
In the vCenter Client, select “Inventory”, “Hosts and Clusters”. Navigate to the Networking section of the
Configuration Tab. Make sure the “vSphere Standard Switch” view is selected.
There is already vSwitch0 created by default. You may add a Port Group to this vSwitch or create a new
vSwitch that binds to different physical adapters.
To create a Port Group on the existing vSwitch click “Properties…” on that vSwitch and then click the
“Add” button. Follow the wizard to create the Port Group.
To create a new vSwitch, click on “Add Networking…” and follow the Add Network Wizard.
8.3 Storage Settings
Serengeti provisions virtual machines on shared storage to enable vSphere HA, FT and DRS features.
Local datastores are attached to virtual machines to be used for data.
8.3.1 Shared Storage Setting
Create LUN on Shared Storage (SAN/NAS) and verify it is accessible by all hosts in the cluster. For
vSphere HA Datastore Heartbeat feature two datastores are required.
8.3.2 Local Storage Settings
8.3.2.1 Configure DAS on Physical Hosts
Direct Attached Storage should be attached and configured on the physical controller to present each
disk separately to the OS. This configuration is commonly described as JBOD (Just A Bunch Of Disks) or
single disk RAID0.
8.3.2.2 Provision VMFS Datastores on DAS of Each Host
Create VMFS Datastores on Direct Attached Storage. This can be done in either of the following two
ways.
Manually using the vSphere Client, the vSphere Management Assistant
Automation by vSphere PowerCLI
Serengeti User’s Guide
73
8.3.2.2.1 Manually Using vSphere Client (Manual per disk):
1. Expand Cluster => Select Host
2. Go to "Configuration" Tab
3. Choose "Storage"
4. Click "Add Storage..."
This will start the Add a Storage Wizard. In the wizard, continue the steps.
5. Select "Disk/LUN" for Storage Type => Next
6. Select a Local Disk from the list => Next
7. Select "VMFS-5" for File System Version => Next => Next
8. Enter Datastore Name => Next
9. "Maximum Available Space" => Next
10. Finish
8.3.2.2.2 Automation by vSphere PowerCLI
This method requires you have a vSphere PowerCLI installed. You can refer to vSphere PowerCLI site to
download and install PowerCLI.
Once the PowerCLI is installed, you can use it to format many Direct Attached Storages to VMFS at a
time.
1. Select Start > Programs > VMware > VMware vSphere PowerCLI.
The VMware vSphere PowerCLI console window opens.
2. In the VMware vSphere PowerCLI console window, run PowerCLI commands to format the disks.
CAUTION
The commands will apply to multiple ESXi hosts at a time. Make sure the scope is what you intended
to before you run a command.
Here‟s a sample script of provisioning datastores. You can type the commands line by line in
PowerCLI shell.
In this example, it formats local disks in all hosts in a vSphere cluster named “My Cluster”. The disks
are formatted to VMFS datastores. The prefix of datastore name is “abcde”.
vSphere PowerCLI - Create Local Datastores for Cluster
# Connect to a vCenter Server. Connect-VIServer -Server 10.23.112.235 -Protocol https -User admin -Password pass # Prepare variables. $i = 0 $localDisks = @{} $clusterName = "My Cluster" $datastoreName = "abcde" # Select Hosts $vmHosts = Get-VMHost -Location $clusterName # Get Local Disks $ldArray = $vmHosts | Get-VMHostDisk | select -ExpandProperty ScsiLun | where {$_.IsLocal -eq "True"}
Serengeti User’s Guide
74
# Get Primary Disks $pdArray = $vmHosts | Get-VMHostDiagnosticPartition # Add Local Disks to Hashtable keyed by CName foreach($ld in $ldArray) {$localDisks.Add($ld.CanonicalName,$ld)} # Remove Primary Disks from Local Disk Hashtable foreach($pd in $pdArray) {$localDisks.Remove($pd.CanonicalName)} # Create Datastores. Will fail to create for any local disks that are in-use. foreach ($ld in $localDisks.Values) {$i++; New-Datastore -Vmfs -Name ($datastoreName + $i.ToString("D3")) -Path $ld.CanonicalName -vmHost $ld.vmHost}
9. Appendix A: Create Local Yum Repository for MapR
9.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which has Internet access, and install a web server such as Apache/lighttpd on the machine. Or you can use the Serengeti Management Server if you don‟t have another machine. This web server will serve as the yum server. This guide will take installing Apache web server as an example.
9.1.1 Configure http proxy
First open a bash shell terminal. If the machine needs a http proxy server to connect to the Internet, set http_proxy env :
# switch to root user sudo su export http_proxy=http://< proxy_server:port>
9.1.2 Install Apache Web Server
yum install -y httpd /sbin/service httpd start
Make sure the firewall on the machine doesn't block the network port 80 used by Apache web server. You can open a web browser on another machine and navigate to http://<ip_of _webserver>/ to ensure the default test page of Apache web server shows up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
Serengeti User’s Guide
75
9.1.3 Install yum related packages
Install the yum-utils and createrepo packages if they are not already installed (yum-utils includes the reposync command):
yum install -y yum-utils createrepo
9.1.4 Sync the remote MapR yum repository
1) Create a new file /etc/yum.repos.d/mapr-m5.repo using vi or other editors with the following content:
[maprtech] name=MapR Technologies baseurl=http://package.mapr.com/releases/v2.1.1/redhat/ enabled=1 gpgcheck=0 protect=1 [maprecosystem] name=MapR Technologies baseurl=http://package.mapr.com/releases/ecosystem/redhat enabled=1 gpgcheck=0 protect=1
2) Mirror the remote yum repository to the local machine:
reposync -r maprtech reposync -r maprecosystem
This will take several minutes (depending on the network bandwidth) to download all the RPMs in the remote repository, and all the RPMs are put in new folders named maprtech and maprecosystem.
9.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if you use Serengeti Management Server to set up the yum server, the folder is /opt/serengeti/www/.
doc_root=/var/www/html mkdir -p $doc_root/mapr/2 mv maprtech/ maprecosystem/ $doc_root/mapr/2/
2) Create a yum repository for the RPMs:
Serengeti User’s Guide
76
cd $doc_root/mapr/2 createrepo .
3) Create a new file /var/www/html/mapr/2/mapr-m5.repo with the following content:
[mapr-m5] name=MapR Version 2 baseurl=http://<ip_of_webserver>/mapr/2 enabled=1 gpgcheck=0 protect=1
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/mapr/2/mapr-m5.repo from another machine.
9.3 Configure http proxy for the VMs created by Serengeti Server
This step is optional and only applies if the VMs created by Serengeti Management Server need a http proxy to connect to the yum repository. You need to configure http proxy for the VMs as this: on Serengeti Server, add the following content into /opt/serengeti/conf/serengeti.properties:
# set http proxy server serengeti.http_proxy = http://<proxy_server:port> # set the IPs of Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.
10. Appendix B: Create Local Yum Repository for CDH4
10.1 Install a web server to server as yum server
Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which has Internet access, and install a web server such as Apache/lighttpd on the machine. Or you can use the Serengeti Management Server if you don‟t have another machine. This web server will serve as the yum server. This guide will take installing Apache web server as an example.
10.1.1 Configure http proxy
First open a bash shell terminal. If the machine needs a http proxy server to connect to the Internet, set http_proxy env :
# switch to root user
Serengeti User’s Guide
77
sudo su export http_proxy=http://<proxy_server:port>
10.1.2 Install Apache Web Server
yum install -y httpd /sbin/service httpd start
Make sure the firewall on the machine doesn't block the network port 80 used by Apache web server. You can open a web browser on another machine and navigate to http://<ip_of_webserver>/ to ensure the default test page of Apache web server shows up.
If you would like to stop the firewall, execute this command:
/sbin/service iptables stop
10.1.3 Install yum related packages
Install the yum-utils and createrepo packages if they are not already installed (yum-utils includes the reposync command):
yum install -y yum-utils createrepo
10.1.4 Sync the remote CDH4 yum repository
1) Create a new file /etc/yum.repos.d/cloudera-cdh4.repo using vi or other editors with the following content:
[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.1.2/ gpgkey = http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera gpgcheck = 1
2) Mirror the remote yum repository to the local machine:
reposync -r cloudera-cdh4
This will take several minutes (depending on the network bandwidth) to download all the RPMs in the remote repository, and all the RPMs are put in new folder named cloudera-cdh4.
10.2 Create local yum repository
1) Put all the RPMs into a new folder under the Document Root folder of the Apache Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if you use Serengeti Management Server to set up the yum server, the folder is /opt/serengeti/www/ .
Serengeti User’s Guide
78
doc_root=/var/www/html mkdir -p $doc_root/cdh/4/ mv cloudera-cdh4/RPMS $doc_root/cdh/4/
2) Create a yum repository for the rpms:
cd $doc_root/cdh/4 createrepo .
3) Create a new file /var/www/html/cdh/4/cloudera-cdh4.repo with the following content:
[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://<ip_of_webserver>/cdh/4/ enabled=1 gpgcheck=0
Please replace the <ip_of_webserver> with the IP address of the web server.
Ensure you can download http://<ip_of_webserver>/cdh/4/cloudera-cdh4.repo from another machine.
10.3 Config http proxy for the VMs created by Serengeti Server
This step is optional and only apply if the VMs created by Serengeti Management Server need a http proxy to connect to the yum repository. You need to config http proxy for the VMs as this: on Serengeti Server, add the following content into /opt/serengeti/conf/serengeti.properties:
# set http proxy server serengeti.http_proxy = http://< proxy_server:port> # set the IPs of Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.