serengeti user guide_0.8

78
VMware, Inc. Serengeti User‟s Guide Serengeti 0.8

Upload: hmaddela

Post on 25-Dec-2015

16 views

Category:

Documents


6 download

DESCRIPTION

Serengeti User Guide_0.8

TRANSCRIPT

Page 1: Serengeti User Guide_0.8

VMware, Inc.

Serengeti User‟s Guide Serengeti 0.8

Page 2: Serengeti User Guide_0.8

Serengeti User’s Guide

2

Contents 1. Serengeti User‟s Guide ...................................................................................................................................... 6

1.1 Intended Audience ........................................................................................................................................ 6

2. Serengeti Overview ............................................................................................................................................ 6

2.1 Serengeti ........................................................................................................................................................ 6

2.1.1 Serengeti Features ................................................................................................................................ 6

2.1.2 Serengeti Architecture Overview......................................................................................................... 8

2.2 Hadoop ........................................................................................................................................................... 8

2.3 VMware Virtual Infrastructure ..................................................................................................................... 9

2.4 Serengeti Virtual Appliance Requirements ............................................................................................... 9

2.5 Serengeti CLI Requirements ....................................................................................................................... 9

3. Installing the Serengeti Virtual Appliance ...................................................................................................... 10

3.1 Download ..................................................................................................................................................... 10

3.2 Deploy Serengeti ........................................................................................................................................ 10

4. Quick Start ......................................................................................................................................................... 13

4.1 Set up the Serengeti CLI ........................................................................................................................... 13

4.2 Deploy a Hadoop Cluster .......................................................................................................................... 13

4.3 Deploy a HBase Cluster ............................................................................................................................ 15

5. Using Serengeti ................................................................................................................................................. 15

5.1 Manage Serengeti Users ........................................................................................................................... 15

5.1.1 Add/Delete a User in Serengeti ......................................................................................................... 15

5.1.2 Modify User Password ........................................................................................................................ 16

5.2 Manage Resources in Serengeti .............................................................................................................. 16

5.2.1 Add a Datastore ................................................................................................................................... 16

5.2.2 Add a Network ..................................................................................................................................... 16

5.2.3 Add a Resource Pool .......................................................................................................................... 17

5.2.4 View Datastores................................................................................................................................... 17

5.2.5 View Networks ..................................................................................................................................... 17

5.2.6 View Resource Pools .......................................................................................................................... 17

5.2.7 Remove a Datastore ........................................................................................................................... 18

5.2.8 Remove a Network .............................................................................................................................. 18

5.2.9 Remove a Resource Pool .................................................................................................................. 18

5.3 Manage Distros ........................................................................................................................................... 18

5.3.1 Supported Distros ................................................................................................................................ 18

5.3.2 Add a Distro to Serengeti ................................................................................................................... 18

5.3.3 List Distros ............................................................................................................................................ 21

5.3.4 Using a Distro....................................................................................................................................... 21

5.4 Hadoop Clusters ......................................................................................................................................... 21

5.4.1 Deploy Hadoop Clusters .................................................................................................................... 21

Page 3: Serengeti User Guide_0.8

Serengeti User’s Guide

3

5.4.2 Manage Hadoop Clusters .................................................................................................................. 30

5.4.3 Use Hadoop Clusters .......................................................................................................................... 36

5.5 HBase Clusters ........................................................................................................................................... 40

5.5.1 Deploy HBase Clusters ...................................................................................................................... 40

5.5.2 Manage HBase Clusters .................................................................................................................... 43

5.5.3 Use HBase Clusters ............................................................................................................................ 43

5.6 Monitoring Cluster Deployed by Serengeti ............................................................................................. 44

5.7 Make Hadoop Master Node HA/FT .......................................................................................................... 44

5.8 Hadoop Topology Awareness ................................................................................................................... 45

5.9 Start and Stop Serengeti Services ........................................................................................................... 45

6. Cluster Specification Reference ..................................................................................................................... 46

7. Serengeti Command Reference ..................................................................................................................... 51

7.1 connect ......................................................................................................................................................... 51

7.2 cluster ........................................................................................................................................................... 51

7.2.1 cluster config ........................................................................................................................................ 51

7.2.2 cluster create ........................................................................................................................................ 52

7.2.3 cluster delete ........................................................................................................................................ 53

7.2.4 cluster export ........................................................................................................................................ 53

7.2.5 cluster limit ............................................................................................................................................ 53

7.2.6 cluster list .............................................................................................................................................. 54

7.2.7 cluster resize ........................................................................................................................................ 55

7.2.8 cluster start ........................................................................................................................................... 55

7.2.9 cluster stop ........................................................................................................................................... 56

7.2.10 cluster target ...................................................................................................................................... 56

7.2.11 cluster unlimit ..................................................................................................................................... 56

7.3 datastore ...................................................................................................................................................... 56

7.3.1 datastore add ....................................................................................................................................... 56

7.3.2 datastore delete ................................................................................................................................... 57

7.3.3 datastore list ......................................................................................................................................... 57

7.4 distro ............................................................................................................................................................. 58

7.4.1 distro list ................................................................................................................................................ 58

7.5 disconnect .................................................................................................................................................... 58

7.6 fs .................................................................................................................................................................... 58

7.6.1 fs cat ...................................................................................................................................................... 58

7.6.2 fs chgrp ................................................................................................................................................. 58

7.6.3 fs chmod ............................................................................................................................................... 58

7.6.4 fs chown ................................................................................................................................................ 59

7.6.5 fs copyFromLocal ................................................................................................................................ 59

7.6.6 fs copyToLocal ..................................................................................................................................... 59

Page 4: Serengeti User Guide_0.8

Serengeti User’s Guide

4

7.6.7 fs copyMergeToLocal ......................................................................................................................... 60

7.6.8 fs count.................................................................................................................................................. 60

7.6.9 fs cp ....................................................................................................................................................... 60

7.6.10 fs du ..................................................................................................................................................... 60

7.6.11 fs expunge .......................................................................................................................................... 60

7.6.12 fs get.................................................................................................................................................... 61

7.6.13 fs ls ...................................................................................................................................................... 61

7.6.14 fs mkdir ............................................................................................................................................... 61

7.6.15 fs moveFromLocal ............................................................................................................................. 61

7.6.16 fs mv .................................................................................................................................................... 61

7.6.17 fs put.................................................................................................................................................... 62

7.6.18 fs rm .................................................................................................................................................... 62

7.6.19 fs setrep .............................................................................................................................................. 62

7.6.20 fs tail .................................................................................................................................................... 62

7.6.21 fs text ................................................................................................................................................... 63

7.6.22 fs touchz ............................................................................................................................................. 63

7.7 hive ............................................................................................................................................................... 63

7.7.1 hive cfg .................................................................................................................................................. 63

7.7.2 hive script .............................................................................................................................................. 63

7.8 mr .................................................................................................................................................................. 63

7.8.1 mr jar ..................................................................................................................................................... 63

7.8.2 mr job counter ...................................................................................................................................... 64

7.8.3 mr job events ........................................................................................................................................ 64

7.8.4 mr job history ........................................................................................................................................ 64

7.8.5 mr job kill ............................................................................................................................................... 64

7.8.6 mr job list .............................................................................................................................................. 65

7.8.7 mr job set priority ................................................................................................................................. 65

7.8.8 mr job status ......................................................................................................................................... 65

7.8.9 mr job submit ........................................................................................................................................ 65

7.8.10 mr task fail .......................................................................................................................................... 66

7.8.11 mr task kill .......................................................................................................................................... 66

7.9 network ......................................................................................................................................................... 66

7.9.1 network add .......................................................................................................................................... 66

7.9.2 network delete ...................................................................................................................................... 67

7.9.3 network list ............................................................................................................................................ 67

7.10 pig script ..................................................................................................................................................... 68

7.10.1 pig cfg.................................................................................................................................................. 68

7.10.2 pig script ............................................................................................................................................. 68

7.11 resourcepool .............................................................................................................................................. 68

Page 5: Serengeti User Guide_0.8

Serengeti User’s Guide

5

7.11.1 resourcepool add ............................................................................................................................... 68

7.11.2 resourcepool delete .......................................................................................................................... 69

7.11.3 resourcepool list ................................................................................................................................ 69

7.12 topology...................................................................................................................................................... 70

7.12.1 topology upload ................................................................................................................................. 70

7.12.2 topology list ........................................................................................................................................ 70

8. vSphere Settings ............................................................................................................................................... 70

8.1 vSphere Cluster Configuration .................................................................................................................. 70

8.1.1 Setup Cluster ....................................................................................................................................... 70

8.1.2 Enable DRS/HA on an existing cluster ............................................................................................. 71

8.1.3 Add Hosts to Cluster ........................................................................................................................... 71

8.1.4 DRS/FT Settings .................................................................................................................................. 71

8.1.5 Enable FT on specific virtual machine ............................................................................................. 71

8.2 Network Settings ......................................................................................................................................... 71

8.2.1 Setup Port Group - Option A (vSphere Distributed Switch) .......................................................... 72

8.2.2 Setup Port Group - Option B (vSwitch) ............................................................................................ 72

8.3 Storage Settings ......................................................................................................................................... 72

8.3.1 Shared Storage Setting ...................................................................................................................... 72

8.3.2 Local Storage Settings ....................................................................................................................... 72

9. Appendix A: Create Local Yum Repository for MapR ................................................................................. 74

9.1 Install a web server to server as yum server .......................................................................................... 74

9.1.1 Configure http proxy ............................................................................................................................ 74

9.1.2 Install Apache Web Server ................................................................................................................ 74

9.1.3 Install yum related packages ............................................................................................................. 75

9.1.4 Sync the remote MapR yum repository ............................................................................................ 75

9.2 Create local yum repository ...................................................................................................................... 75

9.3 Configure http proxy for the VMs created by Serengeti Server ........................................................... 76

10. Appendix B: Create Local Yum Repository for CDH4 ............................................................................... 76

10.1 Install a web server to server as yum server ........................................................................................ 76

10.1.1 Configure http proxy.......................................................................................................................... 76

10.1.2 Install Apache Web Server .............................................................................................................. 77

10.1.3 Install yum related packages ........................................................................................................... 77

10.1.4 Sync the remote CDH4 yum repository ......................................................................................... 77

10.2 Create local yum repository .................................................................................................................... 77

10.3 Config http proxy for the VMs created by Serengeti Server ............................................................... 78

Page 6: Serengeti User Guide_0.8

Serengeti User’s Guide

6

1. Serengeti User’s Guide

The Serengeti User‟s Guide provides information about installing and using the Serengeti to deploying

and scaling Hadoop clusters on vSphere.

To help you start with Serengeti, this information includes descriptions of Serengeti concepts and features.

In addition, this information provides a set of usage examples and sample scripts.

1.1 Intended Audience

This book is intended for anyone who needs to install and use Serengeti. The information in this book is

written for administrators and developers who are familiar with VMware vSphere.

2. Serengeti Overview

2.1 Serengeti

The Serengeti virtual appliance is a management service that you can use to deploy Hadoop clusters on

VMware vSphere systems. It is a “one-click” deployment toolkit that allows you to leverage the VMware

vSphere platform to deploy a highly available Hadoop cluster in minutes, including common Hadoop

components such as HDFS, MapReduce, Pig, and Hive on a virtual platform. Serengeti supports multiple

Hadoop 0.20 based distributions, CDH4 (except YARN), and MapR M5.

2.1.1 Serengeti Features

2.1.1.1 Rapid Provisioning

Serengeti can deploy Hadoop clusters with HDFS, MapReduce, HBase, Pig, Hive client and Hive server

in your vSphere system easily and quickly.

Serengeti includes a provisioning engine, the Apache Hadoop distribution, and a virtual machine template.

Serengeti is preconfigured to automate Hadoop cluster deployment and configuration. With Serengeti,

you can save time in getting started with Hadoop because you do not need to install and configure an

operating system, or download, install and configure each software package on each machine.

2.1.1.2 High Availability

Serengeti takes advantage of vSphere high availability to protect the Hadoop master node virtual

machine. The master node virtual machine can be monitored by vSphere. When Hadoop namenode or

jobtracker service stops unexpectedly, vSphere will restart master node for recovery. When the virtual

machine stops unexpectedly by host failover or cannot access due to poor network, vSphere will leverage

FT to start another standby virtual machine automatically to reduce the unplanned down time.

Page 7: Serengeti User Guide_0.8

Serengeti User’s Guide

7

2.1.1.3 Local Disk Management

Serengeti allows you to use both shared storage and local storage. After the disks are formatted to

datastores in vSphere, you can add the datastores to Serengeti easily. You can specify whether the

datastores are shared storage (SHARED) or local storage (LOCAL). Serengeti automatically allocates the

datastores to Hadoop clusters when you deploy a Hadoop cluster.

By default, Serengeti allocates Hadoop master nodes and client nodes on SHARED datastores, and

data/compute nodes on LOCAL datastores, including both system disk and data disks of those nodes. If

you specify only local storage or shared storage, Serengeti allocates all Hadoop nodes on the available

datastores for a default cluster.

2.1.1.4 Easy Scale Out

With Serengeti you can add more nodes to a Hadoop cluster with a single command after it has been

deployed. You can start with a small Hadoop cluster and scale out as needed.

2.1.1.5 Configuration

Serengeti allows you to customize the following:

Number of virtual machines

CPU, RAM, storage for the virtual machines

Software packages for the virtual machines

Hadoop configuration.

Serengeti automatically adjusts Hadoop configurations according to the virtual machine specification.

After creation, you can export Hadoop cluster‟s spec and tune Hadoop configuration without impacting

irrelevant Hadoop node.

Serengeti provides both cluster level and node group level configuration. You can set different

parameters for different node groups.

2.1.1.6 Data Compute Separation

Serengeti allows you to deploy a data and computer separated Hadoop cluster.

You can specify number of data nodes per host.

You can specify the number of compute nodes for one data node and specify compute node and related data node on the same physical host.

Serengeti also allows you to deploy a compute-only cluster to performance isolation between different MapReduce clusters or consume the existing HDFS.

Deploy a Hadoop cluster with only JobTracker and TaskTracker to consume an existing apache 0.20 based HDFS.

Deploy a Hadoop cluster with only job tracker and task tracker to consume an 3rd party HDFS.

2.1.1.7 Remote CLI

You can remotely access Serengeti Management Server by installing CLI client in your environment.,

which is a one-stop-shop shell to deploy, manage and use Hadoop.

2.1.1.8 Hadoop Distribution Management

Serengeti allows you to use any of the following Hadoop distributions

Page 8: Serengeti User Guide_0.8

Serengeti User’s Guide

8

Apache Hadoop 1.0.x

Greenplum HD 1.2

Hortonworks HDP-1

CDH3

CDH4

MapR M5

You can add your preferred distribution to Serengeti and deploy Hadoop clusters accordingly.

2.1.2 Serengeti Architecture Overview

The Serengeti virtual appliance runs on top of vSphere system and includes a Serengeti Management

Server virtual machine and a Hadoop Template virtual machine. The Hadoop Template virtual machine

includes an agent.

Serengeti performs these major steps to deploy a Hadoop cluster:

1. Serengeti Management Server searches for ESXi hosts with sufficient resources.

2. Serengeti Management Server selects ESXi hosts on which to place Hadoop virtual machines.

3. Serengeti Management Server sends a request to vCenter to clone and reconfigure virtual

machines.

4. Agent configures the OS parameters and network configurations.

5. Agent downloads Hadoop software packages from the Serengeti Management sServer.

6. Agent installs Hadoop software.

7. Agent configures Hadoop parameters.

Provisioning is performed in parallel, which reduces deployment time.

2.2 Hadoop

Apache Hadoop is open source software for distributed storage and computing. Apache Hadoop includes HDFS and MapReduce. The HDFS is a distributed file system. The MapReduce is a software framework for distributed data processing.

You can find more information about Apache Hadoop on http://hadoop.apache.org/ for more information.

Page 9: Serengeti User Guide_0.8

Serengeti User’s Guide

9

2.3 VMware Virtual Infrastructure

VMware„s leading virtualization solutions provide multiple benefits to IT administrators and users. VMware

virtualization creates a layer of abstraction between the resources required by an application and

operating system, and the underlying hardware that provides those resources. A summary of the value of

this abstraction layer includes the following:

Consolidation: VMware technology allows multiple application servers to be consolidated onto

one physical server, with little or no decrease in overall performance.

Ease of Provisioning: VMware virtualization encapsulates an application into an image that can

be duplicated or moved, greatly reducing the cost of application provisioning and deployment.

Manageability: Virtual machines may be moved from server to server with no downtime using

VMware vMotion™, which simplifies common operations like hardware maintenance and reduces

planned downtime.

Availability: Unplanned downtime can be reduced and higher service levels can be provided to an

application. VMware High Availability (HA) ensures that in the case of an unplanned hardware

failure, any affected virtual machines are restarted on another host in a VMware cluster.

2.4 Serengeti Virtual Appliance Requirements

Software

o VMware vSphere 5.0 Enterprise or VMware vSphere 5.1 Enterprise

o VMWare vSphere Client 5.0 or VMWare vSphere Client 5.1

o SSH client

Network

o DNS Server

o DHCP Server or Static IP Address Block

Resource requirements

o Resource pool with at least 27.5GB RAM

o Port group with at least 6 uplink ports

o 350G or more disk spaces are suggested.

o 17GB is for Serengeti virtual appliance,

o 300GB is for your first Hadoop cluster. You can reduce the disk space

requirements by specifying the storage size in a cluster specification.

o The remaining disk space is reserved for swap space.

o Shared storage is required if you use HA or FT for the Hadoop master node.

Others

o All ESXi hosts should have time synchronized using the Network Time Protocol (NTP)

2.5 Serengeti CLI Requirements

OS

Page 10: Serengeti User Guide_0.8

Serengeti User’s Guide

10

o Windows

o Linux

Software

o Java 1.6.26 or later

o Unzip tool

Network

o Can access Serengeti Management Server through HTTP in order to download CLI

package

3. Installing the Serengeti Virtual Appliance

3.1 Download

Download a Serengeti Virtual Appliance OVA from VMware site.

3.2 Deploy Serengeti

Serengeti runs in VMWare vSphere system. You can use the vSphere client to connect VMware vCenter

Server and deploy Serengeti.

1. In the vSphere Client, Select menu File -> Deploy OVF Template

2. Select the OVA file location of Serengeti Virtual Appliance. vSphere client will verify the OVA file and

show you the brief information.

3. Specify the Serengeti virtual appliance name and inventory location.

Only alphabetic letters (“a-z”, “A-Z”), numbers (“0-9”), space (“ “), hyphen (“-“) and underscore

(“_”) can be used for virtual appliance name and resource pool name. For datastore name, it can

be the above ones plus parenthesis (“(“, “)”) and period (“.”).

4. Select the resource pool on which to deploy the template.

You MUST deploy Serengeti in a top-level resource pool.

Page 11: Serengeti User Guide_0.8

Serengeti User’s Guide

11

5. Select a datastore.

6. Select a format for the virtual disks.

7. Map the networks used in the OVF template to the networks in your inventory.

8. Set the properties for this Serengeti deployment.

Page 12: Serengeti User Guide_0.8

Serengeti User’s Guide

12

Serengeti Management Server Network Settings

Network Type Select DHCP or Static IP.

IP Address Enter IP address for the Serengeti Management Server virtual machine.

Net mask Enter the subnet mask of the network.

Gateway Enter the IP address for the network gateway.

DNS Server 1 Enter the DNS server IP address.

DNS Server 2 Enter a second DNS server IP address.

Hadoop Resource Settings

Initialize Resources Keep this option selected to add the resource pool, datastore and network

to Serengeti Management Server database. Users can deploy Hadoop

clusters in the resource pool, datastore and network in which the Serengeti

virtual appliance is deployed. Hadoop node virtual machines attempt to

obtain IP address by using DHCP on the network.

9. Verify binding to vCenter Extension Service.

10. Click Next to deploy the virtual appliance. It‟ll take several minutes to deploy the virtual appliance.

After Serengeti virtual appliance is deployed successfully, two virtual machines will be installed in

Page 13: Serengeti User Guide_0.8

Serengeti User’s Guide

13

vSphere. One is the Serengeti Management Server virtual machine another is the virtual machine

template for Hadoop nodes.

11. Power on the Serengeti vApp and open the console of Serengeti Management Server VM, you see

the initial OS login password for root/serengeti user. Update the password with command „sudo

/opt/serengeti/sbin/set-password -u‟ after login to the VM, and the initial password will disappear on

the welcome screen.

4. Quick Start

4.1 Set up the Serengeti CLI

Serengeti command line shell can run in Windows or Linux. You need Java installed on the machine.

You can download VMware-Serengeti-cli-0.8.0.0-<build number>.zip from the Serengeti Management Server (http://your-serengeti-server/cli).

Unzip the downloaded package to a directory. Run Serengeti CLI by going to this directory, under ”cli”, and enter “java –jar serengeti*.jar”.

Please refer to the troubleshooting document if you have any issues.

4.2 Deploy a Hadoop Cluster

You can use Serengeti CLI to perform actions such as creating and customizing Hadoop clusters. You

have two ways to access Serengeti CLI: from the Serengeti Management Server virtual machine or install

CLI on any machine and use it.

1. Enter the Serengeti shell.

>java –jar serengeti*.jar

2. Run "connect” command to connect to the Serengeti server.

serengeti>connect --host xx.xx.xx.xx:8080 --username xxx --password xxx

A user named “serengeti” with password “password” is created by default.

3. Run "cluster create” command to deploy a Hadoop cluster on vSphere.

serengeti>cluster create --name myHadoop

In the example, “myHadoop” is the name of the Hadoop cluster you deploy. The Serengeti command

continually updates the progress of the deployment.

Only alphabetic letters (“a-z”, “A-Z”), numbers (“0-9”), and underscore (“_”) can be used cluster

Page 14: Serengeti User Guide_0.8

Serengeti User’s Guide

14

name.

This command will deploy a Hadoop cluster with one master node virtual machine, three worker node

virtual machines, and one client node virtual machine. The master node virtual machine contains

NameNode and JobTracker in it. The worker node virtual machines contain datanode and TaskTracker

services. The client node virtual machines contain a Hadoop client environment, including Hadoop client

shell, Pig, and Hive.

After the deployment is complete, you can view the IP addresses of the Hadoop node virtual machines.

Hint

Use the tab key for auto-completion and to get help for commands and parameters.

By default, Serengeti might use any resources added to deploy a Hadoop Cluster. To limit the scope of

resources for the cluster, you can specify resource pools, datastores, or a network in the “cluster create

command”

serengeti>cluster create --name myHadoop --rpNames myRP --dsNames myDS --networkName myNW

In this example “myRP” is the resource pool where the Hadoop cluster is deployed on, “myDS” is the

datastore where the virtual machine images is stored, “myNW” is the network which virtual machines will

use.

Once you have a Hadoop cluster deployed you can execute Hadoop command directly in the CLI. In this section we will describe how you can copy files from the local file system to HDFS and then run a MapReduce job.

1. Start the Serengeti CLI and connect to Serengeti Management Server as described in section 4.1

2. Run the “cluster list” command to show all the available clusters

$serengeti>cluster list

3. Run the “cluster target --name” command to connect to the cluster you want to get data in/out. The “--name” value is the cluster name that you want to connect.

$serengeti>cluster target --name cluster1

4. Run the “fs put” command to upload data to HDFS

$serengeti>fs put from /etc/inittab to /tmp/input/inittab

5. Run the “fs get” command to download data from HDFS

$serengeti>fs get from /tmp/input/inittab to /tmp/local-inittab

6. Run the “mr jar” command to run a MapReduce job

$serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"

7. Run the “fs cat ” command to show the output of the MR job

$serengeti> fs cat /tmp/output/part-r-00000

8. Run the “fs get ” command to download the output of the MR job

$serengeti> fs get from /tmp/output/part-r-00000 to /tmp/wordcount

Hint

You can use “resourcepool list”, “datastore list”, “network list” command to see what resources are in

Serengeti.

Page 15: Serengeti User Guide_0.8

Serengeti User’s Guide

15

Another way to use Hadoop is through the client VM. By default, Serengeti will deploy a VM named client VM. It has Hadoop client, pig and Hive installed. The OS is configured ready to use Hadoop. You can see the IP of client VM after a cluster is deployed or use cluster list command to see the IP. Following are the steps to follow in order to verify that the Hadoop cluster is working properly.

1. Use ssh to login to the client VM.

use "joe" for user name. Password is "password".

2. Create your own home directory.

$ hadoop fs -mkdir /usr/joe

3. Or run a sample Hadoop mapreduce job.

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 10000000 Feel free to use submit other MapReduce, Pig or Hive jobs as well.

4.3 Deploy a HBase Cluster

Serengeti also supports deploying HBase cluster on HDFS. The easiest way to deploy a HBase cluster is running the following command:

serengeti>cluster create --name myHBase --type hbase

In the example, “myHBase” is the name of the HBase cluster you deployed, “--type hbase” implies you

want to deploy a HBase cluster based on a default template Serengeti provides. This command will

deploy one master node virtual machine which runs NameNode and HBaseMaster daemon, three

zookeeper nodes running ZooKeeper daemon, three data nodes running Hadoop DataNode and HBase

Regionserver daemon, and one client node from which you can launch Hadoop or HBase Jobs.

When deployment finished, you can access HBase through a few ways as you expected:

1. Login client VM to run “hbase shell” commands;

2. Launch a HBase job like “hbase org.apache.hadoop.hbase.PerformanceEvaluation –nomapred

randomWrite 3”;

Default HBase cluster does not contain Hadoop JobTracker or Hadoop TaskTracker daemon. So

you need to deploy a customized cluster in case you want to run a HBase mapr job.

3. Access HBase through Rest-ful Web Service or Thrift gateway, HBase Rest and Thrift service are

configured on the HBase client node, and Rest service listens on port 8080, Thrift service listens

on port 9090.

5. Using Serengeti

5.1 Manage Serengeti Users

Spring security In-Memory Authentication is used for Serengeti Authentication and user management. You can modify /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file to manage Serengeti users. And then restart tomcat service using command "sudo service tomat restart".

5.1.1 Add/Delete a User in Serengeti

Add or delete user at /opt/serengeti/tomcat6/webapps/serengeti/WEB-INF/spring-security-context.xml file, user-service element. Following is a sample to add one user into user-service. <authentication-manager alias="authenticationManager"> <authentication-provider> <user-service>

Page 16: Serengeti User Guide_0.8

Serengeti User’s Guide

16

<user name="serengeti" password="password" authorities="ROLE_ADMIN"/> <user name="joe" password="password" authorities="ROLE_ADMIN"/> </user-service> </authentication-provider> </authentication-manager> The authorities value should define user role in Serengeti, but in M2, it‟s not used, so it‟s OK to have any value here.

5.1.2 Modify User Password

Modify the password value in user-service element at the same file. Following is a sample. <authentication-manager alias="authenticationManager"> <authentication-provider> <user-service> <user name="serengeti" password="password" authorities="ROLE_ADMIN"/> <user name="joe" password="welcome1" authorities="ROLE_ADMIN"/> </user-service> </authentication-provider> </authentication-manager>

5.2 Manage Resources in Serengeti

When deploying Serengeti.OVA, VI admin might allow you to use the same resources in which Serengeti

virtual appliance is using. You can also add more resources to Serengeti for your Hadoop cluster. You

can list resources in Serengeti and delete them if it‟s no longer needed.

You must add resource pool, datastore and network before deploying a Hadoop cluster if VI

admin does not allow you to deploy Hadoop cluster in the same set of resources as Serengeti

server.

5.2.1 Add a Datastore

You can use “datastore add” command to add a vSphere datastore to Serengeti.

serengeti>datastore add --name myLocalDS --spec local* --type LOCAL

In this example, “myLocalDS” is the name you used to create the Hadoop cluster.

“local*” is a wildcard specifying a set of datastores. All datastores whose name starts with “local” will be

added and managed as a whole.

“LOCAL” specifies that the datastores are local storage.

In this version, Serengeti does not check if the datastore really exists. If you use a nonexistent

datastore, cluster creation will fail.

5.2.2 Add a Network

You can use “network add” command to add a network to Serengeti. A network is a port group and a way

to get ip on the port group.

serengeti>network add --name myNW --portGroup 10GPG --dhcp

In this example, “myNW” is the name you used to create the Hadoop cluster.

“10GPG” is the name of the port group created by VI Admin in vSphere.

Virtual machines using this network will use DHCP to obtain IP.

You can also add networks using a static IP.

serengeti>network add --name myNW --portGroup 10GPG --ip 192.168.1.2-100 --dns 10.111.90.2 --

Page 17: Serengeti User Guide_0.8

Serengeti User’s Guide

17

gateway 192.168.1.1 --mask 255.255.255.0

In this example, “192.168.1.2-100” is the IP address range Hadoop nodes can use.

“10.111.90.2” is the DNS server IP.

“192.168.1.1” is the gateway.

“255.255.255.0” is the subnet mask.

In this version, Serengeti does not check if the added network is correct. If you use a wrong

network, cluster creation will fail.

5.2.3 Add a Resource Pool

You can use “resourcepool add” command to add a vSphere resource pool to Serengeti.

serengeti>resourcepool add --name myRP --vccluster cluster1 --vcrp rp1

In this example, “myRP” is the name you used to create the Hadoop cluster.

“cluster1” is the vSphere cluster name and “rp1” is vSphere resource pool name.

In this version, Serengeti does not check if the resource pool really exists. If you use a

nonexistent resource pool, cluster creation will fail.

vSphere nested resource pools are not supported in current version. The resource pools must be

one that is located directly under a cluster.

5.2.4 View Datastores

In the Serengeti shell, you can list datastores added to Serengeti.

serengeti>datastore list

You can see details of datastores.

serengeti> datastore list --detail

You can specify which datastore to list.

seretenti> datastore list --name myDS --detail

5.2.5 View Networks

In the Serengeti shell, you can list networks added to Serengeti.

serengeti>network list

You can see details of networks.

serengeti> network list --detail

You can specify which network to list.

seretenti> network list --name myNW --detail

5.2.6 View Resource Pools

In the Serengeti shell, you can list resource pools added to Serengeti.

serengeti>resourcepool list

You can see details of resource pools.

serengeti>resourcepool list --detail

Page 18: Serengeti User Guide_0.8

Serengeti User’s Guide

18

You can specify which resource pool to list.

seretenti>resourcepool list --name myRP --detail

5.2.7 Remove a Datastore

You can use the “datastore delete” command to remove a datastore from Serengeti.

serengeti>datastore delete --name myDS

In this example, “myDS” is the name you specified when you added the datastore.

You cannot remove a datastore from Serengeti if it is referenced by a Hadoop cluster.

5.2.8 Remove a Network

You can use the “network delete” command to remove a network from Serengeti.

serengeti>network delete --name myNW

In this example, “myNW” is the name you specified when you added the network.

You cannot remove a network from Serengeti if it is referenced by a Hadoop cluster.

You can use “network list” command to see which cluster is referencing the network.

5.2.9 Remove a Resource Pool

You can use the “resoucepool delete” command to remove a resource pool from Serengeti.

serengeti>resourcepool delete --name myRP

In this example, “myRP” is the name you specified when you added the resource pool.

You cannot remove a resource pool from Serengeti if the resource pool is referenced by a

Hadoop cluster.

5.3 Manage Distros

5.3.1 Supported Distros

Serengeti Management Server includes the Apache Hadoop 1.0.1, but you can use your preferred

Hadoop distro as well. Greenplum HD1, CDH3, CDH41, HDP1 and MapR M5 are also supported.

Serengeti now supports Hadoop cluster, Pig and Hive instance deployment.

5.3.2 Add a Distro to Serengeti

Serengeti uses tar ball or yum repository to deploy Hadoop cluster for different Hadoop distributions.

5.3.2.1 Using tar ball to deploy Hadoop cluster

Serengeti uses tar ball to deploy the following Hadoop distros:

Apache Hadoop 1.0.x

Greenplum HD 1

1 YARN is not supported at this moment.

Page 19: Serengeti User Guide_0.8

Serengeti User’s Guide

19

CDH3

HDP1

1. Download the three packages (hadoop/pig/hive) in tar ball format from the distro vendor's site.

2. Upload them to Serengeti Management Server virtual machine.

3. Put the packages in “/opt/serengeti/www/distros/”. The hierarchy should be

DISTRO_NAME/VERSION_NUMBER/TARBALLS. For example, place the Apache Hadoop distro as

shown in the following way.

- apache/ - 1.0.1/ - hadoop-1.0.1.tar.gz - hive-0.8.1.tar.gz - pig-0.9.2.tar.gz

4. Edit the “/opt/serengeti/www/distros/manifest” in Serengeti Management Server virtual machine to

add the mapping between Hadoop roles and the tar ball package of the distro. As the following

example, add JSON text to the manifest file:

{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, In this example, the CDH tar balls are put in directory /opt/serengeti/www/distros/cdh/3u3. Please note if a distro supports HVE, please add “hveSupported” : “true”, after the line related to version in the above example.

5. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management

Server read the new manifest file.

$ sudo service tomcat restart

If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the

distro that you added appears. Otherwise, make sure you write the correct JSON text in the manifest.

5.3.2.2 Using yum repository to deploy Hadoop cluster

Serengeti uses yum repository to deploy the following Hadoop distros:

Page 20: Serengeti User Guide_0.8

Serengeti User’s Guide

20

CDH4

MapR M5

1. Open the sample manifest file “/opt/serengeti/www/distros/manifest.sample” in Serengeti

Management Server virtual machine, you will see the following distro configuration for MapR and

CDH4:

{ "name" : "mapr", "vendor" : "MAPR", "version" : "2.1.1", "packages" : [ { "roles" : ["mapr_zookeeper", "mapr_cldb", "mapr_jobtracker", "mapr_tasktracker", "mapr_fileserver", "mapr_nfs", "mapr_webserver", "mapr_metrics", "mapr_client", "mapr_pig", "mapr_hive", "mapr_hive_server", "mapr_mysql_server"], "package_repos" : ["http://<ip_of_serengeti_server>/mapr/2/mapr-m5.repo"] } ] }, { "name" : "cdh4", "vendor" : "CDH", "version" : "4.1.2", "packages" : [ { "roles" : ["hadoop_namenode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_journalnode", "hadoop_client", "hive", "hive_server", "pig", "hbase_master", "hbase_regionserver", "hbase_client", "zookeeper"], "package_repos" : ["http://<ip_of_serengeti_server>/cdh/4/cloudera-cdh4.repo"] } ] }

The two yum repo files (mapr-m5.repo and cloudera-cdh4.repo) point to the official yum repository of MapR and CDH4 on the Internet. You can copy this sample file “/opt/serengeti/www/distros/manifest.sample” to “/opt/serengeti/www/distros/manifest”. When you create a MapR or CDH4 cluster, Hadoop nodes will download rpm packages from the MapR/CDH4 official yum repository on the Internet. If your VMs in the cluster created by Serengeti Management Server do not have access to the Internet or the bandwidth to the Internet is not fast, we strongly suggest create a local yum repository for MapR and CDH4. Please read the Appendix A: Create Local Yum Repository for MapR and Appendix B: Create Local Yum Repository for CDH4 to create a yum repository.

2. Config the local yum repository url in manifest file

Once the local yum repository for MapR/CDH4 is created, open /opt/serengeti/www/distros/manifest

and add the distro configuration (use the sample in previous step and modify attribute

"package_repos" to the url of the local yum repository file).

3. Restart the tomcat server in Serengeti Management Server to allow the Serengeti Management

Server read the new manifest file.

$ sudo service tomcat restart

Page 21: Serengeti User Guide_0.8

Serengeti User’s Guide

21

If the commands are successful, in the Serengeti shell, issue the command "distro list", you can see the

distro that you added. Otherwise, make sure you write the correct JSON text in the manifest.

5.3.3 List Distros

You can use the "distro list" command to see available distros.

serengeti> distro list

You can see packages in each of the distro and make sure it includes services you want to deploy.

5.3.4 Using a Distro

You can choose which distro you use when deploying a cluster.

serengeti>cluster create --name myHadoop --distro cdh

5.4 Hadoop Clusters

5.4.1 Deploy Hadoop Clusters

5.4.1.1 Deploy a Customized Hadoop Cluster

You can customize the number of nodes, and size of virtual machines etc. when you create a cluster.

In Serengeti Management Server you can find sample specs in /opt/serengeti/samples/. If you are using Serengeti CLI from your desktop you can find the sample specs in the client folder.

1. Edit a cluster spec file.

For example:

{ "nodeGroups" : [ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM" }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL" }, { "name": "client", "roles": [

Page 22: Serengeti User Guide_0.8

Serengeti User’s Guide

22

"hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "instanceType": "SMALL" } ] }

In this example, you want 1 master virtual machine MEDIUM size, 5 worker virtual machines in SMALL

size, 1 client virtual machine in SMALL size. You can also specify number of CPUs, RAM, disk size etc.

for each of node groups.

2. Specify the spec when creating the cluster. You need use the full path to specify the file.

serengeti>cluster create --name myHadoop --specFile /home/serengeti/mySpec.txt

CAUTION

Changing the role of node groups might cause the deployed Hadoop cluster not workable.

Deploy a CDH4 Hadoop ClusterYou can create a default CDH4 Hadoop cluster by executing the following

command in Serengeti CLI:

serengeti>cluster create --name mycdh --distro cdh4

You can also create a customized CDH4 Hadoop cluster with a cluster spec file:

serengeti>cluster create --name mycdh --distro cdh4 --specFile /opt/serengeti/samples/default_cdh4_ha_hadoop_cluster.json

/opt/serengeti/samples/default_cdh4_ha_and_federation_hadoop_cluster.json is a sample spec file for

CDH4. You can make a copy of it and modify the parameters in the file before creating the cluster. In this

example, nameservice0 and nameservice1 are federated with each other, the name nodes inside

nameservice0 node group (with instanceNum set as 2) are HDFS2 HA enabled. In Serengeti, name node

group names will be the name service names of HDFS2.

5.4.1.1.1 Deploy a MapR Hadoop Cluster

You can create a default MapR M5 Hadoop cluster by executing the following command in Serengeti CLI:

serengeti>cluster create --name mymapr --distro mapr

You can also create a customized MapR M5 Hadoop cluster with a cluster spec file:

serengeti>cluster create --name mycdh --distro mapr --specFile /opt/serengeti/samples/ default_mapr_cluster.json /opt/serengeti/samples/ default_mapr_cluster.json is a sample spec file for MapR, you can make a copy of it and modify the parameters in the file before creating the cluster.

5.4.1.2 Separating Data and Compute nodes

You can separate data and compute nodes in a cluster and apply more fined control of node placement

among ESX hosts. For example, you can use Serengeti to deploy the following clusters:

1. A data and compute separated cluster, without any node placement constraints.

Page 23: Serengeti User Guide_0.8

Serengeti User’s Guide

23

{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { }

Page 24: Serengeti User Guide_0.8

Serengeti User’s Guide

24

}

In this example, four data nodes and eight compute nodes will be created and put into individual VMs. By default, Serengeti uses Round Robin algorithm to put VM/node across ESX hosts evenly. 2. A data compute separated cluster, with instancePerHost constraint.

{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2 } }, { "name": "client", "roles": [ "hadoop_client", "hive",

Page 25: Serengeti User Guide_0.8

Serengeti User’s Guide

25

"pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }

In this example, data and compute node group have “placementPolicy” constraint. After a successful provision, four data nodes and eight compute nodes will be created and put into individual VMs. With the “instancePerHost=1” constraint, the four data nodes will be placed on four ESX hosts. The eight compute nodes will be put onto four ESX hosts as well, two nodes for each. Note that it is not guaranteed that the two compute nodes will stay collocated with each data node on each of the four ESX hosts. To ensure that this is the case, create a VM-VM affinity rule between each host‟s compute nodes and data node, or disable DRS on the compute nodes.

3. A data compute separated cluster, with instancePerHost , groupAssociations constraints for compute

node group and groupRacks constraint for data node group.

{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] },

Page 26: Serengeti User Guide_0.8

Serengeti User’s Guide

26

} }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data", "type": "STRICT" } } }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 } } ], "configuration": { } }

In this example, after a successful provision, the four data nodes and eight compute nodes will be placed on exactly the same four ESX hosts, each ESX host has one data node and two compute nodes, and these four ESX hosts are selected from “rack1”, “rack2” and “rack3” fairly. Here, as the definition of “compute” node group says, the placement of compute nodes should strictly refer to the placement result of “data” node. That means, “compute” nodes should only be placed on ESX hosts that have “data” nodes.

5.4.1.3 Deploy a Compute Only Cluster

You can create a compute only cluster that refers to an existing HDFS cluster with the following steps:

1. Edit a cluster spec file and save it, for example, as /home/serengeti/coSpec.txt.

Page 27: Serengeti User Guide_0.8

Serengeti User’s Guide

27

For example:

{ "externalHDFS": "hdfs://hostname-of-namenode:8020", "nodeGroups": [ { "name": "master", "roles": [ "hadoop_jobtracker" ], "instanceNum": 1, "cpuNum": 2, "memCapacityMB": 7500, }, { "name": "worker", "roles": [ "hadoop_tasktracker", ], "instanceNum": 4, "cpuNum": 2, "memCapacityMB": 7500, "storage": { "type": "LOCAL", "sizeGB": 20 }, }, { "name": "client", "roles": [ "hadoop_client", "hive", "pig" ], "instanceNum": 1, "cpuNum": 1, "storage": { "type": "LOCAL", "sizeGB": 50 }, }

], “configuration” : { }

}

In this example, the externalHDFS field points to an existing HDFS. You should also specify the node

group with role hadoop_jobtracker and hadoop_tasktracker. Note that the externalHDFS field conflicts

with node groups that have hadoop_namenode and hadoop_datanode roles. The sample cluster spec

can also be found in file in samples/compute_only_cluster.json in the Serengeti CLI directory,

2. Specify the spec when creating the cluster. You need use the full path to specify the file.

serengeti>cluster create --name computeOnlyCluster --specFile /home/serengeti/coSpec.txt

Page 28: Serengeti User Guide_0.8

Serengeti User’s Guide

28

5.4.1.4 Control Hadoop Virtual Machine Placement

Serengeti provides a way for user to control how Hadoop virtual machines to be placed. Generally, it‟s implemented by specifying the “placementPolicies” field inside a node group, like:

{ "nodeGroups":[ … { "name": "group_name", … "placementPolicies": { "instancePerHost": 2, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3"] }, "groupAssociations": [{ "reference": "another_group_name", "type": "STRICT" // or "WEAK" }] } }, … }

As this example shows, the “palcementPolicy” field contains three optional items: “instancePerHost”, “groupRacks” and “groupAssociations”.

As the name implies, “instancePerHost” indicates how many VM nodes or instances should be placed for each physical ESX host, this constraint is aimed at balancing the workload.

The “groupRacks” controls how VM nodes should be put across the racks you specified. In this example, the rack type equals “ROUNDROBIN”, and the “racks” item indicates which racks in the topology map (refer to chapter 5.8 to see how to configure topology map information and enable Hadoop cluster to be rack awareness) will be used for this placement policy. If “racks” item is ignored, Serengeti will use all racks in the topology map. “ROUNDROBIN” here means the candidates will be fairly selected when determining which rack should be selected for each node.

On the other side, if you specify both the “InstancePerHost” and “groupRacks” for placement policy, you should make sure the number of available hosts is enough. You can get the rack-hosts information by using the command “topology list”.

“groupAssociations” means the node group has associations with target node groups, and each association has “reference” and “type” fields. The field “reference” is the name of a target node group, and “type” can be “STRICT” or “WEAK”. “STRICT” means the node group must be placed on the same set or subset of ESX hosts relevant to the target group, while “WEAK” means the node group tries to be placed on the same set or subset of ESX hosts relevant to the target group but no guarantee.

A typical scenario of applying “groupRacks” and “groupAssociations” is deploying a Hadoop cluster with data and compute nodes separated. In this case, user might tend to put compute nodes and data nodes on the same set of physical hosts for better performance, especially the throughput. You can refer to 5.3.3 for the practical examples of how to deploy Hadoop cluster by applying placement policy.

Page 29: Serengeti User Guide_0.8

Serengeti User’s Guide

29

5.4.1.5 Use NFS as Compute Nodes’ Local Directory

Serengeti allows user to specify NFS for compute nodes. There are several benefits 1) increase the capacity of each compute node; 2) return storage resource back when some compute nodes stopped. Here is an example to show how to deploy a cluster whose compute nodes have only NFS storage:

{ "nodeGroups":[ { "name": "master", "roles": [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "LARGE", "cpuNum": 2, "memCapacityMB": 7500, "haFlag": "on" }, { "name": "data", "roles": [ "hadoop_datanode" ], "instanceNum": 4, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 }, "placementPolicies": { "instancePerHost": 1 } }, { "name": "compute", "roles": [ "hadoop_tasktracker" ], "instanceNum": 8, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "TEMPFS" }, "placementPolicies": { "instancePerHost": 2, "groupAssociations": [ { "reference": "data",

Page 30: Serengeti User Guide_0.8

Serengeti User’s Guide

30

"type": "STRICT" } ] } }, { "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], "instanceNum": 1, "cpuNum": 1, "memCapacityMB": 3748, "storage": { "type": "LOCAL", "sizeGB": 50 } } ] } In this example, the cluster is D/C separated. Compute nodes are strictly associated with data nodes. By specifying the “Storage” field of compute node group to “type: TEMPFS”, Serengeti will install NFS server on associated data nodes, install NFS client on compute nodes, and mount data nodes‟ disks on compute nodes. Serengeti will not assign disks to compute nodes, and all temp files generated during running MapReduce jobs are saved on the NFS disks.

5.4.2 Manage Hadoop Clusters

5.4.2.1 Modify Hadoop

Serengeti provides a simple and easy way to tune the Hadoop cluster configuration including attributes in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, etc.

In addition to modifying Hadoop configuration of an existing Hadoop cluster created by

Serengeti, you can also define Hadoop configuration in the cluster spec file when creating a new

cluster.

5.4.2.1.1 Cluster Level Configuration

You can modify the Hadoop configuration of an existing cluster by following these steps below:

1. Export the cluster spec file of the cluster:

serengeti>cluster export --spec --name myHadoop –output /home/serengeti/myHadoop.json

2. Modify the „configuration‟ section at the bottom of /home/serengeti/myHadoop.json with the following content and add the customized Hadoop configuration in this „configuration‟ section:

Page 31: Serengeti User Guide_0.8

Serengeti User’s Guide

31

"configuration": { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "", }, "log4j.properties": { // "hadoop.root.logger": "DEBUG, DRFA ", // "hadoop.security.logger": "DEBUG, DRFA ", }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html } } }

Serengeti provides a tool to convert the Hadoop configuration files of your existing cluster into

the above json format, so you don‟t need to write this json file manually. Please read section

„Tool for converting Hadoop Configuration‟.

Some Hadoop Distributions have their own java jar files which are not put in

$HADOOP_HOME/lib, so by default Hadoop daemons can‟t find it. In order to use these jars,

you need to add a cluster configuration to include the full path of the jar file in

$HADOOP_CLASSPATH.

Here is a sample cluster configuration to configure Cloudera CDH3 Hadoop cluster with Fair

Scheduler (the jar files of Fair Scheduler is put in /usr/lib/hadoop/contrib/fairscheduler/):

Page 32: Serengeti User Guide_0.8

Serengeti User’s Guide

32

"configuration": { "hadoop": { "hadoop-env.sh": { "HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH" }, "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" … }, "fair-scheduler.xml": { … } } }

3. Run „cluster config‟ command to apply the new Hadoop configuration

serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

4. If you want to reset an existing configuration attribute to the Hadoop default value, simply remove it or comment it out using „// ‟ in „configuration‟ section in cluster spec file, and run „cluster config‟ command.

5.4.2.1.2 Group Level Configuration

You can also modify the Hadoop configuration within a node group in an existing cluster by following steps below:

1. Export the cluster spec file of the cluster:

serengeti>cluster export --spec --name myHadoop --output /home/serengeti/myHadoop.json

2. Modify the „configuration‟ section within the node group in /home/serengeti/myHadoop.json with the same content as in „Cluster Level Configuration‟ and add the customized Hadoop configuration for this node group.

The Hadoop configuration in Group Level Configuration will override the configuration with the

same name in Cluster Level Configuration.

3. Run „cluster config‟ command to apply the new Hadoop configuration

serengeti>cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json

5.4.2.1.3 Black List and White List in Hadoop Configuration

Almost all the configuration attributes provided in Apache Hadoop are configurable in Serengeti, and these attributes belong to White List. However a few attributes are not configurable in Serengeti and these attributes belongs to Black List.

If you set an attribute in the cluster spec file and it is in the Black List or not in the White List, then run „cluster config‟ command, Serengeti will detect these attributes and give a warning, you need to answer „yes‟ to continue or „no‟ to abort.

Usually you don‟t need to configure „fs.default.name' or „dfs.http.address‟ if there is a NameNode or JobTracker in your cluster, because Serengeti will automatically configure these 2 attributes. For example, when you create a default cluster in Serengeti, it will contains a NameNode and JobTracker, and you don‟t need to explicitly configure „fs.default.name' and „dfs.http.address‟.

However you can set „fs.default.name' to the uri of another NameNode if you really want to.

Page 33: Serengeti User Guide_0.8

Serengeti User’s Guide

33

5.4.2.1.3.1 White List

core-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/core-default.html

exclude attributes defined in Black List

hdfs-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/hdfs-default.html

exclude attributes defined in Black List

mapred-site.xml

all attributes listed on http://hadoop.apache.org/common/docs/stable/mapred-default.html

exclude attributes defined in Black List

hadoop-env.sh

JAVA_HOME

PATH

HADOOP_CLASSPATH

HADOOP_HEAPSIZE

HADOOP_NAMENODE_OPTS

HADOOP_DATANODE_OPTS

HADOOP_SECONDARYNAMENODE_OPTS

HADOOP_JOBTRACKER_OPTS

HADOOP_TASKTRACKER_OPTS

HADOOP_LOG_DIR

log4j.properties

hadoop.root.logger

hadoop.security.logger

log4j.appender.DRFA.MaxBackupIndex

log4j.appender.RFA.MaxBackupIndex

log4j.appender.RFA.MaxFileSize

fair-scheduler.xml

text

all attributes described on http://hadoop.apache.org/docs/stable/fair_scheduler.html , which can be put inside „text‟ field

exclude attributes defined in Black List

capacity-scheduler.xml

all attributes described on http://hadoop.apache.org/docs/stable/capacity_scheduler.html

exclude attributes defined in Black List

5.4.2.1.3.2 Black List

core-site.xml

Page 34: Serengeti User Guide_0.8

Serengeti User’s Guide

34

net.topology.impl

net.topology.nodegroup.aware

dfs.block.replicator.classname

hdfs-site.xml

dfs.http.address

dfs.name.dir

dfs.data.dir

topology.script.file.name

mapred-site.xml

mapred.job.tracker

mapred.local.dir

mapred.task.cache.levels

mapred.jobtracker.jobSchedulable

mapred.jobtracker.nodegroup.awareness

hadoop-env.sh

HADOOP_HOME

HADOOP_COMMON_HOME

HADOOP_MAPRED_HOME

HADOOP_HDFS_HOME

HADOOP_CONF_DIR

HADOOP_PID_DIR

log4j.properties

None

fair-scheduler.xml

None

capacity-scheduler.xml

None

mapred-queue-acls.xml

None

5.4.2.1.4 Tool for converting Hadoop Configuration

In case you have a lot of Hadoop configuration in core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, log4j.properties, fair-scheduler.xml, capacity-scheduler.xml, mapred-queue-acls.xml, etc. for your existing Hadoop cluster, you can use a tool provided by Serengeti to convert the Hadoop xml configuration files into the json format used in Serengeti.

1) Copy the directory $HADOOP_HOME/conf/ in your existing Hadoop cluster to the Serengeti Server.

2) Execute „convert-hadoop-conf.rb /path/to/hadoop_conf/‟ in bash shell and it will print out all the converted Hadoop configuration attributes in json format.

Page 35: Serengeti User Guide_0.8

Serengeti User’s Guide

35

3) Open the cluster spec file and replace the Cluster Level Configuration or Group Level Configuration with the content printed out step 2.

4) Execute „cluster config --name … --specFile …‟ to apply the new configuration to the existing clusteror execute „cluster create --name … --specFile …‟ to create a new cluster with your configuration.

5.4.2.2 Scale Out a Hadoop Cluster

You can scale out to have more Hadoop worker nodes or client nodes after Hadoop cluster is provisioned.

In the following example, the number of instances in “worker” node group in “myHadoop” cluster will

increase to 10.

serengeti>cluster resize --name myHadoop --nodeGroup worker --instanceNum 10

You cannot set a number smaller than current instance number in this version of the Serengeti

virtual appliance.

5.4.2.3 Scale TaskTracker Nodes Rapidly

You can change the number of active TaskTracker nodes rapidly in a running Hadoop cluster or node

group. The selection of TaskTrackers to be enabled or disabled is done with the goal of balancing the

number of TaskTrackers enabled per host in the specified Hadoop cluster or node group.

In this example, the number of active TaskTracker nodes in “worker” node group in “myHadoop” cluster is

set to 8:

serengeti>cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8

If fewer than 8 TaskTracker nodes were running in the “worker” node group of “myHadoop” cluster,

additional TaskTracker nodes are enabled (re-commissioned and powered-on), up to the number

provisioned in the “worker” node group. If more than 8 TaskTrackers were running in the “worker” node

group, excess TaskTracker nodes are disabled (decommissioned and powered-off). No action is

performed if the number of active TaskTrackers already equals 8.

If the node group is not specified, the TaskTracker nodes are enabled/disabled such that the total number

of active TaskTrackers is 8 across all the compute node groups in the “myHadoop” cluster:

serengeti>cluster limit --name myHadoop –activeComputeNodeNum 8

To enable all the TaskTrackers in the “myHadoop” cluster, use the “cluster unlimit” command:

serengeti>cluster unlimit --name myHadoop This command is especially useful to fix any potential mismatch between the number of active TaskTrackers as seen by Hadoop and the number of powered on TaskTracker nodes as seen by the vCenter.

To enable all TaskTrackers within only one compute node group, specify the name of the node group using the “--nodeGroup” option, similar to the “cluster limit” command.

5.4.2.4 Start/Stop Hadoop Cluster

In the Serengeti shell, you can start (or stop) a whole Hadoop cluster:

serengeti>cluster start --name mycluster

5.4.2.5 View Hadoop Clusters Deployed by Serengeti

In the Serengeti shell, you can list Hadoop clusters deployed by Serengeti.

serengeti>cluster list

Page 36: Serengeti User Guide_0.8

Serengeti User’s Guide

36

You can specify which cluster to list.

serengeti>cluster list --name mycluster

You can see details of Hadoop clusters.

serengeti>cluster list --detail

5.4.2.6 Login to Hadoop Nodes

You can login to Hadoop nodes including master, worker, and client nodes with password-less SSH from Serengeti Management Server using SSH client tools like SSH, PDSH, ClusterSSH, Mussh and etc. to do trouble shooting or run your own management automation scripts.

Serengeti Management Server is configured to be able to SSH to Hadoop cluster nodes without password. Other clients or machines can use user name and password to SSH to the Hadoop cluster nodes.

All of these deployed nodes have random passwords protection. If you want to login to each Hadoop

node directly, please login each node from vSphere client in order to change the password by following

the step in Section 3.2 step 11. Please press “Ctrl + D” in order to get the login information with the

original random password.

5.4.2.7 Delete a Hadoop Cluster

You can delete a Hadoop cluster you no longer needed.

serengeti>cluster delete --name myHadoop

In this example, “myHadoop” is the name of the Hadoop cluster you want to delete.

When a Hadoop cluster is deleted, all virtual machines in the cluster are destroyed.

You can delete a Hadoop cluster even though it is running.

5.4.3 Use Hadoop Clusters

5.4.3.1 Run Pig Scripts

You can run Pig script in the Serengeti CLI. For example, you have a Pig script in “/tmp/data.pig”.

serengeti> pig cfg

serengeti> pig script --location /tmp/data.pig

5.4.3.2 Run Hive Scripts

You can run Hive script in the Serengeti CLI. For example, you have a Hive script in “tmp/data.hive”.

serengeti>hive cfg

serengeti>hive script –location /tmp/data.hive

5.4.3.3 Run HDFS command

You can run HDFS command in the Serengeti CLI. For example, you have file in “/home/serengeti/data” and want to put it in your HDFS path /tmp.

serengeti> fs put –from /home/serengeti/data –to /tmp

Page 37: Serengeti User Guide_0.8

Serengeti User’s Guide

37

5.4.3.4 Run Map Reduce job

You can run Map Reduce job in the Serengeti CLI. For example, you get example jar file in “/opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar” and want to run pi.

serengeti> mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.PiEstimator --args "10 10"

Make sure you have chosen a cluster as target first in Serengeti CLI. See Chapter 7.2.10.

5.4.3.5 Using Data through JDBC

Using Data through Hive JDBC, you can execute SQL in different programming language, such as Java,

Python and PHP, and so on. The following is a JDBC Client sample of Java code.

1. SSH to the node contains hive server role. 2. Create a Java file HiveJdbcClient.java which contains the Java Sample Code for connecting to the

Hive Server:

import java.sql.SQLException;

import java.sql.Connection;

import java.sql.ResultSet;

import java.sql.Statement;

import java.sql.DriverManager;

public class HiveJdbcClient {

private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

/**

* @param args

* @throws SQLException

**/

public static void main(String[] args) throws SQLException {

try {

Class.forName(driverName);

} catch (ClassNotFoundException e){

// TODO Auto-generated catch block

e.printStackTrace();

System.exit(1);

}

Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default",

Page 38: Serengeti User Guide_0.8

Serengeti User’s Guide

38

"", "");

Statement stmt = con.createStatement();

String tableName = "testHiveDriverTable";

stmt.executeQuery("drop table " + tableName);

ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value

string)");

// show tables

String sql = "show tables '" + tableName + "'";

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

if (res.next()) {

System.out.println(res.getString(1));

}

// describe table

sql = "describe " + tableName;

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

while (res.next()) {

System.out.println(res.getString(1) + "\t" + res.getString(2));

}

// load data into table

// NOTE: filepath has to be local to the hive server

// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per line

String filepath = "/tmp/test_hive_server.txt";

sql = "load data local inpath '" + filepath + "' into table " + tableName;

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

// select * query

sql = "select * from " + tableName;

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

while (res.next()){

System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));

Page 39: Serengeti User Guide_0.8

Serengeti User’s Guide

39

}

// regular hive query

sql = "select count(1) from " + tableName;

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

while (res.next()){

System.out.println(res.getString(1));

}

}

}

3. Running the JDBC Sample Code

a. Then on the command-line

$ javac HiveJdbcClient.java

b. Alternatively, you can run the following bash script, which will seed the data file and build your

classpath before invoking the client.

#!/bin/bash

HADOOP_HOME=/usr/lib/hadoop

HIVE_HOME=/usr/lib/hive

echo -e '1\x01foo' > /tmp/test_hive_server.txt

echo -e '2\x01bar' >> /tmp/test_hive_server.txt

HADOOP_CORE=`ls $HADOOP_HOME/hadoop-core-*.jar`

CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf

for jar_file_name in ${HIVE_HOME}/lib/*.jar

do

CLASSPATH=$CLASSPATH:$jar_file_name

done

java -cp $CLASSPATH HiveJdbcClient

For more information of Hive client please visit https://cwiki.apache.org/Hive/hiveclient.html.

Page 40: Serengeti User Guide_0.8

Serengeti User’s Guide

40

5.4.3.6 Using Data through ODBC

You can use specified out-of-box ODBC server for Hadoop Hive such as MapR Hive ODBC connector,

Apache Hadoop Hive ODBC Driver, etc.

Take MapR ODBC Connector as an example: 1. Install the MapR Hive ODBC Connector on your Windows 7 Professional or Windows 2008 R2. 2. Create a Data Source Name (DSN) with the ODBC Connector‟s Data Source Administrator to

connect your remote Hive server. 3. Import rows of HIVE_SYSTEM table in Hive server into excel by connecting to this DSN. More information about Hive ODBC, please refer to https://cwiki.apache.org/Hive/hiveodbc.html More information about MapR Hive ODBC Connector, please refer to www.mapr.com/doc/display/MapR/Hive+ODBC+Connector.

5.5 HBase Clusters

5.5.1 Deploy HBase Clusters

You can customize a HBase cluster by specifying your own spec file. The following is an example:

{ "nodeGroups" : [ { "name" : "zookeeper", "roles" : [ "zookeeper" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 20 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "on", "configuration" : { } }, { "name" : "hadoopmaster", "roles" : [ "hadoop_namenode", "hadoop_jobtracker" ], "instanceNum" : 1, "instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : {

Page 41: Serengeti User Guide_0.8

Serengeti User’s Guide

41

} }, { "name" : "hbasemaster", "roles" : [ "hbase_master" ], "instanceNum" : 1, "instanceType" : "MEDIUM", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 2, "memCapacityMB" : 7500, "haFlag" : "on", "configuration" : { } }, { "name" : "worker", "roles" : [ "hadoop_datanode", "hadoop_tasktracker", "hbase_regionserver" ], "instanceNum" : 3, "instanceType" : "SMALL", "storage" : { "type" : "local", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { } }, { "name" : "client", "roles" : [ "hadoop_client", "hbase_client" ], "instanceNum" : 1, "instanceType" : "SMALL", "storage" : { "type" : "shared", "sizeGB" : 50 }, "cpuNum" : 1, "memCapacityMB" : 3748, "haFlag" : "off", "configuration" : { }

Page 42: Serengeti User Guide_0.8

Serengeti User’s Guide

42

} ], // we suggest running convert-hadoop-conf.rb to generate "configuration" section and paste the output here "configuration" : { "hadoop": { "core-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/core-default.html // note: any value (int, float, boolean, string) must be enclosed in double quotes and here is a sample: // "io.file.buffer.size": "4096" }, "hdfs-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-default.html }, "mapred-site.xml": { // check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-default.html }, "hadoop-env.sh": { // "HADOOP_HEAPSIZE": "", // "HADOOP_NAMENODE_OPTS": "", // "HADOOP_DATANODE_OPTS": "", // "HADOOP_SECONDARYNAMENODE_OPTS": "", // "HADOOP_JOBTRACKER_OPTS": "", // "HADOOP_TASKTRACKER_OPTS": "", // "HADOOP_CLASSPATH": "", // "JAVA_HOME": "", // "PATH": "" }, "log4j.properties": { // "hadoop.root.logger": "DEBUG,DRFA", // "hadoop.security.logger": "DEBUG,DRFA" }, "fair-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/fair_scheduler.html // "text": "the full content of fair-scheduler.xml in one line" }, "capacity-scheduler.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/capacity_scheduler.html }, "mapred-queue-acls.xml": { // check for all settings at http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons // "mapred.queue.queue-name.acl-submit-job": "", // "mapred.queue.queue-name.acl-administer-jobs", "" } }, "hbase": { "hbase-site.xml": { // check for all settings at http://hbase.apache.org/configuration.html#hbase.site }, "hbase-env.sh": { // "JAVA_HOME": "", // "PATH": "", // "HBASE_CLASSPATH": "", // "HBASE_HEAPSIZE": "",

Page 43: Serengeti User Guide_0.8

Serengeti User’s Guide

43

// "HBASE_OPTS": "", // "HBASE_USE_GC_LOGFILE": "", // "HBASE_JMX_BASE": "", // "HBASE_MASTER_OPTS": "", // "HBASE_REGIONSERVER_OPTS": "", // "HBASE_THRIFT_OPTS": "", // "HBASE_ZOOKEEPER_OPTS": "", // "HBASE_REGIONSERVERS": "", // "HBASE_SSH_OPTS": "", // "HBASE_NICENESS": "", // "HBASE_SLAVE_SLEEP": "" }, "log4j.properties": { // "hbase.root.logger": "DEBUG,DRFA" } }, "zookeeper": { "java.env": { // "JVMFLAGS": "-Xmx2g" }, "log4j.properties": { // "zookeeper.root.logger": "DEBUG,DRFA" } } } }

In the example, it has JobTracker and TaskTracker roles compared to the template we mentioned in

section 4.4, which means you can launch a HBase mapreduce job. It separate Hadoop NameNode and

HBase Master roles. The two HBase Master instances,will be protected by HBase internal HA function.

5.5.2 Manage HBase Clusters

HBase cluster has a few more configurable files compared to Hadoop cluster, including hbase-site.xml,

hbase-env.sh, log4j.properties and java.env for Zookeeper nodes. You can refer to HBase official site to

tune your HBase clusters.

Most operations and advanced specifications on Hadoop cluster can also apply to HBase cluster, like

scale out node group, separate data and compute nodes, control placement policy and so on with

following exceptions:

1. Zookeeper nodes are not allowed to scale out in this version;

2. You cannot deploy a compute-only cluster pointing to a HBase cluster to run HBase

mapreduce jobs.

5.5.3 Use HBase Clusters

Serengeti supports most of ways that HBase provides to access the database, including:

1. Do operations through “HBase shell”;

2. If the HBase cluster deployed has Hadoop JobTracker and TaskTracker roles, you can develop a HBase mapreduce job to access HBase from the client node. Here is an example:

>hbase org.apache.hadoop.hbase.PerformanceEvaluation randomWrite 3

Page 44: Serengeti User Guide_0.8

Serengeti User’s Guide

44

3. Rest-ful Web Service is running on client node and listening on port 8080

>curl –I http://<client_node_ip>:8080/status/cluster 4.Thrift gateway is also enabled and listening on port 9090.

5.6 Monitoring Cluster Deployed by Serengeti

Serengeti will create one VM folder for each deployed Serengeti Server. The folder name is SERENGETI-

vApp-<vApp name>. The vApp name is specified during Serengeti deployment.

For each cluster, two level folders will be created under Serengeti instance folder. First level is the cluster

name, and second level is the node group name.

Node group folder contains all nodes in that node group.

To browse the VM and check VM status in vCenter client, you may select “Inventory”, “VMs and

Templates”. The Serengeti folder is listed in the left panel. And then you can check VM nodes following

the folder structure.

If you have installed vCOPs, you can also fetch VM-level metrics including cluster‟s health state, workload,

resource allocation, hardware status and etc. Please refer to vCOPs‟ manual guide for more details.

5.7 Make Hadoop Master Node HA/FT

You can leverage vSphere HA and FT to address the SPOF problem of Hadoop.

1. Make sure you enabled the HA for the cluster where the Hadoop cluster is deployed. Please refer to

for detailed setting steps as needed.

2. Make sure you provide a shared storage for Hadoop to deploy on.

3. By default, Hadoop master node is configured to be protected by vSphere HA.

After doing this, once the master node virtual machine not reachable by vSphere. vSphere will start a new instance on another available ESXi host to serve Hadoop cluster automatically.

There‟s a short downtime when doing the recovery. If you want eliminate the down time, you can use vSphere FT to protect the master node.

Serengeti support configure FT feature for master nodes. In cluster spec file, you can specify “haFlag” to “ft” to enable FT protection.

...

"name": "master",

"cpuNum": 1,

"haFlag": “ft”

"storage": {

"type": "SHARED",

}

By using the cluster spec, master node of the Hadoop cluster is protected by vSphere FT. When one master is not reachable, vSphere will switch traffic to the standby virtual machine immediately. So there‟s no failover downtime.

Please refer to Apache Hadoop 1.0 High Availability Solution on VMware vSphere for more information.

Page 45: Serengeti User Guide_0.8

Serengeti User’s Guide

45

5.8 Hadoop Topology Awareness

You can make the Hadoop cluster topology aware when you create a cluster with the option of --topology from CLI. By --topology, we support 3 types of topology awareness: HVE, RACK_AS_RACK, HOST_AS_RACK.

Here is an example to create a cluster with the topology of HVE.

serengeti>cluster create --name myHadoop --topology HVE --distro HVE-supported_Distro

HVE stands for Hadoop Virtualization Extensions2. HVE refines Hadoop‟s replica placement, task

scheduling and balancer policies. Hadoop clusters implemented on virtualized infrastructure have full awareness of the topology on which they are running. Thus, the reliability and performance of these clusters are enhanced. For more information about HVE, you can refer to

https://issues.apache.org/jira/browse/HADOOP-8468.

RACK_AS_RACK stands for the standard topology in existing Hadoop 1.0.x, where only rack and host information are exposed to Hadoop.

HOST_AS_RACK is a simplified topology of RACK_AS_RACK when all the physical hosts for Serengeti are on a single rack. In this case, each physical host will be treated as a rack in order to avoid all HDFS data replicas are placed in a physical host in some worst cases.

HVE is the recommended topology in Serengeti if a distro supports HVE. Otherwise, we recommend using RACK_AS_RACK topology in multiple rack environments. HOST_AS_RACK is used only when one rack exists for Serengeti or no rack information at all.

In addition, when you decide to enable HVE, or RACK_AS_RACK, you need to upload the rack and physical host information to Serengeti through CLI command below before you create a topology awareness cluster.

serengeti>topology upload --fileName name_of rack_hosts_mapping_file

Here is a sample of the rack and physical hosts mapping file.

rack1: a.b.foo.com, a.c.foo.com rack2: c.a.foo.com

In this sample, physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.

After a cluster is created with the selected topology option, you can view the allocated nodes on each rack with:

serengeti>cluster list --name cluster-name --detail

5.9 Start and Stop Serengeti Services

You can stop and start Serengeti service to make a configuration take effect or to recover from an abnormal situation.

You can run the following command in a Linux shell to stop the Serengeti service.

$ sudo serengeti-stop-services.sh

You can run the following command in a Linux shell to start the Serengeti service.

$ sudo serengeti-start-services.sh

2 HVE is currently supported on Greenplum HD 1.2.

Page 46: Serengeti User Guide_0.8

Serengeti User’s Guide

46

6. Cluster Specification Reference

Cluster specification is a JSON text file. Here‟s a longer example with line number. Same file without line

number is attached as appendix.

1 {

2 "nodeGroups" : [

3 {

4 "name": "master",

5 "roles": [

6 "hadoop_namenode",

7 "hadoop_jobtracker"

8 ],

9 "instanceNum": 1,

10 "instanceType": "LARGE",

11 "cpuNum": 2,

12 "memCapacityMB":4096,

13 "storage": {

14 "type": "SHARED",

15 "sizeGB": 20

16 },

17 "haFlag":"on",

18 "rpNames": [

19 "rp1"

20 ]

21 },

22 {

23 "name": "data",

24 "roles": [

25 "hadoop_datanode"

26 ],

27 "instanceNum": 3,

28 "instanceType": "MEDIUM",

29 "cpuNum": 2,

30 "memCapacityMB":2048,

Page 47: Serengeti User Guide_0.8

Serengeti User’s Guide

47

31 "storage": {

32 "type": "LOCAL",

33 "sizeGB": 50

34 }

35 "placementPolicies": {

36 "instancePerHost": 1,

37 "groupRacks": {

38 "type": "ROUNDROBIN",

39 "racks": ["rack1", "rack2", "rack3"]

40 }

41 }

42 },

43 {

44 "name": "compute",

45 "roles": [

46 "hadoop_tasktracker"

47 ],

48 "instanceNum": 6,

49 "instanceType": "SMALL",

50 "cpuNum": 2,

51 "memCapacityMB":2048,

52 "storage": {

53 "type": "LOCAL",

54 "sizeGB": 10

55 }

56 "placementPolicies": {

57 "instancePerHost": 2,

58 "groupAssociations": [{

59 "reference": "data",

60 "type": "STRICT"

61 }]

62 }

63 },

64 {

65 "name": "client",

Page 48: Serengeti User Guide_0.8

Serengeti User’s Guide

48

66 "roles": [

67 "hadoop_client",

68 "hive",

69 "hive_server",

70 "pig"

71 ],

72 "instanceNum": 1,

73 "instanceType": "SMALL",

74 "memCapacityMB": 2048,

75 "storage": {

76 "type": "LOCAL",

77 "sizeGB": 10,

78 "dsNames": [“ds1”, “ds2”]

79 }

80 }

81 ],

82 "configuration": {

83 }

84 }

It defines 4 node groups.

Line 3 to 21 defines a node group named “master”.

Line 22 to 42 defines a data node group named “data”.

Line 43 to 63 defines a compute node group named “compute”.

Line 64 to 83 defines a client node group.

Line 3 to 21 is an object defines the “master” node group. The attributes are as follows.

Line 4 defines the name of the node group. Attribute name is “name”. Value is “master”.

Line 5 to 8 defines role of the node group. Attribute name is “role”. Value is “hadoop_ namenode”

and “hadoop_jobtracker”. It means hadoop_namenode and hadoop_jobtracker will be deployed

to the virtual machine in the group.

You can see available roles by “distro list” command.

Line 9 defines number of instances in the node group. Attribute name is “instanceNum”. Attribute

value is 1. It means there‟ll be only one virtual machine created for the group.

You can have multiple instances for hadoop_tasktracker, hadoop_datanode, hadoop_client, pig,

and hive. But you can have only one instance for hadoop_namenode and hadoop_jobtracker.

Line 10 defines the instance type in the node group. Attribute name is “instanceType”. Value is

“LARGE”. The instance types are predefined virtual machine spec. They are combinations of

Page 49: Serengeti User Guide_0.8

Serengeti User’s Guide

49

number of CPUs, RAM sizes, and storage size. The predefined number can be overridden by the

cpuNum, memCapacityMB and storage specified in the file.

Line 11 defines number of CPUs per virtual machine. Attribute name is “cpuNum”. Value is 2. It‟ll

override the number of CPUs of the predefined virtual machine spec.

Line 12 defines RAM size per virtual machine. Attribute name is "memCapacityMB". Value is

4096. It will override the RAM size of the predefined virtual machine spec.

Line 13 to 16 defines the storage requirement of the node group. It‟s an object. Object name is

“storage”.

o Line 14 defines the storage type. It‟s an attribute of “storage” object. Attribute name is

“type”. Value is “SHARED”. It means it is required that Hadoop data must be stored in

shared storage.

o Line 15 defines the storage size. It‟s an attribute of “storage” object. Attribute name is

“sizeGB”. Value is 20. It means there‟ll be 20GB disk for Hadoop to use.

Line 17 defines if HA applies to the node. The attribute name is “haFlag”. The value is on. It

means the virtual machine in the group is protected by vSphere HA.

Line 18 to 20 defines the resourcepools which the node group must be associated with. The

attribute name is “rpNames”. The value is an array, which contains one resourcepool “rp1”.

You can see same structure for other 3 node groups. One more thing is for “data” and “compute” groups,

we specify a pair of comprehensive placement constraints:

Line 35 to 41 defines the placement constraints for the data node group. The attribute name is

“placementPolicies” and the value is a hash which contains “instancePerHost” and “groupRacks”.

The contraint means you need at least 3 esx hosts because this group requires 3 instances and

forces putting 1 instance on each one host, furthermore, this group will be provisioned on hosts

on “rack1”, “rack2” and “rack3” by using “ROUNDROBIN” algorithm.

Line 56 to 62 defines the placement constraints for the compute node group which contains

“instancePerHost” and “groupAssociations”. The contraint means you also need at least 3 esx

hosts for the same reason and this group is “STRICT” associated to node group “data” for better

performance.

You can customize Hadoop configuration by “configuration” attribute on line 82 to 83, which happens to

be empty in the sample.

You can modify value of the attributes, and you can also remove the optional value if you don‟t care.

Following is definition for the outer most attributes in a cluster spec:

Attribute Type Mandatory/optional Description

nodeGroups object Mandatory It contains one or more group specification, and

the details can be found in below table.

configuration object Optional Customizable Hadoop configuration key/value

pairs.

externalHDFS string Optional URI of external HDFS (only valid for a compute

only cluster)

Page 50: Serengeti User Guide_0.8

Serengeti User’s Guide

50

Following is the definition of the objects and attributes for a particular node group.

Attribute Type Mandatory/Optional Description

name string Mandatory User defined node group name.

roles list of

string

Mandatory A list of software packages or services will be

installed in the virtual machines in the node

group. The item must be exactly the same as

you saw by “distro list”

instanceNumber integer Mandatory How many virtual machines in the node group.

It must be a positive integer. For

hadoop_namenode and hadoop_jobtracker, it

must be 1.

instanceType string Optional Size of virtual machines in the node group. It‟s

the name of predefined virtual machine

template. It can be “SMALL”, “MEDIUM”,

“LARGE”, and “EXTRA_LARGE”.

The cpuNum, memCapacityMb, and

Storage.sizeGB will overwrite this attribute if

they are all be defined in the same node group.

cpuNum integer Optional Number of vCPUs per virtual machine

memCapacityMb integer Optional Number of RAMs in MB per virtual machine

Storage object Optional Storage settings

type string Optional It can be “LOCAL” or “SHARED”.

sizeGB integer Optional Data storage size. It must be a positive integer.

dsNames list of

string

Optional Datastores the node group can use.

rpNames list of

string

Optional Resourcepools the node group can use.

haFlag string Optional It can be “on”, “off” or “ft”. “on” means use HA to

protect the node group, “ft” means use vSphere

FT to protect the node group.

By default, name node and job tracker are

protected by vSphere HA.

placementPolicies object Optional It can contains three optional constraints:

"instancePerHost", "groupRacks" and

"groupAssociations", refer to 5.3.2 for details.

Page 51: Serengeti User Guide_0.8

Serengeti User’s Guide

51

Serengeti comes with predefined virtual machine specification.

SMALL MEDIUM LARGE EXTRA_LARGE

Number of vCPU 1 2 4 8

RAM 3.75GB 7.5GB 15GB 30GB

Disk size for Hadoop master data 25GB 50GB 100GB 200GB

Disk size for Hadoop worker data 50GB 100GB 200GB 400GB

Disk size for Hadoop client data 50GB 100GB 200GB 400GB

When creating virtual machine, Serengeti will try to allocate datastore on the preferred type. SHARED

storage is preferred for master and clients. LOCAL storage is preferred for workers.

Separate disks are created for OS and swap.

7. Serengeti Command Reference

7.1 connect

Connect and login to remote Serengeti server.

Parameter Mandatory/Optional Description

--host Mandatory Specify the Serengeti web service URL in format <Serengeti

Management Server ip or host>:<port>. By default, the Serengeti web

service is started at port 8080.

--username Optional The Serengeti user name

--password Optional The Serengeti password

The command will read username and password in interactive mode. Section 5.1 describes how to

manage Serengeti users.

If connect failed, or do not run connect command, the other Serengeti command is not allowed to be

executed.

7.2 cluster

7.2.1 cluster config

Modify Hadoop configuration of an existing default or customized Hadoop cluster in Serengeti.

Parameter Type Description

--name <cluster name in

Serengeti>

Mandatory Specify the Hadoop cluster name in Serengeti.

--specFile <spec file path> Optional Specify the Hadoop cluster's specification in a customized file.

Page 52: Serengeti User Guide_0.8

Serengeti User’s Guide

52

--yes Optional Answer „y‟ to „Y/N‟ confirmation. If not specified, the users need

to answer „y‟ or „n‟ explicitly.

7.2.2 cluster create

Create a default/customized Hadoop cluster in Serengeti.

Parameter Mandatory/Optional Description

--name <cluster name

in Serengeti>

Mandatory Specify the Hadoop cluster name in Serengeti.

--type <cluster type> Optional Specify the cluster type. Hadoop or HBase is supported.

The default one is Hadoop.

--specFile <spec file

path>

Optional Specify the Hadoop cluster's specification in a customized

file

--distro <Hadoop distro

name>

Optional Specify which distro will be used to deploy Hadoop

cluster. The distros includes Apache Hadoop, Greenplum

HD, CDH3 and HDP1.

--dsNames <datastore

names>

Optional Specify which datastore will be used to deploy Hadoop

cluster in Serengeti. By default, it will use the same one

with Serengeti virtual machine. Multiple datastores can be

used, separated by “,”.

--networkName

<network name>

Optional Specify which network will be used to deploy Hadoop

cluster in Serengeti. By default, it will use the same one

with Serengeti virtual machine.

--rpNames <resource

pool name>

Optional Specify which resource pool will be used to deploy

Hadoop cluster Serengeti. By default, it will use the same

one with Serengeti virtual machine. Multiple resource

pools can be used, separated by “,”.

--resume Optional If resume is specified, this command will recover a

creation process which cluster is deployed failed.

--topology <topology

type>

Optional Specify which topology type will be used for rack

awareness: HVE, RACK_AS_RACK, or

HOST_AS_RACK.

--yes Optional Answer „y‟ to „Y/N‟ confirmation. If not specified, the users

need to answer „y‟ or „n‟ explicitly.

--skipConfigValidation Optional Skip cluster configuration validation.

If the cluster spec does not include required nodes, for example master node, Serengeti will generate

them with a default configuration.

Page 53: Serengeti User Guide_0.8

Serengeti User’s Guide

53

7.2.3 cluster delete

Delete a Hadoop cluster in Serengeti.

Parameter Mandatory/Optional Description

--name <cluster name> Mandatory Delete a specified Hadoop cluster in Serengeti.

7.2.4 cluster export

Export cluster information.

Parameter Mandatory/Optional Description

--spec Mandatory Export cluster specification. The exported cluster specification can be

used in cluster create or cluster config command.

--output Optional Specify the output file name for exported cluster information.

If not specified, the output will be displayed in the console.

7.2.5 cluster limit

Enable or disable provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti

to reach the limit specified by activeComputeNodeNum. Compute nodes are re-commissioned and

powered-on, or decommissioned and powered-off to reach the specified number of active compute nodes.

Parameter Mandatory/Optional Description

--name <cluster_name> Mandatory Name of the Hadoop cluster in Serengeti

--nodeGroup

<node_group_name>

Optional Name of a node group in the specified Hadoop cluster

in Serengeti (supports node groups with task tracker

role only)

--

activeComputeNodeNum

<number>

Mandatory Number of active compute nodes for the specified

Hadoop cluster or node group within that cluster.

The valid value range is integers larger or equal to

zero.

- For zero value, all the nodes in the specific

Hadoop cluster or the specific node group (if --

nodeGroup value is specified) will be

decommissioned and powered off.

- For integer value between 1 and the max

node number of a Hadoop cluster or the node

group (if --nodeGroup value is specified), the

specific number of nodes will stay

Page 54: Serengeti User Guide_0.8

Serengeti User’s Guide

54

commissioned and powered on, other nodes

will be decommissioned.

- For integer value larger than the max node

number of a Hadoop cluster or the node group

(if --nodeGroup value is specified), all the

nodes in the specific Hadoop cluster or the

specific node group (if --nodeGroup value is

specified) will be re-commissioned and

powered on.

7.2.6 cluster list

List all Hadoop clusters in Serengeti.

Parameter Mandatory/Optional Description

--name <cluster

name in

Serengeti>

Optional List the specified Hadoop cluster in Serengeti including name,

distro, status, each role's information. For each role, it will list

instance count, CPU, memory, type and size.

--detail Optional List all the Hadoop clusters' details including name in

Serengeti, distro, deploy status, each node‟s information in

different roles.

Note: with this option specified, Serengeti will query from

vCenter server to get the latest node status. That operation

may take a few seconds for each cluster.

For example:

Page 55: Serengeti User Guide_0.8

Serengeti User’s Guide

55

7.2.7 cluster resize

Change the number of nodes in a node group.

Parameter Mandatory/Optional Description

--name <cluster name in

Serengeti>

Mandatory Specify the target Hadoop cluster in Serengeti.

--nodeGroup <name of

the node group>

Mandatory Specify the target role which will be scaled out in

Hadoop cluster deployed by Serengeti.

--instanceNum <instance

number>

Mandatory Specify the target count which will be scaled out to.

The target count needs to be more that original.

Example:

Cluster resize --name foo --nodeGroup slave --instanceCount 10

7.2.8 cluster start

Start a Hadoop cluster in Serengeti.

Parameter Mandatory/Optional Description

--name <cluster name> Mandatory Start a specified Hadoop cluster in Serengeti.

Page 56: Serengeti User Guide_0.8

Serengeti User’s Guide

56

7.2.9 cluster stop

Stop a Hadoop cluster in Serengeti.

Parameter Mandatory/Optional Description

--name <cluster name> Mandatory Stop a specified Hadoop cluster in Serengeti.

7.2.10 cluster target

Connect to one Hadoop cluster to interact with it by Serengeti CLI, including run fs, mr, pig, and hive

commands.

Parameter Mandatory/Optional Description

--name <cluster name> Optional The name of the cluster to connect to. If user don‟t specify

this parameter, the first cluster listed by “cluster list”

command will be used

--info Optional Show to targeted cluster information, such as the HDFS

URL, Job Tracker URL and Hive server URL.

Note: --name and –info can not be used together.

7.2.11 cluster unlimit

Enable all of the provisioned compute nodes in the specified Hadoop cluster or node group in Serengeti.

Compute nodes are re-commissioned and powered-on as necessary.

Parameter Mandatory/Optional Description

--name <cluster_name> Mandatory Name of the Hadoop cluster in Serengeti

--nodeGroup

<node_group_name>

Optional Name of a node group in the specified Hadoop cluster

in Serengeti (only supports node groups with task

tracker role)

7.3 datastore

7.3.1 datastore add

Add a datastore to Serengeti for deploying.

Parameter Mandatory/Optional Description

--name <datastore

name in Serengeti> Mandatory Specify the name of datastore added to Serengeti

--spec <datastore

name in VCenter> Mandatory Specify datastore name in vSphere. User can use wild

card to specify multiple vmfs store. * and ? are

Page 57: Serengeti User Guide_0.8

Serengeti User’s Guide

57

supported in wild card.

--type <datastore type:

LOCAL|SHARE>

Mandatory Specify datastore type in vSphere: local storage or

shared storage.

7.3.2 datastore delete

Delete a datastore from Serengeti.

Parameter Mandatory/Optional Description

--name <datastore name in

Serengeti> Mandatory Delete a specified datastore in

Serengeti.

7.3.3 datastore list

List datastores added to Serengeti.

Parameter Mandatory/Optional Description

--name <Name of datastore name

in Serengeti>

Optional List the specified datastore information

including name, type.

--detail Optional List the datastore details including datastore

path in vSphere.

All datastores that are added to Serengeti are listed if the name is not specified.

For example:

Page 58: Serengeti User Guide_0.8

Serengeti User’s Guide

58

7.4 distro

7.4.1 distro list

Show what are the roles offered in a distro.

Parameter Mandatory/Optional Description

--name <distro name> Optional List the specified distro information.

For example:

7.5 disconnect

Disconnect and logout from remote Serengeti server. After disconnect, user is not allowed to run any CLI

commands.

7.6 fs

7.6.1 fs cat

Copy source paths to stdout.

Parameter Mandatory/Optional Description

<file name> Mandatory The file to be showed in the console. Multiple files must be quoted,

such as “/path/file1 /path/file2”

7.6.2 fs chgrp

Change group association of files.

Parameter Mandatory/Optional Description

--group <group name> Mandatory The group name of the file

--recursive true|false Optional make the change recursively through the directory

structure

<file name> Mandatory The file whose group to be changed. Multiple files

must be quoted, such as “/path/file1 /path/file2”

7.6.3 fs chmod

Change the permissions of files.

Parameter Mandatory/Optional Description

Page 59: Serengeti User Guide_0.8

Serengeti User’s Guide

59

--mode <permission mode> Mandatory The file permission mode, such as “755”

--recursive true|false Optional make the change recursively through the directory

structure

<file name> Mandatory The file whose permission to be changed. Multiple

files must be quoted, such as “/path/file1

/path/file2”

7.6.4 fs chown

Change the owner of files.

Parameter Mandatory/Optional Description

--owner <permission

mode>

Mandatory The file owner name

--recursive true|false Optional make the change recursively through the directory structure

<file name> Mandatory The file whose owner to be changed. Multiple files must be

quoted, such as “/path/file1 /path/file2”

7.6.5 fs copyFromLocal

Copy single source file, or multiple source files from local file system to the destination file system. It is

the same as put.

Parameter Mandatory/Optional Description

--from <local file

path>

Mandatory The file path in local. Multiple files must be quoted, such as

“/path/file1 /path/file2”

--to <HDFS file

path>

Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is

directory name.

7.6.6 fs copyToLocal

Copy files to the local file system. It is the same as get.

Parameter Mandatory/Optional Description

--from < HDFS file path > Mandatory The file path in HDFS. Multiple files must be quoted,

such as “/path/file1 /path/file2”

--to < local file path > Mandatory The file path in local. If “--from” is multiple files, “--to” is

directory name.

Page 60: Serengeti User Guide_0.8

Serengeti User’s Guide

60

7.6.7 fs copyMergeToLocal

Takes a source directory and a destination file as input and concatenates the files in the HDFS directory

into the local file system.

Parameter Mandatory/Optional Description

--from < HDFS file path > Mandatory The file path in HDFS. Multiple files must be quoted,

such as “/path/file1 /path/file2”.

--to < local file path > Mandatory The file path in local.

--endline <true|false> Optional Whether add end line character.

7.6.8 fs count

Count the number of directories, files, bytes, quota, and remaining quota.

Parameter Mandatory/Optional Description

--path < HDFS path > Mandatory The path to be counted.

--quota <true|false> Optional Whether with quota information.

7.6.9 fs cp

Copy files from source to destination. This command allows multiple sources as well in which case the

destination must be a directory.

Parameter Mandatory/Optional Description

--from <HDFS source

file path>

Mandatory The file path in local. Multiple files must be quoted, such

as “/path/file1 /path/file2”

--to <HDFS destination

file path>

Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is

directory name.

7.6.10 fs du

Displays sizes of files and directories contained in the given directory or the length of a file in case it‟s just

a file.

Parameter Mandatory/Optional Description

<file name> Mandatory The file to be showed in the console. Multiple files must be quoted,

such as “/path/file1 /path/file2”.

7.6.11 fs expunge

Empty the trash bin in the HDFS.

Page 61: Serengeti User Guide_0.8

Serengeti User’s Guide

61

7.6.12 fs get

Copy files to the local file system.

Parameter Mandatory/Optional Description

--from < HDFS file

path >

Mandatory The file path in HDFS. Multiple files must be quoted, such as

“/path/file1 /path/file2”.

--to < local file

path >

Mandatory The file path in local. If “--from” is multiple files, “--to” is

directory name.

7.6.13 fs ls

List files in the directory.

Parameter Mandatory/Optional Description

<path name> Mandatory The path to be listed. Multiple files must be quoted,

such as “/path/file1 /path/file2”.

--recursive <true|false> Optional Whether list the directory with recursion.

7.6.14 fs mkdir

Create a new directory.

Parameter Mandatory/Optional Description

<dir name> Mandatory The directory name to be created.

7.6.15 fs moveFromLocal

Similar to put command, except that the source local file is deleted after it is copied.

Parameter Mandatory/Optional Description

--from <local file path> Mandatory The file path in local. Multiple files must be quoted, such

as “/path/file1 /path/file2”.

--to <HDFS file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is

directory name.

7.6.16 fs mv

Move source files to destination in the HDFS.

Parameter Mandatory/Optional Description

--from <dest file path> Mandatory The file path in local. Multiple files must be quoted, such

Page 62: Serengeti User Guide_0.8

Serengeti User’s Guide

62

as “/path/file1 /path/file2”.

--to <source file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is

directory name.

7.6.17 fs put

Copy single src, or multiple srcs from local file system to the HDFS.

Parameter Mandatory/Optional Description

--from <local file path> Mandatory The file path in local. Multiple files must be quoted, such

as “/path/file1 /path/file2”.

--to <HDFS file path> Mandatory The file path in HDFS. If “--from” is multiple files, “--to” is

directory name.

7.6.18 fs rm

Remove files in the HDFS.

Parameter Mandatory/Optional Description

< file path> Mandatory The file to be removed.

--recursive <true|false> Optional Remove files with recursion.

--skipTrash <true|false> Optional Bypass trash.

7.6.19 fs setrep

Change the replication factor of a file

Parameter Mandatory/Optional Description

--path < file path> Mandatory The path to be changed replication factor.

--replica <replica number> Mandatory Number of replicas.

--recursive <true|false> Optional Whether set replica with recursion.

--waiting <true|false> Optional Whether wait for the replica number is equal to the

number.

7.6.20 fs tail

Display last kilobyte of the file to stdout.

Parameter Mandatory/Optional Description

<file path> Mandatory The file path to be displayed.

Page 63: Serengeti User Guide_0.8

Serengeti User’s Guide

63

--file <true|false> Optional Show content while the file grows.

7.6.21 fs text

Take a source file and output the file in text format.

Parameter Mandatory/Optional Description

<file path> Mandatory The file path to be displayed.

7.6.22 fs touchz

Create a file of zero length.

Parameter Mandatory/Optional Description

<file path> Mandatory The file name to be created.

7.7 hive

7.7.1 hive cfg

Configure Hive.

Parameter Mandatory/Optional Description

--host <server host > Optional The server host.

--port <server port> Optional The server port.

--timeout Optional The timeout in milliseconds.

7.7.2 hive script

Execute a Hive script. Note: You need to run hive cfg before running a hive script.

Parameter Mandatory/Optional Description

--location <script path> Mandatory The hive script file name to be executed.

7.8 mr

7.8.1 mr jar

Run a MapReduce job located inside the provided jar.

Parameter Mandatory/Optional Description

Page 64: Serengeti User Guide_0.8

Serengeti User’s Guide

64

--jarfile <jar file path> Mandatory The jar file path.

--mainclass <main class name> Mandatory The class which have main() method.

--args <arg> Optional The arguments to the main class. If there are

multiple arguments, they must be double

quoted.

7.8.2 mr job counter

Print the counter value of the MR job.

Parameter Mandatory/Optional Description

--jobid <job id> Mandatory The MR job id.

--groupname <group name> Mandatory The counter‟s group name.

--countername <counter name> Mandatory The counter‟s name.

7.8.3 mr job events

Print the events' detail received by JobTracker for the given range.

Parameter Mandatory/Optional Description

--jobid <job id> Mandatory The MR job id.

--from < from-event-#> Mandatory The start number of events to be printed.

--number < #-of-events> Mandatory The total number of events to be printed.

7.8.4 mr job history

Print job details, failed and killed job details.

Parameter Mandatory/Optional Description

<job history directory> Mandatory The directory where job history files put.

--all <true|false> Optional Print all jobs information.

7.8.5 mr job kill

Kill the MR job.

Parameter Mandatory/Optional Description

--jobid <job id> Mandatory The job id.

Page 65: Serengeti User Guide_0.8

Serengeti User’s Guide

65

7.8.6 mr job list

List MR jobs.

Parameter Mandatory/Optional Description

--all <true|false> Optional Whether list all jobs.

7.8.7 mr job set priority

Change the priority of the job.

Parameter Mandatory/Optional Description

--jobid <jobid> Mandatory The job id.

--priority

<VERY_HIGH|HIGH|NORMAL|LOW|VERY_LOW>

Mandatory The job‟s priority.

7.8.8 mr job status

Query MR job status.

Parameter Mandatory/Optional Description

--jobid <jobid> Mandatory The job id.

7.8.9 mr job submit

Submit a MR job defined in the job file.

Parameter Mandatory/Optional Description

--jobfile <jobfile> Mandatory Specify the file which define the MR job. The file is

standard Hadoop configuration. One example configuration

file is as following:

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapred.jar</name> <value>/home/hadoop/hadoop-1.0.1/hadoop-examples-1.0.1.jar</value> </property> <property>

Page 66: Serengeti User Guide_0.8

Serengeti User’s Guide

66

<name>mapred.input.dir</name> <value>/user/hadoop/input</value> </property> <property> <name>mapred.output.dir</name> <value>/user/hadoop/output</value> </property> <property> <name>mapred.job.name</name> <value>wordcount</value> </property> <property> <name>mapreduce.map.class</name> <value>org.apache.hadoop.examples.WordCount.TokenizerMapper</value> </property> <property> <name>mapreduce.reduce.class</name> <value>org.apache.hadoop.examples.WordCount.IntSumReducer</value> </property> </configuration>

7.8.10 mr task fail

Fail the Map Reduce task.

Parameter Mandatory/Optional Description

--taskid <taskid> Mandatory Specify the task id.

7.8.11 mr task kill

Kill the Map Reduce task.

Parameter Mandatory/Optional Description

--taskid <taskid> Mandatory Specify the task id.

7.9 network

7.9.1 network add

Add a network to Serengeti.

Parameter Mandatory/Optional Description

--name <network name in Serengeti> Mandatory Specify the name of network resource

added to Serengeti

Page 67: Serengeti User Guide_0.8

Serengeti User’s Guide

67

--portGroup <port group name in

vSphere>

Mandatory Specify the name of port group in vSphere

which user wants to add to Serengeti

--dhcp Combination 1 Specify the IP address assignment type,

DHCP.

--ip <IP Spec, an IP address range

looks like xx.xx.xx.xx-xx[,xx]*>

--dns <dns server ip>

--secondaryDNS <dns server ip>

--gateway <gateway IP>

--mask <network mask>

Combination 2 Specify the IP address assignment type,

static IP.

For example:

>network add --name ipNetwork --ip 192.168.1.1-100,192.168.1.120-180 --portGroup pg1 --dns

202.112.0.1 --gateway 192.168.1.255 --mask 255.255.255.1

>network add --name dhcpNetwork --dhcp --portGroup pg1

7.9.2 network delete

Delete a network in Serengeti.

Parameter Mandatory/Optional Description

--name <network name in Serengeti> Mandatory Delete the specified network in Serengeti.

7.9.3 network list

List available networks in Serengeti.

Parameter Mandatory/Optional Description

--name <network name in Serengeti> Optional List the specified network in Serengeti

including name, port group in vSphere, IP

address assignment type, assigned IP

address and so on.

--detail Optional List the network detail information in

Serengeti including Hadoop cluster node's

network information.

For example:

Page 68: Serengeti User Guide_0.8

Serengeti User’s Guide

68

7.10 pig script

7.10.1 pig cfg

Configure Pig.

Parameter Mandatory/Optional Description

--props Optional Specify the Pig properties file location.

--jobName Optional Specify the job name.

--jobPriority Optional Specify the job priority.

--jobTracker Optional Specify the job tracker.

--execType Optional Specify the execution type.

--validateEachStatement Optional Validation of each statement or not.

7.10.2 pig script

Execute a Pig script. Note: You need to run pig cfg before running this command.

Parameter Mandatory/Optional Description

--location <script path> Mandatory Specify the name of the script to be executed.

7.11 resourcepool

7.11.1 resourcepool add

Add a resource pool in vSphere to Serengeti.

Parameter Mandatory/Optional Description

--name <resource pool name in Serengeti> Mandatory Specify the name of resource pool

added to Serengeti.

--vccluster <vSphere cluster of the resource Mandatory Specify the vSphere cluster name in

Page 69: Serengeti User Guide_0.8

Serengeti User’s Guide

69

pool> vSphere where the resource pool is

in.

--vcrp <vSphere resource pool name> Mandatory Specify the vSphere resource pool

in vSphere which is added to

Serengeti for deploying. The

vSphere resource pool must be

directly under a cluster.

7.11.2 resourcepool delete

Remove a resource pool from Serengeti.

Parameter Mandatory/Optional Description

--name <resource pool name in Serengeti> Mandatory Remove specified resource pool

from Serengeti.

7.11.3 resourcepool list

List resource pools added to Serengeti.

Parameter Mandatory/Optional Description

--name <resource pool name in Serengeti> Optional List the specific resource pool

name, path.

--detail Optional List each resource pool's general

information and Hadoop cluster'

node in this resource pool.

All resource pools that are added to Serengeti are listed if a name is not specified. For each resource

pool, NAME, PATH are listed. NAME is the name in Serengeti. PATH is the combination of the vSphere

cluster name and resource pool name, separated by “/”.

For example:

Page 70: Serengeti User Guide_0.8

Serengeti User’s Guide

70

7.12 topology

7.12.1 topology upload

Upload a rack-hosts mapping topology file to Serengeti. The new uploaded file will overwrite the existing

file. The accepted file format looks like: for each line, rackname: hostname1, hostname2…

Hostname1,hostname2,… stands for the host name displayed in vSphere.

Parameter Mandatory/Optional Description

--fileName <topology file name> Mandatory Specify the topology file name.

--yes Optional Answer „y‟ to „Y/N‟ confirmation.

7.12.2 topology list

List rack-hosts mapping topology stored in Serengeti.

8. vSphere Settings

8.1 vSphere Cluster Configuration

8.1.1 Setup Cluster

In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right-click the Datacenter

and select "New Cluster..." Follow new Cluster Wizard using the following settings:

Enable “vSphere HA” and “vSphere DRS”

Enable Host Monitoring

Enable Admission Control and set desired policy. (Default policy is to tolerate 1 host failure)

Virtual machine restart priority “High”

Virtual machine Monitoring “virtual machine and Application Monitoring”

Monitoring sensitivity “High”

Page 71: Serengeti User Guide_0.8

Serengeti User’s Guide

71

8.1.2 Enable DRS/HA on an existing cluster

If DRS or HA is not already enabled on an existing cluster, it can be enabled by right-clicking the cluster

and selecting “Edit Settings”. Under “Cluster Features”, select "Turn On vSphere DRS" and "Turn On

vSphere HA". Use settings specified in "Setup Cluster" above.

8.1.3 Add Hosts to Cluster

In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right-click the Cluster

that was just created and select "Add Host...". Follow the Add Host Wizard to add a Host. Repeat for each

additional Host.

8.1.4 DRS/FT Settings

In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, click a host in the

cluster. On the right side there will be a row of tabs near the top of the window, click on “Configuration”

then click on Networking. The window will display vSwitch port groups. By default A VMkernel Port called

“Management Network” is pre-configured. Click “Properties...” of the vSwitch, choose the “Management

Network” and click the “Edit” button. Enable “vMotion” and “Fault Tolerance Logging” from the

“Management Network Properties” window.

To verify the FT status of a host, click on the Summary tab and locate “Host Configured for FT” in the

general section. If there are any issues with FT they will be shown here.

8.1.5 Enable FT on specific virtual machine

Fault Tolerance runs one virtual machine on two separate hosts, it allows for instant failover in a variety of

situations. Before enabling FT ensure the necessary requirements are met:

Host hardware is listed in the VMware Hardware Compatibility List (HCL)

All hosts in the cluster have Hardware VT enabled in the BIOS

The “Management Network” (VMkernel Port) has “vMotion” and "Fault Tolerance Logging"

enabled

Available capacity in the cluster

Virtual machine disks are thick provisioned, without snapshots and located on shared storage

Virtual machine is single vCPU

In the vCenter Client, select “Inventory”, “Hosts and Clusters”. In the left column, right click the virtual

machine and select “Fault Tolerance”, “Turn On Fault Tolerance”.

8.2 Network Settings

Serengeti currently deploys using a single network. Virtual machines are deployed with one NIC which is

attached to a specific Port Group. How this Port Group is configured and the network backing the Port

Group depends on the environment. Here we will cover a basic network configuration that may be

customized as needed.

Page 72: Serengeti User Guide_0.8

Serengeti User’s Guide

72

Either a vSwitch or vSphere Distributed Switch can be used to provide the Port Group backing a

Serengeti cluster. vDS acts as a single virtual switch across all attached hosts while a vSwitch is per-host

and requires the Port Group to be configured manually.

8.2.1 Setup Port Group - Option A (vSphere Distributed Switch)

In the vCenter Client, select “Inventory”, “Networking”. Right Click the Datacenter and select “New

vSphere Distributed Switch”.

Using the Create vSphere Distributed Switch wizard. Choose Switch Version 5.0. Enter a name and

number of uplink ports (physical adapters) you require.

On the Add Hosts and Physical Adapters step, select the adapter(s) on each host that will carry traffic to

the switch.

On the last step it will create a default Port Group. You can rename this Port Group after it is created and

the wizard is completed.

8.2.2 Setup Port Group - Option B (vSwitch)

In the vCenter Client, select “Inventory”, “Hosts and Clusters”. Navigate to the Networking section of the

Configuration Tab. Make sure the “vSphere Standard Switch” view is selected.

There is already vSwitch0 created by default. You may add a Port Group to this vSwitch or create a new

vSwitch that binds to different physical adapters.

To create a Port Group on the existing vSwitch click “Properties…” on that vSwitch and then click the

“Add” button. Follow the wizard to create the Port Group.

To create a new vSwitch, click on “Add Networking…” and follow the Add Network Wizard.

8.3 Storage Settings

Serengeti provisions virtual machines on shared storage to enable vSphere HA, FT and DRS features.

Local datastores are attached to virtual machines to be used for data.

8.3.1 Shared Storage Setting

Create LUN on Shared Storage (SAN/NAS) and verify it is accessible by all hosts in the cluster. For

vSphere HA Datastore Heartbeat feature two datastores are required.

8.3.2 Local Storage Settings

8.3.2.1 Configure DAS on Physical Hosts

Direct Attached Storage should be attached and configured on the physical controller to present each

disk separately to the OS. This configuration is commonly described as JBOD (Just A Bunch Of Disks) or

single disk RAID0.

8.3.2.2 Provision VMFS Datastores on DAS of Each Host

Create VMFS Datastores on Direct Attached Storage. This can be done in either of the following two

ways.

Manually using the vSphere Client, the vSphere Management Assistant

Automation by vSphere PowerCLI

Page 73: Serengeti User Guide_0.8

Serengeti User’s Guide

73

8.3.2.2.1 Manually Using vSphere Client (Manual per disk):

1. Expand Cluster => Select Host

2. Go to "Configuration" Tab

3. Choose "Storage"

4. Click "Add Storage..."

This will start the Add a Storage Wizard. In the wizard, continue the steps.

5. Select "Disk/LUN" for Storage Type => Next

6. Select a Local Disk from the list => Next

7. Select "VMFS-5" for File System Version => Next => Next

8. Enter Datastore Name => Next

9. "Maximum Available Space" => Next

10. Finish

8.3.2.2.2 Automation by vSphere PowerCLI

This method requires you have a vSphere PowerCLI installed. You can refer to vSphere PowerCLI site to

download and install PowerCLI.

Once the PowerCLI is installed, you can use it to format many Direct Attached Storages to VMFS at a

time.

1. Select Start > Programs > VMware > VMware vSphere PowerCLI.

The VMware vSphere PowerCLI console window opens.

2. In the VMware vSphere PowerCLI console window, run PowerCLI commands to format the disks.

CAUTION

The commands will apply to multiple ESXi hosts at a time. Make sure the scope is what you intended

to before you run a command.

Here‟s a sample script of provisioning datastores. You can type the commands line by line in

PowerCLI shell.

In this example, it formats local disks in all hosts in a vSphere cluster named “My Cluster”. The disks

are formatted to VMFS datastores. The prefix of datastore name is “abcde”.

vSphere PowerCLI - Create Local Datastores for Cluster

# Connect to a vCenter Server. Connect-VIServer -Server 10.23.112.235 -Protocol https -User admin -Password pass # Prepare variables. $i = 0 $localDisks = @{} $clusterName = "My Cluster" $datastoreName = "abcde" # Select Hosts $vmHosts = Get-VMHost -Location $clusterName # Get Local Disks $ldArray = $vmHosts | Get-VMHostDisk | select -ExpandProperty ScsiLun | where {$_.IsLocal -eq "True"}

Page 74: Serengeti User Guide_0.8

Serengeti User’s Guide

74

# Get Primary Disks $pdArray = $vmHosts | Get-VMHostDiagnosticPartition # Add Local Disks to Hashtable keyed by CName foreach($ld in $ldArray) {$localDisks.Add($ld.CanonicalName,$ld)} # Remove Primary Disks from Local Disk Hashtable foreach($pd in $pdArray) {$localDisks.Remove($pd.CanonicalName)} # Create Datastores. Will fail to create for any local disks that are in-use. foreach ($ld in $localDisks.Values) {$i++; New-Datastore -Vmfs -Name ($datastoreName + $i.ToString("D3")) -Path $ld.CanonicalName -vmHost $ld.vmHost}

9. Appendix A: Create Local Yum Repository for MapR

9.1 Install a web server to server as yum server

Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which has Internet access, and install a web server such as Apache/lighttpd on the machine. Or you can use the Serengeti Management Server if you don‟t have another machine. This web server will serve as the yum server. This guide will take installing Apache web server as an example.

9.1.1 Configure http proxy

First open a bash shell terminal. If the machine needs a http proxy server to connect to the Internet, set http_proxy env :

# switch to root user sudo su export http_proxy=http://< proxy_server:port>

9.1.2 Install Apache Web Server

yum install -y httpd /sbin/service httpd start

Make sure the firewall on the machine doesn't block the network port 80 used by Apache web server. You can open a web browser on another machine and navigate to http://<ip_of _webserver>/ to ensure the default test page of Apache web server shows up.

If you would like to stop the firewall, execute this command:

/sbin/service iptables stop

Page 75: Serengeti User Guide_0.8

Serengeti User’s Guide

75

9.1.3 Install yum related packages

Install the yum-utils and createrepo packages if they are not already installed (yum-utils includes the reposync command):

yum install -y yum-utils createrepo

9.1.4 Sync the remote MapR yum repository

1) Create a new file /etc/yum.repos.d/mapr-m5.repo using vi or other editors with the following content:

[maprtech] name=MapR Technologies baseurl=http://package.mapr.com/releases/v2.1.1/redhat/ enabled=1 gpgcheck=0 protect=1 [maprecosystem] name=MapR Technologies baseurl=http://package.mapr.com/releases/ecosystem/redhat enabled=1 gpgcheck=0 protect=1

2) Mirror the remote yum repository to the local machine:

reposync -r maprtech reposync -r maprecosystem

This will take several minutes (depending on the network bandwidth) to download all the RPMs in the remote repository, and all the RPMs are put in new folders named maprtech and maprecosystem.

9.2 Create local yum repository

1) Put all the RPMs into a new folder under the Document Root folder of the Apache Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if you use Serengeti Management Server to set up the yum server, the folder is /opt/serengeti/www/.

doc_root=/var/www/html mkdir -p $doc_root/mapr/2 mv maprtech/ maprecosystem/ $doc_root/mapr/2/

2) Create a yum repository for the RPMs:

Page 76: Serengeti User Guide_0.8

Serengeti User’s Guide

76

cd $doc_root/mapr/2 createrepo .

3) Create a new file /var/www/html/mapr/2/mapr-m5.repo with the following content:

[mapr-m5] name=MapR Version 2 baseurl=http://<ip_of_webserver>/mapr/2 enabled=1 gpgcheck=0 protect=1

Please replace the <ip_of_webserver> with the IP address of the web server.

Ensure you can download http://<ip_of_webserver>/mapr/2/mapr-m5.repo from another machine.

9.3 Configure http proxy for the VMs created by Serengeti Server

This step is optional and only applies if the VMs created by Serengeti Management Server need a http proxy to connect to the yum repository. You need to configure http proxy for the VMs as this: on Serengeti Server, add the following content into /opt/serengeti/conf/serengeti.properties:

# set http proxy server serengeti.http_proxy = http://<proxy_server:port> # set the IPs of Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.

10. Appendix B: Create Local Yum Repository for CDH4

10.1 Install a web server to server as yum server

Find a machine or virtual machine with CentOS 5.x 64 bits (or RHEL 5.x 64 bits) which has Internet access, and install a web server such as Apache/lighttpd on the machine. Or you can use the Serengeti Management Server if you don‟t have another machine. This web server will serve as the yum server. This guide will take installing Apache web server as an example.

10.1.1 Configure http proxy

First open a bash shell terminal. If the machine needs a http proxy server to connect to the Internet, set http_proxy env :

# switch to root user

Page 77: Serengeti User Guide_0.8

Serengeti User’s Guide

77

sudo su export http_proxy=http://<proxy_server:port>

10.1.2 Install Apache Web Server

yum install -y httpd /sbin/service httpd start

Make sure the firewall on the machine doesn't block the network port 80 used by Apache web server. You can open a web browser on another machine and navigate to http://<ip_of_webserver>/ to ensure the default test page of Apache web server shows up.

If you would like to stop the firewall, execute this command:

/sbin/service iptables stop

10.1.3 Install yum related packages

Install the yum-utils and createrepo packages if they are not already installed (yum-utils includes the reposync command):

yum install -y yum-utils createrepo

10.1.4 Sync the remote CDH4 yum repository

1) Create a new file /etc/yum.repos.d/cloudera-cdh4.repo using vi or other editors with the following content:

[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.1.2/ gpgkey = http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera gpgcheck = 1

2) Mirror the remote yum repository to the local machine:

reposync -r cloudera-cdh4

This will take several minutes (depending on the network bandwidth) to download all the RPMs in the remote repository, and all the RPMs are put in new folder named cloudera-cdh4.

10.2 Create local yum repository

1) Put all the RPMs into a new folder under the Document Root folder of the Apache Web Server. The Document Root folder is /var/www/html/ for Apache by default, and if you use Serengeti Management Server to set up the yum server, the folder is /opt/serengeti/www/ .

Page 78: Serengeti User Guide_0.8

Serengeti User’s Guide

78

doc_root=/var/www/html mkdir -p $doc_root/cdh/4/ mv cloudera-cdh4/RPMS $doc_root/cdh/4/

2) Create a yum repository for the rpms:

cd $doc_root/cdh/4 createrepo .

3) Create a new file /var/www/html/cdh/4/cloudera-cdh4.repo with the following content:

[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://<ip_of_webserver>/cdh/4/ enabled=1 gpgcheck=0

Please replace the <ip_of_webserver> with the IP address of the web server.

Ensure you can download http://<ip_of_webserver>/cdh/4/cloudera-cdh4.repo from another machine.

10.3 Config http proxy for the VMs created by Serengeti Server

This step is optional and only apply if the VMs created by Serengeti Management Server need a http proxy to connect to the yum repository. You need to config http proxy for the VMs as this: on Serengeti Server, add the following content into /opt/serengeti/conf/serengeti.properties:

# set http proxy server serengeti.http_proxy = http://< proxy_server:port> # set the IPs of Serengeti Management Server and the local yum repository servers for 'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work. serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.