vmware vsphere big data extensions administrator's and...

VMware vSphere Big Data ExtensionsAdministrator's and User's Guide

vSphere Big Data Extensions 1.0

This document supports the version of each product listed andsupports all subsequent versions until the document isreplaced by a new edition. To check for more recent editionsof this document, see http://www.vmware.com/support/pubs.

EN-001291-00

http://www.vmware.com/support/pubs

VMware vSphere Big Data Extensions Administrator's and User's Guide

2 VMware, Inc.

You can find the most up-to-date technical documentation on the VMware Web site at:

http://www.vmware.com/support/

The VMware Web site also provides the latest product updates.

If you have comments about this documentation, submit your feedback to:

[email protected]

This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 United States License(http://creativecommons.org/licenses/by-nd/3.0/us/legalcode).

VMware is a registered trademark or trademark of VMware, Inc. in the United States and other jurisdictions. All other marksand names mentioned herein may be trademarks of their respective companies.

VMware, Inc.3401 Hillview Ave.Palo Alto, CA 94304www.vmware.com

http://www.vmware.com/support/

mailto:[email protected]

http://www.vmware.com/go/patents

Contents

1 About This Book 7

2 About VMware vSphere Big Data Extensions 9

Big Data Extensions and Project Serengeti 10About Big Data Extensions Architecture 10Big Data Extensions Support for Hadoop Features By Distribution 11Hadoop Feature Support By Distribution 12

3 Getting Started with Big Data Extensions 13

System Requirements for Big Data Extensions 15Deploy the Big Data Extensions vApp in the vSphere Web Client 17Install the Big Data Extensions Plug-in 19Connect to a Serengeti Management Server 21Install the Serengeti Remote Command-Line Interface Client 22About Hadoop Distribution Deployment Types 23Configure a Tarball-Deployed Hadoop Distribution 23Configuring Yum and Yum Repositories 25Add a Hadoop Distribution Name and Version to the Create Cluster Dialog Box 33Install CentOS 6.x and VMware Tools on the Hadoop Template Virtual Machine 34Maintain a Customized Hadoop Template Virtual Machine 37Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client 38Create a New Username and Password for the Serengeti Command-Line Interface 39Create a New Password for the Serengeti Management Server 40Start and Stop Serengeti Services 40Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client 41Change the User Password on all Cluster Nodes 41

4 Managing vSphere Resources for Hadoop and HBase Clusters 43

Add a Resource Pool with the Serengeti Command-Line Interface 43Remove a Resource Pool with the Serengeti Command-Line Interface 44Add a Datastore in the vSphere Web Client 44Remove a Datastore in the vSphere Web Client 45Add a Network in the vSphere Web Client 45Remove a Network in the vSphere Web Client 46

5 Creating Hadoop and HBase Clusters 47

About Hadoop and HBase Cluster Deployment Types 48About Cluster Topology 48Create a Hadoop or HBase Cluster in the vSphere Web Client 49Add Topology for Load Balancing and Performance with the Serengeti Command-Line Interface 51

VMware, Inc. 3

6 Managing Hadoop and HBase Clusters 53Start and Stop a Hadoop Cluster in the vSphere Web Client 53Scale Out a Hadoop Cluster in the vSphere Web Client 54Scale CPU and RAM in the vSphere Web Client 54Reconfigure a Hadoop or HBase Cluster in the Serengeti Command-Line Interface 55Delete a Hadoop Cluster in the vSphere Web Client 57About Resource Usage and Elastic Scaling 57Optimize Cluster Resource Usage with Elastic Scaling in the vSphere Web Client 59Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client 60About vSphere High Availability and vSphere Fault Tolerance 61Recover from Disk Failure in the Serengeti Command-Line Interface Client 61

7 Monitoring the Big Data Extensions Environment 63

View Provisioned Clusters in the vSphere Web Client 63View Cluster Information in the vSphere Web Client 64Monitor the Hadoop Distributed File System Status in the vSphere Web Client 65Monitor MapReduce Status in the vSphere Web Client 66Monitor HBase Status in the vSphere Web Client 66

8 Performing Hadoop and HBase Operations with the Serengeti Command-Line

Interface 67Copy Files Between the Local Filesystem and HDFS with the Serengeti Command-Line Interface 67Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface 68Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface 68Run HDFS Commands with the Serengeti Command-Line Interface 69Run MapReduce Jobs with the Serengeti Command-Line Interface 69Access HBase Databases 70

9 Accessing Hive Data with JDBC or ODBC 71

Configure Hive to Work with JDBC 71Configure Hive to Work with ODBC 73

10 Troubleshooting 75

Log Files for Troubleshooting 76Configure Serengeti Logging Levels 76Troubleshooting Cluster Creation Failures 76Cluster Provisioning Hangs for 30 Minutes Before Completing with an Error 79Cannot Restart a or Reconfigure a Cluster After Changing Its Distribution 80Virtual Machine Cannot Get IP Address 80vCenter Server Connections Fail to Log In 81SSL Certificate Error When Connecting to Non-Serengeti Server with the vSphere Console 81Serengeti Operations Fail After You Rename a Resource in vSphere 81A New Plug-In with the Same or Lower Version Number as an Old Plug-In Does Not Load 82MapReduce Job Fails to Run and Does Not Appear In the Job History 82Cannot Submit MapReduce Jobs for Compute-Only Clusters with External Isilon HDFS 83MapReduce Job Hangs on PHD or CDH4 YARN Cluster 83Unable to Connect the Big Data Extensions Plug-In to the Serengeti Server 84


4 VMware, Inc.

Index 85

Contents

VMware, Inc. 5


6 VMware, Inc.

About This Book 1VMware vSphere Big Data Extensions Administrator's and User's Guide describes how to installBig Data Extensions within your vSphere environment, and how to manage and monitor Hadoop andHBase clusters using the Big Data Extensionsplug-in for vSphere Web Client.

VMware vSphere Big Data Extensions Administrator's and User's Guide also describes how to perform Hadoopand HBase operations using the Serengeti Command-Line Interface Client, which provides a greater degreeof control for certain system management and Big Data cluster creation tasks.

Intended AudienceThis guide is for system administrators and developers who want to use Big Data Extensions to deploy andmanage Hadoop clusters. To successfully work with Big Data Extensions, you should be familiar withVMware® vSphere® and Hadoop and HBase deployment and operation.

VMware Technical Publications GlossaryVMware Technical Publications provides a glossary of terms that might be unfamiliar to you. For definitionsof terms as they are used in VMware technical documentation, go to http://www.vmware.com/support/pubs.

Document FeedbackVMware welcomes your suggestions for improving our documentation. If you have comments, send yourfeedback to mailto:[email protected].

Technical Support and Education ResourcesThe following technical support resources are available to you. To access the current version of this bookand other books, go to http://www.vmware.com/support/pubs.

Online and TelephoneSupport

To use online support to submit technical support requests, view yourproduct and contract information, and register your products, go to http://www.vmware.com/support.

VMware, Inc. 7


mailto:[email protected]


http://www.vmware.com/support

Customers with appropriate support contracts should use telephone supportfor the fastest response on priority 1 issues. Go to http://www.vmware.com/support/phone_support.html.

Support Offerings To find out how VMware support offerings can help meet your businessneeds, go to http://www.vmware.com/support/services.

VMware ProfessionalServices

VMware Education Services courses offer extensive hands-on labs, casestudy examples, and course materials designed to be used as on-the-jobreference tools. Courses are available onsite, in the classroom, and liveonline. For onsite pilot programs and implementation best practices,VMware Consulting Services provides offerings to help you assess, plan,build, and manage your virtual environment. To access information abouteducation classes, certification programs, and consulting services, go to http://www.vmware.com/services.


8 VMware, Inc.

http://www.vmware.com/support/phone_support.html

http://www.vmware.com/support/services

http://www.vmware.com/services

About VMware vSphere Big DataExtensions 2

VMware® vSphere™ Big Data Extensions lets you deploy and centrally operate Hadoop and HBase clustersrunning on VMware vSphere. Big Data Extensions simplifies the Hadoop and HBase deployment andprovisioning process, gives you a real time view of the running services and the status of their virtual hosts,provides a central place from which to manage and monitor your Hadoop and HBase cluster, andincorporates a full range of tools to help you optimize cluster performance and utilization.

n Big Data Extensions and Project Serengeti on page 10Big Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware toautomate the deployment and management of Apache Hadoop and HBase on virtual environmentssuch as vSphere.

n About Big Data Extensions Architecture on page 10The Serengeti Management Server and Hadoop Template virtual machine work together to configureand provision Hadoop and HBase clusters.

n Big Data Extensions Support for Hadoop Features By Distribution on page 11Big Data Extensions provides different levels of feature support depending on the Hadoopdistribution and version you use.

n Hadoop Feature Support By Distribution on page 12Each Hadoop distribution and version provides differing feature support. Learn which Hadoopdistributions support which features.

VMware, Inc. 9

Big Data Extensions and Project SerengetiBig Data Extensions runs on top of Project Serengeti, the open source project initiated by VMware toautomate the deployment and management of Apache Hadoop and HBase on virtual environments such asvSphere.

Big Data Extensions and Project Serengeti provide the following components.

Serengeti ManagementServer

Provides the framework and services to run Big Data clusters on VSphere.The Serengeti Management Server performs resource management, policy-based virtual machine placement, cluster provisioning, configurationmanagement, and monitoring capabilities.

Serengeti Command-Line Interface Client

The command-line client provides a comprehensive set of tools and utilitieswith which to monitor and manage your Big Data deployment. If you areusing the open source version of Serengeti (without Big Data Extensions) it isalso the only interface through which you can perform administrative tasks.

Big Data Extensions The commercial version of the open source Project Serengeti,Big Data Extensions provides the following additional features.

n The Big Data Extensions plug-in, a graphical user interface integratedwith vSphere Web Client to perform common Hadoop infrastructureand cluster management administrative tasks.

n Elastic scaling lets you optimize cluster performance and utilization ofTasktracker and Nodemanager compute nodes. Elasticity-enabledclusters start and stop virtual machines automatically and dynamicallyto optimize resource consumption. Elasticity is ideal in a mixedworkload environment to ensure that high priority jobs are assignedsufficient resources.

Elasticity adjusts the number of active compute virtual machines basedon configuration settings you specify.

About Big Data Extensions ArchitectureThe Serengeti Management Server and Hadoop Template virtual machine work together to configure andprovision Hadoop and HBase clusters.

Big Data Extensions performs the following steps to deploy a Hadoop or HBase cluster.

1 The Serengeti Management Server searches for ESXi hosts with sufficient resources to operate thecluster based on the configuration settings you specify, and selects ESXi hosts on which to placeHadoop virtual machines.

2 The Serengeti Management Server sends a request to the vCenter Server to clone and reconfigurevirtual machines for use with the Hadoop or HBase cluster, and configures the operating system andnetwork parameters for the new created virtual machines.

3 The Serengeti Management Server downloads the Hadoop distribution software packages.

4 The Hadoop Template virtual machine installs the Hadoop distribution software.

5 The Hadoop Template virtual machine configures the Hadoop parameters based on the clusterconfiguration settings you specify.

6 The Serengeti Management Server starts the Hadoop services, at which point you have a runningcluster based on your configuration settings.


10 VMware, Inc.

Big Data Extensions Support for Hadoop Features By DistributionBig Data Extensions provides different levels of feature support depending on the Hadoop distribution andversion you use.

Big Data Extensions Support for Hadoop FeaturesBig Data Extensions provides differing levels of feature support depending on the Hadoop distribution andversion you configure for use. The following table lists the supported Hadoop distributions and indicateswhich features are supported when using it with Big Data Extensions.

Table 2‑1. Big Data Extensions Feature Support for Hadoop

ApacheHadoop Cloudera Cloudera Greenplum

Hortonworks Pivotal MapR

Version 1.2 CDH3 CDH4 HD 1.3 HDP-1.3 PHD 1.0 2.1.3

AutomaticDeployment

Yes Yes Yes Yes Yes Yes Yes

Scale Out Yes Yes Yes Yes Yes Yes Yes

Data/ComputeSeparation

Yes Yes Yes Yes Yes Yes Yes

Compute-only Yes Yes Yes Yes Yes Yes No

Elastic Scalingof ComputeNodes

Yes Yes Yes Yes Yes No No

HadoopConfiguration

Yes Yes Yes Yes Yes Yes No

HadoopTopologyConfiguration

Yes Yes Yes Yes Yes Yes No

Run HadoopCommandsfrom the CLI

Yes No No Yes Yes No No

HadoopVirtualizationExtensions

Yes No No Yes Yes Yes No

vSphere HA Yes Yes Yes Yes Yes Yes Yes

Service LevelvSphere HA

Yes Yes See "AboutServiceLevelvSphereHA forClouderaCDH4"below.

Yes Yes No No

vSphere FT Yes Yes Yes Yes Yes Yes Yes

About Service Level vSphere HA for Cloudera CDH4Cloudera CDH4 offers the following support for Service Level vSphere HA.

n Cloudera CDH4 using MapReduce v1 provides service level vSphere HA support for JobTracker,NameNode, and ResourceManager.

Chapter 2 About VMware vSphere Big Data Extensions

VMware, Inc. 11

n Cloudera CDH4 using MapReduce v2 (YARN) provides it's own service level HA support.

Hadoop Feature Support By DistributionEach Hadoop distribution and version provides differing feature support. Learn which Hadoopdistributions support which features.

Hadoop FeaturesThe following table illustrates which Hadoop distributions support which features.

Table 2‑2. Hadoop Feature Support

ApacheHadoop Cloudera Cloudera

Greenplum

Hortonworks Pivotal MapR

Version 1.2 CDH3 CDH4 GPHD 1.2 HDP 1.3 PHD 1.0 2.1.3

HDFS1 Yes Yes No Yes No No No

HDFS2 No No Yes No Yes Yes No

MapReduce v1 Yes Yes Yes Yes Yes No Yes

MapReduce v2(YARN)

No No Yes No No Yes No

Pig Yes Yes Yes Yes Yes Yes Yes

Hive Yes Yes Yes Yes Yes Yes Yes

Hive Server Yes Yes Yes Yes Yes Yes Yes

HBase Yes Yes Yes Yes Yes Yes Yes

ZooKeeper Yes Yes Yes Yes Yes Yes Yes


12 VMware, Inc.

Getting Started with Big DataExtensions 3

VMware vSphere Big Data Extensions allows you to quickly and easily deploy Hadoop and HBase clusters.The tasks in this chapter describe how to set up vSphere for use with Big Data Extensions, deploy the BigData Extensions vApp, access the vCenter Server and command line (CLI) administrative consoles, andconfigure a Hadoop distribution for use with Big Data Extensions.

Procedure

1 System Requirements for Big Data Extensions on page 15Before you begin the Big Data Extensions deployment tasks, make sure that your system meets all ofthe prerequisites.

2 Deploy the Big Data Extensions vApp in the vSphere Web Client on page 17Deploying theBig Data Extensions vApp is the first step in getting your Hadoop cluster up andrunning on vSphere Big Data Extensions.

3 Install the Big Data Extensions Plug-in on page 19To enable Big Data Extensions user interface for use with a vCenter Server system, you need toregister it with the vSphere Web Client.

4 Connect to a Serengeti Management Server on page 21You must connect the Big Data Extensions plug-in to the Serengeti Management Server you want usebefore you can use it. Connecting to the Serengeti Management Server lets you manage and monitorHadoop and HBase distributions deployed within the server instance.

5 Install the Serengeti Remote Command-Line Interface Client on page 22Although the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and clustermanagement tasks, you can perform a greater number of the management tasks using the SerengetiCommand-line Interface Client.

6 About Hadoop Distribution Deployment Types on page 23You can choose which Hadoop distribution to use when you deploy a cluster. The type of distributionyou choose determines how you configure it for use with Big Data Extensions.

7 Configure a Tarball-Deployed Hadoop Distribution on page 23When you deploy the Big Data Extensions vApp, the Apache 1.2.1 Hadoop distribution is included inthe OVA that you download and deploy. You can add and configure other Hadoop distributionsusing the command line. You can configure multiple Hadoop distributions from different vendors.

8 Configuring Yum and Yum Repositories on page 25You can deploy Cloudera CDH4, Pivotal PHD, and MapR Hadoop distributions using YellowdogUpdater Modifed (Yum). Yum allows automatic updates and package management of RPM-baseddistributions.

VMware, Inc. 13

9 Add a Hadoop Distribution Name and Version to the Create Cluster Dialog Box on page 33You can add Hadoop distributions other than those pre-configured for use upon installation so thatyou can easily select them in the user interface to create new clusters.

10 Install CentOS 6.x and VMware Tools on the Hadoop Template Virtual Machine on page 34You can create a Hadoop Template virtual machine using a customized version of CentOS 6.xoperating system that includes VMware Tools for CentOS 6.x.

11 Maintain a Customized Hadoop Template Virtual Machine on page 37You can modify or update the Hadoop Template virtual machine operating system. When you makesuch updates, you must remove the snapshot that is automatically created by the virtual machine.

12 Access the Serengeti Command-Line Interface Using the Remote Command-Line Interface Client onpage 38You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-LineInterface Client. The Serengeti Remote Command-Line Interface Client lets you access the SerengetiManagement Server to deploy, manage, and use Hadoop.

13 Create a New Username and Password for the Serengeti Command-Line Interface on page 39The Serengeti Command-Line Interface Client uses the vCenter Server login credentials with readpermissions on the Serengeti Management Server. Otherwise, the Serengeti Command-Line Interfaceuses only the default vCenter Server administrator credentials.

14 Create a New Password for the Serengeti Management Server on page 40Starting the Serengeti Management Server generates a random root user password. You can changethat password and other operating system passwords using the virtual machine console.

15 Start and Stop Serengeti Services on page 40You can stop and start Serengeti services to make a reconfiguration take effect, or to recover from anoperational anomaly.

16 Log in to Hadoop Nodes with the Serengeti Command-Line Interface Client on page 41To perform troubleshooting or to run your own management automation scripts, log in to Hadoopmaster, worker, and client nodes with password-less SSH from the Serengeti Management Serverusing SSH client tools such as SSH, PDSH, ClusterSSH, and Mussh.

17 Change the User Password on all Cluster Nodes on page 41You can change the user password for all nodes in a cluster. The user password you can changeinclude that of the serengeti and root user.


14 VMware, Inc.

System Requirements for Big Data ExtensionsBefore you begin the Big Data Extensions deployment tasks, make sure that your system meets all of theprerequisites.

Big Data Extensions requires that you install and configure vSphere, and that your environment meetsminimum resource requirements. You must also make sure that you have licenses for the VMwarecomponents of your deployment.

vSphere Requirements Before you can install Big Data Extensions, you must have set up thefollowing VMware products.

n Install vSphere 5.0 (or later) Enterprise or Enterprise Plus.

NOTE The Big Data Extensions graphical user interface is onlysupported when using vSphere Web Client 5.1 and later. If you installBig Data Extensions on vSphere 5.0, you must perform all administrativetasks using the command-line interface.

n When installing Big Data Extensions on vSphere 5.1 or later, you mustuse VMware® vCenter™ Single Sign-On to provide user authentication.When logging in to vSphere 5.1 or later you pass authentication to thevCenter Single Sign-On server, which you can configure with multipleidentity sources such as Active Directory and OpenLDAP. On successfulauthentication, your username and password is exchanged for a securitytoken which is used to access vSphere components such as Big DataExtensions.

n Enable the vSphere Network Time Protocol on the ESXi hosts. TheNetwork Time Protocol (NTP) daemon ensures that time-dependentprocesses occur in sync across hosts.

Cluster Settings Configure your cluster with the following settings.

n Enable vSphere HA and vSphere DRS

n Enable Host Monitoring.

n Enable Admission Control and set desired policy. The default policy isto tolerate one host failure.

n Set the virtual machine restart priority to High

n Set the virtual machine monitoring to virtual machine and ApplicationMonitoring.

n Set the Monitoring sensitivity to High.

n Enable vMotion and Fault Tolerance Logging.

n All hosts in the cluster have Hardware VT enabled in the BIOS.

n The Management Network VMkernel Port has vMotion and FaultTolerance Logging enabled.

Network Settings Big Data Extensions deploys clusters on a single network. Virtual machinesare deployed with one NIC, which is attached to a specific Port Group. Theenvironment determines how this Port Group is configured and whichnetwork backs the Port Group.

Chapter 3 Getting Started with Big Data Extensions

VMware, Inc. 15

Either a vSwitch or vSphere Distributed Switch can be used to provide thePort Group backing a Serengeti cluster. vDS acts as a single virtual switchacross all attached hosts while a vSwitch is per-host and requires the PortGroup to be configured manually.

When configuring your network for use with Big Data Extensions, thefollowing ports must be open as listening ports.

n Ports 8080 and 8443 are used by the Big Data Extensions plug-in userinterface and the Serengeti Command-Line Interface Client.

n Port 22 is used by SSH clients.

n To prevent having to open a network firewall port to access Hadoopservices, log into the Hadoop client node, and from that node you canaccess your cluster.

n To connect to the Internet (for example, to create an internal Yumrepository from which to install Hadoop distributions), you may use aproxy.

Direct Attached Storage Direct Attached Storage should be attached and configured on the physicalcontroller to present each disk separately to the operating system. Thisconfiguration is commonly described as Just A Bunch Of Disks (JBOD). Youmust create VMFS Datastores on Direct Attached Storage using the followingdisk drive recommendations.

n 8-12 disk drives per host. The more disk drives per host, the better theperformance.

n 1-1.5 disk drives per processor core.

n 7,200 RPM disk Serial ATA disk drives.

Resource Requirementsfor the vSphereManagement Server andTemplates

n Resource pool with at least 27.5GB RAM.

n 40GB or more (recommended) disk space for the management serverand Hadoop template virtual disks.

Resource Requirementsfor the Hadoop Cluster

n Datastore free space is not less than the total size needed by the Hadoopcluster, plus swap disks for each Hadoop node that is equal to thememory size requested.

n Network configured across all relevant ESX or ESXi hosts, and hasconnectivity with the network in use by the management server.

n HA is enabled for the master node if HA protection is needed. You mustuse shared storage in order to use HA or FT to protect the Hadoopmaster node.

Hardware Requirements Host hardware is listed in the VMware Compatibility Guide. To run at optimalperformance, install your vSphere and Big Data Extensions environment onthe following hardware.

n Dual Quad-core CPUs or greater that have Hyper-Threading enabled. Ifyou can estimate your computing workload, consider using a morepowerful CPU.

n Use High Availability (HA) and dual power supplies for the masternode's host machine.


16 VMware, Inc.

n 4-8 GBs of memory per processor core, with 6% overhead forvirtualization.

n Use a 1 Gigabit Ethernet interface or greater to provide adequatenetwork bandwidth.

Tested Host and VirtualMachine Support

The following is the maximum host and virtual machine support that hasbeen confirmed to successfully run with Big Data Extensions.

n 45 physical hosts running a total of 182 virtual machines.

n 128 virtual ESXi hosts deployed on 45 physical hosts, running 256 virtualmachines.

Licensing You must use a vSphere Enterprise license or above in order to use VMwareHigh Availability (HA) and VMware Distributed Resources Scheduler (DRS).

Deploy the Big Data Extensions vApp in the vSphere Web ClientDeploying theBig Data Extensions vApp is the first step in getting your Hadoop cluster up and running onvSphere Big Data Extensions.

Prerequisites

n Install and configure vSphere, paying particular attention to the following items.

n Enable the vSphere Network Time Protocol on the ESXi hosts. The Network Time Protocol (NTP)daemon ensures that time-dependent processes occur in sync across hosts.

n When installing Big Data Extensions on vSphere 5.1 or later, you must use vCenter Single Sign-Onto provide user authentication.

n Verify that you have the required licenses. You manage these licenses in the vSphere Web Client or invCenter Server.

n vSphere license. In a production environment, enterprise license is required to protect the masternode with vSphere High Availability (HA) and Fault Tolerance (FT).

n One license for each Hadoop worker node virtual machine. For example, if a cluster contains 1master node, 10 worker nodes and 1 client node, 10 licenses are required.

n Install the Client Integration plug-in before you deploy an OVF template. This plug-in enables OVFdeployment on your local filesystem.

NOTE Depending on the security settings of your browser, you might have to explicitly approve theplug-in when you use it the first time.

n Download the Big Data Extensions OVA from the VMware download site.

n Verify that you have at least 40GB available for the OVA. You need additional resources for the Hadoopcluster.

n Ensure that you know the vCenter Single Sign-On Look-up Service URL for your vCenter Single Sign-On service.

If you are installing Big Data Extensions on vSphere 5.1 or later, ensure that your environment includesvCenter Single Sign-On. You must use vCenter Single Sign-On to provide user authentication onvSphere 5.1 or later.

See “System Requirements for Big Data Extensions,” on page 15 for a complete list.


VMware, Inc. 17

Procedure

1 In the vSphere Web Client, select Actions > All vCenter Actions > Deploy OVF Template

2 Specify the location where the source the OVF template resides.

Option Description

Deploy from File Browse your file system for an OVF or OVA template.

Deploy from URL Type a URL to an OVF template located on the internet. Example:http://vmware.com/VMTN/appliance.ovf

3 View the OVF Template Details page and click Next.

4 Accept the license agreement and click Next.

5 Specify a name for the vApp, select a target datacenter for the OVA, and click Next.

The only valid characters for vApp names are alphanumeric and underscores. The vApp name cannotexceed 80 characters.

6 Select a vSphere resource pool for the OVA and click Next.

You must select a top-level resource pool. Child resource pools are not supported.

7 Select shared storage for the OVA if possible and click Next.

If shared storage is not available, local storage is acceptable.

8 Leave DHCP, the default network setting, or select static IP and provide the network settings for yourenvironment and click Next.

An IPv4 address is made of 4 numbers separated by dots as in aaa.bbb.ccc.ddd, where each numberranges from 0 to 255.

n If using static IP, the IP range format is aaa.bbb.ccc.ddd-iii.jjj.kkk.lll. You can also type a single IPaddress.

For example, to specify a range of IP addresses that belong to the same subnet, type192.168.1.1-192.168.1.254.

n If you use a static IP address, enter a valid netmask. For example, 255.255.255.0.

n (Optional) Enter the IP address of the DNS server.

If the vCenter Server or any ESXi host is resolved using a FQDN, you must enter a DNS address.

NOTE You cannot use a shared IP pool with Big Data Extensions.

9 Make sure the Initialize Resources check box is checked and click Next.

If the check box is checked, the resource pool, datastore, and network assigned to the vApp are addedto the Big Data Extensions server for use by the Hadoop cluster you create. If the check box isunchecked, the resource pool, data store, and network connection assigned to the vApp will not beadded to Big Data Extensions for use by Hadoop clusters.

If you choose not to automatically add the resource pool, datastore, and network when deploying thevApp, you must use the vSphere Web Client or the Serengeti Command-Line Interface Client to specifyresource pool, datastore, and network information before creating a Hadoop cluster.

10 If you are deploying Big Data Extensions on vSphere 5.1 or later, provide the vCenter Single Sign-OnLook-up Service URL for your vCenter Single Sign-On service. The Serengeti Management Server,vSphere Web Client, and Serengeti Command-Line Interface Client must use the same SSO service.

The default vCenter Single Sign-On Look-up Service URL is: https://SSO_FQDN_or_IP:7444/lookupservice/sdk


18 VMware, Inc.

where:

SSO_FQDN_or_IP The Fully Qualified Domain Name (FQDN) or IP address of the vCenterSingle Sign-On service. If you use the embedded vCenter Single Sign-Onservice, the SSO FQDN or IP address is the same as that of vCenter ServerFQDN or IP address.

7444 The default vCenter Single Sign-On HTTPS port number. If you used adifferent port number when you installed vCenter Single Sign-On, usethat port number.

The vCenter Server starts deploying the Big Data Extensions vApp. When deployment completes twovirtual machines are available in the vApp.

n The Management Server virtual machine (also referred to as the Serengeti Management Server), whichis started as part of the OVA deployment.

n The Hadoop Template virtual machine, which is not started. Big Data Extensions clones Hadoop nodesfrom this template when provisioning a cluster. You should not start or stop this virtual machinewithout good reason. No Hadoop distribution is included in the template.

What to do next

Install the Big Data Extensions plug-in within the vSphere Web Client. See “Install the Big Data ExtensionsPlug-in,” on page 19.

If you did not leave the Initialize Resources check box checked, you must add resources to the Big DataExtensions server prior to creating a Hadoop cluster.

Install the Big Data Extensions Plug-inTo enable Big Data Extensions user interface for use with a vCenter Server system, you need to register itwith the vSphere Web Client.

The Big Data Extensionsplug-in provides a graphical user interface that integrates with thevSphere Web Client. Using the Big Data Extensions plug-in interface you can perform common Hadoopinfrastructure and cluster management tasks.

NOTE Use only the Big Data Extensions plug-in interface in the vSphere Web Client, or the SerengetiCommand-Line Interface Client, to monitor and manage your Big Data Extensions environment. Performingmanagement operations in vCenter Server may cause the Big Data Extensions management tools to becomeun-synchronized, and unable to accurately report the operational status of your Big Data Extensionsenvironment.

Prerequisites

n Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere WebClient,” on page 17.

n Log in to the system on which the vSphere 5.1 Web Client is installed.

The Big Data Extensions graphical user interface is only supported when using vSphere Web Client 5.1and later. If you install Big Data Extensions on vSphere 5.0, you must perform all administrative tasksusing the Serengeti Command-Line Interface Client.

n Ensure that you have login credentials with administrator privileges for the vCenter Server system withwhich you are registering Big Data Extensions.

n If you want to use the vCenter Server IP address to access the vSphere Web Client and your browseruses a proxy, add the server IP address to the list of proxy exceptions.


VMware, Inc. 19

Procedure

1 Open a Web browser and go to the URL of the vSphere Web Client:https://hostname:port/vsphere-client.

The hostname can be either the DNS hostname or IP address of vCenter Server. By default the port is9443, but this may have been changed during installation of the vSphere Web Client.

The vSphere Web Client login screen displays.

2 Type the username and password with administrative privileges that has permissions on vCenterServer, and click Login.

3 Using the vSphere Web Client Navigator panel, locate the Serengeti Management Server that you wantto register with the plug-in.

You can find the Serengeti Management Server under the datacenter and resource pool into which youdeployed it in the previous task.

4 Select management-server in the inventory tree to display information about the object in the centerpane, and click the Summary tab in the center pane to access additional information.

5 Record the IP address of the Management Server virtual machine.

6 Open a Web browser and go to the URL of the management-server virtual machine:https://management-server-ip-address:8443/register-plugin

The management-server-ip-address is the IP address you recorded in step 5.

The Register Big Data Extensions Plug-in Web page displays.

7 Specify the information to register the plug-in.

Option Description

Register or Unregister Radio Button Select the Register radio button to install the plug-in. Select Unregister toun-install the plug-in.

vCenter Server host name or IPaddress

Type the server host name or IP address of vCenter Server.NOTE Do not include http:// or https:// when you type the host name or IPaddress.

User Name and Password Type the user name and password with administrative privileges that youuse to connect to vCenter Server.

Big Data Extensions Package URL The URL with the IP address of the management-server virtual machinewhere the Big Data Extensions plug-in is installed:https://management-server-ip-address/vcplugin/serengeti-plugin.zip

8 Click Submit.

The Big Data Extensions plug-in registers with vCenter Server and the vSphere Web Client.

9 Log out of the vSphere Web Client, and log back in using your vCenter Server user name andpassword.

The Big Data Extensions icon now appears in the list of objects in the inventory.

10 From the Inventory pane, click Big Data Extensions.

The Big Data Extensions home page appears.

What to do next

You can now connect the Big Data Extensions plug-in to the Big Data Extensions instance you want tomanage by connecting to its Serengeti Management Server.


20 VMware, Inc.

Connect to a Serengeti Management ServerYou must connect the Big Data Extensions plug-in to the Serengeti Management Server you want use beforeyou can use it. Connecting to the Serengeti Management Server lets you manage and monitor Hadoop andHBase distributions deployed within the server instance.

You can deploy multiple instances of the Serengeti Management Server in your environment, however, youcan only connect the Big Data Extensions plug-in with one Serengeti Management Server instance at a time.You can change which Serengeti Management Server instance the plug-in connects to using this procedure,and use the Big Data Extensions plug-in interface to manage and monitor multiple Hadoop and HBasedistributions deployed in your environment.

NOTE The Serengeti Management Server that you connect to is shared by all users of the Big DataExtensions plug-in interface within the vSphere Web Client. If a user connects to a different SerengetiManagement Server, all other users will be affected by this change.

Prerequisites

n Verify that the Big Data Extensions vApp deployment was successful and that the SerengetiManagement Server virtual machine is running. See “Deploy the Big Data Extensions vApp in thevSphere Web Client,” on page 17.

The version of the Serengeti Management Server and the Big Data Extensions plug-in must be the same.

n Ensure that vCenter Single Sign-On is properly enabled and configured for use by Big Data Extensionsfor vSphere 5.1 and later. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” onpage 17.

n Install the Big Data Extensions plug-in. See “Install the Big Data Extensions Plug-in,” on page 19.

Procedure

1 Log in to the vSphere Web Client.

2 Select Big Data Extensions.

3 Click the Summary tab.

4 Click the Connect Server link in the Connected Server dialog.

The Connect to Serengeti Server dialog box displays.

5 Navigate to the Serengeti Management Server virtual machine within the Big Data Extensions vAppthat you want to connect to, select it, and click OK to confirm your selection.

The Big Data Extensions plug-in communicates using SSL with the Serengeti Management Server.When you connect to a Serengeti server instance, the plug-in verifies that the SSL certificate in use bythe server is correctly installed, valid, and trusted.

The Serengeti server instance appears in the list of connected servers viewable in the Summary tab of theBig Data Extensions Home.

What to do next

You can add additional datastore and network resources to your Big Data Extensions deployment, and thencreate Hadoop and HBase clusters that you can then provision for use.


VMware, Inc. 21

Install the Serengeti Remote Command-Line Interface ClientAlthough the Big Data Extensions Plug-in for vSphere Web Client supports basic resource and clustermanagement tasks, you can perform a greater number of the management tasks using the SerengetiCommand-line Interface Client.

NOTE If you are using Cloudera CDH3 or CDH4, some Hadoop operations cannot be run from theSerengeti Command-Line Interface Client due to incompatible protocols between Cloudera Impala andCDH3 and CDH4 distributions. If you wish to run Hadoop administrative commands using the commandline (such as fs, mr, pig, and hive), use a Hadoop client node to issue these commands.

Prerequisites

n Verify that the Big Data Extensions vApp deployment was successful and that the Management Serveris running.

n Verify that you have the correct user name and password to log into the Serengeti Command-lineInterface Client.

n If you are deploying on vSphere 5.1 or later, the Serengeti Command-line Interface Client uses yourvCenter Single Sign-On credentials.

n If you are deploying on vSphere 5.0, the Serengeti Command-line Interface Client uses the defaultvCenter Server administrator credentials.

n Verify that the Java Runtime Environment (JRE) is installed in your environment, and that its location isin your PATH environment variable.

Procedure



3 In the Getting Started tab, click the Download Serengeti CLI Console link.

A ZIP file containing the Serengeti Command-line Interface Client downloads to your computer.

4 Unzip and examine the download, which includes the following components in the cli directory.

n The serengeti-cli-version JAR file, which includes the Serengeti Command-line Interface Client.

n The samples directory, which includes sample cluster configurations.

See the VMware vSphere Big Data Extensions Command-line Interface Guide for more information.

n Libraries in the lib directory.

5 Open a command shell, and navigate to the directory where you unzipped the Serengeti Command-lineInterface Client download package. Change to the cli directory, and run the following command tostart the Serengeti Command-line Interface Client:

java -jar serengeti-cli-version.jar

What to do next

To learn more about using the Serengeti Command-line Interface Client, see the VMware vSphere Big DataExtensions Command-line Interface Guide.


22 VMware, Inc.

About Hadoop Distribution Deployment TypesYou can choose which Hadoop distribution to use when you deploy a cluster. The type of distribution youchoose determines how you configure it for use with Big Data Extensions.

Depending on which Hadoop distribution you want to configure for use with Big Data Extensions, you willuse either a tarball or Yum repository to install your chosen distribution. The table below lists all supportedHadoop distributions, and the distribution name, vendor abbreviation, and version number to use as aninput parameter when configuring them for use with Big Data Extensions.

Table 3‑1. Hadoop Deployment Types

Hadoop DistributionVersionNumber

DistributionName

VendorAbbreviation

DeploymentType HVE Support?

Apache 1.2.1 apache Apache Tarball Yes

Greenplum HD 1.2 gphd GPHD Tarball Yes

Pivotal HD 1.0 phd PHD Tarball or Yum Yes

Hortonworks DataPlatform

1.3 hdp HDP Tarball Yes

Cloudera CDH3Update 5 or 6

3.5.0 or 3.6.0 cdh3 CDH Tarball No

Cloudera CDH4.3MapReduce v1

4.3.0 cdh4 CDH Yum No

Cloudera CDH4.3MapReduce v2

4.3.0 cdh4 CDH Yum No

MapR 2.1.3 mapr MAPR Yum No

About Hadoop Virtualization ExtensionsVMware’s Hadoop Virtualization Extensions (HVE), improves Hadoop performance in virtualenvironments by enhancing Hadoop’s topology awareness mechanism to account for the virtualizationlayer.

Configure a Tarball-Deployed Hadoop DistributionWhen you deploy the Big Data Extensions vApp, the Apache 1.2.1 Hadoop distribution is included in theOVA that you download and deploy. You can add and configure other Hadoop distributions using thecommand line. You can configure multiple Hadoop distributions from different vendors.

In most cases, you download Hadoop distributions from the Internet. If you are behind a firewall you mayneed to modify your proxy settings to allow the download. Refer to your Hadoop vendor's Web site toobtain the download URLs to use for the different components you wish to install. Prior to installing andconfiguring tarball-based deployment, ensure that you have the vendor's URLs from which to download thedifferent Hadoop components you want to install. You will use these URLs as input parameters to theconfig-distro.rb configuration utility.

If you have a local Hadoop distribution and your server does not have access to the Internet, you canmanually upload the distribution. See VMware vSphere Big Data Extensions Command-Line Interface Guide.

Prerequisites



VMware, Inc. 23

n Review the different Hadoop distributions so you know which distribution name abbreviation, vendorname abbreviation, and version number to use as an input parameter, and whether or not thedistribution supports Hadoop Virtualization Extension (HVE). See “About Hadoop DistributionDeployment Types,” on page 23.

n (Optional) Set the password for the Serengeti Management Server. See “Create a New Password for theSerengeti Management Server,” on page 40.

Procedure

1 Using PuTTY or another SSH client, log in to the Serengeti Management Server virtual machine as theuser serengeti.

2 Run the /opt/serengeti/sbin/config-distro.rb Ruby script, as follows.

config-distro.rb --name distro_name --vendor vendor_name --version version_number --hadoop

hadoop_package_url --pig pig_package_url --hive hive_package_url --hbase hbase_package_url --

zookeeper zookeeper_package_URL --hve {true | false}

Option Description

--name Name of the Hadoop distribution you want to download. For example,hdp for Hortonworks.

--vendor Vendor name whose Hadoop distribution you want to use. For example,HDP for Hortonworks.

--version Version of the Hadoop distribution you want to use. For example, 1.3.

--hadoop Specifies the URL to use to download the Hadoop distribution tarballpackage. This is the URL from which you can download the software fromyour Hadoop vendor's Web site.

--pig Specifies the URL to use to download the Pig distribution tarball package.This is the URL from which you can download the software from yourHadoop vendor's Web site.

--hive Specifies the URL to use to download the Hive distribution tarballpackage. This is the URL from which you can download the software fromyour Hadoop vendor's Web site.

--hbase (Optional) Specifies the URL to use to download the HBase distributiontarball package. This is the URL from which you can download thesoftware from your Hadoop vendor's Web site.

--zookeeper (Optional) Specifies the URL to use to download the ZooKeeperdistribution tarball package. This is the URL from which you candownload the software from your Hadoop vendor's Web site.

--hve {true | false} (Optional) Specifies whether or not the Hadoop distribution supports HVE The example below downloads the tarball version of Hortonworks Data Platform (HDP), whichconsists of Hortonworks Hadoop, Hive, HBase, Pig, and ZooKeeper distributions. Note that you mustprovide the download URL for each of the software components you wish to configure for use withBig Data Extensions.

config-distro.rb --name hdp --vendor HDP --version 1.0.7 --hadoop http://public-

repo-1.hortonworks.com/HDP-1.0.7/repos/centos5/tars/hadoop-1.0.2.tar.gz --pig http://public-

repo-1.hortonworks.com/HDP-1.0.7/repos/centos5/tars/pig-0.9.2.tar.gz --hive http://public-

repo-1.hortonworks.com/HDP-1.0.7/repos/centos5/tars/hive-0.8.1.tar.gz --hbase http://public-

repo-1.hortonworks.com/HDP-1.0.7/repos/centos5/tars/hbase-0.92.0.tar.gz --zookeeper

http://public-repo-1.hortonworks.com/HDP-1.0.7/repos/centos5/tars/zookeeper-3.3.4.tar.gz --

hve false

The script downloads the files.


24 VMware, Inc.

3 When download completes, explore the newly created /opt/serengeti/www/distros directory, whichincludes the following directories and files.

Item Description

name Directory that is named after the distribution. For example, apache.

manifest manifest file generated by config-distro.rb that is used to downloadthe Hadoop distribution.

manifest.example Example manifest file. This file is available before you perform thedownload. The manifest file is a JSON file with three sections: name,version, and packages. The VMware vSphere Big Data Extensions Command-Line Interface Guide includes information about the manifest.

4 To enable Big Data Extensions to use the added distribution, open a command shell on the Serengeti

Management Server (on the Serengeti server itself, not from the Serengeti Command-Line InterfaceClient), and restart the Tomcat service.

sudo service tomcat restart

The Serengeti Management Server reads the revised manifest file, and adds the distribution to thosefrom which you can create a cluster.

5 Return to the Big Data Extensions Plug-in for vSphere Web Client, and click Hadoop Distributions toverify that the Hadoop distribution is available for use to create a cluster.

The distribution and the corresponding role are displayed.

The distribution is added to the Serengeti Management Server, but is not installed in the Hadoop Templatevirtual machine. The agent is pre-installed on each virtual machine that copies the distribution componentsthat you specify from the Serengeti Management Server to the nodes during the Hadoop cluster creationprocess.

What to do next

You can create and deploy Hadoop or HBase clusters using your chosen Hadoop distribution. See “Create aHadoop or HBase Cluster in the vSphere Web Client,” on page 49.

You can add additional datastore and network resources for use by the Hadoop clusters you intend tocreate. See Chapter 4, “Managing vSphere Resources for Hadoop and HBase Clusters,” on page 43.

Configuring Yum and Yum RepositoriesYou can deploy Cloudera CDH4, Pivotal PHD, and MapR Hadoop distributions using Yellowdog UpdaterModifed (Yum). Yum allows automatic updates and package management of RPM-based distributions.

Procedure

1 Yum Repository Configuration Values on page 26You use the Yum repository configuration values in a file that you create to update the Yumrepositories used to install or update Hadoop software on CentOS and other operating systems thatmake use of Red Hat Package Manager (RPM).

2 Create a Local Yum Repository for Cloudera and MapR Hadoop Distributions on page 27Although publically available Yum repositories exist for Cloudera and MapR distributions, creatingyour own Yum repository can result in better access and greater control over the repository.

3 Create a Local Yum Repository for Pivotal Hadoop Distributions on page 29Pivotal does not provide a publically available Yum repository. Creating your own Yum repository forPivotal provides you with better access and control over installing and updating your Pivotal HDdistribution software.


VMware, Inc. 25

4 Configure a Yum-Deployed Cloudera or MapR Hadoop Distribution on page 31You can install Hadoop distributions that use Yum repositories (as opposed to a tar-ball) for use withBig Data Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoopnodes download Red Hat Package Manager (RPM) packages from the distribution's official Yumrepositories.

5 Configure a Yum-Deployed Pivotal Hadoop Distribution on page 32You can install Hadoop distributions that use Yum repositories (as opposed to a tar-balls) for use withBig Data Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoopnodes download Red Hat Package Manager (RPM) packages from the distribution's official Yumrepositories.

Yum Repository Configuration ValuesYou use the Yum repository configuration values in a file that you create to update the Yum repositoriesused to install or update Hadoop software on CentOS and other operating systems that make use of RedHat Package Manager (RPM).

About Yum Repository Configuration ValuesTo create a local Yum repository you create a configuration file that identifies the specific file and packagenames you wish to download and deploy. When you create the configuration file you replace a set ofplaceholder values with those that correspond to your Hadoop distribution. The table below lists the valuesyou need to use for Cloudera CDH4, MapR, and Pivotal distributions.

Table 3‑2. Yum Repository Placeholder Values

Placeholder MapR CDH4 Pivotal

repo_file_name

mapr-m5.repo cloudera-cdh4.repo phd.repo

package_info [maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v2.1.3/redhat/enabled=1gpgcheck=0protect=1[maprecosystem]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/ecosystem/redhatenabled=1gpgcheck=0protect=1If you use a version other thanv2.1.3, substitute theappropriate version number inthe path name.

[cloudera-cdh4]name=Cloudera's Distribution forHadoop, Version 4baseurl=http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4.2.2/gpgkey =http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-clouderagpgcheck = 1If you use a version other than 2.1,substitute use the appropriateversion number in the path name.

Not Applicable

mirror_cmds reposync -r maprtechreposync -r maprecosystem

reposync -r cloudera-cdh4 Not Applicable

default_rpm_dir

maprtechmaprecosystem

cloudera-cdh4/RPMS pivotal


26 VMware, Inc.

Table 3‑2. Yum Repository Placeholder Values (Continued)

Placeholder MapR CDH4 Pivotal

target_rpm_dir

mapr/2 cdh/4 phd/1

local_repo_info

[mapr-m5]name=MapR Version 2baseurl=http://ip_of_webserver/mapr/2/enabled=1gpgcheck=0protect=1Replace the ip_of_webserver withthe correct IP address. This isthe IP address of the Web serveryou've configured for use asyour Yum repository.

[cloudera-cdh4]name=Cloudera's Distribution forHadoop, Version 4baseurl=http://ip_of_webserver/cdh/4/enabled=1gpgcheck=0Replace the ip_of_webserver with thecorrect IP address. This is the IPaddress of the Web server you'veconfigured for use as your Yumrepository.

[pivotalhd]name=PHD Version 1.0baseurl=http://ip_of_webserver/phd/1/enabled=1gpgcheck=0Replace the ip_of_webserver withthe correct IP address. This isthe IP address of the Webserver you've configured foruse as your Yum repository.

Create a Local Yum Repository for Cloudera and MapR Hadoop DistributionsAlthough publically available Yum repositories exist for Cloudera and MapR distributions, creating yourown Yum repository can result in better access and greater control over the repository.

Prerequisites

n High-speed Internet access with which to download Red Hat Package Manager (RPM) packages fromyour chosen Hadoop distribution's official Yum repository to the local repository you create.

n CentOS 5.x 64 bit or Red Hat Enterprise Linux (RHEL) 5.x 64 bit.

NOTE If the Hadoop distribution you want to use requires the use of a 64 bit CentOS 6.x or Red HatEnterprise Linux (RHEL) 6.x operating system, you must use a 64 bit CentOS 6.x or RHEL 6.x operatingsystem to create your Yum repository.

n An HTTP server with which to create the Yum repository. For example, Apache Lighttpd.

n Ensure that any firewall in use does not block the network port number used by your HTTP serverproxy. Typically this is port 80.

n Refer to the Yum repository placeholder values to populate the necessary variables required in the stepsbelow. See “Yum Repository Configuration Values,” on page 26.

Procedure

1 If your Yum server requires an HTTP proxy server to connect to the Internet, you must set thehttp_proxy environment variable. Open a Bash shell terminal, and run the following commands toexport the http_proxy environment variable.

# switch to root user

sudo su

export http_proxy=http://proxy_server:port

Option Description

proxy_server The IP address or domain name of the proxy server.

port The network port number to use with the proxy server.


VMware, Inc. 27

2 Install the Web server you want to use as a Yum server.

Install the Apache Web Server and enable the httpd server to start whenever the machine is restarted.

yum install -y httpd

/sbin/service httpd start

/sbin/chkconfig httpd on

3 If they are not already installed, install the Yum utils and createrepo packages.

The Yum utils package includes the reposync command.

yum install -y yum-utils createrepo

4 Synchronize the Yum server with the official Yum repository of your preferred Hadoop vendor.

a Using a text editor, create the file /etc/yum.repos.d/repo_file_name.

b Add the package_info content to the new file.

c Mirror the remote yum repository to the local machine by running the mirror_cmds for yourdistribution packages.

Depending on your network bandwidth, it might take several minutes to download the RPMs fromthe remote repository. The RPMs are placed in the default_rpm_dir directories.

5 Create the local Yum repository.

a Move the RPMs to a new directory under the Apache Web Server document root.

The default Apache document root is /var/www/html/. If you use the Serengeti Management Serveras your Yum server machine, the document root is /opt/serengeti/www/.

doc_root=/var/www/html // (or /opt/serengeti/www)

mkdir -p $doc_root/target_rpm_dir

mv default_rpm_dir/* $doc_root/target_rpm_dir/

The wildcard (default_rpm_dir/*) designation means that you should include every directorylisted in your distribution’s default_rpm_dir value.

For example, for the MapR Hadoop distribution, run the following mv command:

mv maprtech/ maprecosystem/ $doc_root/mapr/2/

b Create a Yum repository for the RPMs.

cd $doc_root/target_rpm_dir

createrepo .

c Create a new file, $doc_root/target_rpm_dir/repo_file_name, and include the local_repo_info.

d From a different machine, ensure that you can download the repo file fromhttp://ip_of_webserver/target_rpm_dir/repo_file_name.

6 If the virtual machines created by the Serengeti Management Server need an HTTP proxy to connect tothe local Yum repository, configure HTTP proxy.

a On the Serengeti Management Server, edit the /opt/serengeti/conf/serengeti.properties file andadd the following content anywhere in the file:

# set http proxy server

serengeti.http_proxy = http://<proxy_server:port>

# set the IPs of Serengeti Management Server and the local yum repository servers for

'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.

serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.


28 VMware, Inc.

Create a Local Yum Repository for Pivotal Hadoop DistributionsPivotal does not provide a publically available Yum repository. Creating your own Yum repository forPivotal provides you with better access and control over installing and updating your Pivotal HDdistribution software.

Pivotal does not provide a publicly accessible Yum repository from which you can easily deploy andupgrade the Pivotal Hadoop software distribution. For this reason you may wish to download the Pivotalsoftware tarballs, and create your own Yum repository from which to deploy and configure the pivotalHadoop software.

Prerequisites

n High-speed Internet access with which to download Red Hat Package Manager (RPM) packages fromthe Pivotal Web site.

n The Yum server you create must use a CentOS 6.x 64-bit operating system.

Since the Pivotal Hadoop distribution requires the use of CentOS 6.2 or 6.4 in a 64-bit version (x86_64),the Yum server you create to deploy it must also use a CentOS 6.x 64-bit operating system.

n Any HTTP server with which to create the Yum repository. For example, Apache Lighttpd.

n Ensure that any firewall in use does not block the network port number used by your HTTP serverproxy. Typically this is port 80.

Procedure

1 If your Yum server requires an HTTP proxy server to connect to the Internet, you must set thehttp_proxy environment variable. Open a Bash shell terminal, and run the following commands toexport the http_proxy environment variable.

# switch to root user

sudo su

export http_proxy=http://proxy_server:port

Option Description

proxy_server The IP address or domain name of the proxy server.

port The network port number to use with the proxy server.

2 Install the Web server you want to use with as a Yum server.

Install the Apache Web Server and enable the httpd server to start whenever the machine is restarted.

yum install -y httpd

/sbin/service httpd start

/sbin/chkconfig httpd on

3 If they are not already installed, install the Yum utils and createrepo packages.

The Yum utils package includes the reposync command.

yum install -y yum-utils createrepo

4 Download Pivotal HD 1.0 tarball from the Pivotal Web site.

5 Extract the tarball you download.

tar -xf phd_1.0.1.0-19_community.tar


VMware, Inc. 29

6 Next, extract PHD_1.0.1_CE/PHD-1.0.1.0-19.tar into the default_rpm_dir directory. For Pivotal Hadoopthe default_rpm_dir directory is pivotal.

NOTE The version numbers of the tar you extract may be different from those in used in the example ifthere has been an update.

tar -xf PHD_1.0.1_CE/PHD-1.0.1.0-19.tar -c pivotal

7 Create and configure the local Yum repository.

a Move the RPMs to a new directory under the Apache Web Server document root.

The default Apache document root is /var/www/html/. If you use the Serengeti Management Serveras your Yum server machine, the document root is /opt/serengeti/www/.

doc_root=/var/www/html // (or /opt/serengeti/www)

mkdir -p $doc_root/target_rpm_dir

mv default_rpm_dir/* $doc_root/target_rpm_dir/

The wildcard (default_rpm_dir/*) designation means that you should include every directorylisted in your distribution’s default_rpm_dir value.

For example, for the Pivotal Hadoop distribution, run the following mv command:

mv pivotal/ $doc_root/phd/1/

b Create a Yum repository for the RPMs.

cd $doc_root/target_rpm_dir

createrepo .

c Create a new file, $doc_root/target_rpm_dir/repo_file_name, and include the local_repo_info.

d From a different machine, ensure that you can download the repository (repo_file_name) file fromhttp://ip_of_webserver/target_rpm_dir/repo_file_name.

8 If the virtual machines created by the Serengeti Management Server need an HTTP proxy to connect toyour local Yum repository, configure a HTTP proxy.

a On the Serengeti Management Server, edit the file/opt/serengeti/conf/serengeti.properties, andadd the following content anywhere in the file:

# set http proxy server

serengeti.http_proxy = http://<proxy_server:port>

# set the IPs of Serengeti Management Server and the local yum repository servers for

'serengeti.no_proxy'. The wildcard for matching multi IPs doesn't work.

serengeti.no_proxy = 10.x.y.z, 192.168.x.y, etc.

What to do next

n The Pivotal Hadoop distribution must be installed on the 64-bit version of the CentOS 6.x operatingsystem. If you have not already done so, you must create a Hadoop Template virtual machine usingCentOS 6.2 or 6.4 as its operating system. See “Install CentOS 6.x and VMware Tools on the HadoopTemplate Virtual Machine,” on page 34

n Once you have created a Yum repository for your Pivotal Hadoop software, and a Hadoop Templatevirtual machine using CentOS 6.2 or 6.4 as its operating system, you can configure your Pivotal Hadoopdeployment for use with Big Data Extensions. See “Configure a Yum-Deployed Pivotal HadoopDistribution,” on page 32.


30 VMware, Inc.

Configure a Yum-Deployed Cloudera or MapR Hadoop DistributionYou can install Hadoop distributions that use Yum repositories (as opposed to a tar-ball) for use with BigData Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoop nodesdownload Red Hat Package Manager (RPM) packages from the distribution's official Yum repositories.

Prerequisites

n High-speed Internet access from the Serengeti Management Server in order to download RPM packagesfrom your chosen Hadoop distribution's official Yum repository. If you do not have adequate Internetaccess to download the RPM packages from your Hadoop vendor's Yum repository, you can create alocal Yum repository for your Hadoop distribution.

n Review the different Hadoop distributions so you know which distribution name, vendor abbreviation,and version number to use as an input parameter, and whether or not the distribution supportsHadoop Virtualization Extensions. See “About Hadoop Distribution Deployment Types,” on page 23.

n Create a local Yum repository for your Hadoop distribution. Creating your own repository can result inbetter access and more control over the repository. See “Configuring Yum and Yum Repositories,” onpage 25.

Procedure

1 Log in to the Serengeti Management Server virtual machine using PuTTY or another SSH client.


config-distro.rb --name distro_name --vendor vendor_abbr --version ver_number --repos

http://url_to_pivotalhd_yum_repo/phd.repo

Option Description

--name Name of the Hadoop distribution you want to download. For example,cdh4.

--vendor Abbreviation of vendor name whose Hadoop distribution you want to use.For example, CDH.

--version Version of the Hadoop distribution software you want to use. For example,4.3.0.

--repos Specifies the URL to use to download the Hadoop distribution Yumpackage. This can be a local Yum repository that you create, or a publiclyaccessible Yum repository hosted by the software vendor.

The example below downloads and configures the Yum deployment of the Cloudera CDH4 Hadoopdistribution.

config-distro.rb --name cdh4 --vendor CDH --version 4.3.0 --repos

http://url_to_cdh4_yum_repo/cloudera-cdh4.repo

The example below downloads and configures the Yum deployment of the MapR Hadoop distribution.

config-distro.rb --name mapr --vendor MAPR --version 2.1.3 --repos

http://url_to_mapr_yum_repo/mapr-m5.repo

3 To enable Big Data Extensions to use the newly installed distribution, open a command shell on theSerengeti Management Server (on the Serengeti server itself, not from the Serengeti Command-LineInterface Client), and restart the Tomcat service.


The Serengeti Management Server reads the revised manifest file, and will add the distribution to thosefrom which you can create a cluster.


VMware, Inc. 31



What to do next

You can now create Hadoop and HBase clusters. See Chapter 5, “Creating Hadoop and HBase Clusters,” onpage 47.

Configure a Yum-Deployed Pivotal Hadoop DistributionYou can install Hadoop distributions that use Yum repositories (as opposed to a tar-balls) for use with BigData Extensions. When you create a cluster for a Yum-deployed Hadoop distribution, the Hadoop nodesdownload Red Hat Package Manager (RPM) packages from the distribution's official Yum repositories.

The Pivotal Hadoop distribution must be installed on the 64-bit version of the CentOS 6.x operating system.For this reason, you must create a Hadoop Template virtual machine using CentOS 6.2 or 6.4 as its operatingsystem after you deploy the Big Data Extensions vApp.

Prerequisites

n High-speed Internet access from the Serengeti Management Server in order to download RPM packagesfrom the Pivotal Web site.

n Create a local Yum repository from which to deploy and configure the Pivotal software. Pivotal doesnot provide a publically available Yum repository, so you must create your own Yum repository for thePivotal software. See “Create a Local Yum Repository for Pivotal Hadoop Distributions,” on page 29.

n Create a Hadoop Template virtual machine using CentOs 6.2 or 6.4 as its operating system. See “InstallCentOS 6.x and VMware Tools on the Hadoop Template Virtual Machine,” on page 34.

The Pivotal Hadoop distribution must be installed on the 64-bit version of the CentOS 6.x operatingsystem. For this reason, you must create a Hadoop Template virtual machine using CentOS 6.2 or 6.4 asits operating system after you deploy the Big Data Extensions vApp.

Procedure

1 Using PuTTY or another SSH client, log in to the Serengeti Management Server as the serengeti user.


config-distro.rb --name phd --vendor PHD --version 1.0.1 --repos


Option Description

--name Name of the Hadoop distribution you want to download. For Pivotal typephd.

--vendor Abbreviation of vendor name whose Hadoop distribution you want to use.For Pivotal type PHD.

--version Version of the Hadoop distribution you want to use. For example, 1.0.1.

--repos Specifies the URL to use to download the Hadoop distribution Yumpackage. This is the URL of the Yum server you create from which todeploy the Pivotal software.

The example below downloads the Yum deployment of the Pivotal PHD software.

config-distro.rb --name phd --vendor PHD --version 1.0.1 --repos



32 VMware, Inc.

3 To enable Big Data Extensions to use the added distribution, open a command shell on the SerengetiManagement Server (on the Serengeti server itself, not from the Serengeti Command-Line InterfaceClient), and restart the Tomcat service.


The Serengeti Management Server reads the revised manifest file, and will add the distribution to thosefrom which you can create a cluster.



What to do next

To deploy a Pivotal Hadoop distribution using Big Data Extensions, you must create a Hadoop Templatevirtual machine using the 64-bit version of CentOS 6.2 or 6.4 as the operating system. See “Install CentOS 6.xand VMware Tools on the Hadoop Template Virtual Machine,” on page 34.

Add a Hadoop Distribution Name and Version to the Create ClusterDialog Box

You can add Hadoop distributions other than those pre-configured for use upon installation so that you caneasily select them in the user interface to create new clusters.

You can add a Hadoop distribution vendor name and version number that is not pre-configured for usewith Big Data Extensions such that it is listed in the available distributions from which you can create acluster in the Create New Big Data Cluster dialog box.

NOTE When adding a new Hadoop distribution, ensure that you use do not use duplicate vendor nameabbreviations. For example, do not use the name CDH twice, as this will result in a naming conflict within theuser interface.

Prerequisites


n Review the different Hadoop distributions so you know which vendor name abbreviation and versionnumber to use as an input parameter. See “About Hadoop Distribution Deployment Types,” onpage 23.

n Ensure that the vendor name and version number you specified when you added your Hadoopdistribution using the config-distro.rb script exactly match the vendor name and version number youadd to the user interface.


VMware, Inc. 33

Procedure

1 Edit the file /opt/serengeti/www/specs/map to include the Hadoop distribution vendor name andversion number.

The vendor and version parameters must be the same as those defined in the distribution's manifest file.The type parameter defines the cluster type that will be displayed in the Create New Big Data Clusterdialog box. The path parameter identifies the specification template to use when creating a cluster withthis Hadoop distribution. You can select a specification template from /opt/serengeti/www/specs/ foryour distribution.

NOTE Apache 1.x, Cloudera CDH 3.x, and Greenplum GPHD 1.x distributions use the specificationtemplates located in the apache1.0 directory.

This example specifies Apache for the vendor, and 1.2.1 for the version number. Note the location of thespecification template defined with the path parameter.

{

"vendor" : "Apache",

"version" : "1.2.1",

"type" : "Basic Hadoop Cluster",

"path" : "apache1.0/basic/spec.json"

}

This example specifies CDH (Cloudera) for the vendor, and 4.2.2 for the version number. Unlike theabove example, the specification file is located in the directory cdh4/basic/.

{

"vendor" : "CDH",

"version" : "4.2.2",

"type" : "Basic Hadoop Cluster",

"path" : "cdh4/basic/spec.json"

}

NOTE The values you specify for the vendor and version parameters must exactly match those youspecified using the config-distro.rb script, otherwise the Hadoop distribution name will not properlydisplay in the Create New Big Data Cluster dialog box.

2 Refresh the vSphere Web Client to see your changes take affect, and the newly added distributionappear in the list of available Hadoop distributions from which you can create a cluster.

What to do next

Create a cluster with the newly added Hadoop distribution. See Chapter 5, “Creating Hadoop and HBaseClusters,” on page 47.

Install CentOS 6.x and VMware Tools on the Hadoop Template VirtualMachine

You can create a Hadoop Template virtual machine using a customized version of CentOS 6.x operatingsystem that includes VMware Tools for CentOS 6.x.

You can create a Hadoop Template virtual machine using CentOS 6.2 or 6.4 Linux as the guest operatingsystem into which you can install VMware Tools for CentOS 6.x in combination with a supported Hadoopdistribution. This allows you to create a Hadoop Template virtual machine using your organization'spreferred operating system configuration. When you provision Big Data clusters using the customizedtemplate the VMware Tools for CentOS 6.x will be in the virtual machines that are created from the HadoopTemplate virtual machine.


34 VMware, Inc.

If you create Hadoop Template virtual machines with multiple cores per socket, when you specify the CPUsettings for the virtual machine you must specify a multiple of cores per socket. For example, if the virtualmachine uses two cores per socket, the vCPU settings must be an even number. For example: 4, 8, or 12. Ifyou specify an odd number the cluster provisioning or CPU resizing will fail.

Prerequisites


n Obtain the IP address of the Serengeti Management Server.

n Locate the VMware Tools version that corresponds to the ESXi version in your data center.

Procedure

1 Create a virtual machine template with a 20GB thin provisioned disk, and install CentOS 6.x.

a Download the CentOS 6.x installation package from www.centos.org to a datastore.

b From vCenter Server, create a virtual machine template with a 20GB thin provision disk and selectCentOS 6 (64-bit) as the Guest OS.

c Right-click the virtual machine, select CD Device, and select the datastore ISO file for thecentosversionISO.

d Under Device Status, select connected and connect at power on, and click OK.

e From the console window, install the CentOS 6.x operating system using the default settings.

You can select the language and time zone you want the operating system to use, and you canspecify that the swap partition use a smaller size to save disk space (for example, 500 MBs). Theswap partition is not used by Big Data Extensions, so you can safely reduce the size.

The new operating system installs into the virtual machine template.

2 After you have successfully installed the operating system into the virtual machine template, verify thatyou have network connectivity by running the ipconfig command. This procedure assumes the use ofDynamic Host Configuration Protocol (DHCP).

n If IP address information appears, skip to step 4.

n If there is no IP address information shown, which is the case when DHCP is configured, continuewith step 3.

3 Configure the network.

a Using a text editor open the /etc/sysconfig/network-scripts/ifcfg-eth0 file.

b Locate the following parameters and specify the following configuration.

DEVICE=eth0

ONBOOT=yes

BOOTPROTO=dhcp

c Save your changes and close the file.

d Restart the network service using the commands sudo service network stop and sudo servicenetwork start.

sudo service network stop

sudo service network start

e Verify that you have connectivity by running the ifconfig command.


VMware, Inc. 35

4 Install JDK 6u31-linux-64-rpm.

a From the Oracle® Java SE 6 Downloads page, download jdk 6u31-linux-x64-rpm and copy it to thevirtual machine template's root folder.

b Set the file access.

chmod a+x jdk-6u31-linux-x64-rpm.bin

c Run the following command to install the JDK in the /usr/java/default directory.

./jdk-6u31-linux-x64-rpm.bin

d Edit /etc/environment and add the following line JAVA_HOME=/usr/java/default to yourenvironment.

This adds the /usr/java/default directory to your path.

e Restart the virtual machine.

5 Install VMware Tools for CentOS 6.x.

a Right click the CentOS 6 virtual machine in Big Data Extensions, then select Guest >Install/Upgrade VMware Tools.

b Login to the virtual machine and mount the CD-ROM to access the VMware Tools installationpackage.

mkdir /mnt/cdrom

mount /dev/cdrom /mnt/cdrom

mkdir /tmp/vmtools

cd /tmp/vmtools

c Run tar xf to extract the VMware Tools package tar file.

tar xf VMwareTools-*.tar.gz

d Make vmware-tools-distrib your working directory, and run the vmware-install.pl script.

./vmware-install.pl

Press Enter to finish the installation.

e Remove the vmtools temporary (temp) file that is created as an artifact of the installation process.

rm -rf /tmp/vmtools

6 (Optional) You can create a snapshot to use for recovery operations. In the vSphere Web Client, right-click the virtual machine and select Snapshot > Take Snapshot.

7 Deploy the Big Data Extension vApp. See “Deploy the Big Data Extensions vApp in the vSphere WebClient,” on page 17.

8 Run the installation scripts to customize the local lib with CentOS 6.x.

a Download the scripts from https://deployed_serengeti_server_IP/custos/custos.tar.gz.

b Make the virtual machine template directory your working directory.

c Run tar xf to uncompress the tar file.

tar xf custos.tar.gz

d Run the installer.sh script specifying the /usr/java/default directory path.

./installer.sh /usr/java/default


36 VMware, Inc.

9 Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth numberduring the clone operation.

If you do not remove the /etc/udev/rules.d/70-persistent-net.rules file, the virtual machine clonedfrom the template cannot get an IP address. If you power on this virtual machine to make changes, youmust remove this file before shutting down this virtual machine.

10 Shut down virtual machine.

11 If you created a snapshot as described in Step 6, you must delete it. In the vSphere Web Client, right-click the virtual machine, select Snapshot > Snapshot Manager, select the serengeti-snapshot, andclick Delete.

12 In the vSphere Web Client, edit the template settings, and deselect (uncheck) all devices, including theCD ROM and floppy disk.

13 Replace the original Hadoop Template virtual machine with the custom CentOS-enabled virtualmachine that you have just created.

a Move the original Hadoop Template virtual machine out of the vApp.

b Drag the new template virtual machine that you just created into the vApp.

14 Log in to the Serengeti Management Server as the user serengeti, and restart the Tomcat service.

$ sudo service tomcat restart

Restarting the Tomcat service enables the custom CentOS virtual machine template, making it yourHadoop Template virtual machine.

What to do next

If in the future you intend to modify or update the Hadoop Template virtual machine operating system, youmust remove the serengeti-snapshot that is automatically created each time you shutdown and restart thevirtual machine . See “Maintain a Customized Hadoop Template Virtual Machine,” on page 37.

Maintain a Customized Hadoop Template Virtual MachineYou can modify or update the Hadoop Template virtual machine operating system. When you make suchupdates, you must remove the snapshot that is automatically created by the virtual machine.

If you create a custom Hadoop Template virtual machine, using a version of CentOS 6.x, or otherwisemodify the operating system, you must remove the automatically generated serengeti-snapshot that BigData Extensions creates. If you don't remove the serengeti-snapshot, any changes you've made to theHadoop Template virtual machine will not take effect.

Prerequisites


n Create a customized Hadoop Template virtual machine using CentOS 6.x. See

“Install CentOS 6.x and VMware Tools on the Hadoop Template Virtual Machine,” on page 34.

Procedure

1 Power on the Hadoop Template virtual machine, and apply any changes or updates.


VMware, Inc. 37

2 Remove the /etc/udev/rules.d/70-persistent-net.rules file to prevent increasing the eth numberduring the clone operation.

You must remove this file, otherwise the virtual machine cannot get its IP address.

If you do not remove the /etc/udev/rules.d/70-persistent-net.rules file, the virtual machines clonedfrom the template cannot get an IP address. If you power on this virtual machine to make changes, youmust remove this file before shutting down this virtual machine.

3 Shut down the Hadoop Template virtual machine.

4 Delete the snapshot labeled serengeti-snapshot in the customized Hadoop Template virtual machine.

a In the vSphere Web Client, right-click the Hadoop Template virtual machine and select Snapshot >Snapshot Manager

b Select the serengeti-snapshot, and click Delete.

The automatically generated snapshot is removed.

Access the Serengeti Command-Line Interface Using the RemoteCommand-Line Interface Client

You can access the Serengeti Command-Line Interface using the Serengeti Remote Command-Line InterfaceClient. The Serengeti Remote Command-Line Interface Client lets you access the Serengeti ManagementServer to deploy, manage, and use Hadoop.

IMPORTANT If you are using the Cloudera CDH3 or CDH4 distributions, or MapR or Pivotal distributions,some Hadoop commands cannot be run from the Serengeti CLI due to incompatible protocols between theCommand-Line Interface Impala and the Cloudera CDH3 and CDH4 distributions. The work around is touse a client node to run Hadoop commands such as fs, mr, pig, and hive.

Prerequisites

n Use the vSphere Web Client to log in to vCenter Server, on which you deployed the Serengeti vApp.

n Verify that the Big Data Extensions vApp deployment was successful and that the Management Serveris running.

n Verify that you have the correct password to log in to Serengeti Command-Line Interface Client. See “Create a New Username and Password for the Serengeti Command-Line Interface,” on page 39.

The Serengeti Command-Line Interface Client uses its vCenter Server credentials.

n Verify that the Java Runtime Environment (JRE) is installed in your environment and that its location isin your PATH environment variable.

Procedure

1 Open a Web browser to connect to the Serengeti Management Server cli directory.

http://ip_address/cli

2 Download the ZIP file for your version and build.

The filename is in the format VMware-Serengeti-cli-version_number-build_number.ZIP.

3 Unzip the download.

The download includes the following components.

n The serengeti-cli-version_number JAR file, which includes the Serengeti Remote Command-LineInterface Client.

n The samples directory, which includes sample cluster configurations.


38 VMware, Inc.

n Libraries in the lib directory.

4 Open a command shell, and change to the directory where you unzipped the package.

5 Change to the cli directory, and run the following command to enter the Serengeti CLI.

java -jar serengeti-cli-version_number.jar

6 Connect to the Serengeti service.

You must run the connect host command every time you begin a command-line session, and againafter the 30 minute session timeout. If you do not run this command, you cannot run any othercommands.

a Run the connect command.

connect --host xx.xx.xx.xx:8443

b At the prompt, type your user name, which might be different from your login credentials for theSerengeti Management Server.

c At the prompt, type your password.

A command shell opens, and the Serengeti Command-Line Interface prompt appears. You can use the helpcommand to get help with Serengeti commands and command syntax.

n To display a list of available commands, type help.

n To get help for a specific command, append the name of the command to the help command.

help cluster create

n Press the Tab key to automatically complete a command.

Create a New Username and Password for the Serengeti Command-Line Interface

The Serengeti Command-Line Interface Client uses the vCenter Server login credentials with readpermissions on the Serengeti Management Server. Otherwise, the Serengeti Command-Line Interface usesonly the default vCenter Server administrator credentials.

Prerequisites


n Install the Serengeti Command-Line Interface Client. See “Install the Serengeti Remote Command-LineInterface Client,” on page 22.

Procedure

1 Open a Web browser and go to the URL of the vSphere Web Client:https://vc-hostname:port/vsphere-client.

The vc-hostname can be either the DNS hostname or IP address of vCenter Server. By default the port is9443, but this may have been changed during installation of the vSphere Web Client.

The vSphere Web Client login screen displays.

2 Type the username and password that has administrative privileges on vCenter Server, and click Login.

NOTE vCenter Server 5.5 users must use a local domain to perform SSO related operations.

3 Click Administration > SSO Users and Groups in the vSphere Web Client Navigator panel.

4 Change the login credentials.


VMware, Inc. 39

The login credentials are updated. The next time you access the Serengeti Command-Line Interface use thenew login credentials.

What to do next

You can change the password of the Serengeti Management Server. See “Create a New Password for theSerengeti Management Server,” on page 40.

Create a New Password for the Serengeti Management ServerStarting the Serengeti Management Server generates a random root user password. You can change thatpassword and other operating system passwords using the virtual machine console.

When you log in to the Serengeti Management Server service using the vSphere Web Client, you use theusername serengeti and the password for the Serengeti user. You can change the password for that userand perform other virtual machine management tasks from the Serengeti Management Server. When youlog in to the Serengeti Management Server console as the root user for the first time, you use a randomlygenerated password. You can change the password.

NOTE You can change the password for any node's virtual machine using this procedure.

Prerequisites

n Deploy the Serengeti vApp. See “Deploy the Big Data Extensions vApp in the vSphere Web Client,” onpage 17.

n Log in to a vSphere Web Client connected to the vCenter Server and verify that the SerengetiManagement Server virtual machine is running.

Procedure

1 Right-click the Serengeti Management Server virtual machine and select Open Console.

The current password for the Serengeti Management Server displays.

NOTE Type Ctrl-D if the password scrolls off the console screen to return to the command prompt.

2 Use PuTTY or another SSH client to log in to the Serengeti Management Server as root, using the IPaddress displayed in the Summary tab and the current password.

3 Use the /opt/serengeti/sbin/set-password command to change the password for the root user and theserengeti user.

/opt/serengeti/sbin/set-password -u root

You are prompted to type a new password and asked to retype it.

The next time you access the Serengeti Management Server using a Web browser, use the new password.

What to do next

You can change the username and password of the Serengeti Command-Line Interface Client. See “Create aNew Username and Password for the Serengeti Command-Line Interface,” on page 39.

Start and Stop Serengeti ServicesYou can stop and start Serengeti services to make a reconfiguration take effect, or to recover from anoperational anomaly.


40 VMware, Inc.

Procedure

1 Open a command shell to the Serengeti Management Server.

2 Run the serengeti-stop-services.sh script to stop the Serengeti services.

$ serengeti-stop-services.sh

3 Run the serengeti-start-services.sh script to start the Serengeti services.

$ serengeti-start-services.sh

Log in to Hadoop Nodes with the Serengeti Command-Line InterfaceClient

To perform troubleshooting or to run your own management automation scripts, log in to Hadoop master,worker, and client nodes with password-less SSH from the Serengeti Management Server using SSH clienttools such as SSH, PDSH, ClusterSSH, and Mussh.

You can use a user name and password authenticated login to connect to Hadoop cluster nodes over SSH.All deployed nodes have random password protection.

Procedure

1 (Optional) To obtain the login information with the original random password, press Ctrl+D.

2 Log in to the Hadoop node from the vSphere Web Client, and the generated password for the root useris displayed on the virtual machine console in the vSphere Web Client.

3 Change the Hadoop node’s password by running the set-password -u command.

sudo /opt/serengeti/sbin/set-password -u

Change the User Password on all Cluster NodesYou can change the user password for all nodes in a cluster. The user password you can change include thatof the serengeti and root user.

You can change a user's password on all nodes within a given cluster.

Prerequisites

n Deploy the Big Data Extensions vApp. See “Deploy the Big Data Extensions vApp in the vSphere WebClient,” on page 17 .

n Configure a Hadoop distribution for use with Big Data Extensions.

n Create a cluster. See Chapter 5, “Creating Hadoop and HBase Clusters,” on page 47.

Procedure

1 Log in to the Serengeti Management Server virtual machine as the user serengeti.


VMware, Inc. 41

2 Run the command serengeti-ssh.sh cluster_name 'echo new_password | sudo passwd username --stdin'.

Option Description

cluster_name The name of the cluster whose node passwords you want to update.

new_password The new password you want to use for the user.

username The user account whose password you want to change. For example, theuser serengeti.

This example changes the password for all nodes in the cluster labeled mycluster for the user serengetito mypassword.

serengeti-ssh.sh mycluster 'echo mypassword | sudo passwd serengeti --stdin'

The password for the user account you specify will be changed on all nodes in the cluster.


42 VMware, Inc.

Managing vSphere Resources forHadoop and HBase Clusters 4

Big Data Extensions lets you manage the datastores and networks that you use to create Hadoop and HBaseclusters.

n Add a Resource Pool with the Serengeti Command-Line Interface on page 43You add resource pools to make them available for use by Hadoop clusters. Resource pools must belocated at the top level of a cluster. Nested resource pools are not supported.

n Remove a Resource Pool with the Serengeti Command-Line Interface on page 44You can remove any resource pool from Serengeti that is not in use by a Hadoop cluster. You removeresource pools if they are no longer needed or if you want the Hadoop clusters you create in theSerengeti Management Server to be deployed under a different resource pool. Removing a resourcepool removes its reference in vSphere. The resource pool is not deleted.

n Add a Datastore in the vSphere Web Client on page 44You can add datastores to Big Data Extensions to make them available to Hadoop and HBase clusters.Big Data Extensions supports both shared datastores and local datastores.

n Remove a Datastore in the vSphere Web Client on page 45You remove a datastore from Big Data Extensions when you no longer want the Hadoop clusters youcreate to use that datastore.

n Add a Network in the vSphere Web Client on page 45You add networks to Big Data Extensions to make the IP addresses contained by those networksavailable to Hadoop and HBase clusters.

n Remove a Network in the vSphere Web Client on page 46You can remove an existing network from Big Data Extensions when you no longer need it. Removingan unused network frees the IP addresses for use by other services.

Add a Resource Pool with the Serengeti Command-Line InterfaceYou add resource pools to make them available for use by Hadoop clusters. Resource pools must be locatedat the top level of a cluster. Nested resource pools are not supported.

When you add a resource pool to Big Data Extensions it symbolically represents the actual vSphere resourcepool as recognized by vCenter Server. This symbolic representation lets you use the Big Data Extensionsresource pool name, instead of the full path of the resource pool in vCenter Server, in cluster specificationfiles.

Prerequisites

Deploy Big Data Extensions.

VMware, Inc. 43

Procedure

1 Access the Serengeti Command-Line Interface client.

2 Run the resourcepool add command.

The --vcrp parameter is optional.

This example adds a Serengeti resource pool named myRP to the vSphere rp1 resource pool that iscontained by the cluster1 vSphere cluster.

resourcepool add --name myRP --vccluster cluster1 --vcrp rp1

What to do next

After you add a resource pool to Big Data Extensions, do not rename it in vSphere. If you rename it, youcannot perform any Serengeti operations on clusters that use that resource pool.

Remove a Resource Pool with the Serengeti Command-Line InterfaceYou can remove any resource pool from Serengeti that is not in use by a Hadoop cluster. You removeresource pools if they are no longer needed or if you want the Hadoop clusters you create in the SerengetiManagement Server to be deployed under a different resource pool. Removing a resource pool removes itsreference in vSphere. The resource pool is not deleted.

Procedure

1 Access the Serengeti Command-Line Interface client.

2 Run the resourcepool delete command.

If the command fails because the resource pool is referenced by a Hadoop cluster, you can use theresourcepool list command to see which cluster is referencing the resource pool.

This example deletes the resource pool named myRP.

resourcepool delete --name myRP

Add a Datastore in the vSphere Web ClientYou can add datastores to Big Data Extensions to make them available to Hadoop and HBase clusters. BigData Extensions supports both shared datastores and local datastores.

Prerequisites

Deploy Big Data Extensions, and log in to the vSphere Web Client.

Procedure



3 From the Inventory Lists, click Resources.

4 Expand Inventory List, and select Datastores.

5 Click the plus (+) icon.

The Add Datastore dialog box displays.

6 Type a name with which to identify the datastore in Big Data Extensions.

7 Type the name of a datastore as it is labeled in vSphere. You can choose multiple vSphere datastores.

You can use the * and ? pattern matching operators (wildcards) to specify multiple datastores. Forexample, to specify all datastores whose name begins with data-, type data-*.


44 VMware, Inc.

8 Select the datastore type in vSphere.

Type Description

Shared Recommended for master nodes. Enables you to leverage vMotion, HA,and Fault Tolerance.NOTE If you do not specify shared storage and attempt to provision acluster using vMotion, HA, or Fault Tolerance, the provisoning will fail.

Local Recommended for worker nodes. Throughput is scalable and the cost ofstorage is lower.

9 Click OK to save your changes and close the dialog box.

The vSphere datastore is made available for use by Hadoop and HBase clusters deployed within Big DataExtensions.

Remove a Datastore in the vSphere Web ClientYou remove a datastore from Big Data Extensions when you no longer want the Hadoop clusters you createto use that datastore.

Prerequisites

Remove all Hadoop clusters associated with the datastore. See “Delete a Hadoop Cluster in the vSphereWeb Client,” on page 57.

Procedure




4 Expand Resources > Inventory List, and select Datastores.

5 Select the datastore you want ot remove and click the click the Remove (×) icon.

6 Click Yes to confirm that you want to remove the datastore.

If you have not removed the cluster that uses the datastore, you will receive an error message indicatingthe datastore cannot be removed because it is currently in use.

The datastore is removed from Big Data Extensions.

Add a Network in the vSphere Web ClientYou add networks to Big Data Extensions to make the IP addresses contained by those networks available toHadoop and HBase clusters.

Prerequisites

Deploy Big Data Extensions. See Chapter 3, “Getting Started with Big Data Extensions,” on page 13

Procedure




4 Expand Resources > Inventory List, and select Networks.

Chapter 4 Managing vSphere Resources for Hadoop and HBase Clusters

VMware, Inc. 45

5 Click the plus (+) icon.

The Add Networks dialog box displays.

6 Type a name with which to identify the network resource in Big Data Extensions.

7 Type the name of the port group that you want to add as it is labeled in vSphere.

8 Select DHCP, or specify IP address information.


The IP addresses of the network are available to Hadoop and HBase clusters you create within Big DataExtensions.

Remove a Network in the vSphere Web ClientYou can remove an existing network from Big Data Extensions when you no longer need it. Removing anunused network frees the IP addresses for use by other services.

Prerequisites

Remove clusters assigned to the network. See “Delete a Hadoop Cluster in the vSphere Web Client,” onpage 57.

Procedure




4 Expand Resources > Inventory List, and select Networks.

5 Select the network you want to remove, and click the Remove (x) icon.

6 Click Yes to confirm that you want to remove the network.

If you have not removed the cluster that uses the network, you will receive an error message indicatingthe network cannot be removed because it is currently in use.

The network is removed and the IP addresses become available for use.


46 VMware, Inc.

Creating Hadoop and HBase Clusters 5Big Data Extensions lets you quickly and easily create and deploy Hadoop and HBase clusters.

n About Hadoop and HBase Cluster Deployment Types on page 48Big Data Extensions lets you deploy several different types of Hadoop and HBase clusters. Before youcreate a cluster, you need to know what types of clusters you can create.

n About Cluster Topology on page 48You can improve workload balance across your cluster nodes, and improve performance andthroughput, by specifying how Hadoop virtual machines are placed using topology awareness.

n Create a Hadoop or HBase Cluster in the vSphere Web Client on page 49After you complete deployment of the Hadoop distribution, you can create Hadoop and HBaseclusters to process data. You can create multiple clusters in your Big Data Extensions environment, butyour environment must meet all prerequisites and have adequate resources.

n Add Topology for Load Balancing and Performance with the Serengeti Command-Line Interface onpage 51To achieve a balanced workload or to improve performance and throughput, you can control howHadoop virtual machines are placed by adding topology awareness to the Hadoop clusters. Forexample, you can have separate data and compute nodes, and improve performance and throughputby placing the nodes on the same set of physical hosts.

VMware, Inc. 47

About Hadoop and HBase Cluster Deployment TypesBig Data Extensions lets you deploy several different types of Hadoop and HBase clusters. Before you createa cluster, you need to know what types of clusters you can create.

You can create the following types of clusters.

Basic Hadoop Cluster You can create a simple Hadoop deployment for proof of concept projectsand other small scale data processing tasks using the basic Hadoop cluster.

HBase Cluster You can create an HBase cluster. To run HBase MapReduce jobs you mustconfigure the HBase cluster to include JobTracker or TaskTrackers.

Data/ComputeSeparation HadoopCluster

You can separate data and compute nodes in a Hadoop cluster, and controlhow nodes are placed on your environment's vSphere ESX hosts.

Compute-only HadoopCluster

You can create a compute-only cluster to run MapReduce jobs. Compute-only clusters run only MapReduce services that read data from externalHDFS clusters with having to store data themselves.

Customize You can use an existing cluster specification file to create clusters using thesame configuration as your previously created clusters.. You can also edit thefile to further customize the cluster configuration.

About Cluster TopologyYou can improve workload balance across your cluster nodes, and improve performance and throughput,by specifying how Hadoop virtual machines are placed using topology awareness.

To get maximum performance out of your Hadoop or HBase, it is important to configure your cluster so thatit has awareness of the topology of your environment's host and network information. Hadoop performsbetter when using within-rack transfers (where more bandwidth available) to off-rack transfers whenassigning MapReduce tasks to nodes. HDFS will be able to place replicas more intelligently to trade offperformance and resilience. For example, if you have separate data and compute nodes, you can improveperformance and throughput by placing the nodes on the same set of physical hosts.

You can specify following topology awareness configuration.

Hadoop VirtualizationExtensions (HVE)

Enhanced cluster reliability and performance provided by refined Hadoopreplica placement, task scheduling, and balancer policies. Hadoop clustersimplemented on a virtualized infrastructure have full awareness of thetopology on which they are running when using HVE.


48 VMware, Inc.

To create a cluster using HVE you must deploy a Hadoop distribution thatsupports HVE.

NOTE To use HVE, you must create and upload a server topology file.

RACK_AS_RACK Standard topology for Apache Hadoop distributions. Only rack and hostinformation are exposed to Hadoop.

NOTE To use RACK_AS_RACK, you must create and upload a servertopology file.

HOST_AS_RACK Simplified topology for Apache Hadoop distributions. To avoid placing allHDFS data block replicas on the same physical host, each physical host istreated as a rack. Because data block replicas are never placed on a rack, thisavoids the worst case scenario of a single host failure causing the completeloss of any data block.

Use HOST_AS_RACK only if your cluster uses a single rack, or if you do nothave any rack information with which to make a decsion about topologyconfiguration options.

None No topology is specified.

Create a Hadoop or HBase Cluster in the vSphere Web ClientAfter you complete deployment of the Hadoop distribution, you can create Hadoop and HBase clusters toprocess data. You can create multiple clusters in your Big Data Extensions environment, but yourenvironment must meet all prerequisites and have adequate resources.

Prerequisites

n Deploy the Big Data Extensions vApp. See Chapter 3, “Getting Started with Big Data Extensions,” onpage 13.

n Install the Big Data Extensions plug-in. See“Install the Big Data Extensions Plug-in,” on page 19.

n Connect to a Serengeti Management Server. See “Connect to a Serengeti Management Server,” onpage 21.

n Configure one or more Hadoop distributions. See “Configure a Tarball-Deployed HadoopDistribution,” on page 23or “Configuring Yum and Yum Repositories,” on page 25.

n Understand the topology configuration options you may want to use with your cluster.

Procedure


2 Select Big Data Clusters.

3 Click New Big Data Cluster in the Big Data Cluster List tab.

The Create New Data Cluster dialog displays.

Chapter 5 Creating Hadoop and HBase Clusters

VMware, Inc. 49

4 Specify the information for the cluster that you want to create.

Option Description

Hadoop cluster name Type a name by which to identify the cluster.

Hadoop distro Select the Hadoop distribution.

Deployment type Select the type of cluster you want to create.n Basic Hadoop Clustern HBase Clustern Data/Compute Separation Hadoop Clustern Compute-only Hadoop Clustern CustomizeThe type of cluster you create determines the available node groupselections.If you select Customize, a dialog box displays that lets you load an existingcluster specifciation file. You can edit the file in the text field to furtherrefine the cluster specifcation.

DataMaster Node Group The DataMaster node is a virtual machine that runs the HadoopNameNode service. This node manages HDFS data and assigns tasks toHadoop JobTracker services deployed in the worker node group.Select a resource template from the drop-down menu, or select Customizeto customize a resource template.For the master node, use shared storage so that you protect this virtualmachine with VMware HA and FT.

ComputeMaster Node Group The ComputeMaster node is a virtual machine that runs the HadoopJobTracker service. This node assigns tasks to Hadoop TaskTrackerservices deployed in the worker node group.Select a resource template from the drop-down menu, or select Customizeto customize a resource template.For the master node, use shared storage so that you protect this virtualmachine with VMware HA and FT.

HBaseMaster Node Group (HBasecluster only)

The HBaseMaster node is a virtual machine that runs the HBase masterservice. This node orchestrates a cluster of one or more RegionServer slavenodes.Select a resource template from the drop-down menu, or select Customizeto customize a resource template.For the master node, use shared storage so that you protect this virtualmachine with VMware HA and FT.

Worker Node Group Worker nodes are virtual machines that run the Hadoop DataNodes andTaskTracker service. These nodes store HDFS data and execute tasks.Select the number of nodes and the resource template from the drop-downmenu, or select Customize to customize a resource template.For worker nodes, use local storage.NOTE You can add nodes to the worker node group by using Scale OutCluster. You cannot reduce the number of nodes.

Client Node Group A client node is a virtual machine that contains Hadoop client components.From this virtual machine you can access HDFS, submit MapReduce jobs,run Pig scripts, run Hive queries, and HBase commands.Select the number of nodes and a resource template from the drop-downmenu, or select Customize to customize a resource template.NOTE You can add nodes to the client node group by using Scale OutCluster. You cannot reduce the number of nodes.

Hadoop Topology Select the topology configuration you want the cluster to use.n RACK_AS_RACKn HOST_AS_RACKn HVEn NONE


50 VMware, Inc.

Option Description

Network Select the network that you want the cluster to use.

Resource Pools Select one or more resource pools that you want the cluster to use.

The Serengeti Management Server clones the template virtual machine to create the nodes in the cluster.When each virtual machine starts, the agent on that virtual machine pulls the appropriate Big DataExtensions software components to that node and deploys the software.

Add Topology for Load Balancing and Performance with the SerengetiCommand-Line Interface

To achieve a balanced workload or to improve performance and throughput, you can control how Hadoopvirtual machines are placed by adding topology awareness to the Hadoop clusters. For example, you canhave separate data and compute nodes, and improve performance and throughput by placing the nodes onthe same set of physical hosts.

Procedure

1 If you are using HVE or RACK_AS_RACK topologies, create a plain text file to associate the racks andphysical hosts.

For example:

rack1: a.b.foo.com, a.c.foo.com

rack2: c.a.foo.com

Physical hosts a.b.foo.com and a.c.foo.com are in rack1, and c.a.foo.com is in rack2.

2 Access the Serengeti Command-Line Interface Client.

3 (Optional) View the current topology by running the topology list command.

topology list [--detail]

4 Upload the rack and physical host information.

topology upload --fileName name_of_rack_hosts_mapping_file

5 Create the cluster by running the cluster create command.

This example creates an HVE topology.

cluster create --name cluster-name --topology HVE --distro name_of_HVE-supported_distro

6 View the allocated nodes on each rack.

cluster list --name cluster-name –-detail

Chapter 5 Creating Hadoop and HBase Clusters

VMware, Inc. 51


52 VMware, Inc.

Managing Hadoop and HBaseClusters 6

You can use the vSphere Web Client to start and stop your Hadoop or HBase cluster and to modify clusterconfiguration. You can also manage a cluster using the Serengeti Command-Line Interface.

CAUTION Do not use vSphere management functions such as cluster migration for clusters that you createwith Big Data Extensions. Performing such cluster management functions outside of the Big Data Extensionsenvironment can make it impossible for you to manage those clusters with Big Data Extensions.

This chapter includes the following topics:

n “Start and Stop a Hadoop Cluster in the vSphere Web Client,” on page 53

n “Scale Out a Hadoop Cluster in the vSphere Web Client,” on page 54

n “Scale CPU and RAM in the vSphere Web Client,” on page 54

n “Reconfigure a Hadoop or HBase Cluster in the Serengeti Command-Line Interface,” on page 55

n “Delete a Hadoop Cluster in the vSphere Web Client,” on page 57

n “About Resource Usage and Elastic Scaling,” on page 57

n “Optimize Cluster Resource Usage with Elastic Scaling in the vSphere Web Client,” on page 59

n “Use Disk I/O Shares to Prioritize Cluster Virtual Machines in the vSphere Web Client,” on page 60

n “About vSphere High Availability and vSphere Fault Tolerance,” on page 61

n “Recover from Disk Failure in the Serengeti Command-Line Interface Client,” on page 61

Start and Stop a Hadoop Cluster in the vSphere Web ClientYou can stop a currently running Hadoop cluster and start a stopped Hadoop cluster from thevSphere Web Client.

Prerequisites

n To stop a cluster it must be running or in an error state.

n To start a cluster it must be stopped or in an error state.

Procedure



3 From the Inventory Lists, click Big Data Clusters.

VMware, Inc. 53

4 Select the cluster that you want to stop or start from the Hadoop Cluster Name column, and right-clickto display the Actions menu.

5 Select Shut Down Big Data Cluster to stop a running cluster, or select Start Big data Cluster to start acluster.

Scale Out a Hadoop Cluster in the vSphere Web ClientYou specify the number of nodes to use when you create Hadoop clusters. You can later scale out the clusterby increasing the number of worker nodes and client nodes.

You can scale the cluster using the vSphere Web Client or the Serengeti Command-Line Interface Client. Thecommand-line interface provides more configuration options than the vSphere Web Client. See “Install theSerengeti Remote Command-Line Interface Client,” on page 22 for installation instructions. See the vSphereBig Data Extensions Command-Line Interface Guide for command-line interface documentation.

You cannot decrease the number of worker and client nodes from the vSphere Web Client.

Prerequisites

n Start the cluster if it is not running. See “Start and Stop a Hadoop Cluster in the vSphere Web Client,”on page 53.

Procedure




4 Select the cluster that you want to scale out from the Hadoop Cluster Name column.

5 Click the All Actions icon, and select Scale Out.

6 Select the worker or client node group to scale out from the Node group drop-down menu.

7 Specify the target number of node instances to add in the Instance number text box and click OK.

Because you cannot decrease the number of nodes, a Scale Out Failed error occurs if you specify aninstance number that is less than or equal to the current number of instances.

The cluster is updated to include the specified number of nodes.

Scale CPU and RAM in the vSphere Web ClientYou can increase or decrease a cluster’s compute capacity to prevent CPU or memory resource contentionamong running jobs.

You can adjust compute resources without increasing the workload on the Master node. If increasing ordecreasing the cluster's CPU or RAM is unsuccessful for a node, which is commonly due to insufficientresources being available, the node is returned to its original CPU or RAM setting.

Although all node types support CPU and RAM scaling, scaling a cluster's master node CPU is notrecommended because Big Data Extensions powers down the virtual machine during the scaling process.

When scaling your cluster’s CPU and RAM, each CPU must be a multiple of the virtual CPU cores, and youmust scale RAM as a multiple of 4, allowing a minimum of 3748 MB.

Prerequisites

n Start the cluster if it is not running. See “Start and Stop a Hadoop Cluster in the vSphere Web Client,”on page 53.


54 VMware, Inc.

Procedure




4 Select the cluster that you want to scale up or down from the Hadoop Cluster Name column.

5 Click the All Actions icon, and select Scale Up/Down.

6 Select the ComputeMaster, DataMaster, Worker or Client node group whose CPU or RAM you want toscale up or down from the Node group drop-down menu.

7 Specify the number of vCPUs to use and the amount of RAM in the appropriate text boxes and clickOK.

After applying new values for CPU and RAM, the cluster is placed into Maintenance mode as it applies thenew values. You can monitor the status of the cluster as the new values are applied.

Reconfigure a Hadoop or HBase Cluster in the Serengeti Command-Line Interface

You can reconfigure any Hadoop or HBase cluster that you create with Big Data Extensions.

The cluster configuration is specified by attributes in a variety of Hadoop distribution XML configurationfiles: core-site.xml, hdfs-site.xml, mapred-site.xml, hadoop-env.sh, and so on.

For details about the Serengeti JSON-formatted configuration file and associated attributes in Hadoopdistribution files see the "Cluster Specification Reference" topic in the VMware vSphere Big Data ExtensionsCommand-Line Interface Guide.

Procedure

1 Use the cluster export command to export the cluster specification file for the cluster you want toreconfigure.

cluster export --name cluster_name --specFile file_path/cluster_spec_file_name

Option Description

cluster_name Name of the cluster you want to reconfigure.

file_path The filesystem path at which to export the specification file.

cluster_spec_file_name The name with which to label the exported cluster specification file.

2 Edit the configuration information located near the end of the exported cluster specification file.

If you are modeling your configuration file on existing Hadoop XML configuration files, you can usethe convert-hadoop-conf.rb conversion tool to convert Hadoop XML configuration files to the requiredJSON format.

…

"configuration": {

"hadoop": {

"core-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/stable/core-

default.html

// note: any value (int, float, boolean, string) must be enclosed in double quotes

and here is a sample:

// "io.file.buffer.size": "4096"

},

Chapter 6 Managing Hadoop and HBase Clusters

VMware, Inc. 55

"hdfs-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/stable/hdfs-

default.html

},

"mapred-site.xml": {

// check for all settings at http://hadoop.apache.org/common/docs/stable/mapred-

default.html

},

"hadoop-env.sh": {

// "HADOOP_HEAPSIZE": "",

// "HADOOP_NAMENODE_OPTS": "",

// "HADOOP_DATANODE_OPTS": "",

// "HADOOP_SECONDARYNAMENODE_OPTS": "",

// "HADOOP_JOBTRACKER_OPTS": "",

// "HADOOP_TASKTRACKER_OPTS": "",

// "HADOOP_CLASSPATH": "",

// "JAVA_HOME": "",

// "PATH": "",

},

"log4j.properties": {

// "hadoop.root.logger": "DEBUG, DRFA ",

// "hadoop.security.logger": "DEBUG, DRFA ",

},

"fair-scheduler.xml": {

// check for all settings at

http://hadoop.apache.org/docs/stable/fair_scheduler.html

// "text": "the full content of fair-scheduler.xml in one line"

},

"capacity-scheduler.xml": {

// check for all settings at

http://hadoop.apache.org/docs/stable/capacity_scheduler.html

}

}

}

…

3 (Optional) If your Hadoop distribution’s JAR files are not in the $HADOOP_HOME/lib directory, add thefull path of the JAR file in $HADOOP_CLASSPATH to the cluster specification file.

This lets the Hadoop daemons locate the distribution JAR files.

For example, the Cloudera CDH3 Hadoop Fair Scheduler JAR files arein /usr/lib/hadoop/contrib/fairscheduler/. Add the following to the cluster specification file toenable Hadoop to use the JAR files.

…

"configuration": {

"hadoop": {

"hadoop-env.sh": {

"HADOOP_CLASSPATH": "/usr/lib/hadoop/contrib/fairscheduler/*:$HADOOP_CLASSPATH"

},

"mapred-site.xml": {

"mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler"

…

},

"fair-scheduler.xml": {

…


56 VMware, Inc.

}

}

}

…

4 Access the Serengeti Command-Line Interface.

5 Run the cluster config command to apply the new Hadoop configuration.

cluster config --name cluster_name --specFile file_path/cluster_spec_file_name

6 (Optional) To reset an existing configuration attribute to its default value, remove it from the clusterconfiguration file’s configuration section, or comment it out using double back slashes (//), and re-runthe cluster config command.

Delete a Hadoop Cluster in the vSphere Web ClientYou can delete a cluster using the vSphere Web Client.

When you create a cluster, Big Data Extensions creates a folder and resource pool for each cluster, and wellas resource pools for each node group within the cluster. When you delete a cluster all of theseorganizational folders and resource pools are also removed.

When you delete a cluster it is removed from both the inventory and datastore.

You can delete a running or a stopped cluster.

Procedure


2 In the object navigator select Big Data Extensions.

3 Under Inventory Lists click Big Data Clusters.

A list of the available clusters appears.

4 Select the cluster that you want to delete from the Hadoop Cluster Name column.

5 Click the All Actions icon, and select Delete Big Data Cluster.

The cluster and all the virtual machines it contains are removed from your Big Data Extensionsenvironment.

About Resource Usage and Elastic ScalingScaling lets you adjust the compute capacity of Hadoop data-compute separated clusters. An elasticity-enabled Hadoop cluster automatically stops and starts nodes to match resource requirements to availableresources. You can use manual scaling for more explicit cluster control.

Automatic elasticity is best suited for mixed workload environments where resource requirements andavailability fluctuate. Manual scaling is more appropriate for static environments where capacity planningcan predict resource availability for workloads.

When automatic elasticity is enabled, Big Data Extensions sets the number of active compute nodes withinthe following range:

n The configured minimum number of nodes for elasticity

n Total number of available nodes

When manual scaling is in effect, Big Data Extensions sets the number of active compute nodes to theconfigured number of target compute nodes.


VMware, Inc. 57

vCenter Server does not directly control the number of active nodes. However, vCenter Server does applythe usual reservations, shares, and limits to the cluster's resource pool according to the cluster's vSphereconfiguration. The vSphere dynamic resource scheduler (DRS) operates as usual, allocating resourcesbetween competing workloads.

Big Data Extensions also lets you adjust cluster nodes' access priority for datastores by using the vSphereStorage I/O Control feature. Clusters configured for HIGH IO shares receive higher priority access thanclusters with NORMAL priority. Clusters configured for NORMAL IO shares receive higher priority access thanclusters with LOW priority. In general, higher priority provides better disk I/O performance.

Scaling ModesWhen you enable and disable elastic scaling, you change the scaling mode.

n AUTO. Big Data Extensions automatically adjusts the number of active nodes from the configuredminimum to the number of available nodes in the cluster.

Elastic scaling operates on a per-host basis, at a node-level granularity. That is, the more nodes aHadoop cluster has on a host, the finer the control that Big Data Extensions elasticity can exercise. Thetradeoff is that the more nodes you have, the higher the overhead in terms of runtime resource cost,disk footprint, I/O requirements, and so on.

When resources are overcommitted, elastic scaling reduces the number of powered on nodes.Conversely, if the cluster receives all the resources it requested from vSphere, and Big Data Extensionsdetermines that the cluster can make use of additional capacity, elastic scaling powers on additionalnodes.

Resources can become overcommitted for many reasons, such as:

n The nodes have lower resource entitlements than a competing workload, according to vCenterServer's application of the usual reservations, shares, and limits as configured for the cluster.

n Physical resources are configured to be available, but another workload is consuming thoseresources.

n MANUAL. Big Data Extensions sets the number of active compute nodes to the configured number oftarget compute nodes.

Default Cluster Elasticity SettingsBy default, when you create a cluster, it is not elasticity-enabled.

n The cluster's scaling mode is MANUAL.

n The cluster's minimum number of compute nodes is zero.

n The cluster's target number of nodes is not applicable.

Interactions Between Scaling and Other Cluster OperationsSome cluster operations cannot be performed while Big Data Extensions is actively scaling a cluster.

If you try to perform the following operations while Big Data Extensions is scaling a cluster, Serengeti warnsyou that in the cluster's current state, the operation cannot be performed.

n Concurrent attempt at manual scaling

n Switch to AUTO scaling while MANUAL scaling is in progress

If a cluster is elasticity-enabled when you perform the following cluster operations on it,Big Data Extensions changes the scaling mode to MANUAL and disables elasticity. You can re-enableelasticity after the cluster operation finishes.

n Delete the cluster


58 VMware, Inc.

n Repair the cluster

n Stop the cluster

If a cluster is elasticity-enabled when you perform the following cluster operations on it,Big Data Extensions temporarily switches the cluster to MANUAL scaling. When the cluster operationfinishes, Serengeti returns the elasticity mode to AUTO, which re-enables elasticity.

n Resize the cluster

n Reconfigure the cluster

If Big Data Extensions is scaling a cluster when you perform an operation that changes the scaling mode toMANUAL, your requested operation waits until the scaling finishes, and then the operation begins.

Optimize Cluster Resource Usage with Elastic Scaling in thevSphere Web Client

You can specify the elasticity of the cluster. Elasticity lets you specify the number of nodes the cluster canuse, and whether it automatically adds nodes, or uses nodes within a targeted range.

When you enable elasticity for a cluster, Big Data Extensions optimizes cluster performance and utilizationof nodes that have a Hadoop TaskTracker role.

When you set a cluster's elasticity mode to automatic, you should configure the minimum number ofcompute nodes. If you do not configure the minimum number of compute nodes, the previous setting isretained. When you set a cluster's elasticity mode to manual, you should configure the target number ofcompute nodes. If you do not configure the target number of compute nodes, the previous setting isretained.

Prerequisites

n Before optimizing your cluster's resource usage wit elastic scaling, understand how elastic scaling andresource usage work. See “About Resource Usage and Elastic Scaling,” on page 57.

n You can only use elasticity with clusters you create using data-compute separation. See “About Hadoopand HBase Cluster Deployment Types,” on page 48

Procedure





4 Select the cluster whose elasticity mode you want to set from the Hadoop Cluster Name column.

5 Click the All Actions icon, and select Set Elasticity Mode.


VMware, Inc. 59

6 Specify the Elasticity settings for the cluster that you want to modify.

Option Description

Elasticity mode Select the type of elasticity mode you want to use. You can choose manualor automatic.

Target compute nodes Specify the desired number of compute nodes the cluster should target foruse. This is applicable only when using manual elasticity mode.NOTE A value of zero (0) means that the node setting is not applicable.

Min compute nodes Specify the minimum number (the lower limit) of active compute nodes tomaintain in the cluster. This is applicable only when using automaticelasticity mode. To ensure that under contention elasticity keeps a clusteroperating with more than a cluster’s initial default setting of zero computenodes, configure the minimum number of compute nodes to a non-zeronumber.

What to do next

Specify the cluster's access priority for datastores. See “Use Disk I/O Shares to Prioritize Cluster VirtualMachines in the vSphere Web Client,” on page 60.

Use Disk I/O Shares to Prioritize Cluster Virtual Machines in thevSphere Web Client

You can set the disk IO shares for the virtual machines running a cluster. Disk shares distinguish high-priority from low-priority virtual machines.

Disk shares is a value that represents the relative metric for controlling disk bandwidth to all virtualmachines. The values are compared to the sum of all shares of all virtual machines on the server and, on anESX host, the service console. Big Data Extensions can adjust disk shares for all virtual machines in a cluster.Using disk shares you can change a cluster's I/O bandwidth to improve the cluster's I/O performance.

For more information on using disk shares to prioritize virtual machines, see the VMware vSphere ESXi andvCenter Server documentation.

Procedure





4 Select the cluster whose disk IO shares you want to set from the Hadoop Cluster Name column.

5 Click the Actions icon, and select Set Disk IO Share.

6 Specify a value to allocate a number of shares of disk bandwidth to the virtual machine running thecluster.

Clusters configured for HIGH I/O Shares receive higher priority access than those with NORMAL andLOW priorities, which provides better disk I/O performance. Disk shares are commonly set LOW forcompute virtual machines and NORMAL for data virtual machines. The master node virtual machine iscommonly set to NORMAL.



60 VMware, Inc.

About vSphere High Availability and vSphere Fault ToleranceBig Data Extensions leverages vSphere HA to protect the Hadoop master node virtual machine, which canbe monitored by vSphere.

When a Hadoop NameNode or JobTracker service stops unexpectedly, vSphere restarts the Hadoop virtualmachine in another host, thereby reducing unplanned downtime. If vsphere Fault Tolerance is configuredand the master node virtual machine stops unexpectedly due to host failover or loss of networkconnectivity, the secondary node is used immediately, without any downtime.

Recover from Disk Failure in the Serengeti Command-Line InterfaceClient

If there is a disk failure in a Hadoop cluster, and the disk does not perform management roles such asNameNode, JobTracker, ResourceManager, HMaster, or ZooKeeper, you can recover by running theSerengeti cluster fix command.

Big Data Extensions uses a large number of inexpensive disk drives for data storage (configured as Just aBunch of Disks). If several disks fail, the Hadoop data node may shutdown. For this reason,Big Data Extensions lets you to recover from disk failures.

Serengeti supports recovery from swap and data disk failure on all supported Hadoop distributions. Disksare recovered and started in sequence, not in parallel, to avoid the temporary loss of multiple nodes at once.A new disk matches the corresponding failed disk’s storage type and placement policies.

NOTE MapR does not support recovery from disk failure using the cluster fix command.

Procedure


2 Run the cluster fix command.

The nodeGroup parameter is optional.

cluster fix --name cluster_name --disk [--nodeGroup nodegroup_name]


VMware, Inc. 61


62 VMware, Inc.

Monitoring the Big Data ExtensionsEnvironment 7

You can monitor the status of Serengeti-deployed clusters, including their datastores, networks, andresource pools, through the Serengeti Command-Line Interface. Limited monitoring capabilities are alsoavailable using the vSphere Console.

n View Provisioned Clusters in the vSphere Web Client on page 63You can view the clusters deployed within Big Data Extensions, including information about whetherthe cluster is running, the type of Hadoop distribution used by a cluster, and the number and type ofnodes in the cluster.

n View Cluster Information in the vSphere Web Client on page 64Use the vSphere Web Client to view virtual machines running each node, resource allocation, IPaddresses, and storage information for each node in the Hadoop cluster.

n Monitor the Hadoop Distributed File System Status in the vSphere Web Client on page 65When you configure a Hadoop distribution to use with Big Data Extensions, the Hadoop softwareincludes the Hadoop Distributed File System (HDFS). You can monitor the health and status of HDFSfrom the vSphere Web Client. The HDFS page lets you browse the Hadoop filesystem, viewNameNode logs, and view cluster information including live, dead, and decommissioning nodes, andNameNode storage information.

n Monitor MapReduce Status in the vSphere Web Client on page 66The Hadoop software includes MapReduce, a software framework for distributed data processing.You can monitor MapReduce status vSphere Web Client. The MapReduce Web page includesinformation about scheduling, running jobs, retired jobs, and log files.

n Monitor HBase Status in the vSphere Web Client on page 66HBase is the Hadoop database. You can monitor the health and status of your HBase cluster, as well asthe tables it hosts, from the vSphere Web Client.

View Provisioned Clusters in the vSphere Web ClientYou can view the clusters deployed within Big Data Extensions, including information about whether thecluster is running, the type of Hadoop distribution used by a cluster, and the number and type of nodes inthe cluster.

Prerequisites

n Create one or more Hadoop or Hbase clusters whose information you can view. See Chapter 5,“Creating Hadoop and HBase Clusters,” on page 47

VMware, Inc. 63

Procedure




4 Select Big Data Clusters.

Information about all provisioned clusters appears in the right pane.

Table 7‑1. Cluster Information

Option Description

Cluster Name Name of the cluster.

Status Status of the cluster, which can be: Running, Starting, Started, Stopping, Stopped,Provisioning, or Provision Error.

Distribution Hadoop distribution in use by the cluster. The supported distributions are Apache, Cloudera(CDH), Greenplum (GPHD), Pivotal HD (PHD), Hortonworks (HDP), and MapR.

Elasticity Mode The elasticity mode in use by the cluster, which is Manual, Automatic, or Not Applicable(N/A).

Disk IO Shares The disk IO shares in use by the cluster, which is Low, Normal, or High.

Resource The resource pool or vCenter Server cluster in use by the Big Data cluster.

Big Data ClusterInformation

Number and type of nodes in the cluster.

Progress Status messages of actions being performed on the cluster.

View Cluster Information in the vSphere Web ClientUse the vSphere Web Client to view virtual machines running each node, resource allocation, IP addresses,and storage information for each node in the Hadoop cluster.

Prerequisites

n Create one or more Hadoop clusters.

n Start the Hadoop cluster.

Procedure




4 Double click a Big Data cluster.

Information about the cluster appears in the right pane in the Nodes tab.

Table 7‑2. Cluster Information

Option Description

Node Group Lists all nodes by type in the cluster.

VM Name Name of the virtual machine on which a node is running.


64 VMware, Inc.

Table 7‑2. Cluster Information (Continued)

Option Description

Status The virtual machine reports the following status types:n Not Exist. Status prior to creating a virtual machine instance in vSphere.n Created. After the virtual machine is cloned from the node template.n Powered On. The virtual machine is powered on after virtual disks and network are

configured.n VM Ready. A virtual machine is started and IP is ready.n Service Ready. Services inside the virtual machine have been provisioned.n Bootstrap Failed. A service inside the virtual machine failed to provision.n Powered Off. The virtual machine is powered off.

IP address IP address of the virtual machine.

Host IP address or Fully Qualified Domain Name (FQDN) of the ESX host on which the virtualmachine is running.

vCPU Number of virtual CPUs assigned to the node.

RAM Amount of RAM used by the node.NOTE The RAM size that appears for each node shows the allocated RAM, not the RAM thatis in use.

Storage The amount of storage allocated for use by the virtual machine running the node.

Monitor the Hadoop Distributed File System Status in thevSphere Web Client

When you configure a Hadoop distribution to use with Big Data Extensions, the Hadoop software includesthe Hadoop Distributed File System (HDFS). You can monitor the health and status of HDFS from thevSphere Web Client. The HDFS page lets you browse the Hadoop filesystem, view NameNode logs, andview cluster information including live, dead, and decommissioning nodes, and NameNode storageinformation.

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster consists of aNameNode that manages the file system metadata and DataNodes that store the actual data.

Prerequisites

n Create one or more Hadoop clusters. See Chapter 5, “Creating Hadoop and HBase Clusters,” onpage 47.

Procedure




4 Select the cluster whose HDFS status you want to view from the Big Data Cluster List tab.

5 Select Open HDFS Status Page from the Actions menu.

The HDFS status information displays in a new Web page.

Chapter 7 Monitoring the Big Data Extensions Environment

VMware, Inc. 65

Monitor MapReduce Status in the vSphere Web ClientThe Hadoop software includes MapReduce, a software framework for distributed data processing. You canmonitor MapReduce status vSphere Web Client. The MapReduce Web page includes information aboutscheduling, running jobs, retired jobs, and log files.

Prerequisites

n Create one or more Hadoop clusters whose MapReduce status you can monitor.

Procedure




4 Select the cluster whose MapReduce status you want to view from the Big Data Cluster List tab.

5 Select Open MapReduce Status Page from the Actions menu.

The MapReduce status information displays in a new Web page.

Monitor HBase Status in the vSphere Web ClientHBase is the Hadoop database. You can monitor the health and status of your HBase cluster, as well as thetables it hosts, from the vSphere Web Client.

Prerequisites

n Create one or more HBase clusters. See Chapter 5, “Creating Hadoop and HBase Clusters,” on page 47

Procedure




4 Select the cluster whose HBase status you want to view from the Big Data Cluster List tab.

5 Select Open HBase Status Page from the Actions menu.

The HBase status information displays in a new Web page.


66 VMware, Inc.

Performing Hadoop and HBaseOperations with the SerengetiCommand-Line Interface 8

The Serengeti Command-Line Interface lets you perform Hadoop and HBase operations. You can run Hiveand Pig scripts, HDFS commands, and MapReduce jobs. You can also access Hive data.

n Copy Files Between the Local Filesystem and HDFS with the Serengeti Command-Line Interface onpage 67You can copy files or directories between the local filesystem and the Hadoop filesystem (HDFS). Thefilesystem commands can operate on files or directories in any HDFS.

n Run Pig and PigLatin Scripts with the Serengeti Command-Line Interface on page 68You can run Pig and PigLatin scripts from the Serengeti Command-Line Interface.

n Run Hive and Hive Query Language Scripts with the Serengeti Command-Line Interface on page 68You can run Hive and Hive Query Language (HQL) scripts from the Serengeti Command-LineInterface Client.

n Run HDFS Commands with the Serengeti Command-Line Interface on page 69You can run HDFS commands from the Serengeti Command-Line Interface Client.

n Run MapReduce Jobs with the Serengeti Command-Line Interface on page 69You can run MapReduce jobs on your Hadoop cluster.

n Access HBase Databases on page 70Serengeti supports several methods of HBase database access.

Copy Files Between the Local Filesystem and HDFS with theSerengeti Command-Line Interface

You can copy files or directories between the local filesystem and the Hadoop filesystem (HDFS). Thefilesystem commands can operate on files or directories in any HDFS.

Prerequisites

Log in to the Serengeti Command-Line Interface Client.

Procedure

1 List the available clusters with the cluster list command.

2 Connect to the Hadoop cluster whose files or directories you want to copy to or from your localfilesystem.

cluster target --name cluster_name

VMware, Inc. 67

3 Run the command cfg fs --namenode namenode_address.

You must run this command before using fs put or fs get to identify the namenode of the HDFS.

4 You can copy (upload) a file from the local filesystem to a specific HDFS using the fs put command.

fs put --from source_path_and_file --to dest_path_and_file

The specified file or directory is copied from your local filesystem to the HDFS.

5 You can copy (download) a file from the a specific HDFS to your local filesystem using the fs getcommand.

fs get --from source_path_and_file --to dest_path_and_file

The specified file or directory is copied from the HDFS to your local filesystem.

Run Pig and PigLatin Scripts with the Serengeti Command-LineInterface

You can run Pig and PigLatin scripts from the Serengeti Command-Line Interface.

Pig lets you write PigLatin statements. PigLatin statements are converted by the Pig service into MapReducejobs, which are executed across your Hadoop cluster.

Prerequisites

n You must have a PigLatin script you want to execute against your Hadoop cluster.

n Log in to the Serengeti Command-Line Interface Client.

Procedure


2 Show available clusters by running the cluster list command.

cluster list

3 Connect to the cluster where you want to run a Pig script.


4 Run the pig script command to run an existing PigLatin script.

This example runs the PigLatin script data.pig located in the /pig/scripts directory.

pig script --location /pig/scripts/data.pig

What to do next

If the PigLatin script stores its results in a file, you may want to copy that file to your local filesystem.

Run Hive and Hive Query Language Scripts with the SerengetiCommand-Line Interface

You can run Hive and Hive Query Language (HQL) scripts from the Serengeti Command-Line InterfaceClient.

Hive lets you write HQL statements. HQL statements are converted by the Hive service into MapReducejobs, which are executed across your Hadoop cluster.

Prerequisites

n You must have an HQL script you want to execute against your Hadoop cluster.


68 VMware, Inc.

n Log in to the Serengeti Command-Line Interface Client.

Log in to the Serengeti Command-Line Interface Client.

Procedure



cluster list

3 Connect to the cluster where you want to run a Hive script.


4 Run the hive script command to run an existing Hive script.

This example runs the Hive script data.hive located in the /hive/scripts directory. hive script --location /hive/scripts/hive.data

Run HDFS Commands with the Serengeti Command-Line InterfaceYou can run HDFS commands from the Serengeti Command-Line Interface Client.

Procedure


2 List the available clusters by running the cluster list command.

cluster list

3 Connect to the target cluster.

cluster target –-name cluster_name

4 Run HDFS commands.

This example uses the fs put command to move files from /home/serengeti/data to the HDFSpath /tmp.

fs put –-from /home/serengeti/data –-to /tmp

This example uses the fs get command to download a file from a specific HDFS to your localfilesystem.

fs get --from source_path_and_file --to dest_path_and_file

This example uses the fs ls command to display the contents of the dir1 directory in /home/serengeti.

fs ls /home/serengeti/dir1

This example uses the fs mkdir command to create the dir2 directory in the /home/serengeti directory.

fs mkdir /home/serengeti/dir2

Run MapReduce Jobs with the Serengeti Command-Line InterfaceYou can run MapReduce jobs on your Hadoop cluster.

Procedure

1 Access the Serengeti Command-Line Interface Client, and connect to a Hadoop cluster.


cluster list

Chapter 8 Performing Hadoop and HBase Operations with the Serengeti Command-Line Interface

VMware, Inc. 69

3 Connect to the cluster where you want to run a MapReduce job.


4 Run the mr jar command.

mr jar --jarfile path_and_jar_filename --mainclass class_with_main_method [--args

double_quoted_arg_list]

This example runs the MapReduce job located in the hadoop-examples-1.2.1.jar JAR, which is locatedin the /serengeti/cli/lib directory. Two arguments are passed to the MapReduce job: /tmp/inputand /tmp/output.

mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.2.1.jar --mainclass

org.apache.hadoop.examples.WordCount --args "/tmp/input /tmp/output"

5 Show the output of the MapReduce job by running the fs cat command.

fs cat file_to_display_to_stdout

6 Download the output of the MapReduce job from HDFS to the local file system.

fs get --from HDFS_file_path_and_name --to local_file_path_and_name

Access HBase DatabasesSerengeti supports several methods of HBase database access.

n Log in to the client node virtual machine and run hbase shell commands.

n Run HBase jobs by using the hbase command.

hbase org.apache.hadoop.hbase.PerformanceEvaluation –-nomapred randomWrite 3

The default Serengeti-deployed HBase cluster does not contain Hadoop JobTracker or HadoopTaskTracker daemons. Therefore, to run an HBase MapReduce job, you must deploy a customizedcluster.

n Use the HBase cluster’s client node Rest-ful Web Services, which listen on port 8080, by using the curlcommand.

curl –I http://client_node_ip:8080/status/cluster

n Use the HBase cluster’s client node Thrift gateway, which listens on port 9090.


70 VMware, Inc.

Accessing Hive Data with JDBC orODBC 9

You can run Hive queries from a Java Database Connectivity (JDBC) or Open Database Connectivity(ODBC) application leveraging the Hive JDBC and ODBC drivers.

You can access data from Hive using either JDB or ODBC.

Hive JDBC DriverHive provides a Type 4 (pure Java) JDBC driver, defined in the classorg.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the formjdbc:hive://host:port/dbname, a Java application can connect to a Hive server running at the specifed hostand port. The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thriftbindings.

You may alternatively choose to connect to Hive through JDBC in embedded mode using the URIjdbc:hive://. In embedded mode, Hive runs in the same JVM as the application invoking it, so there is noneed to launch it as a standalone server, since it does not use the Thrift service or the Hive Thrift Client.

Hive ODBC DriverThe Hive ODBC driver allows applications that support the ODBC protocol to connect to Hive.Like theJDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.

n Configure Hive to Work with JDBC on page 71The Hive JDBC driver allows you to access Hive from a Java program that you write, or a BusinessIntelligence or similar application that uses JDBC to communicate with database products.

n Configure Hive to Work with ODBC on page 73The Hive ODBC driver allows you to access Hive from a program that you write, or a BusinessIntelligence or similar application that uses ODBC to communicate with database products.

Configure Hive to Work with JDBCThe Hive JDBC driver allows you to access Hive from a Java program that you write, or a BusinessIntelligence or similar application that uses JDBC to communicate with database products.

The default JDBC 2.0 port is 21050. Hive accepts JDBC connections through this same port 21050 by default.Make sure this port is available for communication with other hosts on your network. For example, ensurethat it is not blocked by firewall software.

Prerequisites

n You must have an application that can connect to a Hive server using the Hive JDBC driver.

n Log in to the Serengeti Command-Line Interface client.

VMware, Inc. 71

n Log in to the Hive server using PuTTY or another secure-shell (SSH) client.

Procedure

1 Log into the Hive server node using an SSH client.

2 Create the file HiveJdbcClient.java with the Java code to connect to the Hive Server.

import java.sql.SQLException;

import java.sql.Connection;

import java.sql.ResultSet;

import java.sql.Statement;

import java.sql.DriverManager;

public class HiveJdbcClient {

private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

/**

* @param args

* @throws SQLException

**/

public static void main(String[] args) throws SQLException {

try {

Class.forName(driverName);

} catch (ClassNotFoundException e){

// TODO Auto-generated catch block

e.printStackTrace();

System.exit(1);

}

Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default",

"", "");

Statement stmt = con.createStatement();

String tableName = "testHiveDriverTable";

stmt.executeQuery("drop table " + tableName);

ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value

string)");

// show tables

String sql = "show tables '" + tableName + "'";

System.out.println("Running: " + sql);

res = stmt.executeQuery(sql);

if (res.next()) {

System.out.println(res.getString(1));

}

// describe table

sql = "describe " + tableName;



while (res.next()) {

System.out.println(res.getString(1) + "\t" + res.getString(2));

}

// load data into table

// NOTE: filepath has to be local to the hive server

// NOTE: /tmp/test_hive_server.txt is a ctrl-A separated file with two fields per

line

String filepath = "/tmp/test_hive_server.txt";

sql = "load data local inpath '" + filepath + "' into table " + tableName;



// select * query


72 VMware, Inc.

sql = "select * from " + tableName;



while (res.next()){

System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));

}

// regular hive query

sql = "select count(1) from " + tableName;



while (res.next()){

System.out.println(res.getString(1));

}

}

}

3 Run the JDBC code using one of the following methods.

u Run the javac command identifying the Java code containing the JDBC code.javacHiveJdbcClient.java

u Run a Bash shell script to populate the data file, define the classpath, and invoke the JDBC client.The example below uses Apache Hadoop 1.1.2 distribution. If you are using a different Hadoopdistribution, you must update the value of the HADOOP_CORE variable to correspond to the version of thedistribution you are using.

#!/bin/bash

HADOOP_HOME=/usr/lib/hadoop

HIVE_HOME=/usr/lib/hive

echo -e '1\x01foo' > /tmp/test_hive_server.txt

echo -e '2\x01bar' >> /tmp/test_hive_server.txt

HADOOP_CORE=`ls /usr/lib/hadoop-1.1.2/hadoop-core-*.jar`

CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf

for jar_file_name in ${HIVE_HOME}/lib/*.jar

do

CLASSPATH=$CLASSPATH:$jar_file_name

done

java -cp $CLASSPATH HiveJdbcClient

Performing either of the above methods establishes a JDBC connection with the Hive server using the hostand port information you specify in the Java application or Bash shell script.

Configure Hive to Work with ODBCThe Hive ODBC driver allows you to access Hive from a program that you write, or a Business Intelligenceor similar application that uses ODBC to communicate with database products.

To access Hive data using ODBC, use the ODBC driver recommended for use with your Hadoopdistribution.

Prerequisites

n Verify that the Hive ODBC driver supports the application or third-party product you intend to use.

n Download an appropriate ODBC connector and configure it for use with your environment.

n Configure a Data Source Name (DSN).

Chapter 9 Accessing Hive Data with JDBC or ODBC

VMware, Inc. 73

DSNs specify how an application connects to Hive or other database products. DSNs are typicallymanaged by the operating system and may be used by multiple applications. Some applications do notuse DSNs. Refer to your particular application's documentation to understand how it connects to Hiveand other database products using ODBC.

Procedure

1 Open the ODBC Data Source Administrator from the Windows Start menu.

2 Click System DSN tab, then click Add.

3 Select the ODBC driver you want to use with your Hadoop distribution, and click Finish.

4 Enter values for the following fields.

Option Description

Data Source Name Type a name by which to identify the DSN.

Host Fully qualified hostname or IP address of the node running the Hiveservice.

Port Port number for the Hive service. The default is 21000.

Hive Server Type Set to HiveServer1 or HiveServer2.

Authentication If you are using Hiveserver2, specify the following.

Mechanism Set to User Name

User Name Specify the user name with which to run Hivequeries.

5 Click OK.

6 Click Test to test the ODBC connection. When you have verified that the connection works, clickFinish.

The new ODBC connector will appear in the User Data Sources list.

What to do next

You can now configure the application to work with your Hadoop distribution's Hive service. Refer to yourparticular application's documentation to understand how it connects to Hive and other database productsusing ODBC.


74 VMware, Inc.

Troubleshooting 10The troubleshooting topics provide solutions to potential problems that you might encounter when usingBig Data Extensions.

This chapter includes the following topics:

n “Log Files for Troubleshooting,” on page 76

n “Configure Serengeti Logging Levels,” on page 76

n “Troubleshooting Cluster Creation Failures,” on page 76

n “Cluster Provisioning Hangs for 30 Minutes Before Completing with an Error,” on page 79

n “Cannot Restart a or Reconfigure a Cluster After Changing Its Distribution,” on page 80

n “Virtual Machine Cannot Get IP Address,” on page 80

n “vCenter Server Connections Fail to Log In,” on page 81

n “SSL Certificate Error When Connecting to Non-Serengeti Server with the vSphere Console,” onpage 81

n “Serengeti Operations Fail After You Rename a Resource in vSphere,” on page 81

n “A New Plug-In with the Same or Lower Version Number as an Old Plug-In Does Not Load,” onpage 82

n “MapReduce Job Fails to Run and Does Not Appear In the Job History,” on page 82

n “Cannot Submit MapReduce Jobs for Compute-Only Clusters with External Isilon HDFS,” onpage 83

n “MapReduce Job Hangs on PHD or CDH4 YARN Cluster,” on page 83

n “Unable to Connect the Big Data Extensions Plug-In to the Serengeti Server,” on page 84

VMware, Inc. 75

Log Files for TroubleshootingBig Data Extensions and Serengeti automatically create log files that provide system and status informationthat you can use to troubleshoot deployment and operation problems.

Table 10‑1. Log Files

Category File Name Information Location

Serengeti deployment logs n serengeti-firstboot.log

n serengeti-firstboot.err

Deployment timemessages, which you canuse to troubleshoot anunsuccessful deployment.

/opt/serengeti/logs

Serengeti server service log n serengeti.log Web service componentlogs.

/opt/serengeti/logs

Serengeti server installationand configuration log

n ironfan.log Software installation andconfiguration information.

/opt/serengeti/logs

Serengeti server elasticitylog

n vhm.log Elasticity logs. /opt/serengeti/logs

Configure Serengeti Logging LevelsThe Serengeti system and backend tasks use Apache log4j, with the default logging level INFO, to logmessages. You can configure the logging level to customize the amount and type of information shown inthe system and event logs.

Procedure

1 Open the /opt/serengeti/conf/log4j.properties file for editing.

2 Change the logging level as desired.

3 Save your changes and close the file.

4 Stop and then restart the Serengeti services.

Troubleshooting Cluster Creation FailuresThe cluster creation process can fail for many reasons.

If cluster creation fails, the first thing you should try is to resume the process.

n If you created the cluster with the Serengeti Command-Line Interface, run the cluster create --resumecommand.

n If you created the cluster with the vSphere Web Client, select the cluster, right-click, and select Resume.

Bootstrap Failed 401 Unauthorized ErrorWhen you run the cluster create or cluster create --resume command, the command can fail. Thereason it failed is logged to the associated background task log, /opt/serengeti/logs/ironfan.log.

Problem

The cluster create or cluster create --resume command fails.

n On the Command-Line Interface, the following error message appears:

Bootstrap Failed


76 VMware, Inc.

n In the log file, the following error message appears:

[Fri, 09 Aug 2013 01:24:01 +0000] INFO: *** Chef 10.X.X *** [Fri, 09 Aug 2013 01:24:01

+0000] INFO: Client key /home/ubuntu/chef-repo/client.pem is not present - registering [Fri,

09 Aug 2013 01:24:01 +0000] INFO: HTTP Request Returned 401 Unauthorized: Failed to

authenticate. Please synchronize the clock on your client [Fri, 09 Aug 2013 01:24:01 +0000]

FATAL: Stacktrace dumped to /var/chef/cache/chef-stacktrace.out [Fri, 09 Aug 2013 01:24:01

+0000] FATAL: Net::HTTPServerException: 401 "Unauthorized"

Cause

This error occurs if the Serengeti Management Server and the failed virtual machine clocks are notsynchronized.

Solution

u From the vSphere Client, configure all physical hosts to synchronize their clocks by using the same NTPserver.

Provisioned virtual machines automatically sync their clocks to their ESXi hosts.

After the clocks are corrected, you can run the cluster create --resume command to complete thecluster provisioning process.

Cannot Create a Cluster with the template-cluster-spec.json FileIf you use the /opt/serengeti/conf/hdfs-hbase-template-spec.json from the Serengeti server virtualmachine to create a cluster, cluster creation fails.

Problem

The cluster create or cluster create --resume command fails, and the Command-Line Interface displaysthe following error message:

cluster cluster_name create failed: Unrecognized field "groups" (Class

com.vmware.bdd.apitypes.ClusterCreate), not marked as ignorable at [Source:

java.io.StringReader@7563a320; line: 3, column: 13] (through reference chain:

com.vmware.bdd.apitypes.ClusterCreate["groups"])

Cause

The /opt/serengeti/conf/template-cluster-spec.json file is for Serengeti management server internal useonly. It is not a valid cluster specification file.

Solution

u Create your own cluster specification file.

Sample cluster specification files are in the /opt/serengeti/samples directory.

Insufficient Storage SpaceIf there are insufficient resources available when you run the cluster create or cluster create --resumecommand, cluster creation fails.

Problem

The cluster create or cluster create --resume command fails, and the Command-Line Interface displaysthe following error message:

cluster $CLUSTER_NAME create failed: cannot find a host with enough storage to place base nodes

[$NODE_NAME] you can get task failure details from serengeti server

logs /opt/serengeti/logs/serengeti.log /opt/serengeti/logs/ironfan.log

Chapter 10 Troubleshooting

VMware, Inc. 77

Cause

This error occurs if there is insufficient datastore space.

Solution

1 Review the serengeti.log file, and search for the phrase “cannot find host with enough”.

This information shows the Serengeti server snapshot of the vCenter Server cluster environmentimmediately after the placement failure.

You can also find information about the datastore name and its capacity. Additionally, you can find thecluster specification file that you used, and information for the nodes that have been successfullyplaced.

2 Review your cluster specification file.

The cluster specification file defines the cluster’s datastore requirements and determines the availablespace on the datastore that you added to Serengeti. Use this information to determine which storage isinsufficient.

For example, if there is insufficient LOCAL datastore capacity for worker nodes, you must addadditional LOCAL datastores to the Serengeti server and assign them to the cluster.

Distribution Download FailureIf the Hadoop distribution’s server is down when you run the cluster create or cluster create --resumecommand, cluster creation fails.

Problem

If the Hadoop distribution’s server is down when you run the cluster create or cluster create --resumecommand, cluster creation fails. The reason the command failed is logged .

n On the Command-Line Interface, the following error message appears:

Bootstrap Failed

n For tarball-deployed distributions, the following error message is logged to the associated backgroundtask log, /opt/serengeti/logs/ironfan.log:

Downloading tarball tarball_url failed

n For Yum-deployed distributions, the following error message is logged to the associated backgroundtask log, /opt/serengeti/logs/ironfan.log:

Errno::EHOSTUNREACH: remote_file[/etc/yum.repos.d/cloudera-cdh4.repo]

(hadoop_common::add_repo line 45) had an error: Errno::EHOSTUNREACH: No route to host -

connect(2)

Cause

The package server is down.

n For tarball-deployed distributions, the package server is the Serengeti management server.

n For Yum-deployed distributions, the package server is the source of the Yum-deployed distribution:either the official Yum repository or your local Yum server.

Solution

1 For tarball-deployed distributions, ensure that the httpd service on the Serengeti management server isalive.

2 For Yum-deployed distributions, ensure that the Yum repository file URLs are correctly configured inthe manifest file.


78 VMware, Inc.

3 Ensure that the necessary file is downloadable from the failed node.

Distribution Type Necessary File

tarball-deployed tarball

Yum-deployed Yum repository file

Serengeti Management Server IP Address Unexpectedly ChangesThe IP address of the Serengeti management server changes unexpectedly.

Problem

When you create a cluster after the Serengeti management server IP address changes, the cluster creationfails with a bootstrap failure.

Cause

The network setting is DHCP.

Solution

u Restart the Serengeti management server virtual machine.

Cluster Provisioning Hangs for 30 Minutes Before Completing with anError

When you create, configure, or resume creating or configuring a cluster, the process hangs for 30 minutesbefore completing with an error message.

Problem

After waiting for 30 minutes for cluster provisioning to complete, a PROVISION_ERROR occurs.

Cause

The NameNode or JobTracker node experienced a Bootstrap Failed error during the cluster creation orconfiguration process.

Solution

1 Do one of the following:

n Wait for the provisioning process to complete and display the PROVISION_ERROR status.

The wait is 30 minutes.

n If you are using the Serengeti CLI, press Ctrl+C.

If you are using the vSphere Console, no action is required.

2 Log in to the Serengeti management server as user serengeti.

3 Write status information to the metadata database.

echo "update cluster set status = 'PROVISION_ERROR' where name = 'your_cluster_name'; \q" |

psql

4 Review the log file, /var/chef/cache/chef-stacktrace.out, for the NameNode and JobTracker node todetermine why the bootstrap failed.

5 Using the information from the log file and from the Troubleshooting topics, solve the problem.


VMware, Inc. 79

6 Resume the cluster creation process.

a Access the Serengeti CLI.

b Run the cluster create ... --resume command.

Cannot Restart a or Reconfigure a Cluster After Changing ItsDistribution

After you change a cluster's distribution vendor or distribution version, but not the distribution name, thecluster cannot be restarted or reconfigured.

Problem

When you try to restart or reconfigure a cluster after changing its distribution vendor or distribution versionin the manifest, you receive the following error message:

Bootstrap Failed

Cause

When you manually change an existing distribution's vendor or version in the manifest file and reuse adistribution's name, the Serengeti server cannot bootstrap the node.

Solution

1 Revert the manifest file.

2 Use the config-distro.rb tool to add a new distribution, with a unique name, for the distributionvendor and version that you want.

Virtual Machine Cannot Get IP AddressWhen you run a Serengeti command, it fails.

Problem

A Serengeti command fails, and the Command-Line Interface displays the following error message:

Virtual Machine Cannot Get IP Address

Cause

This error occurs when there is a network configuration error.

For static IP, the cause is typically an IP address conflict.

For DHCP, common causes include:

n The number of virtual machines that require IPs exceeds the available DHCP addresses.

n The DHCP server fails to allocate sufficient addresses.

n The DHCP renew process failed after an IP address expires.

Solution

n Verify that the vSphere port group has enough available ports for the new virtual machine.

n If the network is using static IP addressing, ensure that the IP address range is not used by anothervirtual machine.

n If the network is using DHCP addressing, ensure that there is an available IP address to allocate for thenew virtual machine.


80 VMware, Inc.

vCenter Server Connections Fail to Log InThe Serengeti management server tries but fails to connect to vCenter Server.

Problem

The Serengeti management server tries but fails to connect to vCenter Server.

Cause

vCenter Server is unreachable for any reason, such as network issues or too many running tasks.

Solution

u Ensure that vCenter Server is reachable.

n Connect to vCenter Server with the vSphere Web Client or the VMware Infrastructure Client (VIClient) .

n Ping the vCenter Server IP address to verify that the Serengeti server is connecting to the correct IPaddress.

SSL Certificate Error When Connecting to Non-Serengeti Server withthe vSphere Console

From the vSphere Web Client, you cannot connect to a non-Serengeti management server.

Problem

When you use the Big Data Extensions plug-in to vCenter Server and try to connect to a non-Serengetiserver, you receive the following error message:

SSL error:

Check certificate failed.

Please select a correct serengeti server.

Cause

When you use the Big Data Extensions plug-in, you can connect only to Serengeti servers.

Solution

Connect only to Serengeti servers. Do not perform any certificate-related operations.

Serengeti Operations Fail After You Rename a Resource in vSphereAfter you use vSphere to rename a resource, Serengeti commands fail for all Serengeti clusters that use thatresource.

Problem

If you use vSphere to rename a Serengeti resource that is used by provisioned Serengeti clusters, Serengetioperations fail for the clusters that use that resource. This problem occurs for vCenter Server resource pools,datastores, and networks that you add to Serengeti, and their related hosts, vCenter Server clusters, and soon. The error message depends on the type of resource, but generally indicates that the resource isinaccessible.

Cause

The Serengeti resource mapping requires that resource names do not change.


VMware, Inc. 81

Solution

u Use vSphere to revert the resource to its original name.

A New Plug-In with the Same or Lower Version Number as an OldPlug-In Does Not Load

When you install a new Big Data Extensions plug-in that has the same version as a previous Big DataExtensions plug-in, the previous version is loaded instead of the new version.

Problem

When you install a new Big Data Extensions plug-in that has the same or lower version number as aprevious Big Data Extensions plug-in, the previous version is loaded instead of the new version. Thishappens regardless of whether you uninstall the previous plug-in.

Cause

When you uninstall a plug-in, the vSphere Web client does not remove the plug-in package from theSerengeti server.

After you install a plug-in with the same or lower version number as the old plug-in, and try to load it,vSphere finds the old package in its local directory. Therefore, vSphere does not download the new plug-inpackage from the remote Serengeti server.

Solution

1 Uninstall the old plug-in.

2 Remove the old plug-in.

n For vCenter Server Appliances, delete the /var/lib/vsphere-client/vc-packages/vsphere-client-serenity/vsphere-bigdataextensions-$version folder.

n For vSphere Web client servers on Windows, delete the %ProgramData%/vmware/vSphere WebClient/vc-packages/vsphere-client-serenity/vsphere-bigdataextensions-$version folder.

3 Restart the vSphere Web client.

n For vCenter Server Appliances, restart the vSphere Web client service at the vCenter ServerAppliance Web console at http://$vCenter-Server-Appliance-IP:5480

n For vSphere Web client servers on Windows, restart the vSphere Web client service from theservices console.

4 Install the new plug-in.

MapReduce Job Fails to Run and Does Not Appear In the Job HistoryA submitted MapReduce job fails to run and does not appear in the job history.

Problem

When you submit a MapReduce job and the workload is very heavy, the MapReduce job might not run, andit might not appear in the MapReduce job history.

Cause

During heavy workloads, the JobTracker or NameNode service might be too busy to respond to vSphereHA monitoring within the configured timeout value. When a service does not respond to vSphere HArequest, vSphere restarts the affected service.


82 VMware, Inc.

Solution

1 Stop the HMonitor service.

Stopping the HMonitor services disables vSphere HA failover.

a Log in to the affected cluster's node.

b Stop the HMonitor service.

sudo /etc/init.d/hmonitor-*-monitor stop

2 Increase the JobTracker vSphere timeout value.

a Open the /user/lib/hadoop/monitor/vm-jobtracker.xml file for editing.

b Locate the service.monitor.probe.connect.timeout property.

c Change the value of the <value> element.

d Save your changes and close the file.

3 Increase the NameNode vSphere timeout value.

a Open the /user/lib/hadoop/monitor/vm-namenode.xml file for editing.

b Locate the service.monitor.portprobe.connect.timeout property.

c Change the value of the <value> element.

d Save your changes and close the file.

4 Start the HMonitor service.

sudo /etc/init.d/hmonitor-*-monitor start

Cannot Submit MapReduce Jobs for Compute-Only Clusters withExternal Isilon HDFS

You cannot submit MapReduce Jobs for compute-only clusters that point to an external Isilon HDFS.

Problem

If you deploy a compute-only cluster with an external HDFS pointing to Isilon, the deployment appears tobe successful. However, the JobTracker is in safe mode, and so you cannot submit MapReduce jobs.

Cause

JobTracker requires a user named mapred.

Solution

u Add the mapred user to HDFS.

MapReduce Job Hangs on PHD or CDH4 YARN ClusterA MapReduce job hangs on PHD or CDH4 YARN cluster with one DataNode and one NodeManager agent,each with only 378MB of memory.

Problem

MapReduce jobs might hang when you run them on a PHD or CDH4 YARN cluster with one data node andone NodeManager agent.

Cause

Insufficient memory resources.


VMware, Inc. 83

Solution

1 Create a PHD or CDH4 YARN cluster with two DataNodes and two NodeManagers.

2 Rerun the MapReduce job.

Unable to Connect the Big Data Extensions Plug-In to the SerengetiServer

When you install Big Data Extensions on vSphere 5.1 or later, the connection to the Serengeti server fails toauthenticate.

Problem

The Big Data Extensions plug-in is unable to connect to the Serengeti server.

Cause

During the deployment, the Single Sign-On (SSO) link was not entered. Therefore, the Serengetimanagement server cannot authenticate the connection from the plug-in.

Solution

1 Open a command shell on the Serengeti Management server.

2 Configure the SSO settings.

EnableSSOAuth https://VCIP:7444/lookupservice/sdk

3 Restart the Tomcat service.


4 Reconnect the Big Data Extensions plug-in to the Serengeti Management server.


84 VMware, Inc.

Index

Numerics401 unauthorized error 76

Aabout 9About this book 7About cluster topology 48access Hive data 71accessing

Command-Line Interface 38HBase databases 70

accessing data 71accessing with JDBC or ODBC 71add a datastore 44add a network 45Add distribution 33adding

resource pools 43topology 51

Apache log4j logs 76architecture 10

Bbalancing workloads 51Big Data Extensions

delete a cluster 57remove a datastore 45

bootstrap failed 76

Cchange on all cluster nodes 41CLI console 22client nodes for Hadoop 48, 49clock synchronization, and cluster creation

failure 76Cloudera 31cluster, creating 49cluster config command 55cluster creation failure

401 unauthorized error 76Bootstrap Failed 76distribution download failure 78Serengeti server IP address changes 79template-cluster-spec.json file 77

cluster export command 55

cluster fix command 61cluster specification files

Hadoop distribution JARs 55reconfiguring clusters 55resource pool symbolic link 43samples 77

cluster target command 67, 68clusters

creating 47deploying under different resource pools 44failover 61managing 53provisioning hangs 79reconfiguring 55running MapReduce jobs on PHCD or

CDH4 83scaling out 54stopping and starting 53topology constraints 51view for Hadoop 64

Command-Line Interface, accessing 38configuration files, converting Hadoop XML to

Serengeti JSON 55Configure a Yum-deployment 32Configure a Yum deployment 31, 32configure Hive 73configuring, logging levels 76Configuring Yum and Yum repositories 25connect to a server instance 21connect to Serengeti Management Server 21connect to with Big Data Extensions plug-in 21connecting, Serengeti service 38connection failure 81, 84convert-hadoop-conf.rb conversion tool 55converting Hadoop XML to Serengeti JSON 55copying files, between local filesystem and

HDFS 67create cluster 49create local yum repo 27creating, clusters 47custom CentOS virtual machine 34

VMware, Inc. 85

Ddatastores

adding 44removing 45

delete cluster} 57deploy OVA 17deploy vApp 17deployment log files 76disk failure, recovering from 61disk IO shares 60distribution 23distribution download failure 78download CLI console 22

Eelastic scaling 57elasticity

log file 76See also resources

Ffeature support by distribution 12feature support for Hadoop 11feature support for Hadoop distributions 11files, copying between local filesystem and

HDFS 67fs get command 67fs put command 67

HHadoop cluster, delete 57Hadoop cluster information, viewing 64Hadoop clusters

viewing 63See also clusters

Hadoop distribution 23Hadoop operations, performing 67Hadoop Distributed File System, monitor

status 65Hadoop distributions, JAR files 55Hadoop Virtualization Extensions (HVE) 51HBase clusters, See also clustersHBase database access 70HBase operations, performing 67HBase, monitor status 66HBase status 66HDFS commands, running 69HDFS status, See Hadoop Distributed File

SystemHive scripts, running 68hive script command 68

II/O bandwidth 60install 19intended audience 7IP address of virtual machine, cannot get 80

JJava Runtime Environment (JRE) 38JDBC 71

Lloading, plug-ins 82log files 76log in 41logging, levels, configuring 76

Mmanaging

clusters 53vSphere resources 43

MapR 31MapReduce 66MapReduce jobs

and compute-only clusters 83cannot submit 83fail to run 82hanging 83running 69

master nodes for Hadoop 48, 49monitor HBase status 66monitor the Hadoop Distributed File System

status 65monitoring, Big Data Extensions environment 63monitoring nodes 64mr jar command 69

Nnetwork, removing 46networks, adding 45nodes

cannot get IP address 80monitoring 64

OODBC 71Open Database Connectivity 73operations fail 81optimize resource usage 59

Ppasswords 40performing, Hadoop and HBase operations 67pig script command 68


86 VMware, Inc.

Pig scripts, running 68Pivotal Hadoop 29plug-ins

cannot connect 84changing versions 82loading 82

prerequisites 15Project Serengeti 10provisioning, hangs 79

Rrack topologies 51recovering from disk failure 61register 19remove a datastore 45remove a network 46remove serengeti-snapshot 37removing, resource pools 44renaming, vSphere resources 81resource pools

adding 43removing 44

resourcepool add command 43resourcepool delete command 44resourcepool list command 44resources, renaming in vSphere 81running

HDFS commands 69Hive scripts 68MapReduce jobs 69Pig scripts 68

Sscale out cluster 54Scale Out Failed error message 54scale up or down 54Serengeti service, connecting 38Serengeti servers

fail to connect to vCenter Server 81IP address changes 79service log file 76

Single Sign-on (SSO) 38, 84SSL certificates, errors 81start cluster 53stop cluster 53stop service 40submitting, problems with MapReduce jobs 83system requirements 15

Ttarball 23Technical Support 7

template-cluster-spec.json file, and clustercreation failure 77

topology, adding 51topology upload command 51topology list command 51troubleshooting

cluster creation failures 76log files for 76overview 75

Uunregister 19uploading, topology 51username 39username and password 39

VvCenter Server Web console, start or stop a

cluster 53vCenter Server, connections fail to log in 81versions, plug-ins 82view cluster information 63view cluster name 63view Hadoop cluster information 64virtual machine, cannot get IP address 80VMware Technical Publications Glossary 7vSphere Fault Tolerance (FT) 61vSphere resources

managing 43resource pools 43

vSphere High Availability (HA) 61

Wworker nodes for Hadoop 48, 49workloads, balancing 51

Yyum 23yum values 26

Index

VMware, Inc. 87


88 VMware, Inc.

vmware vsphere big data extensions administrator's and...

Documents