upgrading ibm® open platform with apache© hadoop 4.1 and

1

Upgrading IBM® Open Platform with Apache© Hadoop 4.1

and Ambari™ 2.1

from IBM Spectrum Scale Hadoop connector

to IBM HDFS Transparency

Version 1.1

IBM Spectrum Scale BDA Team

2016-8-15

2

Contents 1. Background ........................................................................................................................................... 2

2. Upgrade Guide ...................................................................................................................................... 4

2.1 Preparation........................................................................................................................................ 4

2.2 Checklist .......................................................................................................................................... 11

2.3 Update steps ................................................................................................................................... 12

3. Revision History .................................................................................................................................. 21

1. Background

IBM Spectrum Scale provides integration with the Hadoop framework through a Hadoop

connector.

IBM Spectrum Scale has released two types of Hadoop connectors:

First generation connector: Hadoop connector

This first generation connector is used to enable Hadoop over IBM Spectrum Scale by using

the Hadoop file system APIs. This connector is in Support Mode only and there are no new

functions delivered to this connector.

The Hadoop connector rpm package name is gpfs.hadoop-connector-<version>.<arch>.rpm.

The Hadoop connector is integrated with IBM Spectrum Scale as an Ambari stack in IBM

BigInsights™ Ambari IOP 4.1 released in November, 2015.

The Ambari integration package is called gpfs.ambari-iop_4.1-<version>.noarch.rpm.

Second generation connector: HDFS transparency

The second generation IBM Spectrum Scale HDFS Transparency, also known as HDFS

Protocol, offers a set of interfaces that allows applications to use the native HDFS client to

access IBM Spectrum Scale through HDFS RPC requests. This new connector has an

3

improved architecture that leverages the native HDFS client for better compatibility,

performance and support for third party tools. The HDFS transparency connector is the

strategic direction for Hadoop support on Spectrum Scale.

The HDFS transparency rpm package name is gpfs.hdfs-protocol-<version>.<arch>.rpm.

HDFS transparency is integrated with IBM Spectrum Scale as an Ambari service in IBM

BigInsights Ambari IOP 4.1 released in July, 2016.

The Ambari integration package is called gpfs.hdfs-transparency.ambari-iop_4.1-

<version>.noarch.rpm.

This document describes how to upgrade from IOP 4.1 and IBM Spectrum Scale with the first

generation Hadoop connector environment to an IOP 4.1 and IBM Spectrum Scale with the

second generation HDFS transparency cluster. This manual upgrade process for moving from

the first generation connector to the new HDFS Transparency connector is a one-time process.

Future upgrades will be handled through the Ambari dashboard.

For an existing cluster that has IOP 4.1 and Ambari 2.1 and IBM Spectrum Scale and Hadoop

connector, the following packages must be deployed in your environment:

gpfs.ambari-iop_4.1-<version>.noarch.rpm gpfs.hadoop-connector-2.7.0-<version>.<arch>.rpm

To upgrade to the second generation HDFS transparency, the following packages are required:

gpfs.hdfs-transparency.ambari-iop_4.1-0.noarch.rpm gpfs.hdfs-protocol-<version>.x86_64.rpm

The packages above can be downloaded from the IBM DeveloperWorks - IOP with Apache

Hadoop 2nd generation HDFS Transparency webpage.

To determine the connector that the cluster is currently using, run the following commands:

To see if the first generation Hadoop connector is running, run the following command

on all the nodes where the connector is installed:

rpm -qa | grep gpfs.hadoop

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/IBM%20Open%20Platform%20with%20Apache%20Hadoop?section=Download%20Releases


4

To see if HDFS transparency is running, run the following command on all the nodes

where the connector is installed:

rpm -qa | grep gpfs.hdfs

If the command returns the corresponding package, then the cluster is using that connector.

To determine the Ambari integration package that the cluster is currently using, run the

following commands on the Ambari server:

To check if the first generation Hadoop connector Ambari integration package is being

used, run the following command:

rpm -qa | grep gpfs.ambari

To check if the HDFS transparency Ambari integration package is being used, run the

following command:

rpm -qa | grep gpfs.hdfs-transparency

If the command returns the corresponding package, the cluster is using that Ambari integration

package.

2. Upgrade Guide

2.1 Preparation

IMPORTANT NOTE

The current environment has information that must be captured before starting the upgrade process to HDFS transparency.

Use a document to manually write in and maintain all information that is mentioned from Step1 to Step5.

Step1) Write your Ambari server hostname. This document will refer to it as

ambari_server_host.

Step2) Write the user name and password for the Ambari database.

5

By default, the user name is ‘admin’ and the password is ‘admin’ and it has been used during

the ambari-server set-up installation phase. If the username and password were changed

during the installation, ensure that you have the new username and password.

Step3) Write the zookeeper server hostname. If you have more than one zookeeper server, you

write only one. This document will refer to it as zookeeper_server_host.

Step4) Write the HBase configuration key value hbase.rootdir.

From the Ambari GUI, click HBaseConfigsAdvanced panel to get the hbase.rootdir value.

Step5) Write the meta data information about Hive:

If MySQL server is used for Hive’s meta data:

Write the MySQL server hostname if you have installed the Hive service. From the Ambari GUI,

click HiveSummary to get the MySQL server value. This document will refer to it as

hive_mysql_server_host.

Write the MySQL server hostname, username, password, and database values.

From the Ambari GUI, click HiveConfigsAdvancedHive Metastore to get the values.

Write the following information from the Hive panel as shown in the following screenshot:

Database Host: The MySQL server for Hive.

Database Name: The database name. This document will refer to it as Hive_mySQL_db.

Database Username: This document will refer to it as Hive_MySQL_Username.

Database Password: This document will refer to it asHive_MySQL_Password.

6

If PostgreSQL is used for Hive’s meta data:

Write the PostgreSQL Database Host, Database Name, Database Username and Database

Password from Ambari GUIHiveConfigsHive Metastore, shown in the following

screenshot:

Database Host: This document will refer to it as Hive_PostgreSQL_Server

Database Name: The database name. This document will refer to it as

Hive_PostgreSQL_db.

Database Username: This document will refer to it as Hive_PostgreSQL_Username.

7

Database Password: This document will refer to it as Hive_PostgreSQL_Password.

Step6: Check a sample of the current data in Hbase, Hive, and BigInsights Value-Add databases.

This is a sanity check to verify that everything is functioning correctly after the upgrade is

complete.

Hbase

On any Hbase node, run the following command to check the data:

# su - hbase $ /usr/iop/4.1.0.0/hbase/bin/hbase shell hbase(main):001:0> list TABLE ambarismoketest moviedb 2 row(s) in 0.2050 seconds hbase(main):003:0> scan 'moviedb' ROW COLUMN+CELL 21jumpstreet-2012 column=director:, timestamp=1470840069133, value=phil lord 21jumpstreet-2012 column=genre:, timestamp=1470840069146, value=comedy 21jumpstreet-2012 column=title:, timestamp=1470840069126, value=21 jump street

Note: The moviedb is an example database value. Replace moviedb with the name of a

database that exists in your cluster.

Hive

On any Hive node, run the following commands to check the data:

su - hive $ /usr/iop/4.1.0.0/hive/bin/hive hive> show databases; OK bigdata default Time taken: 1.86 seconds, Fetched: 2 row(s)

8

hive> use default; hive> show tables; OK hivetbl1 hivetbl10_part Time taken: 0.077 seconds, Fetched: 2 row(s) hive> select * from hivetbl1;

Note: default is an example database value. Replace default with the name of a database that

exists in your cluster.

BigInsights BigSQL

On the BigSQL head node, run the following commands to check the data:

su - bigsql db2 => LIST ACTIVE DATABASES Active Databases Database name = BIGSQL Applications connected currently = 3 Database path = /var/ibm/bigsql/database/bigdb/bigsql/NODE0000/SQL00001/MEMBER0000/ db2 => connect to bigsql Database Connection Information Database server = DB2/LINUXX8664 10.6.3 SQL authorization ID = BIGSQL Local database alias = BIGSQL db2 => select schemaname from syscat.schemata SCHEMANAME -----------------------------------------------------------------------------------------------

9

BIGSQL DEFAULT SYSFUN SYSHADOOP SYSIBM SYSIBMADM SYSIBMINTERNAL SYSIBMTS SYSPROC SYSPUBLIC SQLJ SYSCAT SYSSTAT SYSTOOLS GOSALESDW NULLID 16 record(s) selected.

From the “select schemaname from syscat.schemata” command output, select a schema used

for the application data that must be used to run the list tables command below. Save the

current table lists output which will be used later for a sanity check after the upgrade process.

This example uses the <user-app-schema1> schema.

db2 => list tables for schema <user-app-schema1> Table/View Schema Type Creation time ------------------------------- --------------- ----- --------------------------

Step7: Write the data replica of your spectrum scale file system

# mmlsfs <your-file-system> -r flag value description ------------------- ------------------------ ----------------------------------- -r 3 Default number of data replicas

The value 3 is your default data replica.

Step8: Stop all application jobs, including Hive and BigInsights Value-Add.

10

Note: Step 8 and Step 9 require the Ambari server and the Hive service to be up. If you are

unable to stop all the application jobs, then another way is to stop all Ambari services, (such as

Yarn, Hive, HBase,) from the Ambari GUI. After all the services are stopped, start only the Hive

services.

Step9: Perform a backup of the Ambari database.

Ensure that the Ambari server is up.

Log in to the ambari_server_host as root and run the following commands to back up the

Ambari database:

su - postgres

pg_dump ambari>ambari.backup pg_dump ambarirca>ambarirca.backup

Note: Do not remove the backup files.

Step10: Perform a backup of the Hive meta data database.

Ensure that the Hive services are up.

If MySQL server is used as the Hive’s meta data database:

Log in to the Hive MySQL server node as root and run the following commands to list all the

databases in the MySQL server:

## run the following command to list all the databases in your MySQL environment: mysql -u <Hive_MySQL_Username> -p

## input your Hive_MySQL_Password here MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | hive | | mysql | | performance_schema | +--------------------+ 4 rows in set (0.01 sec)

11

MariaDB [(none)]>

For each database listed above, run the following command from the bash console to perform

the backup:

# for the above listed databases, run the following commands to back them up mysqldump -u hive -p <Hive_mySQL_db> > hive.backup

The planned upgrade modifies the <Hive_mySQL_db>. However, to avoid any potential issues,

perform a backup of all the databases.

For example:

mysqldump -u hive -p hive > hive.backup mysqldump -u hive -p information_schema >information_schema.backup mysqldump -u hive -p mysql >mysql.backup mysqldump -u hive -p performance_schema >performance_schema.backup

If PostgreSQL server is used as the Hive’s meta data database:

Log in to the Hive_PostgreSQL_Server as root and run the following commands to back up the

Hive’s meta data:

su - <Hive_PostgreSQL_Username>

pg_dump <Hive_PostgreSQL_db> > <Hive_PostgreSQL_db>.backup

Note: Input the Hive_PostgreSQL_Password. Replace the <Hive_PostgreSQL_Username> and

the <Hive_PostgreSQL_db> values according to the information in Step 5.

After performing these steps, you can perform the update to HDFS transparency.

2.2 Checklist

Review the following checklist table to ensure that all the tasks are completed before proceeding to do

the upgrade.

Checklist# Description Completed?

1 Downloaded the HDFS transparency package?

12

2 Downloaded the new Ambari integration for HDFS transparency package?

3 Write the following Ambari information: Ambari Server node hostname Ambari username Ambari password

4 Write the following MySQL information: MySQL server node hostname MySQL database username MySQL database password

5 Performed the sanity check for Hive data?

6 Performed the sanity check for BigInsights Value-Add, such as BigSQL?

7 Performed a backup of the Ambari database?

8 Performed a backup of the MySQL database?

2.3 Update steps

Before proceeding, ensure that you have performed all the steps in Section 2.1 Preparation.

IMPORTANT NOTE

Review the sample commands in steps 6 and 16. If you can perform the steps, then proceed. Otherwise, contact [email protected] for guidance.

To upgrade to HDFS transparency, perform the following steps:

Step1) Check whether all the services (except Hive service) on the Ambari GUI are stopped.

Note: Hive service should be active for data update in the following steps.

Step2) Remove the GPFS service with the REST API from the Ambari server by using the Bash

console as root.

curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE http://localhost:8080/api/v1/clusters/<your-IOP-cluster-name>/services/GPFS

Note: Replace <your-IOP-cluster-name> in the above link with the cluster name. The cluster

name will be displayed in the top-left panel after logging in to the Ambari GUI. Replace

admin:admin with the Ambari username and password.

mailto:[email protected]

http://localhost:8080/api/v1/clusters/%3cyour-IOP-cluster-name%3e/services/GPFS

13

In the following screenshot, iop420 is the cluster name:

Refresh the Ambari GUI and check that the Spectrum Scale menu from the left panel is

removed.

Step3) Log in to the Ambari postgres database console.

Log in to the Ambari server node as root:

# su - postgres # psql postgres=# \connect ambari

Step4) Check the stack version listed on the postgres console (see Step3):

ambari=# select * from ambari.stack ; stack_id | stack_name | stack_version ----------+-------------+----------------------- 1 | BigInsights | 4.1 2 | BigInsights | 4.0 51 | BigInsights | 4.1.SpectrumScale (3 rows)

Write the stack_id values corresponding to the stack_version column for 4.1 and

4.1.SpectrumScale.

For the above output, the numbers 1 and 51 are the stack_ids for the corresponding stack

versions 4.1 and 4.1.SpectrumScale. In later steps, we need to change database records from

the stack version "4.1.SpectrumScale" to “4.1”. The 4.1.SpectrumScale stack_version is the

Ambari GPFS integration package version for the older Hadoop connector. For the new HDFS

transparency connector, a different Ambari stack is not required because it integrates as a

service in the default stack.

Step5) Dump all the Hive meta data records:

If MySQL server is used as the Hive’s meta data database:

Log in to the Hive MySQL server node and dump all the records from the MySQL database.

# first create mysql_migrate.sh

14

# vim mysql_migrate.sh # cat mysql_migrate.sh #!/bin/bash database="$1" username="$2" password="$3" if [[ "$database" == "" || "$password" == "" ]]; then echo "$0 <database-name><username><password>" exit fi echo "Begin to query all tables under database $1..." index=1 for table in BUCKETING_COLS CDS COLUMNS_V2 COMPACTION_QUEUE COMPLETED_TXN_COMPONENTS DATABASE_PARAMS DBS DB_PRIVS DELEGATION_TOKENS FUNCS FUNC_RU GLOBAL_PRIVS HIVE_LOCKS IDXS INDEX_PARAMS MASTER_KEYS NEXT_COMPACTION_QUEUE_ID NEXT_LOCK_ID NEXT_TXN_ID NOTIFICATION_LOG NOTIFICATION_SEQUENCE NUCLEUS_TABLES PARTITIONS PARTITION_EVENTS PARTITION_KEYS PARTITION_KEY_VALS PARTITION_PARAMS PART_COL_PRIVS PART_COL_STATS PART_PRIVS ROLES ROLE_MAP SDS SD_PARAMS SEQUENCE_TABLE SERDES SERDE_PARAMS SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST SKEWED_STRING_LIST_VALUES SKEWED_VALUES SORT_COLS TABLE_PARAMS TAB_COL_STATS TBLS TBL_COL_PRIVS TBL_PRIVS TXNS TXN_COMPONENTS TYPES TYPE_FIELDS VERSION do echo "${index} table name $table" echo "============>" echo "use ${database};" > /tmp/iop41_mig.sql echo "select * from ${table};" >> /tmp/iop41_mig.sql mysql -u ${username} --password=${password} < /tmp/iop41_mig.sql echo "<============" echo ((index++)) done # chmod a+rx mysql_migrate.sh # run the script to dump the records

# ./mysql_migrate.sh <Hive_mySQL_db><Hive_MySQL_Username>< Hive_MySQL_Password>> mysqlData.output

The <Hive_mySQL_db>,<Hive_MySQL_Username> and < Hive_MySQL_Password> values are

derived from Step 5 in section 2.1 Preparation.

If PostgreSQL server is used as the Hive’s meta data database:

15

Log in to the Hive Hive_PostgreSQL_Server node and dump all the records from the PostgreSQL

database.

# first create postgresql_migrate.sh # vim postgresql_migrate.sh # cat postgresql_migrate.sh #!/bin/bash database="$1" username="$2" if [[ "$database" == "" ]]; then echo "$0 <database-name> <username>" exit fi echo "Begin to query all tables under database $1..." echo > /tmp/iop41_mig.sql echo "\c ${database};" > /tmp/iop41_mig.sql index=1 for table in BUCKETING_COLS CDS COLUMNS_V2 compaction_queue completed_txn_components DATABASE_PARAMS DBS DB_PRIVS DELEGATION_TOKENS FUNCS FUNC_RU GLOBAL_PRIVS hive_locks IDXS INDEX_PARAMS MASTER_KEYS next_compaction_queue_id next_lock_id next_txn_id NOTIFICATION_LOG NOTIFICATION_SEQUENCE NUCLEUS_TABLES PARTITIONS PARTITION_EVENTS PARTITION_KEYS PARTITION_KEY_VALS PARTITION_PARAMS PART_COL_PRIVS PART_COL_STATS PART_PRIVS ROLES ROLE_MAP SDS SD_PARAMS SEQUENCE_TABLE SERDES SERDE_PARAMS SKEWED_COL_NAMES SKEWED_COL_VALUE_LOC_MAP SKEWED_STRING_LIST SKEWED_STRING_LIST_VALUES SKEWED_VALUES SORT_COLS TABLE_PARAMS TAB_COL_STATS TBLS TBL_COL_PRIVS TBL_PRIVS txns txn_components TYPES TYPE_FIELDS VERSION do echo "select * from \"${table}\";" >> /tmp/iop41_mig.sql ((index++)) done echo "\q" >> /tmp/iop41_mig.sql psql -U ${username} < /tmp/iop41_mig.sql # chmod a+rx postgresql_migrate.sh # run the script to dump the records

# ./postgresql_migrate.sh <Hive_PostgreSQL_db> <Hive_PostgreSQL_Username> > mysqlData.output

The <Hive_PostgreSQL_db>, <Hive_PostgreSQL_Username> and <

Hive_PostgreSQL_Password> values are derived from Step 5 in section 2.1 Preparation.

16

NOTE: To avoid database crashes that might occur because of using the wrong stack_id entries

for Step6 and Step16, you can send the output from Step4 and the file mysqlData.output to

[email protected] before proceeding. The IBM Support team will return a list of commands for

your environment for performing Step6 and Step16.

If you have reviewed your commands and changes carefully from Step6 and Step16 and noted

that the changes are correct, continue performing the following steps.

Step6) Update the Ambari database to switch the stack version from 4.1.SpectrumScale to 4.1.

Note: The commands with the stack_id values of 1 and 51 are derived from the output of Step

4. You must change the values according to the output of Step 4.

update ambari.clusterconfig set stack_id = '1' where stack_id = '51'; update ambari.clusters set desired_stack_id = '1' where desired_stack_id = '51'; update ambari.clusterstate set current_stack_id = '1' where current_stack_id = '51'; update ambari.servicedesiredstate set desired_stack_id = '1' where desired_stack_id = '51'; update ambari.serviceconfig set stack_id = '1' where stack_id = '51'; update ambari.servicecomponentdesiredstate set desired_stack_id = '1' where desired_stack_id = '51'; update ambari.hostcomponentdesiredstate set desired_stack_id = '1' where desired_stack_id = '51';

Step7) Restart the Ambari server and stop/start all the Ambari agents.

On the Ambari server node, run the following commands to stop and start the Ambari server:

ambari-server stop

ambari-server start

On all the Ambari agent nodes, run the following commands to stop and start the Ambari

agents:

ambari-agent stop

ambari-agent start

Log in to the Ambari GUI.

NOTE: If you cannot log in to the Ambari GUI, contact [email protected] immediately.

Step8) Uninstall the IBM Spectrum Scale Ambari integration package for the Hadoop connector.

mailto:[email protected]

17

On the Ambari server, uninstall the old integration package by running the following command:

rpm -e gpfs.ambari-iop_4.1*

Follow the commands in Step7 to restart the ambari server and all the agents.

Step9) Add the native HDFS service into Ambari.

Follow the Ambari wizard from the Ambari dashboard. Click ActionsAdd Service.

Note: The HDFS NameNode in this step will be the HDFS transparency NameNode in step 15

and it should be one of the nodes of the IBM Spectrum Scale cluster.

Step10) Check the configuration for some of the services in Ambari.

HDFS

Check the configuration fs.defaultFS on Ambari dashboardHDFSConfigsAdvanced core-

site and ensure that it is the hdfs://<hdfs-namenode-hostname>:8020. The <hdfs-namenode-

hostname> must be the HDFS NameNode in Step 9.

MapReduce2

For MapReduce2, on the Ambari dashboard, click MapReduce2ConfigsAdvanced panel:

mapreduce.client.submit.file.replication

If the value is 0, change it to the data replica value that was written down in Step7 of the

section 2.1 Preparation.

HBase

For HBase, remove the following configurations entries from the

HBaseConfigsAdvancedCustom Hbase-site panel:

gpfs.sync.queue=true gpfs.sync.range=true hbase.fsutil.hdfs.impl=org.apache.hadoop.hbase.gpfs.util.FSGPFSUtils hbase.regionserver.hlog.writer.impl= org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogWriter hbase.regionserver.hlog.reader.impl= org.apache.hadoop.hbase.gpfs.regionserver.wal.PreallocatedProtobufLogReader

18

You can click the Remove button to remove the configuration from Custom hbase-site:

Check the hbase.rootdirfield under HBaseConfigsAdvancedAdvanced Hbase-site. Ensure

that the hostname specified in the field is the HDFS NameNode hostname.

e.g. hdfs://c16f1n06.gpfs.net:8020/apps/hbase/data. If the value in this field does not reflect

the correct NameNode (c161f1n06.gpfs.net in this example), modify it accordingly.

Step11) Restart all the services and run the service check for HDFS.

There is no need to run service checks for the other services.

Step12) Stop all services on the Ambari GUI.

Step13) Manually uninstall the old connector from all the nodes.

/usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector stop" /usr/lpp/mmfs/bin/mmdsh -N all " /usr/lpp/mmfs/bin/mmhadoopctl connector detach --distribution BigInsights" /usr/lpp/mmfs/bin/mmdsh -N all "rpm -e gpfs.hadoop-connector"

Note: IBM Spectrum Scale will give installation errors if the above steps were not performed.

The first command, mmhadoopctl connector stop, will report an error if the Spectrum Scale

Hadoop connector was already stopped in Step 1. The error messages in this case just mean

that Spectrum Scale Hadoop connector is not up. One can use the “mmhadoopctl connector

getstate” command to check the connector and only run the “mmhadoopctl connector stop” if the

connector is still up.

Step14) Install the new GPFS Ambari integration module for HDFS Transparency on the Ambari

server node.

Download the GPFS Ambari integration module (gpfs.hdfs-transparency.ambari-iop_4.1-

<version>.noarch.bin) from IBM DeveloperWorks Spectrum Scale Wiki - IBM Open


19

Platform with Apache Hadoop - 2nd generation HDFS Transparency - Download Releases

section.

Download the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with

Ambari 2.1 document on the IBM DeveloperWorks Spectrum Scale Wiki.

o Follow the section 5.4.2 Setting up the IBM Spectrum Scale repository to set up

IBM Spectrum Scale HDFS transparency repository in the Deploying BigInsights

4.1 IBM Spectrum Scale HDFS transparency with Ambari 2.1 document.

o Follow the section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP

and an HDFS Transparency cluster - Install the GPFS integration module into

Ambari in the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with

Ambari 2.1 document.

Step15) Add the IBM Spectrum Scale service to Ambari and integrate the existing IOP with the

existing IBM Spectrum Scale cluster.

Follow section 4.2.1.3 Add Spectrum Scale service to an existing Ambari IOP and an HDFS Transparency cluster- Adding the IBM Spectrum Scale service to Ambari in the Deploying BigInsights 4.1 IBM Spectrum Scale HDFS Transparency with Ambari 2.1 document on the IBM DeveloperWorks Spectrum Scale Wiki. Step16) Update the data in the Hive meta data server. The old data ingested from the Hadoop connector uses the schema as gpfs:// in the meta data

database. This schema is not supported in HDFS transparency because it uses the native HDFS

schema. Therefore, the correct schema to use is hdfs://. Therefore, all the records in the meta

data database must be modified from using the gpfs:// value to the hdfs:// value.

NOTE: If this modification is not implemented, you will be unable to view the old data in Hive.

Assuming that your HDFS transparency NameNode is HDFS-Transparency-host in Step15, the

correct schema after the upgrade is hdfs://HDFS-Transparency-host:8020. Check the output

mysqlData.output in Step5 and determine the to-be-updated table list that have records with

incorrect gpfs:// and hdfs://. For example, the following schema values are all invalid: gpfs:///,

gpfs://HDFS-Transparency-host:8020, gpfs://not-HDFS-Transparency-host:8020 and

hdfs://not-HDFS-Transparency-host:8020. If there are records that use any of the above four

invalid schema formats, the table must be put in the to-be-updated table list.


https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/References?section=DBIHA




20

For example, in the mysqlData.output of Step5, for table DBS, the records are using

gpfs://c8f2n13.gpfs.net:8020 (assuming that the correct schema is

hdfs://c8f2n13.gpfs.net:8020), then the table DBS must be put in the to-be-updated table list:

table name DBS

============>

DB_ID DESC DB_LOCATION_URI NAME OWNER_NAME OWNER_TYPE

1 Default Hive database gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse default public ROLE

6 Hive test database gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/bigdata.db bigdata hive USER

11 NULL gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/bigsql.db bigsql bigsql USER

16 NULL gpfs://c8f2n13.gpfs.net:8020/apps/hive/warehouse/gosalesdw.db gosalesdw bigsql USER

The table DBS and SDS are two tables that must be changed. Check for other tables in your

cluster to see if whether they need to be changed.

If using MySQL for Hive’s meta data:

For each table in to-be-updated table list, issue commands similar to the following to correct

the schema for all records:

update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://', 'hdfs://'));

If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),

use a command similar to the following:

update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://', 'hdfs://')) where DB_ID =’1’;

If your schema in the table DBS is gpfs://not-HDFS-Transparency-host:<port-number>, use a

command similar to the following:

update DBS set DB_LOCATION_URI=(REPLACE(DB_LOCATION_URI, 'gpfs://not-HDFS-Transparency-host:<port-number>’, 'hdfs://HDFS-Transparency-host:8020'));

If using PostgreSQL for Hive’s meta data:

For each table in to-be-updated table list, issue commands similar to the following to correct

the schema for all the records:

21

update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://'));

If you want to update only one specific record (e.g. only update the record whose DB_ID is 1),

use a command similar to the following:

update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'gpfs://', 'hdfs://')) where "DB_ID"='1';

If your schema in the table DBS is gpfs://not-HDFS-Transparency-host:<port-number>, use a

command similar to the following:

update "DBS" set "DB_LOCATION_URI"=(REPLACE("DB_LOCATION_URI", 'hdfs://localhost:8020', 'hdfs://c16f1n06.gpfs.net:8020'));

Step17) Start all the services from Ambari and run service checks for all the services.

Step18) Follow Step 7 - Check the current data in Hbase, Hive and BigInsights Value-Add

databases in section 2.1 Preparation to sanity check the configuration by comparing the data

output after the upgrade with the previous saved outputs.

3. Revision History

Version Change Date Owner Change Logs

0.1 2016-8-5 Yong Yong initialized the draft

0.2 2016-8-16 Yong Merged some comments from Wen Qi

0.3 2016-8-17 Yong Merged comments from Linda

0.4 2016-8-17 Yong Merged comments from PC

0.5 2016-8-23 Yong Merged comments from Linda and PC

0.6 2016-8-23 Yong Merged comments from ID team member Lata


0.8 2016-8-26 Yong Merged comments from customer

0.9 2016-8-30 Yong Updated the guide with PostgreSQL for Hive’s meta data

1.0 2016-8-31 Yong Merged comments from Linda


upgrading ibm® open platform with apache© hadoop 4.1 and

Documents