in 100 bigdatamanagementinstallationandconfigurationguide en

8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

1/60

Informatica (Version 10.0)

ig Data Management

Installation and Configuration

Guide


2/60

Informatica Big Data Management Installation and Configuration Guide

Version 10.0November 2015

Copyright (c) 1993-2015 Informatica LLC. All rights reserved.

This software and documentation contain proprietary information of Informatica LLC and are provided under a license agreement containing restrictions on use anddisclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in anyform, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. This Software may be protected by U.S. and/orinternational Patents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and asprovided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14

(ALT III), as applicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to usin writing.

Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange InformaticaOn Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging andInformatica Master Data Management are trademarks or registered trademarks of Informatica LLC in the United States and in jurisdictions throughout the world. Allother company and product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rightsreserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rightsreserved.Copyright© Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © MetaIntegration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe SystemsIncorporated. All rights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. Allrights reserved. Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rightsreserved. Copyright © Glyph & Cog, LLC. All r ights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rightsreserved. Copyright © Information Builders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved.Copyright Cleo Communications, Inc. All rights reserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-

technologies GmbH. All rights reserved. Copyright © Jaspersoft Corporation. All rights reserved. Copyright © International Business Machines Corporation. All rightsreserved. Copyright © yWorks GmbH. All rights reserved. Copyright © Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved.Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. Allrights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved. Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, Allrights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright© EMC Corporation. All r ights reserved. Copyright © Flexera Software. All rights reserved. Copyright © Jinfonet Software. All rights reserved. Copyright © Apple Inc. Allrights reserved. Copyright © Telerik Inc. All rights reserved. Copyright © BEA Systems. All rights reserved. Copyright © PDFlib GmbH. All rights reserved. Copyright ©

Orientation in Objects GmbH. All rights reserved. Copyright © Tanuki Software, Ltd. All rights reserved. Copyright © Ricebridge. All rights reserved. Copyright © Sencha,Inc. All rights reserved. Copyright © Scalable Systems, Inc. All rights reserved. Copyright © jQWidgets. All rights reserved. Copyright © Tableau Software, Inc. All rightsreserved. Copyright© MaxMind, Inc. All Rights Reserved. Copyright © TMate Software s.r.o. All rights reserved. Copyright © MapR Technologies Inc. All rights reserved.Copyright © Amazon Corporate LLC. All rights reserved. Copyright © Highsoft. All rights reserved. Copyright © Python Software Foundation. All rights reserved.Copyright © BeOpen.com. All rights reserved. Copyright © CNRI. All rights reserved.

This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versionsof the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to inwriting, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied. See the Licenses for the specific language governing permissions and limitations under the Licenses.

This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software

copyright©

1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of anykind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.

The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California,Irvine, and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) andredistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, . All Rights Reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with orwithout fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http://www.dom4j.org/ license.html.

The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject toterms available at http://dojotoolkit.org/license.

This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations

regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found athttp:// www.gnu.org/software/ kawa/Software-License.html.

This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & WirelessDeutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software aresubject to terms available at http:/ /www.boost.org/LICENSE_1_0.txt.

This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available athttp:// www.pcre.org/license.txt.

This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.


3/60

This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://

protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5-current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.

This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and DistributionLicense (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License

Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artistic-license-1.0) and the Initial Developer’s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).

This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.For further information please visit http://www.extreme.indiana.edu/.

This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subjectto terms of the MIT license.

See patents at https://www.informatica.com/legal/patents.html.

DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the impliedwarranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. Theinformation provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation issubject to change at any time without notice.

NOTICES

This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress SoftwareCorporation ("DataDirect") which are subject to the following terms and conditions:

1.THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT

LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

2.IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,

INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT

INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT

LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

Part Number: IN-BDI-10000-0001

https://www.informatica.com/legal/patents.html


4/60

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Informatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Informatica My Support Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Informatica Product Availability Matrixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Informatica Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica Support YouTube Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica Marketplace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica Velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Chapter 1: Installation and Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Installation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Informatica Big Data Management Installation Process. . . . . . . . . . . . . . . . . . . . . . . . . . 10

Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Install and Configure the Informatica Domain and Clients. . . . . . . . . . . . . . . . . . . . . . . . . 11

Install and Configure PowerExchange Adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Install and Configure Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Pre-Installation Tasks for a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Pre-Installation Tasks for a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Informatica Big Data Management Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol. . . . . . . 15

Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS

Protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Installing in a Cluster Environment from any Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Installing Big Data Management Using Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . 17

After You Install. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Configure Hadoop Pushdown Properties for the Data Integration Service. . . . . . . . . . . . . . . 18

Reference Data Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Hive Variables for Mappings in a Hadoop Environment. . . . . . . . . . . . . . . . . . . . . . . . . . 19Update Hadoop Cluster Configuration Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Library Path and Path Variables for Mappings in a Hadoop Environment. . . . . . . . . . . . . . . 21

Configure the Blaze Engine Log Directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Hadoop Environment Properties File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Informatica Developer Files and Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Open the Required Ports for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Enable Support for Lookup Transformations with Teradata Data Objects. . . . . . . . . . . . . . . 22

4 Table of Contents


5/60

Informatica Big Data Management Uninstallation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Uninstalling Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 2: Mappings on Hadoop Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Mappings on Hadoop Distributions Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Big Data Management Configuration Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Use Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Use SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Use a Shared Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Mappings on Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Configure Hadoop Cluster Properties on the Data Integration Service Machine. . . . . . . . . . . 29

Create a Staging Directory on HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Add hbase_pr otocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Configure the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Mappings on Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Configure Hadoop Cluster Properties for the Data Integration Service. . . . . . . . . . . . . . . . . 35

Enable Tez. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Add hbase_pr otocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Enable HBase Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Mappings on IBM BigInsights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

User Account for the JDBC and Hive Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Mappings on MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Verify the Cluster Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1. . . . . . . . 43

Configure hive-site.xml on Every Node in the Hadoop Cluster for MapReduce 1. . . . . . . . . . 44

Configure Hadoop Cluster Properties on the Data Integration Service Machine for

MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Configure yar n-site.xml on Every Node in the Cluster for MapReduce 2. . . . . . . . . . . . . . . . 45

Configure MapR Distribution Variables for Mappings in a Hadoop Environment. . . . . . . . . . . 47

Configure the Heap Space for the MapR-FS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Enable Hadoop Pushdown for HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Configure the Application Timeline Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Mappings on Pivotal HD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Configure Hadoop Cluster Properties for Pivotal HD in yarn-site.xml. . . . . . . . . . . . . . . . . . 49

Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Chapter 3: High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Configure High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Configuring Big Data Management for a Highly Available Cloudera CDH Cluster. . . . . . . . . . . . . 53

Configuring Big Data Management for a Highly Available Hortonworks HDP Cluster. . . . . . . . . . . 54

Configuring Big Data Management for a Highly Available IBM BigInsights Cluster. . . . . . . . . . . . 55

Table of Contents 5


6/60

Configuring Big Data Management for a Highly Avai lable MapR Cluster. . . . . . . . . . . . . . . . . . . 56

Configuring Big Data Management for a Highly Avai lable Pivota l Cluster. . . . . . . . . . . . . . . . . . 57

Appendix A: Upgrade Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Upgrading Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Table of Contents


7/60

Preface

The Informatica Big Data Management Installation and Configur ation Guide is written for the system

administrator who is responsible for installing Informatica Big Data Management. This guide assumes you

have knowledge of operating systems, relational database concepts, and the database engines, flat files, or

mainframe systems in your environment. This guide also assumes you are familiar with the interface

requirements for the Hadoop environment.

Informatica Resources

Informatica My Support Portal

As an Informatica customer, the first step in reaching out to Informat ica is through the Informatica My Support

Portal at https://mysupport.informatica.com. The My Support Portal is the largest online data integration

collaboration platform with over 100,000 Informatica customers and partners worldwide.

As a member, you can:

• Access all of your Informatica resources in one place.

• Review your support cases.

• Search the Knowledge Base, find product documentation, access how-to documents, and watch support

videos.

• Find your local Informatica User Group Network and collaborate with your peers.

Informatica Documentation

The Informatica Documentation team makes every effort to create accurate, usable documentation. If you

have questions, comments, or ideas about this documentation, contact the Informatica Documentation team

through email at [email protected]. We will use your feedback to improve our

documentation. Let us know if we can contact you regarding your comments.

The Documentation team updates documentation as needed. To get the latest documentation for your

product, navigate to Product Documentation from https://mysupport.informatica.com.

Informatica Product Availability Matrixes

Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types

of data sources and targets that a product release supports. You can access the PAMs on the Informatica My

Support Portal at https://mysupport.informatica.com.

7

http://mysupport.informatica.com/http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/


8/60

Informatica Web Site

You can access the Informatica corporate web site at https://www.informatica.com. The site contains

information about Informatica, its background, upcoming events, and sales offices. You will also find product

and partner information. The services area of the site includes important information about technical support,

training and education, and implementation ser vices.

Informatica How-To Library

As an Informatica customer, you can access the Informatica How-To Library at

https://mysupport.informatica.com. The How-To Library is a collection of resources to help you learn more

about Informatica products and features. It includes articles and interactive demonstrations that provide

solutions to common problems, compare features and behaviors, and guide you through performing specific

real-world tasks.

Informatica Knowledge Base

As an Informatica customer, you can access the Informatica Knowledge Base at

https://mysupport.informatica.com. Use the Knowledge Base to search for documented solutions to known

technical issues about Informatica products. You can also find answers to frequently asked questions,

technical white papers, and technical tips. If you have questions, comments, or ideas about the Knowledge

Base, contact the Informatica Knowledge Base team through email at [email protected].

Informatica Support YouTube Channel

You can access the Informatica Support YouTube channel at http://www.youtube.com/user/INFASupport. The

Informatica Support YouTube channel includes videos about solutions that guide you through performing

specific tasks. If you have questions, comments, or ideas about the Informatica Support YouTube channel,

contact the Support YouTube team through email at [email protected] or send a tweet to

@INFASupport.

Informatica Marketplace

The Informatica Marketplace is a forum where developers and partners can share solutions that augment,

extend, or enhance data integration implementations. By leveraging any of the hundreds of solutions

available on the Marketplace, you can improve your productivity and speed up time to implementation on

your projects. You can access Informatica Marketplace at http://www.informaticamarketplace.com.

Informatica Velocity

You can access Informatica Velocity at https://mysupport.informatica.com. Developed from the real-world

experience of hundreds of data management projects, Informatica Velocity represents the collective

knowledge of our consultants who have worked with organizations from around the world to plan, develop,deploy, and maintain successful data management solutions. If you have questions, comments, or ideas

about Informatica Velocity, contact Informatica Professional Services at [email protected].

Informatica Global Customer Support

You can contact a Customer Support Center by telephone or through the Online Support.

Online Support requires a user name and password. You can request a user name and password at

http://mysupport.informatica.com.

8 Preface

http://mysupport.informatica.com/mailto:[email protected]://www.informaticamarketplace.com/mailto:[email protected]:[email protected]://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/http://www.informaticamarketplace.com/mailto:[email protected]://www.youtube.com/user/INFASupportmailto:[email protected]://mysupport.informatica.com/http://mysupport.informatica.com/http://www.informatica.com/


9/60

The telephone numbers for Informatica Global Customer Support are available from the Informatica web site

at http://www.informatica.com/us/services-and-training/support-services/global-support-centers/.

Preface 9

http://www.informatica.com/us/services-and-training/support-services/global-support-centers/


10/60

C H A P T E R 1

Installation and Configuration

This chapter includes the following topics:

• Installation and Configuration Overview, 10

• Before You Begin, 11

• Informatica Big Data Management Installation, 14

•

After You Install, 17• Informatica Big Data Management Uninstallation, 23

Installation and Configuration Overview

The Informatica Big Data Management installation is distributed to the Hadoop cluster as a Red Hat Package

Manager (RPM) installation package.

The RPM package includes the Informatica 10.0 engine, the Blaze engine, and adapter components. The

RPM package and the binary files that you need to run the Big Data Management installation are compressed

into a tar.gz file.

After you complete the installation, you must configure the Informatica domain and the Hadoop cluster to

enable Informatica mappings to run on a Hadoop cluster.

Informatica Big Data Management Installation Process

You can install Big Data Management in a single node or cluster environment.

Installing in a Single Node Environment

You can install Big Data Management in a single node environment.

1. Extract the Big Data Management tar.gz file to the machine.

2. Install Big Data Management by running the installation shell script in a Linux environment.

Installing in a Cluster Environment

You can install Big Data Management in a cluster environment.

1. Extract the Big Data Management tar.gz file to a machine.

10


11/60

2. Distribute the RPM package to all of the nodes within the Hadoop cluster. You can distribute the RPM

package using any of the following protocols: File Transfer Protocol (FTP), Hypertext Transfer Protocol

(HTTP), Network File System (NFS), or Secure Copy Protocol (SCP).

3. Install Big Data Management by running the installation shell script in a Linux environment. You can

install Big Data Management from the primary NameNode or from any machine using the

HadoopDataNodes file.

• Install from the primary NameNode. You can install Big Data Management using FTP, HTTP, NFS or

SCP protocol. During the installation, the installer shell script picks up all of the DataNodes from the

following file: $HADOOP_HOME/conf/slaves. Then, it copies the Big Data Management binary files to

the following directory on each of the DataNodes: //

Informatica. You can perform this step only if you are deploying Hadoop from the primary

NameNode.

• Install from any machine. Add the IP addresses or machine host names, one for each line, for each of

the nodes in the Hadoop cluster in the HadoopDataNodes file. During the Big Data Management

installation, the installation shell script picks up all of the nodes from the HadoopDataNodes file and

copies the Big Data Management binary files to the //

Informatica directory on each of the nodes.

Before You Begin

Before you begin the installation, install the Informatica components and PowerExchange adapters, and

perform the pre-installation tasks.

Install and Configure the Informatica Domain and Clients

Before you install Big Data Management, install and configure the Informatica domain and clients.

You must install the Informatica services and clients. Run the Informatica services installation to configure

the Informatica domain and create the Informatica services. Run the Informatica client installation to install

the Informatica client tools.

Install and Configure PowerExchange Adapters

Based on your business needs, install and configure Informatica adapters. Use Big Data Management with

Informatica adapters for access to sources and targets.

To run Informatica mappings in a Hadoop environment you must install and configure Informatica adapters.

You can use the following Informatica adapters as part of Big Data Management:

• PowerExchange for DataSift

• PowerExchange for Facebook

• PowerExchange for HBase

• PowerExchange for HDFS

• PowerExchange for Hive

• PowerExchange for LinkedIn

• PowerExchange for Teradata Parallel Transporter API

Before You Begin 11


12/60

• PowerExchange for Twitter

• PowerExchange for Web Content-Kapow Katalyst

For more information, see the PowerExchange adapter documentation.

Install and Configure Data ReplicationTo migrate data with minimal downtime and perform auditing and operational reporting functions, install and

configure Data Replication. For information, see the Informatica Data Replication User Guide.

Pre-Installation Tasks for a Single Node Environment

Before you begin the Big Data Management installation in a single node environment, perform the pre-

installation tasks.

• Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation

should include a Hive data warehouse that is configured to use a non-embedded database as the

MetaStore. For more information, see the Apache website here: http://hadoop.apache.org.• To perform both read and write operations in native mode, install the required third-party client software.

For example, install the Oracle client to connect to the Oracle database.

• Verify that the Big Data Management administrator user can run sudo commands or have user root

privileges.

• Verify that the temporary folder on the local node has at least 700 MB of disk space.

• Download the following file to the temporary folder: InformaticaHadoop-

.tar.gz

• Extract the following file to the local node where you want to run the Big Data Management installation:

InformaticaHadoop-.tar.gz

Pre-Installation Tasks for a Cluster EnvironmentBefore you begin the Big Data Management installation in a cluster environment, perform the following tasks:

• Install third-party software.

• Verify the distribution method.

• Verify system requirements.

• Verify connection requirements.

• Download the RPM.

Install Third-Party Software

Verify that the following third-party software is installed:

Hadoop with Hadoop Distributed File System (HDFS) and MapReduce

Hadoop must be installed on every node within the cluster. The Hadoop installation must include a Hive

data warehouse that is configured to use a MySQL database as the MetaStore. You can configure Hive

to use a local or remote MetaStore server. For more information, see the Apache website here:

http://hadoop.apache.org/.

Note: Informatica does not support embedded MetaStore server setups.

12 Chapter 1: Installation and Configuration

http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/


13/60

Database client software to perform read and write operations in native mode

Install the client software for the database. Informatica requires the client software to run MapReduce

jobs. For example, install the Oracle cl ient to connect to the Oracle database. Install the database client

software on all the nodes within the Hadoop cluster.

Verify the Distribution Method

You can distribute Big Data Management to the Hadoop cluster with one of the following protocols:

• File Transfer Protocol (FTP)

• Hypertext Transfer Protocol (HTTP)

• Network File System (NFS) protocol

• Secure Copy (SCP) protocol

• Cloudera Manager.

To verify that you can distribute Big Data Management to the Hadoop cluster with one of the protocols,

perform the following tasks:

Note: If you use Cloudera Manager to distribute Big Data Management to the Hadoop cluster, skip thesetasks.

1. Ensure that the server or service for your distribution method is running.

2. In the config file on the machine where you want to run the Big Data Management installation, set the

DISTRIBUTOR_NODE parameter to the following setting:

• FTP: Set DISTRIBUTOR_NODE=ftp:///pub

• HTTP: Set DISTRIBUTOR_NODE=http://

• NFS: Set DISTRIBUTOR_NODE=

The file location must be accessible to all nodes in the cluster.

Verify System Requirements

Verify the following system requirements:

• The Big Data Management administrator can run sudo commands or has root user privileges.

• The temporary folder in each of the nodes on which Big Data Management will be installed has at least

700 MB of disk space.

Verify Connection Requirements

Verify the connection to the Hadoop cluster nodes.

Big Data Management requires a Secure Shell (SSH) connection without a password between the machine

where you want to run the Big Data Management installation and all the nodes in the Hadoop cluster.

Download the RPM

Download the following file to a temporary folder:

InformaticaHadoop-.tar.gz

Extract the file to the machine from where you want to distribute the RPM package and run the Big Data

Management installation.

Before You Begin 13


14/60

Copy the following package to a shared directory based on the transfer protocol you are using:

InformaticaHadoop-.rpm .

For example,

• HTTP: /var/www/html

• FTP: /var/ftp/pub

• NFS:

The file location must be accessible by all the nodes in the cluster.

Note: The RPM package must be stored on a local disk and not on HDFS.

Informatica Big Data Management Installation

You can install Big Data Management in a single node environment. You can also install Big Data

Management in a cluster environment from the primary NameNode or from any machine.

Install Big Data Management in a single node environment or cluster environment:

• Install Big Data Management in a single node environment.

• Install Big Data Management in a cluster environment from the primary NameNode using SCP protocol.

• Install Big Data Management in a cluster environment from the primary NameNode using FTP, HTTP, or

NFS protocol.

• Install Big Data Management in a cluster environment from any machine.

Install Big Data Management from a shell command line.

Installing in a Single Node Environment

You can install Big Data Management in a single node environment.

1. Log in to the machine.

2. Run the following command from the Big Data Management root directory to start the installation in

console mode:

bash InformaticaHadoopInstall.sh

3. Press y to accept the Big Data Management terms of agreement.

4. Press Enter .

5. Press 1 to install Big Data Management in a single node environment.

6. Press Enter .

7. Type the absolute path for the Big Data Management installation directory and press Enter .

Start the path with a slash. The directory names in the path must not contain spaces or the following

special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

If you type a directory path that does not exist, the installer creates the entire di rectory path on each of

the nodes during the installation. Default is /opt.

8. Press Enter .

The installer creates the //Informatica directory and

populates all of the file systems with the contents of the RPM package.



15/60

To get more information about the tasks performed by the installer, you can view the informatica-hadoop-

install..log installation log file.

Installing in a Cluster Environment from the Primary NameNode

Using SCP ProtocolYou can install Big Data Management in a cluster environment from the primary NameNode using SCP.

1. Log in to the primary NameNode.

2. Run the following command to start the Big Data Management installation in console mode:



4. Press Enter .

5. Press 2 to install Big Data Management in a cluster environment.

6. Press Enter .

7. Type the absolute path for the Big Data Management installation directory.Start the path with a slash. The directory names in the path must not contain spaces or the following




8. Press Enter .

9. Press 1 to install Big Data Management from the primary NameNode.

10. Press Enter .

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter .

13. Type y.14. Press Enter .

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the

DataNodes, the installer creates the Informatica directory and populates all of the file systems with the

contents of the RPM package. The Informatica directory is located here: /

/Informatica

You can view the informatica-hadoop-install..log installation log file to get more

information about the tasks performed by the installer.

Installing in a Cluster Environment from the Primary NameNodeUsing FTP, HTTP, or NFS Protocol

You can install Big Data Management in a cluster environment from the primary NameNode using FTP,

HTTP, or NFS protocol.

1. Log in to the primary NameNode.




4. Press Enter .

Informatica Big Data Management Installation 15


16/60


6. Press Enter .

7. Type the absolute path for the Big Data Management installation directory.

Start the path with a slash. The directory names in the path must not contain spaces or the following




8. Press Enter .

9. Press 1 to install Big Data Management from the primary NameNode.

10. Press Enter .

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter .

13. Type n.

14. Press Enter .

15. Type y.

16. Press Enter .

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the

DataNodes, the installer creates the //Informatica

directory and populates all of the file systems with the contents of the RPM package.

You can view the informatica-hadoop-install..log installation log file to get more

information about the tasks performed by the installer.

Installing in a Cluster Environment from any Machine

You can install Big Data Management in a cluster environment from any machine.

1. Verify that the Big Data Management administrator has user root privileges on the node that will be

running the Big Data Management installation.

2. Log in to the machine as the root user.

3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop

cluster on which you want to install Big Data Management. The HadoopDataNodes file is located on the

node from where you want to launch the Big Data Management installation. You must add one IP

addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.




6. Press Enter .


8. Press Enter .

9. Type the absolute path for the Big Data Management installation directory and press Enter . Start the

path with a slash. Default is /opt.

10. Press Enter .

11. Press 2 to install Big Data Management using the HadoopDataNodes file.



17/60

12. Press Enter .

The installer creates the //Informatica directory and

populates all of the file systems with the contents of the RPM package on the first node that appears in

the HadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file.

Installing Big Data Management Using Cloudera Manager

You can install Big Data Management on a Cloudera CDH cluster using Cloudera Manager.

Perform the following steps:

1. Download the following file: INFORMATICA--informatica-.parcel.tar .

2. Extract manifest.json and the parcels from the .tar file.

3. Verify the location of your Local Parcel Repository.

In Cloudera Manager, click Administration > Settings > Parcels

4. Create a SHA file with the parcel name and hash listed in manifest.json that corresponds with your

Hadoop cluster.

For example, use the following parcel name for Hadoop cluster nodes that run Red Hat Enterprise Linux

6.4 64-bit:

INFORMATICA-informatica--el6.parcel

Use the following hash listed for Red Hat Enterprise Linux 6.4 64-bit:

8e904e949a11c4c16eb737f02ce4e36ffc03854f

To create a SHA file, run the following command:

echo > .sha

For example, run the following command:

echo “8e904e949a11c4c16eb737f02ce4e36ffc03854f” >INFORMATICA-9.6.1-1.informatica9.6.1.1.p0.1203-el6.parcel.sha

5. Transfer the parcel and SHA file to the Local Parcel Repository with FTP.

6. Check for new parcels with Cloudera Manager.

To check for new parcels, click Hosts > Parcels.

7. Distribute the Big Data Management parcels.

8. Activate the Big Data Management parcels.

After You Install

After you install Big Data Management, perform the post-installation tasks to ensure that Big Data

Management runs properly.

Perform the following tasks:

• Configure the Hadoop pushdown properties for the Data Integration Service.

• Optionally, install the Address Validation reference data.

• Configure Hive variables for mappings in a Hadoop environment.

• Update Hadoop cluster configuration parameters for mappings in a Hadoop environment.

• Configure library path and path variables for mappings in a Hadoop environment.

Afte r You Ins tal l 17


18/60

• Start the Application Timeline Server for the Blaze engine.

• Configure the Blaze log directories.

• Configure environment variables in the Big Data Management properties file.

• Open the required ports for the Blaze engine.

• Enable support for Lookup transformations with Teradata data objects.

Note: The Blaze engine only supports the following Hadoop distributions: Cloudera CDH, Hortonworks HDP,

and MapR. Skip the tasks for the Blaze engine if the Blaze engine does not support the distribution that the

Hadoop cluster runs.

Configure Hadoop Pushdown Properties for the Data IntegrationService

Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hadoop

environment.

You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.

The following table describes the Hadoop pushdown properties for the Data Integration Service:

Property Description

Informatica

Home Directory

on Hadoop

The Big Data Management home directory on every data node created by the Hadoop RPM

install. Type //Informatica .

Hadoop

Distribution

Directory

The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM

Install locations. The directory contains the minimum set of JARS required to process

Informatica mappings in a Hadoop environment. Type //Informatica/services/

shared/hadoop/[Hadoop_distribution_name] .

Data Integration

Service HadoopDistribution

Directory

The Hadoop distribution directory on the Data Integration Service node. The contents of the

Data Integration Service Hadoop distribution directory must be identical to Hadoopdistribution directory on the data nodes.

Hadoop Distribution Directory

You can modify the Hadoop distribution directory on the data nodes.

When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop

JARS, and the Snappy libraries required to process Informatica mappings in a Hadoop environment from

your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop

distribution and version.

The Hadoop RPM installs the Hadoop distribution directories in the following path:

/Informatica/services/shared/hadoop .



19/60

Reference Data Requirements

If you have a Data Quality product license, you can push a mapping that contains data quality

transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data

values are accurate and correctly formatted.

When you apply a pushdown operation to a mapping that contains data quality transformations, the operationcan copy the reference data that the mapping uses. The pushdown operation copies reference table data,

content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster

deletes the reference data that the pushdown operation copied with the mapping.

Note: The pushdown operation does not copy address validation reference data. If you push a mapping that

performs address validation, you must install the address validation reference data files on each DataNode

that runs the mapping. The cluster does not delete the address validation reference data files after the

address validation mapping runs.

Address validation mappings val idate and enhance the accuracy of postal address records. You can buy

address reference data files from Informatica on a subscription basis. You can download the current address

reference data files from Informatica at any time during the subscription period.

Installing the Address Reference Data Files

To install the address reference data files on each DataNode in the cluster, create an automation script.

1. Browse to the address reference data files that you downloaded from Informatica.

You download the files in a compressed format.

2. Extract the data files.

3. Copy the files to the NameNode machine or to another machine that can write to the DataNodes.

4. Create an automation script to copy the files to each DataNode.

• If you copied the files to the NameNode, use the slaves file for the Hadoop cluster to identify the

DataNodes. If you copied the files to another machine, use the Hadoop_Nodes.txt file to identify theDataNodes.

Find the Hadoop_Nodes.txt file in the Big Data Management installation package.

• The default directory for the address reference data files in the Hadoop environment

is /reference_data. If you install the files to a non-default directory, create the following custom

property on the Data Integration Service to identify the directory:

AV_HADOOP_DATA_LOCATION

Create the custom property on the Data Integration Service that performs the pushdown operation in

the native environment.

5. Run the automation script.

The script copies the address reference data files to the DataNodes.

Hive Variables for Mappings in a Hadoop Environment

To run mappings in a Hadoop environment, configure Hive environment variables..

You can configure Hive environment variables in the file //

Informatica/services/shared/hadoop//conf/hive-site.xml .



20/60

Configure the following Hive environment variables:

• hive.exec.dynamic.partition=true and hive.exec.dynamic.partition.mode=nonstrict . Configure if

you want to use Hive dynamic partitioned tables.

• hive.optimize.ppd = false. Disable predicate pushdown optimization to get accurate results for

mappings with Hive version 0.9.0. You cannot use predicate pushdown optimization for a Hive query that

uses multiple insert statements. The default Hadoop RPM installation sets hive.optimize.ppd to false.

Update Hadoop Cluster Configuration Parameters

Hadoop cluster configuration parameters that set Java library path in the mapred-site.xml file can override

the paths set in hadoopEnv.properties. Update the mapred-site.xml cluster configuration file on all the

cluster nodes to remove Java options that set the Java library path.

The following cluster configuration parameters in mapred-site.xml can override the Java library path set in

hadoopEnv.properties:

• mapreduce.admin.map.child.java.opts

•

mapreduce.admin.reduce.child.java.opts

If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, mappings

can fail to run in a Hadoop environment.

After you install, perform the fol lowing steps:

• Update the cluster configuration file mapred-site.xml to remove the Java option -Djava.library.path

from the property configuration.

• Edit hadoopEnv.properties to include the user Hadoop libraries in the Java Library path.

Example to Update mapred-site.xml on Cluster Nodes

If mapred-site.xml sets the following configuration for mapreduce.admin.map.child.java.opts parameter:

mapreduce.admin.map.child.java.opts-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/:/mylib/ -Djava.net.preferIPv4Stack=truetrue

The path to Hadoop libraries in mapreduce.admin.map.child.java.opts overrides following path set in

hadoopEnv.properties:

infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64 -Djava.security.egd=file:/dev/./urandom

To run mappings in a Hadoop environment, complete the following steps:

• Remove the -Djava.library.path Java option from mapreduce.admin.map.child.java.opts

parameter.

• Change hadoopEnv.properties to include the Hadoop libraries in the path /usr/lib/hadoop/lib/native

and /mylib/ with the following syntax:

infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/:/mylib/ -Djava.security.egd=file:/dev/./urandom



21/60

Library Path and Path Variables for Mappings in a HadoopEnvironment

To run mappings in a Hadoop environment configure the library path and path environment variables in the

hadoopEnv.properties file.

Configure following library path and path environment variables:

• When you run mappings in a Hadoop environment, configure the ODBC library path before the Teradata

library path. For example, infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=

$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/ODBC7.0/lib/:/opt/

teradata/client/13.10/tbuild/lib64:/opt/teradata/client/13.10/odbc_64/lib:/databases/

oracle11.2.0_64BIT/lib:/databases/db2v9.5_64BIT/lib64/:$HADOOP_NODE_INFA_HOME/

DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-

amd64-64:$LD_LIBRARY_PATH .

• When you use the MapR distribution on the Linux operating system, change the environment variable

LD_LIBRARY_PATH to include the following path: /

services/shared/hadoop/mapr_/lib/native/Linux-amd64-64 .

• When you use the MapR distribution on the Linux operating system, change the environment variableMAPR_HOME to include the following path: /services/

shared/hadoop/mapr_ .

Configure the Blaze Engine Log Directories

The hadoopEnv.properties file lists the log directories that the Blaze engine uses on the node and on

HDFS. You must grant the user account that starts the Blaze engine write permission on the log directories.

Grant the user account that starts the Blaze engine write permission for the di rectories specified in the

following properties:

• infagrid.node.local.root.log.dir

• infacal.hadoop.logs.directory

For more information about user accounts for the Blaze engine, see the Informatica Big Data Management

Security Guide.

Hadoop Environment Properties File

To add environment variables or to extend existing ones, use the Hadoop environment properties file,

hadoopEnv.properties.

You can optionally add third-party environment variables or extend the existing PATH environment variable in

hadoopEnv.properties.

1. Go to the following location: /services/shared/hadoop//infaConf

2. Find the fi le named hadoopEnv.properties.

3. Back up the file before you modify it.

4. Use a text editor to open the file and modify the properties.

5. Save the properties file with the name hadoopEnv.properties.



22/60

Informatica Developer Files and Variables

Edit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular

Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool client again. If

you use the MapR distribution you must also set the MAPR_HOME environment variable to run MapR

mappings in a Hadoop environment.developerCore.ini is located in the following directory:

\\clients\DeveloperClient

Add the following property to developerCore.ini:

• -DINFA_HADOOP_DIST_DIR=hadoop\

For example, the distribution name for a Hadoop cluster that runs MapR version 4.0.2 is mapr_4.0.2.

For a Hadoop cluster that runs MapR, you must perform the following additional tasks:

• Add the following propert ies to developerCore.ini:

- -Djava.library.path=hadoop\mapr_\lib\native\Win64;bin;..\DT\bin

- -Dmapr.library.flatclass

• Edit run.bat to set the MAPR_HOME environment variable and the -clean settings.

For example, include the following lines:

MAPR_HOME=//clients/DeveloperClient\hadoop\mapr_developerCore.exe -clean

• Copy mapr-cluster.conf to the following directory on the machine where the Developer tool runs:

\\clients\DeveloperClient\hadoop

\mapr_\conf.

You can find mapr-cluster.conf in the following directory on any node in the Hadoop cluster: /conf

Open the Required Ports for the Blaze EngineYou must open a range of ports for the Blaze engine to use to communicate with the Informatica domain.

Note: Skip this task if the Blaze engine does not support the distribution that the Hadoop cluster runs.

If the Hadoop cluster is behind a firewall, work with your network administrator to open the range of ports that

the Blaze engine uses.

When you create the Hadoop connection, specify the port range that the Blaze engine can use with the

minimum port and maximum port fields.

Enable Support for Lookup Transformations with Teradata Data

ObjectsTo use Lookup transformations with a Teradata data object in Hadoop pushdown mode, you must copy the

Teradata JDBC drivers to the Informatica installation directory.

You can download the Teradata JDBC drivers from Teradata. For more information about the drivers, see the

following Teradata website: http://downloads.teradata.com/download/connectivity/jdbc-driver .

The software available for download at the referenced links belongs to a third party or third parties, not

Informatica LLC. The download links are subject to the possibility of errors, omissions or change. Informatica

assumes no responsibility for such links and/or such software, disclaims all warranties, either express or


http://downloads.teradata.com/download/connectivity/jdbc-driver


23/60

implied, including but not limited to, implied warranties of merchantability, fitness for a particular purpose, title

and non-infringement, and disclaims all liability relating thereto.

Copy the tdgssconfig.jar and terajdbc4.jar files from the Teradata JDBC drivers to the following

directory on the machine where the Data Integration runs and every node in the Hadoop cluster:

/externaljdbcjars

Additionally, you must copy the tdgssconfig.jar and terajdbc4.jar files to the following directory on the

machine where the Developer tool runs: \clients

\externaljdbcjars.

Informatica Big Data Management Uninstallation

The Big Data Management uninstallation deletes the Big Data Management binary files from all of the

DataNodes within the Hadoop cluster. Uninstall Big Data Management from a shell command.

Uninstalling Big Data Management

To uninstall Big Data Management in a single node or cluster environment:

1. Verify that the Big Data Management administrator can run sudo commands.

2. If you are uninstalling Big Data Management in a cluster environment, set up password-less Secure

Shell (SSH) connection between the machine where you want to run the Big Data Management

installation and all of the nodes on which Big Data Management will be uninstalled.

3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,

verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the

nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The

HadoopDataNodes file is located on the node from where you want to launch the Big Data Management

installation. You must add one IP addresses or machine host names of the nodes in the Hadoop clusterfor each line in the file.

4. Log in to the machine. The machine you log into depends on the Big Data Management environment and

uninstallation method:

• If you are uninstalling Big Data Management in a single node environment, log in to the machine on

which Big Data Management is installed.

• If you are uninstalling Big Data Management in a cluster environment using the HADOOP_HOME

environment variable, log in to the primary NameNode.

• If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,

log in to any node.

5. Run the following command to start the Big Data Management uninstallation in console mode:



7. Press Enter .

8. Select 3 to uninstall Big Data Management.

9. Press Enter .

Informatica Big Data Management Uninstallation 23


24/60

10. Select the uninstallation option, depending on the Big Data Management environment:

• Select 1 to uninstall Big Data Management in a single node environment.

• Select 2 to uninstall Big Data Management in a cluster environment.

11. Press Enter .

12. If you are uninstalling Big Data Management in a cluster environment, select the uninstallation option,

depending on the uninstallation method:

• Select 1 to uninstall Big Data Management from the primary NameNode.

• Select 2 to uninstall Big Data Management using the HadoopDataNodes file.

13. Press Enter .

14. If you are uninstalling Big Data Management in a cluster environment from the primary NameNode, type

the absolute path for the Hadoop installation directory. Start the path with a slash.

The uninstaller deletes all of the Big Data Management binary files from the /

/Informatica directory. In a cluster environment, the

uninstaller delete the binary files from all of the nodes within the Hadoop cluster.



25/60

C H A P T E R 2

Mappings on HadoopDistributions

This chapter includes the following topics:

• Mappings on Hadoop Distributions Overview, 25

• Big Data Management Configuration Utility, 26

• Mappings on Cloudera CDH, 29

• Mappings on Hortonworks HDP, 34

• Mappings on IBM BigInsights, 41

• Mappings on MapR, 42

• Mappings on Pivotal HD, 49

Mappings on Hadoop Distributions Overview

After you install Big Data Management, you must enable mappings to run on a Hadoop cluster on a Hadoop

distribution.

After you enable Informatica mappings to run on a Hadoop cluster, you must configure the Big Data

Management Client files to communicate with a Hadoop cluster on a particular Hadoop distribution. You can

use the Big Data Management Configuration Utility to automatically configure some of the properties. After

you run the utility, you must complete the configuration for your Hadoop distribution.

Alternatively, you can manually Big Data Management without the utility.

The following table describes the Hadoop distributions, MapReduce versions, and schedulers that you can

use with Big Data Management:

Ha do op D ist ri bu tio n M ap Red uce V ersi on S ch ed ul er

Cloudera CDH 5.4 MRv2 Fair Scheduler

Hortonworks HDP 2.2 MRv2 CapacityScheduler

IBM BigInsights 3.0 MRv1 CapacityScheduler and Fair Scheduler

25


26/60

Ha do op D ist ri bu tio n M ap Red uce V ersi on S ch ed ul er

MapR 4.0.2 MRv1 or MRv2 CapacityScheduler and Fair Scheduler

Pivotal HD 2.1 MRv2 CapacityScheduler and Fair Scheduler

Big Data Management Configuration Utility

You can use the Big Data Management Configuration Utility to automate part of the configuration for Big Data

Management. Alternatively, you can manually configure Big Data Management.

To automate part of the configuration process for the Hadoop cluster properties on the machine where the

Data Integration Service runs, perform the following steps:

1. On the machine where the Data Integration Service runs, open the command line.

2. Go to the following directory: /tools/BDEUtil.

3. Run BDEConfig.sh.

4. Press Enter.

5. Choose the Hadoop distribution:

Option Description

1 Cloudera CDH

2 Hortonworks HDP

3 MapR

4 Pivotal HD

5 IBM BigInsights

6. Choose the Hadoop distribution version you want to use to configure Big Data Management.

7. Choose how to access files on the Hadoop cluster:

If you choose Cloudera CDH, the following options appear:

Option Description

1 Cloudera Manager. Enter this option to use the Cloudera Manager API to access

files on the Hadoop cluster.

2 Secure Shell (SSH). Enter this option to use SSH to access files on the Hadoopcluster. This option requires SSH connections to the machines that host the

NameNode, JobTracker, and Hive client. If you select this option, Informatica

recommends that you use an SSH connection without a password or have sshpass

or Expect installed.

26 Chapter 2: Mappings on Hadoop Distributions


27/60

Option Description

3 Shared directory. Select this option to use a shared directory to access files on the

Hadoop cluster. You must have read permission for the shared directory.

Note: Informatica recommends the Cloudera Manager or SSH option.

If you choose a distribution other than Cloudera CDH, the following options appear :

Option Description

1 Secure Shell (SSH). Enter this option to use SSH to access files on the Hadoopcluster. This option requires SSH connections to the machines that host the

NameNode, JobTracker, and Hive client. If you select this option, Informatica

recommends that you use an SSH connection without a password or have sshpassor Expect installed.

2 Shared directory. Enter this option to use a shared directory to access files on the

Hadoop cluster. You must have read permission for the shared directory.

Note: Informatica recommends the SSH option.

8. If you did not choose Cloudera CDH, continue to step 8. Choose the Cloudera CDH cluster you want to

use to configure Big Data Management.

9. Based on the option you selected, see the corresponding topic to continue with the configuration

process:

• “Use Cloudera Manager” on page 27

• “Use SSH” on page 27

• “Use a Shared Directory” on page 28

Use Cloudera Manager If you choose Cloudera Manager, perform the following steps to configure Big Data Management:

1. Enter the Cloudera Manager host.

2. Enter the Cloudera user ID.

3. Enter the password for the user ID.

4. Enter the port for Cloudera Manager.

The Big Data Management Configuration Utility retrieves the required information from the Hadoop

cluster.

5. Complete the manual configuration steps.

Use SSH

If you choose SSH, you must provide host names and Hadoop configuration file locations.

Note: Informatica recommends that you use an SSH connection without a password or have sshpass or

Expect installed. If you do not use one of these methods, you must enter the password each time the utility

downloads a file from the Hadoop cluster.

Big Data Management Configuration Utility 27

http://-/?-


28/60

Verify the following host names: NameNode, JobTracker, and Hive client. Additionally, verify the locations for

the following files on the Hadoop cluster:

• hdfs-site.xml

• core-site.xml

•

mapred-site.xml

• yarn-site.xml

• hive-site.xml

Perform the following steps to configure Big Data Management:

1. Enter the NameNode host name.

2. Enter the SSH user ID.

3. Enter the password for the SSH user ID.

If you use an SSH connection without a password, leave this field blank and press enter.

4. Enter the location for the hdfs-site.xml file on the Hadoop cluster.

5. Enter the location for the core-site.xml file on the Hadoop cluster.

The Big Data Management Configuration Utility connects to the NameNode and downloads the following

files: hdfs-site.xml and core-site.xml.

6. Enter the JobTracker host name.




9. Enter the directory for the mapred-site.xml file on the Hadoop cluster.

10. Enter the directory for the yarn-site.xml file on the Hadoop cluster.

The utility connects to the JobTracker and downloads the following files: mapred-site.xml and yarn-

site.xml.11. Enter the Hive client host name.




14. Enter the directory for the hive-site.xml file on the Hadoop cluster.

The utility connects to the Hive client and downloads the following file: hive-site.xml.


Use a Shared Directory

If you choose shared directory, perform the following steps to configure Big Data Management:

1. Enter the location of the shared directory.

Note: You must have read permission for the directory, and the directory should contain the following

files:

• core-site.xml

• hdfs-site.xml



29/60

• hive-site.xml

• mapred-site.xml

• yarn-site.xml


Mappings on Cloudera CDH

You can enable Informatica mappings to run on a Hadoop cluster on Cloudera CDH.

Informatica supports Cloudera CDH clusters that are deployed on-premise, on Amazon EC2, or on Microsoft

Azure.

To enable Informatica mappings to run on a Cloudera CDH cluster, complete the following steps:

1. Configure Hadoop cluster properties on the machine on which the Data Integration Service runs.

2. Configure virtual memory limits.

3. Create a staging directory.

4. Add hbase_protocol.jar to the Hadoop classpath.

Configure Hadoop Cluster Properties on the Data IntegrationService Machine

Configure Hadoop cluster properties in the hive-site.xml and yarn-site.xml files that the Data Integration

Service uses when it runs mappings on a Cloudera CDH cluster.

Configure hive-site.xml for the Data Integration Service

hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:

/services/shared/hadoop/cloudera_cdh/conf

In hive-site.xml, configure the following property:

hive.optimize.constant.propagation

Whether to enable the constant propagation optimizer.

Set this value to false.

The following sample code describes the properties you can set in hive-site.xml:

hive.optimize.constant.propagation false

Configure yarn-site.xml for the Data Integration Service

The yarn-site.xml file is located in the following directory on the machine where the Data Integration

Service runs:

/services/shared/hadoop/cloudera_cdh/conf

Configure the following Hadoop cluster property:

yarn.resourcemanager.webapp.address

Web application address for the Resource Manager.

Mappings on Cloudera CDH 29


30/60

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml.

The Big Data Management Configuration utility automatically configures the following properties in the yarn-

site.xml file. You can also manually configure the properties.

mapreduce.jobhistory.address

Location of the MapReduce JobHistory Server.

Use the value in the following file:/etc/hadoop/conf/mapred-site.xml

mapreduce.jobhistory.webapp.address

Web address of the MapReduce JobHistory Server.

Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

yarn.resourcemanager.scheduler.address

Scheduler interface address.

Use the value in the following file: /etc/hadoop/conf/yarn-site.xml

You can set the following properties in yarn-site.xml:

mapreduce.jobhistory.address hostname:port MapReduce JobHistory Server IPC host:port

mapreduce.jobhistory.webapp.address hostname:port MapReduce JobHistory Server Web UI host:port

yarn.resourcemanager.scheduler.address hostname:port The address of the scheduler interface

yarn.resourcemanager.webapp.address hostname:port The address for the Resource Manager web application.

Create a Staging Directory on HDFS

If the Cloudera cluster uses HiveServer 2, you must grant the anonymous user the Execute permission on the

staging directory or you must create another staging directory on HDFS.

By default, a staging directory already exists on HDFS. You must grant the anonymous user the Execute

permission on the staging directory. If you cannot grant the anonymous user the Execute permission on this

directory, you must enter a valid user name for the user in the Hive connection. If you use the default staging

directory on HDFS, you do not have to configure mapred-site.xml or hive-site.xml.

If you want to create another staging directory to store mapreduce jobs, you must create a di rectory on

HDFS. After you create the staging directory, you must add it to mapred-site.xml and hive-site.xml.

To create another staging directory on HDFS, run the following commands from the command line of the

machine that runs the Hadoop cluster:

hadoop fs –mkdir /staginghadoop fs –chmod –R 0777 /staging

Add the staging directory to mapred-site.xml.



31/60

mapred-site.xml is located in the following directory on the Hadoop cluster: /etc/hadoop/conf/mapred-

site.xml

For example, mapred-site.xml, add the following entry to mapred-site.xml:

yarn.app.mapreduce.am.staging-dir

/staging

Add the staging directory to hive-site.xml on the machine where the Data Integration Service runs.

hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:

/services/shared/adhoop/cloudera_/conf.

In hive-site.xml, add the yarn.app.mapreduce.am.staging-dir property. Use the value that you specified

in mapred-site.xml.

For example, add the following entry to hive-site.xml:

yarn.app.mapreduce.am.staging-dir /staging

Configure Virtual Memory Limits

Configure the virtual memory limits in yarn-site.xml for every node in the Hadoop cluster. After you configure

virtual memory limits you must restart the Hadoop cluster.

yarn-site.xml is located in the following directory on every node in the Hadoop cluster:

/etc/hadoop/conf/yarn-site.xml

In yarn-site.xml, configure the following property:

yarn.nodemanager.vmem-check-enabled

Determines virtual memory limits.

The following example describes the property you can configure in yarn-site.xml:

yarn.nodemanager.vmem-check-enabled false Enforces virtual memory limits for containers.

Add hbase_protocol.jar to the Hadoop classpath

Add hbase-protocol.jar to the Hadoop classpath on every node on the Hadoop cluster. Then, restart the

Node Manager for each node in the Hadoop cluster.

hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more

information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304

Configure the Blaze Engine

To use the Blaze engine, you must configure the Hadoop cluster.

To configure a Cloudera CDH cluster for the Blaze engine, complete the following tasks:

• Create symbolic links to the Jackson JAR files.

• Configure yarn-site.xml on every node in the Hadoop cluster.

Mappings on Cloudera CDH 31

https://issues.apache.org/jira/browse/HBASE-10304


32/60

• Start the Application Timeline Server.

• Enable the Blaze Engine console.

Create Symbolic Links to Jackson JAR Files for the Blaze Engine

The Hadoop Application Timeline Server requires the latest version of the Jackson JAR files to run. You must

create symbolic links to Jackson JAR files that the Hadoop Application Timeline Server requires.

Perform the following steps on the node where you want to start the Application Timeline Server:

1. Navigate to the directory that contains following Jackson JAR files on the Hadoop cluster:

• jackson-xc-1.8.8.jar

• jackson-jaxrs-1.8.8.jar

• jackson-core-asl-1.8.8.jar

• jackson-mapper-asl-1.8.8. jar

If you use Cloudera Manager to configure the Hadoop cluster, you can find the files in the following

directory: /opt/cloudera/parcels/CDH/lib/hadoop/libexec/ ../ ../hadoop/lib/.

If you configure the Hadoop cluster manually, you can find the files in the following directory: /usr/lib/

hadoop/lib/.

2. Remove the link to the Jackson JAR files.

Run the following command for each Jackson JAR file:

rm

For example, run the following command to remove the link to jackson-xc-1.8.8.jar on a Cloudera

CDH cluster that

in 100 bigdatamanagementinstallationandconfigurationguide en

Documents