in 100 bigdatamanagementinstallationandconfigurationguide en

Upload: sandip-chandarana

Post on 06-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    1/60

    Informatica (Version 10.0)

      ig Data Management

    Installation and Configuration

    Guide

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    2/60

    Informatica Big Data Management Installation and Configuration Guide

    Version 10.0November 2015

    Copyright (c) 1993-2015 Informatica LLC. All rights reserved.

    This software and documentation contain proprietary information of Informatica LLC and are provided under a license agreement containing restrictions on use anddisclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in anyform, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC. This Software may be protected by U.S. and/orinternational Patents and other Patents Pending.

    Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and asprovided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013©(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14

    (ALT III), as applicable.

    The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to usin writing.

    Informatica, Informatica Platform, Informatica Data Services, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange,PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Transformation, Informatica B2B Data Exchange InformaticaOn Demand, Informatica Identity Resolution, Informatica Application Information Lifecycle Management, Informatica Complex Event Processing, Ultra Messaging andInformatica Master Data Management are trademarks or registered trademarks of Informatica LLC in the United States and in jurisdictions throughout the world. Allother company and product names may be trade names or trademarks of their respective owners.

    Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright DataDirect Technologies. All rightsreserved. Copyright © Sun Microsystems. All rights reserved. Copyright © RSA Security Inc. All Rights Reserved. Copyright © Ordinal Technology Corp. All rightsreserved.Copyright© Aandacht c.v. All rights reserved. Copyright Genivia, Inc. All rights reserved. Copyright Isomorphic Software. All rights reserved. Copyright © MetaIntegration Technology, Inc. All rights reserved. Copyright © Intalio. All rights reserved. Copyright © Oracle. All rights reserved. Copyright © Adobe SystemsIncorporated. All rights reserved. Copyright © DataArt, Inc. All rights reserved. Copyright © ComponentSource. All rights reserved. Copyright © Microsoft Corporation. Allrights reserved. Copyright © Rogue Wave Software, Inc. All rights reserved. Copyright © Teradata Corporation. All rights reserved. Copyright © Yahoo! Inc. All rightsreserved. Copyright © Glyph & Cog, LLC. All r ights reserved. Copyright © Thinkmap, Inc. All rights reserved. Copyright © Clearpace Software Limited. All rightsreserved. Copyright © Information Builders, Inc. All rights reserved. Copyright © OSS Nokalva, Inc. All rights reserved. Copyright Edifecs, Inc. All rights reserved.Copyright Cleo Communications, Inc. All rights reserved. Copyright © International Organization for Standardization 1986. All rights reserved. Copyright © ej-

    technologies GmbH. All rights reserved. Copyright © Jaspersoft Corporation. All rights reserved. Copyright © International Business Machines Corporation. All rightsreserved. Copyright © yWorks GmbH. All rights reserved. Copyright © Lucent Technologies. All rights reserved. Copyright (c) University of Toronto. All rights reserved.Copyright © Daniel Veillard. All rights reserved. Copyright © Unicode, Inc. Copyright IBM Corp. All rights reserved. Copyright © MicroQuill Software Publishing, Inc. Allrights reserved. Copyright © PassMark Software Pty Ltd. All rights reserved. Copyright © LogiXML, Inc. All rights reserved. Copyright © 2003-2010 Lorenzi Davide, Allrights reserved. Copyright © Red Hat, Inc. All rights reserved. Copyright © The Board of Trustees of the Leland Stanford Junior University. All rights reserved. Copyright© EMC Corporation. All r ights reserved. Copyright © Flexera Software. All rights reserved. Copyright © Jinfonet Software. All rights reserved. Copyright © Apple Inc. Allrights reserved. Copyright © Telerik Inc. All rights reserved. Copyright © BEA Systems. All rights reserved. Copyright © PDFlib GmbH. All rights reserved. Copyright ©

    Orientation in Objects GmbH. All rights reserved. Copyright © Tanuki Software, Ltd. All rights reserved. Copyright © Ricebridge. All rights reserved. Copyright © Sencha,Inc. All rights reserved. Copyright © Scalable Systems, Inc. All rights reserved. Copyright © jQWidgets. All rights reserved. Copyright © Tableau Software, Inc. All rightsreserved. Copyright© MaxMind, Inc. All Rights Reserved. Copyright © TMate Software s.r.o. All rights reserved. Copyright © MapR Technologies Inc. All rights reserved.Copyright © Amazon Corporate LLC. All rights reserved. Copyright © Highsoft. All rights reserved. Copyright © Python Software Foundation. All rights reserved.Copyright © BeOpen.com. All rights reserved. Copyright © CNRI. All rights reserved.

    This product includes software developed by the Apache Software Foundation (http://www.apache.org/), and/or other software which is licensed under various versionsof the Apache License (the "License"). You may obtain a copy of these Licenses at http://www.apache.org/licenses/. Unless required by applicable law or agreed to inwriting, software distributed under these Licenses is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied. See the Licenses for the specific language governing permissions and limitations under the Licenses.

    This product includes software which was developed by Mozilla (http://www.mozilla.org/), software copyright The JBoss Group, LLC, all rights reserved; software

    copyright©

     1999-2006 by Bruno Lowagie and Paulo Soares and other software which is licensed under various versions of the GNU Lesser General Public License Agreement, which may be found at http:// www.gnu.org/licenses/lgpl.html. The materials are provided free of charge by Informatica, "as-is", without warranty of anykind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.

    The product includes ACE(TM) and TAO(TM) software copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California,Irvine, and Vanderbilt University, Copyright (©) 1993-2006, all rights reserved.

    This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit (copyright The OpenSSL Project. All Rights Reserved) andredistribution of this software is subject to terms available at http://www.openssl.org and http://www.openssl.org/source/license.html.

    This product includes Curl software which is Copyright 1996-2013, Daniel Stenberg, . All Rights Reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://curl.haxx.se/docs/copyright.html. Permission to use, copy, modify, and distribute this software for any purpose with orwithout fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

    The product includes software copyright 2001-2005 (©) MetaStuff, Ltd. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http://www.dom4j.org/ license.html.

    The product includes software copyright © 2004-2007, The Dojo Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject toterms available at http://dojotoolkit.org/license.

    This product includes ICU software which is copyright International Business Machines Corporation and others. All rights reserved. Permissions and limitations

    regarding this software are subject to terms available at http://source.icu-project.org/repos/icu/icu/trunk/license.html.

    This product includes software copyright © 1996-2006 Per Bothner. All rights reserved. Your right to use such materials is set forth in the license which may be found athttp:// www.gnu.org/software/ kawa/Software-License.html.

    This product includes OSSP UUID software which is Copyright © 2002 Ralf S. Engelschall, Copyright © 2002 The OSSP Project Copyright © 2002 Cable & WirelessDeutschland. Permissions and limitations regarding this software are subject to terms available at http://www.opensource.org/licenses/mit-license.php.

    This product includes software developed by Boost (http://www.boost.org/) or under the Boost software license. Permissions and limitations regarding this software aresubject to terms available at http:/ /www.boost.org/LICENSE_1_0.txt.

    This product includes software copyright © 1997-2007 University of Cambridge. Permissions and limitations regarding this software are subject to terms available athttp:// www.pcre.org/license.txt.

    This product includes software copyright © 2007 The Eclipse Foundation. All Rights Reserved. Permissions and limitations regarding this software are subject to termsavailable at http:// www.eclipse.org/org/documents/epl-v10.php and at http://www.eclipse.org/org/documents/edl-v10.php.

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    3/60

    This product includes software licensed under the terms at http://www.tcl.tk/software/tcltk/license.html, http://www.bosrup.com/web/overlib/?License, http://www.stlport.org/doc/ license.html, http://asm.ow2.org/license.html, http://www.cryptix.org/LICENSE.TXT, http://hsqldb.org/web/hsqlLicense.html, http://httpunit.sourceforge.net/doc/ license.html, http://jung.sourceforge.net/license.txt , http://www.gzip.org/zlib/zlib_license.html, http://www.openldap.org/software/release/license.html, http://www.libssh2.org, http://slf4j.org/license.html, http://www.sente.ch/software/OpenSourceLicense.html, http://fusesource.com/downloads/license-agreements/fuse-message-broker-v-5-3- license-agreement; http://antlr.org/license.html; http://aopalliance.sourceforge.net/; http://www.bouncycastle.org/licence.html;http://www.jgraph.com/jgraphdownload.html; http://www.jcraft.com/jsch/LICENSE.txt; http://jotm.objectweb.org/bsd_license.html; . http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231; http://www.slf4j.org/license.html; http://nanoxml.sourceforge.net/orig/copyright.html; http://www.json.org/license.html; http://forge.ow2.org/projects/javaservice/, http://www.postgresql.org/about/licence.html, http://www.sqlite.org/copyright.html, http://www.tcl.tk/software/tcltk/license.html, http://www.jaxen.org/faq.html, http://www.jdom.org/docs/faq.html, http://www.slf4j.org/license.html; http://www.iodbc.org/dataspace/iodbc/wiki/iODBC/License; http://www.keplerproject.org/md5/license.html; http://www.toedter.com/en/jcalendar/license.html; http://www.edankert.com/bounce/index.html; http://www.net-snmp.org/about/license.html; http://www.openmdx.org/#FAQ; http://www.php.net/license/3_01.txt; http://srp.stanford.edu/license.txt; http://www.schneier.com/blowfish.html; http://www.jmock.org/license.html; http://xsom.java.net; http://benalman.com/about/license/; https://github.com/CreateJS/EaselJS/blob/master/src/easeljs/display/Bitmap.js;http://www.h2database.com/html/license.html#summary; http://jsoncpp.sourceforge.net/LICENSE; http://jdbc.postgresql.org/license.html; http://

    protobuf.googlecode.com/svn/trunk/src/google/protobuf/descriptor.proto; https://github.com/rantav/hector/blob/master/LICENSE; http://web.mit.edu/Kerberos/krb5-current/doc/mitK5license.html; http://jibx.sourceforge.net/jibx-license.html; https://github.com/lyokato/libgeohash/blob/master/LICENSE; https://github.com/hjiang/jsonxx/blob/master/LICENSE; https://code.google.com/p/lz4/; https://github.com/jedisct1/libsodium/blob/master/LICENSE; http://one-jar.sourceforge.net/index.php?page=documents&file=license; https://github.com/EsotericSoftware/kryo/blob/master/license.txt; http://www.scala-lang.org/license.html; https://github.com/tinkerpop/blueprints/blob/master/LICENSE.txt; http://gee.cs.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/intro.html; https://aws.amazon.com/asl/; https://github.com/twbs/bootstrap/blob/master/LICENSE; https://sourceforge.net/p/xmlunit/code/HEAD/tree/trunk/LICENSE.txt; https://github.com/documentcloud/underscore-contrib/blob/master/LICENSE, and https://github.com/apache/hbase/blob/master/LICENSE.txt.

    This product includes software licensed under the Academic Free License (http://www.opensource.org/licenses/afl-3.0.php), the Common Development and DistributionLicense (http://www.opensource.org/licenses/cddl1.php) the Common Public License (http://www.opensource.org/licenses/cpl1.0.php), the Sun Binary Code License

     Agreement Supplemental License Terms, the BSD License (http:// www.opensource.org/licenses/bsd-license.php), the new BSD License (http://opensource.org/licenses/BSD-3-Clause), the MIT License (http://www.opensource.org/licenses/mit-license.php), the Artistic License (http://www.opensource.org/licenses/artistic-license-1.0) and the Initial Developer’s Public License Version 1.0 (http://www.firebirdsql.org/en/initial-developer-s-public-license-version-1-0/).

    This product includes software copyright © 2003-2006 Joe WaInes, 2006-2007 XStream Committers. All rights reserved. Permissions and limitations regarding thissoftware are subject to terms available at http://xstream.codehaus.org/license.html. This product includes software developed by the Indiana University Extreme! Lab.For further information please visit http://www.extreme.indiana.edu/.

    This product includes software Copyright (c) 2013 Frank Balluffi and Markus Moeller. All rights reserved. Permissions and limitations regarding this software are subjectto terms of the MIT license.

    See patents at https://www.informatica.com/legal/patents.html.

    DISCLAIMER: Informatica LLC provides this documentation "as is" without warranty of any kind, either express or implied, including, but not limited to, the impliedwarranties of noninfringement, merchantability, or use for a particular purpose. Informatica LLC does not warrant that this software or documentation is error free. Theinformation provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation issubject to change at any time without notice.

    NOTICES

    This Informatica product (the "Software") includes certain drivers (the "DataDirect Drivers") from DataDirect Technologies, an operating company of Progress SoftwareCorporation ("DataDirect") which are subject to the following terms and conditions:

    1.THE DATADIRECT DRIVERS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT

    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

    2.IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT,

    INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT

    INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT

    LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

    Part Number: IN-BDI-10000-0001

    https://www.informatica.com/legal/patents.html

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    4/60

    Table of Contents

    Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Informatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Informatica My Support Portal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Informatica Product Availability Matrixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Informatica Web Site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica How-To Library. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica Support YouTube Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica Marketplace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica Velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    Chapter 1: Installation and Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Installation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Informatica Big Data Management Installation Process. . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    Install and Configure the Informatica Domain and Clients. . . . . . . . . . . . . . . . . . . . . . . . . 11

    Install and Configure PowerExchange Adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    Install and Configure Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Pre-Installation Tasks for a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Pre-Installation Tasks for a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Informatica Big Data Management Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol. . . . . . . 15

    Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS

    Protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    Installing in a Cluster Environment from any Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Installing Big Data Management Using Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . 17

     After You Install. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Configure Hadoop Pushdown Properties for the Data Integration Service. . . . . . . . . . . . . . . 18

    Reference Data Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    Hive Variables for Mappings in a Hadoop Environment. . . . . . . . . . . . . . . . . . . . . . . . . . 19Update Hadoop Cluster Configuration Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    Library Path and Path Variables for Mappings in a Hadoop Environment. . . . . . . . . . . . . . . 21

    Configure the Blaze Engine Log Directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Hadoop Environment Properties File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    Informatica Developer Files and Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Open the Required Ports for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Enable Support for Lookup Transformations with Teradata Data Objects. . . . . . . . . . . . . . . 22

    4 Table of Contents

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    5/60

    Informatica Big Data Management Uninstallation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    Uninstalling Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    Chapter 2: Mappings on Hadoop Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Mappings on Hadoop Distributions Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Big Data Management Configuration Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Use Cloudera Manager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Use SSH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Use a Shared Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    Mappings on Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    Configure Hadoop Cluster Properties on the Data Integration Service Machine. . . . . . . . . . . 29

    Create a Staging Directory on HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

     Add hbase_pr otocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    Configure the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    Mappings on Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Configure Hadoop Cluster Properties for the Data Integration Service. . . . . . . . . . . . . . . . . 35

    Enable Tez. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

     Add hbase_pr otocol.jar to the Hadoop classpath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  39

    Enable HBase Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    Configure the Hadoop Cluster for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    Mappings on IBM BigInsights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    User Account for the JDBC and Hive Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Mappings on MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Verify the Cluster Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1. . . . . . . . 43

    Configure hive-site.xml on Every Node in the Hadoop Cluster for MapReduce 1. . . . . . . . . . 44

    Configure Hadoop Cluster Properties on the Data Integration Service Machine for

    MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    Configure yar n-site.xml on Every Node in the Cluster for MapReduce 2. . . . . . . . . . . . . . . . 45

    Configure MapR Distribution Variables for Mappings in a Hadoop Environment. . . . . . . . . . . 47

    Configure the Heap Space for the MapR-FS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    Enable Hadoop Pushdown for HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    Configure the Application Timeline Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    Mappings on Pivotal HD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Configure Hadoop Cluster Properties for Pivotal HD in yarn-site.xml. . . . . . . . . . . . . . . . . . 49

    Configure Virtual Memory Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Chapter 3: High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    Configure High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    Configuring Big Data Management for a Highly Available Cloudera CDH Cluster. . . . . . . . . . . . . 53

    Configuring Big Data Management for a Highly Available Hortonworks HDP Cluster. . . . . . . . . . . 54

    Configuring Big Data Management for a Highly Available IBM BigInsights Cluster. . . . . . . . . . . . 55

    Table of Contents 5

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    6/60

    Configuring Big Data Management for a Highly Avai lable MapR Cluster. . . . . . . . . . . . . . . . . . . 56

    Configuring Big Data Management for a Highly Avai lable Pivota l Cluster. . . . . . . . . . . . . . . . . . 57

    Appendix A: Upgrade Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    Upgrading Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    6 Table of Contents

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    7/60

    Preface

    The Informatica Big Data Management Installation and Configur ation Guide is written for the system

    administrator who is responsible for installing Informatica Big Data Management. This guide assumes you

    have knowledge of operating systems, relational database concepts, and the database engines, flat files, or

    mainframe systems in your environment. This guide also assumes you are familiar with the interface

    requirements for the Hadoop environment.

    Informatica Resources

    Informatica My Support Portal

     As an Informatica customer, the first step in reaching out to Informat ica is through the Informatica My Support

    Portal at https://mysupport.informatica.com. The My Support Portal is the largest online data integration

    collaboration platform with over 100,000 Informatica customers and partners worldwide.

     As a member, you can:

    •  Access all of your Informatica resources in one place.

    • Review your support cases.

    • Search the Knowledge Base, find product documentation, access how-to documents, and watch support

    videos.

    • Find your local Informatica User Group Network and collaborate with your peers.

    Informatica Documentation

    The Informatica Documentation team makes every effort to create accurate, usable documentation. If you

    have questions, comments, or ideas about this documentation, contact the Informatica Documentation team

    through email at [email protected]. We will use your feedback to improve our

    documentation. Let us know if we can contact you regarding your comments.

    The Documentation team updates documentation as needed. To get the latest documentation for your

    product, navigate to Product Documentation from https://mysupport.informatica.com.

    Informatica Product Availability Matrixes

    Product Availability Matrixes (PAMs) indicate the versions of operating systems, databases, and other types

    of data sources and targets that a product release supports. You can access the PAMs on the Informatica My

    Support Portal at https://mysupport.informatica.com.

    7

    http://mysupport.informatica.com/http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/http://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    8/60

    Informatica Web Site

    You can access the Informatica corporate web site at https://www.informatica.com. The site contains

    information about Informatica, its background, upcoming events, and sales offices. You will also find product

    and partner information. The services area of the site includes important information about technical support,

    training and education, and implementation ser vices.

    Informatica How-To Library

     As an Informatica customer, you can access the Informatica How-To Library at

    https://mysupport.informatica.com. The How-To Library is a collection of resources to help you learn more

    about Informatica products and features. It includes articles and interactive demonstrations that provide

    solutions to common problems, compare features and behaviors, and guide you through performing specific

    real-world tasks.

    Informatica Knowledge Base

     As an Informatica customer, you can access the Informatica Knowledge Base at

    https://mysupport.informatica.com. Use the Knowledge Base to search for documented solutions to known

    technical issues about Informatica products. You can also find answers to frequently asked questions,

    technical white papers, and technical tips. If you have questions, comments, or ideas about the Knowledge

    Base, contact the Informatica Knowledge Base team through email at [email protected].

    Informatica Support YouTube Channel

    You can access the Informatica Support YouTube channel at http://www.youtube.com/user/INFASupport. The

    Informatica Support YouTube channel includes videos about solutions that guide you through performing

    specific tasks. If you have questions, comments, or ideas about the Informatica Support YouTube channel,

    contact the Support YouTube team through email at [email protected] or send a tweet to

    @INFASupport.

    Informatica Marketplace

    The Informatica Marketplace is a forum where developers and partners can share solutions that augment,

    extend, or enhance data integration implementations. By leveraging any of the hundreds of solutions

    available on the Marketplace, you can improve your productivity and speed up time to implementation on

    your projects. You can access Informatica Marketplace at http://www.informaticamarketplace.com.

    Informatica Velocity

    You can access Informatica Velocity at https://mysupport.informatica.com. Developed from the real-world

    experience of hundreds of data management projects, Informatica Velocity represents the collective

    knowledge of our consultants who have worked with organizations from around the world to plan, develop,deploy, and maintain successful data management solutions. If you have questions, comments, or ideas

    about Informatica Velocity, contact Informatica Professional Services at [email protected].

    Informatica Global Customer Support

    You can contact a Customer Support Center by telephone or through the Online Support.

    Online Support requires a user name and password. You can request a user name and password at

    http://mysupport.informatica.com.

    8 Preface

    http://mysupport.informatica.com/mailto:[email protected]://www.informaticamarketplace.com/mailto:[email protected]:[email protected]://mysupport.informatica.com/mailto:[email protected]://mysupport.informatica.com/http://www.informaticamarketplace.com/mailto:[email protected]://www.youtube.com/user/INFASupportmailto:[email protected]://mysupport.informatica.com/http://mysupport.informatica.com/http://www.informatica.com/

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    9/60

    The telephone numbers for Informatica Global Customer Support are available from the Informatica web site

    at http://www.informatica.com/us/services-and-training/support-services/global-support-centers/.

    Preface 9

    http://www.informatica.com/us/services-and-training/support-services/global-support-centers/

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    10/60

    C H A P T E R   1

    Installation and Configuration

    This chapter includes the following topics:

    • Installation and Configuration Overview, 10

    • Before You Begin, 11

    • Informatica Big Data Management Installation, 14

     After You Install, 17• Informatica Big Data Management Uninstallation, 23

    Installation and Configuration Overview

    The Informatica Big Data Management installation is distributed to the Hadoop cluster as a Red Hat Package

    Manager (RPM) installation package.

    The RPM package includes the Informatica 10.0 engine, the Blaze engine, and adapter components. The

    RPM package and the binary files that you need to run the Big Data Management installation are compressed

    into a tar.gz file.

     After you complete the installation, you must configure the Informatica domain and the Hadoop cluster to

    enable Informatica mappings to run on a Hadoop cluster.

    Informatica Big Data Management Installation Process

    You can install Big Data Management in a single node or cluster environment.

    Installing in a Single Node Environment

    You can install Big Data Management in a single node environment.

    1. Extract the Big Data Management tar.gz file to the machine.

    2. Install Big Data Management by running the installation shell script in a Linux environment.

    Installing in a Cluster Environment

    You can install Big Data Management in a cluster environment.

    1. Extract the Big Data Management tar.gz file to a machine.

    10

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    11/60

    2. Distribute the RPM package to all of the nodes within the Hadoop cluster. You can distribute the RPM

    package using any of the following protocols: File Transfer Protocol (FTP), Hypertext Transfer Protocol

    (HTTP), Network File System (NFS), or Secure Copy Protocol (SCP).

    3. Install Big Data Management by running the installation shell script in a Linux environment. You can

    install Big Data Management from the primary NameNode or from any machine using the

    HadoopDataNodes file.

    • Install from the primary NameNode. You can install Big Data Management using FTP, HTTP, NFS or

    SCP protocol. During the installation, the installer shell script picks up all of the DataNodes from the

    following file: $HADOOP_HOME/conf/slaves. Then, it copies the Big Data Management binary files to

    the following directory on each of the DataNodes: //

    Informatica. You can perform this step only if you are deploying Hadoop from the primary

    NameNode.

    • Install from any machine. Add the IP addresses or machine host names, one for each line, for each of

    the nodes in the Hadoop cluster in the HadoopDataNodes file. During the Big Data Management

    installation, the installation shell script picks up all of the nodes from the HadoopDataNodes file and

    copies the Big Data Management binary files to the //

    Informatica directory on each of the nodes.

    Before You Begin

    Before you begin the installation, install the Informatica components and PowerExchange adapters, and

    perform the pre-installation tasks.

    Install and Configure the Informatica Domain and Clients

    Before you install Big Data Management, install and configure the Informatica domain and clients.

    You must install the Informatica services and clients. Run the Informatica services installation to configure

    the Informatica domain and create the Informatica services. Run the Informatica client installation to install

    the Informatica client tools.

    Install and Configure PowerExchange Adapters

    Based on your business needs, install and configure Informatica adapters. Use Big Data Management with

    Informatica adapters for access to sources and targets.

    To run Informatica mappings in a Hadoop environment you must install and configure Informatica adapters.

    You can use the following Informatica adapters as part of Big Data Management:

    • PowerExchange for DataSift

    • PowerExchange for Facebook

    • PowerExchange for HBase

    • PowerExchange for HDFS

    • PowerExchange for Hive

    • PowerExchange for LinkedIn

    • PowerExchange for Teradata Parallel Transporter API

    Before You Begin 11

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    12/60

    • PowerExchange for Twitter 

    • PowerExchange for Web Content-Kapow Katalyst

    For more information, see the PowerExchange adapter documentation.

    Install and Configure Data ReplicationTo migrate data with minimal downtime and perform auditing and operational reporting functions, install and

    configure Data Replication. For information, see the Informatica Data Replication User Guide.

    Pre-Installation Tasks for a Single Node Environment

    Before you begin the Big Data Management installation in a single node environment, perform the pre-

    installation tasks.

    • Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation

    should include a Hive data warehouse that is configured to use a non-embedded database as the

    MetaStore. For more information, see the Apache website here: http://hadoop.apache.org.• To perform both read and write operations in native mode, install the required third-party client software.

    For example, install the Oracle client to connect to the Oracle database.

    • Verify that the Big Data Management administrator user can run sudo commands or have user root

    privileges.

    • Verify that the temporary folder on the local node has at least 700 MB of disk space.

    • Download the following file to the temporary folder: InformaticaHadoop-

    .tar.gz

    • Extract the following file to the local node where you want to run the Big Data Management installation:

    InformaticaHadoop-.tar.gz

    Pre-Installation Tasks for a Cluster EnvironmentBefore you begin the Big Data Management installation in a cluster environment, perform the following tasks:

    • Install third-party software.

    • Verify the distribution method.

    • Verify system requirements.

    • Verify connection requirements.

    • Download the RPM.

    Install Third-Party Software

    Verify that the following third-party software is installed:

    Hadoop with Hadoop Distributed File System (HDFS) and MapReduce

    Hadoop must be installed on every node within the cluster. The Hadoop installation must include a Hive

    data warehouse that is configured to use a MySQL database as the MetaStore. You can configure Hive

    to use a local or remote MetaStore server. For more information, see the Apache website here:

    http://hadoop.apache.org/.

    Note: Informatica does not support embedded MetaStore server setups.

    12 Chapter 1: Installation and Configuration

    http://hadoop.apache.org/http://hadoop.apache.org/http://hadoop.apache.org/

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    13/60

    Database client software to perform read and write operations in native mode

    Install the client software for the database. Informatica requires the client software to run MapReduce

     jobs. For example, install the Oracle cl ient to connect to the Oracle database. Install the database client

    software on all the nodes within the Hadoop cluster.

    Verify the Distribution Method

    You can distribute Big Data Management to the Hadoop cluster with one of the following protocols:

    • File Transfer Protocol (FTP)

    • Hypertext Transfer Protocol (HTTP)

    • Network File System (NFS) protocol

    • Secure Copy (SCP) protocol

    • Cloudera Manager.

    To verify that you can distribute Big Data Management to the Hadoop cluster with one of the protocols,

    perform the following tasks:

    Note: If you use Cloudera Manager to distribute Big Data Management to the Hadoop cluster, skip thesetasks.

    1. Ensure that the server or service for your distribution method is running.

    2. In the config file on the machine where you want to run the Big Data Management installation, set the

    DISTRIBUTOR_NODE parameter to the following setting:

    • FTP: Set DISTRIBUTOR_NODE=ftp:///pub

    • HTTP: Set DISTRIBUTOR_NODE=http://

    • NFS: Set DISTRIBUTOR_NODE=

    The file location must be accessible to all nodes in the cluster.

    Verify System Requirements

    Verify the following system requirements:

    • The Big Data Management administrator can run sudo commands or has root user privileges.

    • The temporary folder in each of the nodes on which Big Data Management will be installed has at least

    700 MB of disk space.

    Verify Connection Requirements

    Verify the connection to the Hadoop cluster nodes.

    Big Data Management requires a Secure Shell (SSH) connection without a password between the machine

    where you want to run the Big Data Management installation and all the nodes in the Hadoop cluster.

    Download the RPM

    Download the following file to a temporary folder:

    InformaticaHadoop-.tar.gz

    Extract the file to the machine from where you want to distribute the RPM package and run the Big Data

    Management installation.

    Before You Begin 13

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    14/60

    Copy the following package to a shared directory based on the transfer protocol you are using:

    InformaticaHadoop-.rpm .

    For example,

    • HTTP: /var/www/html

    • FTP: /var/ftp/pub

    • NFS:

    The file location must be accessible by all the nodes in the cluster.

    Note: The RPM package must be stored on a local disk and not on HDFS.

    Informatica Big Data Management Installation

    You can install Big Data Management in a single node environment. You can also install Big Data

    Management in a cluster environment from the primary NameNode or from any machine.

    Install Big Data Management in a single node environment or cluster environment:

    • Install Big Data Management in a single node environment.

    • Install Big Data Management in a cluster environment from the primary NameNode using SCP protocol.

    • Install Big Data Management in a cluster environment from the primary NameNode using FTP, HTTP, or

    NFS protocol.

    • Install Big Data Management in a cluster environment from any machine.

    Install Big Data Management from a shell command line.

    Installing in a Single Node Environment

    You can install Big Data Management in a single node environment.

    1. Log in to the machine.

    2. Run the following command from the Big Data Management root directory to start the installation in

    console mode:

    bash InformaticaHadoopInstall.sh

    3. Press y to accept the Big Data Management terms of agreement.

    4. Press Enter .

    5. Press 1 to install Big Data Management in a single node environment.

    6. Press Enter .

    7. Type the absolute path for the Big Data Management installation directory and press Enter .

    Start the path with a slash. The directory names in the path must not contain spaces or the following

    special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

    If you type a directory path that does not exist, the installer creates the entire di rectory path on each of

    the nodes during the installation. Default is /opt.

    8. Press Enter .

    The installer creates the //Informatica  directory and

    populates all of the file systems with the contents of the RPM package.

    14 Chapter 1: Installation and Configuration

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    15/60

    To get more information about the tasks performed by the installer, you can view the informatica-hadoop-

    install..log installation log file.

    Installing in a Cluster Environment from the Primary NameNode

    Using SCP ProtocolYou can install Big Data Management in a cluster environment from the primary NameNode using SCP.

    1. Log in to the primary NameNode.

    2. Run the following command to start the Big Data Management installation in console mode:

    bash InformaticaHadoopInstall.sh

    3. Press y to accept the Big Data Management terms of agreement.

    4. Press Enter .

    5. Press 2 to install Big Data Management in a cluster environment.

    6. Press Enter .

    7. Type the absolute path for the Big Data Management installation directory.Start the path with a slash. The directory names in the path must not contain spaces or the following

    special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

    If you type a directory path that does not exist, the installer creates the entire di rectory path on each of

    the nodes during the installation. Default is /opt.

    8. Press Enter .

    9. Press 1 to install Big Data Management from the primary NameNode.

    10. Press Enter .

    11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

    12. Press Enter .

    13. Type y.14. Press Enter .

    The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the

    DataNodes, the installer creates the Informatica directory and populates all of the file systems with the

    contents of the RPM package. The Informatica directory is located here: /

    /Informatica

    You can view the informatica-hadoop-install..log  installation log file to get more

    information about the tasks performed by the installer.

    Installing in a Cluster Environment from the Primary NameNodeUsing FTP, HTTP, or NFS Protocol

    You can install Big Data Management in a cluster environment from the primary NameNode using FTP,

    HTTP, or NFS protocol.

    1. Log in to the primary NameNode.

    2. Run the following command to start the Big Data Management installation in console mode:

    bash InformaticaHadoopInstall.sh

    3. Press y to accept the Big Data Management terms of agreement.

    4. Press Enter .

    Informatica Big Data Management Installation 15

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    16/60

    5. Press 2 to install Big Data Management in a cluster environment.

    6. Press Enter .

    7. Type the absolute path for the Big Data Management installation directory.

    Start the path with a slash. The directory names in the path must not contain spaces or the following

    special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

    If you type a directory path that does not exist, the installer creates the entire di rectory path on each of

    the nodes during the installation. Default is /opt.

    8. Press Enter .

    9. Press 1 to install Big Data Management from the primary NameNode.

    10. Press Enter .

    11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

    12. Press Enter .

    13. Type n.

    14. Press Enter .

    15. Type y.

    16. Press Enter .

    The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the

    DataNodes, the installer creates the //Informatica

    directory and populates all of the file systems with the contents of the RPM package.

    You can view the informatica-hadoop-install..log  installation log file to get more

    information about the tasks performed by the installer.

    Installing in a Cluster Environment from any Machine

    You can install Big Data Management in a cluster environment from any machine.

    1. Verify that the Big Data Management administrator has user root privileges on the node that will be

    running the Big Data Management installation.

    2. Log in to the machine as the root user.

    3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop

    cluster on which you want to install Big Data Management. The HadoopDataNodes file is located on the

    node from where you want to launch the Big Data Management installation. You must add one IP

    addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.

    4. Run the following command to start the Big Data Management installation in console mode:

    bash InformaticaHadoopInstall.sh

    5. Press y to accept the Big Data Management terms of agreement.

    6. Press Enter .

    7. Press 2 to install Big Data Management in a cluster environment.

    8. Press Enter .

    9. Type the absolute path for the Big Data Management installation directory and press Enter . Start the

    path with a slash. Default is /opt.

    10. Press Enter .

    11. Press 2 to install Big Data Management using the HadoopDataNodes file.

    16 Chapter 1: Installation and Configuration

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    17/60

    12. Press Enter .

    The installer creates the //Informatica  directory and

    populates all of the file systems with the contents of the RPM package on the first node that appears in

    the HadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file.

    Installing Big Data Management Using Cloudera Manager 

    You can install Big Data Management on a Cloudera CDH cluster using Cloudera Manager.

    Perform the following steps:

    1. Download the following file: INFORMATICA--informatica-.parcel.tar .

    2. Extract manifest.json and the parcels from the .tar file.

    3. Verify the location of your Local Parcel Repository.

    In Cloudera Manager, click Administration  > Settings > Parcels

    4. Create a SHA file with the parcel name and hash listed in manifest.json that corresponds with your

    Hadoop cluster.

    For example, use the following parcel name for Hadoop cluster nodes that run Red Hat Enterprise Linux

    6.4 64-bit:

    INFORMATICA-informatica--el6.parcel

    Use the following hash listed for Red Hat Enterprise Linux 6.4 64-bit:

    8e904e949a11c4c16eb737f02ce4e36ffc03854f 

    To create a SHA file, run the following command:

    echo > .sha

    For example, run the following command:

    echo “8e904e949a11c4c16eb737f02ce4e36ffc03854f” >INFORMATICA-9.6.1-1.informatica9.6.1.1.p0.1203-el6.parcel.sha

    5. Transfer the parcel and SHA file to the Local Parcel Repository with FTP.

    6. Check for new parcels with Cloudera Manager.

    To check for new parcels, click Hosts > Parcels.

    7. Distribute the Big Data Management parcels.

    8. Activate the Big Data Management parcels.

     After You Install

     After you install Big Data Management, perform the post-installation tasks to ensure that Big Data

    Management runs properly.

    Perform the following tasks:

    • Configure the Hadoop pushdown properties for the Data Integration Service.

    • Optionally, install the Address Validation reference data.

    • Configure Hive variables for mappings in a Hadoop environment.

    • Update Hadoop cluster configuration parameters for mappings in a Hadoop environment.

    • Configure library path and path variables for mappings in a Hadoop environment.

     Afte r You Ins tal l 17

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    18/60

    • Start the Application Timeline Server for the Blaze engine.

    • Configure the Blaze log directories.

    • Configure environment variables in the Big Data Management properties file.

    • Open the required ports for the Blaze engine.

    • Enable support for Lookup transformations with Teradata data objects.

    Note: The Blaze engine only supports the following Hadoop distributions: Cloudera CDH, Hortonworks HDP,

    and MapR. Skip the tasks for the Blaze engine if the Blaze engine does not support the distribution that the

    Hadoop cluster runs.

    Configure Hadoop Pushdown Properties for the Data IntegrationService

    Configure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hadoop

    environment.

    You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.

    The following table describes the Hadoop pushdown properties for the Data Integration Service:

    Property Description

    Informatica

    Home Directory

    on Hadoop

    The Big Data Management home directory on every data node created by the Hadoop RPM

    install. Type //Informatica .

    Hadoop

    Distribution

    Directory

    The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM

    Install locations. The directory contains the minimum set of JARS required to process

    Informatica mappings in a Hadoop environment. Type //Informatica/services/

    shared/hadoop/[Hadoop_distribution_name] .

    Data Integration

    Service HadoopDistribution

    Directory

    The Hadoop distribution directory on the Data Integration Service node. The contents of the

    Data Integration Service Hadoop distribution directory must be identical to Hadoopdistribution directory on the data nodes.

    Hadoop Distribution Directory

    You can modify the Hadoop distribution directory on the data nodes.

    When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop

    JARS, and the Snappy libraries required to process Informatica mappings in a Hadoop environment from

    your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop

    distribution and version.

    The Hadoop RPM installs the Hadoop distribution directories in the following path:

    /Informatica/services/shared/hadoop .

    18 Chapter 1: Installation and Configuration

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    19/60

    Reference Data Requirements

    If you have a Data Quality product license, you can push a mapping that contains data quality

    transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data

    values are accurate and correctly formatted.

    When you apply a pushdown operation to a mapping that contains data quality transformations, the operationcan copy the reference data that the mapping uses. The pushdown operation copies reference table data,

    content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster

    deletes the reference data that the pushdown operation copied with the mapping.

    Note: The pushdown operation does not copy address validation reference data. If you push a mapping that

    performs address validation, you must install the address validation reference data files on each DataNode

    that runs the mapping. The cluster does not delete the address validation reference data files after the

    address validation mapping runs.

     Address validation mappings val idate and enhance the accuracy of postal address records. You can buy

    address reference data files from Informatica on a subscription basis. You can download the current address

    reference data files from Informatica at any time during the subscription period.

    Installing the Address Reference Data Files

    To install the address reference data files on each DataNode in the cluster, create an automation script.

    1. Browse to the address reference data files that you downloaded from Informatica.

    You download the files in a compressed format.

    2. Extract the data files.

    3. Copy the files to the NameNode machine or to another machine that can write to the DataNodes.

    4. Create an automation script to copy the files to each DataNode.

    • If you copied the files to the NameNode, use the slaves file for the Hadoop cluster to identify the

    DataNodes. If you copied the files to another machine, use the Hadoop_Nodes.txt file to identify theDataNodes.

    Find the Hadoop_Nodes.txt file in the Big Data Management installation package.

    • The default directory for the address reference data files in the Hadoop environment

    is /reference_data. If you install the files to a non-default directory, create the following custom

    property on the Data Integration Service to identify the directory:

    AV_HADOOP_DATA_LOCATION

    Create the custom property on the Data Integration Service that performs the pushdown operation in

    the native environment.

    5. Run the automation script.

    The script copies the address reference data files to the DataNodes.

    Hive Variables for Mappings in a Hadoop Environment

    To run mappings in a Hadoop environment, configure Hive environment variables..

    You can configure Hive environment variables in the file //

    Informatica/services/shared/hadoop//conf/hive-site.xml .

     Afte r You Ins tal l 19

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    20/60

    Configure the following Hive environment variables:

    •   hive.exec.dynamic.partition=true and hive.exec.dynamic.partition.mode=nonstrict . Configure if

    you want to use Hive dynamic partitioned tables.

    •  hive.optimize.ppd = false. Disable predicate pushdown optimization to get accurate results for

    mappings with Hive version 0.9.0. You cannot use predicate pushdown optimization for a Hive query that

    uses multiple insert statements. The default Hadoop RPM installation sets hive.optimize.ppd to false.

    Update Hadoop Cluster Configuration Parameters

    Hadoop cluster configuration parameters that set Java library path in the mapred-site.xml file can override

    the paths set in hadoopEnv.properties. Update the mapred-site.xml cluster configuration file on all the

    cluster nodes to remove Java options that set the Java library path.

    The following cluster configuration parameters in mapred-site.xml can override the Java library path set in

    hadoopEnv.properties:

    • mapreduce.admin.map.child.java.opts

    mapreduce.admin.reduce.child.java.opts

    If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, mappings

    can fail to run in a Hadoop environment.

     After you install, perform the fol lowing steps:

    • Update the cluster configuration file mapred-site.xml to remove the Java option -Djava.library.path

    from the property configuration.

    • Edit hadoopEnv.properties to include the user Hadoop libraries in the Java Library path.

    Example to Update mapred-site.xml on Cluster Nodes

    If mapred-site.xml sets the following configuration for mapreduce.admin.map.child.java.opts  parameter:

    mapreduce.admin.map.child.java.opts-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/:/mylib/ -Djava.net.preferIPv4Stack=truetrue

    The path to Hadoop libraries in mapreduce.admin.map.child.java.opts  overrides following path set in

    hadoopEnv.properties:

    infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64 -Djava.security.egd=file:/dev/./urandom

    To run mappings in a Hadoop environment, complete the following steps:

    • Remove the -Djava.library.path Java option from mapreduce.admin.map.child.java.opts

    parameter.

    • Change hadoopEnv.properties to include the Hadoop libraries in the path /usr/lib/hadoop/lib/native

    and /mylib/ with the following syntax:

    infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/:/mylib/ -Djava.security.egd=file:/dev/./urandom

    20 Chapter 1: Installation and Configuration

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    21/60

    Library Path and Path Variables for Mappings in a HadoopEnvironment

    To run mappings in a Hadoop environment configure the library path and path environment variables in the

    hadoopEnv.properties file.

    Configure following library path and path environment variables:

    • When you run mappings in a Hadoop environment, configure the ODBC library path before the Teradata

    library path. For example, infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=

    $HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/ODBC7.0/lib/:/opt/

    teradata/client/13.10/tbuild/lib64:/opt/teradata/client/13.10/odbc_64/lib:/databases/

    oracle11.2.0_64BIT/lib:/databases/db2v9.5_64BIT/lib64/:$HADOOP_NODE_INFA_HOME/

    DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-

    amd64-64:$LD_LIBRARY_PATH .

    • When you use the MapR distribution on the Linux operating system, change the environment variable

    LD_LIBRARY_PATH to include the following path: /

    services/shared/hadoop/mapr_/lib/native/Linux-amd64-64 .

    • When you use the MapR distribution on the Linux operating system, change the environment variableMAPR_HOME to include the following path: /services/

    shared/hadoop/mapr_ .

    Configure the Blaze Engine Log Directories

    The hadoopEnv.properties file lists the log directories that the Blaze engine uses on the node and on

    HDFS. You must grant the user account that starts the Blaze engine write permission on the log directories.

    Grant the user account that starts the Blaze engine write permission for the di rectories specified in the

    following properties:

    • infagrid.node.local.root.log.dir 

    • infacal.hadoop.logs.directory

    For more information about user accounts for the Blaze engine, see the Informatica Big Data Management

    Security Guide.

    Hadoop Environment Properties File

    To add environment variables or to extend existing ones, use the Hadoop environment properties file,

    hadoopEnv.properties.

    You can optionally add third-party environment variables or extend the existing PATH environment variable in

    hadoopEnv.properties.

    1. Go to the following location: /services/shared/hadoop//infaConf

    2. Find the fi le named hadoopEnv.properties.

    3. Back up the file before you modify it.

    4. Use a text editor to open the file and modify the properties.

    5. Save the properties file with the name hadoopEnv.properties.

     Afte r You Ins tal l 21

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    22/60

    Informatica Developer Files and Variables

    Edit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular

    Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool client again. If

    you use the MapR distribution you must also set the MAPR_HOME environment variable to run MapR

    mappings in a Hadoop environment.developerCore.ini is located in the following directory:

    \\clients\DeveloperClient

     Add the following property to developerCore.ini:

    •   -DINFA_HADOOP_DIST_DIR=hadoop\

    For example, the distribution name for a Hadoop cluster that runs MapR version 4.0.2 is mapr_4.0.2.

    For a Hadoop cluster that runs MapR, you must perform the following additional tasks:

    •  Add the following propert ies to developerCore.ini:

    - -Djava.library.path=hadoop\mapr_\lib\native\Win64;bin;..\DT\bin

    - -Dmapr.library.flatclass

    • Edit run.bat to set the MAPR_HOME environment variable and the -clean settings.

    For example, include the following lines:

    MAPR_HOME=//clients/DeveloperClient\hadoop\mapr_developerCore.exe -clean

    • Copy mapr-cluster.conf to the following directory on the machine where the Developer tool runs:

    \\clients\DeveloperClient\hadoop

    \mapr_\conf.

    You can find mapr-cluster.conf in the following directory on any node in the Hadoop cluster: /conf

    Open the Required Ports for the Blaze EngineYou must open a range of ports for the Blaze engine to use to communicate with the Informatica domain.

    Note: Skip this task if the Blaze engine does not support the distribution that the Hadoop cluster runs.

    If the Hadoop cluster is behind a firewall, work with your network administrator to open the range of ports that

    the Blaze engine uses.

    When you create the Hadoop connection, specify the port range that the Blaze engine can use with the

    minimum port and maximum port fields.

    Enable Support for Lookup Transformations with Teradata Data

    ObjectsTo use Lookup transformations with a Teradata data object in Hadoop pushdown mode, you must copy the

    Teradata JDBC drivers to the Informatica installation directory.

    You can download the Teradata JDBC drivers from Teradata. For more information about the drivers, see the

    following Teradata website: http://downloads.teradata.com/download/connectivity/jdbc-driver .

    The software available for download at the referenced links belongs to a third party or third parties, not

    Informatica LLC. The download links are subject to the possibility of errors, omissions or change. Informatica

    assumes no responsibility for such links and/or such software, disclaims all warranties, either express or

    22 Chapter 1: Installation and Configuration

    http://downloads.teradata.com/download/connectivity/jdbc-driver

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    23/60

    implied, including but not limited to, implied warranties of merchantability, fitness for a particular purpose, title

    and non-infringement, and disclaims all liability relating thereto.

    Copy the tdgssconfig.jar and terajdbc4.jar files from the Teradata JDBC drivers to the following

    directory on the machine where the Data Integration runs and every node in the Hadoop cluster:

    /externaljdbcjars

     Additionally, you must copy the tdgssconfig.jar and terajdbc4.jar files to the following directory on the

    machine where the Developer tool runs: \clients

    \externaljdbcjars.

    Informatica Big Data Management Uninstallation

    The Big Data Management uninstallation deletes the Big Data Management binary files from all of the

    DataNodes within the Hadoop cluster. Uninstall Big Data Management from a shell command.

    Uninstalling Big Data Management

    To uninstall Big Data Management in a single node or cluster environment:

    1. Verify that the Big Data Management administrator can run sudo commands.

    2. If you are uninstalling Big Data Management in a cluster environment, set up password-less Secure

    Shell (SSH) connection between the machine where you want to run the Big Data Management

    installation and all of the nodes on which Big Data Management will be uninstalled.

    3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,

    verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the

    nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The

    HadoopDataNodes file is located on the node from where you want to launch the Big Data Management

    installation. You must add one IP addresses or machine host names of the nodes in the Hadoop clusterfor each line in the file.

    4. Log in to the machine. The machine you log into depends on the Big Data Management environment and

    uninstallation method:

    • If you are uninstalling Big Data Management in a single node environment, log in to the machine on

    which Big Data Management is installed.

    • If you are uninstalling Big Data Management in a cluster environment using the HADOOP_HOME

    environment variable, log in to the primary NameNode.

    • If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,

    log in to any node.

    5. Run the following command to start the Big Data Management uninstallation in console mode:

    bash InformaticaHadoopInstall.sh

    6. Press y to accept the Big Data Management terms of agreement.

    7. Press Enter .

    8. Select 3 to uninstall Big Data Management.

    9. Press Enter .

    Informatica Big Data Management Uninstallation 23

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    24/60

    10. Select the uninstallation option, depending on the Big Data Management environment:

    • Select 1 to uninstall Big Data Management in a single node environment.

    • Select 2 to uninstall Big Data Management in a cluster environment.

    11. Press Enter .

    12. If you are uninstalling Big Data Management in a cluster environment, select the uninstallation option,

    depending on the uninstallation method:

    • Select 1 to uninstall Big Data Management from the primary NameNode.

    • Select 2 to uninstall Big Data Management using the HadoopDataNodes file.

    13. Press Enter .

    14. If you are uninstalling Big Data Management in a cluster environment from the primary NameNode, type

    the absolute path for the Hadoop installation directory. Start the path with a slash.

    The uninstaller deletes all of the Big Data Management binary files from the /

    /Informatica  directory. In a cluster environment, the

    uninstaller delete the binary files from all of the nodes within the Hadoop cluster.

    24 Chapter 1: Installation and Configuration

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    25/60

    C H A P T E R   2

    Mappings on HadoopDistributions

    This chapter includes the following topics:

    • Mappings on Hadoop Distributions Overview, 25

    • Big Data Management Configuration Utility, 26

    • Mappings on Cloudera CDH, 29

    • Mappings on Hortonworks HDP, 34

    • Mappings on IBM BigInsights, 41

    • Mappings on MapR, 42

    • Mappings on Pivotal HD, 49

    Mappings on Hadoop Distributions Overview

     After you install Big Data Management, you must enable mappings to run on a Hadoop cluster on a Hadoop

    distribution.

     After you enable Informatica mappings to run on a Hadoop cluster, you must configure the Big Data

    Management Client files to communicate with a Hadoop cluster on a particular Hadoop distribution. You can

    use the Big Data Management Configuration Utility to automatically configure some of the properties. After

    you run the utility, you must complete the configuration for your Hadoop distribution.

     Alternatively, you can manually Big Data Management without the utility.

    The following table describes the Hadoop distributions, MapReduce versions, and schedulers that you can

    use with Big Data Management:

    Ha do op D ist ri bu tio n M ap Red uce V ersi on S ch ed ul er  

    Cloudera CDH 5.4 MRv2 Fair Scheduler  

    Hortonworks HDP 2.2 MRv2 CapacityScheduler  

    IBM BigInsights 3.0 MRv1 CapacityScheduler and Fair Scheduler  

    25

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    26/60

    Ha do op D ist ri bu tio n M ap Red uce V ersi on S ch ed ul er  

    MapR 4.0.2 MRv1 or MRv2 CapacityScheduler and Fair Scheduler  

    Pivotal HD 2.1 MRv2 CapacityScheduler and Fair Scheduler  

    Big Data Management Configuration Utility

    You can use the Big Data Management Configuration Utility to automate part of the configuration for Big Data

    Management. Alternatively, you can manually configure Big Data Management.

    To automate part of the configuration process for the Hadoop cluster properties on the machine where the

    Data Integration Service runs, perform the following steps:

    1. On the machine where the Data Integration Service runs, open the command line.

    2. Go to the following directory: /tools/BDEUtil.

    3. Run BDEConfig.sh.

    4. Press Enter.

    5. Choose the Hadoop distribution:

    Option Description

    1 Cloudera CDH

    2 Hortonworks HDP

    3 MapR

    4 Pivotal HD

    5 IBM BigInsights

    6. Choose the Hadoop distribution version you want to use to configure Big Data Management.

    7. Choose how to access files on the Hadoop cluster:

    If you choose Cloudera CDH, the following options appear:

    Option Description

    1 Cloudera Manager. Enter this option to use the Cloudera Manager API to access

    files on the Hadoop cluster.

    2 Secure Shell (SSH). Enter this option to use SSH to access files on the Hadoopcluster. This option requires SSH connections to the machines that host the

    NameNode, JobTracker, and Hive client. If you select this option, Informatica

    recommends that you use an SSH connection without a password or have sshpass

    or Expect installed.

    26 Chapter 2: Mappings on Hadoop Distributions

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    27/60

    Option Description

    3 Shared directory. Select this option to use a shared directory to access files on the

    Hadoop cluster. You must have read permission for the shared directory.

    Note: Informatica recommends the Cloudera Manager or SSH option.

    If you choose a distribution other than Cloudera CDH, the following options appear :

    Option Description

    1 Secure Shell (SSH). Enter this option to use SSH to access files on the Hadoopcluster. This option requires SSH connections to the machines that host the

    NameNode, JobTracker, and Hive client. If you select this option, Informatica

    recommends that you use an SSH connection without a password or have sshpassor Expect installed.

    2 Shared directory. Enter this option to use a shared directory to access files on the

    Hadoop cluster. You must have read permission for the shared directory.

    Note: Informatica recommends the SSH option.

    8. If you did not choose Cloudera CDH, continue to step 8. Choose the Cloudera CDH cluster you want to

    use to configure Big Data Management.

    9. Based on the option you selected, see the corresponding topic to continue with the configuration

    process:

    • “Use Cloudera Manager” on page 27

    • “Use SSH” on page 27

    • “Use a Shared Directory” on page 28

    Use Cloudera Manager If you choose Cloudera Manager, perform the following steps to configure Big Data Management:

    1. Enter the Cloudera Manager host.

    2. Enter the Cloudera user ID.

    3. Enter the password for the user ID.

    4. Enter the port for Cloudera Manager.

    The Big Data Management Configuration Utility retrieves the required information from the Hadoop

    cluster.

    5. Complete the manual configuration steps.

    Use SSH

    If you choose SSH, you must provide host names and Hadoop configuration file locations.

    Note: Informatica recommends that you use an SSH connection without a password or have sshpass or

    Expect installed. If you do not use one of these methods, you must enter the password each time the utility

    downloads a file from the Hadoop cluster.

    Big Data Management Configuration Utility 27

    http://-/?-

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    28/60

    Verify the following host names: NameNode, JobTracker, and Hive client. Additionally, verify the locations for

    the following files on the Hadoop cluster:

    •   hdfs-site.xml

    •   core-site.xml

      mapred-site.xml

    •   yarn-site.xml

    •   hive-site.xml

    Perform the following steps to configure Big Data Management:

    1. Enter the NameNode host name.

    2. Enter the SSH user ID.

    3. Enter the password for the SSH user ID.

    If you use an SSH connection without a password, leave this field blank and press enter.

    4. Enter the location for the hdfs-site.xml file on the Hadoop cluster.

    5. Enter the location for the core-site.xml file on the Hadoop cluster.

    The Big Data Management Configuration Utility connects to the NameNode and downloads the following

    files: hdfs-site.xml and core-site.xml.

    6. Enter the JobTracker host name.

    7. Enter the SSH user ID.

    8. Enter the password for the SSH user ID.

    If you use an SSH connection without a password, leave this field blank and press enter.

    9. Enter the directory for the mapred-site.xml file on the Hadoop cluster.

    10. Enter the directory for the yarn-site.xml file on the Hadoop cluster.

    The utility connects to the JobTracker and downloads the following files: mapred-site.xml and yarn-

    site.xml.11. Enter the Hive client host name.

    12. Enter the SSH user ID.

    13. Enter the password for the SSH user ID.

    If you use an SSH connection without a password, leave this field blank and press enter.

    14. Enter the directory for the hive-site.xml file on the Hadoop cluster.

    The utility connects to the Hive client and downloads the following file: hive-site.xml.

    15. Complete the manual configuration steps.

    Use a Shared Directory

    If you choose shared directory, perform the following steps to configure Big Data Management:

    1. Enter the location of the shared directory.

    Note: You must have read permission for the directory, and the directory should contain the following

    files:

    •   core-site.xml

    •   hdfs-site.xml

    28 Chapter 2: Mappings on Hadoop Distributions

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    29/60

    •   hive-site.xml

    •   mapred-site.xml

    •   yarn-site.xml

    2. Complete the manual configuration steps.

    Mappings on Cloudera CDH

    You can enable Informatica mappings to run on a Hadoop cluster on Cloudera CDH.

    Informatica supports Cloudera CDH clusters that are deployed on-premise, on Amazon EC2, or on Microsoft

     Azure.

    To enable Informatica mappings to run on a Cloudera CDH cluster, complete the following steps:

    1. Configure Hadoop cluster properties on the machine on which the Data Integration Service runs.

    2. Configure virtual memory limits.

    3. Create a staging directory.

    4. Add hbase_protocol.jar to the Hadoop classpath.

    Configure Hadoop Cluster Properties on the Data IntegrationService Machine

    Configure Hadoop cluster properties in the hive-site.xml and yarn-site.xml files that the Data Integration

    Service uses when it runs mappings on a Cloudera CDH cluster.

    Configure hive-site.xml for the Data Integration Service

    hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:

    /services/shared/hadoop/cloudera_cdh/conf

    In hive-site.xml, configure the following property:

    hive.optimize.constant.propagation

    Whether to enable the constant propagation optimizer.

    Set this value to false.

    The following sample code describes the properties you can set in hive-site.xml:

     hive.optimize.constant.propagation false

    Configure yarn-site.xml for the Data Integration Service

    The yarn-site.xml file is located in the following directory on the machine where the Data Integration

    Service runs:

    /services/shared/hadoop/cloudera_cdh/conf

    Configure the following Hadoop cluster property:

    yarn.resourcemanager.webapp.address

    Web application address for the Resource Manager.

    Mappings on Cloudera CDH 29

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    30/60

    Use the value in the following file: /etc/hadoop/conf/yarn-site.xml.

    The Big Data Management Configuration utility automatically configures the following properties in the yarn-

    site.xml file. You can also manually configure the properties.

    mapreduce.jobhistory.address

    Location of the MapReduce JobHistory Server.

    Use the value in the following file:/etc/hadoop/conf/mapred-site.xml

    mapreduce.jobhistory.webapp.address

    Web address of the MapReduce JobHistory Server.

    Use the value in the following file: /etc/hadoop/conf/mapred-site.xml

    yarn.resourcemanager.scheduler.address

    Scheduler interface address.

    Use the value in the following file: /etc/hadoop/conf/yarn-site.xml

    You can set the following properties in yarn-site.xml:

      mapreduce.jobhistory.address  hostname:port  MapReduce JobHistory Server IPC host:port

      mapreduce.jobhistory.webapp.address  hostname:port  MapReduce JobHistory Server Web UI host:port

      yarn.resourcemanager.scheduler.address  hostname:port  The address of the scheduler interface

     yarn.resourcemanager.webapp.address hostname:port The address for the Resource Manager web application.

    Create a Staging Directory on HDFS

    If the Cloudera cluster uses HiveServer 2, you must grant the anonymous user the Execute permission on the

    staging directory or you must create another staging directory on HDFS.

    By default, a staging directory already exists on HDFS. You must grant the anonymous user the Execute

    permission on the staging directory. If you cannot grant the anonymous user the Execute permission on this

    directory, you must enter a valid user name for the user in the Hive connection. If you use the default staging

    directory on HDFS, you do not have to configure mapred-site.xml or hive-site.xml.

    If you want to create another staging directory to store mapreduce jobs, you must create a di rectory on

    HDFS. After you create the staging directory, you must add it to mapred-site.xml and hive-site.xml.

    To create another staging directory on HDFS, run the following commands from the command line of the

    machine that runs the Hadoop cluster:

    hadoop fs –mkdir /staginghadoop fs –chmod –R 0777 /staging

     Add the staging directory to mapred-site.xml.

    30 Chapter 2: Mappings on Hadoop Distributions

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    31/60

    mapred-site.xml is located in the following directory on the Hadoop cluster: /etc/hadoop/conf/mapred-

    site.xml

    For example, mapred-site.xml, add the following entry to mapred-site.xml:

      yarn.app.mapreduce.am.staging-dir

      /staging

     Add the staging directory to hive-site.xml on the machine where the Data Integration Service runs.

    hive-site.xml is located in the following directory on the machine where the Data Integration Service runs:

    /services/shared/adhoop/cloudera_/conf.

    In hive-site.xml, add the yarn.app.mapreduce.am.staging-dir  property. Use the value that you specified

    in mapred-site.xml.

    For example, add the following entry to hive-site.xml:

      yarn.app.mapreduce.am.staging-dir  /staging

    Configure Virtual Memory Limits

    Configure the virtual memory limits in yarn-site.xml for every node in the Hadoop cluster. After you configure

    virtual memory limits you must restart the Hadoop cluster.

    yarn-site.xml is located in the following directory on every node in the Hadoop cluster:

    /etc/hadoop/conf/yarn-site.xml

    In yarn-site.xml, configure the following property:

    yarn.nodemanager.vmem-check-enabled

    Determines virtual memory limits.

    The following example describes the property you can configure in yarn-site.xml:

      yarn.nodemanager.vmem-check-enabled  false  Enforces virtual memory limits for containers.

     Add hbase_protocol.jar to the Hadoop classpath

     Add hbase-protocol.jar to the Hadoop classpath on every node on the Hadoop cluster. Then, restart the

    Node Manager for each node in the Hadoop cluster.

    hbase-protocol.jar is located in the HBase installation directory on the Hadoop cluster. For more

    information, refer to the following link: https://issues.apache.org/jira/browse/HBASE-10304

    Configure the Blaze Engine

    To use the Blaze engine, you must configure the Hadoop cluster.

    To configure a Cloudera CDH cluster for the Blaze engine, complete the following tasks:

    • Create symbolic links to the Jackson JAR files.

    • Configure yarn-site.xml on every node in the Hadoop cluster.

    Mappings on Cloudera CDH 31

    https://issues.apache.org/jira/browse/HBASE-10304

  • 8/16/2019 In 100 BigDataManagementInstallationAndConfigurationGuide En

    32/60

    • Start the Application Timeline Server.

    • Enable the Blaze Engine console.

    Create Symbolic Links to Jackson JAR Files for the Blaze Engine

    The Hadoop Application Timeline Server requires the latest version of the Jackson JAR files to run. You must

    create symbolic links to Jackson JAR files that the Hadoop Application Timeline Server requires.

    Perform the following steps on the node where you want to start the Application Timeline Server:

    1. Navigate to the directory that contains following Jackson JAR files on the Hadoop cluster:

    •  jackson-xc-1.8.8.jar 

    •  jackson-jaxrs-1.8.8.jar 

    •  jackson-core-asl-1.8.8.jar 

    •  jackson-mapper-asl-1.8.8. jar 

    If you use Cloudera Manager to configure the Hadoop cluster, you can find the files in the following

    directory: /opt/cloudera/parcels/CDH/lib/hadoop/libexec/ ../ ../hadoop/lib/.

    If you configure the Hadoop cluster manually, you can find the files in the following directory: /usr/lib/

    hadoop/lib/.

    2. Remove the link to the Jackson JAR files.

    Run the following command for each Jackson JAR file:

    rm

    For example, run the following command to remove the link to jackson-xc-1.8.8.jar on a Cloudera

    CDH cluster that