installing hadoop / spark from scratch

1

© 2016 IBM Corporation

Big Data Developer meetup

Installing Apache Hadoop and Spark from scratch

Ljubljana, June 2016

2


Agenda

Why do you need Hadoop

What do you need before you install Apache Hadoop

Hadoop distributions

Hadoop components you need to know about

About Spark

Installation process walk-through

Adding cluster nodes

Ways to automate

Zero-install options

3


Why do you need Apache Hadoop

License – free

Scalable

General purpose MPP

engine

Distributed storage

Packed with tools

Backend for your Big

Data project

4


What do you need before you install Hadoop and Spark

A server (or servers)

Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)

A Hadoop distribution (more later)

Or avoid all that trouble by using VM / Docker if you are just

playing (more later)

5


Apache Hadoop Distributions

Hortonworks HDP

Cloudera CDH

IBM IOP (today’s focus)

Number of others

Distributions are very similar but different, as in Linux

Some are part of ODP some are not

6


Hadoop components you need to know about

Yarn – resource manager

HDFS

MapReduce

Ambari

ZooKeeper

Hive

Pig

sqoop

7


Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-

scale data processing

– Fast

•Leverages aggressively cached in-memory

distributed computing and dedicated

App Executor processes even when no jobs

are running

•Faster than MapReduce

– General purpose

•Covers a wide range of workloads

•Provides SQL, streaming and complex

analytics

– Flexible and easier to use than Map Reduce

•Spark is written in Scala, an object oriented,

functional programming language

•Scala, Python and Java APIs

•Scala and Python interactive shells

•Runs on Hadoop, Mesos, standalone or

cloud

Logistic regression in Hadoop and Spark

from http://spark.apache.org

8


Installation process walk-through

Review the requirements

Review the installation docs

Get IOP software: http://www-

01.ibm.com/support/docview.wss?uid=swg24040517

http://www-969.ibm.com/software/reports/compatibility/clarity-reports/report/html/softwareReqsForProduct?deliverableId=629D2510EA9511E4AA4FBC363E07AC33&osPlatforms=Linux

http://www.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/installing_icnav.html

http://www-01.ibm.com/support/docview.wss?uid=swg24040517



9


Prereqs

Install OS

Setup yum repository

Install prerequisites

• Yum install nc

Full list of preparation steps

Make sure your hostname is in /etc/hosts

Tweak some settings (disable Trasparent Huge Pages)

• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Generate ssh key and set up passwordless ssh

• Ssh-keygen

• Chmod 700 ~/.ssh

• Check with ssh localhost

http://www.ibm.com/support/knowledgecenter/SSPT3X_4.1.0/com.ibm.swg.im.infosphere.biginsights.install.doc/doc/install_prepare.html

https://access.redhat.com/solutions/46111

10


Prereqs (cont.)

disable IPv6

Configure ulimit

• /etc/security/limits.conf

Disable SELinux

Set up NTP on all servers

11


First step – install Ambari

Install repository

• yum install iop-4.1.0.0-1.<version>.<platform>.rpm

Install ambari

• Yum install ambari-server

Setup ambari server

• sudo ambari-server setup

Start ambari server

• Ambari-server start

Go to ambari interface <your-ip>:8080

• Default user/pass = admin/admin

Launch installation wisard

12


Ambari installation

Next-next-next

Provide cluster name

Provide private ssh key

13


Choose services

14


Assign masters

15


Assign slaves and clients

16


Customize services

Here you would have to setup proper DB server connections in

your prod environment

17


Review and deploy

18


Validate

19


Adding a new cluster node

Create a new server, with same

pre-rereqs

Make sure that passwordless ssh

works from ambari server to the

node

ssh-copy-id -i ~/.ssh/id_rsa.pub

root@hostname01

And done

20


Extra steps

Install Anaconda / Jupyter for data analysis

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -

-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark

21


Ways to automate - Ansible

Simple automation tool

Infrastructure as a code

Agent-less

Easy to learn

Check for examples online “ansible hadoop

playbook”

22


Zero – installation options

•Big Insights QSE

•BigInsights on cloud (paid)

23


WRAP-UP

installing hadoop / spark from scratch

Data & Analytics