installing hadoop / spark from scratch
TRANSCRIPT
![Page 1: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/1.jpg)
1
© 2016 IBM Corporation
Big Data Developer meetup
Installing Apache Hadoop and Spark from scratch
Ljubljana, June 2016
![Page 2: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/2.jpg)
2
© 2016 IBM Corporation
Agenda
Why do you need Hadoop
What do you need before you install Apache Hadoop
Hadoop distributions
Hadoop components you need to know about
About Spark
Installation process walk-through
Adding cluster nodes
Ways to automate
Zero-install options
![Page 3: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/3.jpg)
3
© 2016 IBM Corporation
Why do you need Apache Hadoop
License – free
Scalable
General purpose MPP
engine
Distributed storage
Packed with tools
Backend for your Big
Data project
![Page 4: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/4.jpg)
4
© 2016 IBM Corporation
What do you need before you install Hadoop and Spark
A server (or servers)
Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)
A Hadoop distribution (more later)
Or avoid all that trouble by using VM / Docker if you are just
playing (more later)
![Page 5: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/5.jpg)
5
© 2016 IBM Corporation
Apache Hadoop Distributions
Hortonworks HDP
Cloudera CDH
IBM IOP (today’s focus)
Number of others
Distributions are very similar but different, as in Linux
Some are part of ODP some are not
![Page 6: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/6.jpg)
6
© 2016 IBM Corporation
Hadoop components you need to know about
Yarn – resource manager
HDFS
MapReduce
Ambari
ZooKeeper
Hive
Pig
sqoop
![Page 7: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/7.jpg)
7
© 2016 IBM Corporation
Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-
scale data processing
– Fast
•Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
•Faster than MapReduce
– General purpose
•Covers a wide range of workloads
•Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
•Spark is written in Scala, an object oriented,
functional programming language
•Scala, Python and Java APIs
•Scala and Python interactive shells
•Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
from http://spark.apache.org
![Page 8: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/8.jpg)
8
© 2016 IBM Corporation
Installation process walk-through
Review the requirements
Review the installation docs
Get IOP software: http://www-
01.ibm.com/support/docview.wss?uid=swg24040517
![Page 9: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/9.jpg)
9
© 2016 IBM Corporation
Prereqs
Install OS
Setup yum repository
Install prerequisites
• Yum install nc
Full list of preparation steps
Make sure your hostname is in /etc/hosts
Tweak some settings (disable Trasparent Huge Pages)
• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Generate ssh key and set up passwordless ssh
• Ssh-keygen
• Chmod 700 ~/.ssh
• Check with ssh localhost
![Page 10: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/10.jpg)
10
© 2016 IBM Corporation
Prereqs (cont.)
disable IPv6
Configure ulimit
• /etc/security/limits.conf
Disable SELinux
Set up NTP on all servers
![Page 11: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/11.jpg)
11
© 2016 IBM Corporation
First step – install Ambari
Install repository
• yum install iop-4.1.0.0-1.<version>.<platform>.rpm
Install ambari
• Yum install ambari-server
Setup ambari server
• sudo ambari-server setup
Start ambari server
• Ambari-server start
Go to ambari interface <your-ip>:8080
• Default user/pass = admin/admin
Launch installation wisard
![Page 12: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/12.jpg)
12
© 2016 IBM Corporation
Ambari installation
Next-next-next
Provide cluster name
Provide private ssh key
![Page 13: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/13.jpg)
13
© 2016 IBM Corporation
Choose services
![Page 14: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/14.jpg)
14
© 2016 IBM Corporation
Assign masters
![Page 15: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/15.jpg)
15
© 2016 IBM Corporation
Assign slaves and clients
![Page 16: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/16.jpg)
16
© 2016 IBM Corporation
Customize services
Here you would have to setup proper DB server connections in
your prod environment
![Page 17: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/17.jpg)
17
© 2016 IBM Corporation
Review and deploy
![Page 18: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/18.jpg)
18
© 2016 IBM Corporation
Validate
![Page 19: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/19.jpg)
19
© 2016 IBM Corporation
Adding a new cluster node
Create a new server, with same
pre-rereqs
Make sure that passwordless ssh
works from ambari server to the
node
ssh-copy-id -i ~/.ssh/id_rsa.pub
root@hostname01
And done
![Page 20: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/20.jpg)
20
© 2016 IBM Corporation
Extra steps
Install Anaconda / Jupyter for data analysis
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -
-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
![Page 21: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/21.jpg)
21
© 2016 IBM Corporation
Ways to automate - Ansible
Simple automation tool
Infrastructure as a code
Agent-less
Easy to learn
Check for examples online “ansible hadoop
playbook”
![Page 22: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/22.jpg)
22
© 2016 IBM Corporation
Zero – installation options
•Big Insights QSE
•BigInsights on cloud (paid)
![Page 23: Installing Hadoop / Spark from scratch](https://reader034.vdocument.in/reader034/viewer/2022042611/58f183271a28ab416b8b4611/html5/thumbnails/23.jpg)
23
© 2014 IBM Corporation
WRAP-UP