docker session 7.0 bdc

Use this title slide only with an image

Part 7 Big Data ComponentsDawood Sayyed/GLDS May 24 , 2016 Internal

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2Internal

Why would any organization want to store data?

• The present and the future belongs to those who hold onto their data and work with it to improve their current operations and innovate to generate newer products and opportunities.

• Data and the creative use of it is the heart of organizations such as Google, Facebook, Netflix, Amazon, and Yahoo!.

• Or any other organization which uses database .( Why ?= > for predictive analytics and reporting)

• They have proven that data, along with powerful analysis, helps in building fantastic and powerful products.


What is Big Data ?

• Organizations want to now use this data to get insight to help understand existing problems,

• Seize new opportunities, and be more profitable.

• The study and analysis of these vast volumes of data has given birth to a term called big data.


Distributed Computing/ Cluster/

Several companies have been working to solve this problem and have come out with a few commercial offerings that leverage the power of distributed computing.

In this solution, multiple computers work together (a cluster) to store and process large volumes of data in parallel, thus making the analysis of large volumes of data possible.

Google, the Internet search engine giant, ran into issues when their data, acquired by crawling the Web, started growing to such large volumes that it was getting increasingly impossible to process.

They had to find a way to solve this problem and this led to the creation of Google File System (GFS) and MapReduce.


What is Apache Hadoop ?

• Apache Hadoop is a widely used open source distributed computing framework that is employed to efficiently process large volumes of data using large clusters of cheap or commodity computers.

• Apache Hadoop is a framework written in Java that:

• Is used for distributed storage and processing of large volumes of data, which run on top of a cluster and can scale from a single computer to thousands of computers

• Stores and processes data on every worker node (the nodes on the cluster that are responsible for the storage and processing of data) and handles hardware failures efficiently, providing high availability


Map reduce & YARN ( Yet Another Resource Negotiator )

Uses the MapReduce programming model to process data .

Most of the Apache Hadoop clusters in production run Apache Hadoop 1.x (MRv1—MapReduce Version 1).

New version of Apache Hadoop, 2.x (MRv2—MapReduce Version 2), also referred to as Yet Another Resource Negotiator (YARN) is being adopted by many organizations actively.


YARN vs Map Reduce

• YARN is a general-purpose, distributed, application management framework for processing data in Hadoop clusters.

• YARN was built to solve the following two important problems:

• Support for large clusters (4000 nodes or more)

• Ability to run other applications apart example Apache Giraph .

• MapReduce to make use of data already stored in HDFS


Scenario from Customer

• Customer needs HANA DB running on 500GB ,Customer needs Apache components running on remaining 500GB

• Apache components will sync with HANA DB to produce customer desired results

• Database team will handle the further request from customer

• Apache component needed are

Spark

Zookeeper

Kafka

Solr


Apache Spark Why Spark is better choice for HANA ?

• Apache Spark is a data processing engine for large data sets.

• Apache Spark is much faster (up to 100 times faster in memory) than Apache Hadoop Map Reduce.

• Cluster mode : Spark applications run as independent processes coordinated by the Spark Context object in the driver program, which is the main program .

• Spark Context : Connects to several types of cluster managers to allocate resources to Spark applications.

• Supported cluster managers include the Standalone cluster manager, Mesos and YARN.

• Apache Spark is designed to access data from varied data sources including the HDFS, Apache HBase and NoSQL databases such as Apache Cassandra and MongoDB .

• Run an Apache Spark Master in cluster mode using the YARN cluster manager in a Docker container.


Setting Enviroment

Setting the Environment

Running the Docker Container for CDH

Running Apache Spark Job in yarn-cluster Mode

Running Apache Spark Job in yarn-client Mode

Running the Apache Spark Shell


Apache Spark

sudo docker pull svds/cdh

sudo docker run -p 8088 -d --name cdh svds/cdh

docker exec –it cdh bash

spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi

/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000

spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi

/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000


Running Apache Spark Shell ( This is what developer do )

spark-shell --master yarn-client

object HelloWorld {

def main(args: Array[String]) {

println("Hello, world!")

}

}

HelloWorld.main(null)


Summary Apache Spark

#)Apache Spark applications on a YARN cluster in a Docker container using the sparksubmit command.

#) Example application in yarn-cluster and yarn-client modes.

#) HelloWorld Scala script in a Spark shell.


Apache Solr / Enviroment Set -up

• Apache Solr is an open source search platform built on Apache Lucene, a text search engine library.

• Apache Solr is scalable and reliable and provides indexing and querying service.

• Cloudera Search is based on Apache Solr.

• Setting the Environment

• Starting Docker Container for Apache Solr Server

• Starting Interactive Shell

• Logging in to the Solr Admin Console

Creating a Core Admin Index


Apache Solr / Enviroment Set -up

Creating a Core Admin Index

Loading Sample Data

Querying Apache Solr in Solr Admin Console

Querying Apache Solr using REST API Client

Deleting Data

Listing Logs

Stopping Apache Solr Server


Starting Docker Container for Apache Solr Server

docker pull solr

docker run -p 8983:8983 -d --name solr_on_docker solr

docker logs -f <containerID>

docker exec -it –user=solr solr_on_docker bash


Apache Kafka

• Apache Kafka is a messaging system based on the publish-subscribe model.

• A Kafka cluster consists of one or more servers called brokers. Kafka keeps messages categorized by “topics”.

• Producers produce messages and publish the messages to topics.

• Consumers subscribe to specific topic/s and consume feeds of messages published to the topics.

• The messages published to a topic do not have to be consumed as produced and are stored in the topic for a configurable duration.

• A consumer may choose to consume the messages in a topic from the beginning.

• Apache ZooKeeper server is used to coordinate a Kafka cluster.


Kafka + Zookeeper


Setting up the Environment for Kafka

Starting Docker Containers for Apache Kafka

Finding IP Addresses

Listing the Kafka Logs

Creating a Kafka Topic

Starting the Kafka Producer

Starting the Kafka Consumer

Producing and Consuming Messages

Stopping and Removing the Docker Containers


Starting Docker Containers for Apache Kafka

docker pull dockerkafka/zookeeper

docker pull dockerkafka/kafka

docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper

docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka


Finding IP Addresses

export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)

export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)

echo $ZK_IP

echo $KAFKA_IP


Running Docker on SLES 12 SP 1

Docker support only SLES version 12 .

For Monsoon environment we need to have central server email registration and code .

docker session 7.0 bdc

Documents