docker session 7.0 bdc

22
Use this title slide only with an image Part 7 Big Data Components Dawood Sayyed/GLDS May 24 , 2016 Internal

Upload: dawood-ms

Post on 15-Apr-2017

106 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Docker Session 7.0 BDC

Use this title slide only with an image

Part 7 Big Data ComponentsDawood Sayyed/GLDS May 24 , 2016 Internal

Page 2: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2Internal

Why would any organization want to store data?

• The present and the future belongs to those who hold onto their data and work with it to improve their current operations and innovate to generate newer products and opportunities.

• Data and the creative use of it is the heart of organizations such as Google, Facebook, Netflix, Amazon, and Yahoo!.

• Or any other organization which uses database .( Why ?= > for predictive analytics and reporting)

• They have proven that data, along with powerful analysis, helps in building fantastic and powerful products.

Page 3: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3Internal

What is Big Data ?

• Organizations want to now use this data to get insight to help understand existing problems,

• Seize new opportunities, and be more profitable.

• The study and analysis of these vast volumes of data has given birth to a term called big data.

Page 4: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4Internal

Distributed Computing/ Cluster/

Several companies have been working to solve this problem and have come out with a few commercial offerings that leverage the power of distributed computing.

In this solution, multiple computers work together (a cluster) to store and process large volumes of data in parallel, thus making the analysis of large volumes of data possible.

Google, the Internet search engine giant, ran into issues when their data, acquired by crawling the Web, started growing to such large volumes that it was getting increasingly impossible to process.

They had to find a way to solve this problem and this led to the creation of Google File System (GFS) and MapReduce.

Page 5: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5Internal

What is Apache Hadoop ?

• Apache Hadoop is a widely used open source distributed computing framework that is employed to efficiently process large volumes of data using large clusters of cheap or commodity computers.

• Apache Hadoop is a framework written in Java that:

• Is used for distributed storage and processing of large volumes of data, which run on top of a cluster and can scale from a single computer to thousands of computers

• Stores and processes data on every worker node (the nodes on the cluster that are responsible for the storage and processing of data) and handles hardware failures efficiently, providing high availability

Page 6: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6Internal

Map reduce & YARN ( Yet Another Resource Negotiator )

Uses the MapReduce programming model to process data .

Most of the Apache Hadoop clusters in production run Apache Hadoop 1.x (MRv1—MapReduce Version 1).

New version of Apache Hadoop, 2.x (MRv2—MapReduce Version 2), also referred to as Yet Another Resource Negotiator (YARN) is being adopted by many organizations actively.

Page 7: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7Internal

YARN vs Map Reduce

• YARN is a general-purpose, distributed, application management framework for processing data in Hadoop clusters.

• YARN was built to solve the following two important problems:

• Support for large clusters (4000 nodes or more)

• Ability to run other applications apart example Apache Giraph .

• MapReduce to make use of data already stored in HDFS

Page 8: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8Internal

Scenario from Customer

• Customer needs HANA DB running on 500GB ,Customer needs Apache components running on remaining 500GB

• Apache components will sync with HANA DB to produce customer desired results

• Database team will handle the further request from customer

• Apache component needed are

Spark

Zookeeper

Kafka

Solr

Page 9: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9Internal

Apache Spark Why Spark is better choice for HANA ?

• Apache Spark is a data processing engine for large data sets.

• Apache Spark is much faster (up to 100 times faster in memory) than Apache Hadoop Map Reduce.

• Cluster mode : Spark applications run as independent processes coordinated by the Spark Context object in the driver program, which is the main program .

• Spark Context : Connects to several types of cluster managers to allocate resources to Spark applications.

• Supported cluster managers include the Standalone cluster manager, Mesos and YARN.

• Apache Spark is designed to access data from varied data sources including the HDFS, Apache HBase and NoSQL databases such as Apache Cassandra and MongoDB .

• Run an Apache Spark Master in cluster mode using the YARN cluster manager in a Docker container.

Page 10: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10Internal

Setting Enviroment

Setting the Environment

Running the Docker Container for CDH

Running Apache Spark Job in yarn-cluster Mode

Running Apache Spark Job in yarn-client Mode

Running the Apache Spark Shell

Page 11: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11Internal

Apache Spark

sudo docker pull svds/cdh

sudo docker run -p 8088 -d --name cdh svds/cdh

docker exec –it cdh bash

spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi

/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000

spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi

/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000

Page 12: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12Internal

Running Apache Spark Shell ( This is what developer do )

spark-shell --master yarn-client

object HelloWorld {

def main(args: Array[String]) {

println("Hello, world!")

}

}

HelloWorld.main(null)

Page 13: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13Internal

Summary Apache Spark

#)Apache Spark applications on a YARN cluster in a Docker container using the sparksubmit command.

#) Example application in yarn-cluster and yarn-client modes.

#) HelloWorld Scala script in a Spark shell.

Page 14: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14Internal

Apache Solr / Enviroment Set -up

• Apache Solr is an open source search platform built on Apache Lucene, a text search engine library.

• Apache Solr is scalable and reliable and provides indexing and querying service.

• Cloudera Search is based on Apache Solr.

• Setting the Environment

• Starting Docker Container for Apache Solr Server

• Starting Interactive Shell

• Logging in to the Solr Admin Console

Creating a Core Admin Index

Page 15: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15Internal

Apache Solr / Enviroment Set -up

Creating a Core Admin Index

Loading Sample Data

Querying Apache Solr in Solr Admin Console

Querying Apache Solr using REST API Client

Deleting Data

Listing Logs

Stopping Apache Solr Server

Page 16: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16Internal

Starting Docker Container for Apache Solr Server

docker pull solr

docker run -p 8983:8983 -d --name solr_on_docker solr

docker logs -f <containerID>

docker exec -it –user=solr solr_on_docker bash

Page 17: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17Internal

Apache Kafka

• Apache Kafka is a messaging system based on the publish-subscribe model.

• A Kafka cluster consists of one or more servers called brokers. Kafka keeps messages categorized by “topics”.

• Producers produce messages and publish the messages to topics.

• Consumers subscribe to specific topic/s and consume feeds of messages published to the topics.

• The messages published to a topic do not have to be consumed as produced and are stored in the topic for a configurable duration.

• A consumer may choose to consume the messages in a topic from the beginning.

• Apache ZooKeeper server is used to coordinate a Kafka cluster.

Page 18: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18Internal

Kafka + Zookeeper

Page 19: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19Internal

Setting up the Environment for Kafka

Starting Docker Containers for Apache Kafka

Finding IP Addresses

Listing the Kafka Logs

Creating a Kafka Topic

Starting the Kafka Producer

Starting the Kafka Consumer

Producing and Consuming Messages

Stopping and Removing the Docker Containers

Page 20: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20Internal

Starting Docker Containers for Apache Kafka

docker pull dockerkafka/zookeeper

docker pull dockerkafka/kafka

docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper

docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka

Page 21: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21Internal

Finding IP Addresses

export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)

export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)

echo $ZK_IP

echo $KAFKA_IP

Page 22: Docker Session 7.0 BDC

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 22Internal

Running Docker on SLES 12 SP 1

Docker support only SLES version 12 .

For Monsoon environment we need to have central server email registration and code .