docker session 7.0 bdc
TRANSCRIPT
Use this title slide only with an image
Part 7 Big Data ComponentsDawood Sayyed/GLDS May 24 , 2016 Internal
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2Internal
Why would any organization want to store data?
• The present and the future belongs to those who hold onto their data and work with it to improve their current operations and innovate to generate newer products and opportunities.
• Data and the creative use of it is the heart of organizations such as Google, Facebook, Netflix, Amazon, and Yahoo!.
• Or any other organization which uses database .( Why ?= > for predictive analytics and reporting)
• They have proven that data, along with powerful analysis, helps in building fantastic and powerful products.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3Internal
What is Big Data ?
• Organizations want to now use this data to get insight to help understand existing problems,
• Seize new opportunities, and be more profitable.
• The study and analysis of these vast volumes of data has given birth to a term called big data.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4Internal
Distributed Computing/ Cluster/
Several companies have been working to solve this problem and have come out with a few commercial offerings that leverage the power of distributed computing.
In this solution, multiple computers work together (a cluster) to store and process large volumes of data in parallel, thus making the analysis of large volumes of data possible.
Google, the Internet search engine giant, ran into issues when their data, acquired by crawling the Web, started growing to such large volumes that it was getting increasingly impossible to process.
They had to find a way to solve this problem and this led to the creation of Google File System (GFS) and MapReduce.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5Internal
What is Apache Hadoop ?
• Apache Hadoop is a widely used open source distributed computing framework that is employed to efficiently process large volumes of data using large clusters of cheap or commodity computers.
• Apache Hadoop is a framework written in Java that:
• Is used for distributed storage and processing of large volumes of data, which run on top of a cluster and can scale from a single computer to thousands of computers
• Stores and processes data on every worker node (the nodes on the cluster that are responsible for the storage and processing of data) and handles hardware failures efficiently, providing high availability
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6Internal
Map reduce & YARN ( Yet Another Resource Negotiator )
Uses the MapReduce programming model to process data .
Most of the Apache Hadoop clusters in production run Apache Hadoop 1.x (MRv1—MapReduce Version 1).
New version of Apache Hadoop, 2.x (MRv2—MapReduce Version 2), also referred to as Yet Another Resource Negotiator (YARN) is being adopted by many organizations actively.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7Internal
YARN vs Map Reduce
• YARN is a general-purpose, distributed, application management framework for processing data in Hadoop clusters.
• YARN was built to solve the following two important problems:
• Support for large clusters (4000 nodes or more)
• Ability to run other applications apart example Apache Giraph .
• MapReduce to make use of data already stored in HDFS
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8Internal
Scenario from Customer
• Customer needs HANA DB running on 500GB ,Customer needs Apache components running on remaining 500GB
• Apache components will sync with HANA DB to produce customer desired results
• Database team will handle the further request from customer
• Apache component needed are
Spark
Zookeeper
Kafka
Solr
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9Internal
Apache Spark Why Spark is better choice for HANA ?
• Apache Spark is a data processing engine for large data sets.
• Apache Spark is much faster (up to 100 times faster in memory) than Apache Hadoop Map Reduce.
• Cluster mode : Spark applications run as independent processes coordinated by the Spark Context object in the driver program, which is the main program .
• Spark Context : Connects to several types of cluster managers to allocate resources to Spark applications.
• Supported cluster managers include the Standalone cluster manager, Mesos and YARN.
• Apache Spark is designed to access data from varied data sources including the HDFS, Apache HBase and NoSQL databases such as Apache Cassandra and MongoDB .
• Run an Apache Spark Master in cluster mode using the YARN cluster manager in a Docker container.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10Internal
Setting Enviroment
Setting the Environment
Running the Docker Container for CDH
Running Apache Spark Job in yarn-cluster Mode
Running Apache Spark Job in yarn-client Mode
Running the Apache Spark Shell
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11Internal
Apache Spark
sudo docker pull svds/cdh
sudo docker run -p 8088 -d --name cdh svds/cdh
docker exec –it cdh bash
spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi
/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000
spark-submit --master yarn-client --class org.apache.spark.examples.SparkPi
/usr/lib/spark/examples/lib/spark-examples-1.3.0-cdh5.4.7-hadoop2.6.0-cdh5.4.7.jar 1000
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12Internal
Running Apache Spark Shell ( This is what developer do )
spark-shell --master yarn-client
object HelloWorld {
def main(args: Array[String]) {
println("Hello, world!")
}
}
HelloWorld.main(null)
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13Internal
Summary Apache Spark
#)Apache Spark applications on a YARN cluster in a Docker container using the sparksubmit command.
#) Example application in yarn-cluster and yarn-client modes.
#) HelloWorld Scala script in a Spark shell.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14Internal
Apache Solr / Enviroment Set -up
• Apache Solr is an open source search platform built on Apache Lucene, a text search engine library.
• Apache Solr is scalable and reliable and provides indexing and querying service.
• Cloudera Search is based on Apache Solr.
• Setting the Environment
• Starting Docker Container for Apache Solr Server
• Starting Interactive Shell
• Logging in to the Solr Admin Console
Creating a Core Admin Index
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15Internal
Apache Solr / Enviroment Set -up
Creating a Core Admin Index
Loading Sample Data
Querying Apache Solr in Solr Admin Console
Querying Apache Solr using REST API Client
Deleting Data
Listing Logs
Stopping Apache Solr Server
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16Internal
Starting Docker Container for Apache Solr Server
docker pull solr
docker run -p 8983:8983 -d --name solr_on_docker solr
docker logs -f <containerID>
docker exec -it –user=solr solr_on_docker bash
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17Internal
Apache Kafka
• Apache Kafka is a messaging system based on the publish-subscribe model.
• A Kafka cluster consists of one or more servers called brokers. Kafka keeps messages categorized by “topics”.
• Producers produce messages and publish the messages to topics.
• Consumers subscribe to specific topic/s and consume feeds of messages published to the topics.
• The messages published to a topic do not have to be consumed as produced and are stored in the topic for a configurable duration.
• A consumer may choose to consume the messages in a topic from the beginning.
• Apache ZooKeeper server is used to coordinate a Kafka cluster.
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18Internal
Kafka + Zookeeper
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19Internal
Setting up the Environment for Kafka
Starting Docker Containers for Apache Kafka
Finding IP Addresses
Listing the Kafka Logs
Creating a Kafka Topic
Starting the Kafka Producer
Starting the Kafka Consumer
Producing and Consuming Messages
Stopping and Removing the Docker Containers
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20Internal
Starting Docker Containers for Apache Kafka
docker pull dockerkafka/zookeeper
docker pull dockerkafka/kafka
docker run -d --name zookeeper -p 2181:2181 dockerkafka/zookeeper
docker run --name kafka -p 9092:9092 --link zookeeper:zookeeper dockerkafka/kafka
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21Internal
Finding IP Addresses
export ZK_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper)
export KAFKA_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' kafka)
echo $ZK_IP
echo $KAFKA_IP
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 22Internal
Running Docker on SLES 12 SP 1
Docker support only SLES version 12 .
For Monsoon environment we need to have central server email registration and code .