sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015...

Introduction To Big Data Pipelining with Docker, Cassandra, Spark,

Spark-Notebook & Akka

Apache Cassandra and DataStax enthusiast who enjoys explaining to customers that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world.

Previously 25 years designing, building, implementing and supporting complex data management solutions with traditional RDBMS technology includingOracle Hyperion & E-Business Suite deployments at clients such as the Financial Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment, HP, Sun and Oracle.

Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite, Oracle Virtual Machine and Oracle Exalytics.

[email protected]

@stratman1958

Simon Ambridge Pre-Sales Solution Engineer, Datastax UK

Big Data Pipelining: Outline •  1-Hour introduction to Big Data Pipelining and a working sandbox •  Presented at a half-day workshop Devoxx November 2015 •  Uses Data Pipeline environment from Data Fellas •  Contributors from Typesafe, Mesos, Datastax •  Demonstrates how to use scalable, distributed technologies

•  Docker •  Spark •  Spark-Notebook •  Cassandra

•  Objective is to introduce the demo environment •  Key takeaway – understanding how to build a reactive, repeatable Big Data

pipeline

Big Data Pipelining: Devoxx & Data Fellas

•  Co-founder of Data Fellas •  Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book. •  Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala

•  Co-founder of Data Fellas •  Ph.D in experimental atomic physics •  Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning

methodologies

•  Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe. •  For the last six years he has been the main contributor for many critical Scala components including the compiler

backend, its optimizer and the Eclipse build manager

•  Datastax Solutions Engineer •  Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC

etc.

Andy Petrella

Xavier Tordoir

Iulian Dragos

Simon Ambridge

Big Data Pipelining: Legacy

Sampling Data Modeling Tuning Report Interpret

•  Sampling and analysis often run on a single machine •  CPU and memory limitations •  Frequently dictates limited sampling because of data size limitations •  Multiple iterations over large datasets

Repeated iterations

Big Data Pipelining: Big Data Problems

•  Data is getting bigger or, more accurately, the number of available data sources is exploding

•  Sampling the data is becoming more difficult •  The validity of the analysis becomes obsolete faster •  Analysis becomes too slow to get any ROI from the data

Big Data Pipelining: Big Data Needs •  Scalable infrastructure + distributed technologies

•  Allow data volumes to be scaled •  Faster processing

•  More complex processing •  Constant data flow •  Visible, reproducible analysis

•  For example, SHAR3 from Data Fellas

Big Data Pipelining: Pipeline Flow

ADAM

Intro To Docker: Quick History

What is Docker? •  Open source project started in 2013 •  Easy to build, deploy, copy containers •  Great for packaging and deploying applications •  Similar resource isolation to VMs, but different architecture •  Lightweight

•  Containers share the OS kernel •  Fast start •  Layered filesystems – share underlying OS files, directories

“Each virtual machine includes the application, the necessary binaries and libraries and an entire guest operating system - all of which may be tens of GBs in size.”

“Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system. They’re also not tied to any specific infrastructure – Docker containers run on any computer, on any infrastructure and in any cloud.”

Intro To ADAM: Quick History

What is ADAM? •  Started at UC Berkeley in 2012 •  Open-source library for bioinformatics analysis, written for Spark •  Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics

methods •  A set of formats, APIs, and processing stage implementations for genomic

data •  Fully open source under the Apache 2 license •  Implemented on top of Avro and Parquet for data storage •  Compatible with Spark up to 1.5.1

Intro To Spark: Quick History

What is Apache Spark? •  Started at UC Berkeley in 2009 •  Apache Project since 2010 •  Fast - 10x-100x faster than Hadoop MapReduce •  Distributed in-memory processing •  Rich Scala, Java and Python APIs •  2x-5x less code than R •  Batch and streaming analytics •  Interactive shell (REPL)

Intro To Spark-Notebook: Quick History

What is Spark-Notebook? •  Drive your data analysis from the browser •  Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc. •  Features tight integration with Apache Spark and offers handy tools to

analysts: •  Reproducible visual analysis •  Charting •  Widgets •  Dynamic forms •  SQL support •  Extensible with custom libraries

Intro To Parquet: Quick History

What is Parquet? •  Started at Twitter and Cloudera in 2013 •  Databases traditionally store information in rows and are optimized for

working with one record at a time •  Columnar storage systems optimised to store data by column •  Netflix big user - 7 PB of warehoused data in Parquet format •  A compressed, efficient columnar data representation •  Allows complex data to be encoded efficiently •  Compression schemes can be specified on a per-column level •  Not as compressed as ORC (Hortonworks) but faster read/analysis

Intro To Cassandra: Quick History

What is Apache Cassandra? •  Originally started at Facebook in 2008 •  Top level Apache project since 2010 •  Open source distributed database •  Handles large amounts of data •  At high velocity

•  Across multiple data centres •  No single point of failure •  Continuous Availability •  Disaster avoidance

•  Enterprise Cassandra from Datastax

Intro To Akka: Quick History

What is Akka? •  Open source toolkit first released in 2009 •  Simplifies the construction of concurrent and distributed Java applications •  Primarily designed for actor-based concurrency •  Akka enforces parental supervision

•  Actors are arranged hierarchically •  Each actor is created and supervised by its parent actor •  Program failures treated as events handled by an actor's supervisor

•  Message-based and asynchronous; typically no mutable data are shared •  Language bindings exist for both Java and Scala

Spark: RDD

What Is A Resilient Distributed Dataset? •  RDD - a distributed, memory abstraction for parallel in-memory

computations •  RDD represents a dataset consisting of objects and records

•  Such as Scala, Java or Python objects •  RDD is distributed across nodes in the Spark cluster

•  Nodes hold partitions and partitions hold records •  RDD is read-only or immutable

•  RDD can be transformed into a new RDD •  Operations

•  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save)

Spark: DataFrames

What Is A DataFrame? •  Inspired by data frames in R and Python •  Data is organized into named columns •  Conceptually equivalent to a table in a relational database •  Can be constructed from a wide array of sources

•  structured data files - JSON, Parquet •  tables in Hive •  relational database systems via JDBC •  existing RDDs

•  Can be extended to support any third-party data formats or sources •  Existing third-party extensions already include Avro, CSV, ElasticSearch,

and Cassandra •  Enables applications to easily combine data from disparate sources

Spark & Cassandra: How?

How Does Spark Access Cassandra? •  DataStax Cassandra Spark driver – open source! •  Open source:

•  https://github.com/datastax/spark-cassandra-connector •  Compatible with

•  Spark 0.9+ •  Cassandra 2.0+ •  DataStax Enterprise 4.5+ •  Scala 2.10 and 2.11 •  Java and Python

•  Expose Cassandra tables as Spark RDDs •  Execute arbitrary CQL queries in Spark applications •  Saves RDDs back to Cassandra via saveToCassandra call

Spark: How Do You Access RDDs?

Create A ‘Spark Context’ •  To create an RDD you need a Spark Context object •  A Spark Context represents a connection to a Spark Cluster •  In the Spark shell the sc object is created automatically •  In a standalone application a Spark Context must be constructed

Spark: Architecture

Spark Architecture •  Master-worker architecture

•  One master •  Spark Workers run on all nodes •  Executors belonging to different clients/SCs are isolated •  Executors belonging to the same client/SCs can communicate •  Client jobs are divided into tasks, executed by multiple threads

•  First Spark node promoted as Spark Master •  Master HA feature available in DataStax Enterprise •  Standby Master promoted on failure •  Workers are resilient by default

Open Source: Analytics Integration •  Apache Spark for Real-Time Analytics •  Analytics nodes separate from data nodes •  ETL required

Cassandra Cluster

ETL

Spark Cluster

•  Loose integration •  Data separate from processing •  Millisecond response times

Solr Cluster

ES Cluster

10 core 16GB minimum

DataStax Enterprise: Analytics Integration

Cassandra Cluster

Spark, Solr Cluster

ETL

Spark Cluster

•  Tight integration •  Data locality •  Microsecond response times

X

•  Integrated Apache Spark for Real-Time Analytics •  Integrated Apache Solr for Enterprise Search •  Search and analytics nodes close to data •  No ETL required

X Solr Cluster

ES Cluster

12+ core 32GB+

Big Data Pipelining: Demo

Build & Run Steps 1.  Provision a 64-bit Linux environment 2.  Pre-requisites (5 mins) 3.  Install Docker (5 mins) 4.  Clone the Pipeline Repo from GitHub (2 mins) 5.  Pull the Docker image from Docker Hub (20 mins) 6.  Run the image as a container (5 mins) 7.  Run the demo setup script - inside the container (2 mins) 8.  Run the demo from a browser - on the host (30 mins)


Steps 1.  Provision a host

Required machine spec: 3 cores, 5GB •  Linux machine

http://www.ubuntu.com/download/desktop •  Create a VM (e.g. Ubuntu)

http://virtualboxes.org/images/ubuntu/ http://www.osboxes.org/ubuntu/


Steps 2.  Pre-requisites

https://docs.docker.com/installation/ubuntulinux/ •  Updates to apt-get sources and gpg key •  Check kernel version


Steps 3.  Install Docker

$ sudo apt-‐get update $ sudo apt-‐get install docker $ sudo usermod -‐aG docker <myuserid> Log out/in $ docker run hello-‐world


Steps 4.  Clone the Pipeline repo

$ mkdir ~/pipeline $ cd ~/pipeline $ git clone https://github.com/distributed-‐freaks/pipeline.git


Steps 5.  Pull the Pipeline image

$ docker pull xtordoir/pipeline


Steps 6.  Run the Pipeline image as a container

$ docker run -‐it -‐m 8g -‐p 30080:80 -‐p 34040-‐34045:4040-‐4045 -‐p 9160:9160 -‐p 9042:9042 -‐p 39200:9200 -‐p 37077:7077 -‐p 36060:6060 -‐p 36061:6061 -‐p 32181:2181 -‐p 38090:8090 -‐p 38099:8099 -‐p 30000:10000 -‐p 30070:50070 -‐p 30090:50090 -‐p 39092:9092 -‐p 36066:6066 -‐p 39000:9000 -‐p 39999:19999 -‐p 36081:6081 -‐p 35601:5601 -‐p 37979:7979 -‐p 38989:8989 xtordoir/pipeline bash


Steps 7.  Run the demo setup script in the container

$ cd pipeline $ source devoxx-‐setup.sh # ignore Cassandra errors Run cqlsh


Steps 8.  Run the demo in the host browser

http://localhost:39000/tree/pipeline

Thank you!

Big Data Pipelining: Appendix

RDD/Cassandra Reference

Spark: RDD

How Do You Create An RDD? 1.  From an existing collection:

‘action’

Spark: RDD

How Do You Create An RDD? 2.  From a text file:

‘action’

Spark: RDD

How Do You Create An RDD? 3.  From a data in a Cassandra database:

‘action’

Spark: RDD

How Do You Create An RDD? 4.  From an existing RDD:

‘action’

‘transformation’

Spark: RDD’s & Cassandra

Accessing Data As An RDD

‘action’

RDD method

Spark: Filtering Data In Cassandra

Server-side Selection •  Reduce the amount of data transferred

•  Selecting rows (by clustering columns and/or secondary indexes)

Spark: Saving Data In Cassandra

Saving Data •  saveToCassandra

Spark: Using SparkSQL & Cassandra

You Can Also Access Cassandra Via SparkSQL! •  Spark Conf object can be used to create a Cassandra-aware Spark SQL context

object •  Use regular CQL syntax •  Cross table operations - joins, unions etc!

Spark: Streaming Data

Spark Streaming •  High velocity data – IoT, sensors, Twitter etc •  Micro batching •  Each batch represented as RDD •  Fault tolerant •  Exactly-once processing •  Represents a unified stream and batch processing framework

Spark: Streaming Data Into Cassandra

Streaming Example

sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015...

Technology