sa introduction to big data pipelining with cassandra & spark west minster meetup - black-2015...

44
Introduction To Big Data Pipelining with Docker, Cassandra, Spark, Spark-Notebook & Akka

Upload: simon-ambridge

Post on 07-Jan-2017

21.717 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Introduction To Big Data Pipelining with Docker, Cassandra, Spark,

Spark-Notebook & Akka

Page 2: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Apache Cassandra and DataStax enthusiast who enjoys explaining to customers that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world.

Previously 25 years designing, building, implementing and supporting complex data management solutions with traditional RDBMS technology includingOracle Hyperion & E-Business Suite deployments at clients such as the Financial Services Authority, Olympic Delivery Authority, BT, RBS, Virgin Entertainment, HP, Sun and Oracle.

Oracle certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE, and worked extensively with Oracle Hyperion, Oracle E-Business Suite, Oracle Virtual Machine and Oracle Exalytics.

[email protected]

@stratman1958

Simon Ambridge Pre-Sales Solution Engineer, Datastax UK

Page 3: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Outline •  1-Hour introduction to Big Data Pipelining and a working sandbox •  Presented at a half-day workshop Devoxx November 2015 •  Uses Data Pipeline environment from Data Fellas •  Contributors from Typesafe, Mesos, Datastax •  Demonstrates how to use scalable, distributed technologies

•  Docker •  Spark •  Spark-Notebook •  Cassandra

•  Objective is to introduce the demo environment •  Key takeaway – understanding how to build a reactive, repeatable Big Data

pipeline

Page 4: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Devoxx & Data Fellas

•  Co-founder of Data Fellas •  Certified Scala/Spark trainer and wrote the Learning Play! Framework 2 book. •  Creator of Spark-Notebook, one of the top projects on GitHub related to Apache Spark and Scala

•  Co-founder of Data Fellas •  Ph.D in experimental atomic physics •  Specialist in prediction of biological molecular structures and interactions, and applied Machine Learning

methodologies

•  Iulian Dragos is a key member of Martin Odersky’s Scala team at Typesafe. •  For the last six years he has been the main contributor for many critical Scala components including the compiler

backend, its optimizer and the Eclipse build manager

•  Datastax Solutions Engineer •  Prior to Datastax Simon has extensive experience with traditional RDBMS technologies at Oracle, Sun, Compaq, DEC

etc.

Andy Petrella

Xavier Tordoir

Iulian Dragos

Simon Ambridge

Page 5: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Legacy

Sampling Data Modeling Tuning Report Interpret

•  Sampling and analysis often run on a single machine •  CPU and memory limitations •  Frequently dictates limited sampling because of data size limitations •  Multiple iterations over large datasets

Repeated iterations

Page 6: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Big Data Problems

•  Data is getting bigger or, more accurately, the number of available data sources is exploding

•  Sampling the data is becoming more difficult •  The validity of the analysis becomes obsolete faster •  Analysis becomes too slow to get any ROI from the data

Page 7: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Big Data Needs •  Scalable infrastructure + distributed technologies

•  Allow data volumes to be scaled •  Faster processing

•  More complex processing •  Constant data flow •  Visible, reproducible analysis

•  For example, SHAR3 from Data Fellas

Page 8: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Pipeline Flow

ADAM

Page 9: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Docker: Quick History

What is Docker? •  Open source project started in 2013 •  Easy to build, deploy, copy containers •  Great for packaging and deploying applications •  Similar resource isolation to VMs, but different architecture •  Lightweight

•  Containers share the OS kernel •  Fast start •  Layered filesystems – share underlying OS files, directories

“Each virtual machine includes the application, the necessary binaries and libraries and an entire guest operating system - all of which may be tens of GBs in size.”

“Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system. They’re also not tied to any specific infrastructure – Docker containers run on any computer, on any infrastructure and in any cloud.”

Page 10: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To ADAM: Quick History

What is ADAM? •  Started at UC Berkeley in 2012 •  Open-source library for bioinformatics analysis, written for Spark •  Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics

methods •  A set of formats, APIs, and processing stage implementations for genomic

data •  Fully open source under the Apache 2 license •  Implemented on top of Avro and Parquet for data storage •  Compatible with Spark up to 1.5.1

Page 11: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Spark: Quick History

What is Apache Spark? •  Started at UC Berkeley in 2009 •  Apache Project since 2010 •  Fast - 10x-100x faster than Hadoop MapReduce •  Distributed in-memory processing •  Rich Scala, Java and Python APIs •  2x-5x less code than R •  Batch and streaming analytics •  Interactive shell (REPL)

Page 12: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Spark-Notebook: Quick History

What is Spark-Notebook? •  Drive your data analysis from the browser •  Can be deployed on a single host or large cluster e.g. Mesos, ec2, GCE etc. •  Features tight integration with Apache Spark and offers handy tools to

analysts: •  Reproducible visual analysis •  Charting •  Widgets •  Dynamic forms •  SQL support •  Extensible with custom libraries

Page 13: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Parquet: Quick History

What is Parquet? •  Started at Twitter and Cloudera in 2013 •  Databases traditionally store information in rows and are optimized for

working with one record at a time •  Columnar storage systems optimised to store data by column •  Netflix big user - 7 PB of warehoused data in Parquet format •  A compressed, efficient columnar data representation •  Allows complex data to be encoded efficiently •  Compression schemes can be specified on a per-column level •  Not as compressed as ORC (Hortonworks) but faster read/analysis

Page 14: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Cassandra: Quick History

What is Apache Cassandra? •  Originally started at Facebook in 2008 •  Top level Apache project since 2010 •  Open source distributed database •  Handles large amounts of data •  At high velocity

•  Across multiple data centres •  No single point of failure •  Continuous Availability •  Disaster avoidance

•  Enterprise Cassandra from Datastax

Page 15: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Intro To Akka: Quick History

What is Akka? •  Open source toolkit first released in 2009 •  Simplifies the construction of concurrent and distributed Java applications •  Primarily designed for actor-based concurrency •  Akka enforces parental supervision

•  Actors are arranged hierarchically •  Each actor is created and supervised by its parent actor •  Program failures treated as events handled by an actor's supervisor

•  Message-based and asynchronous; typically no mutable data are shared •  Language bindings exist for both Java and Scala

Page 16: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD

What Is A Resilient Distributed Dataset? •  RDD - a distributed, memory abstraction for parallel in-memory

computations •  RDD represents a dataset consisting of objects and records

•  Such as Scala, Java or Python objects •  RDD is distributed across nodes in the Spark cluster

•  Nodes hold partitions and partitions hold records •  RDD is read-only or immutable

•  RDD can be transformed into a new RDD •  Operations

•  Transformations (e.g. map, filter, groupBy) •  Actions (e.g. count, collect, save)

Page 17: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: DataFrames

What Is A DataFrame? •  Inspired by data frames in R and Python •  Data is organized into named columns •  Conceptually equivalent to a table in a relational database •  Can be constructed from a wide array of sources

•  structured data files - JSON, Parquet •  tables in Hive •  relational database systems via JDBC •  existing RDDs

•  Can be extended to support any third-party data formats or sources •  Existing third-party extensions already include Avro, CSV, ElasticSearch,

and Cassandra •  Enables applications to easily combine data from disparate sources

Page 18: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark & Cassandra: How?

How Does Spark Access Cassandra? •  DataStax Cassandra Spark driver – open source! •  Open source:

•  https://github.com/datastax/spark-cassandra-connector •  Compatible with

•  Spark 0.9+ •  Cassandra 2.0+ •  DataStax Enterprise 4.5+ •  Scala 2.10 and 2.11 •  Java and Python

•  Expose Cassandra tables as Spark RDDs •  Execute arbitrary CQL queries in Spark applications •  Saves RDDs back to Cassandra via saveToCassandra call

Page 19: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: How Do You Access RDDs?

Create A ‘Spark Context’ •  To create an RDD you need a Spark Context object •  A Spark Context represents a connection to a Spark Cluster •  In the Spark shell the sc object is created automatically •  In a standalone application a Spark Context must be constructed

Page 20: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Architecture

Spark Architecture •  Master-worker architecture

•  One master •  Spark Workers run on all nodes •  Executors belonging to different clients/SCs are isolated •  Executors belonging to the same client/SCs can communicate •  Client jobs are divided into tasks, executed by multiple threads

•  First Spark node promoted as Spark Master •  Master HA feature available in DataStax Enterprise •  Standby Master promoted on failure •  Workers are resilient by default

Page 21: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Open Source: Analytics Integration •  Apache Spark for Real-Time Analytics •  Analytics nodes separate from data nodes •  ETL required

Cassandra Cluster

ETL

Spark Cluster

•  Loose integration •  Data separate from processing •  Millisecond response times

Solr Cluster

ES Cluster

10 core 16GB minimum

Page 22: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

DataStax Enterprise: Analytics Integration

Cassandra Cluster

Spark, Solr Cluster

ETL

Spark Cluster

•  Tight integration •  Data locality •  Microsecond response times

X

•  Integrated Apache Spark for Real-Time Analytics •  Integrated Apache Solr for Enterprise Search •  Search and analytics nodes close to data •  No ETL required

X Solr Cluster

ES Cluster

12+ core 32GB+

Page 23: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Build & Run Steps 1.  Provision a 64-bit Linux environment 2.  Pre-requisites (5 mins) 3.  Install Docker (5 mins) 4.  Clone the Pipeline Repo from GitHub (2 mins) 5.  Pull the Docker image from Docker Hub (20 mins) 6.  Run the image as a container (5 mins) 7.  Run the demo setup script - inside the container (2 mins) 8.  Run the demo from a browser - on the host (30 mins)

Page 24: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 1.  Provision a host

Required machine spec: 3 cores, 5GB •  Linux machine

http://www.ubuntu.com/download/desktop •  Create a VM (e.g. Ubuntu)

http://virtualboxes.org/images/ubuntu/ http://www.osboxes.org/ubuntu/

Page 25: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 2.  Pre-requisites

https://docs.docker.com/installation/ubuntulinux/ •  Updates to apt-get sources and gpg key •  Check kernel version

Page 26: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 3.  Install Docker

$  sudo  apt-­‐get  update    $  sudo  apt-­‐get  install  docker  $  sudo  usermod  -­‐aG  docker  <myuserid>   Log out/in $  docker  run  hello-­‐world    

Page 27: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 4.  Clone the Pipeline repo

$  mkdir  ~/pipeline  $  cd  ~/pipeline  $  git  clone  https://github.com/distributed-­‐freaks/pipeline.git  

Page 28: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 5.  Pull the Pipeline image

$  docker  pull  xtordoir/pipeline  

Page 29: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 6.  Run the Pipeline image as a container

$  docker  run  -­‐it  -­‐m  8g    -­‐p  30080:80  -­‐p  34040-­‐34045:4040-­‐4045  -­‐p  9160:9160  -­‐p  9042:9042  -­‐p  39200:9200  -­‐p  37077:7077  -­‐p  36060:6060  -­‐p  36061:6061  -­‐p  32181:2181  -­‐p  38090:8090  -­‐p  38099:8099  -­‐p  30000:10000  -­‐p  30070:50070  -­‐p  30090:50090  -­‐p  39092:9092  -­‐p  36066:6066  -­‐p  39000:9000  -­‐p  39999:19999  -­‐p  36081:6081  -­‐p  35601:5601  -­‐p  37979:7979  -­‐p  38989:8989  xtordoir/pipeline  bash  

Page 30: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 7.  Run the demo setup script in the container

$  cd  pipeline  $  source  devoxx-­‐setup.sh              #  ignore  Cassandra  errors    Run cqlsh

Page 31: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Demo

Steps 8.  Run the demo in the host browser

http://localhost:39000/tree/pipeline

Page 32: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2
Page 33: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Thank you!

Page 34: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Big Data Pipelining: Appendix

RDD/Cassandra Reference

Page 35: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD

How Do You Create An RDD? 1.  From an existing collection:

‘action’

Page 36: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD

How Do You Create An RDD? 2.  From a text file:

‘action’

Page 37: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD

How Do You Create An RDD? 3.  From a data in a Cassandra database:

‘action’

Page 38: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD

How Do You Create An RDD? 4.  From an existing RDD:

‘action’

‘transformation’

Page 39: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: RDD’s & Cassandra

Accessing Data As An RDD

‘action’

RDD method

Page 40: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Filtering Data In Cassandra

Server-side Selection •  Reduce the amount of data transferred

•  Selecting rows (by clustering columns and/or secondary indexes)

Page 41: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Saving Data In Cassandra

Saving Data •  saveToCassandra

Page 42: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Using SparkSQL & Cassandra

You Can Also Access Cassandra Via SparkSQL! •  Spark Conf object can be used to create a Cassandra-aware Spark SQL context

object •  Use regular CQL syntax •  Cross table operations - joins, unions etc!

Page 43: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Streaming Data

Spark Streaming •  High velocity data – IoT, sensors, Twitter etc •  Micro batching •  Each batch represented as RDD •  Fault tolerant •  Exactly-once processing •  Represents a unified stream and batch processing framework

Page 44: Sa introduction to big data pipelining with cassandra & spark   west minster meetup - black-2015 0.11-2

Spark: Streaming Data Into Cassandra

Streaming Example