juc europe 2015: jenkins pipeline for continuous delivery of big data projects

20
1 Gerrit and Jenkins for Big Data Continuous Delivery London, UK, June 2015

Upload: cloudbees

Post on 15-Jan-2017

94 views

Category:

Technology


0 download

TRANSCRIPT

1

Gerrit and Jenkins for Big Data Continuous Delivery

London, UK, June 2015

www.gerritforge.com

#jenkinsconf

About GerritForge

•  Founded in 2009 in London •  Committed to OpenSource

2

www.gerritforge.com

#jenkinsconf

The Team

Luca Milanesio •  Co-founder and Director of GerritForge •  over 20 years in Agile Development and ALM •  OpenSource contributor to many projects

(BigData, Continuous Integration, Git/Gerrit)

3

Antonios Chalkiopulos •  Author of Programming MapReduce with Scalding •  Open source contributor to many BigData projects •  Working on the "land-of-Hadoop' (landoop.com)

www.gerritforge.com

#jenkinsconf

The Team (2)

Tiago Palma •  Data Warehouse & Big Data Development •  Senior Data Modeler •  Big Data infrastructure specialist

4

Stefano Galarraga •  20 years of Agile Development •  Middleware, Big Data, Reactive Distributed Systems. •  Open Source contributor to many BigData projects.

www.gerritforge.com

#jenkinsconf

Agenda

•  Why continuous deployment on BigData? •  Our Development Lifecycle ingredients

–  Gerrit, Jenkins, Mesos, Marathon, CDH / Spark •  Topics to address in BigData development

–  Type of tests (Unit vs. Integration) –  Testing the "real thing" (aka the Cluster)

•  Our BigData virtualised infrastructure –  Marathon, Mesos and Dockers all around

•  Live (minimised) Demo 5

www.gerritforge.com

#jenkinsconf

WHY?

•  Early BigData had no process at all = may fail at any time •  Mature BigData is mission critical decision maker •  Need for more stable sw-engineering methodologies:

–  Test-Driven Development (Stefano's ScaldingUnit) –  Continuous Integration with Jenkins –  Integration & Performance testing –  Code review and validation

6

www.gerritforge.com

#jenkinsconf

Code-Review BigData Lifecycle (1)

• GIT used by distributed teams (UK, Israel, India) • Topics and Code Review •  Jenkins build on every patch-set • Commits reviewed / approved via Gerrit Submit

7

www.gerritforge.com

#jenkinsconf

Code-Review BigData Lifecycle (2)

8

www.gerritforge.com

#jenkinsconf

Code-Review BigData Lifecycle (3)

•  Submitting a Topic automatically does: –  all patch-sets merged (semi-atomically) –  trigger a longer chain of CI steps –  automatically promote a RC if everything passes

•  Jenkins automation via Gerrit Trigger Plugin

9

www.gerritforge.com

#jenkinsconf

Ingredients: Gerrit

• Git-based Code Review system

•  Pre-commit review •  Allows multiple validation steps

(pipeline) •  Validation + Integration flags

10

www.gerritforge.com

#jenkinsconf

Ingredients: Jenkins

•  Plugins: –  Gerrit trigger –  Docker build step –  Post-build script plugin

11

www.gerritforge.com

#jenkinsconf

Fitting CDH Into this Picture

•  Integration Test –  Running integration tests into an CDH-enabled docker

container –  Hadoop/local and Spark/standalone is not enough –  Need to test classes serialisation –  Validate package fat-jars (libs conflicts with CDH) –  Performance on a real cluster

12

www.gerritforge.com

#jenkinsconf

Fitting CDH Into this Picture

•  Acceptance / performance test with short-lived CDHs •  Solution: Mesos, Marathon and Docker:

–  Ephemeral clusters with defined capacity –  Automatic cluster-config –  All controlled via Docker/Mesos

13

www.gerritforge.com

#jenkinsconf

Mesos + Marathon

14

•  Apache Mesos –  Abstracts CPU, memory, storage, other compute

resources away from machines • Marathon Framework

–  Runs on top of Mesos –  Guarantees that long-running applications never

stop –  REST API for managing and scaling services

www.gerritforge.com

#jenkinsconf

CDH Components

•  CDH 5.4.1 distribution –  Apache Spark –  Hadoop HDFS –  YARN

15

www.gerritforge.com

#jenkinsconf Slave Host

Integration Test Flow on CDH Cluster

16

Jenkins Master

Mesos Master Marathon Private

Docker Registry Mesos Slave Docker

POST to Marathon REST API to start 1 docker container with Cloudera Manager and N docker containers with cloudera agents

Marathon Framework receives resource offers from Mesos Master and submits the tasks

The task is sent to the Mesos Slave

Mesos slave starts the docker container

Docker image is fetched from Docker registry if not present in Slave host W

aitin

g fo

r Doc

kers

Doc

kers

UP

Install Cloudera packages via Cloudera Manager API using Python

Deploy the ETL, run the ETL and the Integration Tests

www.gerritforge.com

#jenkinsconf

Unit and Integration Tests sample

•  Test project: –  Test Spark project –  ETL from Oracle to HDFS

•  Unit-test directly on Spark logic •  Integration tests for every patch-set:

–  VERY small dataset just for this demo –  CDH and Oracle Docker Images

17

www.gerritforge.com

#jenkinsconf

O

Unit and Integration Tests

18

Hadoop Pseudo-distributed mode

Spark Standalone

Jenkins

Build Job init

Submit job

Init/read HDFS

#jenkinsconf

DEMO Small-scale of BigData Delivery Pipeline

19

www.gerritforge.com

#jenkinsconf

References

•  Demo sources https://github.com/GerritForge

•  Blog: https://gitenterprise.me

•  Twitter: @GerritReview @GitEnterprise @GerritForge

•  Learn Gerrit Code Review book: GerritHub.io/book

•  Get in touch with GerritForge: [email protected]

20