dataops with project amaterasu
TRANSCRIPT
![Page 1: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/1.jpg)
DataOps with Project AmaterasuYaniv Rodenski Karel Alfonso
![Page 2: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/2.jpg)
What Data Pipelines are Made Off• Big Data applications:
• Ingestion
• Storage
• Processing
• Serving
• Workflows
• Machine learning
• Data Sources and Destinations
• Tests?
• Schemas??
![Page 3: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/3.jpg)
Archetypes of Data Pipelines Builders
• Exploratory workloads
• Data centric
• Simple Deployment
Data People (Data Scientist/Analysts/BI Devs) Software Developers
• Code centric
• Heavy on methodologies
• Heavy tooling
• Very complex deployment
![Page 4: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/4.jpg)
Making Big Data Teams Scale• Scaling teams is hard
• Scaling Big Data teams is harder
• Different mentality between data professionals/engineers
• Mixture of technologies
• Data as integration point
• Often schema-less
• Lack of tools
![Page 5: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/5.jpg)
Continuous Delivery
• Keep software in a production ready state
• Test all the changes: unit, integration
• Exercise deployments
• Faster feedback cycle
![Page 6: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/6.jpg)
No silos Autonomous teams
Feedback Automation
Build quality in
Shared responsibility
DevOps & Collaboration
![Page 7: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/7.jpg)
The case for CI/CD/DevOps in Big Data Projects• Coordination: data engineers, analysts, business, ops
• Integrate and test critical jobs
• Complex infrastructure: multiple distributed systems
• Need to decouple cluster operation via APIs/DSLs
• DevOps team to manage cluster operations: scaling, monitoring, deployment.
• Include CI/CD practices are part of the delivery process.
![Page 8: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/8.jpg)
![Page 9: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/9.jpg)
How are these techniques applicable to
Big Data applications?
![Page 10: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/10.jpg)
What Do We Need for Deploying our apps?• Source control system: Git, Hg, etc
• CI process to run tests and package app
• A repository to store packaged app
• A repository to store configuration
• An API/DSL to deploy to the cluster
• Mechanism to monitor the behaviour and performance of the app
![Page 11: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/11.jpg)
Who are we? Software developers withyears of Big Data experience
What do we want? Simple and robust way todeploy Big Data applications
How will we get it? Write thousands of linesof code on top of Mesos
![Page 12: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/12.jpg)
Amaterasu - Simple Continually Deployed Data Apps
• Amaterasu is the Shinto goddess of sun
• In the Japanese manga series Naruto Amaterasu is a super-natural power in the shape of a black flame that can only be taken out by its Sender
• Started as a framework to reliably execute Spark driver programs
![Page 13: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/13.jpg)
Amaterasu - Simple Continually Deployed Data Apps
• Big Data apps in Multiple Frameworks (Currently Only Spark is Supported)
• Multiple Languages (soon)
• Workflow as YAML
• Simple to Write, easy to deploy
• Reliable execution (via Mesos)
• Multiple Environments
![Page 14: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/14.jpg)
Big Data Pipeline Ops Requirements
• Support managing multiple distributed technologies: Apache Spark, HDFS, Kafka, Cassandra, etc.
• Treat data center as the OS while providing resource isolation, scalability and fault tolerance.
• Ability to run multiple tasks per machine to maximize utilization
![Page 15: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/15.jpg)
Why Mesos?• General purpose, battle tested cluster resource scheduler.
• Can run major modern Big Data systems: Hadoop, Spark, Kafka, Cassandra
• Can deploys spark as part of the execution
• Supports scheduled and long running apps.
• Improves resource management and efficiency
• Great APIs
• DC/OS provides an even reacher environment
![Page 16: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/16.jpg)
Amaterasu Repositories• Jobs are defined in repositories
• Current implementation - git repositories
• Local directories support is planned for future release
• Repos structure
• maki.yml - The workflow definition
• src - a folder containing the actions (spark scripts, etc.) to be executed
• env - a folder containing configuration per environment
• Benefits of using git:
• Branching
• Tooling
![Page 17: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/17.jpg)
Workflow DSL - maki.yml---job-name:amaterasu-testflow:-name:starttype:spark-scalafile:file.scala-name:step2type:spark-scalafile:file2.scalaerror:file2.scalaname:handle-errortype:spark-scalafile:cleanup.scala...
Actions
Error handling actions
![Page 18: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/18.jpg)
Amaterasu is not a workflow engine, it’s a deployment tool that understands that Big
Data applications are rarely deployed independently of other Big Data applications
![Page 19: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/19.jpg)
Actions DSL• Your Scala/Future languages Spark code
• Few changes:
• Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.sc and AmaContext.sqlContext
• AmaContext.getDataFrame and AmaContext.getRDD are used to access data from previously executed actions
![Page 20: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/20.jpg)
importio.shinto.amaterasu.runtime._
valoddRdd=AmaContext.getRDD[Int]("start","rdd").filter(x=>x%2==0)
oddRdd.take(5).foreach(println)
valhighNoDf=AmaContext.getDataFrame("start",“odd").where("_1>3")
highNoDf.write.json("file:///tmp/test1")
Actions DSL (in action)
importio.shinto.amaterasu.runtime._
valdata=Array(1,2,3,4,5)valx=data.tail
valrdd=AmaContext.sc.parallelize(data)valodd=rdd.filter(n=>n%2!=0)
Action 1 (“start”) Action 2
![Page 21: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/21.jpg)
Environments• Configuration is stored per environment
• Stored as JSON
• Contains:
• Spark master URI
• Input/output path
• Work dir
• User defined key-values
![Page 22: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/22.jpg)
production.json{"name":"production","sparkMasterUrl":"mesos://server1:5050","inputPath":"hdfs://hdfsprd:9000/user/amaterasu/input","outputPath":"hdfs://hdfsprd:9000/user/amaterasu/output","workingDir":"alluxio://server3:19998/","configuration":{"spark.cassandra.connection.host":"cassie-prod","sourceTable":"documents"}}
![Page 23: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/23.jpg)
dev.json{"name":"test","sparkMasterUrl":"local[*]","inputRootPath":"file:///tmp/input","outputRootPath":"file:///tmp/output","workingDir":"file:///tmp/work","configuration":{"spark.cassandra.connection.host":"127.0.0.1","sourceTable":"documents"}}
![Page 24: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/24.jpg)
importio.shinto.amaterasu.runtime._
valoddRdd=AmaContext.getRDD[Int]("start","rdd").filter(x=>x/2==0)
oddRdd.take(5).foreach(println)
valhighNoDf=AmaContext.getDataFrame("start",“x").where("_1>3")
highNoDf.write.json(Env.outputPath)
Environments in the Actions DSL
![Page 25: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/25.jpg)
Future Development• Continuous integration and test automation
• R, shell and Python support (R is already in progress)
• Extend environments to support:
• Full spark configuration (spark-defaults.conf, etc.)
• Extendable configuration model
• Better tooling
• DC/OS universe package
• Other frameworks: Flink, vowpal wabbit
• YARN?
![Page 26: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/26.jpg)
Amaterasu + demos https://github.com/shintoio/
Slack http://shintoio.slack.com
Getting started
![Page 27: DataOps with Project Amaterasu](https://reader030.vdocument.in/reader030/viewer/2022020314/586fdea01a28ab18428b6c79/html5/thumbnails/27.jpg)
Thank you!