robust data pipelines with drake and docker talks/tamas szilagyi.pdf · robust data pipelines with...

15
Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com

Upload: others

Post on 19-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

Robust Data Pipelineswith drake and Docker

@tudosgar | tamaszilagyi.com

Page 2: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

�2

The end goal for most data analytics projects is to build a data product, whether it is a weekly dashboard or deploying a ML model.

From getting and cleaning the data to generating plots or fitting models, each project can be split up into individual tasks.

Afterwards, we want to make sure that the environment we wrote our code in can be replicated.

The low-down

@tudosgar | tamaszilagyi.com

Page 3: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

�3

Automation. Reproducibility.

@tudosgar | tamaszilagyi.com

Page 4: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

Workflow management tool that enables you to build managed workflows. It is a first serious attempt to create a tool like Airflow or Luigi, but in R

@tudosgar | tamaszilagyi.com

building pipelines

Page 5: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

�5

Your analysis is a sequence of transformations.

@tudosgar | tamaszilagyi.com

Page 6: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

write a plan.

A plan can be simple, or very complex. It all depends on how you define the tasks’ dependencies.

@tudosgar | tamaszilagyi.com

Page 7: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

not only targets are kept track of.

By default we can visualise all imports, functions and transformations in our DAG, not only completed tasks. As a nice bonus we can also monitor progress for longer jobs.

@tudosgar | tamaszilagyi.com

Page 8: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

make(my_plan)

Checks dependencies and cache before creating plan. This means that on subsequent runs, only the changed tasks will rerun, leaving the rest intact.

@tudosgar | tamaszilagyi.com

Page 9: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

�9

A container is kinda like a VM but..different.

@tudosgar | tamaszilagyi.com

Page 10: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

benefits of containers

- Easily reproduce your infrastructure

- Runs independent of host OS

- Consistency in production

@tudosgar | tamaszilagyi.com

Page 11: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

step 1: Dockerfile

A Dockerfile includes, in backwards order:

- Your script

- Package dependencies of your script

- System level dependencies of these packages

@tudosgar | tamaszilagyi.com

Page 12: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

step 2: build image

This is the stage where we actually build our mini computer, ready for deployment.

@tudosgar | tamaszilagyi.com

Page 13: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

step 3: run container

Instantiates the container, mounts the folder from our host where the data resides and runs our executable as defined in the Dockerfile.

@tudosgar | tamaszilagyi.com

Page 14: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

http://tamaszilagyi.com/ @tudosgar

@tudosgar | tamaszilagyi.com

Find me

Page 15: Robust Data Pipelines with drake and Docker talks/Tamas Szilagyi.pdf · Robust Data Pipelines with drake and Docker @tudosgar | tamaszilagyi.com. 2 The end goal for most data analytics

Thank you.@tudosgar | tamaszilagyi.com