a data layer in clojure

Post on 29-Jan-2018

470 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A data layer in Clojure

@sbelak simon@goopti.com

• Started in machine learning • Turned to data science and

helped 20+ companies become data-driven

• Now leading data science department at GoOpti

Self-service infrastructure for data scientists

The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow

< 2min < 20min project

squeeze in somewhere in the day

fail

roadmapahoy!

My goto architecture

KafkaDB EventsOnyx Onyx

Onyx

Persist all events to S3 • time travel • query with AWS Athena

Onyxa masterless, cloud scale, fault tolerant, high performance distributed computation system

… written entirely in Clojure

Clojure at a glance• Lisp running on JVM

• Functional, dynamic, immutable

• Excellent concurrency and state management support

• Unparalleled data manipulation

• Good Java interoperability

Onyx at• In production for almost a year

• ETL

• online machine learning

• offline (batch) machine learning

• ad-hoc analysis

Onyx at a glance

Job =

[[:input :processing-1] [:input :processing-2] [:processing-1 :output-1] [:processing-2 :output-2]]

[{:flow/from :input-stream :flow/to [:process-adults] :flow/predicate :my.ns/adult? :flow/doc "Emits segment if an adult.”}]

workflow + flow conditions + catalogue [{:onyx/name :add-5

:onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n]}

{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}

{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/doc "Writes segments to a core.async channel"}]

Catalogue[{:onyx/name :add-5 :onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n]}

{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}

{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/doc "Writes segments to a core.async channel"}]

Vanilla Clojure function(defn adder [n {:keys [x] :as segment}] (assoc segment :x (+ n x))))

Plugins (I/O)seq, async, Kafka, Datomic, SQL, S3, SQS, …

parameter

self-documenting

Computation entirely described with data

data is

code!

Everything can be run locally!

Testing without mocking

Resilience and handling state

• Activity log

• Window and trigger states checkpointed

• Resume points

• Configurable flux policies

How Onyx rewired my brain

It’s not about scaling, but clean architecture

Decomplect everything

Computation graphs

Machine learning with Onyx

• Hyperparameter server build on top of Onyx parameters

• Batch & streaming mode

• Monitoring: performance metrics, side channels for partial results/introspection into computiation

• Everything is data so easy to build tools around

Onyx/Pyroclast

Putting “data is code” to work

Describing data with clojure.spec

composing smaller parts into the whole }

code i

s data

!

Queryable data descriptions

Turn spec into a graph

A fully interactive and open type system!

order

promo code

useraccount age

countryalways always

alwaysmaybe

“Composition is about decomposing.”

— E. Normand

Case study: autogenerating materialised views

KafkaMaterialised views

Events External data

Automatic view generation• Event & attribute ontology

• Manual (via spec) • Inferred

• Statistical analysis (seasonality detection, outlier removal, …)

Onyx Onyx

Onyx

Automatic view generation

1. Walk spec registry

2. Apply rules

1. Define new view (spec)

2. Trigger Onyx job that creates the view

Takeouts

Everything should be live and interactive

Computation graphs are a great way to structure data processing code

Queryable data and computation descriptions supercharge interactive development and are a great building block for automation

Questions@sbelak

simon@goopti.com

top related