when we spark and when we don’t - qcon.ai5. pipelines leave behind multiple artifacts for...

24
When we Spark and when we don’t: ML Pipeline Development at Stitch Fix

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

When we Spark and when we don’t:

ML Pipeline Development at Stitch Fix

Page 2: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Talk Flow

● What is Stitch Fix?

● Infrastructure and Tech Stack

● Thoughts on Good Practices for Developing ML Pipelines

● Case Study: Inventory Recommendation Models

● Tooling & Abstractions at Stitch Fix

Page 3: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Share your style, size and price preferences

with your personal stylist.

Get 5 hand-selected pieces of clothing delivered to your

door.

Try your fix on in the comfort of your home

Leave feedback and pay for only the items you keep

Return the other items in the

envelope provided

Stitch Fix

Page 4: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

There’s an algorithm for that...

Styling Algorithms

Client/Stylist Matching

Demand Modeling

Human Computation

Pick Path Optimization

New Style Development

Inventory Allocation

State Machines

Warehouse Assignment

Batch Picking

Replenishment

* Find out more at http://algorithms-tour.stitchfix.com/

Page 5: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

OurInfrastructureandTech Stack

Page 6: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Camera: State Snapshots

FlotillaAWS ECS Cluster

Bumblebee: Metadata Manager

AWS:S3Prod

Dev/Research

MetastoreAWS ECS

Cluster

AWS ECS Cluster

Data Acquisition Data ProcessingData Storage

Data Management

Uhura

Job Execution

Workflow Management

Page 7: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Some facts

● 1000s of jobs / day

○ Model training, featurization, test analysis, reporting, analytics, adhoc research

● Production jobs run on

○ Spark: mostly Spark SQL and pySpark

○ Flotilla: Python or R in Docker containers on ECS

● ML pipelines typically consist of several jobs spanning the stack of technologies

● Data scientists own pipelines and implementations end-to-end

Page 8: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Good Practices for Developing ML Pipelines

Page 9: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Pipelines should be designed to support constant iteration

○ Individual pipelines/algorithms/implementations change quickly

○ Tooling and infrastructure should be relatively stable

Page 10: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

At scale, failure should be expected

○ Be robust to failure

■ Checkpointing

■ Isolation

■ Automated Retries

■ Alerting

○ Make it easy to debug and diagnose

○ We train 100s of models / day, and expect some # to fail.

Page 11: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Pipelines and jobs should be idempotent.

Page 12: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Make pragmatic choices with respect to technology.

Page 13: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Case Study: Inventory Recommendation

Models

Page 14: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Extract Training Data Train Model Upload ModelExtract Training

Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training

Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training

Data Train Model Upload ModelExtract Training Data Train Model Upload Model

Algo_V1_1

Model by Inventory Department

Page 15: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

User Item RatingData

Extract “wide” Client

Training Data

TrainModel A

Upload Model A

Extract “wide” Item

Training DataModel D Training

Data

Model C Training

Data

Ingest

TrainModel C

Upload Model C

TrainModel D

Upload Model D

Model B Training

Data

TrainModel B

Upload Model B

Model A Training

Data

Page 16: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Extract “wide” Client Training

Data

User Item RatingData

TrainModel A

Upload Model A

Extract “wide” Item

Training Data Model D Training Data

Model C Training Data

Model A Training Data

Ingest

TrainModel C

Upload Model C

TrainModel D

Upload Model D

Model B Training Data

TrainModel B

Upload Model B

Page 17: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

client_features: { "expanded_colors": { "in": [ "client_colors" ], "fn": "dummy_expand" }, "X_Y_ratio" : { "in": [ X, Y ], "fn": "compute_scaled_ratio"

} …},

item_features: { "expanded_print" : { "in": [ colors ], "fn": "dummy_expand"

}},interaction_features: {}

Extract Jobs generated from resolution of Model + Feature Definitions

{ “deptA”: { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + shiny_material_flag + x_y_ratio” ] }, "deptB": { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + x_y_ratio + client_color_a + expanded_print_x” ] }}

Page 18: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

1. Spark is utilized heavily for feature engineering.

2. Model fitting occurs in containerized Python and R environments.

3. Individual jobs communicate via data dependencies.

4. Our inventory recommendation algorithms are specified with a high degree of tooling.

5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load)

6. Individual models are isolated from one another. (and can fail without impacting the rest of the group)

7. Data is contextual: e.g. item type; business line

Some Observations

Page 19: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Platform Tooling is Important!

Page 20: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Desirable Properties of Infrastructure & Tooling

● Isolation should be guaranteed by the infrastructure

● It should be obvious what running jobs and services are doing, when, and why

● Access to data should be easy, consistent, and self-service

● Guide rails should enforce, or strongly encourage, idempotent patterns

● Scaling, logging, and security should be baked into infrastructure and tooling

Page 21: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Access to Data

● All data is managed and tracked by the Metastore

○ Hive metastore abstracted by Bumblebee

○ Location, Schema, Format

● Data access for Python and R is a 1st class citizen

○ Typically accessed as dataframes

○ df = load_dataframe( namespace, table)

○ store_dataframe(df, namespace, table)

Page 22: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

the cloud.

embrace elasticity.

Page 23: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Containerized Batch Jobs

● Containerized job execution has many benefits○ Strong isolation○ High degree of control over resources and environment

● But, needs abstraction over job definition and management○ So we developed Flotilla○ And open sourced it!

https://stitchfix.github.io/flotilla-os/

Page 24: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are

Questions?

Get in touch:[email protected] @jeffmagnusson http://www.linkedin.com/in/jmagnuss