when we spark and when we don’t - qcon.ai5. pipelines leave behind multiple artifacts for...
TRANSCRIPT
![Page 1: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/1.jpg)
When we Spark and when we don’t:
ML Pipeline Development at Stitch Fix
![Page 2: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/2.jpg)
Talk Flow
● What is Stitch Fix?
● Infrastructure and Tech Stack
● Thoughts on Good Practices for Developing ML Pipelines
● Case Study: Inventory Recommendation Models
● Tooling & Abstractions at Stitch Fix
![Page 3: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/3.jpg)
Share your style, size and price preferences
with your personal stylist.
Get 5 hand-selected pieces of clothing delivered to your
door.
Try your fix on in the comfort of your home
Leave feedback and pay for only the items you keep
Return the other items in the
envelope provided
Stitch Fix
![Page 4: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/4.jpg)
There’s an algorithm for that...
Styling Algorithms
Client/Stylist Matching
Demand Modeling
Human Computation
Pick Path Optimization
New Style Development
Inventory Allocation
State Machines
Warehouse Assignment
Batch Picking
Replenishment
* Find out more at http://algorithms-tour.stitchfix.com/
![Page 5: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/5.jpg)
OurInfrastructureandTech Stack
![Page 6: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/6.jpg)
Camera: State Snapshots
FlotillaAWS ECS Cluster
Bumblebee: Metadata Manager
AWS:S3Prod
Dev/Research
MetastoreAWS ECS
Cluster
AWS ECS Cluster
Data Acquisition Data ProcessingData Storage
Data Management
Uhura
Job Execution
Workflow Management
![Page 7: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/7.jpg)
Some facts
● 1000s of jobs / day
○ Model training, featurization, test analysis, reporting, analytics, adhoc research
● Production jobs run on
○ Spark: mostly Spark SQL and pySpark
○ Flotilla: Python or R in Docker containers on ECS
● ML pipelines typically consist of several jobs spanning the stack of technologies
● Data scientists own pipelines and implementations end-to-end
![Page 8: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/8.jpg)
Good Practices for Developing ML Pipelines
![Page 9: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/9.jpg)
Pipelines should be designed to support constant iteration
○ Individual pipelines/algorithms/implementations change quickly
○ Tooling and infrastructure should be relatively stable
![Page 10: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/10.jpg)
At scale, failure should be expected
○ Be robust to failure
■ Checkpointing
■ Isolation
■ Automated Retries
■ Alerting
○ Make it easy to debug and diagnose
○ We train 100s of models / day, and expect some # to fail.
![Page 11: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/11.jpg)
Pipelines and jobs should be idempotent.
![Page 12: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/12.jpg)
Make pragmatic choices with respect to technology.
![Page 13: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/13.jpg)
Case Study: Inventory Recommendation
Models
![Page 14: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/14.jpg)
Extract Training Data Train Model Upload ModelExtract Training
Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training
Data Train Model Upload ModelExtract Training Data Train Model Upload ModelExtract Training
Data Train Model Upload ModelExtract Training Data Train Model Upload Model
Algo_V1_1
Model by Inventory Department
![Page 15: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/15.jpg)
User Item RatingData
Extract “wide” Client
Training Data
TrainModel A
Upload Model A
Extract “wide” Item
Training DataModel D Training
Data
Model C Training
Data
Ingest
TrainModel C
Upload Model C
TrainModel D
Upload Model D
Model B Training
Data
TrainModel B
Upload Model B
Model A Training
Data
![Page 16: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/16.jpg)
Extract “wide” Client Training
Data
User Item RatingData
TrainModel A
Upload Model A
Extract “wide” Item
Training Data Model D Training Data
Model C Training Data
Model A Training Data
Ingest
TrainModel C
Upload Model C
TrainModel D
Upload Model D
Model B Training Data
TrainModel B
Upload Model B
![Page 17: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/17.jpg)
client_features: { "expanded_colors": { "in": [ "client_colors" ], "fn": "dummy_expand" }, "X_Y_ratio" : { "in": [ X, Y ], "fn": "compute_scaled_ratio"
} …},
item_features: { "expanded_print" : { "in": [ colors ], "fn": "dummy_expand"
}},interaction_features: {}
Extract Jobs generated from resolution of Model + Feature Definitions
{ “deptA”: { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + shiny_material_flag + x_y_ratio” ] }, "deptB": { "computed_features": [ “example_feature” ], "formula": [ "s ~ 1 + f_a + x_y_ratio + client_color_a + expanded_print_x” ] }}
![Page 18: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/18.jpg)
1. Spark is utilized heavily for feature engineering.
2. Model fitting occurs in containerized Python and R environments.
3. Individual jobs communicate via data dependencies.
4. Our inventory recommendation algorithms are specified with a high degree of tooling.
5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load)
6. Individual models are isolated from one another. (and can fail without impacting the rest of the group)
7. Data is contextual: e.g. item type; business line
Some Observations
![Page 19: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/19.jpg)
Platform Tooling is Important!
![Page 20: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/20.jpg)
Desirable Properties of Infrastructure & Tooling
● Isolation should be guaranteed by the infrastructure
● It should be obvious what running jobs and services are doing, when, and why
● Access to data should be easy, consistent, and self-service
● Guide rails should enforce, or strongly encourage, idempotent patterns
● Scaling, logging, and security should be baked into infrastructure and tooling
![Page 21: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/21.jpg)
Access to Data
● All data is managed and tracked by the Metastore
○ Hive metastore abstracted by Bumblebee
○ Location, Schema, Format
● Data access for Python and R is a 1st class citizen
○ Typically accessed as dataframes
○ df = load_dataframe( namespace, table)
○ store_dataframe(df, namespace, table)
![Page 22: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/22.jpg)
the cloud.
embrace elasticity.
![Page 23: When we Spark and when we don’t - QCon.ai5. Pipelines leave behind multiple artifacts for analysis, debugging, and checkpointing. (extract, train, load) 6. Individual models are](https://reader033.vdocument.in/reader033/viewer/2022050204/5f57f11aca830c58c4308baf/html5/thumbnails/23.jpg)
Containerized Batch Jobs
● Containerized job execution has many benefits○ Strong isolation○ High degree of control over resources and environment
● But, needs abstraction over job definition and management○ So we developed Flotilla○ And open sourced it!
https://stitchfix.github.io/flotilla-os/