feature store: the missing data layer in ml pipelines ... · outline 1 hopsworks:...

42
Feature Store: the missing data layer in ML pipelines? 1 Spotify ML Guild Fika Kim Hammar [email protected] February 26, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/. 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Store: the missing data layer in ML pipelines?1

Spotify ML Guild Fika

Kim Hammar

[email protected]

February 26, 2019

1Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29

Page 2: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Model

ϕ(x)

Data Predictions

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29

Page 3: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Model

ϕ(x)

Data Predictions

Distributed TrainingData Validation

Feature Engineering

Data Collection

HardwareManagement

HyperParameterTuning

ModelServing

Pipeline Management

A/BTesting

Monitoring

2

2Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29

Page 4: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Outline

1 Hopsworks: Quick background of the platform

2 What is a Feature Store

3 Why You Need a Feature Store, Things to Consider:How to encourage feature reusage?How to store large-scale datasets for deep learning?How to serve features for inference?

4 How to Build a Feature Store (Hopsworks Feature Store Case Study)

5 Demo

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29

Page 5: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

REST API

Kafka

TF Serving

Data Ingestion Data Prep Feature Store Training Serving

Orchestration

CPUs GPUs

HopsML

HopsYARN (fork of YARN)

HopsFS (fork of HDFS)

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

Page 6: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

REST API

Kafka

TF Serving

Data Ingestion Data Prep Feature Store Training Serving

Orchestration

CPUs GPUs

HopsML

HopsYARN (fork of YARN)

HopsFS (fork of HDFS)

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

Page 7: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

Page 8: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y?_\_( ") )_/

_

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber3

3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

Page 9: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

y?_\_( ") )_/

_

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber3

3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

Page 10: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

ϕ(x)y1...yn

x1,1 . . . x1,n

... . . ....

xn,1 . . . xn,n

yFeature Store

“Data is the hardest part of ML and the most important piece toget right.

Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”

- Uber4

4Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

Page 11: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Disentangle ML Pipelines with a Feature Store

Raw/Structured Data

Feature StoreFeature Engineering Training

Modelsb0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y

A feature store is a central vault for storing documented, curated, andaccess-controlled features.

The feature store is the interface between data engineering and datamodel development

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29

Page 12: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

FeatureEngineering

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29

Page 13: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

Feature Store

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

Page 14: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

Feature Store

Backfilling

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

Page 15: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

Feature Store

Backfilling

Analysis

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

Page 16: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

Feature Store

Backfilling

Analysis

Versioning

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

Page 17: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Dataset 1 Dataset 2 . . . Dataset n

Feature Store

Backfilling

Analysis

Versioning

Documentation

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y≥ 0.9 < 0.9

≥ 0.2

< 0.2

≥ 11.2

< 11.2

B B

A

(−1,−1) (−8,−8)(−10, 0) (0,−10)

40 60 80 100

160

180

200

X

Y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

Page 18: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

What is a Feature?

A feature is a measurable property of some data-sample

A feature could be..An aggregate value (min, max, mean, sum)A raw value (a pixel, a word from a piece of text)A value from a database table (the age of a customer)A derived representation: e.g an embedding or a cluster

Features are the fuel for AI systems:

x1...xn

Features

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y

Model θ

y

Prediction

L(y , y)

Loss

Gradient∇θL(y , y)

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29

Page 19: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Raw text lower-case& remove noise

tokenization lemmatization

words.txt group by post words_post.csv

word2vec TF-IDF LDAontology-matchingannotation with weaksupervision

Model

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29

Page 20: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Raw text

Feature Store

TF-IDF word2vec LDA weakannotation

normalization

Model

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29

Page 21: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Encourage FeatureReusage?

Page 22: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Marketplace

Feature Marketplace

Download

Features

Search

FeaturesFeatures

Publish

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29

Page 23: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Store API Service

from hops import featurestore

features_df =

featurestore.get_features([

"average_attendance",

"average_player_age"

])

Feature Relationships

Feature Groups

Shared Storage

Feature Store API ServiceFeature

Metadata

in ML pipelines

Include features

Figure: Feature Store API Service

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29

Page 24: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Store Datasets for DeepLearning?

Page 25: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Store Datasets for Deep Learning?

Should be frameworkagnostic

Need to be able to storetensor datasets

Should support sharding fordistributed training

Advanced features:row-predicate filtering, SQLinterface, columnar selection.

?

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29

Page 26: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Store Datasets for Deep Learning?

, /

HDF5

/ ,

TFRecords

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29

Page 27: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Store Datasets for Deep Learning?

Petastorm is a datasetformat designed for deeplearning

Petastorm stores data asparquet files with extrametadata to handlemulti-dimensional tensors

Petastorm contains readersfor the popular machinelearning frameworks such asSparkML, Tensorflow,PyTorch

, ,

Petastorm

55Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for

Deep Learning. https://eng.uber.com/petastorm/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 19 / 29

Page 28: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Serve Features forInference?

Page 29: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Delivering Features for Training and Serving is Different

Serving can require real-timefeatures

Ideally we want consistencybetween real-time featuresand batch features used fortraining

Complex engineeringproblem

Feature Store

Real-Time Features

xPrediction

Inference Request

b0

x 0,1

x 0,2

x 0,3

b1

x 1,1

x 1,2

x 1,3

y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 21 / 29

Page 30: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

How to Implement a (batch)Feature Store?

Page 31: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

The Components of a Feature Store

The Storage Layer: For storing feature data in the feature storeThe Metadata Layer: For storing feature metadata (versioning,feature analysis, documentation, jobs)The Feature Engineering Jobs: For computing featuresThe Feature Registry: A user interface to share and discoverfeaturesThe Feature Store API: For writing/reading to/from the featurestore

Feature Storage

Feature Metadata Jobs

Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 23 / 29

Page 32: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Storage

FeatureComputation

Raw/Structured Data

Data Lake

FeatureGroup 1

FeatureGroup 2

FeatureGroup 3

FeatureGroup 4

project_featurestore.db

HiveMetastore

Foreign keys

Feature Storage

Feature Metadata Jobs

Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 24 / 29

Page 33: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Metadata

FeatureComputation

Raw/Structured Data

Data Lake

FeatureGroup 1

FeatureGroup 2

FeatureGroup 3

FeatureGroup 4

project_featurestore.db

HiveMetastore

FeaturestoreMetadata

Foreign keys

Foreign keys

Feature Storage

Feature Metadata Jobs

Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 25 / 29

Page 34: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Feature Registry and API

FeatureComputation

Raw/Structured Data

Data Lake

FeatureGroup 1

FeatureGroup 2

FeatureGroup 3

FeatureGroup 4

project_featurestore.db

HiveMetastore

FeaturestoreMetadata

Foreign keys

Foreign keys

HopsworksFeature registry (UI)

REST API

Program APIs

Feature Storage

Feature Metadata Jobs

Feature Registry API

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 26 / 29

Page 35: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

FeatureComputation

Raw/Structured Data

Data Lake Feature Store

Curated FeaturesModel

b0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y

Demo-Setting

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 27 / 29

Page 36: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Summary

Machine learning comes with a high technical costMachine learning pipelines needs proper data managementA feature store is a place to store curated and documented featuresThe feature store serves as an interface between feature engineeringand model development, it can help disentangle complex ML pipelinesHopsworks6 provides the world’s first open-source feature store

@hopshadoop

www.hops.io

@logicalclocks

www.logicalclocks.com

We are open source:https://github.com/logicalclocks/hopsworks

https://github.com/hopshadoop/hops

76Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.7Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,

Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 28 / 29

Page 37: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

References

Hopsworks’ feature store8 (the only open-source one!)Uber’s feature store9

Airbnb’s feature store10

Comcast’s feature store11

GO-JEK’s feature store12

HopsML13

Hopsworks14

8Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.

9Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd InternationalConference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of MachineLearning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL:http://proceedings.mlr.press/v67/li17a.html.

10Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.

11Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-predictions. 2018.

12Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.

13Logical Clocks AB. HopsML: Python-First ML Pipelines.https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.

14Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 29 / 29

Page 38: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Backup Slides

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 30 / 29

Page 39: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Modeling Data in the Feature Store

A feature group is a logical grouping of featuresTypically from the same input dataset and computed with the same job

A training dataset is a set of features suitable for a prediction taskFeatures in a training dataset are often from several feature groupsE.g features on customers, features on user activities, etc.

Training Datasets d

Feature groups g

Features f f1 f2 f3 f4 f5

g1

f6 f7 f8 f9 f10

g2

f11 f12

g3

d1 d2

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 31 / 29

Page 40: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Hopsworks Feature Store API Service

SQL

hops-util

Query PlannerFeature Store Data

Hive on HopsFS

Feature StoreMetadata

Dataframe With Features

Client Interface Feature Store Service Output

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 32 / 29

Page 41: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Training Pipeline in HopsML

1 Create job/notebook to compute features and publish to the featurestore

2 Create job/notebook to read features/labels and save to a trainingdataset

3 Read the training dataset into your model for training

HopsFSData Lake

HiveFeature store

Raw data

Feature computation

Feature store features

hops-utilhops-util-py

Basis features Training dataset Modelb0

x0,1

x0,2

x0,3

b1

x1,1

x1,2

x1,3

y

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 33 / 29

Page 42: Feature Store: the missing data layer in ML pipelines ... · Outline 1 Hopsworks: Quickbackgroundoftheplatform 2 What isaFeatureStore 3 Why YouNeedaFeatureStore,ThingstoConsider:

Hopsworks Feature Store API

Reading from the Feature Store:

from hops import featurestore

features_df = featurestore.get_features([

"average_attendance",

"average_player_age"

])

Writing to the Feature Store:

from hops import featurestore

raw_data = spark.read.parquet(filename)

pol_features = raw_data.map(lambda x: x^2)

featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup")

Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 34 / 29