feature store: the missing data layer in ml pipelines ... · outline 1 hopsworks:...
TRANSCRIPT
Feature Store: the missing data layer in ML pipelines?1
Spotify ML Guild Fika
Kim Hammar
February 26, 2019
1Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29
Model
ϕ(x)
Data Predictions
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29
Model
ϕ(x)
Data Predictions
Distributed TrainingData Validation
Feature Engineering
Data Collection
HardwareManagement
HyperParameterTuning
ModelServing
Pipeline Management
A/BTesting
Monitoring
2
2Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29
Outline
1 Hopsworks: Quick background of the platform
2 What is a Feature Store
3 Why You Need a Feature Store, Things to Consider:How to encourage feature reusage?How to store large-scale datasets for deep learning?How to serve features for inference?
4 How to Build a Feature Store (Hopsworks Feature Store Case Study)
5 Demo
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29
REST API
Kafka
TF Serving
Data Ingestion Data Prep Feature Store Training Serving
Orchestration
CPUs GPUs
HopsML
HopsYARN (fork of YARN)
HopsFS (fork of HDFS)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
REST API
Kafka
TF Serving
Data Ingestion Data Prep Feature Store Training Serving
Orchestration
CPUs GPUs
HopsML
HopsYARN (fork of YARN)
HopsFS (fork of HDFS)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29
ϕ(x)y1...yn
x1,1 . . . x1,n
... . . ....
xn,1 . . . xn,n
y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)y1...yn
x1,1 . . . x1,n
... . . ....
xn,1 . . . xn,n
y?_\_( ") )_/
_
“Data is the hardest part of ML and the most important piece toget right.
Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”
- Uber3
3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)y1...yn
x1,1 . . . x1,n
... . . ....
xn,1 . . . xn,n
y?_\_( ") )_/
_
“Data is the hardest part of ML and the most important piece toget right.
Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”
- Uber3
3Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
ϕ(x)y1...yn
x1,1 . . . x1,n
... . . ....
xn,1 . . . xn,n
yFeature Store
“Data is the hardest part of ML and the most important piece toget right.
Modelers spend most of their time selecting and transformingfeatures at training time and then building the pipelines to deliverthose features to production models.”
- Uber4
4Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo.https://eng.uber.com/scaling-michelangelo/. 2018.
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29
Disentangle ML Pipelines with a Feature Store
Raw/Structured Data
Feature StoreFeature Engineering Training
Modelsb0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y
A feature store is a central vault for storing documented, curated, andaccess-controlled features.
The feature store is the interface between data engineering and datamodel development
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29
Dataset 1 Dataset 2 . . . Dataset n
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
FeatureEngineering
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
Versioning
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
Dataset 1 Dataset 2 . . . Dataset n
Feature Store
Backfilling
Analysis
Versioning
Documentation
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y≥ 0.9 < 0.9
≥ 0.2
< 0.2
≥ 11.2
< 11.2
B B
A
(−1,−1) (−8,−8)(−10, 0) (0,−10)
40 60 80 100
160
180
200
X
Y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29
What is a Feature?
A feature is a measurable property of some data-sample
A feature could be..An aggregate value (min, max, mean, sum)A raw value (a pixel, a word from a piece of text)A value from a database table (the age of a customer)A derived representation: e.g an embedding or a cluster
Features are the fuel for AI systems:
x1...xn
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y
Model θ
y
Prediction
L(y , y)
Loss
Gradient∇θL(y , y)
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29
Raw text lower-case& remove noise
tokenization lemmatization
words.txt group by post words_post.csv
word2vec TF-IDF LDAontology-matchingannotation with weaksupervision
Model
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29
Raw text
Feature Store
TF-IDF word2vec LDA weakannotation
normalization
Model
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29
How to Encourage FeatureReusage?
Feature Marketplace
Feature Marketplace
Download
Features
Search
FeaturesFeatures
Publish
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29
Feature Store API Service
from hops import featurestore
features_df =
featurestore.get_features([
"average_attendance",
"average_player_age"
])
Feature Relationships
Feature Groups
Shared Storage
Feature Store API ServiceFeature
Metadata
in ML pipelines
Include features
Figure: Feature Store API Service
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29
How to Store Datasets for DeepLearning?
How to Store Datasets for Deep Learning?
Should be frameworkagnostic
Need to be able to storetensor datasets
Should support sharding fordistributed training
Advanced features:row-predicate filtering, SQLinterface, columnar selection.
?
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29
How to Store Datasets for Deep Learning?
, /
HDF5
/ ,
TFRecords
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29
How to Store Datasets for Deep Learning?
Petastorm is a datasetformat designed for deeplearning
Petastorm stores data asparquet files with extrametadata to handlemulti-dimensional tensors
Petastorm contains readersfor the popular machinelearning frameworks such asSparkML, Tensorflow,PyTorch
, ,
Petastorm
55Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for
Deep Learning. https://eng.uber.com/petastorm/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 19 / 29
How to Serve Features forInference?
Delivering Features for Training and Serving is Different
Serving can require real-timefeatures
Ideally we want consistencybetween real-time featuresand batch features used fortraining
Complex engineeringproblem
Feature Store
Real-Time Features
xPrediction
Inference Request
b0
x 0,1
x 0,2
x 0,3
b1
x 1,1
x 1,2
x 1,3
y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 21 / 29
How to Implement a (batch)Feature Store?
The Components of a Feature Store
The Storage Layer: For storing feature data in the feature storeThe Metadata Layer: For storing feature metadata (versioning,feature analysis, documentation, jobs)The Feature Engineering Jobs: For computing featuresThe Feature Registry: A user interface to share and discoverfeaturesThe Feature Store API: For writing/reading to/from the featurestore
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 23 / 29
Feature Storage
FeatureComputation
Raw/Structured Data
Data Lake
FeatureGroup 1
FeatureGroup 2
FeatureGroup 3
FeatureGroup 4
project_featurestore.db
HiveMetastore
Foreign keys
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 24 / 29
Feature Metadata
FeatureComputation
Raw/Structured Data
Data Lake
FeatureGroup 1
FeatureGroup 2
FeatureGroup 3
FeatureGroup 4
project_featurestore.db
HiveMetastore
FeaturestoreMetadata
Foreign keys
Foreign keys
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 25 / 29
Feature Registry and API
FeatureComputation
Raw/Structured Data
Data Lake
FeatureGroup 1
FeatureGroup 2
FeatureGroup 3
FeatureGroup 4
project_featurestore.db
HiveMetastore
FeaturestoreMetadata
Foreign keys
Foreign keys
HopsworksFeature registry (UI)
REST API
Program APIs
Feature Storage
Feature Metadata Jobs
Feature Registry API
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 26 / 29
FeatureComputation
Raw/Structured Data
Data Lake Feature Store
Curated FeaturesModel
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y
Demo-Setting
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 27 / 29
Summary
Machine learning comes with a high technical costMachine learning pipelines needs proper data managementA feature store is a place to store curated and documented featuresThe feature store serves as an interface between feature engineeringand model development, it can help disentangle complex ML pipelinesHopsworks6 provides the world’s first open-source feature store
@hopshadoop
www.hops.io
@logicalclocks
www.logicalclocks.com
We are open source:https://github.com/logicalclocks/hopsworks
https://github.com/hopshadoop/hops
76Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.7Thanks to Logical Clocks Team: Jim Dowling, Seif Haridi, Theo Kakantousis, Fabio Buso, Gautier Berthou,
Ermias Gebremeskel, Mahmoud Ismail, Salman Niazi, Antonios Kouzoupis, Robin Andersson, and Alex Ormenisan
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 28 / 29
References
Hopsworks’ feature store8 (the only open-source one!)Uber’s feature store9
Airbnb’s feature store10
Comcast’s feature store11
GO-JEK’s feature store12
HopsML13
Hopsworks14
8Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?https://www.logicalclocks.com/feature-store/. 2018.
9Li Erran Li et al. “Scaling Machine Learning as a Service”. In: Proceedings of The 3rd InternationalConference on Predictive Applications and APIs. Ed. by Claire Hardgrove et al. Vol. 67. Proceedings of MachineLearning Research. Microsoft NERD, Boston, USA: PMLR, 2017, pp. 14–29. URL:http://proceedings.mlr.press/v67/li17a.html.
10Nikhil Simha and Varant Zanoyan. Zipline: Airbnb’s Machine Learning Data Management Platform.https://databricks.com/session/zipline-airbnbs-machine-learning-data-management-platform. 2018.
11Nabeel Sarwar. Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions.https://databricks.com/session/operationalizing-machine-learning-managing-provenance-from-raw-data-to-predictions. 2018.
12Willem Pienaar. Building a Feature Platform to Scale Machine Learning | DataEngConf BCN ’18.https://www.youtube.com/watch?v=0iCXY6VnpCc. 2018.
13Logical Clocks AB. HopsML: Python-First ML Pipelines.https://hops.readthedocs.io/en/latest/hopsml/hopsML.html. 2018.
14Jim Dowling. Introducing Hopsworks. https://www.logicalclocks.com/introducing-hopsworks/. 2018.Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 29 / 29
Backup Slides
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 30 / 29
Modeling Data in the Feature Store
A feature group is a logical grouping of featuresTypically from the same input dataset and computed with the same job
A training dataset is a set of features suitable for a prediction taskFeatures in a training dataset are often from several feature groupsE.g features on customers, features on user activities, etc.
Training Datasets d
Feature groups g
Features f f1 f2 f3 f4 f5
g1
f6 f7 f8 f9 f10
g2
f11 f12
g3
d1 d2
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 31 / 29
Hopsworks Feature Store API Service
SQL
hops-util
Query PlannerFeature Store Data
Hive on HopsFS
Feature StoreMetadata
Dataframe With Features
Client Interface Feature Store Service Output
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 32 / 29
Training Pipeline in HopsML
1 Create job/notebook to compute features and publish to the featurestore
2 Create job/notebook to read features/labels and save to a trainingdataset
3 Read the training dataset into your model for training
HopsFSData Lake
HiveFeature store
Raw data
Feature computation
Feature store features
hops-utilhops-util-py
Basis features Training dataset Modelb0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
y
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 33 / 29
Hopsworks Feature Store API
Reading from the Feature Store:
from hops import featurestore
features_df = featurestore.get_features([
"average_attendance",
"average_player_age"
])
Writing to the Feature Store:
from hops import featurestore
raw_data = spark.read.parquet(filename)
pol_features = raw_data.map(lambda x: x^2)
featurestore.insert_into_featuregroup(pol_features , "pol_featuregroup")
Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 34 / 29