release latest diogo munaro vieira

26
driftage Documentation Release latest Diogo Munaro Vieira Nov 10, 2020

Upload: others

Post on 08-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Release latest Diogo Munaro Vieira

driftage DocumentationRelease latest

Diogo Munaro Vieira

Nov 10, 2020

Page 2: Release latest Diogo Munaro Vieira
Page 3: Release latest Diogo Munaro Vieira

GETTING STARTED

1 Install 3

2 Examples 5

3 API 7

4 DataFlow 13

5 Sequence Diagram 15

6 About 17

Python Module Index 19

Index 21

i

Page 4: Release latest Diogo Munaro Vieira

ii

Page 5: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

Diftage is a modular multi-agent framework to detect concept drifts from batch or streaming data.

Deployment & Documentation & Stats

Status & Coverage & License

GETTING STARTED 1

Page 6: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

2 GETTING STARTED

Page 7: Release latest Diogo Munaro Vieira

CHAPTER

ONE

INSTALL

Install Driftage as a Python 3.7+ package:

pip install -U driftage

or

conda install driftage

Now you can build your own Monitor, Analyser, Planner and Executor using Knowledge Base.

There’s an example of detecting muscle voltage chenges which that you can follow.

If you want to know more about how Driftage is structured, take a look at the DataFlow and Sequence Diagram partsof this doc.

3

Page 8: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

4 Chapter 1. Install

Page 9: Release latest Diogo Munaro Vieira

CHAPTER

TWO

EXAMPLES

An example was created as a health monitor at the moment that a user is punching. The data EMG Physical ActionData Set was collected from UCI Machine Learning.

There is a Jupyter Notebook analysis that illustrates how different voltage signals are collected from the followingmuscles:

• R-Bic: right bicep (C1)

• R-Tri: right tricep (C2)

• L-Bic: left bicep (C3)

• L-Tri: left tricep (C4)

• R-Thi: right thigh (C5)

• R-Ham: right hamstring (C6)

• L-Thi: left thigh (C7)

• L-Ham: left hamstring (C8)

The analysis show how ADWIN drift detection algorithm adapts to each signal.

One example for each kind of agent was implemented:

• Spark Monitor: as a Monitor integrated with Apache Spark Structured Streaming to read csv punching signalsfrom muscles.

• ADWIN Analyser: as an Analyser for each muscle signal using ADWIN from Skmultiflow to detect drifts onmuscle activity.

• Voting Planner: as a Planner for voting with a threshold 3 >= X < 8. If X muscle signals are interpreted havingdrift in the threshold, than it alerts a drift to the Executor.

• Csv Executor: as an Executor that validates if filesystem is ok and saves detected Concept Drift.

As Knowledge Base TimescaleDB was chosen because it is full compatible with SQLAlchemy used in connectionadapter and handles time series data very well.

The full example can be executed using Docker Compose following these 3 steps:

1. Cloning the repository:

git clone https://github.com/dmvieira/driftage.git

2. If you already have Docker Compose installed, run in driftage folder:

make example

3. Wait until Executor logs that already written drift and take a look on the file with drifts:

5

Page 10: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

cat example/health_monitor/build/executor/output.csv

For more details, take a look at API.

6 Chapter 2. Examples

Page 11: Release latest Diogo Munaro Vieira

CHAPTER

THREE

API

3.1 Monitor

Captures data integrated to any framework you want: Spark, Flink, or even a Python function.

class driftage.monitor.Monitor(jid: str, password: str, identifier: Optional[str] = None, ver-ify_security: bool = False)

An Agent to collect data from sources and send to Analyser. This agent authenticates on XMPP server.

Parameters

• jid (str) – Id for XMPP authentication. Ex: user@localhost

• password (str) – Password for XMPP authentication.

• identifier (Optional[str], optional) – Data identification or agent jid, de-faults to None

• verify_security (bool, optional) – Security validation with XMPP server, de-faults to False.

collect(data: dict)Callback to collect data to be sent as dict.

Parameters data (dict) – Data to send

async setup()Agent startup for behaviours.

3.2 Analyser

Analyses data collected by the Monitor using a customized Predictor for Concept Drift detection. Fast classifiers fromScikit-Multiflow or Facebook Prophet are great projects for that.

7

Page 12: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

3.2.1 Analyser Predictor

class driftage.analyser.predictor.AnalyserPredictor(connection: drif-tage.db.connection.Connection)

Predictor base class for Concept Drift detection.

Parameters connection (Connection) – Knowledge Based connection

async fit()Load new model or get old data for model retrain. If you set None to retrain_period, than you can ignorethis function on inheritance.

abstract async predict(X: driftage.analyser.predictor.PredictionData)→ boolReceives PredictionData and predicts if new data is a Concept Drift of not.

Parameters X (PredictionData) – Data to be predicted as Concept Drift

Raises NotImplementedError – Needs to be implemented when overridden

Returns Prediction for whether PredictionData is a drift or not

Return type bool

retrain_period()→ Optional[int]Retrain time period (in seconds). This property defines how long AnalyserPredictor will wait until next fitcall. Retrain is optional and if it returns as None, fit method is never called on.

Returns Time to wait for retrain in seconds or None if no retrain

Return type Optional[int]

class driftage.analyser.predictor.PredictionData(data: dict, timestamp: date-time.datetime, identifier: str)

Dataclass to store data to predict.

Parameters

• data (dict) – Data that comes from the Monitor.

• created_at (datetime) – Datetime object of when the message was created.

• identifier (str) – Data identifier that comes from the Monitor or any identifier youwant.

data: dict = None

identifier: str = None

timestamp: datetime = None

3.3 Planner

Observes the situation for possible new predictions and based on them chooses whether the drift is actually valid. Ifit’s a real drift, it should be sent to the Executor. A custom Predictor can be done for that too, like a voting one or amore time-consuming algorithm from Scikit-Learn, TensorFlow, PyTorch, or others.

8 Chapter 3. API

Page 13: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

3.3.1 Planner Predictor

class driftage.planner.predictor.PlannerPredictor(connection: drif-tage.db.connection.Connection)

Predictor base class for Concept Drift detection.

Parameters connection (Connection) – Knowledge Based connection

abstract async predict()→ List[driftage.planner.predictor.PredictResult]Using data stored on KB predicts if this data is a Concept Drift of not, to then send or not to the Executor.

Raises NotImplementedError – Needs to be implemented when overridden

Returns Results predicted that should or shouldn’t be sent to the Executor

Return type List[PredictResult]

abstract property predict_periodPredict time period (in seconds). This property defines how long PlannerPredictor will wait until nextpredict call.

Raises NotImplementedError – Needs to be implemented when overridden

Returns Time it takes to wait for predict in seconds

Return type Union[float, int]

class driftage.planner.predictor.PredictResult(identifier: str, predicted: Union[bool,str, int, float], should_send: bool)

Dataclass to store each prediction, result and if this prediction should be sent to Executor.

Parameters

• identifier (str) – Data identifier that comes from Monitor or any identifier you want.

• predicted (Union[bool, str, int, float]) – Value predicted from Drift de-tection algorithm. This can even inform type of drift to Executor.

• should_send (bool) – If this prediction should be sent to Executor. Sometimes yourPlanner can decide to not send it because of time or other business rules.

identifier: str = None

predicted: Union[bool, str, int, float] = None

should_send: bool = None

3.4 Executor

Receives from Planner new drifts and send to a custom Sink. It can be a Apache Kafka, RabbitMQ, API, and so on. . .

3.4. Executor 9

Page 14: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

3.4.1 Retry Config

class driftage.executor.retry_config.RetryConfig(send_timeout: Union[int, float, None]= 1.0, retry_backoff: Union[int,float] = 1.0, max_tries: int = 3,all_tries_timeout: Union[int, float,None] = None, retry_exceptions: Tu-ple[Exception] = (<class 'Excep-tion'>, ))

Retry configuration for Sink connection. Configuring this retry it will be resilient when sending data to Sink.

Parameters

• send_timeout (Optional[Union[int, float]], optional) – All timeoutswhen send in seconds, defaults to 1.0

• retry_backoff (Union[int, float], optional) – Time to wait to another tryin seconds, defaults to 1.0

• max_tries (int, optional) – Maximum number of tries, defaults to 3

• all_tries_timeout (Optional[Union[int, float]], optional) – Totaltimeout from all retries in seconds, defaults to None

• retry_exceptions (Tuple[Exception], optional) – Exceptions that weshould take in account to retry, defaults to (Exception,)

3.4.2 Sink

class driftage.executor.sink.Sink(circuit_breaker: aiobreaker.circuitbreaker.CircuitBreaker= <aiobreaker.circuitbreaker.CircuitBreaker object>,is_available_cache_ttl: Union[int, float] = 1.0,retry_config: driftage.executor.retry_config.RetryConfig= <driftage.executor.retry_config.RetryConfig object>)

Sink base class to implement custom Sinks like Kafka, RabbitMQ MariaDB, or even an API.

Parameters

• circuit_breaker (CircuitBreaker, optional) – Circuit breaker to protectSink if it’s down, defaults to CircuitBreaker()

• is_available_cache_ttl (Union[int, float], optional) – Healthcheckcache means the time to is_available method in how many seconds, defaults to 1.0

• retry_config (RetryConfig, optional) – Retry configuration to send data toSink, defaults to RetryConfig()

abstract async drain(data: dict)Method that sends data to the Sink. This receives predicted data from the Planner and sends it out.

Parameters data (dict) – Predicted data with timestamp, predicted and identifier

Raises NotImplementedError – Needs to be implemented when overridden

abstract is_available()→ boolHealthcheck function to know if Sink is available to receive data.

Raises NotImplementedError – Needs to be implemented when overridden

Returns True if it is available or False if not

Return type bool

10 Chapter 3. API

Page 15: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

3.5 Knowledge Base

Stores data collected and predicted by the Analyser. Can be queried by Analyser for retraining or Planner for predic-tions.

3.5.1 Connection

Build on top of SQLAlchemy to interact with Knowledge Base.

class driftage.db.connection.Connection(db_engine: sqlalchemy.engine.base.Engine,bulk_size: int, bulk_time: Union[int,float], circuit_breaker: aio-breaker.circuitbreaker.CircuitBreaker = <aio-breaker.circuitbreaker.CircuitBreaker object>)

Connects with SQLAlchemy Engine to store and query data for concept drift datection.

Parameters

• db_engine (Engine) – SQLAlchemy Engine to use as backend

• bulk_size (int) – Quantity of data that connection will wait to make bulk insert

• bulk_time (Union[int, float]) – Time in seconds between last insert and now. Ifbulk_size is not reached in bulk_time interval, then an insert was done

• circuit_breaker (CircuitBreaker, optional) – Circuit Breaker configura-tion to connect with Database, defaults to CircuitBreaker()

async get_between(column: sqlalchemy.sql.schema.Column, from_datetime: datetime.datetime,to_datetime: datetime.datetime)→ pandas.core.frame.DataFrame

Collects data between dates from database.

Parameters

• column (Column) – Database column from schema

• from_datetime (datetime) – Start Datetime to search

• to_datetime (datetime) – End Datetime to search

Returns Data got from specified date range

Return type pd.DataFrame

async lazy_insert(df: pandas.core.frame.DataFrame)Insert in database if bulk size or bulk time reached.

Parameters df (pd.DataFrame) – Data to be inserted

3.5.2 Schema

Schema from data stored and predicted.

Definitions from table where predicted drifts are stored:

Table: driftage_kb or defined by DRIFTAGE_TABLENAME enviroment variable

Columns:

• driftage_jid: ID from the Analyser that predicted and saved data

• driftage_datetime_monitored: Datetime of collected data

3.5. Knowledge Base 11

Page 16: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

• driftage_datetime_analysed: Datetime of analysed data

• driftage_identifier: Identifier from the Monitor that collected data

• driftage_data: Json type object from data collected

• driftage_predicted: True or False depending on if data is drift or not

12 Chapter 3. API

Page 17: Release latest Diogo Munaro Vieira

CHAPTER

FOUR

DATAFLOW

13

Page 18: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

14 Chapter 4. DataFlow

Page 19: Release latest Diogo Munaro Vieira

CHAPTER

FIVE

SEQUENCE DIAGRAM

15

Page 20: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

16 Chapter 5. Sequence Diagram

Page 21: Release latest Diogo Munaro Vieira

CHAPTER

SIX

ABOUT

The amount of data and the change in behaviours happens very fast in this interconnected world. These repeatedchanges and the amount of data make machine learning algorithms lose accuracy because they don’t know aboutthese new patterns. This change in pattern of data is known as Concept Drift and there are already many approachesfor treating these drifts. Usually these treatments are costly to implement because they require knowledge of driftdetection algorithms, software engineering and need maintenance for new drifts.

The proposal of Driftage is build a framework using multi-agent systems to simplify the implementation of conceptdrift detectors for dynamic environments.

Driftage is a modular framework where:

6.1 Monitor

Captures data integrated to any framework you want: Spark, Flink, or even a Python function.

6.2 Analyser

Analyses data collected by Monitor using a customized Predictor for Concept Drift detection. Fast classifiers fromScikit-Multiflow or Facebook Prophet are great projects for that.

6.3 Planner

Observes for new predictions and based on them chooses if the drift is really valid. If it’s a real drift, it should be sentto the Executor. A custom Predictor can be done for that too, like a voting one or a more time-consuming algorithmfrom Scikit-Learn, TensorFlow, PyTorch, or others.

6.4 Executor

Receives from Planner new Drifts and sends them to a custom Sink. It can be a Apache Kafka, RabbitMQ, API, andso on. . .

17

Page 22: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

6.5 Knowledge Base

Store data collected and predicted by Analyser. Can be queried by Analyser for retraining or Planner for predictions.

If you want to know how more how Driftage is structured, take a look at DataFlow and Sequence Diagram.

18 Chapter 6. About

Page 23: Release latest Diogo Munaro Vieira

PYTHON MODULE INDEX

ddriftage.analyser.predictor, 8driftage.db.connection, 11driftage.executor.retry_config, 10driftage.executor.sink, 10driftage.monitor, 7driftage.planner.predictor, 9

19

Page 24: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

20 Python Module Index

Page 25: Release latest Diogo Munaro Vieira

INDEX

AAnalyserPredictor (class in drif-

tage.analyser.predictor), 8

Ccollect() (driftage.monitor.Monitor method), 7Connection (class in driftage.db.connection), 11

Ddata (driftage.analyser.predictor.PredictionData at-

tribute), 8drain() (driftage.executor.sink.Sink method), 10driftage.analyser.predictor

module, 8driftage.db.connection

module, 11driftage.executor.retry_config

module, 10driftage.executor.sink

module, 10driftage.monitor

module, 7driftage.planner.predictor

module, 9

Ffit() (driftage.analyser.predictor.AnalyserPredictor

method), 8

Gget_between() (driftage.db.connection.Connection

method), 11

Iidentifier (driftage.analyser.predictor.PredictionData

attribute), 8identifier (driftage.planner.predictor.PredictResult

attribute), 9is_available() (driftage.executor.sink.Sink

method), 10

Llazy_insert() (driftage.db.connection.Connection

method), 11

Mmodule

driftage.analyser.predictor, 8driftage.db.connection, 11driftage.executor.retry_config, 10driftage.executor.sink, 10driftage.monitor, 7driftage.planner.predictor, 9

Monitor (class in driftage.monitor), 7

PPlannerPredictor (class in drif-

tage.planner.predictor), 9predict() (driftage.analyser.predictor.AnalyserPredictor

method), 8predict() (driftage.planner.predictor.PlannerPredictor

method), 9predict_period() (drif-

tage.planner.predictor.PlannerPredictorproperty), 9

predicted (driftage.planner.predictor.PredictResultattribute), 9

PredictionData (class in drif-tage.analyser.predictor), 8

PredictResult (class in driftage.planner.predictor),9

Rretrain_period() (drif-

tage.analyser.predictor.AnalyserPredictormethod), 8

RetryConfig (class in driftage.executor.retry_config),10

Ssetup() (driftage.monitor.Monitor method), 7should_send (driftage.planner.predictor.PredictResult

attribute), 9

21

Page 26: Release latest Diogo Munaro Vieira

driftage Documentation, Release latest

Sink (class in driftage.executor.sink), 10

Ttimestamp (driftage.analyser.predictor.PredictionData

attribute), 8

22 Index