release latest diogo munaro vieira
TRANSCRIPT
driftage DocumentationRelease latest
Diogo Munaro Vieira
Nov 10, 2020
GETTING STARTED
1 Install 3
2 Examples 5
3 API 7
4 DataFlow 13
5 Sequence Diagram 15
6 About 17
Python Module Index 19
Index 21
i
ii
driftage Documentation, Release latest
Diftage is a modular multi-agent framework to detect concept drifts from batch or streaming data.
Deployment & Documentation & Stats
Status & Coverage & License
GETTING STARTED 1
driftage Documentation, Release latest
2 GETTING STARTED
CHAPTER
ONE
INSTALL
Install Driftage as a Python 3.7+ package:
pip install -U driftage
or
conda install driftage
Now you can build your own Monitor, Analyser, Planner and Executor using Knowledge Base.
There’s an example of detecting muscle voltage chenges which that you can follow.
If you want to know more about how Driftage is structured, take a look at the DataFlow and Sequence Diagram partsof this doc.
3
driftage Documentation, Release latest
4 Chapter 1. Install
CHAPTER
TWO
EXAMPLES
An example was created as a health monitor at the moment that a user is punching. The data EMG Physical ActionData Set was collected from UCI Machine Learning.
There is a Jupyter Notebook analysis that illustrates how different voltage signals are collected from the followingmuscles:
• R-Bic: right bicep (C1)
• R-Tri: right tricep (C2)
• L-Bic: left bicep (C3)
• L-Tri: left tricep (C4)
• R-Thi: right thigh (C5)
• R-Ham: right hamstring (C6)
• L-Thi: left thigh (C7)
• L-Ham: left hamstring (C8)
The analysis show how ADWIN drift detection algorithm adapts to each signal.
One example for each kind of agent was implemented:
• Spark Monitor: as a Monitor integrated with Apache Spark Structured Streaming to read csv punching signalsfrom muscles.
• ADWIN Analyser: as an Analyser for each muscle signal using ADWIN from Skmultiflow to detect drifts onmuscle activity.
• Voting Planner: as a Planner for voting with a threshold 3 >= X < 8. If X muscle signals are interpreted havingdrift in the threshold, than it alerts a drift to the Executor.
• Csv Executor: as an Executor that validates if filesystem is ok and saves detected Concept Drift.
As Knowledge Base TimescaleDB was chosen because it is full compatible with SQLAlchemy used in connectionadapter and handles time series data very well.
The full example can be executed using Docker Compose following these 3 steps:
1. Cloning the repository:
git clone https://github.com/dmvieira/driftage.git
2. If you already have Docker Compose installed, run in driftage folder:
make example
3. Wait until Executor logs that already written drift and take a look on the file with drifts:
5
driftage Documentation, Release latest
cat example/health_monitor/build/executor/output.csv
For more details, take a look at API.
6 Chapter 2. Examples
CHAPTER
THREE
API
3.1 Monitor
Captures data integrated to any framework you want: Spark, Flink, or even a Python function.
class driftage.monitor.Monitor(jid: str, password: str, identifier: Optional[str] = None, ver-ify_security: bool = False)
An Agent to collect data from sources and send to Analyser. This agent authenticates on XMPP server.
Parameters
• jid (str) – Id for XMPP authentication. Ex: user@localhost
• password (str) – Password for XMPP authentication.
• identifier (Optional[str], optional) – Data identification or agent jid, de-faults to None
• verify_security (bool, optional) – Security validation with XMPP server, de-faults to False.
collect(data: dict)Callback to collect data to be sent as dict.
Parameters data (dict) – Data to send
async setup()Agent startup for behaviours.
3.2 Analyser
Analyses data collected by the Monitor using a customized Predictor for Concept Drift detection. Fast classifiers fromScikit-Multiflow or Facebook Prophet are great projects for that.
7
driftage Documentation, Release latest
3.2.1 Analyser Predictor
class driftage.analyser.predictor.AnalyserPredictor(connection: drif-tage.db.connection.Connection)
Predictor base class for Concept Drift detection.
Parameters connection (Connection) – Knowledge Based connection
async fit()Load new model or get old data for model retrain. If you set None to retrain_period, than you can ignorethis function on inheritance.
abstract async predict(X: driftage.analyser.predictor.PredictionData)→ boolReceives PredictionData and predicts if new data is a Concept Drift of not.
Parameters X (PredictionData) – Data to be predicted as Concept Drift
Raises NotImplementedError – Needs to be implemented when overridden
Returns Prediction for whether PredictionData is a drift or not
Return type bool
retrain_period()→ Optional[int]Retrain time period (in seconds). This property defines how long AnalyserPredictor will wait until next fitcall. Retrain is optional and if it returns as None, fit method is never called on.
Returns Time to wait for retrain in seconds or None if no retrain
Return type Optional[int]
class driftage.analyser.predictor.PredictionData(data: dict, timestamp: date-time.datetime, identifier: str)
Dataclass to store data to predict.
Parameters
• data (dict) – Data that comes from the Monitor.
• created_at (datetime) – Datetime object of when the message was created.
• identifier (str) – Data identifier that comes from the Monitor or any identifier youwant.
data: dict = None
identifier: str = None
timestamp: datetime = None
3.3 Planner
Observes the situation for possible new predictions and based on them chooses whether the drift is actually valid. Ifit’s a real drift, it should be sent to the Executor. A custom Predictor can be done for that too, like a voting one or amore time-consuming algorithm from Scikit-Learn, TensorFlow, PyTorch, or others.
8 Chapter 3. API
driftage Documentation, Release latest
3.3.1 Planner Predictor
class driftage.planner.predictor.PlannerPredictor(connection: drif-tage.db.connection.Connection)
Predictor base class for Concept Drift detection.
Parameters connection (Connection) – Knowledge Based connection
abstract async predict()→ List[driftage.planner.predictor.PredictResult]Using data stored on KB predicts if this data is a Concept Drift of not, to then send or not to the Executor.
Raises NotImplementedError – Needs to be implemented when overridden
Returns Results predicted that should or shouldn’t be sent to the Executor
Return type List[PredictResult]
abstract property predict_periodPredict time period (in seconds). This property defines how long PlannerPredictor will wait until nextpredict call.
Raises NotImplementedError – Needs to be implemented when overridden
Returns Time it takes to wait for predict in seconds
Return type Union[float, int]
class driftage.planner.predictor.PredictResult(identifier: str, predicted: Union[bool,str, int, float], should_send: bool)
Dataclass to store each prediction, result and if this prediction should be sent to Executor.
Parameters
• identifier (str) – Data identifier that comes from Monitor or any identifier you want.
• predicted (Union[bool, str, int, float]) – Value predicted from Drift de-tection algorithm. This can even inform type of drift to Executor.
• should_send (bool) – If this prediction should be sent to Executor. Sometimes yourPlanner can decide to not send it because of time or other business rules.
identifier: str = None
predicted: Union[bool, str, int, float] = None
should_send: bool = None
3.4 Executor
Receives from Planner new drifts and send to a custom Sink. It can be a Apache Kafka, RabbitMQ, API, and so on. . .
3.4. Executor 9
driftage Documentation, Release latest
3.4.1 Retry Config
class driftage.executor.retry_config.RetryConfig(send_timeout: Union[int, float, None]= 1.0, retry_backoff: Union[int,float] = 1.0, max_tries: int = 3,all_tries_timeout: Union[int, float,None] = None, retry_exceptions: Tu-ple[Exception] = (<class 'Excep-tion'>, ))
Retry configuration for Sink connection. Configuring this retry it will be resilient when sending data to Sink.
Parameters
• send_timeout (Optional[Union[int, float]], optional) – All timeoutswhen send in seconds, defaults to 1.0
• retry_backoff (Union[int, float], optional) – Time to wait to another tryin seconds, defaults to 1.0
• max_tries (int, optional) – Maximum number of tries, defaults to 3
• all_tries_timeout (Optional[Union[int, float]], optional) – Totaltimeout from all retries in seconds, defaults to None
• retry_exceptions (Tuple[Exception], optional) – Exceptions that weshould take in account to retry, defaults to (Exception,)
3.4.2 Sink
class driftage.executor.sink.Sink(circuit_breaker: aiobreaker.circuitbreaker.CircuitBreaker= <aiobreaker.circuitbreaker.CircuitBreaker object>,is_available_cache_ttl: Union[int, float] = 1.0,retry_config: driftage.executor.retry_config.RetryConfig= <driftage.executor.retry_config.RetryConfig object>)
Sink base class to implement custom Sinks like Kafka, RabbitMQ MariaDB, or even an API.
Parameters
• circuit_breaker (CircuitBreaker, optional) – Circuit breaker to protectSink if it’s down, defaults to CircuitBreaker()
• is_available_cache_ttl (Union[int, float], optional) – Healthcheckcache means the time to is_available method in how many seconds, defaults to 1.0
• retry_config (RetryConfig, optional) – Retry configuration to send data toSink, defaults to RetryConfig()
abstract async drain(data: dict)Method that sends data to the Sink. This receives predicted data from the Planner and sends it out.
Parameters data (dict) – Predicted data with timestamp, predicted and identifier
Raises NotImplementedError – Needs to be implemented when overridden
abstract is_available()→ boolHealthcheck function to know if Sink is available to receive data.
Raises NotImplementedError – Needs to be implemented when overridden
Returns True if it is available or False if not
Return type bool
10 Chapter 3. API
driftage Documentation, Release latest
3.5 Knowledge Base
Stores data collected and predicted by the Analyser. Can be queried by Analyser for retraining or Planner for predic-tions.
3.5.1 Connection
Build on top of SQLAlchemy to interact with Knowledge Base.
class driftage.db.connection.Connection(db_engine: sqlalchemy.engine.base.Engine,bulk_size: int, bulk_time: Union[int,float], circuit_breaker: aio-breaker.circuitbreaker.CircuitBreaker = <aio-breaker.circuitbreaker.CircuitBreaker object>)
Connects with SQLAlchemy Engine to store and query data for concept drift datection.
Parameters
• db_engine (Engine) – SQLAlchemy Engine to use as backend
• bulk_size (int) – Quantity of data that connection will wait to make bulk insert
• bulk_time (Union[int, float]) – Time in seconds between last insert and now. Ifbulk_size is not reached in bulk_time interval, then an insert was done
• circuit_breaker (CircuitBreaker, optional) – Circuit Breaker configura-tion to connect with Database, defaults to CircuitBreaker()
async get_between(column: sqlalchemy.sql.schema.Column, from_datetime: datetime.datetime,to_datetime: datetime.datetime)→ pandas.core.frame.DataFrame
Collects data between dates from database.
Parameters
• column (Column) – Database column from schema
• from_datetime (datetime) – Start Datetime to search
• to_datetime (datetime) – End Datetime to search
Returns Data got from specified date range
Return type pd.DataFrame
async lazy_insert(df: pandas.core.frame.DataFrame)Insert in database if bulk size or bulk time reached.
Parameters df (pd.DataFrame) – Data to be inserted
3.5.2 Schema
Schema from data stored and predicted.
Definitions from table where predicted drifts are stored:
Table: driftage_kb or defined by DRIFTAGE_TABLENAME enviroment variable
Columns:
• driftage_jid: ID from the Analyser that predicted and saved data
• driftage_datetime_monitored: Datetime of collected data
3.5. Knowledge Base 11
driftage Documentation, Release latest
• driftage_datetime_analysed: Datetime of analysed data
• driftage_identifier: Identifier from the Monitor that collected data
• driftage_data: Json type object from data collected
• driftage_predicted: True or False depending on if data is drift or not
12 Chapter 3. API
CHAPTER
FOUR
DATAFLOW
13
driftage Documentation, Release latest
14 Chapter 4. DataFlow
CHAPTER
FIVE
SEQUENCE DIAGRAM
15
driftage Documentation, Release latest
16 Chapter 5. Sequence Diagram
CHAPTER
SIX
ABOUT
The amount of data and the change in behaviours happens very fast in this interconnected world. These repeatedchanges and the amount of data make machine learning algorithms lose accuracy because they don’t know aboutthese new patterns. This change in pattern of data is known as Concept Drift and there are already many approachesfor treating these drifts. Usually these treatments are costly to implement because they require knowledge of driftdetection algorithms, software engineering and need maintenance for new drifts.
The proposal of Driftage is build a framework using multi-agent systems to simplify the implementation of conceptdrift detectors for dynamic environments.
Driftage is a modular framework where:
6.1 Monitor
Captures data integrated to any framework you want: Spark, Flink, or even a Python function.
6.2 Analyser
Analyses data collected by Monitor using a customized Predictor for Concept Drift detection. Fast classifiers fromScikit-Multiflow or Facebook Prophet are great projects for that.
6.3 Planner
Observes for new predictions and based on them chooses if the drift is really valid. If it’s a real drift, it should be sentto the Executor. A custom Predictor can be done for that too, like a voting one or a more time-consuming algorithmfrom Scikit-Learn, TensorFlow, PyTorch, or others.
6.4 Executor
Receives from Planner new Drifts and sends them to a custom Sink. It can be a Apache Kafka, RabbitMQ, API, andso on. . .
17
driftage Documentation, Release latest
6.5 Knowledge Base
Store data collected and predicted by Analyser. Can be queried by Analyser for retraining or Planner for predictions.
If you want to know how more how Driftage is structured, take a look at DataFlow and Sequence Diagram.
18 Chapter 6. About
PYTHON MODULE INDEX
ddriftage.analyser.predictor, 8driftage.db.connection, 11driftage.executor.retry_config, 10driftage.executor.sink, 10driftage.monitor, 7driftage.planner.predictor, 9
19
driftage Documentation, Release latest
20 Python Module Index
INDEX
AAnalyserPredictor (class in drif-
tage.analyser.predictor), 8
Ccollect() (driftage.monitor.Monitor method), 7Connection (class in driftage.db.connection), 11
Ddata (driftage.analyser.predictor.PredictionData at-
tribute), 8drain() (driftage.executor.sink.Sink method), 10driftage.analyser.predictor
module, 8driftage.db.connection
module, 11driftage.executor.retry_config
module, 10driftage.executor.sink
module, 10driftage.monitor
module, 7driftage.planner.predictor
module, 9
Ffit() (driftage.analyser.predictor.AnalyserPredictor
method), 8
Gget_between() (driftage.db.connection.Connection
method), 11
Iidentifier (driftage.analyser.predictor.PredictionData
attribute), 8identifier (driftage.planner.predictor.PredictResult
attribute), 9is_available() (driftage.executor.sink.Sink
method), 10
Llazy_insert() (driftage.db.connection.Connection
method), 11
Mmodule
driftage.analyser.predictor, 8driftage.db.connection, 11driftage.executor.retry_config, 10driftage.executor.sink, 10driftage.monitor, 7driftage.planner.predictor, 9
Monitor (class in driftage.monitor), 7
PPlannerPredictor (class in drif-
tage.planner.predictor), 9predict() (driftage.analyser.predictor.AnalyserPredictor
method), 8predict() (driftage.planner.predictor.PlannerPredictor
method), 9predict_period() (drif-
tage.planner.predictor.PlannerPredictorproperty), 9
predicted (driftage.planner.predictor.PredictResultattribute), 9
PredictionData (class in drif-tage.analyser.predictor), 8
PredictResult (class in driftage.planner.predictor),9
Rretrain_period() (drif-
tage.analyser.predictor.AnalyserPredictormethod), 8
RetryConfig (class in driftage.executor.retry_config),10
Ssetup() (driftage.monitor.Monitor method), 7should_send (driftage.planner.predictor.PredictResult
attribute), 9
21
driftage Documentation, Release latest
Sink (class in driftage.executor.sink), 10
Ttimestamp (driftage.analyser.predictor.PredictionData
attribute), 8
22 Index