www.cranfield.ac.uk
Hongmei He
Online Anomaly Detection of Time Series at Scale
Cyber Science 2019, 3-4 Jun 2019
2
Outline
The team of the project.
Research Background
The Challenges of Time-Series Anomaly Detection
Parametric Approaches
Non-parametric approaches
Classic machine learning platforms
Cloud and streaming platforms
3
Project Consortium
Dr Hongmei He
PI
Dr Yifan Zhao
Supervisor
Leoni Padilla
Knowledge Exchange
Officer
Raymon Gompelman
Director – PULSE Product
Development
Andrew Mason
KTP AssociateiDir
ect
Team
Cra
nfi
eld
Team
Srikanth Mandava
Supervisor
4
Satellite Communication
@Internet
Google Inc. plans to create
a satellite constellation of
around 1000 satellites
orbiting around the earth
and covering around 75%
of its surface.
5
Very Small Aperture Terminals (VSATs)
A two-way satellite ground
station, consisting of two units:
Outdoor Unit (dish antenna
<3.8m) and Indoor Unit
Used for reliable transmission
of data, video and voice via
satellite;
Simply plug in existing terminal
equipment
@Internet
6
Motivation
7
The Goal of the Project
iDirect UK Ltd. expects to add additional value to their current
software product in a new and innovative way to give customers a
better understanding of their network traffic.
The vision is to develop an ‘add-on’ module to their software product
to include the facility to predict and report network behaviours that will
eventually lead to an error or a fault.
8
The Problem to Be Solved
Satellite operators need to be able to react quickly to network issues such as a
remote terminal going down or a faulty earth station. How we provide our
customers with early warning about anomaly in networks?
Time-Series Anomaly Detection (TSAD) @Internet
9
Time Series Prediction
𝑦𝑡 = 𝑓 𝑦𝑡−𝑑 , 𝑦𝑡−𝑑−1, …… , 𝑦𝑡−𝑑−𝑛+1 + 𝜀𝑡
10
Critical Challenges for Online Anomaly Detection
Challenge 1: Real time performance of algorithm, regarding the complexity of
algorithm itself and volume of data to be dealt with;
Challenge 2: Uncertainty of anomaly situations, which may require different
analyses and hybrid techniques;
Challenge 3: Lack of positive data samples and/or extreme unbalance
between positive (anomaly) and negative (normal) data samples.
11
Issues to implement real-time time series anomaly detection
Need for labels/supervised learning
Lack of robustness to previous anomalous events
High latency requirements
Multi-pass processes over data
Non-adaptiveness to changes in underlying distributions
12
Important Considerations in Online Time Series Anomaly Detection
Timeliness
Rate of change/concept drift
Scale
Conciseness
Level of Autonomy
13
Parametric approaches
Statistical methods under the two assumptions: normal
data distribution and stationary
Time series analysis methods under the assumption of structural information
14
Statistical Methods[1]
P: Parametric Techniques, PT: Point Anomalies, PA: Pattern Anomalies, INC. Incremental Techniques, ROBUST:
Robustness to Noises, RECENCY: Ability to Weight Observations by Ages, TG: Time Granularity of Data that can
be used by the method, CFAR: Constant False Alarm Rate
[1] Choudhary, D., Kejariwal, A., & Orsini, F., “On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-
Time Streaming Data”. 12 Oct 2017. arXiv:1710.04735v1.
15
Time Series Analysis [1]
P: Parametric Techniques, PT: Point Anomalies, PA: Pattern Anomalies, INC. Incremental Techniques, ROBUST:
Robustness to Noises, RECENCY: Ability to Weight Observations by Ages, TG: Time Granularity of Data that can
be used by the method, CFAR: Constant False Alarm Rate
16
Pattern Mining & Machine Learning [1]
17
Non-parametric methods – Machine Learning Techniques
1. Supervised Learning Strategies for Time Series Forecasting (Bontempi, S, et al, eBISS, 2012)
2. Semi-supervised
- OCSVM (Arbon, E.K., Smet, P.J., 2015)
3. Unsupervised
- IsolationForest (Liu, F.T. et al, ICDM ’08, 413-422)
PS. Comparison of ML vs Statistical approaches found:
Statistical approaches are better than ML in most cases
(Makridakis, S. et al. 2018. PloS ONE, 13(3), 1-26)
A Hybrid approach could be a good way to solve time-series prediction problems.
18
Classic Machine Learning Platforms
• point-and-click apps for training and comparing machine learning models for advanced signal processing, automatic hyperparameter tuning and feature selection as well as scale processing to big data and clusters.
MatLab
• a group of machine learning algorithms for data mining, including data pre-processing, classification, regression, clustering, association rules mining, and visualization.
WEKA
• an open source software for creating data science applications and services. It provides reusable components of data processing and analysis, which allow users to create graphical workflows for their tasks.
KNIME
R and Python provide a high-level language programming platform with strong support to
statistics and machine learning techniques, respectively.
19
Cloud Platforms
Platform Amazon Google
cloud
Microsoft Azure IBM Waterson
Interface Console &
command
REST API Azure Machine Learning
Studio
SPSS graphical
interface
Data storage AWS account
S3, RDS,
Redshift
cloud
account
Azure Cloud (data over
2GB)
IBM Bluemix
charge small charge on
tasks
Small
charge on
tasks
Free and paid version,
fixed charge on user and
a small charge on time.
Paid and free
version
ML Algs. Pre-built
algorithms
Tensor flow
(open)
automated algorithms Machine Learning
via API
Merits Easy and fast to
use
Scale,
speed, and
stability
Easy to deliver individual
tasks
Enterprise client
services for
medicine, finance
and large
organisations.
20
Streaming Advantages Disadvantages
Very low latency, true streaming, mature and
high throughput, Excellent for non-complicated
streaming use cases
No state management; No advanced
features like event time processing,
aggregation, windowing, watermarks,
sessions; At-least-once guarantee
Supports Lambda architecture, comes free
with Spark High throughput, sub-latency is not
required, fault tolerance by default due to
micro-batch nature; higher level APIs,
Not true streaming, not suitable for low
latency requirements, many parameters to
tune. Stateless by nature
Leader of innovation in open source Streaming
landscape; true streaming framework with all
advanced features
Less popular; less well for standalone
applications and micro-services that need
to do stream processing, in contrast to the
Kafka with lightweight library.
Apache Kafka Very light weight library, good for micro-
services, IoT applications
Tightly coupled with Kafka, cannot use
without Kafka in picture;
Streaming Platforms
21
Thank you!