pmu time series data mining - wecc pmu data for jsis.pdf · • adaptive piecewise constant...

$: PMU Time Series Data Mining - WECC PMU data for JSIS.pdf · • Adaptive Piecewise Constant Approximation • Symbolic Aggregate Approximation (SAX) representation . PCA \⠀倀爀椀渀挀椀瀀愀氀$
SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Natasha Balac, Ph.D Chuck Wells, Ph.D

Nicole Wolter Albert Nguyen

Jake Schurmeier

Predictive Analytics Center of Excellence (PACE) San Diego Supercomputer Center University of California, San Diego

PMU Time Series Data Mining



Brief History of SDSC

• 1985-1997: NSF national supercomputer center; managed by General Atomics

• 1997-2007: NSF PACI program leadership center; managed by UCSD

• PACI: Partnerships for Advanced Computational Infrastructure

• 2007-2009: Internal transition to support more diversified research computing

• Still NSF national “resource provider”

• 2009-future: Multi-constituency cyberinfrastructure (CI) center

• Provide data-intensive CI resources, services, and expertise for campus, state, and nation

• Approaching $1B in lifetime contract and grant activity



Gordon Speeds and Feeds INTEL SANDY BRIDGE COMPUTE NODE

Sockets & Cores 2 & 16 Clock speed 2.6 GHz

DRAM capacity and speed 64 GB, 1,333 MHz

INTEL710 eMLC FLASH I/O NODE

NAND flash SSD drives 16 SSD capacity per drive & per

node 16 * 300 GB = 4.8 TB

SMP SUPER-NODE (VIA VSMP) Compute nodes / I/O Nodes 32 / 2

Addressable DRAM 2 TB Addressable memory

including flash 11.6 TB

GORDON (AGGREGATE) Compute Nodes 1,024 Compute cores 16,384

Peak performance 341 TF DRAM/SSD memory 64 TB DRAM; 300 TB SSD

INFINIBAND INTERCONNECT Architecture Dual-Rail, 3D torus

Link Bandwidth QDR Vendor Mellanox

LUSTRE-BASED DISK I/O SUBSYSTEM (SHARED) Total storage: current/planned 4 PB/6 PB (raw)

Total bandwidth 100 GB/s

Presenter

Presentation Notes

Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others Emphasizes memory and IO over FLOPS. Appro integrated 1,024 node Sandy Bridge cluster 300 TB of high performance Intel flash Large memory supernodes via vSMP Foundation from ScaleMP 3D torus interconnect from Mellanox In production operation since February 2012 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)



PMU Frequency Data



Sampling PMU Frequency Data and Fast Fourier Transformation (FFT)

• Transforming Frequency Data to FFT Data

• 23 samples of Frequency Data was taken from the PMU at different times

• The FFT was computed for each sample • Each FFT was standardized by setting the max value to 1 • The following slides are the standardized FFT for the

various time samples



X-Axis = Frequency Y-Axis: Magnitude

FFT at Various Time (1 of 4)



Time Series Representation and Similarity Measure

• Transforming FFT Data into FFT Bins

• For each preceding sample, FFT Frequencies are discretized into 25 bins

• For each bin the mean and the sum are calculated • Correlation matrix comparing the corresponding event and

control frequency bins



FFT Correlation Matrix

Control Group Event Group



Simple Anomaly Detection • Benford’s Law

• Also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time

• Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution



0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2 3 4 5 6 7 8 9First Digit

Benford Uncompress

Compressed

Benford Distribution Between Compressed and Uncompressed Data



Benford Distribution Between Control and Event

0%

5%

10%

15%

20%

25%

30%

35%

40%

1 2 3 4 5 6 7 8 9First Digit

ControlEvent



Next Steps

• Alternate time series representation and dimensionality reduction • Discrete wavelet transform

• Discrete Haar Wavelet Transform (DTWT) • Adaptive Piecewise Constant Approximation • Symbolic Aggregate Approximation (SAX) representation

Presenter

Presentation Notes

PCA (Principal Component Analysis) – orthogonal transformation to reduce data representation – feature reduction Possibly PCA While FFT has shown great prospect via initial results for dimensionality reduction and representation No one representation can be superior for all tasks – we will experiment 3 major ones to create insight into what the best representation for pmu data might be However, one important difference is that wavelets are localized in time, i.e. some of the wavelet coefficients represent small, local subsections of the data being studied. This is in contrast to Fourier coefficients that always represent global contribution to the data. This property is very useful for multiresolution analysis of the data. The first few coefficients contain an overall, coarse approximation of the data; addition coefficients can be imagined as “zooming-in” to areas of high detail, Recently, there has been an explosion of interest in using wavelets for data compression, filtering, analysis, and other areas where Fourier methods have previously been used. Chan and Fu (1999) produces a breakthrough for time series indexing with wavelets by producing a distance measure defined on wavelet coefficients which provably satisfies the lower bounding requirement. The work is based on a simple, but powerful type of wavelet known as the Haar Wavelet. The Discrete Haar Wavelet Transform (DWT) can be calculate efficiently and an entire dataset can be indexed in O(mn). DTW does have some drawbacks, however. It is only defined for sequence whose length is an integral power of two. Although much work has been undertaken on more flexible distance measures using Haar wavelet (Huhtala et al., 1995; Struzik and Siebes, 1999), none of those techniques are indexable.



Time Series Data Mining

• Pattern Discovery and Clustering for • Motif discovery – K-motif detection • Anomaly detection or finding discords

• Distance-based Clustering • Self Organizing Map (SOM) • Multi-resolution Clustering (MPAA) • ARIMA EM-Clustering • Hidden Markov Model (HMM) • Motif-based clustering



Classification

• Descriptive techniques • Supervised learning - maps data into predefined

categories/classes • Nearest Neighbor classifier

• Applies the similarity measures to the object to be classified to determine its best classification based on the existing data that has already been classified

• Decision trees • A set of rules are inferred from the training data, and this

set of rules is then applied to any new data to be classified



Clustering

X

X

X

X K-means

Hierarchical Clustering



Scalability

• From one to multiple PMUs – multivariate time series mining

• Sub-second data collection and processing



Analytics Architecture

• OSIsoft PI server direct export to Hadoop • Hadoop & myHadoop with Mahout • KNIME batch job • Revolution Analytics R libraries for “Big Data”

• Once the models are trained – (near)Real-time scoring can be implemented on the sensor streams enabling large prediction window horizon

Presenter

Presentation Notes

The increasing level of interest in indexing and mining time series data has produced many algorithms and representations. However, with few exceptions, the size of datasets considered, indexed, and mined seems to have stalled at the megabyte level. At the same time, improvements in our ability to capture and store data have lead to the proliferation of terabyte-plus time series datasets. In this work, we show how a novel multiresolution symbolic representation called indexable Symbolic Aggregate approXimation (iSAX) can be used to index datasets which are several orders of magnitude larger than anything else considered in current literature. The iSAX approach allows for both fast exact search and ultra fast approximate search. Beyond mere similarity search, we show how to exploit the combination of both types of search as sub-routines in data mining algorithms, permitting the exact mining of truly massive datasets, with millions of time series, occupying up to a terabyte of disk space. Our approach is based on a modification of the SAX representation to allow extensible hashing [12]. In essence, we show how we can modify SAX to be a multiresolution representation, similar in spirit to wavelets. It is this multiresolution property that allows us to index time series with zero overlap at leaf nodes [2], unlike R-trees and other spatial access methods. As we shall show, our indexing technique is fast and scalable due to intrinsic properties of the iSAX representation. Because of this, we do not require the use of specialized databases or file managers. Our results, conducted on massive datasets, are all achieved using a simple tree structure which simply uses the standard Windows XP NTFS file system for disk access. While it might have been possible to achieve faster times with a sophisticated DBMS, we feel that the simplicity of this approach is a great strength, and will allow easy adoption, replication, and extension of our work. A further advantage of our representation is that, being symbolic, it allows the use of data structures and algorithms that are not well defined for real-valued data, including suffix trees, hashing, Markov models etc [12]. Furthermore, given that iSAX is a superset of classic SAX, the several dozen research groups that use SAX will be able to adopt iSAX to improve scalability [11]. The rest of the paper is organized as follows. In Section 2 we review related work and background material. Section 3 introduces the iSAX representation, and Section 4 shows how it can be used for approximate and exact indexing. In Section 5 we perform a comprehensive set of experiments on both indexing and data mining problems. Finally, in Section 6 we offer conclusions and suggest directions for future work.



Thank you!

For further information, contact:

• Chuck Wells [email protected]

• Natasha Balac [email protected]

pmu time series data mining - wecc pmu data for jsis.pdf · • adaptive piecewise constant...

Documents