pmu time series data mining - wecc pmu data for jsis.pdf · • adaptive piecewise constant...
TRANSCRIPT
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Natasha Balac, Ph.D Chuck Wells, Ph.D
Nicole Wolter Albert Nguyen
Jake Schurmeier
Predictive Analytics Center of Excellence (PACE) San Diego Supercomputer Center University of California, San Diego
PMU Time Series Data Mining
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Brief History of SDSC
• 1985-1997: NSF national supercomputer center; managed by General Atomics
• 1997-2007: NSF PACI program leadership center; managed by UCSD
• PACI: Partnerships for Advanced Computational Infrastructure
• 2007-2009: Internal transition to support more diversified research computing
• Still NSF national “resource provider”
• 2009-future: Multi-constituency cyberinfrastructure (CI) center
• Provide data-intensive CI resources, services, and expertise for campus, state, and nation
• Approaching $1B in lifetime contract and grant activity
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon Speeds and Feeds INTEL SANDY BRIDGE COMPUTE NODE
Sockets & Cores 2 & 16 Clock speed 2.6 GHz
DRAM capacity and speed 64 GB, 1,333 MHz
INTEL710 eMLC FLASH I/O NODE
NAND flash SSD drives 16 SSD capacity per drive & per
node 16 * 300 GB = 4.8 TB
SMP SUPER-NODE (VIA VSMP) Compute nodes / I/O Nodes 32 / 2
Addressable DRAM 2 TB Addressable memory
including flash 11.6 TB
GORDON (AGGREGATE) Compute Nodes 1,024 Compute cores 16,384
Peak performance 341 TF DRAM/SSD memory 64 TB DRAM; 300 TB SSD
INFINIBAND INTERCONNECT Architecture Dual-Rail, 3D torus
Link Bandwidth QDR Vendor Mellanox
LUSTRE-BASED DISK I/O SUBSYSTEM (SHARED) Total storage: current/planned 4 PB/6 PB (raw)
Total bandwidth 100 GB/s
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
PMU Frequency Data
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Sampling PMU Frequency Data and Fast Fourier Transformation (FFT)
• Transforming Frequency Data to FFT Data
• 23 samples of Frequency Data was taken from the PMU at different times
• The FFT was computed for each sample • Each FFT was standardized by setting the max value to 1 • The following slides are the standardized FFT for the
various time samples
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
X-Axis = Frequency Y-Axis: Magnitude
FFT at Various Time (1 of 4)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
FFT at Various Time (2 of 4)
X-Axis = Frequency Y-Axis: Magnitude
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
FFT at Various Time (3 of 4)
X-Axis = Frequency Y-Axis: Magnitude
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
FFT at Various Time (4 of 4)
X-Axis = Frequency Y-Axis: Magnitude
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Time Series Representation and Similarity Measure
• Transforming FFT Data into FFT Bins
• For each preceding sample, FFT Frequencies are discretized into 25 bins
• For each bin the mean and the sum are calculated • Correlation matrix comparing the corresponding event and
control frequency bins
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
FFT Correlation Matrix
Control Group Event Group
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Simple Anomaly Detection • Benford’s Law
• Also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time
• Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2 3 4 5 6 7 8 9First Digit
Benford Uncompress
Compressed
Benford Distribution Between Compressed and Uncompressed Data
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Benford Distribution Between Control and Event
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2 3 4 5 6 7 8 9First Digit
ControlEvent
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Next Steps
• Alternate time series representation and dimensionality reduction • Discrete wavelet transform
• Discrete Haar Wavelet Transform (DTWT) • Adaptive Piecewise Constant Approximation • Symbolic Aggregate Approximation (SAX) representation
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Time Series Data Mining
• Pattern Discovery and Clustering for • Motif discovery – K-motif detection • Anomaly detection or finding discords
• Distance-based Clustering • Self Organizing Map (SOM) • Multi-resolution Clustering (MPAA) • ARIMA EM-Clustering • Hidden Markov Model (HMM) • Motif-based clustering
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Classification
• Descriptive techniques • Supervised learning - maps data into predefined
categories/classes • Nearest Neighbor classifier
• Applies the similarity measures to the object to be classified to determine its best classification based on the existing data that has already been classified
• Decision trees • A set of rules are inferred from the training data, and this
set of rules is then applied to any new data to be classified
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Clustering
X
X
X
X K-means
Hierarchical Clustering
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Scalability
• From one to multiple PMUs – multivariate time series mining
• Sub-second data collection and processing
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Analytics Architecture
• OSIsoft PI server direct export to Hadoop • Hadoop & myHadoop with Mahout • KNIME batch job • Revolution Analytics R libraries for “Big Data”
• Once the models are trained – (near)Real-time scoring can be implemented on the sensor streams enabling large prediction window horizon
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Thank you!
For further information, contact:
• Chuck Wells [email protected]
• Natasha Balac [email protected]