machine learning for predictive maintenance on wind turbines1420733/fulltext01.pdf · 2020. 3....

Linköpings universitetSE–581 83 Linköping+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

202020 | LIU-IDA/LITH-EX-A--2020/008--SE

Machine Learning for PredictiveMaintenance on Wind Turbines– Using SCADA Data and the Apache Hadoop EcosystemBehovsstyrt Underhåll av Vindkraftverk med Maskininlärning iApache Hadoop

John Eriksson

Supervisor : Rouhollah MahfouziExaminer : Martin Sjölund

External supervisor : Fredrik Eklund

http://www.liu.se

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior förenskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användningav dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten finns lösningar av teknisk och administrativ art.Upphovsmannens ideella rätt innefattar rätt att bli nämnd somupphovsman i den omfattning som godsed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart.För ytterligare information om Linköping University Electronic Press se förlagets hemsidahttp://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for aperiod of 25 years starting from the date of publication barring exceptional circumstances.The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercialresearch and educational purpose. Subsequent transfers of copyright cannot revoke this permission.All other uses of the document are conditional upon the consent of the copyright owner. The publisherhas taken technical and administrative measures to assure authenticity, security and accessibility.According to intellectual property law the author has the right to be mentioned when his/her work isaccessed as described above and to be protected against infringement.For additional information about the Linköping University Electronic Press and its proceduresfor publication and for assurance of document integrity, please refer to its www home page:http://www.ep.liu.se/.

© John Eriksson

http://www.ep.liu.se/

http://www.ep.liu.se/

Abstract

This thesis explores how to implement a predictive maintenance system for wind turbinesin Apache Spark using SCADA data. How to balance and scale the data set is evaluated,together with the effects of applying the algorithms available in Spark mllib to the givenproblem. These algorithms include Multilayer Perceptron (MLP), Linear Regression (LR),Random Forest (RF), Decision Tree (DT), Support Vector Machine (SVM) and GradientBoosted Tree (GBT). This thesis also evaluates the effects of applying stacking and baggingalgorithms in an attempt to decrease the variance and improve the metrics of the model.It is found that the MLP produces the most promising model for predicting failures on thegiven data set and that stacking multiple MLP models is a good way of producing a modelwith a lower variance than the individual base models. In addition to this, a function thatcreates a savings estimation is developed. Using this function, a time window functionthat explores the decisiveness of a model is created. The conclusion is made that a modelis more decisive if the failure it predicts occurs in a turbine where it has been trained onfailure data from that same component, indicating that there are unknown variables thataffect the sensor data.

Acknowledgments

I would like to acknowledge my supervisor Fredrik Eklund at Attentec for providing valuableideas about how to proceed with the research and what to prioritize when being stuck at across roads, as well as for providing valuable feedback regarding the report.

I would also like to acknowledge the work of my supervisor Rouhollah Mahfouzi and myexaminer Martin Sjölund at LiU for helping me improve and finalize this thesis.

v

Contents

Abstract iii

Acknowledgments v

Contents vii

List of Figures ix

List of Tables xi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 52.1 Literature Study Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Wind Turbines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Wind Turbine Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 SCADA Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Big Data Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.7 Machine Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Predictive Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.9 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.10 Model Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.11 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.12 Homogeneous Ensemble Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 222.13 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Method 253.1 Data Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Model Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5 Final System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Results 474.1 Data Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3 Model Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

vii

4.5 Kafka Based Prediction System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Discussion 535.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion 576.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Bibliography 61

viii

List of Figures

2.1 Wind turbine component scheme. Retrieved from energy.gov. Image is work inthe public domain according to EERE copyright policy. . . . . . . . . . . . . . . . . 6

2.2 Subsystem downtime per turbine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Subsystem failure rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Monolithic SCADA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Distributed SCADA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Networked SCADA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Big Data Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 MapReduce word count example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.9 Bias variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Amb_Temp_Avg over time. This variable describe the measured average ambienttemperature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Prod_LatestAvg_TotReactPwr over time. This variable describe the measured to-tal reactive power produced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Scaler comparison for a feature with no outliers, using the data from theAmb_Temp_Avg variable. 3.3a, 3.3b and 3.3c illustrate the data distribu-tion of this variable when scaled using the MinMaxScaler, StandardScaler andPowerTransformer-Yeo-Johnson respectively. . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Scaler comparison for features with no outliers, using the data from theProd_LatestAvg_TotReactPwr variable. 3.4a, 3.4b and 3.4c illustrate the data dis-tribution of this variable when scaled using the MinMaxScaler, StandardScaler andPowerTransformer-Yeo-Johnson respectively. . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Visualization of precision, sensitivity and specificity metrics in relation to the ratiobetween positive and negative samples in training data . . . . . . . . . . . . . . . . 37

3.6 Visualization of decision tree based bagging algorithm performance . . . . . . . . . 423.7 Visualization of multilayer perceptron based stacking algorithm performance . . . 433.8 System that creates a warning Kafka topic using the prediction models . . . . . . . 46

4.1 Output from Kafka producer that creates a stream of turbine sensor measurements 51

ix

https://www.energy.gov

https://www.energy.gov/about-us/web-policies


List of Tables

2.1 Related work metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Turbine failure summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Baseline theory about turbine failures with respect to training and testing data . . 273.3 Costs for component operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4 50-50 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.5 50-50 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 60-40 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.7 60-40 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.8 70-30 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.9 70-30 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.10 75-25 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.11 75-25 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.12 80-20 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.13 80-20 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.14 90-10 ratio with normalizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.15 90-10 ratio with standardizing scaler . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.16 PCA evaluation using data with 70-30 ratio . . . . . . . . . . . . . . . . . . . . . . . 383.17 Algorithm metrics for models produced by the hyperparameter evaluation . . . . 383.18 Cross-validation comparison using two and five folds . . . . . . . . . . . . . . . . . 393.19 Bagging model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.20 Stacking model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.21 Savings evaluations for the gearbox component . . . . . . . . . . . . . . . . . . . . 44

4.1 Metrics for training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Metrics for testing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Cost estimation for all components using unfiltered predictions . . . . . . . . . . . 504.4 Cost estimation for all components using window filtered predictions . . . . . . . 504.5 Outcome of model predictions of turbine failures with respect to training and test-

ing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.6 Output from Kafka consumer that displays gearbox predictions . . . . . . . . . . . 52

xi

1 Introduction

This section describes the motivation and concepts discussed in this report, as well as theresearch questions and delimitations.

1.1 Motivation

Renewable energy sources play an increasingly important role in the global energy mix, asthe effort to reduce the environmental impact of energy production increases. Out of all therenewable energy alternatives, wind energy is the most developed technology worldwidewith over 597GW capacity in 2018 [1].

Over an estimated wind turbine life span of 20 years, it is estimated that the cumulativeoperation and maintenance costs is 65-90% of the total investment cost. These costs includecrane costs and inflation rates. The lower estimate is based on the Danish fleet of 600kWwind turbines while the higher estimate is based on 600kW-750kW machines located in NorthAmerica [2]. In another perspective, maintenance costs is estimated to constitute 20-25% ofthe levelized cost per kWh for wind turbines [3]. It is clear that operation and maintenancecosts has an impact on the profitability of the wind farm and on the competitiveness of windturbines compared to other green energy alternatives. However, this also means that there isgreat room for improvement using new technologies.

The U.S Department of Energy have put together a guide to achieving operational efficiencywhere maintenance practices are explained [4]. They define maintenance as either proac-tive or reactive, where the aim of proactive maintenance is to correct the error before failureoccurs, while reactive maintenance reacts only on errors or failures. They claim that eventhough reactive maintenance has low running costs it usually leads to increased costs dueto unplanned downtime. Proactive maintenance can be grouped into two sub groups: pre-ventive and predictive [4]. Preventive maintenance means that maintenance is performed ona time-based schedule. This leads to an increased product component life cycle in most ofthe cases and an estimated 12-18% lower cost compared to reactive maintenance. Predictivemaintenance uses sensor information and analysis methods to measure and predict degrada-tion and future component capability. The idea behind predictive maintenance is that failure

1

1. INTRODUCTION

patterns are predictable. If the time when a component will fail can be predicted accuratelyand the component is replaced before it fails, the costs of operation and maintenance will bemuch lower. Predictive maintenance leads to an estimated 8-12% cost saving compared topreventive maintenance as it’s less labor intensive [4].

There are many reasons for why proactive, and especially predictive maintenance, leads tolower maintenance costs, such as:

• Wind turbines are often located in remote locations and downtime can last for daysbefore the required spare parts reach their destination.

• Errors classified as a major failures, meaning that the failure has an associated down-time greater than one day, constitute 25% of the amount of errors but are responsiblefor 95% of the downtime.

• Not only can predictive maintenance reduce the amount of failures by correcting errors,it can also reduce the amount of redundant hours being spent on routine controls ormaintenance of well functioning components.

With recent advances in technologies related to internet of things, predictive maintenance isstarting to become the norm for industrial equipment monitoring. Still, about 30% of all in-dustrial equipment does not benefit from predictive maintenance technologies and insteadrelies on periodic inspections to detect anomalies. A study on predictive maintenance tech-niques by Hashemian et al. put these numbers into the perspective of common failure modelsand presented the conclusion that predictive maintenance is preferred in 89% of the cases [5].Another argument for monitoring presented in the same study, is that SKF Group, a lead-ing manufacturer and supplier of bearings and condition monitoring systems, stress testedbearings and measured the time to failure. They presented data that showed a seemingly uni-formly distributed failure pattern. In addition, the study displayed a high range in durabilitywith some of the 30 bearings lasting fewer than 15 hours and one lasting for 300 hours. Sincebearings are a key component in wind turbines, this is an indication that monitoring usingsensors is crucial for predictive maintenance in wind turbines.

1.2 Aim

The main objective of this thesis project is to enhance the current body of knowledge re-garding predictive maintenance on wind turbines by implementing a big data analysis ona real world data set. The evaluation is done by comparing the implications of using thechosen techniques for data analysis used in this report with the respective data from relatedresearch. All machine learning algorithms available in Apache Spark are considered and theimplications of applying stacking and bagging to these algorithms are evaluated. The pro-duced models are compared with the models presented in related research with regards toimplementation, run time, usability and accuracy. The researched techniques were chosen inpart due to discoveries being made when studying research on predictive maintenance forwind turbines and in part when studying research on big data analysis methods.

1.3 Research questions

1. Can a system for predictive maintenance in wind turbines be implemented usingApache Spark?

2. Will a model created by applying bagging or stacking algorithms on several base mod-els perform better than the base models?

2

1.4. Delimitations

3. Which algorithms available in Spark are eligible for stacking and bagging?

4. How does the final solution compare to the current state of the art predictive mainte-nance for wind turbines with regards to implementation, run time, usability and accu-racy metrics?

1.4 Delimitations

This is a study that focuses on the Hadoop Ecosystem and Spark, in particular. Inherently,Spark does not support algorithms that consider time series, meaning that the models pro-duced in this study have both been trained on, as well as make predictions considering onlyrow of data points at the time.

3

1. INTRODUCTION

4

2 Theory

This chapter describes the theory necessary to understand the thesis report as well as relevantbackground that has influenced this research. First, Section 2.1 describe how information wasfound in order to make it easier for the reader to find relevant literature for a similar study.In order to understand, validate and prepare the data for the algorithms, some knowledgeof the turbine construction and its components is required. Therefore, section 2.2 presentsthe necessary theory regarding how wind turbines are designed and function with regards tocomponents and sensors and is followed by section 2.3 that connects this theory to wind tur-bine failure modes. Section 2.4 describes what SCADA data is and gives a historical overviewon how the architecture of SCADA systems have evolved to the cloud based systems that arebeing developed today. This is followed by section 2.5 and 2.6 that present what big data isfollowed by an overview of the tools used to work with big data in this study. Following thisis section 2.7 that presents descriptions of the algorithms that have been used and comparedfor the predictive models in this study. Lastly, related work on predictive maintenance withregards to its relation with SCADA, big data and machine learning is presented in section2.13.

2.1 Literature Study Method

Search terms included predictive maintenance, wind turbines, big data frameworks, machine learn-ing, failure modes, SCADA, Hadoop, Spark and combinations of these. The platforms used forknowledge discovery was Science Direct, Research Gate, Google Scholar, Springer and IEEEex-plore.

A paper was read in its entirety if the abstract indicated that it would answer one of thefollowing questions.

• How is the Hadoop Ecosystem being used for predictive maintenance?

• Which machine learning algorithms and methods are used for state of the art predictivemaintenance?

• How is SCADA data being used for predictive maintenance?

5

2. THEORY

Figure 2.1: Wind turbine component scheme. Retrieved from energy.gov. Image is work inthe public domain according to EERE copyright policy.

• What does the fault process for wind turbine components look like?

2.2 Wind Turbines

The majority of wind turbines are horizontal axis wind turbines, as opposed to vertical axisturbines, and that is also the type of wind turbine considered in this study.

2.2.1 Design and Components

There are many variations and many schematics that can be applied to wind turbines. Theschematic described here captures the significant components needed to create a referenceguide throughout the paper.

Figure 2.1 provides an illustration of how the majority of the components described in thefollowing list are placed in the turbine.

• Tower

Made from tubular steel, concrete, or steel lattice. Supports the structure of the turbine.Because wind speed increases with height, taller towers enable turbines to capture moreenergy and generate more electricity.

• Blades

The blades are shaped to create a pressure differential when air moves across them,causing them to lift in the upwards direction relative to the blade. Most wind tur-bines have three blades for several reasons. A turbine with two blades is prone to aphenomenon called gyroscopic precession, causing a wobbly motion and unnecessarystress on the components. Four blades or more would increase the torque, but also theloads on the tower, the wind resistance and the cost, making the wind turbine less costeffective [6].

• Rotor and Pitch System

The rotor is where the blades are connected to the hub. Inside the hub resides a pitchsystem that can control the pitch of the blades to control the rotor speed.

6

https://www.energy.gov



2.2. Wind Turbines

• Brakes

A mechanical disk brake designed to stop the rotor in case of an emergency. The me-chanical brake is used in case of failure of the aerodynamic brake, or during a turbineservice.

• Low-speed Shaft

The low-speed shaft connects the hub to the gearbox. It rotates at 20-60 rpm dependingon the turbine model. The pipes for the hydraulics system that enables the aerodynamicbrakes are contained in the low-speed shaft.

• High-speed Shaft

The high-speed shaft rotates at about 1,000 to 1,800 rpm depending on the model andthe current wind speed. The reason for this is because the electrical generator requiresa high rotational speed to produce electricity.

• Gearbox

The gearbox connects the low-speed shaft which is also known as the main shaft, tothe high-speed shaft and increases the rotational speed of the high-speed shaft. Thegearbox is an expensive and heavy part of the turbine.

• Generator

Generates AC or DC depending on the type.

• Yaw System

Consists of the yaw drive and the yaw motor, as well as the anemometer and the wheelsand pinions needed to drive the component. The yaw system aligns the wind turbineproperly using wind speed information from the anemometer. This is necessary in anupwind turbine, however a downwind turbine will achieve this naturally.

• Nacelle

The term for the housing containing all of the electrical components.

• Wind Vane

Measures wind direction. This information is used by the yaw system to align the tur-bine.

• Controller

Starts and turns off the wind turbine in a controlled manner depending on the windspeed.

• Electric System

Transformer, fuses, switches, cables and connections needed to carry currents and sig-nals between components.

2.2.2 Sensors and Monitoring Solutions

Sensors are the very heart of all monitoring systems. Modern wind turbines are usuallyequipped with sensors that provide fault detection on either system or subsystem level. Thereare other solutions available in addition to the mentioned SCADA system. These includeblade monitoring systems and holistic models that may take weather data such as temper-atures and salinity, or hours of continuous work into account [7]. However, these solutionswill not be considered further as they are out of scope for this thesis.

7

2. THEORY

The SCADA system provides fault detection on a subsystem level. Parameters being mon-itored include generator rpm, generator bearing temperatures, oil temperature, pitch angle,yaw system, wind speeds and more [7].

2.3 Wind Turbine Failure Modes

It is clear that wind turbines are expensive equipment, both with respect to procurementas well as maintenance. It is also clear that the current state of the art research regardingmaintenance is focused towards predictive maintenance, as it has shown to be the most costeffective maintenance method at this point in time. To be able to understand and validatethe SCADA data, as well as prioritizing research efforts, it is important to get a picture ofcomponent failure rate and how this affects the turbine productivity. A study published bythe National Renewable Energy Laboratory did a review using survey data from six publicationson wind turbines failure rates. The failures were compared with regards to failure rate peryear and downtime per year. The reasoning was that downtime can be used as an indicatorof cost and effort to repair a component, as well as a direct measurement of lost revenue [8].When comparing the data in Figure 2.2 and 2.3, it is found that some subsystems such as thegearbox or the blades and pitch system generate a high amount of downtime even thoughthey rarely fail. The electric system on the other hand fails much more frequently, howeverthe downtime per failure is much lower.

0 5 10 15 20 25

Gearbox

Electric System

Blades and Pitch System

Generator

Control System

Hydraulics

Main Shaft and Drive Train

Yaw System

Mechanical Brakes

hours/year

Figure 2.2: Subsystem downtime per turbine

2.4 SCADA Systems

SCADA (Supervisory Control And Data Acquisition) systems can be used for both moni-toring as well as controlling industrial systems remotely and provides an efficient way forindustries to gather and analyze data in real time. Hundreds of thousands of sensors may beused in larger SCADA systems, generating large amounts of data.

2.4.1 Historical Overview

The first SCADA systems were developed in the late 1960s and have played an importantpart in improving maintenance efficiency ever since. Vendors of SCADA systems usuallyrelease one major and two minor versions every year to take advantage of new technologi-cal advances and meet the requirements of their customers, meaning that SCADA systems

8

2.4. SCADA Systems

0 0.05 0.1 0.15 0.2 0.25 0.3

Gearbox

Electric System

Blades and Pitch System

Generators

Control System

Hydraulics

Main Shaft and Drive Train

Yaw System

Mechanical Brakes

Number of failures per turbine per year

Figure 2.3: Subsystem failure rate

historically have stayed relatively up to date with technological progress. The first iterationof SCADA systems were based on a centralized computing architecture and is referred to asmonolithic or stand alone. This architecture is described in Figure 2.4. As internet technologyadvanced in the 1990s together with system miniaturization, SCADA systems adapted andwere developed to run on distributed computing architectures. This improved the responsetimes and reliability of the system as more computing capacity and redundancy could beintroduced to a lower cost [9]. This architecture is described in Figure 2.5

Remote TerminalUnit

Remote TerminalUnit

Remote TerminalUnit

SCADA Master

WAN

WAN

WAN

Figure 2.4: Monolithic SCADA architecture

9

2. THEORY

LAN

Operating Station Operating Station Communication Server

Operating Station Operating Station

Remote Terminal Unit

Remote Terminal Unit

WAN

WAN

Figure 2.5: Distributed SCADA architecture

2.4.2 From Distributed to Cloud

Traditionally, SCADA servers have been large and expensive with an expected life span of8 to 15 years. After that, the system is replaced and the old hardware is usually discarded.With a more open architecture, such as a cloud computing based solution, the lifetime of thesystem can be improved even further [10]. There have been multiple attempts at describingSCADA systems using a generalized architecture, but no single standard exists. In Church etal. "SCADA Systems in the Cloud" in Handbook of Big Data Technologies, some of the key at-tempts are summarized. Based on the IEEE Standard for SCADA and Automation Systems,a generalized cloud based architecture for a SCADA system is proposed [11]. When compar-ing this cloud based architecture to the generalized architecture proposed in What is SCADA?from 1999 it is clear that the body of knowledge regarding networked distributed computingand scalable solutions have improved [12]. The older architecture did have a network baseddistributed computing architecture in mind, which can be observed as the file server andthe control server are described using similar internal structures. In the newer architecturehowever, each sensor is connected to a field device that has an internal processor, memory,power supply and network interface. Field devices are grouped and connected to a deviceserver. The module responsible for reading and writing data to the data processing moduleis moved from the server to the field device to enable parallel read and writes, thus makingit possible to utilize one advantage of cloud computing. The field devices are connected tothe file server to achieve a scalable solution where field devices can be added to the systemwhen needed. This third generation of SCADA systems are referred to as networked SCADAsystems and can be observed in Figure 2.6. One of the main improvements compared tothe distributed architecture comes from the use of WAN protocols for communicating withservers and equipment, allowing the system to be spread across multiple LAN networks andthus also geographically, allowing for more cost-effective scaling for very large scale SCADAsystems [11].

10

2.5. Big Data

CommunicationServer

Legacy RemoteTerminal Unit

Networked RemoteTerminal Unit

Cloud Service

SCADA Master

Figure 2.6: Networked SCADA architecture

2.5 Big Data

During the last three decades, there has been an exponential increase in data volumes. Inthe 1990s, data was measured in terabytes and could be managed using standard relationaldatabases. A decade later, the data volumes had increased to being measured in petabytes.The increase in volume stems from an increase in connected hardware such as industrialmachines, but also content repositories and network attached storage systems. Moving for-wards to the decade of 2010s, data is being measured in exabytes, even though there arefew applications or companies that store or process close to an exabyte of data. Everythingfrom machines and human interaction with machines to the actual processing of data, gen-erate data. Mobile sensors, surveillance, smart grids, medicinal imaging, gene sequencingand more is driving this modern age deluge of data and it is clear that a paradigm shift hashappened [13].

Big data has become a buzzword and is sometimes misused. Big data is not a framework or atechnology in itself, but rather a problem statement. Tools designed to handle big data weredesigned with the size and complexity of big data in mind and might not be the best choiceunless the data in mind is actually big data. The first step towards choosing the right toolsshould therefore be to understand what big data is. Big data is a very wide term and con-sequently, there are many definitions for it. The most known definition is what is describedas "3V". The term springs from the words Volume, Variety and Velocity. Another common

11

2. THEORY

definition is known as "5V". This term includes the "3V" and adds Value and Veracity to thelist of V’s [14].

• Volume - big data volume is always increasing and can be billions or rows and millionsof columns.

• Variety - big data reflects the variety of data sources, formats and structures. It can beboth structured and unstructured, or a combination of both. This increased the com-plexity of storing and analysing the data.

• Velocity - big data can describe high velocity data, with high speed data ingestion anddata analysis. Handling these demands is often a challenge.

• Value - a research project that does not produce value is not worth the investment,however it can be difficult to determine if and when big data research will deliver thedesired value.

• Veracity - it is crucial to ensure correctness and accuracy of the obtained data. Factorsto consider include trustworthiness, authenticity, accountability and availability.

Lately, the focus of big data research regarding the storage and computing solution hasshifted from Message Passing Interface (MPI) and Distributed Database Management Sys-tems (D-DBMS) to Cloud Computing. Reasons for this shift is the elasticity regarding theusage of computing resources and space, as well as flexible costs and lower managementefforts that is associated with Cloud Computing [15].

2.5.1 The Big Data Life Cycle

Jagadish et al. mentions a best practises guide for big data where incremental steps andconditions for moving forwards to the next step are defined [16]. This approach, illustratedin Figure 2.7, is the approach that was used during the research presented in thesis to ensurethat raw data, processed data, and models met the requirements before moving forward tothe next step.

• Discovery. Learn the domain, find data sources and evaluate the quality and sustain-ability of the data sources. Define the problems, the aims and the hypotheses of theproject.

• Data Preparation. Clean, integrate, transform, reduce and discretize the data accordingto the project needs. If analytical models are fed with poor quality data, the predictionswill most likely be suboptimal or even misleading.

• Model Planning. The techniques, methods and workflows for the models are chosen.Ensure that the choices made will enable the earlier defined hypotheses to be proven ordisproved.

• Model Building. The models defined in the model planning step are developed andexecuted. The models are fine tuned and the results are documented.

• Communicate Results. The criteria for success and failure are evaluated against theoutcome of the research by assessing the results of the models. The results are discussedand recommendations for future research is made.

• Operationalize. The models are deployed and tested in a small scale production-likeenvironment before a full deployment is made.

12

2.5. Big Data

Figure 2.7: Big Data Life Cycle

13

2. THEORY

2.6 Big Data Tools

The advances of wireless sensor technology and the introduction of SCADA systems has pro-vided companies with new ways of collecting more data about performance and degradationof their industrial machines in an easier manner. One of the challenges that developers ofsuch systems for wind turbines have to overcome is that the daily data volumes produced bya SCADA system are too large to be processed with traditional technology [17]. Even thoughsome analysis could be made using traditional methods, an important part of big data anal-ysis is ensuring a response within an acceptable time. This is where a natural connectionbetween big data analysis tools and predictive maintenance for wind turbines is made.

As data volumes are increasing at a faster rate than the computing resources meant to analysethem, new methods and tools have had to be discovered in order facilitate the needs. Thesetools have made it possible not only to analysing more data, but also to discover new anal-ysis methods made possible by the access to these kinds of data volumes. This section willcover Hadoop and relevant parts of the so called Hadoop Ecosystem that has sprung from itscreation.

2.6.1 Hadoop

Hadoop is frequently mentioned as the number one framework for big data management.It was introduced by Apache in 2007 as an open source implementation of the MapReduceprocessing engine bundled with a distributed file system. Hadoop thus solves the scalabilityproblem of a MapReduce on its own by using distributed storage and processing. Hadoopprovides an extensible platform for applications that process large data volumes, such asmachine learning. Because of that, many open source and commercial extensions have beenbased upon Hadoop since its release and it has grown into what is known as the HadoopEcosystem. The components of the Hadoop Ecosystem can be described as follows [18], [19].

Major Components

• HDFS: Hadoop Distributed File System

• YARN: Yet Another Resource Negotiator

• MapReduce: Programming based data processing

• Common: Set of common utilities needed by other components

Extensions

• Spark: In-memory data processing

• Kafka: Distributed publish-subscribe message streaming system

• HIVE: SQL-like data querying

• Pig: High level scripting

• HBase: Column oriented data store built on top of HDFS

• Mahout, Spark MLlib: Machine learning algorithm libraries

• Solr, Lucene: searching and indexing

• Zookeeper: Cluster management and coordination

14

2.6. Big Data Tools

• Oozie: Job scheduling

• Hue: Web interface

HDFS

The Hadoop Distributed File System is designed specifically to store large amounts of struc-tured and unstructured data across multiple nodes. There are two major components in theHDFS: Name node and Data node. HDFS is designed using a master-slave architecture. Thename node is the master, which holds references to file locations and metadata. Its primaryresponsibility is directing traffic to the data nodes. The data nodes are the slaves in this sys-tem. They can consist of commodity hardware which increases scalability, as commodityhardware is readily available, cheap and easily extendable.

HDFS stores files in blocks that it distributes over the cluster. A block size is typically 64Mb.If possible, the file blocks are stored on different machines, enabling parallel map step opera-tions on the blocks. This design entails that for a system that has many files smaller than theblock size, HDFS is most likely not the best solution.

YARN

YARN is a resource manager. As such, it schedules and allocates resources for the Hadoopsystem across the clusters. YARN was introduced together with Hadoop 2.0 in 2012 to handlesome of the deficiencies of the older Hadoop version, where the MapReduce module wasresponsible for resource management and job scheduling. This introduced the possibilityof running other types of distributed applications beyond MapReduce within the Hadoopframework. There are three main components in YARN: resource manager, node managerand application manager.

MapReduce

The MapReduce framework is used to break a task into smaller tasks, execute them in paralleland collect the individual outputs. As can be gathered from the name, a MapReduce jobconsists of two phases, a map phase and a reduce phase. The mapper contains the logic to beprocessed on each data block, which produces key/value pairs that are sent to their respectivereducer based on key value. A reducer thus receives many key value pairs from multiplemappers, which it then aggregates according to the defined reducing logic into smaller set ofkey/value pairs that are the final outputs. The MapReduce framework may be best explainedwith an example. In Figure 2.8, a word count algorithm is illustrated. The input data is splitand stored on different machines according to block size, called map nodes. The map nodesexecute the job, which is to count how many times each word occurs and outputs the pairs.The map nodes are also responsible for the shuffling phase that sorts the output and writesto disk. The sorted output is then sent to the reducer nodes, which reduces the output andwrites that part of the final result to the output folder.

One of the main deficiencies with MapReduce is that it is difficult to design an algorithm thatuses iterative processing, which is common for machine learning and graph applications.Iterative computations in MapReduce not only require careful manual programming andscheduling of multiple MapReduce jobs but are also slow as the data is written and readfrom disk between each iteration. In addition to this, MapReduce can’t do real-time analysisas it was designed to do batch processing.

15

2. THEORY

Deer Bear RiverCar Car RiverDeer Car Bear

Deer Car BearCar Car RiverDeer Bear River

Deer, 1Bear, 1River, 1

Car, 1Car, 1

River, 1

Deer, 1Car, 1Bear, 1

Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)

Bear, 2 Car, 3 Deer, 2 River, 2

Bear, 2Car, 3

Deer, 2River, 2

Input

Splitting

Mapping

Shuffling

Reducing

Final Result

Figure 2.8: MapReduce word count example

2.6.2 Spark

Spark started as a project at the University of California, Berkeley, but is now a top-levelproject supported by Apache. Spark is based on MapReduce and is designed to resolve someof the deficiencies with MapReduce mentioned above. It supports iterative algorithms anddoes fault tolerance without replication through its data storage model called Resilient Dis-tributed Dataset (RDD). An RDD is an immutable data set that remembers each deterministicoperation that was performed. As such, in case a worker node fails, the RDD can be recreatedusing the operation lineage. RDDs are primarily used for manipulating data with functionalprogramming constructs. For high-level expressions, Spark has introduced DataFrames andDatasets.

Spark has been proven to be fast and highly scalable. In an article by García et al. [20], Sparkis compared to Apache Flink by implementing the two popular machine learning algorithmsSVM and LR and comparing speed and scalability of the training process. Another article byXianrui et al. [21] compared Spark to MapReduce with regards to speed and scalability by im-plementing five different algorithms. It was found that “MapReduce’s scheduling overheadand lack of support for iterative computation substantially slow down its performance on

16

2.7. Machine Learning Theory

moderately sized datasets. In contrast, MLlib exhibits excellent performance and scalability,and in fact can scale to much larger problems”.

DataFrames and Datasets

When working with Spark, one will also encounter the terms DataFrame and Dataset. ADataFrame is described as a two-dimensional structure where each column contains valuesconcerning one variable and each row contains one set of values. The Spark DataFrame wasintroduced as an extension of RDDs to improve the performance and scalability of Spark forsemi-structured and structured data, as well as providing developers with high-level abstrac-tions. As an example, these abstractions give developers access to SQL queries and to MLib’sMachine Learning API, making them very useful for machine learning applications.

2.6.3 Kafka

Kafka works on a publish-subscribe basis and delivers a fault tolerant messaging systemthat is scalable and distributed by design. It achieves fault tolerance by replicating messageswithin the cluster. Kafka has many use cases, for example aggregating statistics and logsfrom distributed application and making this available to multiple consumers, or for streamprocessing. Relative to many other messaging systems, Kafka has a low overhead because ofhow it keeps messages for only a set amount of time and thus makes the consumer responsi-ble for tracking relevant messages [22].

Spark has an API named Spark Streaming which is used to integrate Kafka with Spark. Itenables Spark to ingest the data stream in a scalable and fault-tolerant manner, divide it intobatches and process the data using the Spark engine. Finally, the processed data can be eitherstored on the HDFS or pushed to another Kafka topic to be consumed by a subscriber.

2.7 Machine Learning Theory

Machine learning (ML) has, just like big data, become a buzzword as it has boomed in pop-ularity. It is sometimes incorrectly used to describe what is really Artificial Intelligence (AI).In the same way, Deep Learning (DL) is sometimes used to describe what is ML. To clarify,ML is a subset of AI that uses statistical methods to enable machines to improve as they areexposed to more data, where AI are a much broader term that can be used to describe alltechniques that mimic human behaviour. DL is a subset of ML where the models being usedare based on artificial neural networks and have one or more intermediate layer between theinput and the output layers. Additionally, the model parameters of the intermediate layersare learned using outputs of the preceding layers, instead of being learned directly from thefeatures of the training data [23]. ML can be used for a wide range of real-world applicationsuch as image processing, natural language processing, computational finance and more1.

When choosing which algorithm to use for a specific problem, it is important to know thedifference between supervised and unsupervised learning as well as training and validationmethods to be able to explore and prepare the data correctly.

2.7.1 Model Training and Validation

It is common to divide the data intro three parts; training, validation and testing data sets.The training data is used to fit the initial model and the validation data is used to provide anunbiased metric on how well the model generalizes with some hyperparameters of choice,

1https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/i/88174_92991v00_machine_learning_section1_ebook.pdf

17

https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/i/88174_92991v00_machine_learning_section1_ebook.pdf




2. THEORY

meaning that the validation data is used to select the model with the best hyperparameters.When the best model is trained and found, the testing data is used to evaluate how well themodel generalizes using a data set that has not affected the model in any way. It is common tofirst split the data into two, where the first part contains both training and validation data andthe second part is the testing data. The first part is then split according to the needed amountsof training and validation data. A model with many hyperparameters are more difficult totune and will require a higher amount of validation data. The split ratio is dependent on theamount of hyperparameters that the model has, as well the amount of data available. As arule of thumb, sklearn uses a default ratio of 75% training and 25% testing data2.

There are some important metrics for evaluating a prediction model, which are defined below.These are reoccurring in related studies and will be used throughout this thesis to benchmarkmodel performance. TP, TN, FP and FN stands for true positive, true negative, false positiveand false negative.

Sensitivity = TPTP+FN = CorrectPositivePredictions

AllPositiveValues

Speci f icity = TNTN+FP = CorrectNegativePredictions

AllNegativeValues

Precision = TPTP+FP = CorrectPositivePredictions

AllPositivePredictions

Accuracy = TN+TPTN+TP+FN+FP = CorrectPredictions

AllValues

2.7.2 Supervised Learning

Supervised learning can be used when each sample of the data set used to train the modelhas a set of known input values x as well as an output value y. The algorithm at hand isthen used to find a mapping function f (x) = y. This is a common case for classification andregression problems, which constitute the majority of machine learning problems. Under-standing the difference between these is key to classifying whether the machine learning taskis a classification or a regression problem.

Classification problems consists of creating a mapping function such that it maps the inputto a discrete or categorical value. Regression problems on the other hand, creates a mappingfunction to a continuous variable. With this in mind, predictive maintenance can be either aregression or a classification problem, depending on if the output variable used to train themodel is designed as a categorical value such as warning level, or a continuous value such asestimated time to failure [24].

2.7.3 Unsupervised Learning

Unsupervised learning is used to make inferences from data when the output values for agiven set of input values are unknown. Because the output data is unknown, applying regres-sion directly is not possible. Using some technique, the input is interpreted and grouped byfinding previously unknown patterns or underlying structures in the data. Common tasks in-clude clustering and association used for exploratory analysis and dimensionality reduction[25]. The only unsupervised learning technique used during this thesis is Principal Com-ponent Analysis (PCA) and therefore the report will not describe any other unsupervisedtechniques more in depth.

2https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

18

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

2.8. Predictive Classifiers

2.8 Predictive Classifiers

This section will present the fundamental theory for each of the algorithms that have beenconsidered for building predictive models in this thesis. Knowing fundamental theory aboutthe algorithms being considered for the solution is key to selecting the best algorithm andevaluating it in a correct manner. The following information is gathered from The Hundred-page Machine Learning Book [23].

2.8.1 Decision Trees

A decision tree is built by repeatedly splitting a set of data into subsets and choosing the splitthat minimizes the entropy. The algorithm stops either when the tree reaches a configuredmaximum depth d, or when all possible splits reduces the entropy less than ε. It is necessaryto be aware of and explore these parameters cautiously. An example of what can happen isthat a very tall decision tree will model insignificant noise and overfit to the training data,meaning that it will perform very well on the training data, but perform poorly on futureexamples. The final result of the decision tree algorithm is an acyclic graph that can be usedto make decisions by inspecting one feature at the time.

2.8.2 Support Vector Machine

Support Vector Machines uses the dot product between feature vectors to solve an optimiza-tion problem that consists of finding the hyperplane that has the greatest margin to the near-est point of any class. Finding the hyperplane with the largest margin is important as thatcontributes to how well the model will function for future examples. However, outliers mayaffect the SVM such that the data is not linearly separable. For those cases, a hinge function isused that introduces a trade-off possibility between decreasing margin size in order to clas-sify the training data well and the ability to classify future examples well. For cases where thedata is inherently non-linear, SVM can be extended by the use of kernels to make non linearclassification models. The final result is a model that can assign a feature vector to one of twocategories.

2.8.3 K-Nearest Neighbors

K-Nearest Neighbors (KNN) produces a model which is a collection of all the training sam-ples. It works by comparing all the previously recorded samples and compares them to thenew data point. Once it has decided the K nearest samples, the new data point is assigned thelabel that the majority of those K samples has. The distance function for comparing the dis-tance between data points needs to be chosen by the data analyst, however Euclidean distanceis frequently used.

2.8.4 Neural Networks

Neural networks consists of an input layer, one or more hidden layers and an output layer.In each layer is a number of neurons and inside every neuron is a core with the assignmentof using an activation function to produce an output given an input. There are many typesof neural networks, however most are not relevant for the case of using tabular multivariatetime series data for predictive maintenance. To further narrow it down, the only type ofneural network provided by Spark MLlib is the Multilayer Perceptron Classifier3.

3https://spark.apache.org/docs/latest/ml-classification-regression.html

19

https://spark.apache.org/docs/latest/ml-classification-regression.html

2. THEORY

Multilayer Perceptron

Multilayer Perceptron Classifier (MLP) is feedforward neural network that have been de-scribed as the classical type of neural network, or vanilla neural networks [26]. MLPs aresuitable for classification prediction problems as well as regression prediction problems, mak-ing them highly relevant for predictive maintenance. MLP trains by solving an optimizationproblem where each neuron weight is optimized through gradient descent and back propa-gation techniques. The metric used for solving the optimization problem is the mean squarederror between the model output and the answers.

2.9 Dimensionality Reduction

Dimensionality reduction can be used to speed up model training by allowing simpler mod-els to be used. With modern techniques such as cloud computing and improved graphicalprocessing units, dimensionality reduction is less important now compared compared to thepast. Another more common use case nowadays is to be able to visualize higher dimensionaldata [23]. One effect of dimensionality reduction is that it removes redundant or highly cor-related features and thus removes noise in the data. Therefore, it can also be used to reduceconfusion and improve model precision when working with complex data, as can be seen inan article by Meigarom Lopes [27].

2.9.1 Principal Component Analysis

PCA is an unsupervised method that is used to find the linear combination of variables thatmaximizes the variance in the variable space. This means that the new set of components willbe of a lower dimension, while still retaining all or most of the information needed to findpatterns in the data.

2.10 Model Tuning

The prediction algorithms do not inherently tune hyperparameters. Instead, these parametershave to be tuned manually.

2.10.1 Grid Search

Grid search is a simple technique for hyperparameter tuning where a number of values areentered for each variable that required optimization. All possible combinations of these vari-ables are then tested in a parameter search and the best performing model are kept.

2.10.2 K-Folds Cross-Validation

In order to evaluate how well a certain model using the hyperparameters currently under testgeneralize to an independent data set, a k-fold cross-validation may be applied. The trainingdata set is split into 1 partition of validation data and k partitions of training data. The resultis the average evaluation of k models trained on different subsets of the data set. Combiningcross-validation with grid search is a powerful technique to find good hyperparameters thatusually gives a more sound estimation of how well the parameters are performing, howeverit can be computationally expensive as every model is trained k times instead of one4.

4https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

20

https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

2.11. Ensemble Methods

Bias

Erro

r

Model Compexity

Total Error

Variance

Figure 2.9: Bias variance trade-off

2.11 Ensemble Methods

This section will present ensemble methods as a concept and three different methods forproducing an ensemble model: bagging, boosting and stacking. Ensemble methods are al-gorithms that are based on the idea that multiple models can be combined to create a singlemodel that is better than any of the base models. Bagging and boosting considers homo-geneous weak learners, meaning that the weak learners are different variants of the samealgorithm. Stacking however, often considers heterogeneous weak learners, meaning that itcombines multiple models based on different learning algorithms.

The terms bias and variance will be frequently used in this section. A model with a high biasis prone to underfitting, while a model with a high variance is prone to overfitting. A modelwith a high bias will make incorrect predictions both on the training data as well as on the testdata, where a model with a high variance will be very accurate on the training data but makeincorrect predictions on the test data. As is illustrated in Figure 2.9, there generally exists anoptimal balance between bias and variance. This is known as the bias variance trade-off.

An explanation to why ensemble methods works is that each model can be viewed as a signaland some surrounding noise, meaning that a model usually has either a bias, or a variancethat is too high. Assuming that the noise is evenly distributed around the true value, averag-ing the models cancels out the noise and finds a balance. Ensemble methods are divided intoparallel and sequential methods. Parallel methods combine base models that can be trainedindependently of each other, while sequential methods are limited to training one base modelat a time as the current model rely on information that the previous model generates.

2.11.1 Bagging

Bagging, which stands for Bootstrap Aggregating, is a method used to improve stability andaccuracy while at the same time decreasing the variance of a model by avoiding overfitting.Bagging does not reduce the bias of the models however and therefore, base model with alow bias and a high variance should be selected. Bagging achieves a decrease in variance byaveraging multiple base models that are trained using random subsamples of the trainingdata. The first step in this process is the bootstrap sampling, which selects a number ofsamples at random for every base model. The second step aggregates the base models into afinal model. Averaging or voting is used for aggregating the models in an optimal way.

21

2. THEORY

2.11.2 Boosting

Boosting has the objective of converting weak learners to strong learners by reducing the biasof the model, with the drawback that overfitting may be increased. Therefore, weak learnerswith a high bias and a low variance should be selected when using boosting. A weak learnercan be represented by a single model, or a combination of models, but it must be better thanrandom chance. The method is based on the question posed by Kearns and Valiant: "Can a setof weak learners create a single strong learner?" [28] Boosting does not take subsamples fromthe training data, but instead uses the complete data set together with a weighting systemto make future models focus on classifications which were perceived as difficult by previousmodels. For each model being added, the data weights are readjusted to increase the weightsfor data that have been previously misclassified and decrease the weights for points that areeasy to classify. Each weak learner is given a voting weight based on their accuracy, such thatmodels with a higher accuracy have a stronger vote. This vote is later used when combiningthe weak learners into a more complex, strong learner.

2.11.3 Stacking

Stacking utilizes a learning algorithm, referred to as a meta learner, that learns how to combinebase-level model predictions into a new model. As previously stated, the base-level modelsused with stacking are often heterogeneous, however one may use homogeneous models ifdesirable. As noted by Saso Džeroski and Bernard Ženko [29], one may have to experimentwhen deciding the number of base-level models and which algorithm to use for the metalearner. A good starting point however is to consider between three and seven base-levelmodels. They propose a solution called multi-response linear regression for the meta learnerand shows that it outperforms other stacking approaches. One may have more than onelayer in a stacking ensemble. In such a model with i.e three layers, the base layer would beconnected to a number of meta learners that are connected to a final meta learner. This issuggested by Rodolfo Lorbieski and Silvia Modesto Nassar [30] as the ensemble alternativethat is most likely to increase the accuracy of the model with the drawback that the computingtime is heavily increased.

2.12 Homogeneous Ensemble Algorithms

This section presents the common algorithms random forest and gradient boosting that are ainherent to the Spark MLlib5 and thus can be used and evaluated with no further implemen-tation work required.

2.12.1 Random Forest

Random forests ensemble is a method that can be used for both classification and regressionproblems. They exist to solve the problem of overfitting that decision trees are prone to. Over-fitting means that the model performs very well on the training data but does not generalizevery well. The reason why this can happen for a decision tree is that the tree is designed tooperfectly with regards to the training data and thus it ends up with branches that make strictdecisions on very small amounts of data. A random forest is an aggregated model of a collec-tion of trees that are trained on subsets of the training data. These subsets are created usingthe standard bagging method, with the small twist that only a certain amount of features areselected for each tree. For tabular data, this means that a subset of rows are selected and thena subset of columns are picked from these rows. Apart from random inducing features, thetraining process is identical to the one for decision trees.

5https://spark.apache.org/docs/latest/ml-classification-regression.html

22

https://spark.apache.org/docs/latest/ml-classification-regression.html

2.13. Related work

2.12.2 Gradient Boosting

Like random forests, gradient boosting can be used for both classification and regressionproblems. Gradient boosting is typically used with shallow decision trees as its weak learn-ers. Each tree ti is trained using the training data and is then added to the model together witha weight wi that represents the accuracy. The residuals [31] are used to update the weightsto configure what the next tree should focus on. When the configured maximum amount oftrees are trained, they are combined and the ensemble model is returned.

2.12.3 Neural Network

The greatest strength with neural networks is that they can find and predict complex non-linear relationships in data. The cost of this high flexibility is that they are highly sensitiveto noise and errors in the training data, which often results in a high variance. An article byLars Kai Hansen, Peter Salamon shows evidence that an ensemble of neural networks per-form better than a single neural network when trained using bagged subsets of the trainingdata[32] and ensembles consisting of MLP neural networks will therefore be considered inthis thesis.

2.13 Related work

This section presents and discusses related research. The related work has influenced thealgorithms taken into consideration and the methods used for benchmarking their perfor-mance. In addition, the related research has acted as a source of validation for the obtainedvalues during the thesis work.

Olgun Aydin and Seren Guldamlasioglu [33] investigated how to implement the Keras libraryon the distributed clustering platform of Spark. For this purpose they used Elephas, whichis an extension that allows for deep learning models built in Keras to run on Spark. Theyimplemented an LSTM model with the purpose of predicting engine condition. The timeframe in this study was set to “200 epochs” and they received an accuracy of 85%. Any othermetrics are not disclosed, however they present the conclusion that Spark provides a large-scale distributed data processing environment that is suitable for this kind of research andthat an LSTM model is a promising alternative when creating a prediction system.

Chinedu et al. [34] described a generalized solution for fault detection on complex systemsusing SCADA data. The article describes how to process the SCADA data from the data ac-quisition step to data preparation, model training and model validation in order to create anArtificial Neural Network (ANN) that can predict future readings from a target component.The importance of data preparation is emphasised. The authors choose to remove low vari-ance features, impute missing values by calculating neighbor mean value and filter outliersby using the Interquartile Range Rule for Outliers. The models were built using a four-foldcross-validation ensemble method for ANNs. The conclusion is that cross-validation ensem-ble technique is superior to the classic ANN when comparing predictive ability. The authorshypothesize that better results could be made using an 8-12-fold cross-validation ANN, butdo not mention why they choose four folds for their solution. However, this is still convinc-ing evidence that using k-fold cross-validation should be considered when working with datafrom complex systems that SCADA systems usually monitor.

Leahy et al. [35] investigated how to build a predictive system for wind turbine fault de-tection using SCADA data and support vector machines. The produced models are able topredict an error up to 12 hours in advance for a specific failure. The authors achieve a veryhigh recall, but express their concern with the poor precision metric of the models. Theyexpress a hypothesis that a feature extraction method would allow for models with higher

23

2. THEORY

Author Accuracy Sensitivity Specificity Time frameKusiak et al. 76.50% 77.60% 75.70% 5 hoursCanizo et al. 82.04% 92.34% 60.58% 1 hour

Table 2.1: Related work metrics

precision and propose that a future study should consider the costs of false negative versusfalse positive to find an optimal balance between precision and specificity. Feature extractionwill therefore be explored, as will the relationship between accuracy metrics and maintenancecosts.

Canizo et al. [36] describe a complete solution for predictive maintenance using HDFS andSpark, showing that such a solution is possible to build using only the Hadoop framework.The only algorithm considered in that study is the random forest algorithm, achieving anaccuracy of 82.04%. It is worth noticing that the specificity was significantly lower comparedto related work by Kusiak et al. [37]. The metrics found in these two studies are presentedin Table 2.1 together with the time frame in which the predictor is designed to operate. Inaddition, Canizo et al. performed some experimentation on the number of trees Ntrees andthe depth of the trees Maxdepth with the conclusion that Ntrees = 40 and Maxdepth = 25 resultsin an optimal random forest algorithm. This is however, under the condition that the SCADAdata is not only very similar but also preprocessed in the same way and should therefore onlybe used as a starting point for a parameter analysis. Lastly, they present a hypothesis that theaccuracy of the predictive model could have been improved if the data set had been balancedduring the preprocessing step.

24

3 Method

This chapter describes the stages that were followed during the thesis and how each stagewas carried out. Before anything else, a literature study was made to determine the feasibil-ity of the research questions and the current state-of-the-art knowledge in each area of inter-est. After the literature study, the steps described in Figure 2.7 were followed to ensure thatno premature decisions that would have negative consequences in a consecutive step wasmade. Therefore, a period of data gathering took place to determine if sufficient data wasavailable, followed by a period of going back and forth between data preparation and modelplanning. When the quality of the data and how to process the data to make it useful hadbeen determined and the analytical plan was defined, the models were built and evaluated.

3.1 Data Discovery

Data discovery is the first step in any big data project. To find suitable data, a number ofcompanies were contacted and sources for open access datasets such as World Bank OpenData, EU Open Data Portal, Data.gov, Kaggle and more were searched. The found datasets werecompared and explored with the goal of finding a strategy to use the dataset in question tobuild a classification or regression model.

3.1.1 EDP Data Set

Energias De Portugal (EDP) provided a dataset that was used by the competitors during a windturbine themed hackathon held in May 2019 named Wind Turbine Failure Detection1. Thisdataset is now open access and was found to be suitable for evaluating software, methodsand models for this thesis. The distinction between this and many other datasets was thepresence of identifiable distinct error codes, making it possible to add labelling columns suchas remaining useful life that the algorithms can use for training and validating.

The dataset contains measurements from five turbines. Measurements have been recordedevery ten minutes over the course of two years, 2016 and 2017. The data from 2016 is used as

1https://opendata.edp.com/pages/challenges

25

https://opendata.edp.com/pages/challenges

3. METHOD

Component F/C F in training F in testing T 1 T 2 T 3 T 4 T 5Gearbox 4 2 2 1 1 0 2 0Generator 7 5 2 0 5 1 0 1Generator Bearing 6 4 2 0 0 2 4 0Transformer 3 2 1 1 0 2 0 0Hydraulic Group 8 2 6 0 2 2 1 3Failures/T 28 16 12 2 8 7 7 4

Table 3.1: Turbine failure summary

training data and the data from 2017 is used for testing purposes. The measurements are com-posed of 81 variables derived from sensors that monitor 12 components and environmentalaspects. For details about the variables, sensors and components, please see the EDP OpenData website2. The datasets are available for registered users on the Data page, however thedescription of the data can be found under Challenges and Wind Turbine Failure Detection.

Failure Data Analysis

The error codes provided by EDP cover five components - gearbox, generator, generator bear-ing, transformer and hydraulic group. Table 3.1 summarizes the failures (F) per component(C), per turbine (T), how the failures are divided between the training and the testing data aswell as the amount of failures per turbine.

These error codes can be translated reasonably well to subsystem downtime per turbine, seeFigure 2.2 presented in section 2.3. When evaluating the EDP dataset downtime coverage,two assumptions were made. First, that the transformer is the error prone component ofthe Electric System. Secondly, that the downtime for both the generator as well as generatorbearing is included in Generator. Under these assumptions, the five components that themodels are trained to predict are responsible for 62% of the average downtime, which wasdeemed as sufficient coverage to continue working with this dataset.

Assuming that the models are more precise the more training data they have, the three mod-els for the gearbox, transformer and hydraulic group components would perform equallywell, given that they have two failure occurrences each in the training data. Further, the gen-erator bearing predictor would perform slightly better and and the generator would performthe best. However, if the performance of the model is indeed related to the specific turbinethat it has been trained on, a model will be limited to only predicting component failureswhere it has been trained on failure data in that very same turbine. The cells marked withyellow in Table 3.2 illustrate training data that is expected to be of no use and the testingdata where the models are expected to fail, while the cells marked with green illustrate thetraining data that is expected to be useful and the testing data where the models are expectedto predict an error with high precision, according to this theory. Therefore, to prove that amodel is able to generalize predictions across turbines, failures marked with yellow must bepredicted.

3.2 Data Preparation

Data preparation is the process of transforming the raw data such it can be used with machinelearning algorithms and is last step where any deficiencies in the data should be corrected orremoved before planning and building the models. Jupyter Notebooks, Pandas and Matplotlibwere used to visualize and explore the data.

2https://opendata.edp.com/

26

https://opendata.edp.com/

3.2. Data Preparation

Component Training T1 Testing T1Gearbox 1 0Generator 0 0Gen. Bearing 0 0Transformer 0 1Hyd. Group 0 0

(a) Turbine 1 failures with respect to trainingand testing


(b) Turbine 2 failures with respect to trainingand testing


(c) Turbine 3 failures with respect to trainingand testing


(d) Turbine 4 failures with respect to trainingand testing


(e) Turbine 5 failures with respect to trainingand testing

Table 3.2: Baseline theory about turbine failures with respect to training and testing data

3.2.1 Cleaning

First, an analysis was made on the data with regards to outliers, null values and the varianceof the features.

Outliers

The data was found to be of high quality with few apparent outliers. The available choices areto drop, keep or transform the outlier. Since outliers in this dataset may be either the resultof a temporary malfunction in a sensor, or a correct value that is the result of componentmalfunction, there is a chance that similar cases will occur in the test data. Therefore, alloutliers were kept in the dataset for the first model evaluation.

Null Values

Null values have to be handled, since the algorithms do not know how to handle missingvalues and will throw errors if null values are present. Out of 521, 784 measurements, 7contained null values. Due to the large amount of available correct data, these measurementswere removed from the dataset.

Low Variance Features

Low variance features were removed. The standard deviation for each feature was comparedagainst a threshold of 0.1 to determine which features to remove. 4 features were removedfrom the dataset: Grd_Prod_CosPhi_Avg, Grd_Prod_Freq_Avg, Prod_LatestAvg_ActPwrGen2and Prod_LatestAvg_ReactPwrGen2. By performing three PCA, creating 5, 25 and 50 princi-

27

3. METHOD

Figure 3.1: Amb_Temp_Avg over time. This variable describe the measured average ambienttemperature.

pal components it was confirmed that these features could be safely removed, since they hadno influence on the extracted principal components in neither of the analyses made.

3.2.2 Scaling

Many machine learning algorithms assume that the data is scaled in a certain way. It maybe that the algorithms do not work at all unless the data is scaled in a specific way, or it mayaffect performance such as training time or accuracy. As is described in an article by ShayGeller, the scaler of choice can have significant effect on the end result and is therefore animportant part of the data preparation [38]. Due to this observation, every algorithm wasevaluated using two common scaling methods to be able to choose the optimal scaler.

The scalers considered for this evaluation had to be limited to MinMaxScaler and Standard-Scaler, as they are the only scalers supported by mllib. Both the MinMaxScaler and the Stan-dardScaler are sensitive to outliers. Therefore, the following images has been extended toinclude the PowerTransformer-Yeo-Johnson as it reacts differently to outliers. The MinMaxScalertransforms the data so that all samples are in the range [0 ´ 1]. The StandardScaler sets themean to zero and scales the data to unit variance. PowerTransformer-Yeo-Johnson scales the datainto a more Gaussian-like distribution with the purpose of avoiding modelling problems re-lated to heteroscedasticity, also known as non-constant feature variance. The different scalersare illustrated using Figure 3.3 and 3.4 that uses the data from the variables Amb_Temp_Avgand Prod_LatestAvg_TotReactPwr. The measured data from these variables is illustrated inFigure 3.1 and Figure 3.2 respectively. The reason for using these variables is that the firstcontains no outliers and the second has the most amount of outliers in the dataset, using theInterquartile Range Rule for Outliers. It can be observed that StandardScaler and MinMaxScalerproduce data with identical shapes where only the mean value and variance differs whilePowerTransformer-Yeo-Johnson changes the shape of the data noticeably in comparison.

3.2.3 Dimensionality Reduction

The dimensionality of the problem can be reduced to speed up the process of training themodel. This is a common application of Principal Component Analysis (PCA). The data couldbe reduced to 5 features while still keeping 99% of the variance using PCA.

3.2.4 Feature Engineering

While cleaning is about data reduction, feature engineering is about addition. It’s wheredomain knowledge is used to enhance the dataset and increase the accuracy of the model bymaking it easier for the machine learning algorithm. The data could not be divided into train-

28

3.3. Model Planning

Figure 3.2: Prod_LatestAvg_TotReactPwr over time. This variable describe the measuredtotal reactive power produced.

ing, validation and testing data, because of how each model should be trained to recognizeonly faults in one component. Therefore, one training and one testing set for each componentwas required. It was decided to use the data related to gearbox failures as training and val-idation data to make evaluations and tune hyperparameters. The reason for this is that thegearbox model has two errors to train on and two errors to test on, while the other compo-nents are unbalanced in this regard. Furthermore, the gearbox failures are divided such thatone of the failures in the training data occurs in the same turbine as one of the failures in thetesting data and thus it can be evaluated whether or not the model’s ability to predict failuresvary depending on whether or not it has been trained on such a failure in the same turbine.The remaining four components were used to evaluate the results received in this first step.

To be able to create a label for each measurement, the future point in time where a relevanterror occurs has to be known. Therefore, the sensor data was joined with the failure datausing the condition that the turbine id is the same and that the timestamp is less or equalthan the timestamp in the failure data, creating the column failure timestamp. This failure datacontains turbine id, timestamp, component name and remarks such as “Gearbox pump dam-aged”. What is worth mentioning here is that this resulted in duplication of many sensordata points, as every future error is in the scope of the join query. The difference between thedata timestamp and the failure timestamp was used to create a new column named remaininguseful lifetime. Using this, labels for both classification and regression models were createdand evaluated in order to stay open to possibility of evaluating both in the model planningstep. The first binary label to be added and evaluated is label_60_days and is a binary classi-fier to classify whether or not a component is going to fail within sixty days from the giventimestamp. The reason for having sixty days as the classifier limit is that the scoring systemused by EDP in the competition awarded maximum score for predicting a failure sixty daysin advance. The technical reason for using this limit is not disclosed. These duplicated datapoints were then ranked depending on if the data point was related to the component rele-vant to the model and if the label indicated an error and then removed so that only one datapoint for every measurement existed.

3.3 Model Planning

Model planning is about creating a clear vision regarding which models to build and use.Here, the two scalers and the dataset balance ratios mentioned in the previous section wereevaluated to get a good picture about how to produce the best base models.

29

3. METHOD

(a) Normalized Amb_Temp_Avg

(b) Standardized Amb_Temp_Avg

(c) Power Transformed Amb_Temp_Avg

Figure 3.3: Scaler comparison for a feature with no outliers, using the data from theAmb_Temp_Avg variable. 3.3a, 3.3b and 3.3c illustrate the data distribution of this vari-able when scaled using the MinMaxScaler, StandardScaler and PowerTransformer-Yeo-Johnsonrespectively.

30

3.3. Model Planning

(a) Normalized Prod_LatestAvg_TotReactPwr

(b) Standardized Prod_LatestAvg_TotReactPwr

(c) Power Transformed Prod_LatestAvg_TotReactPwr

Figure 3.4: Scaler comparison for features with no outliers, using the data from theProd_LatestAvg_TotReactPwr variable. 3.4a, 3.4b and 3.4c illustrate the data distribution ofthis variable when scaled using the MinMaxScaler, StandardScaler and PowerTransformer-Yeo-Johnson respectively.

31

3. METHOD

Component Replacement cost Repair cost Inspection costGearbox 100 000 € 20 000 € 5 000€Generator 60 000 € 15 000 € 5 000€Generator Bearing 30 000 € 12 500€ 4 500€Transformer 50 000 € 3 500€ 1 500€Hydraulic Group 20 000 € 3 000€ 2 000€

Table 3.3: Costs for component operations

3.3.1 Task Description

To be able to compare the produced models with other models under equal conditions, therules used to evaluate models during the Wind Turbine Failure Detection competitions wereapplied. The reason for including these rules in the planning stage is that they affect how thedata is prepared. The following rules have been extracted from the EDP website [39].

“True positives (TP) are failures of the correct wind turbine and subsystem, detected between2 and 60 days before the date of the break. If a failure is detected in the right period but in thewrong wind turbine or subsystem, it counts as a false positive. True Positives are translatedinto savings, which are the difference between replacement and repair costs.

False negatives (FN) are real failures in a wind turbine and subsystem where there is nodetection in the previous 2-60 days. False negatives are translated into replacement costs.

False positives (FP) are produced warnings in a wind turbine and subsystem where there isno failure in the next 2-60 days. False positives are translated into inspection costs.”

The following formulas are used to calculate the total reduction or increase in maintenancecosts caused by the prediction system.

TPSavings =ř

i=#TP(Replacement ´ (Repair + (Replacement ´ Repair)(1 ´ti60 )))

FPCost =ř

i=#FP(Inspection)

FNCost =ř

i=#FN(Replacement)

TotalSavings =ř

Turbineř

Component(TPSavings ´ FPCost ´ FNCost)

To summarize these rules into a task: for each sensor data si, produce a prediction if com-ponent ci in turbine ti is going to fail within sixty days. The dataset created according to theaforementioned specification is iterated, ci and ti are given together with the sensor data ineach row.

3.3.2 Core Problem Evaluation

The core of the model planning problem is to decide which type of model to build. Two kindsof models were evaluated: multinomial classification and binary classification. All initial modelswere built and evaluated using a normalized dataset. Multinomial classification were eval-uated for each model in mllib having this functionality, with the aim of producing a modelcapable of predicting if a component failure is imminent within sixty days and if so, whichcomponent is going to fail. Poor results were received, indicating that the models were notable to successfully learn either how to detect a failure or how to decide in which componentfailure is imminent. Due to the many sources of confusion when building a multinomial clas-

32

3.3. Model Planning

sification model on this dataset, the binary classifier was given priority and was evaluatedmore thoroughly.

Binary Classification

When evaluating binary classification models, one model had to be produced for each com-ponent. The dataset was split accordingly, creating subsets of training data. Significantlybetter results were received using this approach with the first models producing up to 79%accuracy, indicating that this is indeed the better approach for making component specificpredictions using the given dataset. Therefore, all following evaluations and optimizationswere made with the goal of producing several binary classifiers.

Ensemble Evaluation

The package spark-ensemble3 was found to be suitable for evaluating the effects of bagging andstacking algorithms because of how it has not only implemented the algorithms in scope forthis thesis but also is the only available package for Spark with this functionality. In additionto these two algorithms, it also supports combining machine learning algorithms throughboosting and Gradient Boosting Machines however this was not evaluated. Boosting aims atreducing the bias, however the produced models have a low bias and a high variance andwould most likely not benefit from the boosting algorithm. Gradient Boosted Machines wereleft out due to time constraints.

3.3.3 Data Balancing

After adding the engineered features, the dataset was skewed with regards to the classifier,with 17050 out of 238957 measurements (6.66%) being labelled as positive. Considering thehypothesis made by Canizo et al.[36] regarding the balance of the dataset, it was decided thatthe models should be evaluated using differently balanced versions of the same dataset. Theratios 90 ´ 10, 80 ´ 20, 70 ´ 30, 60 ´ 40 and 50 ´ 50 were considered with the motivation thatthese four test cases would indicate a trend and that further tuning would be based on thistrend. Notice that this is the ratio between positive and negative labels in the training datasetand is unrelated to the ratio between training and testing data.

3.3.4 Data Balancing and Scaler Evaluation

To better understand how the balance between positive and negative samples affect themodel performance, a number of tests were performed using the ratios described earlier in3.3.3. The balancing itself was done using downsampling where the number of positive sam-ples were calculated and then used to pick an appropriate number of positive samples fromthe dataset at random. This has the obvious downside that the datasets are not identical,even if they are built using the same settings for data balancing and therefore the results mayvary depending on the data points that are seeded. However, no other solution was foundas Spark was unable to save a DataFrame the size of the training data to file due to lack ofavailable memory. Thus, there exists a natural variation in model performance that has to betaken into account when observing the results. A naive oversampling was tried in order tomitigate this, where the group of positive samples were duplicated multiple times until thedesired ratio between positive and negative samples was achieved, but it was found that thishad the same performance outcome as if the model was trained on the complete training setwithout any data balancing. In addition to the balancing evaluation, the scalers were evalu-ated in this same process by creating two almost identical datasets where the only differencebetween them was whether the data was scaled using the MinMaxScaler or the StandardScaler.

3https://pierrenodet.github.io/spark-ensemble

33

https://pierrenodet.github.io/spark-ensemble

3. METHOD

Model TP FP TN FN Prec. Sens. Spec. Acc. TimeLR 9569 40242 109950 7672 19.21% 55.50% 73.21% 71.38% 1259DT 6384 19629 130563 10857 24.54% 37.03% 86.93% 81.79% 118RF 6916 24198 125994 10325 22.23% 40.11% 83.89% 79.38% 1025MLP 6393 21752 128440 10848 22.71% 37.08% 85.52% 80.53% 3414L-SVC 11178 46082 104110 6063 19.52% 64.83% 69.32% 68.86% 564

Table 3.4: 50-50 ratio with normalizing scaler


Table 3.5: 50-50 ratio with standardizing scaler



To better understand the data, the ratio of 75 ´ 25 was added in retrospect. Linear Regres-sion (LR), Decision Tree (DT), Random Forest (RF), Multilayer Perceptron (MLP) and LinearSupport Vector Machine (L-SVC) were considered for this evaluation.

The following tables describe the results of the performed tests, the name of the algorithmsare abbreviated for spatial reasons. Conclusions are presented below. These tests were per-formed on a computer with an 8th generation quad core Intel I7-8550U processor and 16GBDDR4 RAM operating at 1200 MHz. Therefore, the running times presented are only relativein relation to each other.

The following conclusions were made from the results of these tests.



34

3.3. Model Planning











35

3. METHOD







• The scaler of choice had no impact on the performance of decision tree based modelssuch as DT and RF.

• The scaler of choice was found to be affect the performance of the Multilayer Perceptronespecially, with MinMaxScaler being the preferred scaler.

• Logistic Regression was found to be unreliable when the negative-positive ratio wasequal or greater than 90 ´ 10.

• Linear SVC performed overall poorly and produced zero sensitivity models when thenegative-positive ratio was equal or greater than 70 ´ 30.

The most interesting metric for this problem is the precision, as this translates directly to sav-ings due to avoided failures and inspection costs due to unnecessary inspections. In additionto this, the sensitivity and the specificity metrics are observed in order to create a benchmarkagainst related studies. It can be observed in Figure 3.5 that the precision metric peaks at60 ´ 40 for DT, 75 ´ 25 for MLP and at 80 ´ 20 for RF and LR. The decision for which metricto use in the proceeding evaluations has to consider all of these metrics, since the ideal modelshould not only be trustworthy when it produces a positive label, but also robust as to not leta false positive through. Since these models are to be used with spark-ensemble, which createsa Classifier that inherits from Predictor, the models must be trained using the same dataset.Thus, only one optimal balance can be selected to be used with all the models in the stack.Because of the aforementioned reasons, it was decided to use a ratio of 70 ´ 30 percent forbuilding the models. It was decided not to use the linear SVC any further because of the poor

36

3.3. Model Planning

50 55 60 65 70 75 80 85 90Percentage of Negative Samples in Training Data

0

5

10

15

20

25

Precision

AlgorithmLRDTRFMLPL-SVC

(a) Precision, or the model’s ability to label positive samples


0

5

10

15

20

25

Precision


(b) Sensitivity, or the amount of positive samples labeled correctly


0

5

10

15

20

25

Precision


(c) Specificity, or the amount of negative samples labeled correctly

Figure 3.5: Visualization of precision, sensitivity and specificity metrics in relation to the ratiobetween positive and negative samples in training data

37

3. METHOD

Model TP FP TN FN Prec. Sens. Spec. Acc. TimeLR 555 3620 146572 16686 13.29% 3.22% 97.59% 87.87% 153DT 2097 37368 112824 12144 5.31% 14.73% 75.12% 69.89% 402RF 4145 24684 125508 13096 14.38% 24.04% 83.57% 77.44% 3855MLP 2181 13941 136251 15060 13.53% 12.65% 90.72% 82.68% 3126

Table 3.16: PCA evaluation using data with 70-30 ratio

Model TP FP TN FN Prec. Sens. Spec. Acc.LR 5123 16290 133902 12118 23.92% 29.71% 89.15% 83.03%DT 5373 16753 133439 11868 24.28% 31.16% 88.85% 82.91%RF 4640 14393 135799 12601 24.38% 26.91% 90.42% 83.88%MLP 6803 16992 133200 10438 28.59% 39.46% 88.69% 83.62%GBT 5110 15455 134737 12131 24.85% 29.64% 89.71% 83.52%

Table 3.17: Algorithm metrics for models produced by the hyperparameter evaluation

results received, however LR, DT, RF and MLP were kept to be evaluated for further use withbagging and stacking.

3.3.5 PCA Evaluation

When the balance, scaler and algorithms were selected, an evaluation on the usefulness ofPCA in this context was made. The produced models were compared with the models in theprevious section to evaluate how PCA affected metrics and run time for the given dataset andalgorithms. The time consumption together with the metrics can be seen in Table 3.16 and arecompared to the metrics in Table 3.8. This comparison showed inconsistent changes to timeconsumption, as LR showed a decrease in time, however the total time required to train thefour models increased by 42%. It also showed a detrimental effect on model performance.Through this evaluation it was decided not to use PCA in this thesis.

3.3.6 Hyperparameter Evaluation

To find settings for the models that would further improve the metrics, a parameter grid wasdeployed. All other settings were identical to the settings used during the balance evaluationprocess in order to keep the results comparable. The gradient boosted trees (GBT) algorithm,which had been excluded in the previous stages due to the amount of time required to traina model, was included at this point to determine its usefulness under the given prerequisites.The results from these evaluations are presented in Table 3.17. The time metric is excludedin this Table as it is heavily dependant on the amount of parameters tested and thereforeirrelevant. In the following text, the parameters are named as they appear in the Spark docu-mentation.

Linear Regression

The grid search concluded that regParam = 0.0 and maxIter = 3000 was optimal for LR.The regParam, or regularization parameter defines the trade-off between minimizing trainingerror and minimizing model complexity. This parameter decides how much to penalize theweights, a higher value forces the algorithm to build a simpler model. In this case, the mostcomplex model performed the best, indicating that the problem can’t be solved optimallyusing linear methods.

38

3.3. Model Planning

Model TP FP TN FN Prec. Sens. Spec. Acc. Time FoldsLR 6701 17787 132405 10540 27.36% 38.87% 88.16% 83.08% 14989 2LR 6184 16919 133273 11057 26.77% 35.87% 88.74% 83.29% 13660 5DT 5373 16753 133439 11868 24.28% 31.16% 88.85% 82.91% 278 2DT 5833 17378 132814 11388 25.13% 33.87% 88.43% 82.82% 486 5RF 4630 14192 136000 12611 24,60% 26,85% 90,55% 83,99% 3141 2RF 4707 14448 135744 12534 24,57% 27,30% 90,38% 83,88% 3141 5MLP 6800 18244 131948 10441 27,15% 39,44% 87,85% 82,87% 2864 2MLP 7115 18752 131440 10126 27,51% 41,27% 87,51% 82,75% 4364 5

Table 3.18: Cross-validation comparison using two and five folds

Decision Tree

For the DT algorithm, having MaxBins = 100 and MaxDepth = 20 produced the best model.

Random Forest

It was found that a random forest algorithm with MaxBins = 40 and MaxDepth = 30 wasthe optimal choice. Worth noticing is that MaxDepth = 30 is the largest this parameter can bein Spark. Therefore, the random forest algorithm may produce even better result in anotherframework where a deeper tree is allowed.


For the Multilayer Perceptron, having solver = l ´ b f gs, layers = [77, 2] and MaxIterations =800 gave the best result. Notice that because of time constraints only six settings for layerconfigurations, five of them containing between one and two hidden layers, however none ofthem produced a better result than the layer configuration with just one input and one outputlayer. Both increasing dimension, triangular dimension and decreasing dimension layoutswere evaluated for the layers parameter, however the simpler [77, 2] proved to produce thebest metrics. Three settings for MaxIterations (200, 500, 800) and two settings for solver (l-bfgs, gd) were tested and that there may exist better settings. The reason for not increasingMaxIterations further is that when going from 500 to 800 only a very small performanceincrease was observed in relation to the increase in computation time.

Gradient Boosted Trees

Because of time constraints, only a few settings were explored. Spark documentation statesthat “it is often reasonable to use smaller (shallower) trees with GBTs than with RandomForests”. However, even though the grid included more shallow options it found maxBins =40 and MaxDepth = 30 to be the optimal choice for the GBT algorithm, just as it did for RF.

3.3.7 Cross-validation Evaluation

Once the hyperparameters were set, the effects of using cross-validation was explored. Eachalgorithm was evaluated using 2 and 5 folds. As cross-validation is a technique to help avoidoverfitting in complex models, the theory was that MLP would benefit most from it and thatit would have little to no effect on the simpler LR algorithm. Table 3.18 describe the resultsfrom this evaluation.

39

3. METHOD

Model TP FP TN FN Prec. Sens. Spec. Acc. TimeRF 5181 15171 135021 12060 25.46% 30.05% 89.90% 83.74% 3821DT 6401 16849 133343 10840 27.53% 37.12% 88.78% 83.46% 2436

Table 3.19: Bagging model evaluation

Linear Regression

The LR algorithm was the only algorithm where the number of TP decreased as the numberof folds increased. This meant that the precision and sensitivity metrics fell while specificityand accuracy increased.

Decision Tree

For the DT, performance increased slightly with regards to precision and sensitivity and de-creased slightly with regards to specificity and accuracy as the number of folds increased.

Random Forest

Increasing the amount of folds had no effect on the performance of RF. This was expected,since RF uses a bagging technique similar to that of cross-validation when training and ag-gregating the trees and thus, no information gain would be expected from repeating thisprocess.


For the MLP, performance increased slightly with regards to precision and sensitivity anddecreased slightly with regards to specificity and accuracy as the number of folds increased.

Gradient Boosted Tree

Unsuccessful attempts to evaluate the effects of cross-validation on GBT were made. Thereason for these failures were limitations in the amount of available RAM memory, accordingto the error messages.

3.3.8 Bagging Model Evaluation

Bagging was evaluated by using the optimal hyperparameters found in previous step andapplying a bagging algorithm. This algorithm works by training a number of base modelson different subsets of the training data and then combining them using a decision tree basedmeta learner. For this evaluation, the bagged models were created using eight base models,each containing 80% of the features and 80% of the training data. The attempts to createan MLP based model were unsuccessful, due to difficulties in finding an appropriate inputlayer size when only a subset of the features were selected at random. Table 3.19 describe thereceived metrics from the bagging evaluation.

3.3.9 Stacking Model Evaluation

To further evaluate the effects of combining models with themselves and each other, a stack-ing algorithm was utilized. In most cases, the experiment showed that this added confusionto the model and that using a single algorithm model would be the preferred option. Becauseof time constraints, this search evaluation could not be a complete and exhaustive search.However, it was found that by combining two MLP, one with layers = [77, 2] and one withlayers = [77, 36, 2], a model with higher precision and sensitivity metrics compared to all

40

3.3. Model Planning

Models TP FP TN FN Prec. Sens. Spec. Acc. TimeMLP / MLP 8291 20119 130073 8950 29.18% 48.09% 86.60% 82.64% 7272MLP / RF 4866 14969 135223 12375 24.53% 28.22% 90.03% 83.67% 6954MLP / DT 5057 16670 133520 12184 23.28% 29.33% 88.90% 82.77% 5470MLP / LR 6985 19386 130806 10256 26.49% 40.51% 87.09% 82.30% 9456RF / RF 4991 14919 135273 12250 25.07% 28.95% 90.07% 83.77% 4358RF / DT 4799 14375 135817 12442 25.03% 27.83% 90.43% 83.98% 3240RF / LR 4579 14275 135917 12662 24.29% 26.56% 90.50% 83.91% 12081DT / DT 5598 15413 143779 11643 26.64% 32.47% 90.32% 84.67% 860DT / LR 5838 19352 130840 11403 23.18% 33.86% 87.12% 81.63% 16426

Table 3.20: Stacking model evaluation

other models produced in this study, was found. Table 3.20 describe the received metricsfrom the stacking evaluation.

3.3.10 Output Filtering

When observing the number of warnings that any of these produced models would send, itis reasonable to believe that sending a technician to investigate whether or not a part needsreplacement every time the model sends a warning is not what would happen in a real lifesituation. More likely, if a component has been examined and deemed to function properly,subsequent warnings would be ignored for a period of time before a technician is sent againto examine the component. It is also reasonable to believe that once a component has beenreplaced, the sensor readings would change. Therefore, a top layer function was developedwith the purpose of filtering the predictions in order to evaluate the costs or savings thatwould entail from using the selected model.

In order to evaluate how the filter should be designed and how long the intervals where fur-ther warnings from the model are ignored should be, the predictions were visualized. Thebagged DT model described in Table 3.19 as well as the stacked MLP model described inTable 3.20 were selected for this purpose. The DT and the MLP model performance is visual-ized in Figure 3.6 and Figure 3.7 respectively. In these figures, the row index that the data hasin the testing dataset is used as the x axis to avoid further cluttering the plot by having fivepredictions on the same x position, which would have been the effect if the timestamp of thedata had been used. Furthermore, the actual label is marked with orange and thus representthe output of the optimal predictor, the unfiltered predictions are marked with blue and thepredictions that have gone through a low pass filter function are marked with green. This lowpass filter is implemented using a window function that requires that 140 out of the latest 144predictions are positive in order to send a warning and will be referred to as the window func-tion for this chapter. It was developed in order to explore model decisiveness, in an attemptto reduce the amount of false positives. The labels in the scatter plots in Figure 3.6a, 3.6b,3.7a and 3.7b have been displaced to allow for visible distinction. A low y axis value under0.10 should be read as a negative prediction and a high value above 0.10 should be read as apositive prediction. Because of the large amount of predictions, it may look as if the model ispredicting two labels at the same time. This should be interpreted as indecisiveness, as thisresult is produced when the model is going back and forth between negative and positivepredictions in close time intervals. To compliment and clarify the scatter plots, line plots thatdescribe the total amount of produced positive predictions on the y axis are available belowthe scatter plots in Figure 3.6c, 3.6d, 3.7c and 3.7d.

Without post-processing, both models overestimate the amount of warnings and have trou-bles detecting the first appearing failure. The window function works well on the testing data

41

3. METHOD

(a) Training dataset individual predictions

(b) Testing dataset individual predictions

0 10000 20000 30000 40000 50000Sensor data row index

0250050007500

1000012500150001750020000

Warning

cou

nt

raw_prediction_countcorrect_label_countwindow_count

(c) Training dataset amount of warnings

0 20000 40000 60000 80000 100000 120000 140000 160000Sensor data row index

0

5000

10000

15000

20000

Warning

cou

nt


(d) Testing dataset amount of warnings

Figure 3.6: Visualization of decision tree based bagging algorithm performance

when used on the MLP model, but very poorly on the DT, indicating that the MLP is moredecisive when predicting an error on data that it has not seen before.

3.3.11 Cost Evaluation

The algorithms described in section 3.3.1 were used to extract the inspection costs, repaircosts and replacement costs associated with false positives, true positives and false negatives.What should be noted here is that a false negative in the sense that it entails a cost, is not a

42

3.3. Model Planning

(a) Training dataset individual predictions

(b) Testing dataset individual predictions

0 10000 20000 30000 40000 50000Sensor data row index

0250050007500

1000012500150001750020000

Warning

cou

nt


(c) Training dataset amount of warnings

0 20000 40000 60000 80000 100000 120000 140000 160000Sensor data row index

0

5000

10000

15000

20000

25000

Warning

cou

nt


(d) Testing dataset amount of warnings

Figure 3.7: Visualization of multilayer perceptron based stacking algorithm performance

43

3. METHOD

Prediction function TP FP FN Savings (€)Ignore + Unfiltered 2 23 0 -4157Ignore + Window 1 6 1 -64712

(a) Savings evaluation for the gearbox component using the MLP modelPrediction Function TP FP FN Savings (€)Ignore + Unfiltered 2 28 0 -66657Ignore + Window 0 2 2 -210000

(b) Savings evaluation for the gearbox component using the DT model

Table 3.21: Savings evaluations for the gearbox component

false negative as described in the metrics tables, but a scenario where no warning have beensent whatsoever during the 60 days before failure.

The duration where the models should be ignored after a warning were explored and set to30 days after a false positive and to the remaining period of the sixty days plus 7 days after atrue positive. The first number, 30 days, is based around the fact that the positive label is 60days before failure. As such, even though a false positive occurs close to a failure, more thanhalf of the 60 days interval remains for the model to act on. The second number is based onthe fact that the model should not send any more warnings in the interval where a componenthas been replaced and the assumption that the new component should function well duringits first 7 days of service, but may require a running-in period where abnormal sensor valuesare being sent and where the prediction model therefore should be inactive.

The savings implications for applying these model on the test dataset can be observed inTable 3.21. A positive savings is an actual saving under the rules described in section 3.3.1, anegative number means that there is a cost. These numbers will be discussed and comparedto the cost implications of applying preventive or reactive maintenance in Chapter 5. It isclear when observing these tables that the MLP model is preferred, both when looking atthe unfiltered predictions and when applying the window function. Due to the fact that thewindow function leads to a gearbox failure being missed, the cost metric for this functionis heavily negative compared to the unfiltered predictions. On the other hand, it can beobserved in Figure 3.7 that the model is unconfident when predicting this particular failureand it can be argued that finding it at all is more by chance than by design. However, thedecision was made that avoiding false negatives must be prioritized due to the heavy costimplication of not reporting a failure and the unfiltered function was therefore chosen overthe window function.

3.4 Model Building

When a clear idea about how to build and evaluate the prediction models had been estab-lished, prediction models for the four remaining components were built with the aim of cre-ating a complete prediction system for all the monitored components. Due to the metricsevaluation performed in the model planning stage, the stacked MLP model was chosen asthe algorithm as it displayed the best metrics and proved to generalize better than the DTbased model. The metrics and savings implications from using these models are presentedin the results chapter. The scalers used to scale the data well as the produced models weresaved to file to enable other Spark compatible programs to use them by simply loading thefiles.

44

3.5. Final System

3.5 Final System

This section will present how the produced programs were used to produce and analyse pre-dictions. First, the system that was built for the purpose of writing the predictions to fileand analysing the performance of the models will be presented. Following this is a presenta-tion of a Kafka based system that was made as a proof of concept that it is possible to makepredictions in close to real time on streamed sensor data.

3.5.1 Performance Analysis System

The system used for analysing model performance was implemented in Scala. First, thedataset containing the measurements that that the models should make predictions on isloaded from a csv file. Then, the five prediction models that were previously built are loadedand used to make predictions. The predictions are recorded and written to a csv file in thesame order as the data was read from the input file. Since the system uses five models, one foreach component, one corresponding csv file with predictions is created for each component.Because of how the correct label is known in the testing data, the produced csv files can beanalysed to see model performance by comparing the predicted label and the correct label.The program used to analyse and visualize the performance of the models was implementedin Jupyter Notebook using Python and the matplotlib library.

3.5.2 Kafka Based System

A solution was built using the Kafka framework with the aim of providing streamed predic-tions in close to real time. The implementation was done using Python. Figure 3.8 illustratesthe flow that the data takes through this system. It uses the same scaler that is created whenprocessing the data before training the models in order to scale individual sensor readingsin a correct way. The sensor measurements are read from a specified csv file one at the timeand put onto the system using a KafkaProducer that creates events on a topic turbine-stream.The measurements are read by several KafkaConsumers, one for each predictor, that are re-sponsible for fetching messages that they have not yet processed from this stream. Once theKafkaConsumers have received a message, the sensor measurements are scaled and put intothe prediction model. If a warning is produced by the model, a message is created contain-ing the turbine id, component and timestamp and put on the topic turbine-warning. AnotherKafkaConsumer is used to read the turbine-warning topic and display alerts to the end user.This application is terminal based, but could be integrated with any software that supportsreading from Kafka topics.

45

3. METHOD

Allow consumer to readmessages asynchronously

Gearboxpredictor

Generatorpredictor

Generatorbearing

predictor

Transformerpredictor

Hydraulicsystem

predictor

Read from file and create postson Kafka topic "turbine-stream"

KafkaProducer

turbine-stream

Allow warnings to be readasynchronously

turbine-warning

Display warnings to end user

KafkaConsumer

Figure 3.8: System that creates a warning Kafka topic using the prediction models

46

4 Results

This chapter summarizes the results that were observed during each step of the process asdescribed in Chapter 3 and presents the results observed from running the complete predic-tion system on the full data set with all components. Overall, the results confirm the benefitsof using predictive maintenance over scheduled or reactive maintenance as well as indicat-ing that metrics can be improved by applying bagging and stacking algorithms to a machinelearning use case.

4.1 Data Discovery

Finding appropriate SCADA data for such a study proved to be difficult. Over a dozen com-panies were contacted with the proposal of sharing data with this study. Many showed inter-est, but due to service agreements or perceived difficulties with anonymizing and transferringsuch big amounts of data they were unable to initiate a cooperation. The data sets found inpublic data set libraries were most of the time generated with the exception of the data setfound on EDP Open Data, which contains actual measurements from existing wind turbines.

4.2 Data Preparation

Compared to data sets described in related research, this data set required a relatively smallamount of work with regards to cleaning, as the amount of outliers, null values and lowvariance features was low. In order to understand the implications of using a certain scalerover another, the data was visualized. However, Spark requires integration with Matplotlib1

or a similar library such as D32 in order to be able to create visualizations of data. Therefore,it was easier to do this process using Pandas and Matplotlib directly.

1https://matplotlib.org/2https://d3js.org/

47

https://matplotlib.org/

https://d3js.org/

4. RESULTS

4.3 Model Planning

The model planning step made use of the training and validation data set related to the gear-box model. The model planning section provided the material needed to make conclusionsabout how to balance the data set and how this affected the metrics of the models. In addi-tion to this, an evaluation about which scaler and which model parameters to use with thedata set was made. This section also explored the effects of bagging and stacking machinelearning models with regards to the metrics by training several models.

4.3.1 Balance, Scaler and Hyperparameter Evaluation

For this data set, multiple binary classifiers were found to perform better than a single multi-nomial classifier. After multiple tests a ratio of 70 ´ 30 between negative and positive sampleswere proven to produce the best models with regards to the metrics precision, sensitivity,specificity and accuracy. The decision to use 70 ´ 30 as the balance ratio was a trade-off thatprioritized precision given how this metric has a more direct effect on the maintenance costs.The MinMaxScaler were found to either have no effect, or create overall more robust classifierscompared to the StandardScaler with regards to the aforementioned metrics. It was thereforeselected as the scaler of choice. PCA was found to have a detrimental effect on model perfor-mance and the decision was made to not use it any further. The hyperparameter grid searchwas used with the purpose of finding hyperparameters that further improved the metricsand found to improve metrics slightly. Using cross validation and increasing the amount offolds from 2 to 5 improved the metrics even further in all cases but for linear regression. Crossvalidation affected the sensitivity metric especially.

4.3.2 Bagging Algorithm

The bagging algorithm was unable to work well with the MLP algorithm because of how itselects a subset of the features, which is not able to work with the MLP, as the input layer thatthe model is trained on does not correspond to how the input of the testing data is shaped. Ithad a slight positive effect on the RF model, as well as a more visible positive effect on the DTmodel. It did not have a negative impact on any model, indicating that bagging is a robustway of increasing performance in decision tree based models. The produced DT model hadthe second best metrics of the evaluation series and was selected for prediction visualization.

4.3.3 Stacking Algorithm

The stacking algorithm was evaluated with combinations of the best performing models withthemselves and with each other. It was found that when stacking two MLP models withdifferent layer configurations, the best performing model of the entire evaluation series wasfound. This model was therefore selected with the DT model for prediction visualization.Some of the combined models performed worse when compared to their corresponding basemodels and only a few performed significantly better, indicating that stacking is a processthat may boost model performance but that it should be used with caution.

4.3.4 Output Filtering

When visualizing the output from the models it became clear that the MLP model generalizedbetter and provided better cost metrics compared to the DT model. It also became clear thatwhile the implemented window function avoided many false positives, it also lead to thefact that the predictor missed one of two failures in the testing data set and the use of it wastherefore discarded in favor of using the unfiltered predictions.

48

4.4. Model Building

Component TP FP TN FN Prec. Sens. Spec. Acc. TimeGearbox 16749 2610 36605 301 86.52% 98.23% 93.34% 94.83% 7272Generator 30053 3868 67388 928 88.60% 97.00% 94.57% 95.31% 11341Gen. Bearing 36680 5082 83939 2025 87.83% 94.77% 94.29% 94.44% 12626Transformer 14823 1347 33017 118 91.67% 99.21% 96.08% 97.03% 9209Hyd. Group 16507 2319 36981 580 87.68% 96.61% 94.10% 94.86% 10097All 114812 15226 257930 3952 88.29% 96.67% 94.43% 95.11% 50545

Table 4.1: Metrics for training data

Component TP FP TN FN Prec. Sens. Spec. Acc. TimeGearbox 8291 20119 130073 8950 29.18% 48.09% 86.60% 82.64% 7272Generator 0 20405 121474 8484 0.00% 0.00% 85.62% 80.79% 11341Gen. Bearing 4904 21294 116924 7070 18.72% 40.96% 84.59% 81.11% 12626Transformer 479 10766 139426 7927 4.26% 5.70% 92.83% 88.21% 9209Hyd. Group 9844 8627 111410 41737 53.29% 19.08% 92.81% 70.65% 10097All 23518 81211 619307 74168 22.46% 24.08% 88.41% 80.53% 50545

Table 4.2: Metrics for testing data

4.4 Model Building

The model building step included building and testing models using the full data set withall five components to validate that it is indeed possible to use the methods derived from themodel planning step. Three models performed worse when compared to the gearbox modelwith regards to the savings estimation metric, with the exception of the hydraulic groupmodel that produced a savings positive model.

4.4.1 Model Metrics

The metrics obtained when running the complete prediction system can be observed in Table4.1 and Table 4.2. These tables describe the metrics from running the produced models onthe training data and the testing data in order to see how well the models generalize. Itis clear that some models performed better than others but that all models have problemsgeneralizing to the testing dataset. The predictor for the generator component performedabnormally bad and was not able to capture a single failure, even though many false positiveswere reported. The predictor for the transformer component is another model that performedpoorly, having a precision of 4.26%. When comparing these observed metrics to the metrics inpresented regarding related work in Section 2.13 it is clear that these models perform worse.Possible reasons for this will be discussed in Section 5.

4.4.2 Cost Evaluation

Table 4.3 describe the estimated savings per component using the stacked MLP models. Inorder to verify the earlier made decision that prioritizing a high number of TP over a lownumber of FP, the corresponding numbers for the window function are described in Table4.4. The unfiltered predictions are still more cost efficient, however none of the alternativespresent a positive savings figure using the rules mentioned in Section 3.1.1. The number offalse positives severely impact the savings metric in a negative way, as do the undetectedfailure in the generator component since this is a particularly expensive component.

49

4. RESULTS

Component Savings (€) TP FP FNGearbox -4157 2 23 0Generator -170000 0 22 1Generator Bearing -82778 2 23 0Transformer -59535 1 20 1Hydraulic Group 21342 6 24 0Sum -295128 11 112 2

Table 4.3: Cost estimation for all components using unfiltered predictions

Component Savings (€) TP FP FNGearbox -64712 1 6 1Generator -85000 0 5 1Generator Bearing -10516 2 6 0Transformer -103000 0 2 2Hydraulic Group -85386 1 1 5Sum -348614 4 20 9

Table 4.4: Cost estimation for all components using window filtered predictions

4.4.3 Failure Detection Analysis

In order to explore the theory made in chapter 3.1.1 regarding the expected failure detectionoutcome, the model predictions were saved to file and analysed. The results presented inTable 4.5 are marked such that red cells are failures in the testing data that the models failedto predict, yellow cells represent training data that was expected to be of no use and testingdata that was labelled correctly and green cells represent training data that was expected tobe useful and failure data that the models were expected to label correctly according to thetheory presented in Subsection 3.1.1 regarding model generalization across turbines. Fivefailure marked with yellow are predicted, and thus convincing evidence that the models areable to generalize across turbines using the given training data is presented. It can be seen thatthe models label all but two failures in components where no training data for that componentin the corresponding turbine is present.

4.5 Kafka Based Prediction System

The built system serves as a proof of concept that it is possible to build a complete predictionsystem that handles data processing, model training and acts as a prediction service that candisplay warnings to a user through a terminal based program using the software included inthe Hadoop ecosystem. Figure 4.1 and Table 4.6 illustrates the output from the producer andthe consumer programs that were developed to make predictions based on a Kafka stream.

50

4.5. Kafka Based Prediction System


(a) Turbine 1 failures with respect to trainingand testing


(b) Turbine 2 failures with respect to trainingand testing


(c) Turbine 3 failures with respect to trainingand testing


(d) Turbine 4 failures with respect to trainingand testing


(e) Turbine 5 failures with respect to trainingand testing

Table 4.5: Outcome of model predictions of turbine failures with respect to training and test-ing data

1 2020-03-01 18:33:19.048927. Sending: T01, 2016-01-17T21:50:00.000+01:00, 1404.2,1279.0, 1344.1, 28.0, 36, 58, 57, 57, 27, 48, 53, 25, 12.5, 11.3, 11.9, 15.4,0.9, 6.1, 0.9, 8.1, 64.0, 14, 0, 62987, 0, 62987, 0, -15959, 0, -15959, 50, 66,65, 35, 36, 23, 39, 22, 15, -2.2, -1.3, -1.9, 0.2, 95, 34, 34, 34, 33, 377.5,1.0, 50.0, 399.1, 397.6, 393.8, 301.4, 340.1, 333.2, 501.9, 300.5, 36, 0.2,6.1, 42.5, -95.7, -90.3, -106.2, 3.2, 376.9, 505.7, 294.1, 42.3, -1000.0,-1000.0, -1000.0, 0.0, 1000.0, 1000.0, 1000.0, 0.0, 36, 55.9, T01, GEARBOX,2016-07-18T04:10:00.000+02:00, Gearbox pump damaged, 15744000, 0

2 2020-03-01 18:33:19.149597. Sending: T01, 2016-01-17T21:50:00.000+01:00, 1404.2,1279.0, 1344.1, 28.0, 36, 58, 57, 57, 27, 48, 53, 25, 12.5, 11.3, 11.9, 15.4,0.9, 6.1, 0.9, 8.1, 64.0, 14, 0, 62987, 0, 62987, 0, -15959, 0, -15959, 50, 66,65, 35, 36, 23, 39, 22, 15, -2.2, -1.3, -1.9, 0.2, 95, 34, 34, 34, 33, 377.5,1.0, 50.0, 399.1, 397.6, 393.8, 301.4, 340.1, 333.2, 501.9, 300.5, 36, 0.2,6.1, 42.5, -95.7, -90.3, -106.2, 3.2, 376.9, 505.7, 294.1, 42.3, -1000.0,-1000.0, -1000.0, 0.0, 1000.0, 1000.0, 1000.0, 0.0, 36, 55.9, T01, TRANSFORMER,2017-08-11T15:14:00.000+02:00, Transformer fan damaged, 49393440, 0

Figure 4.1: Output from Kafka producer that creates a stream of turbine sensor measurements

51

4. RESULTS

Timestamp ID prediction rawPrediction probability label Component2016-01-17 21:50:00 T01 0.0 11.78031472429 0.589015736214 0.0 GEARBOX2016-01-17 22:00:00 T01 0.0 11.78031472429 0.589015736214 0.0 GEARBOX2016-01-17 22:10:00 T01 0.0 12.78031472429 0.639015736214 0.0 GEARBOX2016-01-17 22:20:00 T01 0.0 11.78031472429 0.589015736214 0.0 GEARBOX2016-01-17 22:30:00 T01 0.0 11.78031472429 0.589015736214 0.0 GEARBOX2016-01-17 22:40:00 T01 0.0 12.77526421924 0.638763210962 0.0 GEARBOX2016-01-17 22:50:00 T01 0.0 13.77890229491 0.688945114745 0.0 GEARBOX2016-01-17 23:00:00 T01 0.0 13.77890229491 0.688945114745 0.0 GEARBOX2016-01-17 23:10:00 T01 0.0 11.78031472429 0.589015736214 0.0 GEARBOX2016-01-17 23:20:00 T01 0.0 13.78031472429 0.689015736214 0.0 GEARBOX

Table 4.6: Output from Kafka consumer that displays gearbox predictions

52

5 Discussion

The discussion is divided into three parts. First, the result presented in Chapter 4 is discussedin Section 5.1. The second part found in Section 5.2 discusses the methodology with focuson strengths and weaknesses in the different solutions that were evaluated. The last part,Chapter 5.3 discusses the work in a wider context.

5.1 Results

This Section will discuss the acquired results from the implementation stage to the receivedsavings figures.

5.1.1 Implementation

Hadoop with Spark was not very time consuming to set up and get acquainted with. It alsoproved to be as scalable as promised, in the sense that the code that had been written couldbe executed both on a local machine such as the laptop used for most of this thesis, as well ason a cluster. An AWS EMR cluster1 was set up and used to verify this in the early stages ofthe project. Worth noting is that preparing the data and training model with Spark proved tobe relatively expensive and thus, the cluster was not used for other purposes than verifyingthe scalability of Spark.

Due to the lack of some functionality such as time series classification, or non-linear supportvector machines, there may be better options for doing predictive maintenance such as Keras2 or H2O ai3. The team behind H2O also maintains a project called sparkling water whichintegrates Spark with H2O and may be worth researching, as the H2O algorithm supportstime series forecasting according to the documentation4.

1https://aws.amazon.com/emr/2https://keras.io/3https://www.h2o.ai/4http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/time-series.html

53

https://aws.amazon.com/emr/

https://keras.io/

https://www.h2o.ai/

http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/time-series.html

5. DISCUSSION

5.1.2 Performance

The received metrics are not as good as the metrics in the two related works that are presentedin Section 2.13. These metrics are not directly comparable however, as the produced modelspredict failures up to 60 days in advance, while the models in the study by Kuziak et al. [37]are trained to predict failures five hours in advance and the models in the study by Canizo etal. [36] one hour in advance.

Model Precision in Relation to Dataset

When observing the failures that the window function predicts in Table 4.4, they correspondwell to the failures marked with green in Table 4.5. When decreasing the required number ofpositive predictions from 140 out of 144 to 120 out of 144, the previous results were confirmedbut with one additional hydraulic group failure in turbine 2 being reported. It is clear thata model is more decisive when predicting a failure in a component where a previous failurein the same component and turbine was available in the training data. This indicates thateven though all turbines in the data set that were used were manufactured equal there maybe differences in how the errors present themselves in the data that are dependant on factorsthat are not component , but rather turbine specific. It may also be that not all features usedto train the models in this study are relevant and that this caused confusion. Once again,more domain knowledge in the field of wind turbines are required to begin to answer thesequestions.

Bagging and Stacking

The process of bagging and stacking was successful with both methods achieving improvedmetrics compared to the other methods evaluated in this thesis. No related work was foundwhich implemented bagging and stacking methods on SCADA data and therefore it is notpossible to compare the results to any directly related work. While the metrics increasedonly slightly, the run time increased by double or more. This was expected since the amountof different models that have to be trained increased as well. The implementation of spark-ensemble does not support loading already trained models and creating a meta learner usingtheir predictions, but instead requires all models to be trained at the same time using thesame data set. This slowed down the project to a point where a bagging evaluation of linearregression was left out due to an expected training time of 33 hours and to give priority morepromising models.

Savings

In order to discuss the presented savings figures, a baseline needs to be established. Thesavings were therefore compared to making preventive maintenance through scheduled in-spections, as well as purely reactive maintenance. Given the rules for the data set at handdescribed in 3.3.1, the maximum interval between inspections is 60 days provided that cap-ture of all failures is desired. When summarizing the inspection costs and savings gives afigure of ´427, 500 € assuming that the failures are detected on average 30 days in advance.The corresponding number for reactive maintenance would be ´540, 000 €. In order to ex-plore the impact of the top function on the savings figure, the minimum interval betweenmodel warnings after a false positive were changed from 30 days to 60 days. Using this set-ting, the same amount of true positives were detected by the models while the savings metricimproved to ´169, 756 € from the previous ´295, 128 € in Table 4.3. Some failures were de-tected very close to the time of failure and therefore, this is not a recommended configuration.It is also poor practise to optimize methods using the testing data and this figure can not bepresented as a result, nor is this improvement significant enough to justify further researcharound optimizing this time interval. It is however clear that the produced models can be

54

5.2. Method

used to provide predictive maintenance that has an advantage over scheduled maintenanceand reactive maintenance with regards to cost.

5.2 Method

This section will discuss and criticize the method that was used to perform the evaluationsand obtain the results. First, the implementation implications of using Hadoop and Spark arediscussed. Secondly, the design of the evaluations is covered.

5.2.1 Implication of Using Spark

The implementation was an overall success, being able to evaluate how to use Spark for pre-dictive maintenance as well as the effects of using bagging and stacking on machine learningalgorithms. However, a few problems arised during the implementation that will be dis-cussed here.

Data Discovery

Due to the lack of inherent visualization aid in Spark, such as the ability to plot data in scatterand line plots, this step was implemented in Python with Jupyter, Pandas and Matplotlib.Using Jupyter for this purpose was a good decision in hindsight, as the computations can belimited to sections and all results are being kept in memory. This is important when experi-menting with larger data sets where a relatively simple computation can take a few minutesas it provides an environment where small changes can be made without having to wait foridentical computations to finish.

Preprocessing

While Spark was fast to install and get started with, while computing some algorithms suchas gradient boosted trees, or writing large DataFrames to file it became noticeable that Sparkis developed for distributed computing. Several crashes occurred during computations dueto OutOfMemoryException and quite some time was spent on tuning Spark configuration pa-rameters. After a while it became clear that executor.memory and driver.memory are importantparameters that control the amount of memory per executor process and the amount of allo-cated memory for a Spark driver respectively. The executor.memory is 512MB by default andwas increased to 4GB, while the driver.memory is 1GB by default and was increased to 8GBin order to be able to collect and save some of the larger DataFrames to file. These numberswere gathered by experimentation and reading forums such as StackOverflow and articles onsites such as towardsdatascience, where Spark has a strong community and many discussionthreads and articles cover Spark and what can be done with it. Without this community, theimplementation details and figuring out error messages would have taken much longer.

Spark also imposed limitations in this step, as the scalers of choice were limited to MinMaxS-caler and the StandardScaler. As mentioned in Section 3.2.2, it would have been interesting toexplore the effects of scalers that handle outliers differently, such as the PowerTransformer-Yeo-Johnson.

Algorithm Limitations

Spark limited the amount of algorithms that could be evaluated, especially in terms of neuralnetworks. Another implication of using Spark for the implementation of the models is thatneither MLP nor decision trees support a weighted loss function and therefore, instead ofevaluating possibilities with using different loss functions and configurations of weight func-

55

5. DISCUSSION

tions, the balancing evaluation described in Section 3.3.3 using undersampling was carriedout.

5.3 The work in a wider context

Predictive maintenance continues to be an important method for improving efficiency in allkinds of environments where machines that wear down over time are involved. The possi-bilities of manufacturing and placing cheap, connected sensors will continue to increase withthe rise of IoT [13]. And as the amount of data increases with the amount of sensors, so willthe possibilities of applying machine learning algorithms to perform predictive maintenance.While other methods, currently not available in mllib, that are able to look at a longer timeseries may prove better at predicting wear and tear, the scalability of Spark and the Hadoopecosystem remains interesting.

56

6 Conclusion

This thesis implemented a prediction system using Spark and other parts of the Apacheecosystem.

6.0.1 Can a system for predictive maintenance in wind turbines be implementedusing Apache Spark?

As mentioned in the theory chapter, Spark is proven to be fast and highly scalable whencompared to other solutions. This is an important factor, as predictive maintenance solutionsoften deal with large quantities of data and need to process this data under time constraints.This system was implemented in Scala, which is the native language for Spark. However,Spark also provides APIs in Java and Python. PySpark was evaluated, but considering thatone of the goals was to export a jar file which can be submitted to an AWS EMR cluster,Scala was chosen. More specifically, IntelliJ Idea with SBT as the build tool. The algorithmsinherent to Spark mllib were used to implement the models. This is a growing collection ofclassification and regression algorithms, which still lacks support for some of the algorithmsthat would have been interesting to consider in this thesis, such as a feedforward neuralnetwork or a long short-term memory neural network. Overall, the produced system provesthat Spark is worth considering when building a prediction system, but that it may requireintegration with other services to access the optimal algorithms.

6.0.2 Will a model created by applying bagging or stacking algorithms onseveral base models perform better than the base models?

When compared to the base models, the stacked algorithms rarely produced unanimouslybetter results than the best of the two base models, with regards to the metrics used. However,stacking multiple neural networks had this effect. The final stacked model produced moretrue positive and true negative, while at the same time reducing the amount of false negatives.

6.0.3 Which algorithms available in Spark are eligible for stacking and bagging?

As bagging reduces the training data to a subset of the features as well as a subset of thesamples when training a base model, neural networks can not be used with this algorithm.

57

6. CONCLUSION

While only the decision tree and the random forest algorithm was evaluated using bagging,all other algorithms in the mllib library could be implemented using a bagging algorithm.

Stacking trains the base models on the full data set and utilizes the fact that heterogeneousmodels may recognize different underlying structures in the data. No limitations were foundwith regards to which algorithms that may be used as a base model with stacking.

6.0.4 How does the final solution compare to the current state of the artpredictive maintenance for wind turbines with regards to implementation,run time, usability and accuracy metrics?

For clarity, the conclusion regarding this research question have been split up into four sec-tions.

Implementation

Spark was used in the related study by Canizo et al. [36] and got similar results to what theydescribed as the current state of the art. In that study, they used a random forest algorithm,which was evaluated in this thesis as well. With regards to implementation, this solutiontherefore follows what could be considered the current state of the art. However, the fieldis still very much experimental and no consensus regarding the current state of the art withregards to implementation exists.

Run time

No comparative number was found when looking at the run time. Once the models aretrained, this solution produces two years worth of predictions in 150 seconds when executedon the personal computer mentioned in section 3.3.3. That corresponds to 0.00029 seconds perprediction, which in the given context of a failure process of 60 days, produces a predictionwithin a sufficient time frame.

Usability

The usability of a predictive system implemented in Spark such as this one would be consid-ered high, given the fact that other software in the Hadoop ecosystem such as Kafka can loadand use trained Spark models to combine the functionality of that software with predictiveability.

Accuracy

While the precision and sensitivity metrics were visibly lower when compared to relatedwork, the given time frame in which the models make predictions is very different and itmay not be fair to compare the number out of the box. The average model had a precisionof 22.46%, a sensitivity of 24.08%, a specificity of 92.81% and an accuracy of 80.53%. Worthnoting is that the metrics are heavily punished by the fact the two of the models failed togeneralize to the testing data.

6.1 Future Work

There is room for improvement and a different approach that can produce better performingmodels surely exists. The work was limited to working with the models in the mllib librarydue to the research being focused on the effects of bagging and stacking, however it would bevery interesting to see the effects of implementing a solution such as the one described in therelated work by Olgun Aydin and Seren Guldamlasioglu [33] applied in the context of this

58

6.1. Future Work

thesis. It would be interesting to evaluate the effects of stacking multiple neural networkssuch as the LSTM, given the results of this study.

59

6. CONCLUSION

60

Bibliography

[1] W. W. E. Association. [Online]. Available: https://wwindea.org/blog/2019/02/25/wind-power-capacity-worldwide-reaches-600-gw-539-gw-added-in-2018/.

[2] C. A. Walford, “Wind turbine reliability: Understanding and minimizing wind turbineoperation and maintenance costs.”, 2006.

[3] I. El-Thalji and J. Liyanage, “On the operation and maintenance practices of windpower asset: A status review and observations”, Journal of Quality in Maintenance Engi-neering, vol. 18, pp. 232–266, Aug. 2012. DOI: 10.1108/13552511211265785.

[4] G. Sullivan, R. Pugh, A. P. Melendez, and W. D. Hunt, “Operations & maintenancebest practices - a guide to achieving operational efficiency (release 3)”, Aug. 2010. DOI:10.2172/1034595.

[5] H. Hashemian, “State-of-the-art predictive maintenance techniques”, Instrumentationand Measurement, IEEE Transactions on, vol. 60, pp. 226–236, Feb. 2011. DOI: 10.1109/TIM.2010.2047662.

[6] R. Crossley and P. J. Schubel, “Wind turbine blade design”, Energies, vol. 5, Sep. 2012.DOI: 10.3390/en5093425.

[7] A. Kusiak and W. Li, “The prediction and diagnosis of wind turbine faults”, RenewableEnergy, vol. 36, no. 1, pp. 16–23, 2011.

[8] K. Kim, G. Parthasarathy, Ö. Uluyol, W. Foslien, S. Sheng, and P. Fleming, “Use of scadadata for failure detection in wind turbines”, ASME 2011 5th International Conference onEnergy Sustainability, ES 2011, Jan. 2011. DOI: 10.1115/ES2011-54243.

[9] Y. Ebata, H. Hayashi, Y. Hasegawa, S. Komatsu, and K. Suzuki, “Development of theintranet-based scada (supervisory control and data acquisition system) for power sys-tem”, in 2000 IEEE Power Engineering Society Winter Meeting. Conference Proceedings(Cat. No.00CH37077), vol. 3, Jan. 2000, 1656–1661 vol.3. DOI: 10.1109/PESW.2000.847593.

[10] K. Barnes, B. Johnson, and R. Nickelson, “Review of supervisory control and dataacquisition (scada) systems”, Idaho National Engineering and Environmental Laboratory,2004.

61

https://wwindea.org/blog/2019/02/25/wind-power-capacity-worldwide-reaches-600-gw-539-gw-added-in-2018/



https://doi.org/10.1108/13552511211265785

https://doi.org/10.2172/1034595

https://doi.org/10.1109/TIM.2010.2047662

https://doi.org/10.1109/TIM.2010.2047662

https://doi.org/10.3390/en5093425

https://doi.org/10.1115/ES2011-54243

https://doi.org/10.1109/PESW.2000.847593

https://doi.org/10.1109/PESW.2000.847593

BIBLIOGRAPHY

[11] P. Church, H. Mueller, C. Ryan, S. V. Gogouvitis, A. Goscinski, H. Haitof, and Z. Tari,“Scada systems in the cloud”, in Handbook of Big Data Technologies, A. Y. Zomaya andS. Sakr, Eds. Cham: Springer International Publishing, 2017, pp. 691–718, ISBN: 978-3-319-49340-4. DOI: 10.1007/978-3-319-49340-4_20. [Online]. Available: https://doi.org/10.1007/978-3-319-49340-4_20.

[12] A. Daneels and W. Salter, “What is scada?”, 1999.

[13] M. Vozábal, “Tools and methods for big data analysis”, 2016.

[14] K. Krishnan, “Chapter 1 - introduction to big data”, in Data Warehousing in the Age ofBig Data, ser. MK Series on Business Intelligence, K. Krishnan, Ed., Boston: MorganKaufmann, 2013, pp. 3–14, ISBN: 978-0-12-405891-0. DOI: https://doi.org/10.1016/B978- 0- 12- 405891- 0.00001- 5. [Online]. Available: http://www.sciencedirect.com/science/article/pii/B9780124058910000015.

[15] A. Fernández, S. del Río, V. López, A. Bawakid, M. J. del Jesus, J. M. Benítez, and F.Herrera, “Big data with cloud computing: An insight on the computing environment,mapreduce, and programming frameworks”, Wiley Interdisciplinary Reviews: Data Min-ing and Knowledge Discovery, vol. 4, no. 5, pp. 380–409, 2014. DOI: 10.1002/widm.1134. eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1134. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1134.

[16] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakr-ishnan, and C. Shahabi, “Big data and its technical challenges”, Commun. ACM, vol. 57,no. 7, pp. 86–94, Jul. 2014, ISSN: 0001-0782. DOI: 10.1145/2611567. [Online]. Avail-able: http://doi.acm.org/10.1145/2611567.

[17] F. Provost and T. Fawcett, “Data science and its relationship to big data and data-drivendecision making”, Big Data, vol. 1, Mar. 2013. DOI: 10.1089/big.2013.1508.

[18] S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, “A survey of open sourcetools for machine learning with big data in the hadoop ecosystem”, Journal of Big Data,vol. 2, pp. 1–36, 2015.

[19] M. Khan, “Big data analytics evaluation”, International Journal of Engineering Research inComputer Science and Engineering (IJERCSE), vol. 5, pp. 2394–2320, Feb. 2018.

[20] D. García-Gil, S. Ramírez-Gallego, S. García, and F. Herrera, “A comparison on scalabil-ity for batch big data processing on apache spark and apache flink”, Big Data Analytics,vol. 2, no. 1, p. 1, Mar. 2017, ISSN: 2058-6345. DOI: 10.1186/s41044-016-0020-2.[Online]. Available: https://doi.org/10.1186/s41044-016-0020-2.

[21] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M.Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar,“Mllib: Machine learning in apache spark”, Journal of Machine Learning Research, vol. 17,no. 34, pp. 1–7, 2016. [Online]. Available: http://jmlr.org/papers/v17/15-237.html.

[22] D. Infotech, “Achieving fault tolerance with kafka: A detailed explanation”, 2018. [On-line]. Available: https://medium.com/@debutinfotech/achieving-fault-tolerance-with-kafka-a-detailed-explanation-a9828929d00d.

[23] A. Burkov, The Hundred-page Machine Learning Book. Andriy Burkov, 2019, ISBN:9781999579517. [Online]. Available: https://books.google.se/books?id=0jbxwQEACAAJ.

[24] D. M. J. Garbade, “Regression versus classification machine learning: What’s the dif-ference?”, 2018. [Online]. Available: https : / / medium . com / quick - code /regression- versus- classification- machine- learning- whats- the-difference-345c56dd15f7.

62

https://doi.org/10.1007/978-3-319-49340-4_20

https://doi.org/10.1007/978-3-319-49340-4_20

https://doi.org/10.1007/978-3-319-49340-4_20

https://doi.org/https://doi.org/10.1016/B978-0-12-405891-0.00001-5

https://doi.org/https://doi.org/10.1016/B978-0-12-405891-0.00001-5

http://www.sciencedirect.com/science/article/pii/B9780124058910000015

http://www.sciencedirect.com/science/article/pii/B9780124058910000015

https://doi.org/10.1002/widm.1134

https://doi.org/10.1002/widm.1134

https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1134

https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1134

https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1134

https://onlinelibrary.wiley.com/doi/abs/10.1002/widm.1134

https://doi.org/10.1145/2611567

http://doi.acm.org/10.1145/2611567

https://doi.org/10.1089/big.2013.1508

https://doi.org/10.1186/s41044-016-0020-2

https://doi.org/10.1186/s41044-016-0020-2

http://jmlr.org/papers/v17/15-237.html

http://jmlr.org/papers/v17/15-237.html

https://medium.com/@debutinfotech/achieving-fault-tolerance-with-kafka-a-detailed-explanation-a9828929d00d

https://medium.com/@debutinfotech/achieving-fault-tolerance-with-kafka-a-detailed-explanation-a9828929d00d

https://books.google.se/books?id=0jbxwQEACAAJ

https://books.google.se/books?id=0jbxwQEACAAJ

https://medium.com/quick-code/regression-versus-classification-machine-learning-whats-the-difference-345c56dd15f7



Bibliography

[25] D. Soni, “Supervised vs. unsupervised learning”, 2018. [Online]. Available: https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d.

[26] J. Brownlee, “When to use mlp, cnn, and rnn neural networks”, 2018. [Online]. Avail-able: https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/.

[27] M. Lopes, “Dimensionality reduction — does pca really improve classifica-tion outcome?”, 2017. [Online]. Available: https : / / towardsdatascience .com / dimensionality - reduction - does - pca - really - improve -classification-outcome-6e9ba21f0a32.

[28] M. Kearns, “Thoughts on hypothesis boosting”, 1988. [Online]. Available: https://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf.

[29] S. Džeroski and B. Ženko, “Is combining classifiers with stacking better than selectingthe best one?”, Machine Learning, vol. 54, no. 3, pp. 255–273, Mar. 2004, ISSN: 1573-0565.DOI: 10.1023/B:MACH.0000015881.36452.6e. [Online]. Available: https://doi.org/10.1023/B:MACH.0000015881.36452.6e.

[30] R. Lorbieski and S. Nassar, “Impact of an extra layer on the stacking algorithm forclassification problems”, Journal of Computer Science, vol. 14, pp. 613–622, May 2018.DOI: 10.3844/jcssp.2018.613.622.

[31] S. Glen, “Error term: Definition and examples”, 2017. [Online]. Available: https://www.statisticshowto.datasciencecentral.com/error-term/.

[32] L. K. Hansen and P. Salamon, “Neural network ensembles”, IEEE Trans. Pattern Anal.Mach. Intell., vol. 12, pp. 993–1001, 1990.

[33] O. Aydin and S. Guldamlasioglu, “Using lstm networks to predict engine condition onlarge scale data processing framework”, in 2017 4th International Conference on Electricaland Electronic Engineering (ICEEE), Apr. 2017, pp. 281–285. DOI: 10.1109/ICEEE2.2017.7935834.

[34] C. Ossai, “Integrated big data analytics technique for real-time prognostics, fault de-tection and identification for complex systems”, Infrastructures, vol. 2, p. 20, Nov. 2017.DOI: 10.3390/infrastructures2040020.

[35] K. Leahy, R. L. Hu, I. C. Konstantakopoulos, C. J. Spanos, A. M. Agogino, and D. T. J.O’Sullivan, “Diagnosing and predicting wind turbine faults from scada data using sup-port vector machines”, 2018.

[36] M. Canizo, E. Onieva, A. Conde, S. Charramendieta, and S. Trujillo, “Real-time predic-tive maintenance for wind turbines using big data frameworks”, pp. 70–77, Jun. 2017.DOI: 10.1109/ICPHM.2017.7998308.

[37] A. Verma and A. Kusiak, “Prediction of status patterns of wind turbines: A data-miningapproach”, Journal of Solar Energy Engineering, vol. 133, pp. 011 008–1, Feb. 2011. DOI:10.1115/1.4003188.

[38] S. Geller, “Normalization vs standardization — quantitative analysis”, 2019. [On-line]. Available: https://towardsdatascience.com/normalization- vs-standardization-quantitative-analysis-a91e8a79cebf.

[39] Edp open data challenges, https://opendata.edp.com/pages/challenges/description#description, Accessed: 2020-01-07.

[40] X. Sun, N. Gebraeel, and M. Yildirim, “Integrated predictive analytics and optimiza-tion for wind farm maintenance and operations”, IEEE Transactions on Power Systems,vol. PP, Feb. 2017. DOI: 10.1109/TPWRS.2017.2666722.

63

https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d



https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/

https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks/

https://towardsdatascience.com/dimensionality-reduction-does-pca-really-improve-classification-outcome-6e9ba21f0a32



https://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf

https://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf

https://doi.org/10.1023/B:MACH.0000015881.36452.6e



https://doi.org/10.3844/jcssp.2018.613.622

https://www.statisticshowto.datasciencecentral.com/error-term/

https://www.statisticshowto.datasciencecentral.com/error-term/

https://doi.org/10.1109/ICEEE2.2017.7935834

https://doi.org/10.1109/ICEEE2.2017.7935834

https://doi.org/10.3390/infrastructures2040020

https://doi.org/10.1109/ICPHM.2017.7998308

https://doi.org/10.1115/1.4003188

https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf

https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf

https://opendata.edp.com/pages/challenges/description#description

https://opendata.edp.com/pages/challenges/description#description

https://doi.org/10.1109/TPWRS.2017.2666722

BIBLIOGRAPHY

[41] J. Ribrant and L. M. Bertling, “Survey of failures in wind power systems with focuson swedish wind power plants during 1997 ndash;2005”, IEEE Transactions on EnergyConversion, vol. 22, no. 1, pp. 167–173, Mar. 2007, ISSN: 0885-8969. DOI: 10.1109/TEC.2006.889614.

[42] A. Kusiak, H. Zheng, and Z. Song, “Short-term prediction of wind farm power: A datamining approach”, IEEE Transactions on Energy Conversion, vol. 24, no. 1, pp. 125–136,Mar. 2009, ISSN: 0885-8969. DOI: 10.1109/TEC.2008.2006552.

[43] R. Nickelson, B. Johnson, and K. Barnes, “Review of supervisory control and data ac-quisition (scada) systems”, 2004.

[44] “Ieee standard for scada and automation systems - redline”, IEEE Std C37.1-2007 (Revi-sion of IEEE Std C37.1-1994) - Redline, pp. 1–200, May 2008.

64

https://doi.org/10.1109/TEC.2006.889614

https://doi.org/10.1109/TEC.2006.889614

https://doi.org/10.1109/TEC.2008.2006552

machine learning for predictive maintenance on wind turbines1420733/fulltext01.pdf · 2020. 3....

Documents